Mining Frequent Generator Itemsets using the DefMe Algorithm (SPMF documentation)

This example explains how to run the DefMe algorithm using the SPMF open-source data mining library.

How to run this example?

What is DefMe?

DefMe is an algorithm proposed at PAKDD 2014 for discovering minimal patterns in set systems. If it is applied to itemset mining, it will discover frequent itemset generator. In SPMF, we have implemented it for this purpose.

DefMe is the our knowledge the only real depth-first search algorithm for mining generator itemsets (it does not need to use a hash table or store candidates). It is interesting to have a depth-first search algorithm since depth-first search algorithm are generally faster than Apriori-based algorithms.

Another important point about DefMe is that unlike Pascal, DefMe only find frequent generator itemsets rather than generating all frequent itemsets and identifying which one are generators.

What is the input of the DefMe algorithm?

The input is a transaction database (aka binary context) and a threshold named minsup (a value between 0 and 100 %).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextZart.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 2, 4, 5}
t2 {1, 3}
t3 {1, 2, 3, 5}
t4 {2, 3, 5}
t5 {1, 2, 3, 5}

What is the output of the DefMe algorithm?

The output of the DefMe algorithm for a transaction database and a minimum support threshold minsup is the set of all frequent itemsets and their support, and a flag indicating which itemsets is a generator.

To explain what is a frequent itemset and a generator, it is necessary to review a few definitions.

An itemset is a unordered set of distinct items. The support of an itemset is the number of transactions that contain the itemset. For example, the itemset {1, 3} has a support of 3 because it appears in three transactions from the database (t2, t3 and t5). A frequent itemset is an itemset that appears in at least minsup transactions from the transaction database. A generator is an itemset X such that there does not exist an itemset Y strictly included in X that has the same support.

By running DefMe with the previous transaction database and a minsup of 40% (2 transactions), we obtain the following result:

itemsets support
{} 5
{1} 4
{2} 4
{3} 4
{5} 4
{1, 2} 3
{1, 3} 3
{1, 5} 3
{2, 3} 3
{2, 5} 4
{3, 5} 3
{1, 2, 3} 2
{1, 3, 5} 2

How should I interpret the results?

In the results, for each generator itemset found, its support is indicated. For example, the itemset {1,2,3} has a support of 2 because it appears in 2 transactions (t3 and t5).

Input file format

The input file format used by DefMe is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, the input file is defined as follows:

1 2 4 5
1 3
1 2 3 5
2 3 5
1 2 3 5

Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file, where each line represents a frequent itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. For instance, the first line indicates that the empty set is a generator having a support of 5 transactions. The second line indicates that the itemset {1} has a support of 4 transactions.

#SUP: 5
1 #SUP: 4
1 2 #SUP: 3
1 2 3 #SUP: 2
1 3 #SUP: 3
1 3 5 #SUP: 2
1 5 #SUP: 3
2 #SUP: 4
2 3 #SUP: 3
3 #SUP: 4
3 5 #SUP: 3
5 #SUP: 4

Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.

Optional parameter(s):

Sometimes, there may be just too many itemsets, and itemsets containing many items may not be interesting. Thus, it is also possible to specify an optional parameter in the user interface of SPMF:

If you are using the command line interface of SPMF, it is also possible to use this optional parameter by adding it at the end of the command. For example:
java -jar spmf.jar run DefMe contextZart.txt output.txt 40% 2
means to run the above example to find only frequent itemsets having 2 items or less.

Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextZart.txt". Here we have modified the file to give names to the items: 

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
1 2 4 5
1 3
1 2 3 5
2 3 5
1 2 3 5

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define transactions in the SPMF format.

Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:

orange tomato #SUP: 3
tomato #SUP: 4
tomato bread #SUP: 3
bread #SUP: 4

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.

Performance

The DefMe algorithm should be more efficient than Apriori-based algorithm such as Zart or Pascal. However, no performance comparison has been done by the authors of DefMe.

Where can I get more information about the Pascal algorithm?

The DefMe algorithm is described in this paper:

Arnaud Soulet, François Rioult (2014). Efficiently Depth-First Minimal Pattern Mining. PAKDD (1) 2014: 28-39

Also, for a good overview of frequent itemset mining algorithms, you may read this survey paper.