Mining Frequent Itemsets with Multiple Support Thresholds Using the CFPGrowth++ Algorithm (SPMF documentation)
This example explains how to run the CFPGrowth++ algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "CFPGrowth++" algorithm, (2) select the input file "contextCFPGrowth.txt", (3) set the output file name (e.g. "output.txt") (4) put "MIS.TXT" in the "MIS file name" text field (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run CFPGrowth++ contextCFPGrowth.txt output.txt MIS.txt in a folder containing spmf.jar and the example input file contextCFPgrowth.txt and MIS.txt. - If you are using the source code version of SPMF, launch the file "MainTestCFPGrowth_saveToFile.java" in the package ca.pfv.SPMF.tests.
What is CFPGrowth++?
CFPGrowth++ is an algorithm for mining frequent itemsets by using multiple minimum supports. It is an extension of the CFPGrowth algorithm for mining frequent itemsets using multiple minimum support thresholds.
What is the input of this algorithm?
The input of CFPGrowth++ is a transaction database and a list of minimum support thresholds indicating the minimum support threshold for each item.
A transaction database is a set of transactions, where each transaction is a list of distinct items (symbols). For example, let's consider the following transaction database. It consists of 5 transactions (t1,t2...t6) and 8 items (1,2,3,4,5,6,7,8). For instance, transaction t1 is the set of items {1, 3, 4, 6}. This database is provided in the file "contextCFPGrowth.txt" of the SPMF distribution.. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.
Transaction ID | items |
t1 | {1, 3, 4,6} |
t2 | {1, 3, 5, 6, 7} |
t3 | {1, 2, 3, 6, 8} |
t4 | {2, 6, 7} |
t5 | {2, 3} |
The list of minimum support threshold is stored in a text file that is read as input by the algorithm. This is provided in the file "MIS.txt":
item | minimum support threshold |
1 | 1 |
2 | 2 |
3 | 3 |
4 | 3 |
5 | 2 |
6 | 3 |
7 | 2 |
8 | 1 |
This file indicated for example that the minimum support threshold to be used for item 6 is 3.
What is the output of this algorithm?
The output of CFPgrowth++ is the set of all frequent itemsets contained in the database.
What is a frequent itemset ? The support of an itemset is the number of transactions containing the itemset. An itemset is a frequent itemset if its support is higher or equal to the smallest minimum support threshold among the minimum support thresholds of all its items. For example, the itemset {1 2 8} is frequent because it appears in one transactions (t3) and its support is higher than the smallest minimum support among the minimum support of item 1, item 2 and item 8, which are respectively 1, 2 and 1.
Why CFPGrowth++ is useful? It is useful because it permits setting lower minimum support thresholds for rare items. Therefore, it allows discovering frequent itemsets containing rare items.
If we run CFPGrowth++ on the previous transaction database with the MIS.txt file previously described, we get the following result, where each line represents an itemsets followed by ":" and then its absolute support.:
8:18 1:1
8 1 2:1 // for example, this itemset is {1, 2, 8}, and it has a support of 1.
8 1 2 6:1
8 1 2 6 3:1
8 1 2 3:1
8 1 6:1
8 1 6 3:1
8 1 3:1
8 2:1
8 2 6:1
8 2 6 3:1
8 2 3:1
8 6:1
8 6 3:1
8 3:1
1:3 // for example, this itemset is {1}, and it has a support of 3.
1 7:1
1 7 5:1
1 7 5 6:1
1 7 5 6 3:1
1 7 5 3:1
1 7 6:1
1 7 6 3:1
1 7 3:1
1 5:1
1 5 6:1
1 5 6 3:1
1 5 3:1
1 2:1
1 2 6:1
1 2 6 3:1
1 2 3:1
1 6:3
1 6 4:1
1 6 4 3:1
1 6 3:3
1 4:1
1 4 3:1
1 3:3
7:2
7 6:2
2:3
2 6:2
2 3:2
6:4
6 3:3
3:4
Note: If you are using the GUI version of SPMF the file containing the minimum support must be located in the same folder as the input file containing the transaction database.
Input file format
The input file format of CFPGrowth++ is two files defined as follows.
The first file (e.g. contextCFPGrowth.txt) It is a text file containing the transactions. Each lines represents a transaction. The items in the transaction are listed on the corresponding line. An item is represented by a positive integer. Each item is separated from the following item by a space. It is assumed that items are sorted according to a total order and that no item can appear twice in the same transaction. For example, for the previous example, the input file is defined as follows:
1 3 4 6
1 3 5 6 7
1 2 3 6 8
2 6 7
2 3
Consider the first line. It means that the first transaction is the itemset {1, 3, 4, 6}. The following lines follow the same format.
The second file is a text file (e.g. MIS.txt) which provides the minimum support to be used for each item. Each line indicate the minimum support for an item and consists of two integer values separated by a single space. The first value is the item. The second value is the minimum support value to be used for this item. For example, here is the file used in this example. The first line indicate that for item "1" the minimum support to be used is 1 (one transaction). The other lines follow the same format.
1 1
2 2
3 3
4 3
5 2
6 3
7 2
8 1
Output file format
The output file format of CFPGrowth++ is defined as follows. It is a text file, where each line represents a frequent itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer, followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by a integer value indicating the support of that itemset.
8 #SUP: 1
8 1 #SUP: 1
8 1 2 #SUP: 1
8 1 2 6 #SUP: 1
8 1 2 6 3 #SUP: 1
8 1 2 3 #SUP: 1
8 1 6 #SUP: 1
8 1 6 3 #SUP: 1
8 1 3 #SUP: 1
8 2 #SUP: 1
8 2 6 #SUP: 1
8 2 6 3 #SUP: 1
8 2 3 #SUP: 1
8 6 #SUP: 1
8 6 3 #SUP: 1
8 3 #SUP: 1
1 #SUP: 3
1 7 #SUP: 1
1 7 5 #SUP: 1
1 7 5 6 #SUP: 1
1 7 5 6 3 #SUP: 1
1 7 5 3 #SUP: 1
1 7 6 #SUP: 1
1 7 6 3 #SUP: 1
1 7 3 #SUP: 1
1 5 #SUP: 1
1 5 6 #SUP: 1
1 5 6 3 #SUP: 1
1 5 3 #SUP: 1
1 2 #SUP: 1
1 2 6 #SUP: 1
1 2 6 3 #SUP: 1
1 2 3 #SUP: 1
1 6 #SUP: 3
1 6 4 #SUP: 1
1 6 4 3 #SUP: 1
1 6 3 #SUP: 3
1 4 #SUP: 1
1 4 3 #SUP: 1
1 3 #SUP: 3
7 #SUP: 2
7 6 #SUP: 2
2 #SUP: 3
2 6 #SUP: 2
2 3 #SUP: 2
6 #SUP: 4
6 3 #SUP: 3
3 #SUP: 4
For example, the last line indicates that the itemset {4} has a support of 4 transactions. The other lines follows the same format.
Implementation details
In the source code version of SPMF, there are two versions of CFPGrowth: one that saves the result to a file (MainTestCFPGrowth_saveToFile.java) and one that saves the result to memory (MainTestCFPGrowth_saveToMemory.java). In the graphical interface and command line interface, only the version that saves to file is offered.
Optional feature: giving names to items
Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextCFPGrowth.txt". Here we have modified the file to give names to the items:
@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
@ITEM=6=noodle
@ITEM=7=rice
@ITEM=8=potato
1 3 4 6
1 3 5 6 7
1 2 3 6 8
2 6 7
2 3
In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define transactions in the SPMF format.
Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:
orange tomato #SUP: 2
noodle #SUP: 4
noodle tomato #SUP: 3
Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.
Performance
CFPGrowth++ is a very efficient algorithm. It is based on FPGrowth.
SPMF also offers the MISApriori algorithm, which is less efficient than CFPGrowth++. Note that there is one important difference between the input of CFPGrowth++ and MSApriori in SPMF. The MISApriori algorithm works by setting the multiple minimum supports by using some special parameters named LS and BETA (see the example describing MISApriori for more details). The CFPGrowth++ implementation instead uses a list of minimum support values stored in a text file.
Where can I get more information about the CFPGrowth++ algorithm?
This article describes the original CFPGrowth algorithm:
Y.-H. Hu, Y.-L. Chen: Mining association rules with multiple minimum supports: a new mining algorithm and a support tuning mechanism. Decision Support Systems 42(1): 1-24 (2006)
This article describe CFPGrowth++, the extension of CFPGrowth that is implemented in SPMF, which introduce a few additional optimizations.
Kiran, R. U., & Reddy, P. K. (2011). Novel techniques to reduce search space in multiple minimum supports-based frequent pattern mining algorithms. In Proceedings of the 14th International Conference on Extending Database Technology, ACM, pp. 11-20.
For a good overview of frequent itemset mining algorithms, you may read this survey paper.