Mining Frequent Maximal Itemsets using the Charm-MFI Algorithm (SPMF documentation)
This example explains how to run the Charm-MFI algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "Charm_MFI" algorithm, (2) select the input file "contextPasquier99.txt", (3) set the output file name (e.g. "output.txt") (4) set minsup to 40% and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run Charm_MFI contextPasquier99.txt output.txt 40% in a folder containing spmf.jar and the example input file contextPasquier99.txt. - To run this example with the source code version of SPMF, launch the file "MainTestCharmMFI.java" in the package ca.pfv.SPMF.tests.
What is Charm-MFI?
Charm-MFI is an algorithm for discovering frequent maximal itemsets in a transaction database.
Charm-MFI is not an efficient algorithm because it
discovers maximal itemsets by performing post-processing after
discovering frequent closed itemsets with the Charm algorithm (hence
the name: Charm-MFI). A more efficient algorithm for mining maximal
itemsets named FPMax is provided in SPMF.
Moreover, note that the original Charm-MFI algorithm is not correct. In SPMF, it has been fixed so that it generates the correct result.
What is the input of the Charm-MFI algorithm?
The input is a transaction database (aka binary context) and a threshold named minsup (a value between 0 and 100 %).
A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.
Transaction id | Items |
t1 | {1, 3, 4} |
t2 | {2, 3, 5} |
t3 | {1, 2, 3, 5} |
t4 | {2, 5} |
t5 | {1, 2, 3, 5} |
What is the output of the Charm-MFI algorithm?
Charm-MFI outputs frequent maximal itemsets. To explain what is a frequent maximal itemset, it is necessary to review a few definitions.
An itemset is a unordered set of distinct items. The support of an itemset is the number of transactions that contain the itemset. For example, the itemset {1, 3} has a support of 3 because it appears in three transactions (t1,t3, t5) from the previous transaction database.
A frequent itemset is an itemset that appears in at least minsup transactions from the transaction database. A frequent closed itemset is a frequent itemset that is not included in a proper superset having the same support. A frequent maximal itemset is a frequent itemset that is not included in a proper superset that is a frequent itemset. The set of frequent maximal itemsets is thus a subset of the set of frequent closed itemsets, which is a subset of frequent itemsets. Why it is interesting to discover frequent maximal itemsets ? The reason is that the set of frequent maximal itemsets is usually much smaller than the set of frequent itemsets and also smaller than the set of frequent closed itemsets. However, unlike frequent closed itemsets, frequent maximal itemsets are not a lossless representation of the set of frequent itemsets (it is possible to regenerate all frequent itemsets from the set of frequent maximal itemsets but it would not be possible to get their support without scanning the database).
If we apply Charm-MFI on the previous transaction database with a minsup of 40 % (2 transactions), we get the following result:
frequent maximal itemsets | support |
{1, 2, 3, 5} | 2 |
How should I interpret the results?
In the results, each frequent maximum itemset is annotated with its support. For example, the itemset {1, 2, 3 5} is a maximal itemset having a support of 2 because it appears in transactions t3 and t5. The itemset {2, 5} has a support of 4 and is not a maximal itemset because it is included in {2, 3, 5}, which is a frequent itemset.
Input file format
The input file format used by CHARM-MFI is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.
For example, for the previous example, the input file is defined as follows:
1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5
Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.
Output file format
The output file format is defined as follows. It is a text file, where each line represents a maximal itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. For example, we show below the output file for this example consisting of a single line. The only line here indicates the maximal itemset consisting of the item 1, item 2, item 3 and item 5. This lines indicates that this itemset has a support of 2 transactions.
1 2 3 5 #SUP: 2
Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.
Optional parameter(s)
This implementation of Charm_MFI allows to specify additional optional parameter(s) :
- "show transaction ids?" (true/false) This parameter allows to specify that transaction ids of transactions containing a pattern should be output for each pattern found. For example, if the parameter is set to true, each pattern in the output file will be followed by the keyword #TID followed by a list of transaction ids (integers separated by space). For example, a line terminated by "#TID: 0 2" means that the pattern on this line appears in the first and the third transactions of the transaction database (transactions with ids 0 and 2).
These parameter(s) are available in the GUI of SPMF and also in the example(s) "MainTestCharmMFI_SaveToFile .java" provided in the source code of SPMF.
The parameter(s) can be also used in the command line with the Jar file. If you want to use these optional parameter(s) in the command line, it can be done as follows. Consider this example:
java -jar spmf.jar run Charm_MFI contextPasquier99.txt
output.txt 40% true
This command means to apply the algorithm on the file
"contextPasquier99.txt" and output the results to "output.txt".
Moreover, it specifies that the user wants to find patterns for minsup = 40% transactions, and that transaction
ids should be output for each pattern found.
Performance
The Charm-MFI algorithm is not a very efficient algorithm because it finds frequent maximal itemsets by post-processing instead of finding them directly.
A more efficient algorithm for mining maximal itemsets named FPMax is provided in SPMF.
Optional feature: giving names to items
Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextPasquier99.txt". Here we have modified the file to give names to the items:
@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5
In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define transactions in the SPMF format.
Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file contains this pattern:
apple orange tomato bread #SUP: 2
Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.
Where can I get more information about the Charm-MFI algorithm?
The Charm-MFI algorithm is described in this thesis (in French language only):
L. Szathmary (2006). Symbolic Data Mining Methods with the Coron Platform.
Also, for a good overview of frequent itemset mining algorithms, you may read this survey paper.