Mining Perfectly Rare Itemsets using the AprioriInverse Algorithm (SPMF documentation)
This example explains how to run the AprioriInverse algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "AprioriInverse" algorithm, (2) select the input file "contextInverse.txt", (3) set the output file name (e.g. "output.txt") (4) set minsup = 0.1 % and maxsup of 61 % and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run AprioriInverse contextZart.txt output.txt 10% 61% in a folder containing spmf.jar and the example input file contextInverse.txt. - If you are using the source code version of SPMF, launch the file "MainTestAprioriInverse.java" in the package ca.pfv.SPMF.tests.
What is AprioriInverse?
AprioriInverse is an algorithm mining perfectly rare itemsets. Why mining perfectly rare itemsets? One reason is that it is useful for generating the set of sporadic association rules.
What is the input?
The input is a transaction database (aka binary context) and two thresholds named minsup and maxsup (a value between 0 and 100 %), where maxsup > minsup.
A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextInverse.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.
Transaction id | Items |
t1 | {1, 2, 4, 5} |
t2 | {1, 3} |
t3 | {1, 2, 3, 5} |
t4 | {2, 3} |
t5 | {1, 2, 4, 5} |
What is the output?
The output of AprioriInverse is the set of all perfectly rare itemsets in the database such that their support is lower than maxsup and higher than minsup. To explain what it a perfectly rare itemset, it is necessary to review a few definitions. An itemset is an unordered set of distinct items. The support of an itemset is the number of transactions that contain the itemset divided by the total number of transactions. For example, the itemset {1, 2} has a support of 60% because it appears in 3 transactions out of 5 in the previous database (it appears in t1, t2 and t5). A frequent itemset is an itemset that has a support no less than the maxsup parameter. A perfectly rare itemset (aka sporadic itemset) is an itemset that is not a frequent itemset and that all its proper subsets are also not frequent itemsets. Moreover, it has to have a support higher or equal to the minsup threshold.
By running the AprioriInverse algorithm with minsup = 0.1 % and maxsup of 61 % and this transaction database, we obtain the following set of perfectly rare itemsets (see Koh & Roundtree 2005 for further details):
Perfectly Rare Itemsets | Support |
{3} | 60 % |
{4} | 40 % |
{5} | 60 % |
{4, 5} | 40 % |
{3, 5} | 20 % |
Input file format
The input file format of AprioriInverse is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.
For example, for the previous example, the input file is defined as follows:
1 2 4 5
1 3
1 2 3 5
2 3 5
1 2 3 5
Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.
Output file format
The output file format of AprioriInverse is defined as follows. It is a text file, where each line represents a perfectly rare itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. For example, we show below the output file for this example.
3 #SUP: 3
4 #SUP: 2
5 #SUP: 3
3 5 #SUP: 1
4 5 #SUP: 2
The output file here consists of five lines which indicate that the itemsets {3}, {4}, {5}, {3, 5}, {4, 5} are perfectly rare itemsets having respectively a support of 3, 2, 3 1 and 2 transactions.
Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.
Optional parameter(s) :
There is an alternative implementation of AprioriInverse in SPMF called "AprioriInverse_TID". This implementation is based on AprioriTID instead of the standard Apriori algorithm. The key difference is that the identifiers of transactions where patterns are found are kept in memory to avoid scanning the database. This can be faster on some datasets. Beside, this implementation offers an additional parameter:
"show transaction ids?" (true/false) This parameter allows to specify that transaction ids of transactions containing a pattern should be output for each pattern found. For example, if the parameter is set to true, each pattern in the output file will be followed by the keyword #TID followed by a list of transaction ids (integers separated by space). For example, a line terminated by "#TID: 0 2" means that the pattern on this line appears in the first and the third transactions of the transaction database (transactions with ids 0 and 2).
Optional feature: giving names to items
Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextInverse.txt". Here we have modified the file to give names to the items:
@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
1 2 4 5
1 3
1 2 3 5
2 3
1 2 4 5
In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define transactions in the SPMF format.
Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns:
tomato #SUP: 3
milk #SUP: 2
bread #SUP: 3
tomato bread #SUP: 1
milk bread #SUP: 2
Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.
Performance
AprioriInverse is the only algorithm for perfectly rare itemset mining offered in SPMF. Since it is based on Apriori, it suffers from the same fundamental limitations (it may generate too much candidates and may generate candidates that do not appear in the database).
Where can I get more information about this algorithm?
The AprioriInverse algorithm is described in this p aper:
Yun Sing Koh, Nathan Rountree: Finding Sporadic Rules Using Apriori-Inverse. PAKDD 2005: 97-106
For a good overview of frequent itemset mining algorithms, you may read this survey paper.