Mining Perfectly Sporadic Association Rules (SPMF documentation)

This example explains how to mine perfectly sporadic association rules using the SPMF open-source data mining library.

How to run this example?

What is this algorithm?

This is an algorithm for mining perfectly sporadic association rules. The algorithm first uses AprioriInverse to generate perfectly rare itemsets. Then, it uses these itemsets to generate the association rules.

What is the input?

The input of this algorithm is a transaction database and three thresholds named minusp, maxsup and minconf. A transaction database is a set of transactions. A transaction is a set of distinct items (symbols), assumed to be sorted in lexical order. For example, the following transactions database contains 5 transactions (t1,t2...t5) and 5 items (1,2,3,4,5). This database is provided in the file "contextInverse.txt" of the SPMF distribution:

Transaction id Items
t1 {1, 2, 4, 5}
t2 {1, 3}
t3 {1, 2, 3, 5}
t4 {2, 3}
t5 {1, 2, 4, 5}

What is the output?

The output is the set of perfectly sporadic association rules respecting the minconf (a value in [0,1]), minsup (a value in [0,1]) and maxsup (a value in [0,1]) parameters.

To explain what it a perfectly sporadic association rule, we need to review some definitions. An itemset is an unordered set of distinct items. The support of an itemset is the number of transactions that contain the itemset divided by the total number of transactions. For example, the itemset {1, 2} has a support of 60% because it appears in 3 transactions out of 5 (it appears in t1, t2 and t5). A frequent itemset is an itemset that has a support that is no less than the maxsup parameter.

A perfectly rare itemset (aka sporadic itemset) is an itemset that is not a frequent itemset and that all its subsets are also not frequent itemsets. Moreover, it has to have a support higher or equal to the minsup threshold.

An association rule X==>Y is a relationship between two itemsets (sets of items) X and Y such that the intersection of X and Y is empty. The support of a rule is the number of transactions that contains X∪Y divided by the total number of transactions. The confidence of a rule is the number of transactions that contains X∪Y divided by the number of transactions that contain X.

A perfectly sporadic association rule X==>Y is an association rule such that the confidence is higher or equal to minconf and the support of any non empty subset of X∪Y is lower than maxsup.

For example, let's apply the algorithm with minsup = 0.1 %, maxsup of 60 % and minconf = 60 %.

The first step that the algorithm perform is to apply AprioriInverse algorithm with minsup = 0.1 % and maxsup of 60 %. The result is the following set of perfectly rare itemsets:

Perfectly Rare Itemsets Support
{3} 60 %
{4} 40 %
{5} 60 %
{4, 5} 40 %
{3, 5} 20 %

Then, the second step is to generate all perfectly sporadic association rules respecting minconf by using the perfectly rare itemsets found in the first step. The result is :

Rule Support Confidence
5 ==> 4 40 % 60 %
4 ==> 5 40 % 100 %

How to interpret the result?

For example, consider the rule 5 ==> 4. It means that if item 5 appears in a transaction, it is likely to be associated with item 4 with a confidence of 60 % (because 5 and 4 appears together in 40% of the transactions where 5 appears). Moreover, this rule has a support of 40 % because it appears in 40% of the transactions of this database.

Input file format

The input file format is a text file containing transactions. Each lines represents a transaction. The items in the transaction are listed on the corresponding line. An item is represented by a positive integer. Each item is separated from the following item by a space. It is assumed that items are sorted according to a total order and that no item can appear twice in the same transaction. For example, for the previous example, the input file is defined as follows:

1 2 4 5
1 3
1 2 3 5
2 3
1 2 4 5

Consider the first line. It means that the first transaction is the itemset {1, 2, 4 and 5}. The following lines follow the same format.

Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file, where each line represents an association rule. On each line, the items of the rule antecedent are first listed. Each item is represented by an integer, followed by a single space. After, that the keyword "==>" appears followed by a space. Then, the items of the rule consequent are listed. Each item is represented by an integer, followed by a single space. Then, the keyword " #SUP: " appears followed by the support of the rule represented by an integer. Then, the keyword " #CONF: " appears followed by the confidence of the rule represented by a double value (a value between 0 and 1, inclusively). For example, here is the output file for this example:

5 ==> 4 #SUP: 2 #CONF: 0,6
4 ==> 5 #SUP: 2 #CONF: 1

For example, the first line indicates that the association rule {5} --> {4} has a support of 2 transactions and a confidence of 60 %. The second line follow the same format.

Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.

Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextInverse.txt". Here we have modified the file to give names to the items: 

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
1 2 4 5
1 3
1 2 3 5
2 3
1 2 4 5

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define four sequences in the SPMF format.

Then, if we apply a sequential pattern mining algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:

bread ==> milk #SUP: 2 #CONF: 0.6666666666666666
milk ==> bread #SUP: 2 #CONF: 1.0

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.

Where can I get more information about this algorithm?

The AprioriInverse algorithm and how to generate sporadic rules are described in this paper:

Yun Sing Koh, Nathan Rountree: Finding Sporadic Rules Using Apriori-Inverse. PAKDD 2005: 97-106.

For a good overview of itemset mining and association rule mining, you may read this survey paper.