Mining All Association Rules using the GCD Algorithm ( SPMF documentation)

This example explains how to run the GCD algorithm using the SPMF open-source data mining library.

How to run this example?

What is this algorithm?

This algorithm finds the association rules in a given transaction database or sequence of transactions/events, using GCD calculations for prime numbers.It is an original algorithm implemented by Ahmed El-Serafy and Hazem El-Raffiee.

What is the input?

The input is a transaction database (aka binary context) and three thresholds named minsup (a value between 0 and 1), minconf (a value between 0 and 1), and maxcomb (a positive integer).

A transaction database is a set of transactions. Each transaction is a set of distinct items. For example, consider the following transaction database. It contains 6 transactions (t1, t2, ..., t5, t6) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextIGB.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 2, 4, 5}
t2 {2, 3, 5}
t3 {1, 2, 4, 5}
t4 {1, 2, 3, 5}
t5 {1, 2, 3, 4, 5}
t6 {2, 3, 4}

What is the output?

The output of an association rule mining algorithm is a set of association rules respecting the user-specified minsup and minconf thresholds. To explain how this algorithm works, it is necessary to review some definitions. An association rule X==>Y is a relationship between two itemsets (sets of items) X and Y such that the intersection of X and Y is empty. The support of a rule is the number of transactions that contains X∪Y. The confidence of a rule is the number of transactions that contains X∪Y divided by the number of transactions that contain X.

If we apply an association rule mining algorithm, it will return all the rules having a support and confidence respectively no less than minsup and minconf.

For example, by applying the algorithm with minsup = 0.5 (50%), minconf = 0.6 (60%), and maxcomb = 3, we obtain 56 associations rules (run the example in the SPMF distribution to see the result).

Now let's explain the "maxcomb" parameter taken by the GCD algorithm. This parameter is used by the algorithm when finding the GCD (greatest common divisors) between two transactions. For example, consider 385, which comes from the multiplication of (5, 7 and 11), this actually means that (5), (7), (11), (5, 7), (5, 11), (7, 11), (5, 7, 11) are all common combinations between these two transactions. For larger GCD's, calculating all combinations grows exponentially in both time and memory. Hence, we introduced this parameter, to limit the maximum combinations' length generated from a single GCD. Although increasing this number might seem to provide more accurate results, the experiments showed that larger association rules occur at lower support (less important to the user). Hence, setting this parameter to values from 1 to 4 produces reasonable results.

Input file format

The input file format is a text file containing transactions. Each lines represents a transaction. The items in the transaction are listed on the corresponding line. An item is represented by a positive integer. Each item is separated from the following item by a space. It is assumed that items are sorted according to a total order and that no item can appear twice in the same transaction. For example, for the previous example, the input file is defined as follows:

1 2 4 5
2 3 5
1 2 4 5
1 2 3 5
1 2 3 4 5
2 3 4

Consider the first line. It means that the first transaction is the itemset {1, 2, 4 and 5}. The following lines follow the same format.

Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file, where each line represents an association rule. On each line, the items of the rule antecedent are first listed. Each item is represented by an integer, followed by a single space. After, that the keyword "==>" appears followed by a space. Then, the items of the rule consequent are listed. Each item is represented by an integer, followed by a single space. Then, the keyword " #SUP: " appears followed by the support of the rule represented by an integer. Then, the keyword " #CONF: " appears followed by the confidence of the rule represented by a double value (a value between 0 and 1, inclusively). For example, here is a few lines from the output file for this example:

1 ==> 2 4 5 #SUP: 3 #CONF: 0,75
5 ==> 1 2 4 #SUP: 3 #CONF: 0,6
4 ==> 1 2 5 #SUP: 3 #CONF: 0,75

For example, the first line indicates that the association rule {1} --> {2, 4, 5} has a support of 3 transactions and a confidence of 75 %. The other lines follow the same format.

Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.

Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextIGB.txt". Here we have modified the file to give names to the items: 

1 2 4 5
2 3 5
1 2 4 5
1 2 3 5
1 2 3 4 5
2 3 4

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define transactions in the SPMF format.

Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:

apple ==> milk bread #SUP: 3 #CONF: 0.75
tomato bread ==> orange #SUP: 3 #CONF: 1.0
orange bread ==> tomato #SUP: 3 #CONF: 0.6
orange tomato ==> bread #SUP: 3 #CONF: 0.75

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.

Where can I get more information about this algorithm?

The GCD Association Rules algorithm is an original algorithm. More information about it can be obtained from the bitbucket repository dedicated to this algorithm: