Mining Indirect Association Rules with the INDIRECT algorithm (SPMF documentation)

This example explains how to run the INDIRECT algorithm using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "Indirect_association_rules" algorithm, (2) select the input file "contextIndirect.txt", (3) set the output file name (e.g. "output.txt") (4) set minsup = 60 %, ts = 50 % and minconf= 10% (5) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Indirect_association_rules contextIndirect.txt output.txt 60% 50% 10% in a folder containing spmf.jar and the example input file contextIndirect.txt.
If you are using the source code version of SPMF, launch the file "MainTestIndirectRules_saveToFile.java" in the package ca.pfv.SPMF.tests.

What is the INDIRECT algorithm?

Indirect (Tan et al., KDD 2000; Tan, Steinbach & Kumar, 2006, p.469) is an algorithm for discovering indirect associations between items in transactions databases.

Why this algorithm is important? Because traditional association rule mining algorithms focus on direct associations between itemsets. This algorithm can discover indirect associations, which can be useful in domains such as biology. Indirect association rule mining has various applications such as stock market analysis and competitive product analysis (Tan et al., 2000).

What is the input?

The input of the indirect algorithm is a transaction database and three parameters named minsup (a value in [0,1] that represents a percentage), ts (a value in [0,1] that represents a percentage) and minconf (a value in [0,1] that represents a percentage).

A transaction database is a set of transactions. Each transaction is a set of distinct items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 4 and 5. This database is provided as the file contextIndirect.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id	Items
t1	{1, 4, 5}
t2	{2, 3, 4}
t3	{1, 2, 4, 5}
t4	{5}
t5	{1, 2, 4, 5}

The three numeric parameters of the indirect algorithm are:

minsup : called the "mediator minimum support".
ts : called the "itempair minimum support"
minconf : representing the minimal confidence required for indirect associations (note that in the original article it uses the IS measure instead of the confidence).

What is the output?

The result is all indirect associations respecting the parameters minsup, ts and minconf. An indirect association has the form {x,y} ==> M, where x and y are single items and M is an itemset called the "mediator".

An indirect association has to respect the following conditions:

The number of transactions containing all items of {x}∪ M divided by the total number of transaction must be higher or equal to minsup.
The number of transactions containing all items of {y}∪ M divided by the total number of transaction must be higher or equal to minsup.
The number of transactions containing {x,y} divided by the total number of transaction must be smaller than ts.
The confidence of {x}with respect to M and {y}with respect M must be higher or equal to minconf. The confidence of an itemset X with respect to another itemset Y is defined as the number of transactions that contains X and Y divided by the number of transactions that contain X.

For example, by applying the indirect algorithm with minsup = 60 %, ts = 50 % and minconf= 10%, we obtain 3 indirect association rules:

{1, 2 | {4}}, which means that 1 and 2 are indirectly associated by the mediator {4 }.
{1, 5 | {4}}, which means that 1 and 5 are indirectly associated by the mediator {4 }.
{2, 5 | {4}}, which means that 1 and 5 are indirectly associated by the mediator {4 }.

To see additional details about each of these three indirect rules, run this example.

Input file format

The input file format is a text file containing transactions. Each lines represents a transaction. The items in the transaction are listed on the corresponding line. An item is represented by a positive integer. Each item is separated from the following item by a space. It is assumed that items are sorted according to a total order and that no item can appear twice in the same transaction. For example, for the previous example, the input file is defined as follows:

1 4
2 3 4
1 2 4 5
4 5
1 2 4 5

This file contains five lines (five transactions). Consider the first line. It means that the first transaction is the itemset {1, 4}. The following lines follow the same format.

Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file, where each line represents an indirect association rule. Each line starts by "(a=x b=y | mediator=M )" indicating that the line represents the rule {x,y} ==> M, where x, y and M are integers representing items. Then, the keyword "#sup(a,mediator)=" is followed by the support of {x}∪ M expressed as a number of transactions (an integer). Then, the keyword "#sup(b,mediator)=" is followed by the support of {y}∪ M expressed as a number of transactions (an integer). Then, the keyword "#conf(a,mediator)= " is followed by the confidence of a with respect to the mediator, expressed as a double value in the [0, 1] interval. Then, the keyword "#conf(b,mediator)= " appears followed by the confidence of b with respect to the mediator, expressed as a double value in the [0, 1] interval.

For example, the output file of this example is:

(a=1 b=2 | mediator=4 ) #sup(a,mediator)= 3 #sup(b,mediator)= 3 #conf(a,mediator)= 1.0 #conf(b,mediator)= 1.0
(a=1 b=5 | mediator=4 ) #sup(a,mediator)= 3 #sup(b,mediator)= 3 #conf(a,mediator)= 1.0 #conf(b,mediator)= 1.0
(a=2 b=5 | mediator=4 ) #sup(a,mediator)= 3 #sup(b,mediator)= 3 #conf(a,mediator)= 1.0 #conf(b,mediator)= 1.0

This file contains three lines (three indirect association rules). Consider the first line. It represents that items 1 and 2 are indirectly associated by the item 4 as mediator. Furthermore, it indicates that the support of {1, 4} is 3 transactions, the support of {2,4} is 3 transactions, the confidence of item 1 with respect to item 4 is 100 % and the confidence of item 2 with respect to item 4 is 100%. The other lines follow the same format.

Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.

Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextIndirect.txt". Here we have modified the file to give names to the items:

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
1 2 4 5
1 3
1 2 3 5
2 3
1 2 4 5

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define transactions in the SPMF format.

Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:

(a= apple b= orange | mediator= milk ) #sup(a,mediator)= 3 #sup(b,mediator)= 3 #conf(a,mediator)= 1.0 #conf(b,mediator)= 1.0
(a= apple b= bread | mediator= milk ) #sup(a,mediator)= 3 #sup(b,mediator)= 3 #conf(a,mediator)= 1.0 #conf(b,mediator)= 1.0
(a= orange b= bread | mediator= milk ) #sup(a,mediator)= 3 #sup(b,mediator)= 3 #conf(a,mediator)= 1.0 #conf(b,mediator)= 1.0

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.

Implementation details

The implementation attempts to be as faithful as possible to the original algorithm, except that the confidence is used instead of the IS measure.

Note that some algorithms claimed to be more efficient than Indirect such as HI-Mine but they have not been implemented in SPMF.

Where can I get more information about indirect association rules?

The concept of indirect associations was proposed by Tan (2000) in this conference paper:

Pang-Ning Tan, Vipin Kumar, Jaideep Srivastava: Indirect Association: Mining Higher Order Dependencies in Data. PKDD 2000: 632-637

Moreover, note that the book "Introduction do data mining" by Tan, Steinbach and Kumar provides an overview of indirect association rules that is easy to read.

For a good overview of itemset mining and association rule mining, you may read this survey paper.