Building, updating incrementally and using an Itemset-Tree to generate targeted frequent itemsets and association rules (SPMF documentation)

This example explains how to run the Itemset-Tree algorithm using the SPMF open-source data mining library.

How to run this example?

What is an itemset-tree?

An itemset-tree is a special data structure that can be used for performing efficient queries about itemsets and association rules in a transaction database without having to generate all of them beforehand.

An itemset-tree has the nice property of being incremental, which means that new transactions can be added to an existing itemset tree very efficiently without having to rebuild the tree from scratch. An itemset-tree also has the property of being compact.

How to use it?

An itemset-tree is built by inserting a set of transactions into the tree. A transaction is simply a set of distinct items. For example, we could insert the following 6 transactions (t1,t2...t5) into an itemset-tree. In this example, the transaction t1 represents the set of items {1, 4}. This set of transactions is provided in the file "contextItemsetTree.txt" of the SPMF distribution.

transaction IDs items
t1 {1, 4}
t2 {2, 5}
t3 {1, 2, 3, 4, 5}
t4 {1, 2, 4}
t5 {2, 5}
t6 {2, 4}

The result of the insertion of these six transactions is the following itemset-tree (see the article by Kubat for more details).

{} sup=6
[2 ] sup=3
[2 5 ] sup=2
[2 4 ] sup=1
[1 ] sup=3
[1 2 ] sup=2
[1 2 4 ] sup=1
[1 2 3 4 5 ] sup=1
[1 4 ] sup=1

The root is the empty itemset {} and the leafs are {2, 5}, {2, 4}, {1 2 4},{1 2 3 4 5} and {1, 4}.

Once an itemset-tree has been created, it is possible to update it by inserting a new transaction. For example, in this example provided in the source code, we update the previous tree by adding a new transaction {4, 5}. The result is this tree:

{} sup=7
[2 ] sup=3
[2 5 ] sup=2
[2 4 ] sup=1
[1 ] sup=3
[1 2 ] sup=2
[1 2 4 ] sup=1
[1 2 3 4 5 ] sup=1
[1 4 ] sup=1
[4 5 ] sup=1

Next, it is shown how to query the tree to determine the support of a target itemset efficiently. For example, if we execute the query of finding the support of the itemset {2}, the support is determined to be 5 because 2 appear in 5 transactions.

After that the source code offers an example of how to use the itemset tree to get all itemsets that subsume an itemset and to get their support. For example, if we use the itemset {1 2} for this query the result is:

[1 2 ] supp:2
[1 2 3 ] supp:1
[1 2 4 ] supp:2
[1 2 5 ] supp:1
[1 2 3 4 ] supp:1
[1 2 3 5 ] supp:1
[1 2 4 5 ] supp:1
[1 2 3 4 5 ] supp:1

Another example provided is how to use the tree to find all itemsets that subsume an itemset such that the support is higher or equal to a user-specified threshold named minsup (a positive integer representing a number of transactions). For example, if we execute this query with the itemset {1} and minsup =2, we get this result:

[1 ] supp:3
[1 2 ] supp:2
[1 4 ] supp:3
[1 2 4 ] supp:2 Lastly, another example is how to generate all association rules having a target itemset as antecedent and a support and confidence respectively higher or equal to some user-specificed thresholds minsup (a positive integer representing a number of transactions) and minconf (a value between 0 and 1). For example, if the target itemset is {1} and minconf = 0.1 and minsup = 2, the result is:[ 1 ] ==> [2 ] sup=2 conf=0.666666666666666

[ 1 ] ==> [4 ] sup=3 conf=1.0

[ 1 ] ==> [2 4 ] sup=2 conf=0.66666666666666

Input and output file format

There is no need to use an input and output file with an itemset tree because it is an incremental data structure that is designed for live update and live targeted queries rather than batch processing.

However, it is possible to load a transaction database in an itemset tree. In this case, a file is loaded. The file is defined as a text file where each line represents a transactions. Each item is represented by an integer and it is assumed that all transactions are sorted according to a total order and that no item can appear twice in the same transaction. On any given line, the items of the corresponding transaction are listed such that each item is separated from the following item by a single space. For example, the file "contextItemsetTree.txt" that is provided contains the following content:1 4
2 5
1 2 3 4 5
1 2 4
2 5
2 4

There is a total of six transactions (six lines) in the file. The first line represents the transaction {1, 4} (containing items 1 and 4). The second line represents the transaction {2, 5}. The third line represents the transaction {1, 2, 3, 4, 5}. The following lines follow the same format.

Performance

The itemset-tree is an efficient data structure for the case of a database that needs to be updated frequently and where targeted queries need to be performed. For details about the complexity in terms of space and time, please refer to the article by Kubat et al., which provides an extensive discussion of the complexity

Where can I get more information about the Itemset-tree data structure and related algorithms?

This article describes the itemset-tree and related algorithms for querying it:

Miroslav Kubat, Aladdin Hafez, Vijay V. Raghavan, Jayakrishna R. Lekkala, Wei Kian Chen: Itemset Trees for Targeted Association Querying. IEEE Trans. Knowl. Data Eng. 15(6): 1522-1534 (2003)

For a good overview of frequent itemset mining algorithms, you may read this survey paper.