Building, updating incrementally and using a Memory-Efficient Itemset-Tree to generate targeted frequent itemsets and association rules (SPMF documentation)

This example explains how to run the Memory-Efficient Itemset-Tree algorithm using the SPMF open-source data mining library.

How to run this example?

What is a Memory-Efficient Itemset-Tree (MEIT)?

An itemset-tree (IT) is a special data structure that can be used for performing efficient queries about itemsets and association rules in a transaction database without having to generate all of them beforehand.

An itemset-tree has the nice property of being incremental, which means that new transactions can be added to an existing itemset tree very efficiently without having to rebuild the tree from scratch. An itemset-tree also has the property of being compact.

The Memory-Efficient Itemset-Tree (MEIT) is a modification of the original Itemset-Tree structure that uses about twice less memory than the regular itemset-tree (see the paper describing MEIT for a performance comparison). But it runs about twice slower. Therefore, choosing between using an IT or MEIT is a trade-off between memory and speed.

How to use it?

A Memory-Efficient Itemset-Tree (MEIT) is built by inserting a set of transactions into the tree. A transaction is simply a set of distinct items. For example, we could insert the following 6 transactions (t1,t2...t5) into an itemset-tree. In this example, the transaction t1 represents the set of items {1, 4}. This set of transactions is provided in the file "contextItemsetTree.txt" of the SPMF distribution.

transaction IDs items
t1 {1, 4}
t2 {2, 5}
t3 {1, 2, 3, 4, 5}
t4 {1, 2, 4}
t5 {2, 5}
t6 {2, 4}

The result of the insertion of these six transactions is the following MEIT.

{} sup=6
[2 ] sup=3
[5 ] sup=2
[4 ] sup=1
[1 ] sup=3
[2 ] sup=2
[4 ] sup=1
[3 5 ] sup=1
[4 ] sup=1

The root is the empty itemset {} and the leafs are {5}, {4}, {4},{3 5} and {4}.

Once an itemset-tree has been created, it is possible to update it by inserting a new transaction. For example, in this example provided in the source code, we update the previous tree by adding a new transaction {4, 5}. The result is this tree:

{} sup=7
[2 ] sup=3
[5 ] sup=2
[4 ] sup=1
[1 ] sup=3
[2 ] sup=2
[4 ] sup=1
[3 5 ] sup=1
[4 ] sup=1
[4 5 ] sup=1

Next, it is shown how to query the tree to determine the support of a target itemset efficiently. For example, if we execute the query of finding the support of the itemset {2}, the support is determined to be 5 because 2 appear in 5 transactions.

After that the source code offers an example of how to use the itemset tree to get all itemsets that subsume an itemset and to get their support. For example, if we use the itemset {1 2} for this query the result is:

[1 2 ] supp:2
[1 2 3 ] supp:1
[1 2 4 ] supp:2
[1 2 5 ] supp:1
[1 2 3 4 ] supp:1
[1 2 3 5 ] supp:1
[1 2 4 5 ] supp:1
[1 2 3 4 5 ] supp:1

Another example provided is how to use the tree to find all itemsets that subsume an itemset such that the support is higher or equal to a user-specified threshold named minsup (a positive integer representing a number of transactions). For example, if we execute this query with the itemset {1} and minsup =2, we get this result:

[1 ] supp:3
[1 2 ] supp:2
[1 4 ] supp:3
[1 2 4 ] supp:2

Lastly, another example is how to generate all association rules having a target itemset as antecedent and a support and confidence respectively higher or equal to some user-specificed thresholds minsup (a positive integer representing a number of transactions) and minconf (a value between 0 and 1). For example, if the target itemset is {1} and minconf = 0.1 and minsup = 2, the result is:

[ 1 ] ==> [2 ] sup=2 conf=0.666666666666666

[ 1 ] ==> [4 ] sup=3 conf=1.0

[ 1 ] ==> [2 4 ] sup=2 conf=0.66666666666666

Input and output file format

There is no need to use an input and output file with aa memory-efficient itemset tree because it is an incremental data structure that is designed for live update and live targeted queries rather than batch processing.

However, it is possible to load a transaction database in a memory-efficient itemset tree. In this case, a file is loaded. The file is defined as a text file where each line represents a transactions. Each item is represented by an integer and it is assumed that all transactions are sorted according to a total order and that no item can appear twice in the same transaction. On any given line, the items of the corresponding transaction are listed such that each item is separated from the following item by a single space. For example, the file "contextItemsetTree.txt" that is provided contains the following content:

1 4
2 5
1 2 3 4 5
1 2 4
2 5
2 4

There is a total of six transactions (six lines) in the file. The first line represents the transaction {1, 4} (containing items 1 and 4). The second line represents the transaction {2, 5}. The third line represents the transaction {1, 2, 3, 4, 5}. The following lines follow the same format.

Performance

The Memory-Efficient Itemset-Tree (MEIT) is an efficient data structure for the case of a database that needs to be updated frequently and where targeted queries need to be performed on itemsets and association rules.

The MEIT is a modification of the original Itemset-Tree (MEIT). According to our experiments, the MEIT uses about twice less memory than the IT but is about twice slower for answering queries. Therefore, choosing between MEIT and IT is a compromise between speed and memory.

Where can I get more information about the Itemset-tree data structure and related algorithms?

This article describes the Memory-Efficient Itemset-tree:

Fournier-Viger, P., Mwamikazi, E., Gueniche, T., Faghihi, U. (2013). Memory Efficient Itemset Tree for Targeted Association Rule Mining. Proc. 9th International Conference on Advanced Data Mining and Applications (ADMA 2013) Part II, Springer LNAI 8347, pp. 95-106.

For a good overview of frequent itemset mining algorithms, you may read this survey paper.