Mining Recent Frequent Itemsets from a Data Stream using the estDec+ Algorithm (SPMF documentation)
This example explains how to run the estDec+ algorithm using the SPMF open-source data mining library.
How to run this example?
- This example is not available in the release version of SPMF.
- To run this example with the source code version of SPMF, launch the file "MainTestEstDecPlus_saveToFile.java" in the package ca.pfv.SPMF.tests.
What is estDec+?
estDec+ is an algorithm for mining recent frequent itemsets from a data stream. It is an extension of estDec proposed by Chang et al. in 2005. The main difference with estDec is to use a compressed tree to maintain information about recent frequent itemsets, which may be more memory efficient in some cases but may decrease accuracy. Note that the version of estDec+ implemented here is based on the 2014 paper by Chang et al.
Why is it useful? Because most itemset mining algorithms such as Apriori, FPGrowth and Eclat are batch algorithms. This means that if the input transaction database is updated, those algorithms need to be run again from zero to update the result, which is inefficient. Stream mining algorithms such as estDec+ are designed for discovering patterns in a stream (a potentially infinite sequence of transactions) and for updating the results incrementally after each new transaction. Stream mining algorithms assume that each transaction in a database can only be read once. The estDec+ algorithm is also interesting because it mines recent frequent itemsets, which means that it put more weight on recent transactions than on older transactions when searching for recent frequent itemsets. This allows estDec+ to learn new trends and to forgot older trends.
What is the input of estDec+?
The input of estDec+ is a stream of transactions and a support threshold minsup. Each transaction is a set of items (symbols). For example, consider the following six transactions (t1, t2, ..., t6) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction. estDec+ is an algorithm for processing a stream. This means that estDec+ is allowed to read each transaction only once because a stream is assumed to be potentially infinite and coming at high speed.
Transaction ID | Items |
t1 | {1, 2, 4, 5} |
t2 | {2, 3, 5} |
t3 | {1, 2, 4, 5} |
t4 | {1, 2, 3, 5} |
t5 | {1, 2, 3, 4, 5} |
t6 | {2, 3, 5} |
What is the output of estDec+?
estDec+ produces as output the set of recent frequent itemsets contained in the transactions that estDec+ has seen until now. It is said that estDec mines recent frequent itemsets because estDec utilizes a decay function so that estDec puts more weight on recent transactions than on older ones (frequency of an itemset). This allows estDec+ to learn new trends and to forgot older trends.
The output is a set of recent frequent itemsets. The support count of an itemset is the number of transactions that contains the itemset. For example, the itemset {1, 2, 4} has a support count of 1 because it only appear in t1. The support of an itemset is the number of transaction were the itemset appears divided by the total number of transactions seen until now. A frequent itemset is an itemset that has a support higher or equal to minsup.
The estDec+ algorithm is an approximate algorithm. It approximate the support of itemsets and returns itemsets that have an estimated support higher or equal to minsup.
For example, consider the example MainTestEstDecPlus_saveToFile.java. This example consists of loading the transactions from a file named "contextIGB.txt" provided in the SPMF distribution. Then, this example show how to save the result to a file. Here is the output:
2 5 #SUP: 1.0
1 4 5 #SUP: 0.5
1 2 3 #SUP: 0.5
5 #SUP: 1.0
1 2 5 #SUP: 0.5
1 #SUP:0.66
1 5 #SUP: 0.5555555555555556
1 2 4 #SUP: 0.5
4 5 #SUP: 0.5
2 4 #SUP: 0.5
1 4 #SUP: 0.5555555555555556
1 3 #SUP: 0.5555555555555556
4 #SUP: 0.5
1 3 5 #SUP: 0.5
2 3 #SUP:0.66
1 2 #SUP: 0.5555555555555556
3 4 #SUP:0.66
2 #SUP: 1.0
3 5 #SUP:0.66
2 4 5 #SUP: 0.5
3 #SUP:0.66
2 3 5 #SUP:0.66
For example, consider line 1. It indicates that the pattern {1, 2, 5} is a recent frequent itemsets with an estimated support of 50%
Note that we also provide a second example named MainTestEstDec_saveToMemory.java. This example shows how to process a set of transactions from memory instead of from a file and to keep the result into memory instead of saving the result to a file. This is especially useful, if you wish to integrate estDec into another Java program. The example also shows how to set the decay rate.
Input file format
The estDec algorithm can either take as input a stream in memory or read transactions from a file. The input file format of estDec is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.
For example, for the previous example, the input file is defined as follows:
1 2 4 5
2 3 5
1 2 4 5
1 2 3 5
1 2 3 4 5
2 3 5
Output file format
The output file format of estDec+ is defined as follows. It is a text file, where each line represents a recently frequent itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. For example, here are few lines of the output file for this example.
1 2 3 #SUP: 0.5
5 #SUP: 1.0
1 2 5 #SUP: 0.5
The output file here consists of the first line indicates that the itemset {1, 2, 3} has an estimated support of 50 %.
Performance
estDec+ is a reasonably efficient algorithm. When minsup is high, it may use less memory than the original estDec algorithm because the CP-Tree is generally smaller than the estTree.
Where can I get more information about this algorithm?
The estDec+ algorithm is described in this paper:
Se Jung Shin , Dae Su Lee , Won Suk Lee, “CP-tree: An adaptive synopsis structure for compressing frequent itemsets over online data streams”, Information Sciences,Volume 278, 10 September 2014, Pages 559–576
For a good overview of frequent itemset mining algorithms, you may read this survey paper.