Mining Recent Frequent Itemsets from a Data Stream using the estDec Algorithm (SPMF documentation)

This example explains how to run the estDec algorithm using the SPMF open-source data mining library.

How to run this example?

This example is not available in the release version of SPMF.
To run this example with the source code version of SPMF, launch the file "MainTestEstDec_saveToFile.java" in the package ca.pfv.SPMF.tests.

What is estDec?

estDec is an algorithm for mining recent frequent itemsets from a data stream. It was proposed by Chang et al. (2003).

Why is it useful? Because most itemset mining algorithms such as Apriori, FPGrowth and Eclat are batch algorithms. This means that if the input transaction database is updated, those algorithms need to be run again from zero to update the result, which is inefficient. Stream mining algorithms such as estDec are designed for discovering patterns in a stream (a potentially infinite sequence of transactions) and for updating the results incrementally after each new transaction. Stream mining algorithms assume that each transaction in a database can only be read once. The estDec algorithm is also interesting because it mines recent frequent itemsets, which means that it put more weight on recent transactions than on older transactions when searching from frequent itemsets. This allows estDec to learn new trends and to forgot older trends.

What is the input of estDec?

The input of estDec is a stream of transactions and a support threshold minsup. Each transaction is a set of items (symbols). For example, consider the following six transactions (t1, t2, ..., t6) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction. estDec is an algorithm for processing a stream. This means that estDec is allowed to read each transaction only once because a stream is assumed to be potentially infinite and coming at high speed.

Transaction ID	Items
t1	{1, 2, 4, 5}
t2	{2, 3, 5}
t3	{1, 2, 4, 5}
t4	{1, 2, 3, 5}
t5	{1, 2, 3, 4, 5}
t6	{2, 3, 5}

What is the output of estDec?

estDec produces as output the set of recent frequent itemsets contained in the transactions that estDec has seen until now. It is said that estDec mines recent frequent itemsets because estDec utilizes a decay function so that estDec puts more weight on recent transactions than on older ones (frequency of an itemset). This allows estDec to learn new trends and to forgot older trends.

The output is a set of recent frequent itemsets. The support count of an itemset is the number of transactions that contains the itemset. For example, the itemset {1, 2, 4} has a support count of 1 because it only appear in t1. The support of an itemset is the number of transaction were the itemset appears divided by the total number of transactions seen until now. A frequent itemset is an itemset that has a support higher or equal to minsup.

The estDec algorithm is an approximate algorithm. It approximate the support of itemsets and returns itemsets that have an estimated support higher than minsup.

For example, consider the example MainTestEstDec_saveToFile.java. This example consists of loading the transactions from a file named "contextIGB.txt" provided in the SPMF distribution. Then, this example show how to save the result to a file. Here is the output:

3 5 #SUP: 0.5000519860383547
2 #SUP: 0.8333622131312072
1 2 3 #SUP: 0.33335643690463074
3 #SUP: 0.5000519860383547
1 4 #SUP: 0.3333448844517001
3 4 #SUP: 0.19334881331065332
1 3 5 #SUP: 0.33335643690463074
1 2 5 #SUP: 0.5000173262771588
2 5 #SUP: 0.8333622131312072
1 #SUP: 0.5000173262771588
2 3 5 #SUP: 0.5000519860383547
1 5 #SUP: 0.5000173262771588
2 3 #SUP: 0.5000519860383547
4 #SUP: 0.3333448844517001
1 4 5 #SUP: 0.3333448844517001
2 4 5 #SUP: 0.3333448844517001
1 2 #SUP: 0.5000173262771588
5 #SUP: 0.8333622131312072
1 3 #SUP: 0.33335643690463074
2 4 #SUP: 0.3333448844517001
1 2 4 #SUP: 0.3333448844517001
4 5 #SUP: 0.3333448844517001

For example, consider line 1. It indicates that the pattern {1, 2, 5} is a recent frequent itemsets with an estimated support of 50%

Note that we also provide a second example named MainTestEstDec_saveToMemory.java. This example shows how to process a set of transactions from memory instead of from a file and to keep the result into memory instead of saving the result to a file. This is especially useful, if you wish to integrate estDec into another Java program. The example also shows how to set the decay rate.

Input file format

The estDec algorithm can either take as input a stream in memory or read transactions from a file. The input file format of estDec is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, the input file is defined as follows:

1 2 4 5
2 3 5
1 2 4 5
1 2 3 5
1 2 3 4 5
2 3 5

Output file format

The output file format of estDec is defined as follows. It is a text file, where each line represents a recently frequent itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. For example, here are few lines of the output file for this example.

3 5 #SUP: 0.5000519860383547
2 #SUP: 0.8333622131312072
1 2 3 #SUP: 0.33335643690463074

The output file here consists of the first line indicates that the itemset {1, 2, 3} has an estimated support of 50 %.

Performance

estDec is a reasonably efficient algorithm.

Where can I get more information about this algorithm?

The estDec algorithm is described in this paper:

Joong Hyuk Chang, Won Suk Lee: Finding recent frequent itemsets adaptively over online data streams. KDD 2003: 487-492

For a good overview of frequent itemset mining algorithms, you may read this survey paper.