Mining Frequent Closed Itemsets from a Data Stream using the CloStream Algorithm (SPMF documentation)

This example explains how to run the CloStream algorithm using the SPMF open-source data mining library.

How to run this example?

What is CloStream?

CloStream is an algorithm for incrementally mining closed itemsets from a data stream. It was proposed by Yen et al. (2009).

Why is it useful? Because most closed itemset mining algorithms such as Charm, DCI_Closed and AprioriClose are batch algorithms. This means that if the transaction database is updated, we need to run the algorithms again to update the set of closed itemsets. If there is constant insertion of new transactions and the results need to be updated often, it may become very costly to use these algorithms. A stream mining algorithm like CloStream is specially designed to handle this situation. It assumes that each transaction in a database can only be read once and that new transaction appears regularly. Every time that a new transaction appear, the result is updated by CloStream.

What is the input of CloStream?

The input of CloStream is a stream of transactions. Each transaction is a set of items (symbols). For example, consider the following five transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2 and 4. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction. CloStream is an algorithm for processing a stream. This means that CloStream is allowed to read each transaction only once because a stream is assumed to be potentially infinite and coming at high speed.

Transaction ID Items
t1 {1,2 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {2, 3, 5}

What is the output of CloStream?

CloStream produces as output the set of closed itemsets contained in the transactions that it has seen until now. An itemset is an unordered set of distinct items. The support of an itemset is the number of transactions that contains the itemset. For example, the itemset {1, 2, 4} has a support of 1 because it only appear in t1. A closed itemset is an itemset that is not included in another itemset having the same support. For example, if we apply CloStream to the five following transactions, the final result is:
closed itemsets support
{} 5
{3} 4
{1, 3} 3
{1, 3, 4} 1
{2, 5} 4
{2, 3, 5} 3
{1, 2, 3, 5} 2

For example, the itemset {2, 3, 5} has a support of 3 because it appears in transactions t2, t4 and t5. It is a closed itemset because it has no proper superset having the same support.

Input and output file format

This is not applicable for this algorithm since it is designed for a stream of data (see the source code example referenced above to understand how to use this algorithm).

Performance

CloStream is a reasonably efficient algorithm. A limitation of this algorithm is that it is not possible to set a minimum support threshold. Therefore, if the number of closed itemsets is large, this algorithm may use too much memory. However, CloStream has the advantage of being very simple an easy to implement.

Where can I get more information about this algorithm?

The CloStream algorithm is described in this paper:

Show-Jane Yen, Yue-Shi Lee, Cheng-Wei Wu, Chin-Lin Lin: An Efficient Algorithm for Maintaining Frequent Closed Itemsets over Data Stream. IEA/AIE 2009: 767-776.

For a good overview of frequent itemset mining algorithms, you may read this survey paper.