# Mining Frequent Closed Itemsets from a Data Stream using the CloStream Algorithm (SPMF documentation)

This example explains how to run the **CloStream algorithm** using the SPMF open-source data mining library.

## How to run this example?

**This example is not available in the release version of SPMF.****To run this example with the source code version of SPMF,**launch the file**"MainTestCloStream.java"**in the package**ca.pfv.SPMF.tests**.

## What is CloStream?

**CloStream** is an algorithm for **incrementally
mining**** closed itemsets **from a data stream.
It was proposed by Yen et al. (2009).

Why is it useful? Because most closed itemset mining algorithms
such as Charm, DCI_Closed and AprioriClose are batch algorithms. This
means that if the transaction database is updated, we need to run the
algorithms again to update the set of closed itemsets. If there is
constant insertion of new transactions and the results need to be
updated often, it may become very costly to use these algorithms. A **stream
mining algorithm** like **CloStream** is specially
designed to handle this situation. It assumes that each transaction in
a database can only be read once and that new transaction appears
regularly. Every time that a new transaction appear, the result is
updated by CloStream.

## What is the input of CloStream?

The input of **CloStream** is a stream of
transactions. Each **transaction** is a set of items
(symbols). For example, consider the following five transactions (t1,
t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first
transaction represents the set of items 1, 2 and 4. It is important to
note that an item is not allowed to appear twice in the same
transaction and that items are assumed to be sorted by lexicographical
order in a transaction. **CloStream** is an algorithm for
processing a stream. This means that CloStream is allowed to read each
transaction only once because a stream is assumed to be potentially
infinite and coming at high speed.

Transaction ID |
Items |

t1 |
{1,2 4} |

t2 |
{2, 3, 5} |

t3 |
{1, 2, 3, 5} |

t4 |
{2, 5} |

t5 |
{2, 3, 5} |

## What is the output of CloStream?

**CloStream**produces as output the set of

**closed itemset**s contained in the transactions that it has seen until now. An itemset is an unordered set of distinct items. The

**support of an itemset**is the number of transactions that contains the itemset. For example, the itemset {1, 2, 4} has a support of 1 because it only appear in t1. A

**closed itemset**is an itemset that is not included in another itemset having the same support. For example, if we apply

**CloStream**to the five following transactions, the final result is:

closed itemsets |
support |

{} | 5 |

{3} | 4 |

{1, 3} | 3 |

{1, 3, 4} | 1 |

{2, 5} | 4 |

{2, 3, 5} | 3 |

{1, 2, 3, 5} | 2 |

For example, the itemset {2, 3, 5} has a support of 3 because it appears in transactions t2, t4 and t5. It is a closed itemset because it has no proper superset having the same support.

## Input and output file format

This is not applicable for this algorithm since it is designed for a stream of data (see the source code example referenced above to understand how to use this algorithm).

## Performance

**CloStream** is a reasonably efficient algorithm. A
limitation of this algorithm is that it is not possible to set a
minimum support threshold. Therefore, if the number of closed itemsets
is large, this algorithm may use too much memory. However, CloStream
has the advantage of being very simple an easy to implement.

## Where can I get more information about this algorithm?

The **CloStream** algorithm is described in this
paper:

Show-Jane Yen, Yue-Shi Lee, Cheng-Wei Wu, Chin-Lin Lin: An Efficient Algorithm for Maintaining Frequent Closed Itemsets over Data Stream. IEA/AIE 2009: 767-776.

For a good overview of **frequent itemset mining algorithms**, you may read this survey paper.