Mining Frequent Closed Itemsets using the LCM Algorithm (SPMF documentation)

This example explains how to run the LCM algorithm using the SPMF open-source data mining library.

How to run this example?

What is LCM?

LCM is an algorithm of the LCM familly of algorithms for mining frequent closed itemsets. LCM is the winner of the FIMI 2004 competition. It is supposed to be one of the fastest closed itemset mining algorithm.

In this implementations,we have attempted to replicate LCM v2 used in FIMI 2004. Most of the key features of LCM have been replicated in this implementation (anytime database reduction, occurrence delivery, etc.). However, a few optimizations have been left out for now (transaction merging, removing locally infrequent items). They may be added in a future version of SPMF.

What is the input of the LCM algorithm?

The input is a transaction database (aka binary context) and a threshold named minsup (a value between 0 and 100 %).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {1, 2, 3, 5}

What is the output of the LCM algorithm?

LCM outputs frequent closed itemsets. To explain what is a frequent closed itemset, it is necessary to review a few definitions.

An itemset is a unordered set of distinct items. The support of an itemset is the number of transactions that contain the itemset. For example, the itemset {1, 3} has a support of 3 because it appears in three transactions (t1, t3, t5) from the previous transaction database.

A frequent itemset is an itemset that appears in at least minsup transactions from the transaction database. A frequent closed itemset is a frequent itemset that is not included in a proper superset having exactly the same support. The set of frequent closed itemsets is thus a subset of the set of frequent itemsets. Why is it interesting to discover frequent closed itemsets ? The reason is that the set of frequent closed itemsets is usually much smaller than the set of frequent itemsets and it can be shown that no information is lost by discovering only frequent closed itemsets (because all the frequent itemsets can be regenerated from the set of frequent closed itemsets - see Zaki (2002) for more details).

If we apply LCM on the previous transaction database with a minsup of 40 % (2 transactions), we get the following result:

frequent closed itemsets support
{3} 4
{1, 3} 3
{2, 5} 4
{2, 3, 5} 3
{1, 2, 3, 5} 2

If you compare this result with the output from a frequent itemset mining algorithm like Apriori, you would notice that only 5 closed itemsets are found by LCM instead of about 15 itemsets by Apriori, which shows that the set of frequent closed itemset can be much smaller than the set of frequent itemsets.

How should I interpret the results?

In the results, each frequent closed itemset is annotated with its support. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5. It is a frequent itemset because its support is higher or equal to the minsup parameter. It is a closed itemset because it has no proper superset having exactly the same support.

Input file format

The input file format used by LCM is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, the input file is defined as follows:

1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file, where each line represents a frequent closed itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. For example, we show below the output file for this example. The second line indicates the frequent itemset consisting of the item 1 and 3, and it indicates that this itemset has a support of 4 transactions.

3 #SUP: 4
1 3 #SUP: 3
2 5 #SUP: 4
2 3 5 #SUP: 3
1 2 3 5 #SUP: 2

Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.

Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextPasquier99.txt". Here we have modified the file to give names to the items: 

1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define transactions in the SPMF format.

Then, if we apply the algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:

orange tomato bread #SUP: 3
orange bread #SUP: 4
apple orange tomato bread #SUP: 2

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.


There exists several algorithms for mining closed itemsets. LCM is the winner of the FIMI 2004 competition so it is probably one of the best. In this implementation, we have attempted to replicate v2 of the algorithm. But some optimizations have been left out (transaction merging and removing locally infrequent items). The algorithm seems to perform very well on sparse datasets. According to some preliminary experiments, it can be faster than Charm, dCharm and DCI_closed on sparse datasets, but may perform less well on dense datasets.

Implementation details

In the source code version of SPMF, there are two versions of LCM. The version "" keeps the result into memory. The version named "" saves the result to a file. In the graphical user interface and command line interface only the second version is offered.

Where can I get more information about the LCM algorithm?

This article describes the LCM algorithm:

Here is an article describing the LCM v2 familly of algorithms:

Takeaki Uno, Masashi Kiyomi and Hiroki Arimura (2004). LCM ver. 2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets. Proc. IEEE ICDM Workshop on Frequent Itemset Mining Implementations Brighton, UK, November 1, 2004

Also, for a good overview of frequent itemset mining algorithms, you may read this survey paper.