Mining Frequent Itemsets using the SSFIM Algorithm (SPMF documentation)

This example explains how to run the SSFIM algorithm using the SPMF open-source data mining library.

How to run this example?

What is SSFIM?

SSFIM is an algorithm for discovering frequent itemsets in a transaction database. SSFIM was proposed by Djenouri (2017). It is a brute force algorithms that generates all posibilities. Thus, it performs well on datasets with a very small search space. On eight standard benchmark datasets (see performance section below), it run out of memory on 7 out of 8 datasets for minsup = 0.99, while some algorithms like FPGrowth can finish sometimes in less than 1 second with just 50 Mb of Memory on thresholds as low as minsup = 0.01. Only for datasets where the search space is small like Chess, the SSFIM algorithm can run efficiently. See the "performance" section below for more details.

What is the input of the SSFIM algorithm?

The input is a transaction database (aka binary context) and a threshold named minsup (a value between 0 and 100 %).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {1, 2, 3, 5}

What is the output of the SSFIM algorithm?

SSFIM is an algorithm for discovering itemsets (group of items) occurring frequently in a transaction database (frequent itemsets). A frequent itemset is an itemset appearing in at least minsup transactions from the transaction database, where minsup is a parameter given by the user.

For example, if SSFIM is run on the previous transaction database with a minsup of 40 % (2 transactions), SSFIM produces the following result:

itemsets support
{1} 3
{2} 4
{3} 4
{5} 4
{1, 2} 2
{1, 3} 3
{1, 5} 2
{2, 3} 3
{2, 5} 4
{3, 5} 3
{1, 2, 3} 2
{1, 2, 5} 2
{1, 3, 5} 2
{2, 3, 5} 3
{1, 2, 3, 5} 2

How should I interpret the results?

In the results, each itemset is annotated with its support. The support of an itemset is how many times the itemset appears in the transaction database. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5. It is a frequent itemset because its support is higher or equal to the minsup parameter.

Input file format

The input file format used by SSFIM is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, the input file is defined as follows:

1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

Note that it is also possible to use the ARFF format as an alternative to the default input format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file, where each line represents a frequent itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. For example, here is the output file for this example. The first line indicates the frequent itemset consisting of the item 1 and it indicates that this itemset has a support of 3 transactions.

1 #SUP: 3
2 #SUP: 4
3 #SUP: 4
5 #SUP: 4
1 2 #SUP: 2
1 3 #SUP: 3
1 5 #SUP: 2
2 3 #SUP: 3
2 5 #SUP: 4
3 5 #SUP: 3
1 2 3 #SUP: 2
1 2 5 #SUP: 2
1 3 5 #SUP: 2
2 3 5 #SUP: 3
1 2 3 5 #SUP: 2

Note that if the ARFF format is used as input instead of the default input format, the output format will be the same except that items will be represented by strings instead of integers.

Performance

There exists several algorithms for mining frequent itemsets. Here is a comparison of the performance of SSFIM with FPGrowth on eight benchmark datasets.

PSUMB dataset
Time Memory
SSFIM minsup =0.99 Out of memory Out of memory
FPGrowth minsup = 0.7 6 seconds 72 Mb
Accidents dataset
Time Memory
SSFIM minsup =0.99 Out of memory Out of memory
FPGrowth minsup = 0.3 7 seconds 125 Mb
Connect dataset
Time Memory
SSFIM minsup =0.99 Out of memory  Out of memory
FPGrowth minsup = 0.9 1 second 62 Mb
Retail dataset
Time Memory
SSFIM minsup =0.99 Out of memory Out of memory
FPGrowth minsup = 0.01 0.4 second 76 Mb
Mushroom dataset
Time Memory
SSFIM minsup =0.99 Out of memory Out of memory
FPGrowth minsup = 0.1 0.5 second 45 Mb
BMS1 dataset
Time Memory
SSFIM minsup =0.99 Out of memory Out of memory
FPGrowth minsup = 0.1 0.08 second 50 Mb
Chainstore dataset
Time Memory
SSFIM minsup =0.99 Out of memory Out of memory
FPGrowth minsup = 0.1 3.9 second 50 Mb
Chess dataset
Time Memory
SSFIM minsup =0.5 0.02 second 49 Mb
FPGrowth minsup = 0.1 1.6 second 57 Mb

As it can be seen above, SSFIM can only return a result for one of the eight standard benchmark datasets tested above (Chess). The reason is that SSFIM uses a brute force approach that generates all possibilities, which can only terminates without running out of memory when the search space is small.. Thus, even for very high minsup threshold values like 0.99, on many datasets, SSFIM runs out of memory, while FPGrowth can run efficiently for thresholds as low as 0.01 with a very small amount of memory. The only dataset where SSFIM is efficient in the above comparison is Chess because the search space is small.

Where can I get more information about the SSFIM algorithm?

This is the conference article describing SSFIM:

Djenouri, Y., Comuzzi, M. and Djenouri, D., 2017, May. SS-FIM: Single Scan for Frequent Itemsets Mining in Transactional Databases. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (pp. 644-654). Springer.

Also, for a good overview of frequent itemset mining algorithms, you may read this survey paper.