Mining Rare Itemsets using the RP-Growth Algorithm (SPMF documentation)

This example explains how to run the RP-Growth algorithm using the SPMF open-source data mining library.

How to run this example?

What is RPGrowth?

RPGrowth is an adaptation of the FPGrowth algorithm for discovering rare item sets in a transaction database. FPGrowth was proposed by Han et al. (2000) and the adaptation of the FPGrowth to find rare item sets was proposed by Sidney Tsang, Yun Sing Koh, and Gillian Dobbie (2011). Being modified from the FPGrowth algorithm, it retains the speed, memory efficiency, and the usage of the FP-Tree.

What is the input of the RPGrowth algorithm?

The input of RPGrowth is a transaction database (aka binary context) and thresholds named minsup (which represents the upper boundary of what is considered “rare”, a value between 0 and 100% -- it also can be considered the minimum of what is frequent) and minraresup (which represents the lower boundary of what is considered “rare”, a value between 0 and 100%).
NOTE: minraresup and minsup represent a range. That being said, minraresup must be lower than minsup.
A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, …, t5) and 5 items (1, 2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, and 4. This database is provided as the file contextRP in the SPMF distribution. It is important to note than an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.


Transaction id

Items

t1

{1, 2, 4}

t2

{1, 3}

t3

{1, 2, 3, 5}

t4

{2, 3, 5}

t5

{1, 2, 3, 5}

What is the output of the RPGrowth algorithm?

RPGrowth is an algorithm for discovering item sets (group of items) occurring infrequently in a transaction database (rare item sets). A rare itemset is an itemset appearing within the minraresup and minsup range (e.g. minraresup is set to 10% and minsup is set to 60%), where both minraresup and minsup are parameters defined by the user. NOTE: RPGrowth has the requirement that any “rare” item set of size 2 or greater contain at least on item (item set of size 1) that was considered “rare”.
For example, if RPGrowth is run on the previous transaction database with a minsup of 80% (4 transactions) and minraresup 10% (1 transaction), RPGrowth produces the following result:


Itemsets

Support

{4}

1

{5}

3

{1, 4}

1

{2, 4}

1

{1, 5}

2

{3, 5}

3

{2, 5}

3

{1, 2, 4}

1

{1, 3, 5}

2

{2, 3, 5}

3

{1, 2, 5}

2

{1, 2, 3, 5}

2

Another example, if RPGrowth is run on the same transaction database again with minsup dropped to 40% (2 transactions) and minraresup remaining at 10% (1 transaction), RPGrowth produces the following result:


Itemsets

Support

{4}

1

{1, 4}

1

{2, 4}

1

{1, 2, 4}

1

How should I interpret the results?

In the results, each itemset is annotated with its support. The support of an itemset is how many times the itemset appears in the transaction database. For example, the itemset {1, 2, 4} has a support of 1 because it appears in only t1. It is a rare itemset because it falls within the user determined range of what is considered “rare” (and meets previously mentioned constraints).

Input file format:

The input file format used by the RPGrowth is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, the input file is defined as follows:
1 2 4
1 3
1 2 3 5
2 3 5
1 2 3 5

Output file format:

The output file format is defined as follows. It is a text file, where each line represents a frequent itemset. On each line, the items of the item sets are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword “#SUP:” appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. For example, here is the output file for the first example. The first line indicates the frequent itemset consisting of the item 4 and it indicates that this itemset has a support of 1 transaction.

4 #SUP: 1
1 4 #SUP: 1
2 4 #SUP: 1
1 2 4 #SUP: 1
5 #SUP: 3
1 5 #SUP: 2
3 5 #SUP: 3
1 3 5 #SUP: 2
2 3 5 #SUP: 3
1 2 3 5 #SUP: 2
2 5 #SUP: 3
1 2 5 #SUP: 2

Optional feature: constraints on the size of itemsets

Sometimes, there may be just too many itemsets, and itemsets containing many items may not be interesting. Thus, it is also possible to specify two optional parameters in the user interface of SPMF:

If you are using the command line interface of SPMF, it is also possible to use these optional parameters by adding them at the end of the command. For example:
java –jar spmf.jar run RPGrowth_itemsets contextRP.txt output.txt 60% 10% 2 3
means to run the above example but to only find frequent itemsets containing at least 2 items and no more than 3 items.

Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextRP.txt". Here we have modified the file to give names to the items: 

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". Then the following lines define four sequences in the SPMF format.

Then, if we apply a sequential pattern mining algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns:

milk #SUP: 1
tomato milk #SUP: 1
apple milk #SUP: 1
apple tomato milk #SUP: 1

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided for using it from the source code.

Performance

RPGrowth is the only algorithm for all rare item sets that branch off initial singleton rare items offered in SMPF. Being an extension of FPGrowth; RPGrowth retains the speed, memory efficiency, and the usage of the FP-Tree.

Where can I get more information about the RPGrowth algorithm?

This is the journal article describing the original FPGrowth algorithm:

Jiawei Han, Jian Pei, Yiwen Yin, Runying Mao: Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. Data Min. Knowl. Discov. 8(1): 53-87 (2004)

This is the journal article describing the modification of the FPGrowth algorithm and FP-Tree to handle rare itemsets rather than frequent itemsets:

Sidney Tsang, Yun Sing Koh, Gillian Dobbie, RP-Tree: Rare Pattern Tree Mining, International Conference of Data Warehousing and Knowledge Discovery, 277-288 (2011)
https://link.springer.com/chapter/10.1007/978-3-642-23544-3_21