Mining High-Utility Itemsets in a Transaction Database using the R-Miner Algorithm (SPMF documentation)

This example explains how to run the R-Miner algorithm using the SPMF open-source data mining library.

How to run this example?

What is R-Miner?

R-Miner (Pushp, Satish Chand, 2022) is an algorithm for discovering high-utility itemsets in a transaction database containing utility information.

High utility itemset mining has several applications such as discovering groups of items in transactions of a store that generate the most profit. A database containing utility information is a database where items can have quantities and a unit price. Although these algorithms are often presented in the context of market basket analysis, there exist other applications.

What is the input?

R-Miner takes as input a transaction database with utility information and a minimum utility threshold min_utility (a positive integer). Let's consider the following database consisting of 5 transactions (t1,t2...t5) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DB_utility.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.


Items Transaction utility Item utilities for this transaction
t1 3 5 1 2 4 6 30 1 3 5 10 6 5
t2 3 5 2 4 20 3 3 8 6
t3 3 1 4 8 1 5 2
t4 3 5 1 7 27 6 6 10 5
t5 3 5 2 7 11 2 3 4 2

Each line of the database is:

Note that the value in the second column for each line is the sum of the values in the third column.

What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction represents the items purchased by a customer. The first customer named "t1" bought items 3, 5, 1, 2, 4 and 6. The amount of money spent for each item is respectively 1 $, 3 $, 5 $, 10 $, 6 $ and 5 $. The total amount of money spent in this transaction is 1 + 3 + 5 + 10 + 6 + 5 = 30 $.

What is the output?

The output of R-Miner is the set of high utility itemsets having a utility no less than a min_utility threshold (a positive integer) set by the user. To explain what is a high utility itemset, it is necessary to review some definitions. An itemset is an unordered set of distinct items. The utility of an itemset in a transaction is the sum of the utility of its items in the transaction. For example, the utility of the itemset {1 4} in transaction t1 is 5 + 6 = 11 and the utility of {1 4} in transaction t3 is 5 + 2 = 7. The utility of an itemset in a database is the sum of its utility in all transactions where it appears. For example, the utility of {1 4} in the database is the utility of {1 4} in t1 plus the utility of {1 4} in t3, for a total of 11 + 7 = 18. A high utility itemset is an itemset such that its utility is no less than min_utility For example, if we run R-Miner with a minimum utility of 30, we obtain 8 high-utility itemsets:

itemsets utility support
{2 4} 30 40 % (2 transactions)
{2 5} 31 60 % (3 transactions)
{1 3 5} 31 40 % (2 transactions)
{2 3 4} 34 40 % (2 transactions)
{2 3 5} 37 60 % (3 transactions)
{2 4 5} 36 40 % (2 transactions)
{2 3 4 5} 40 40 % (2 transactions)
{1 2 3 4 5 6} 30 20 % (1 transactions)

If the database is a transaction database from a store, we could interpret these results as all the groups of items bought together that generated a profit of 30 $ or more.

Input file format

The input file format of R-Miner is defined as follows. It is a text file. Each lines represents a transaction. Each line is composed of three sections, as follows.

For example, for the previous example, the input file is defined as follows:

3 5 1 2 4 6:30:1 3 5 10 6 5
3 5 2 4:20:3 3 8 6
3 1 4:8:1 5 2
3 5 1 7:27:6 6 10 5
3 5 2 7:11:2 3 4 2

Consider the first line. It means that the transaction {3, 5, 1, 2, 4, 6} has a total utility of 30 and that items 3, 5, 1, 2, 4 and 6 respectively have a utility of 1, 3, 5, 10, 6 and 5 in this transaction. The following lines follow the same format.

Output file format

The output file format of R-Miner is defined as follows. It is a text file, where each line represents a high utility itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer, followed by a single space. After, all the items, the keyword " #UTIL: " appears and is followed by the utility of the itemset. For example, we show below the output file for this example.

2 4 #UTIL: 30
2 5 #UTIL: 31
1 3 5 #UTIL: 31
2 3 4 #UTIL: 34
2 3 5 #UTIL: 37
2 4 5 #UTIL: 36
2 3 4 5 #UTIL: 40
1 2 3 4 5 6 #UTIL: 30

For example, the first line indicates that the itemset {2, 4} has a utility of 30. The following lines follows the same format.

Performance

This is the original implementation of the R-Miner algorithm.

Implementation details

The version implemented here contains all the optimizations described in the paper proposing R-Miner. Note that the input format is not exactly the same as described in the original article. But it is equivalent.

Where can I get more information about the R-Miner algorithm?

Tthe R-Miner algorithm is presented in this paper:

Pushp Sra, Satish Chand (2023). A residual utility based concept for high utility itemset mining. KAIS journal.

Besides, for a general overview of high utility itemset mining, you may read this survey paper.