Mining Maximal High-Utility Itemsets from a transaction database with utility information using the CHUI-Miner(Max) Algorithm (SPMF documentation)

This example explains how to run the CHUI-Miner(Max) algorithm using the SPMF open-source data mining library.

How to run this example?

What is CHUI-Miner?

CHUI-Miner(Max) (Wu et al., 2019) is an algorithm for discovering maximal high-utility itemsets in a transaction database containing utility information.

There has been many work on the topic of high-utility itemset mining. A limitation of many high-utility itemset mining algorithms is that they generate too much itemsets as output. The CHUI-Miner(Max) algorithm was designed to discover only the high-utility itemsets that are maximal. A maximal high utility itemset is an itemset that is not included in any other high utility itemsets.

What is the input?

CHUI-Miner(Max) takes as input a transaction database with utility information and a minimum utility threshold min_utility (a positive integer). Let's consider the following database consisting of 5 transactions (t1,t2...t5) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DB_utility.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.


Items Transaction utility Item utilities for this transaction
t1 3 5 1 2 4 6 30 1 3 5 10 6 5
t2 3 5 2 4 20 3 3 8 6
t3 3 1 4 8 1 5 2
t4 3 5 1 7 27 6 6 10 5
t5 3 5 2 7 11 2 3 4 2

Each line of the database is:

Note that the value in the second column for each line is the sum of the values in the third column.

What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction represents the items purchased by a customer. The first customer named "t1" bought items 3, 5, 1, 2, 4 and 6. The amount of money spent for each item is respectively 1 $, 3 $, 5 $, 10 $, 6 $ and 5 $. The total amount of money spent in this transaction is 1 + 3 + 5 + 10 + 6 + 5 = 30 $.

What is the output?

The output of CHUI-Miner(Max) is the set of maximal high utility itemsets having a utility no less than a min_utility threshold (a positive integer) set by the user. To explain what is a maximal high utility itemset, it is necessary to review some definitions.

An itemset is an unordered set of distinct items. The utility of an itemset in a transaction is the sum of the utility of its items in the transaction. For example, the utility of the itemset {1 4} in transaction t1 is 5 + 6 = 11 and the utility of {1 4} in transaction t3 is 5 + 2 = 7. The utility of an itemset in a database is the sum of its utility in all transactions where it appears. For example, the utility of {1 4} in the database is the utility of {1 4} in t1 plus the utility of {1 4} in t3, for a total of 11 + 7 = 18. A high utility itemset is an itemset such that its utility is no less than min_utility.

A maximal high utility itemset (CHUI) is a high-utility itemset that not included in any other high utility itemset.

For example, if we run CHUI-Miner(Max) with a minimum utility of 25 we obtain 2 maximal high-utility itemsets:

itemsets utility
{1, 2, 3, 4, 5, 6} 30
{1, 3, 5, 7} 27

If the database is a transaction database from a store, we could interpret these results as all the largest groups of items bought together that generated a profit of 25 $ or more.

Input file format

The input file format of CHUI-Miner(Max) is defined as follows. It is a text file. Each lines represents a transaction. Each line is composed of three sections, as follows.

For example, for the previous example, the input file is defined as follows:

3 5 1 2 4 6:30:1 3 5 10 6 5
3 5 2 4:20:3 3 8 6
3 1 4:8:1 5 2
3 5 1 7:27:6 6 10 5
3 5 2 7:11:2 3 4 2

Consider the first line. It means that the transaction {3, 5, 1, 2, 4, 6} has a total utility of 30 and that items 3, 5, 1, 2, 4 and 6 respectively have a utility of 1, 3, 5, 10, 6 and 5 in this transaction. The following lines follow the same format.

Output file format

The output file format of CHUI-Miner(Max) is defined as follows. It is a text file, where each line represents a maximal high utility itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer, followed by a single space. After, all the items, the keyword "#SUP:" appears and is followed by the support of the itemset. The support of an itemset is how many times the itemset appeared in the database. Then, the keyword #UTIL: " appears and is followed by the utility of the itemset. For example, we show below the output file for this example.

6 4 2 1 5 3 #SUP: 1 #UTIL: 30
7 5 3 1 #SUP: 1 #UTIL: 27

For example, the second line indicates that the itemset {1, 3, 5, 7} has a support of 1 transaction and a utility of 27$. The other lines follows the same format.

Performance

High utility itemset mining is a more difficult problem than frequent itemset mining. Therefore, high-utility itemset mining algorithms are generally slower than frequent itemset mining algorithms. The CHUI-Miner(Max) algorithm was proposed in 2019 to discover only the high-utility itemsets that are maximal. It is generally faster to discover maximal high utility itemsets than discovering all high-utility itemsets.

Note that there exists another variation of this algorithm named CHUI-Miner or CHUI-Miner(Closed) for discovering the closed high-utility itemsets. It is also included in SPMF.

Implementation details

This is an implementation of CHUI-Miner(Max), implemented by P. Fournier-Viger. This is an alternative implementation that was not used in the paper. The main differences with the implementation in the paper is that this implementation (1) does not calculate utility-unit arrays (see the paper) and (2) adds the EUCP optimizations introduced in the FHM algorithm.

In the source code version of SPMF, there are two examples of using CHUI-Miner(Max) in the package ca.pfv.spmf.tests. The first one is MainTestCHUIMinerMax_saveToFile, which saves the result to an output file. The second one is MainTestCHUIMinerMax_saveToMemory, which saves the result to memory.

Where can I get more information about the CHUI-Miner(Max) algorithm?

This is the reference of the article describing the CHUI-Miner(Max) algorithm:

Wu. C.-W., Fournier-Viger, P., Gu, J. Y., Tseng, V.S. (2019). Mining Compact High Utility Itemsets without Candidate Generation . In: Fournier-Viger et al. (eds). High-Utility Pattern Mining: Theory, Algorithms and Applications, Springer.

Besides, for a general overview of high utility itemset mining, you may read this survey paper.