Mining the Top-K High-Utility Itemsets in a Transaction Database using the TKU-CE+ Algorithm (SPMF documentation)

This example explains how to run the TKU-CE+ algorithm using the SPMF open-source data mining library.

How to run this example?

What is TKU-CE+?

TKU-CE+ (Song et al, 2021) is an algorithm for discovering the top-k high-utility itemsets in a transaction database containing utility information. It is heuristic algorithm, that is it provides a trade-off between speed and completeness.  It is an improved version of the TKU-CE algorithm, also offered in SPMF.

High utility itemset mining has several applications such as discovering groups of items in transactions of a store that generate the most profit. A database containing utility information is a database where items can have quantities and a unit price. Although these algorithms are often presented in the context of market basket analysis, there exist other applications.

What is the input?

TKU-CE+ takes as input a transaction database with utility information and a parameter k (a positive integer). Let's consider the following database consisting of 5 transactions (t1,t2...t5) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DB_utility.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.


Items Transaction utility Item utilities for this transaction
t1 3 5 1 2 4 6 30 1 3 5 10 6 5
t2 3 5 2 4 20 3 3 8 6
t3 3 1 4 8 1 5 2
t4 3 5 1 7 27 6 6 10 5
t5 3 5 2 7 11 2 3 4 2

Each line of the database is:

Note that the value in the second column for each line is the sum of the values in the third column.

What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction represents the items purchased by a customer. The first customer named "t1" bought items 3, 5, 1, 2, 4 and 6. The amount of money spent for each item is respectively 1 $, 3 $, 5 $, 10 $, 6 $ and 5 $. The total amount of money spent in this transaction is 1 + 3 + 5 + 10 + 6 + 5 = 30 $.

What is the output?

The output of TKU-CE+ is an approxmation of the top-k high utility itemsets, that is the k itemsets that have the highest utility in the transaction database taken as input. To explain what is a top-k high utility itemset, it is necessary to review some definitions. An itemset is an unordered set of distinct items. The utility of an itemset in a transaction is the sum of the utility of its items in the transaction. For example, the utility of the itemset {1 4} in transaction t1 is 5 + 6 = 11 and the utility of {1 4} in transaction t3 is 5 + 2 = 7. The utility of an itemset in a database is the sum of its utility in all transactions where it appears. For example, the utility of {1 4} in the database is the utility of {1 4} in t1 plus the utility of {1 4} in t3, for a total of 11 + 7 = 18. The top-k high utility itemsets is the set of the k itemsets that have the highest utility. It is to be noted that in some cases, it is possible that the algorithm returns more than k itemsets if several itemsets have the same utility. For example, if we run TKU-CE+ with a k = 3, we may obtain the following top-3 high-utility itemsets:

itemsets utility
{2 3 5} 37
{2 4 5} 36
{2 3 4 5} 40

Note that the result may vary is it is a randomized algorithm.

If the database is a transaction database from a store, we could interpret these results as all the k groups of items bought together that generated the highest profit.

Input file format

The input file format of TKU-CE+ is defined as follows. It is a text file. Each lines represents a transaction. Each line is composed of three sections, as follows.

For example, for the previous example, the input file is defined as follows:

3 5 1 2 4 6:30:1 3 5 10 6 5
3 5 2 4:20:3 3 8 6
3 1 4:8:1 5 2
3 5 1 7:27:6 6 10 5
3 5 2 7:11:2 3 4 2

Consider the first line. It means that the transaction {3, 5, 1, 2, 4, 6} has a total utility of 30 and that items 3, 5, 1, 2, 4 and 6 respectively have a utility of 1, 3, 5, 10, 6 and 5 in this transaction. The following lines follow the same format.

Output file format

The output file format of TKU-CE+ is defined as follows. It is a text file, where each line represents a top-k high utility itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer, followed by a single space. After, all the items, the keyword " #UTIL: " appears and is followed by the utility of the itemset. For example, we show below the output file for this example.

2 3 4 5 #UTIL:40
2 3 5 #UTIL:37
2 4 5 #UTIL:36

For example, the second line indicates that the itemset {2, 3, 5} has a utility of 37. The following lines follows the same format.

Performance

Top-k high utility itemset mining is a more difficult problem than high-utility itemset mining. TKU-CE+ is an improved version of the TKU-CE algorithm, to find a quick solution to this problem using an heuristic approach.

This is the original implementation of TKU-CE+.

Implementation details

Note that the input format is not exactly the same as described in the article. But it is equivalent.

Where can I get more information about the TKU-CE+ algorithm?

The TKU-CE+ algorithm was proposed in this paper:

    Song, W. et al. (2021). Heuristically mining the top-k high-utility itemsets with cross-entropy optimization  Applied Intelligence, Springer, DOI: 10.1007/s10489-021-02576-z

Besides, for a general overview of high utility itemset mining, you may read this survey paper.