Mining High-Utility Itemsets from a Database with Positive or Negative Unit Profit using the HUINIV-Mine Algorithm (SPMF documentation)

This example explains how to run the HUINIV-Mine algorithm using the SPMF open-source data mining library.

How to run this example?

What is HUINIV-Mine?

HUINIV-Mine is an algorithm for discovering high-utility itemsets in a transaction database containing utility information. It is an extension of the Two-Phase algorithm designed for mining patterns in a transaction database where items may have negative unit profit values.

Items with negative values are interesting in real-life scenarios. Often in a retail store, items may be sold at a loss. If traditional high utility itemset mining algorithms such as Two-Phase, IHUP, UPGrowth, HUI-Miner and FHM are appied on such database, it was demonstrated that they may not discover the correct restults. To address this issue, the HUINIV-Mine algorithm was proposed. However, faster algorithms now exists, such as FHN, also offered in SPMF.

What is the input?

HUINIV-Mine takes as input a transaction database with utility information and a minimum utility threshold min_utility (a positive integer). Let's consider the following database consisting of 10 transactions (t1,t2...t10) and 5 items (1, 2, 3, 4, 5). This database is provided in the text file "DB_NegativeUtility.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.


Items Transaction utility Item utilities for this transaction
t1 1 4 5 27 5 12 10
t2 2 3 4 36 -3 -4 36
t3 1 4 45 15 30
t4 1 5 15 5 10
t5 2 3 4 36 -3 -4 36
t6 2 3 5 20 -3 -2 20
t7 1 10 10
t8 1 4 21 15 6
t9 2 3 4 24 -3 -2 24
t10 1 5 15 5 10

Each line of the database is:

Note that the value in the second column for each line is the sum of the values in the third column.

What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction represents the items purchased by a customer. The first customer named "t1" bought items 1, 4 and 5. The amount of profit generated by the sale of each of these item is respectively 5 $, 12 $ and 10 $. The total amount of money spent in this transaction is 5 + 12 + 10 = 27 $.

What is the output?

The output of HUINIV-Mineis the set of high utility itemsets having a utility no less than a min_utility threshold (a positive integer) set by the user. To explain what is a high utility itemset, it is necessary to review some definitions. An itemset is an unordered set of distinct items. The utility of an itemset in a transaction is the sum of the utility of its items in the transaction. For example, the utility of the itemset {1 4} in transaction t1 is 5 + 12 = 17 and the utility of {1 4} in transaction t3 is 15 + 30 = 45. The utility of an itemset in a database is the sum of its utility in all transactions where it appears. For example, the utility of {1 4} in the database is the utility of {1 4} in t1 plus the utility of {1 4} in t3, plus the utility of {1 4} in t8, for a total of 17 + 45 + 21 = 83. A high utility itemset is an itemset such that its utility is no less than min_utility For example, if we run HUINIV-Mine with a minimum utility of 30, we obtain 8 high-utility itemsets:

itemsets utility ($)
{5} 50
{1 5} 45
{1} 55
{1 4} 83
{4} 144
{2 4} 87
{2 3 4} 77
{3 4} 86

If the database is a transaction database from a store, we could interpret these results as all the groups of items bought together that generated a profit of 30 $ or more.

Input file format

The input file format of HUINIV-Mine is defined as follows. It is a text file. Each lines represents a transaction. Each line is composed of three sections, as follows.

For example, for the previous example, the input file is defined as follows:

1 4 5:27:5 12 10
2 3 4:36:-3 -4 36
1 4:45:15 30
1 5:15:5 10
2 3 4:36:-3 -4 36
2 3 5:20:-3 -2 20
1:10:10
1 4:21:15 6
2 3 4:24:-3 -2 24
1 5:15:5 10

Consider the first line. It means that the transaction {1, 4, 5} has a total utility of 27 and that items 1, 4and 5 respectively have a utility of 5, 12 and 10 in this transaction. The following lines follow the same format.

Output file format

The output file format of HUINIV-Mine is defined as follows. It is a text file, where each line represents a high utility itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer, followed by a single space. After, all the items, the keyword " #UTIL: " appears and is followed by the utility of the itemset. For example, we show below the output file for this example.

5 #UTIL: 50
5 1 #UTIL: 45
1 #UTIL: 55
1 4 #UTIL: 83
4 #UTIL: 144
4 2 #UTIL: 87
4 2 3 #UTIL: 77
4 3 #UTIL: 86

For example, the second line indicates that the itemset {1, 5} has a utility of 45. The other lines follows the same format.

Performance

High utility itemset mining is a more difficult problem than frequent itemset mining. Therefore, high-utility itemset mining algorithms are generally slower than frequent itemset mining algorithms. The HUINIV-Mine is the first algorithm for high-utility itemset mining with negative unit profit. However, faster algorithms have now been proposed such as FHN (2014), also offered in SPMF.

Where can I get more information about the HUINIV-Mine algorithm?

This is the reference of the article describing the HUINIV-Mine algorithm:

Chu, Chun-Jung, Vincent S. Tseng, and Tyne Liang. "An efficient algorithm for mining high utility itemsets with negative item values in large databases." Applied Mathematics and Computation 215.2 (2009): 767-778.

Besides, for a general overview of high utility itemset mining, you may read this survey paper.