Mining High Utility Association Rules using the HGB / HGB-ALL Algorithm (SPMF documentation)

This example explains how to run the HGB and HGB_All algorithm using the SPMF open-source data mining library to discover high utility association rules..

How to run this example?

To run the example with the HGB algorithm for mining all non redundant high utility association rules:

To run the example with the HGB_All algorithm for mining all high utility association rules:

What is HGB / HGB_All ?

HGB and HGB_All are names given in SPMF to two algorithms proposed by Sahoo et al. (2015) for mining high utility association rules. Those algorithms are post-processing algorithms to extract high utility association rules using the results from the HUCI-Miner algorithm. The HGB_All algorithm is designed to find all high utility association rules, while the HGB algorithm finds only the non redundant association rules, which is a subset of all rules.

The advantage of high utility association rules over traditional association rules is that the concept of utility or importance is taken into account. The utility can be used to model factors such as the profit yield by the sales of items, and can be also applied to other application.

What is the input?

HGB / HGB_ALL takes as input a transaction database with utility information, a minimum utility threshold min_utility (a positive integer) and a minimum confidence threshold minconf (a double value representing a percentage) . Let's consider the following database consisting of 5 transactions (t1,t2...t5) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DB_utility.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.


Items Transaction utility Item utilities for this transaction
t1 3 5 1 2 4 6 30 1 3 5 10 6 5
t2 3 5 2 4 20 3 3 8 6
t3 3 1 4 8 1 5 2
t4 3 5 1 7 27 6 6 10 5
t5 3 5 2 7 11 2 3 4 2

Each line of the database is:

Note that the value in the second column for each line is the sum of the values in the third column.

What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction represents the items purchased by a customer. The first customer named "t1" bought items 3, 5, 1, 2, 4 and 6. The amount of money spent for each item is respectively 1 $, 3 $, 5 $, 10 $, 6 $ and 5 $. The total amount of money spent in this transaction is 1 + 3 + 5 + 10 + 6 + 5 = 30 $.

What is the output?

The output of HGB and HGB_ALL are high utility association rules. To explain what is a high utility association rule, it is necessary to review some definitions.

An itemset is an unordered set of distinct items.

An association rule is an implication of the form X ==> Y where X and Y are two non empty itemsets, and where X and Y are disjoint. X is called the antecedent of the rule and Y is called the consequent.

The utility of an itemset in a transaction is the sum of the utility of its items in the transaction. For example, the utility of the itemset {1 4} in transaction t1 is 5 + 6 = 11 and the utility of {1 4} in transaction t3 is 5 + 2 = 7. The utility of an itemset in a database is the sum of its utility in all transactions where it appears. For example, the utility of {1 4} in the database is the utility of {1 4} in t1 plus the utility of {1 4} in t3, for a total of 11 + 7 = 18. A high utility itemset is an itemset such that its utility is no less than min_utility.

An association rule X ==> Y is said to be a high utility association rule if it satisfy the following conditions:

The support of an itemset is the number of transactions that contain the itemset. For example, the itemset {1, 3, 5} has a support of 2 because it appears in three transactions from the database (t1 and t4). A closed is an itemset X such that there does not exist an itemset Y strictly included in X that has the same support. For example, itemset {1, 3, 5} is a closed itemset.

A closed high utility itemset (CHUI) is a high-utility itemset that is a closed itemset.

A high utility generator, as defined by Sahoo (2015), is a high utility itemsets that has no subset that is also a high utility itemset and has the same support

A high utility association rule is said to be a non redundant association rule if X U Y is a closed itemset and X is a generator.

For example, if we run HGB to find the non redundant high utility association rules with a minimum utility of 30 and minconf = 0.5 (which means 50%) we obtain 4 rules:

Association rule utility utility of antecedent utility confidence
{2 5} ==> {3} 37 31 100%
{2 4}==> {3 5} 40 30 100 %
{2 5} ==> [3 4] 40 31 77.4 %
{2 4}==>[1 3 5 6] 30 30 53.3 %

Now if we run the HGB_All algorithms, we find 13 rules, as follows:

Association rule utility utility of antecedent utility confidence
2 4 ==> 5 36 30 100%
2 4 ==> 3 34 30 100 %
2 5 ==> 4 36 31 77.4 %
2 5 ==> 3 37 31 100%
2 4 ==> 3 5 40 30 100%
2 5 ==> 3 4 40 31 77.4 %
2 3 4 ==> 5 40 34 100%
2 4 5 ==> 3 40 36 100%
2 3 5 ==> 4 40 37 75.7%
2 4 ==> 1 3 5 6 30 30 53.3%
2 3 4 ==> 1 5 6 30 34 50 %
2 4 5 ==> 1 3 6 30 36 52.78 %
2 3 4 5 ==> 1 6 30 40 50 %

Input file format

The input file format of HUCI-Miner is defined as follows. It is a text file. Each line represents a transaction. Each line is composed of three sections, as follows.

For example, for the previous example, the input file is defined as follows:

3 5 1 2 4 6:30:1 3 5 10 6 5
3 5 2 4:20:3 3 8 6
3 1 4:8:1 5 2
3 5 1 7:27:6 6 10 5
3 5 2 7:11:2 3 4 2

Consider the first line. It means that the transaction {3, 5, 1, 2, 4, 6} has a total utility of 30 and that items 3, 5, 1, 2, 4 and 6 respectively have a utility of 1, 3, 5, 10, 6 and 5 in this transaction. The following lines follow the same format.

Output file format

The output file format of HGB is defined as follows. It is a text file, that contains a list of high utility association rules. Each line contains a rule. First, items from the antecedent are listed, separated by spaces. Then, there is "==>", followed by the list of items from the consequent, separated by spaces. Then, #UTIL: is followed by the utility of the rule. Then the keyword #AUTIL: is followed by the utility of the rule antecedent. Then, the keyword #UCONF: is followed by the utility confidence. Here is the output of HGB for the above example:

2 5 ==> 3 #UTIL: 37 #AUTIL: 31 #UCONF: 1.0
2 4 ==> 3 5 #UTIL: 40 #AUTIL: 30 #UCONF: 1.0
2 5 ==> 3 4 #UTIL: 40 #AUTIL: 31 #UCONF: 0.7741935483870968
2 4 ==> 1 3 5 6 #UTIL: 30 #AUTIL: 30 #UCONF: 0.5333333333333333

And here is the output of HGB_All for the example:

2 4 ==> 5 #UTIL: 36 #AUTIL: 30 #UCONF: 1.0
2 4 ==> 3 #UTIL: 34 #AUTIL: 30 #UCONF: 1.0
2 5 ==> 4 #UTIL: 36 #AUTIL: 31 #UCONF: 0.7741935483870968
2 5 ==> 3 #UTIL: 37 #AUTIL: 31 #UCONF: 1.0
2 4 ==> 3 5 #UTIL: 40 #AUTIL: 30 #UCONF: 1.0
2 5 ==> 3 4 #UTIL: 40 #AUTIL: 31 #UCONF: 0.7741935483870968
2 3 4 ==> 5 #UTIL: 40 #AUTIL: 34 #UCONF: 1.0
2 4 5 ==> 3 #UTIL: 40 #AUTIL: 36 #UCONF: 1.0
2 3 5 ==> 4 #UTIL: 40 #AUTIL: 37 #UCONF: 0.7567567567567568
2 4 ==> 1 3 5 6 #UTIL: 30 #AUTIL: 30 #UCONF: 0.5333333333333333
2 3 4 ==> 1 5 6 #UTIL: 30 #AUTIL: 34 #UCONF: 0.5
2 4 5 ==> 1 3 6 #UTIL: 30 #AUTIL: 36 #UCONF: 0.5277777777777778
2 3 4 5 ==> 1 6 #UTIL: 30 #AUTIL: 40 #UCONF: 0.5

Performance

This is the original implementation of HGB and HGB_All, which also uses the original implementation of HUCI_Miner.

Where can I get more information about the HGB, HGB_All and HUCI_Miner algorithms?

This is the reference of the article describing the HUCI_Miner algorithm, as well as HGB and HGB_All:

Sahoo, Jayakrushna, Das, A., Goswami, A., (2015) An efficient approach for mining association rules from high utility itemsets. Expert Syst. Appl. 42, 5754-5778.

Besides, for a general overview of high utility itemset mining, you may read this survey paper.