Mining Minimal High-Utility Itemsets from a transaction database with utility information using the MinFHM Algorithm (SPMF documentation)

This example explains how to run the MinFHM algorithm using the SPMF open-source data mining library.

How to run this example?

What is MinFHM?

MinFHM (Fournier-Viger et al., 2016) is an algorithm for discovering minimal high-utility itemsets in a transaction database containing utility information.

There has been quite a huge amount of work on the topic of high-utility itemset mining in recent years. High-utility itemset mining consists of finding sets of items that yield a high profit in a database of customer transaactions where the purchase quantities of items in transactions is indicated and each item has a unit profit. Several algorithms have been proposed for high-utility itemset mining. However, they may find a huge number of patterns. These patterns are often very long and often represent rare cases, as in real-life, few customers exactly buy the same large set of items. For marketing purpose, a retailer may be more interested in finding the smallest sets of items that generate a high profit, since it is easier to co-promote a small set of items targeted at many customers rather than a large set of items targeted at few customers. The MinFHM algorithm was designed to address this issues by discovering only the high-utility itemsets that are minimal.

A high-utility itemset is said to be minimal if it has no subset that is also a high-utility itemset. In terms of application to transaction database, the concept of minimal high-utility itemsets can be understood as the smallest sets of items that yield a high profit.. The concept of minimal high-utility itemset can also be understood as the opposite of the concept of maximal high-utility itemset proposed in other work.

This is the original implementation of MinFHM.

What is the input?

MinFHM takes as input a transaction database with utility information and a minimum utility threshold min_utility (a positive integer). Let's consider the following database consisting of 5 transactions (t1,t2...t5) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DB_utility.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.


Items Transaction utility Item utilities for this transaction
t1 3 5 1 2 4 6 30 1 3 5 10 6 5
t2 3 5 2 4 20 3 3 8 6
t3 3 1 4 8 1 5 2
t4 3 5 1 7 27 6 6 10 5
t5 3 5 2 7 11 2 3 4 2

Each line of the database is:

Note that the value in the second column for each line is the sum of the values in the third column.

What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction represents the items purchased by a customer. The first customer named "t1" bought items 3, 5, 1, 2, 4 and 6. The amount of money spent for each item is respectively 1 $, 3 $, 5 $, 10 $, 6 $ and 5 $. The total amount of money spent in this transaction is 1 + 3 + 5 + 10 + 6 + 5 = 30 $.

What is the output?

The output of MinFHM is the set of minimal high utility itemsets having a utility no less than a min_utility threshold (a positive integer) set by the user. To explain what is a minimal high-utility itemsets, it is necessary to review some definitions.

An itemset is an unordered set of distinct items. The utility of an itemset in a transaction is the sum of the utility of its items in the transaction. For example, the utility of the itemset {1 4} in transaction t1 is 5 + 6 = 11 and the utility of {1 4} in transaction t3 is 5 + 2 = 7. The utility of an itemset in a database is the sum of its utility in all transactions where it appears. For example, the utility of {1 4} in the database is the utility of {1 4} in t1 plus the utility of {1 4} in t3, for a total of 11 + 7 = 18. A high utility itemset is an itemset such that its utility is no less than min_utility.

A minimal high utility itemsets (MinHUI) is a high-utility itemset that is has no subset that is a high-utility itemset

For example, if we run MinFHM with a minimum utility of 30, we obtain 3 minimal high-utility itemsets:

itemsets utility
{2, 4} 30
{1 3 5} 31
{2 5} 31

If the database is a transaction database from a store, we could interpret these results as all the smallest groups of items bought together that generated a profit of 30 $ or more (that are minimal).

Input file format

The input file format of MinFHM is defined as follows. It is a text file. Each lines represents a transaction. Each line is composed of three sections, as follows.

For example, for the previous example, the input file is defined as follows:

3 5 1 2 4 6:30:1 3 5 10 6 5
3 5 2 4:20:3 3 8 6
3 1 4:8:1 5 2
3 5 1 7:27:6 6 10 5
3 5 2 7:11:2 3 4 2

Consider the first line. It means that the transaction {3, 5, 1, 2, 4, 6} has a total utility of 30 and that items 3, 5, 1, 2, 4 and 6 respectively have a utility of 1, 3, 5, 10, 6 and 5 in this transaction. The following lines follow the same format.

Output file format

The output file format of MinFHM is defined as follows. It is a text file, where each line represents a minimal high utility itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer, followed by a single space. After, all the items, the keyword #UTIL: " appears and is followed by the utility of the itemset. For example, we show below the output file for this example.

4 2 #UTIL: 30
2 5 #UTIL: 31
1 5 3 #UTIL: 31

For example, the third line indicates that the itemset {2, 4} has a utility of 30$. The following lines follows the same format.

Performance

High utility itemset mining is a more difficult problem than frequent itemset mining. Therefore, high-utility itemset mining algorithms are generally slower than frequent itemset mining algorithms. The MinFHM algorithm was proposed in 2016 to discover only the high-utility itemsets that are minimal. It was found that MinFHM can be orders of magitude faster than algorithms such as FHM for mining all high-utility itemsets.

Implementation details

This is the original implementation of the MinFHM algorithm

Where can I get more information about the MinFHM algorithm?

This is the reference of the article describing the MinFHM algorithm:

Fournier-Viger, P., Lin, C.W., Wu, C.-W., Tseng, V. S., Faghihi, U. (2016). Mining Minimal High-Utility Itemsets. Proc. 27th International Conference on Database and Expert Systems Applications (DEXA 2016). Springer, LNCS, 13 pages, to appear

You can also view a video presentation of the MinFHM algorithm

Besides, for a good overview of itemset mining algorithms, you may read this survey paper.