SPMF documentation > Mining Quantiative High Utility Itemsets using the VHUQI Algorithm

This example explains how to run the VHUQI algorithm using the SPMF open-source data mining library.

How to run this example?

What is VHUQI?

VHUQI is an algorithm for mining quantitative high utility itemsets. Unlike traditional high utility itemsets, the VHUQI algorithm will not only find patterns that have a high utility, but also give information about the ranges of purchase quantities (internal utility) that lead to this high utility.

What is the input?

The input of VHUQI is a transaction database with purchase quantities (internal utility) and unit profit information (external utility). Moreover, three other parameter must be set:

For details about the meaning of these parameters, please see the VHUQI research paper.

A transaction database is a set of transactions, where each transaction is a list of distinct items (symbols). For example, let's consider the following transaction database. It consists of 4 transactions (t1, t2, ..., t4) and 4 items (1, 2, 3, 4). For instance, transaction t1 is the set of items {1, 2, 3, 4}. In a transaction database, each item has a corresponding purchase quantity. For examle, in the first transaction, the quantity of item 1 is 3, the quantity of item 2 is 3, the quantity of item 2 is 6, and the quantity of item 4 is 5. This database is provided in the file "HUQI_DB.txt" of the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction, and that items are assumed to be sorted according to some order (e.g. ascending order of numbers).

Transaction ID

Items

Quantities

t1

1 2 3 4

3 3 2 5

t2

1 2 3

2 3 3

t3

1 2 4

7 6 4

t4

1 2 3 4

7 8 9 6

Besides a table indicating the unit profit of each item is required. For example, consider the following table:

Item

Unit profit

1

42

2

87

3

94

4

97

It indicates that the unit profit yield by selling the items 1, 2, 3 and 4 are 5, 4, 6 and 2, respectively. This table is provided in the file "HUQI_DB_profit.txt"

Note that in research papers, unit profit is often called "external utility" while purchase quantities are called "internal utility". This is the terminology used by researchers. Although in the above example, utility is used to represent quantities and profit, it could be more generally be used to represent the relative importance of items to the user. Thus, it can be also applied to other types of data besides shopping databases.

What is the output?

The output file format is defined as follows. It is a text file, where each line represents a quantitative high utility itemset. An itemset contains several items. Each item is represented in the form (x,y) where x is the item and y is its purchase quantity, or by the form (x,y,z) indicating that from y to z units of item x where purchased. At the end of each line, the utility of the itemset is indicated after the keyword #UTIL: For examle, here is a few line of the output file for the above example:

(2,6) (1,7) #UTIL: 816
(2,6) (4,4) (1,7) #UTIL: 1204
(4,4,5) (1,7) #UTIL: 682

For example, the second line indicates that the itemset containing 6 units of item 2, 4 units of item 4, and 7 units of item 1 yield a utility of 1204 $. The third line indicates that the purchase of 4 to 5 units of item 4, and 7 units of item 1, yield a 682 $ profit.

Input file format

VHUQI takes two files as input, defined as follows.

The first file (e.g. HUQI_DB.txt) is a text file containing transactions. Each lines represents a transaction. Each line is a list of positive numbers where an item is followed by a space and then its quantity. Then, the next item appears, followed by a space, followed by its quantity, and so on... This is the file HUQI_DB.txt used in the example.

(1,3) (2,3) (3,2) (4,5)
(1,2) (2,3) (3,3)
(1,7) (2,6) (4,4)
(1,7) (2,8) (3,9) (4,6)

The second file is a text file (e.g. HUQI_DB.txt) which provides the unit profit of each item. Each is an item, followed by a space, followed by its unit profit. This is the file HUQI_DB_profit.txt used in the above example.

1, 42
2, 87
3, 94
4, 97

Output file format

The output file format is defined as follows. It is a text file, where each line represents a quantitative high utility itemset. An itemset contains several items. Each item is represented in the form (x,y) where x is the item and y is its purchase quantity, or by the form (x,y,z) indicating that from y to z units of item x where purchased. At the end of each line, the utility of the itemset is indicated after the keyword #UTIL: For examle, here is a few line of the output file for the above example:

(3,9) #UTIL: 846
(2,8) #UTIL: 696
(4,6) #UTIL: 582
(1,7) #UTIL: 588
(4,4,5) #UTIL: 873
(3,9) (2,8) #UTIL: 1542
(3,9) (4,6) #UTIL: 1428
(3,9) (1,7) #UTIL: 1140
(3,9) (2,8) (4,6) #UTIL: 2124
(3,9) (2,8) (1,7) #UTIL: 1836
(3,9) (2,8) (4,6) (1,7) #UTIL: 2418
(3,9) (4,6) (1,7) #UTIL: 1722
(2,8) (4,6) #UTIL: 1278
(2,8) (1,7) #UTIL: 990
(2,8) (4,6) (1,7) #UTIL: 1572
(4,6) (1,7) #UTIL: 876
(2,6) (4,4) #UTIL: 910
(2,6) (1,7) #UTIL: 816
(2,6) (4,4) (1,7) #UTIL: 1204
(4,4,5) (1,7) #UTIL: 682
(4,4,5) (2,3) #UTIL: 746
(4,4,5) (3,2) #UTIL: 673
(4,4,5) (1,3) #UTIL: 611
(4,4,5) (2,3) (3,2) #UTIL: 934
(4,4,5) (2,3) (1,3) #UTIL: 872
(4,4,5) (2,3) (3,2) (1,3) #UTIL: 1060
(4,4,5) (3,2) (1,3) #UTIL: 799
(4,5) (2,3) #UTIL: 746
(4,5) (3,2) #UTIL: 673
(4,5) (1,3) #UTIL: 611
(4,5) (2,3) (3,2) #UTIL: 934
(4,5) (2,3) (1,3) #UTIL: 872
(4,5) (2,3) (3,2) (1,3) #UTIL: 1060
(4,5) (3,2) (1,3) #UTIL: 799
(4,4) (1,7) #UTIL: 682
(3,3) (2,3) #UTIL: 543
(3,3) (2,3) (1,2) #UTIL: 627
(2,3) (1,2,3) #UTIL: 732
(2,3) (3,2) (1,3) #UTIL: 575

Implementation details

This is the original implementation of the algorithm.

Where can I get more information about the VHUQI algorithm?

This is the reference of the article describing the VHUQI algorithm:

C. H. Li, C. Wu and V. S. Tseng, “Efficient Vertical Mining of High Utility Quantitative Itemsets, Proc. of Int’l Conf. on Granular Computing,pp. 155-160, 2014.

Besides, for a general overview of high utility itemset mining, you may read this survey paper.

<< Return to table of contents of SPMF documentation

Copyright © 2008-2020 Philippe Fournier-Viger. All rights reserved.