Mining Skyline High-Utility Itemsets in a transaction database with utility information using the SkyMine Algorithm (SPMF documentation)

This example explains how to run the SkyMine algorithm using the SPMF open-source data mining library.

How to run this example?

What is SkyMine?

SkyMine (Goyal et al, 2015) is an algorithm for discovering skyline high-utility itemsets in a transaction database containing utility information.

This is the original implementation of SkyMine.

What is the input?

SkyMine takes as input a transaction database with purchase quantities, a table indicating the utility of items, and a minimum utility threshold min_utility (a positive integer). Let's consider the following database consisting of 5 transactions (t1,t2...t5) and 9 items (1, 2, 3, 4, 5, 6, 7, 8, 9). This database is provided in the text file "SkyMineTransaction.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.

Items Item purchase quantities for this transaction
t1 1 3 4 8 1 1 1 1
t2 1 3 5 7 2 6 2 5
t3 1 2 3 4 5 6 1 2 1 6 1 5
t4 2 3 5 7 2 2 1 2
t5 1 3 4 9 1 1 1 1

Each line of the database is:

For example, the second line of the database indicates that in the second transaction, the items 1, 3, 5, and 7 were purchased respectively with quantities of 2, 6, 2, and 5.

Moreover, another table must be provided to indicate the unit profit of each item (how much profit is generated by the sale of one unit of each item). For example, consider the utility table provided in the file "SkyMineItemUtilities.txt (below). The first line indicates that each unit sold of item 1 yield a profit of 5$.

Item Utility (unit profit)
1 5
2 2
3 1
4 2
5 3
6 1
7 1
8 1
9 25

What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction represents the items purchased by a customer. The first customer named "t1" bought items 1, 3, 4 and 8. The purchase quantities of each item is respectively 1, 1, 1, and 1. The total amount of money spent in this transaction is (1*5)+(3*1)+(4*2)+(8*1)= 24 $.

What is the output?

The output of SkyMine is the set of skyline high utility itemsets. To explain what is a skyline high-utility itemsets, it is necessary to review some definitions.

An itemset is an unordered set of distinct items. The utility of an item in a transaction is the product of its purchase quantity in the transaction by its unit profit. For example, the utility of item 3 in transaction t2 is (6*1)- 6 $. The utility of an itemset in a transaction is the sum of the utility of its items in the transaction. For example, the utility of the itemset {5 7} in transaction t2 is (2*3)+(5*1)=12$ and the utility of {5, 7} in transaction t4 is (1*3)+(2*1)=5. The utility of an itemset in a database is the sum of its utility in all transactions where it appears. For example, the utility of {5 7} in the database is the utility of {5 7} in t4 plus the utility of {5 7} in t5, for a total of 12 + 5= 17. The utility of an itemset X is denoted as u(X). Thus u({5 7})= 17$

The support of an itemset is the number of transactions that contains the itemset. For example, the support of the itemset {5 7} is sup({5 7}) = 2 transactions because it appears in transactions t4 and t5.

An itemset X is said to be dominating another itemset Y, if and only if, sup(X) >= sup(Y ) and u(X) > u(Y ), or, sup(X) > sup(Y ) and u(X) >= u(Y ).

A skyline high utility itemset is an itemset that is not dominated by another itemset in the transaction database.

For example, if we run SkyMine, we obtain 3 skyline high-utility itemsets:

itemsets utility
{3} 14
{1, 3} 34
{2, 3, 4, 5} 40

If the database is a transaction database from a store, we could interpret these results as all the itemsets that are dominating the other itemsets in terms of selling frequencies and utilty.

Input file format

The input file format of the transaction file of Skymine is defined as follows. It is a text file. Each lines represents a transaction. Each transaction is a list of items separated by single spaces. Each item is a positive integer followed by ":" and its purchase quantity in the transaction. Note that it is assume that items on each line are ordered according to some total order such as the alphabetical order. For example, for the previous example, the input file SkyMineTransactions.txt is defined as follows:

1:1 3:1 4:1 8:1
1:2 3:6 5:2 7:5
1:1 2:2 3:1 4:6 5:1 6:5
2:4 3:3 4:3 5:1
2:2 3:2 5:1 7:2
1:1 3:1 4:1 9:1

For example, the second line indicates that the items 1, 3, 5 and 7 respectively have a purchase quantity of 2, 6, 2 and 5 in that transaction.

The input format of the second file, indicating the utility (unit profit) of each item, is defined as follows. Each line is an item, followed by a space, followed by the unit profit of the item. For example, consider the content of the file "SkyMineItemUtilities.txt", shown below. The first line indicates that the item 1 has a unit profit of 5$. The other lines follow the same format.

1 5
2 2
3 1
4 2
5 3
6 1
7 1
8 1
9 25

Output file format

The output file format of SkyMine is defined as follows. It is a text file, where each line represents a skyline high utility itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer, followed by a single space. Then, the keyword #UTIL: " appears and is followed by the utility of the itemset. For example, we show below the output file for this example.

3 #UTIL: 14
1 3 #UTIL: 34
2 3 4 5 #UTIL: 40

For example, the third line indicates that the itemset {2, 3, 4, 5} has a utility of 40$. The other lines follows the same format.

Performance

SkyMine is the original algorithm for mining Skyline high-utility itemets.

Where can I get more information about the algorithm?

This is the reference of the article describing the algorithm:

Goyal, V., Sureka, A., & Patel, D. (2015). Efficient Skyline Itemsets Mining. In Proceedings of the Eighth International C* Conference on Computer Science & Software Engineering (pp. 119-124). ACM.

For a good overview of itemset mining algorithms, you may read this survey paper.