Mining Cross-Level High-Utility Itemsets in a Transaction Database using the FEACP Algorithm (SPMF documentation)

This example explains how to run the FEACP algorithm using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "FEACP" algorithm, (2) select the input file "transaction_CLHMiner.txt" and the taxonomy file as "taxonomy_CLHMiner.txt", (3) set the output file name (e.g. "output.txt") (4) set the parameter minutil to 60 and (5) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run CLHMiner transaction_CLHMiner.txt output.txt 60 taxonomy_CLHMiner.txt in a folder containing spmf.jar and the example input file taxonomy_CLHMiner.txt and transactions_CLHMiner.txt.
If you are using the source code version of SPMF, launch the file "MainTestCLHMiner.java" in the package ca.pfv.SPMF.tests.

What is FEACP?

FEACP is an algorithm for discovering the cross-level high-utility itemsets in a transaction database containing utility information and having a taxonomy.

High utility itemset mining has several applications such as discovering groups of items in transactions of a store that generate the most profit. A database containing utility information is a database where items can have quantities and a unit price. Although these algorithms are often presented in the context of market basket analysis, there exist other applications.

Contrarily to traditional high utility itemset mining algorithm, cross-level high utility itemset mining allows to consider a taxonomy of items (items can be organized into a hierarchy of categories). Then, patterns can be found involving these categories. There are several algorithms for cross-level high utility itemset mining such as CLH-Miner and FEACP. FEACP is highly efficient, and can be viewed as an extension of the EFIM algorithm.

What is the input?

FEACP takes as input a transaction database with utility information, a taxonomy database, and a minimum utility threshold min_utility (a positive number). Let’s consider the following transaction database consisting 7 transactions (t1, t2, …, t7) and 5 items (1, 2, 3, 4, 5). This transaction database is provided in the text file “transaction_CLHMiner.txt” in the package ca.pfv.spmf.tests of the SPMF distribution.

	Items	Transaction_utility	Item_utilities for this transaction
t1	1 3	6	5 1
t2	5	3	3
t3	1 2 3 4 5	25	5 10 1 6 3
t4	2 3 4 5	20	8 3 6 3
t5	1 3 4	5	5 1 2
t6	1 3 5	22	10 6 6
t7	2 3 5	9	4 2 3

Each line of the transaction database contains:

a set of items (the first column of the table),
the sum of the utilities (e.g. profit) of these items in this transaction (the second column of the table),
the utility of each item for this transaction (e.g. profit generated by this item for this transaction) (the third column of the table).

For example, the first transaction may indicate that some customer has purchased some items 1 and 3, that the profit generated by the sale of item 1 is 5, and the profit generated by the sale of item 3 is 1. The total utility (profit) is in this case 5 + 1 = 6. Thus, 6 is shown in the second column.

Note that the value in the second column for each line is the sum of the values in the third column.

The following table is the taxonomy database taxonomy_CLHMiner.txt for this example, which consists of 5 items and 3 generalized items. This taxonomy database is provided in the text file “taxonomy_CLHMiner.txt” in the package ca.pfv.spmf.tests of the SPMF distribution.

Item	Generalized item
1	6
2	6
3	7
4	8
5	8
6	7

Each line of the taxonomy database is:

an item (the first column of the table),
a generalized item (the category to which the item belongs)

For example, the first line indicates that item 1 is a special case (a specialization) of the item 6. The second line indicates that the item 2 is also a special case of the item 6. Thus, the item 6 can be viewed as category containing the items 1 and 2. The other lines have a similar meaning.

What is the output?

The output is the set of cross-level high utility itemsets having a utility no less than a min_utility threshold (a positive number) set by the user. To explain what is a cross-level high utility itemset, it is necessary to review some definitions.

An itemset is an unordered set of distinct items. The utility of an itemset in a transaction is the sum of the utility of its items in the transaction. The utility of a generalized itemset in a transaction is the sum of the utility of its descendant items in the transaction. For example, the utility of the itemset {1 3} in transaction t3 is 5 + 1 = 6 and the utility of {1 3} in t6 = 10 + 6 = 16. Because item 1 and item 2 are the descendants of itemset 6. The utility of 6 in t3 is 5 + 10 = 15. The utility of generalized itemset {3, 6} is 1+ 15 = 16. The utility of a (generalized) itemset in a database is the sum of its utility in all transactions where it appears. For example, the utility of {1 3} in the database is the utility of {1 3} in t1 plus the utility of {1 3} in t3 plus the utility of {1 3} in t5 plus the utility of {1 3} in t6, for a total of 6 + 6 + 6 + 16 = 34. The utility of {3 6} in the database is 6 + 16 + 11 + 6 + 16 + 6 = 61. A cross-level high utility itemset is a (generalized) itemset such that its utility is no less than min_utility. For example, if we run CLHMiner with a minimum utility of 60, we obtain 7 cross-level high utility itemsets.

Itemsets	Utility	Support
{7 8}	84	71% (5 transactions)
{3 6 8}	84	71% (5 transactions)
{6 8}	71	71% (5 transactions)
(5 7)	64	57% (4 transactions)
{3 5 6}	64	57% (4 transactions)
{7}	61	86% (6 transactions)
{3 6}	61	86% (6 transactions)

If the database is a transaction database from a store, we could interpret these results as all the sets of items that yield a profit that is no less than 60$.

Input file format

The input file format is defined as follows. The transaction database and taxonomy database are both text files. In the transaction database, each line represents a transaction. Each line is composed of three sections, as follows.

First, the items contained in the transaction are listed. An item is represented by a positive integer. Each item is separated from the next item by a single space. It is assumed that all items within a same transaction are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same transaction.
Second, the symbol “:” appears and is followed by the transaction utility (an integer).
Third, the symbol “:” appears and is followed by the utility of each item in this transaction (an integer), separated by single spaces.

For example, for the previous example, the input file for the transaction is defined as follows:

1 3:6:5 1
5:3:3
1 2 3 4 5:25:5 10 1 6 3
2 3 4 5:20:8 3 6 3
1 3 4:8:5 1 2
1 3 5:22:10 6 6
2 3 5:9:4 2 3

Consider the first line. It means that the transaction {1, 3} has a total utility of 6 and that items 1 and 3 respectively have a utility of 5 and 1 in this transaction. The following lines follow the same format.

In the taxonomy database, each line represents a category relationship. Each line is composed of two sections, as follows.

First, the item.
Second, the symbol “,” appears and is followed by a generalized item, representing the category that the item belongs.

For example, for the previous example, the input file for the taxonomy is defined as follows:

1,6
2,6
3,7
4,8
5,8
6,7
Consider the first line. It means that the item 1 is belongs to the generalized item 6.

Output file format

The output file format is defined as follows. It is a text file, where each line represents a cross-level high utility itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer, followed by a single space. After all the items, the key word “#UTIL:” appears and is followed by the utility of the itemset. For example, we show below the output file for this example.

8 7 #UTIL: 84
8 3 6 #UTIL: 84
8 6 #UTIL: 71
7 #UTIL: 61
7 5 #UTIL: 64
5 3 6 #UTIL: 64
3 6 #UTIL: 61

For example, the fourth line indicates that the itemset {7} has a utility of 61. The following lines follows the same format.

Performance

This is the original implementation of the algorithm.

Where can I get more information about the FEACP algorithm?

The FEACP algorithm was proposed in this paper:

N. T. Tung, Loan T. T. Nguyen, Trinh D. D. Nguyen, Philippe Fournier-Viger, Ngoc Thanh Nguyen, Bay Vo: Efficient mining of cross-level high-utility itemsets in taxonomy quantitative databases. Inf. Sci. 587: 41-62 (2022)

Besides, for a general overview of high utility itemset mining, you may read this survey paper.