Mining Irregular High-Utility Itemsets Using the PHM_irregular Algorithm (SPMF documentation)
This example explains how to run the PHM_irregular algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "PHM_irregular" algorithm, (2) select the input file "DB_UtilityPerHUIs.txt", (3) set the output file name (e.g. "output.txt") (4) set the minimum utility threshold to 20 and the regularity threshold to 2, respectively and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run PHM_irregular DB_UtilityPerHUIs.txt output.txt 20 2 in a folder containing spmf.jar and the example input file DB_UtilityPerHUIs.txt. - If you are using the source code version of SPMF, to run respectively PHM_irregular, launch the file "MainTestPHM_irregular.java"in the package ca.pfv.SPMF.tests.
What is PHM_irregular ?
PHM_irregular is a variation of the PHM algorithm for discovering irregular high utility itemsets in a sequence of transactions (a transaction database) having information about the utility of items. The PHM algorithm was proposed by Fournier-Viger et al. (2016).
An irregular high utility itemset is an itemset that does not periodically appear in a sequence of transaction and that generate a high profit (have a high utility). Discovering irregular high utility itemset may have many applications such as discovering irregular purchase behaviors of customers of a retail store, which yield a high profit.
The PHM_irregular algorithm is a variation of the PHM algorithm also offered in SPMF, as the measure of regularity is basically equivalent to the measure of periodicity .
What is the input of the PHM_irregular algorithm?
The input is a transaction database (a sequence of transactions) and four parameters that are set by the user:
- a minimum utility threshold minutil ( a positive integer)
- a regularity threshold ( a positive integer) - note: this is equivalent to the minper threshold of the PHM algorithm
Note that two optional parameters are also offered to specify constraints on the minimum and maximum number of items that patterns should contain (positive integers).
A transaction database is a set of transactions. Let's consider the following database consisting of 7 transactions (t1,t2...t5,t6, t7) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DB_utilityPerHUIs.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.
Items | Transaction utility | Item utilities for this transaction | |
t1 | {1, 3} | 6 | 5 1 |
t2 | {5} | 3 | 3 |
t3 | {1, 2, 3, 4, 5} | 25 | 5 10 1 6 3 |
t4 | {2, 3, 4, 5} | 20 | 8 3 6 3 |
t5 | {1, 3, 4} | 8 | 5 1 2 |
t6 | {1,3,5} | 22 | 10 6 6 |
t7 | {2, 3, 5} | 9 | 4 2 3 |
Each line of the database is:
- a set of items (the first column of the table),
- the sum of the utilities (e.g. profit) of these items in this transaction (the second column of the table),
- the utility of each item for this transaction (e.g. profit generated by this item for this transaction)(the third column of the table).
Note that the value in the second column for each line is the sum of the values in the third column.
What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction represents the items purchased by a customer on a given day. The first transaction named "t1" represents the purchase of items 3, 5, 1, 2, 4 and 6. The amount of money spent for each item is respectively 1 $, 3 $, 5 $, 10 $, 6 $ and 5 $. The total amount of money spent in this transaction is 1 + 3 + 5 + 10 + 6 + 5 = 30 $.
What is the output of the algorithm?
PHM_irregular is an algorithm for discovering irregular high-utility itemsets. An itemset is a group of items. The PHM_irregular algorithm finds itemsets that irregularly appears in a sequence of transactions and generate a high profit (have a high utility). To measure if an itemset is irregular, the algorithm calculates its periods. To explain this in more details, it is necessary to introduce a few definitions.
The set of transactions containing an itemset X is denoted as g(X). For example, consider the itemset {1, 3}. The set g({1,3}) is equal to {t1, t3, t5, t6}. In other words, the itemset {1, 3} appears in the transactions t1, t3, t5 and t6. It is also said that these are four occurrences of the itemset {1,3}.
Now, to assess the irregularity (i.e. non periodicity) of an itemset X, its list of periods is calculated. A period is the time in terms of number of transactions between two occurences of an itemset in the database (see the paper for the formal definition). For example, the periods of the itemset {1, 3} are {1,2,2,1,1}. The first period of {1,3} is 1 because {1,3} appears in the first transaction after the creation of the database. The second period of {1,3} is 2 because the itemset appears in transaction t3, which is two transactions after t1. The third period of {1,3} is 2 because the itemset appears in transactions t5, which is two transactions after t3. Then, the fourth period of {1,3} is 1 because the itemset appears in t6, which is one transaction after t5. Finally, the fifth period of {1,3} is 1 because there is one transaction appearing after the last occurrence of {1,3} in the database (in t6).
The PHM_irregular algorithms utilize the list of periods of an itemset X to calculate its regularity. The regularity (also known as periodicity in PHM) is the smallest period among the periods of the itemset (note that the first and last periods are excluded from the calculation of the regularity - see the paper for details).
The utility of an itemset in a transaction is the sum of the utility of its items in the transaction. For example, the utility of the itemset {1 3} in transaction t1 is 5 + 1 = 6 and the utility of {1 3} in transaction t3 is 5 + 1 = 6. The utility of an itemset in a database is the sum of its utility in all transactions where it appears. For example, the utility of {13} in the database is the utility of {1 3} in t1 plus the utility of {1 3} in t3, plus the utility of {1 3} in t5, plus the utility of {1 3} in t6, for a total of 6 + 6 + 6 + 16= 34. A high utility itemset is an itemset such that its utility is no less than min_utility.
The PHM_irregular algorithm finds all the high-utility itemsets that have a regularity that is no less than the regularity threshold. Those itemsets are called irregular high-utility itemsets.
For example, if PHM_irregular is run on the previous transaction database with a minutil = 20, minper = 1, maxper = 3 , minavgper = 1 and maxavgper = 2, the PHM_irregular algorithm finds 7 irregular high-utility itemsets.
itemset | utility | regularity ( periodicity) |
{4, 2, 1} | 21 | 4 |
{4,2, 1, 5} | 24 | 4 |
{4, 2, 1, 5, 3} | 25 | 4 |
{4, 2, 1, 3} | 22 | 4 |
{4, 1, 3} | 20 | 3 |
{1, 5} | 24 | 3 |
{1, 5, 3} | 31 | 3 |
How should I interpret the results?
Each irregular high-utility itemset is annotated with its utility (e.g. profit), and its regularity (periodicty). For example, the itemset {1, 5} yield a profit of 24$, has a regularity of 3.
Input file format
The input file format used by the algorithm is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.
For example, for the previous example, the input file is defined as follows:
3 1:6:1 5
5:3:3
3 5 1 2 4:25:1 3 5 10 6
3 5 2 4:20:3 3 8 6
3 1 4:8:1 5 2
3 5 1:22:6 6 10
3 5 2:9:2 3 4
Output file format
The output file format is defined as follows. It is a text file, where each line represents an irregular itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword #UTIL: appears followed by a single space, the utility of the itemset, and a single space. Then, the keyword "#REG:" appears, which is followed by the regularity of the itemset.
For example, here is the output file for this example. The first line indicates that the itemset {1, 2, 4} is an irregular high-utility itemset, having a utility of 21$ and a regularity of 4 transactions.
4 2 1 #UTIL: 21 #REG: 4
4 2 1 5 #UTIL: 24 #REG: 4
4 2 1 5 3 #UTIL: 25 #REG: 4
4 2 1 3 #UTIL: 22 #REG: 4
4 1 3 #UTIL: 20 #REG: 3
1 5 #UTIL: 24 #REG: 3
1 5 3 #UTIL: 31 #REG: 3
Where can I get more information about the PHM_irregular and PHM algorithm?
This is the article proposing the original PHM algorithm for periodic high utility itemset mining:
Fournier-Viger, P., Lin, C.W., Duong, Q.-H., Dam, T.-L. (2016). PHM: Mining Periodic High-Utility Itemsets. Proc. 16th Industrial Conference on Data Mining. Springer LNAI 9728, 15 pages.
The PHM_irregular is a simple variation of the PHM algorithm that has been published in this chapter:
Fournier-Viger, P., Wu. Y., Dinh, D.-T., Song, W., Lin, J.C.W. (2021). Discovering Periodic High Utility Itemsets in a Discrete Sequence. In the book “Periodic Pattern Mining: Theory, Algorithms and Application”, Springer, to appear.
Besides, for a general overview of high utility itemset mining, you may read this survey paper.