Mining Periodic High--Utility Itemsets Using the PHM Algorithm (SPMF documentation)
This example explains how to run the PHM algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "PHM" algorithm, (2) select the input file "DB_UtilityPerHUIs.txt", (3) set the output file name (e.g. "output.txt") (4) set the minimum utility threshold, minimum periodicity, the maximum periodicity, the minimum average periodicity and the maximm average periodicity, respectively to 20, 1, 3, 1, 2 and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run PHM DB_UtilityPerHUIs.txt output.txt 20 1 3 1 2 in a folder containing spmf.jar and the example input file DB_UtilityPerHUIs.txt. - If you are using the source code version of SPMF, to run respectively PHM, launch the file "MainTestPHM.java"in the package ca.pfv.SPMF.tests.
What is PHM ?
PHM is an algorithm for discovering periodic high-utility itemsets in a sequence of transactions (a transaction database) having information about the utility of items. It was proposed by Fournier-Viger et al. (2016). PHM can discover patterns that periodically appears in a sequence of transactions and that generate a high profit (have a hih utility). Periodic high utility itemset mining has many applications such as discovering periodic purchase behaviors of customers of a retail store, which yield a high profit.
The PHM algorithm is similar to the PFPM algorithm also offered in SPMF. The main difference is that PHM consider the additional constraint of utility (profit generated by itemsets). Thus, there is an additional constraint that we not only want to find periodic patterns, but also profitable patterns.
What is the input of the PHM algorithm?
The input is a transaction database (a sequence of transactions) and four parameters that are set by the user:
- a minimum utility threshold minutil ( a positive integer)
- a minimum periodicity threshold minper ( a positive integer)
- a maximum periodicity threshold maxper ( a positive integer)
- a minimum average periodicity threshold minavgper ( a positive integer)
- a maximum average periodicity threshold maxavgper ( a positive integer)
Note that two optional parameters are also offered to specify constraints on the minimum and maximum number of items that patterns should contain (positive integers).
A transaction database is a set of transactions. Let's consider the following database consisting of 7 transactions (t1,t2...t5,t6, t7) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DB_utilityPerHUIs.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.
Items | Transaction utility | Item utilities for this transaction | |
t1 | {1, 3} | 6 | 5 1 |
t2 | {5} | 3 | 3 |
t3 | {1, 2, 3, 4, 5} | 25 | 5 10 1 6 3 |
t4 | {2, 3, 4, 5} | 20 | 8 3 6 3 |
t5 | {1, 3, 4} | 8 | 5 1 2 |
t6 | {1,3,5} | 22 | 10 6 6 |
t7 | {2, 3, 5} | 9 | 4 2 3 |
Each line of the database is:
- a set of items (the first column of the table),
- the sum of the utilities (e.g. profit) of these items in this transaction (the second column of the table),
- the utility of each item for this transaction (e.g. profit generated by this item for this transaction)(the third column of the table).
Note that the value in the second column for each line is the sum of the values in the third column.
What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction represents the items purchased by a customer on a given day. The first transaction named "t1" represents the purchase of items 3, 5, 1, 2, 4 and 6. The amount of money spent for each item is respectively 1 $, 3 $, 5 $, 10 $, 6 $ and 5 $. The total amount of money spent in this transaction is 1 + 3 + 5 + 10 + 6 + 5 = 30 $.
What is the output of the algorithm?
PHM is an algorithm for discovering periodic high-utility itemsets. An itemset is a group of items. The PHM algorithm finds itemsets that appears periodically in a sequence of transactions and generate a high profit (have a high utility). To measure if an itemset is periodic, the algorithm calculates its periods. To explain this in more details, it is necessary to introduce a few definitions.
The set of transactions containing an itemset X is denoted as g(X). For example, consider the itemset {1, 3}. The set g({1,3}) is equal to {t1, t3, t5, t6}. In other words, the itemset {1, 3} appears in the transactions t1, t3, t5 and t6. It is also said that these are four occurrences of the itemset {1,3}.
Now, to assess the periodicity of an itemset X, its list of periods is calculated. A period is the time in terms of number of transactions between two occurences of an itemset in the database (see the paper for the formal definition). For example, the periods of the itemset {1, 3} are {1,2,2,1,1}. The first period of {1,3} is 1 because {1,3} appears in the first transaction after the creation of the database. The second period of {1,3} is 2 because the itemset appears in transaction t3, which is two transactions after t1. The third period of {1,3} is 2 because the itemset appears in transactions t5, which is two transactions after t3. Then, the fourth period of {1,3} is 1 because the itemset appears in t6, which is one transaction after t5. Finally, the fifth period of {1,3} is 1 because there is one transaction appearing after the last occurrence of {1,3} in the database (in t6).
The PHM algorithms utilize the list of periods of an itemset X to calculate its average periodicity, minimum periodicity and maximum periodicity. The average periodicity is calculated as the average of the periods of the itemset. The minimum periodicity is the smallest period among the periods of the itemset (note that the first and last periods are excluded from the calculation of the minimum periodicity - see the paper for details). The maximum periodicity is the largest period among the periods of the itemset.
The utility of an itemset in a transaction is the sum of the utility of its items in the transaction. For example, the utility of the itemset {1 3} in transaction t1 is 5 + 1 = 6 and the utility of {1 3} in transaction t3 is 5 + 1 = 6. The utility of an itemset in a database is the sum of its utility in all transactions where it appears. For example, the utility of {13} in the database is the utility of {1 3} in t1 plus the utility of {1 3} in t3, plus the utility of {1 3} in t5, plus the utility of {1 3} in t6, for a total of 6 + 6 + 6 + 16= 34. A high utility itemset is an itemset such that its utility is no less than min_utility.
The PHM algorithm finds all the high-utility itemsets that have a minimum periodicity and maximum periodicity that are not less than the minper and maxper thresholds, and an average periodicity that is not less than minavgper and not greater than maxavgper.Those itemsets are called periodic high-utility itemsets.
For example, if PHM is run on the previous transaction database with a minutil = 20, minper = 1, maxper = 3 , minavgper = 1 and maxavgper = 2, the PHM algorithm finds 7 periodic high-utility itemsets.
itemset | utility | support (number of transactions where the itemset appear) | minimum periodicity | maximum periodicity | average periodicity |
{2} | 22 | 3 | 1 | 3 | 1.75 |
{2, 5} | 31 | 3 | 1 | 3 | 1.75 |
{2, 3, 5} | 37 | 3 | 1 | 3 | 1.75 |
{2, 3} | 28 | 3 | 1 | 3 | 1.75 |
{1} | 25 | 4 | 1 | 2 | 1.4 |
{1, 3} | 34 | 4 | 1 | 2 | 1.4 |
{3, 5} | 27 | 4 | 1 | 3 | 1.4 |
How should I interpret the results?
Each periodic high-utility itemset is annotated with its utility (e.g. profit), support (number of transactions where it appears) as well as its minimum/maximum periodicity and average periodicity. For example, the itemset {1, 3} yield a profit of 22$, has a support of 4 because it appears in four transactions (t1, t3, t5 and t6. The average periodicity of {1,3} is 1.4 because on average it appears every 1.4 transactions in terms of time. The smallest period of {1,3} (minimum periodicity) is 1 and the largest period of {1,3} is 2 transactions.This indicates that {3} appears quite periodically and is quite profitable.
Input file format
The input file format used by the algorithm is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.
For example, for the previous example, the input file is defined as follows:
3 1:6:1 5
5:3:3
3 5 1 2 4:25:1 3 5 10 6
3 5 2 4:20:3 3 8 6
3 1 4:8:1 5 2
3 5 1:22:6 6 10
3 5 2:9:2 3 4
Output file format
The output file format is defined as follows. It is a text file, where each line represents a periodic itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword #UTIL: appears followed by a single space, the utility of the itemset, and a single space. Then, the keyword "#SUP:" appears, which is followed by an integer indicating the support of the itemset, expressed as a number of transactions. Then, the keyword #MINPER: appears and is followed by a space, an integer indicating the minimum periodicity of the itemset, and another space.Then, the keyword #MAXPER: appears and is followed by a space, an integer indicating the maximal periodicity of the itemset, and another space. Then, the keyword #AVGPER: appears and is followed by a space, a double value indicating the average periodicity of the itemset.
For example, here is the output file for this example. The first line indicates that the itemset {2} is a periodic high-utility itemset, having a utility of 22$, a support of 3 transactions, a minimum periodicity of 1 transactions, a maximum periodicity of 3 transactions and an average periodicity f 1.75 transactions..
2 #UTIL: 22 #SUP: 3 #MINPER: 1 #MAXPER: 3 #AVGPER: 1.75
2 5 #UTIL: 31 #SUP: 3 #MINPER: 1 #MAXPER: 3 #AVGPER: 1.75
2 5 3 #UTIL: 37 #SUP: 3 #MINPER: 1 #MAXPER: 3 #AVGPER: 1.75
2 3 #UTIL: 28 #SUP: 3 #MINPER: 1 #MAXPER: 3 #AVGPER: 1.75
1 #UTIL: 25 #SUP: 4 #MINPER: 1 #MAXPER: 2 #AVGPER: 1.4
1 3 #UTIL: 34 #SUP: 4 #MINPER: 1 #MAXPER: 2 #AVGPER: 1.4
5 3 #UTIL: 27 #SUP: 4 #MINPER: 1 #MAXPER: 3 #AVGPER: 1.4
Performance
PHM is currently the only algorithm for mining high-utility periodic patterns in sequence of transactions containing profit informaiton (items have weights/unit profit, and transactions indicate purchase quantities for items), which is offered in SPMF. It was shown that mining periodic high-utility itemsets can be faster than mining all high-utility itemsets, because PHM can filter many non-periodical patterns.
Where can I get more information about the PHM algorithm?
This is the article proposing the PHM algorithm:
Fournier-Viger, P., Lin, C.W., Duong, Q.-H., Dam, T.-L. (2016). PHM: Mining Periodic High-Utility Itemsets. Proc. 16th Industrial Conference on Data Mining. Springer LNAI 9728, 15 pages.