This example explains how to run the **PHM_irregular algorithm** using the SPMF open-source data mining library.

How to run this example?

**If you are using the graphical interface,**(1) choose the**"PHM_irregular"**algorithm**,**(2) select the input file**"DB_UtilityPerHUIs.txt"**, (3) set the output file name (e.g. "output.txt") (4) set the**minimum utility threshold**to 20 and the**regularity threshold**to 2, respectively and (5) click "Run algorithm".**If you want to execute this example from the command line**, then execute this command:

java -jar spmf.jar run**PHM**_irregular DB_UtilityPerHUIs.txt output.txt 20 2 in a folder containing spmf.jar and the example input file**DB_UtilityPerHUIs**.txt.**If you are using the source code version of SPMF,**to run respectively**PHM_irregular****,**launch the file**"MainTestPHM_irregular.java"**in the package**ca.pfv.SPMF.tests**.

What is PHM_irregular ?

PHM_irregularis a variation of the PHM algorithm for discovering irregular high utility itemsets in a sequence of transactions (a transaction database) having information about the utility of items. ThePHMalgorithm was proposed by Fournier-Viger et al. (2016).An irregular high utility itemset is an itemset that

does notperiodically appear in a sequence of transaction and that generate a high profit (have a high utility). Discovering irregular high utility itemset may have many applications such as discovering irregular purchase behaviors of customers of a retail store, which yield a high profit.The

PHM_irregularalgorithm is a variation of the PHM algorithm also offered in SPMF, as the measure of regularity is basically equivalent to the measure of periodicity .

What is the input of the PHM_irregular algorithm?

The input is a

transaction database(a sequence of transactions) and four parameters that are set by the user:

- a minimum utility threshold
minutil ( a positive integer)- a regularity threshold
( a positive integer)-note: this is equivalent to the minper threshold of the PHM algorithmNote that two optional parameters are also offered to specify constraints on the minimum and maximum number of items that patterns should contain (positive integers).

A

transaction databaseis a set of transactions. Let's consider the following database consisting of 7 transactions (t1,t2...t5,t6, t7) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DB_utilityPerHUIs.txt" in the packageca.pfv.spmf.testsof the SPMF distribution.

ItemsTransaction utilityItem utilities for this transactiont1{1, 3} 6 5 1 t2{5} 3 3 t3{1, 2, 3, 4, 5} 25 5 10 1 6 3 t4{2, 3, 4, 5} 20 8 3 6 3 t5{1, 3, 4} 8 5 1 2 t6{1,3,5} 22 10 6 6 t7{2, 3, 5} 9 4 2 3 Each line of the

databaseis:

- a set of items (the first column of the table),
- the sum of the utilities (e.g. profit) of these items in this transaction (the second column of the table),
- the utility of each item for this transaction (e.g. profit generated by this item for this transaction)(the third column of the table).
Note that the value in the second column for each line is the sum of the values in the third column.

What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction represents the items purchased by a customer on a given day. The first transaction named "

t1" represents the purchase of items 3, 5, 1, 2, 4 and 6. The amount of money spent for each item is respectively 1 $, 3 $, 5 $, 10 $, 6 $ and 5 $. The total amount of money spent in this transaction is 1 + 3 + 5 + 10 + 6 + 5 = 30 $.

What is the output of the algorithm?

PHM_irregularis an algorithm for discoveringirregular high-utility itemsets. An itemset is a group of items. ThePHM_irregularalgorithm finds itemsets that irregularly appears in a sequence of transactions and generate a high profit (have a high utility). To measure if an itemset is irregular, the algorithm calculates its periods. To explain this in more details, it is necessary to introduce a few definitions.The set of transactions containing an itemset X is denoted as g(X). For example, consider the itemset {1, 3}. The set g({1,3}) is equal to {t1, t3, t5, t6}. In other words, the itemset {1, 3} appears in the transactions t1, t3, t5 and t6. It is also said that these are four occurrences of the itemset {1,3}.

Now, to assess the

irregularity (i.e. non periodicity)of an itemset X, its list of periods is calculated. Aperiodis the time in terms of number of transactions between two occurences of an itemset in the database (see the paper for the formal definition). For example, the periods of the itemset {1, 3} are {1,2,2,1,1}. The first period of {1,3} is 1 because {1,3} appears in the first transaction after the creation of the database. The second period of {1,3} is 2 because the itemset appears in transaction t3, which is two transactions after t1. The third period of {1,3} is 2 because the itemset appears in transactions t5, which is two transactions after t3. Then, the fourth period of {1,3} is 1 because the itemset appears in t6, which is one transaction after t5. Finally, the fifth period of {1,3} is 1 because there is one transaction appearing after the last occurrence of {1,3} in the database (in t6).The PHM_irregular algorithms utilize the list of periods of an itemset X to calculate its regularity. The

regularity (also known as periodicity in PHM)is the smallest period among the periods of the itemset (note that the first and last periods are excluded from the calculation of the regularity- see the paper for details).The

utility of an itemset in a transactionis the sum of the utility of its items in the transaction. For example, the utility of the itemset {1 3} in transaction t1 is 5 + 1 = 6 and the utility of {1 3} in transaction t3 is 5 + 1 = 6. Theutility of an itemset in a databaseis the sum of its utility in all transactions where it appears. For example, the utility of {13} in the database is the utility of {1 3} in t1 plus the utility of {1 3} in t3, plus the utility of {1 3} in t5, plus the utility of {1 3} in t6, for a total of 6 + 6 + 6 + 16= 34. Ahigh utility itemsetis an itemset such that its utility is no less thanmin_utility.The

PHM_irregularalgorithm finds all the high-utility itemsets that have a regularity that is no less than theregularitythreshold. Those itemsets are calledirregular high-utility itemsets.For example, if

PHM_irregularis run on the previous transaction database with aminutil= 20,minper =1,maxper= 3 ,minavgper= 1 andmaxavgper= 2, thePHM_irregularalgorithm finds 7 irregular high-utility itemsets.

itemset |
utility |
regularity ( periodicity) |

{4, 2, 1} | 21 | 4 |

{4,2, 1, 5} | 24 | 4 |

{4, 2, 1, 5, 3} | 25 | 4 |

{4, 2, 1, 3} | 22 | 4 |

{4, 1, 3} | 20 | 3 |

{1, 5} | 24 | 3 |

{1, 5, 3} | 31 | 3 |

How should I interpret the results?

Each irregular high-utility itemset is annotated with its utility (e.g. profit), and its regularity (periodicty). For example, the itemset {1, 5} yield a profit of 24$, has a regularity of 3.

Input file format

The

input file formatused by the algorithm is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.For example, for the previous example, the input file is defined as follows:

3 1:6:1 5

5:3:3

3 5 1 2 4:25:1 3 5 10 6

3 5 2 4:20:3 3 8 6

3 1 4:8:1 5 2

3 5 1:22:6 6 10

3 5 2:9:2 3 4

Output file format

The

output file formatis defined as follows. It is a text file, where each line represents an irregular itemset. On each line, the items of the itemset are first listed. Each item is represented by an integer and it is followed by a single space. After, all the items, the keyword #UTIL: appears followed by a single space, the utility of the itemset, and a single space. Then, the keyword "#REG:" appears, which is followed by the regularity of the itemset.For example, here is the output file for this example. The first line indicates that the itemset {1, 2, 4} is an irregular high-utility itemset, having a utility of 21$ and a regularity of 4 transactions.

4 2 1 #UTIL: 21 #REG: 4

4 2 1 5 #UTIL: 24 #REG: 4

4 2 1 5 3 #UTIL: 25 #REG: 4

4 2 1 3 #UTIL: 22 #REG: 4

4 1 3 #UTIL: 20 #REG: 3

1 5 #UTIL: 24 #REG: 3

1 5 3 #UTIL: 31 #REG: 3

Where can I get more information about the **PHM_irregular **and **PHM** algorithm?

This is the article proposing the original PHM algorithm for periodic high utility itemset mining:

Fournier-Viger, P., Lin, C.W., Duong, Q.-H., Dam, T.-L. (2016). PHM: Mining Periodic High-Utility Itemsets. Proc. 16th Industrial Conference on Data Mining. Springer LNAI 9728, 15 pages.The PHM_irregular is a simple variation of the PHM algorithm that has not been published because it can be seen as a special case of the problem addressed by PHM. That problem called irregular high utility itemset was proposed by Amphawan et al. (2016). A good overview of this problem is presented by Amphawan et al. in the following book chapter:

Laoviboon, S., & Amphawan, K. (2019). Mining High-Utility Irregular Itemsets. In

High-Utility Pattern Mining(pp. 175-205). Springer, Cham.Besides, for a general overview of

high utility itemset mining, you may read this survey paper.

<< Return to table of contents of SPMF documentation

Copyright © 2008-2020 Philippe Fournier-Viger. All rights reserved.