Mining High-Utility Probability Sequential Patterns from a Sequence Database with utility and probability information using the UHUSPM Algorithm (SPMF documentation)

This example explains how to run the UHUSPM algorithm using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "UHUSPM" algorithm, (2) select the input file "contextPHUSPM.txt", (3) set the output file name (e.g. "output.txt") (4) set the minimum utility to 20, (5) set the minimum probability to 1.4, and (5) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar runUHUSPMcontextPHUSPM.txt output.txt 20 1.4 in a folder containing spmf.jar and the example input file contextPHUSPM.txt.
If you are using the source code version of SPMF, launch the file "MainTestUHUSPM.java" in the package ca.pfv.SPMF.tests.

What is UHUSPM?

UHUSPM (Zhang et al., 2018) is an algorithm for discovering high-utility sequential patterns in an uncertain sequence database containing utility and probability information.

What is the input?

The algorithm takes as input:

a sequence database with utility information,
a minimum utility threshold minUtility (a positive integer)
and a minimum support threshold minSupport (a positive float).

Let's consider the following sequence database consisting 4 sequences of transactions (s1,s2, s3, s4) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "contextPHUSPM.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.

SID	Sequence	Probability
s1	<[(1, 6)], [(2, 4)], [(1, 3), (3, 2), (5, 2)] >	0.6
s2	<[(2, 1), (3, 3], [(1, 2), (4, 4], [(1, 4, (2, 2)]>	0.8
s3	<[(6, 2)], [(6, 2), (4, 4], [(2, 4), (3, 3]>	0.5
s4	<[(2, 1)], [(3, 3)], [(1, 2), (2, 2)], [(1, 8), (2, 1)]>	0.9

Each line of the database is a sequence:

each sequence is an ordered list of itemsets, such that sequences are enclosed by < >, and itemsets are enclosed by [].
each itemset contains a set of items (symbols) represented by integers (in this example: 1, 2, 3, 4, 5, 6)
each item is followed by a utility value (e.g. sale profit), indicated after “,”
each sequence has a probability (indicated in the "Probability" column in the above table)

Note that this representation of the input database is not exactly the same as in the paper about PHUSPM. However, it is equivalent.
What are real-life examples of such a database? A typical example is a database containing sequences of customer transactions. Imagine that each sequence represents the transactions made by a customer. The first customer named "s1" bought items 1 the item generated a profit of 6$. Then, the customer bought item 2 for 4 $. Then, the customer bought items 1, 3 and 5, and those items respectively generated a profit of 3$, 2$ and 2$. The probability associated to this sequence is 0.6.

What is the output?

The output of the algorithm is the set of high utility probability sequential patterns meeting the constraints specified by the parameters.

A sequential pattern is a sequence of itemsets X1, X2, ... Xk, where X1, X2... Xk are itemsets (sets of items). A sequential pattern is said to occur in another sequence SB = Y1, Y2, ... Ym, where Y1, Y2... Ym are itemsets, if and only if there exists integers 1 <= i1 < i2... < ik <= m such that X1 is a subset of Yi1, X2 s a subset of Yi2, ... Xk s a subset of Yik.

The utility (profit) of a sequential pattern is the sum of the maximum utility (profit) generated by the pattern in each sequences where it appears. The probability of a sequential pattern is the sum of the probability generated by the pattern in each sequences where is appears. For example, the sequence<(1), (1)> appears in sequences s1,s2, and s4. In s1, the profit and probability generated by that patern are 6 + 3 = 9 $ and 0.6, respectively. In s2, the profit and probability generated by that pattern are 2 + 4 = 6 $ and 0.8, respectively. In s4, the profit and probability generated by that pattern are 2 + 8 = 10$ and 0.9, respectively. Thus, the total utility of that pattern in the database is 9 + 6 + 10 = 25 $ and the probability is 0.6 + 0.8 + 0.9 = 3.2.

The algorithm returns all high-utility probability sequential patterns, such that each pattern the two following criteria:

the utility of the sequential pattern in the database is no less than a minimum utility threshold set by the user,
the probability of the sequential pattern in the database is no less than a minimum expected support threshold set by the user,

For example, if we run the algorithm with a minimum utility of 20 and a minimum probability of 1.4, we obtain the following high probability-utility sequential patterns:

sequence	utility	probability
<(1), (1)>	25	2.3
<(2), (1)>	22	2.3
<(2), (1), (1, 2)>	21	1.7
<(3), (1, 2)>	21	1.7
<(3), (1), (1)>	22	1.7
<(3), (1), (1, 2)>	25	1.7

Input file format

The input file format of that algorithm is defined as follows. It is a text file.

Each lines represents a sequence of transactions.
Each transaction is separated by the keyword -1.
A transaction is a list of items (positive integers). Each item is followed by a positive integer, which is item profit. If there is more than one item, they are separated by ",".
At the end of a sequence is the probability of the sequence
In a transaction, it is assumed that items are sorted according to some order (eg. alphabetical order).

For example, for the previous example, the input file is defined as follows:
1 6 -1 2 4 -1 3 3, 5 2, 1 3 -1 17 -1 0.6
2 1, 3 3 -1 1 2, 4 4 -1 1 4, 2 2 -1 16 -1 0.8
6 2 -1 6 2, 4 4 -1 2 4, 3 3 -1 15 -1 0.5
2 1 -1 3 3 -1 1 2, 2 2 -1 1 8, 2 1 -1 17 -1 0.9

For example, consider the first line. It means that the first customer bought items 1, and the item generated a profit 6$. Then, the customer bought item 2 for 4$. Then, the customer bought item 3 for 3$, $5 for 2$ and 1 for 3$. The total utility (profit) generated by that sequence of transaction is 6$ + 4$ + 3$ + 2$ + 3$ = 17 $, and the probability associated to that sequence is 0.6. Other lines follow the same format.

Output file format

The output file format of USPAN is defined as follows. It is a text file, where each line represents a high utility sequential pattern. Each line, first indicate the sequential patterns which is a list of itemsets. Each itemset is represented by a list of positive integers. Each itemset is separated by a -1. Then, the keyword "#UTIL" appears followed by the utility of the sequential pattern.

1 4 -1 3 -1 2 -1 #UTIL: 37
1 4 -1 3 -1 7 -1 #UTIL: 35
1 -1 3 -1 7 -1 #UTIL: 36
3 -1 #UTIL: 35
3 -1 7 -1 #UTIL: 40
4 -1 3 -1 2 -1 #UTIL: 36
4 -1 3 -1 2 -1 5 -1 #UTIL: 37
4 -1 3 -1 2 -1 7 -1 #UTIL: 38
4 -1 3 -1 5 7 -1 #UTIL: 35

For example, the first line represents the pattern of buying items 1 and 4 together, then buying item 3, then buying item 2. This pattern has a total utility of 37, meaning that it generated a 37 $ profit. The other lines follow the same format.

Optional feature: giving names to items

The output file format of the algorithm is defined as follows. It is a text file, where each line represents a high utility probability sequential pattern. Each line, first indicates the sequential patterns which is a list of itemsets. Each itemset is represented by a list of positive integers separated by spaces. Each itemset is separated by a -1. Then, the keyword "#UTIL" appears followed by the utility of the sequential pattern. Then, the keyword "#SP" appears followed by the probability of the sequential pattern.

1 -1 1 -1 #UITL: 25 #SP: 2.3000002
2 -1 1 -1 #UITL: 22 #SP: 2.3000002
3 -1 1 2 -1 #UITL: 21 #SP: 1.7
3 -1 1 -1 1 -1 #UITL: 22 #SP: 1.7
3 -1 1 -1 1 2 -1 #UITL: 25 #SP: 1.7
2 -1 1 -1 1 2 -1 #UITL: 21 #SP: 1.7

For example, the first line represents the pattern of buying item 1 followed by buying the item 1. This pattern has a total utility of 25, meaning that it generated a 25 $ profit and a probability of 2.3.

The other lines follow the same format.

Performance

High utility probability sequential pattern mining is a variation of the problem of high-utility sequential pattern mining where probability are introduced to annotate sequences. This is the original implementation of the algorithm.

Where can I get more information about the algorithm?

This is the article describing the algorithm:

Zhang, B., Lin, J. C.-W., Li, T., Gan, W., Fournier-Viger, P. (2017). Mining High Utility-Probability Sequential Patterns in Uncertain Databases. PLoS One, Public Library of Science, to appear.

Besides, you may read this survey of sequential pattern mining, which gives an overview of sequential pattern mining algorithms.