This example explains how to run the **PHUSPM algorithm** using the SPMF open-source data mining library.

How to run this example?

**If you are using the graphical interface,**(1) choose the**"PHUSPM"**algorithm**,**(2) select the input file**"contextPHUSPM.txt"**, (3) set the output file name (e.g. "output.txt") (4) set the minimum utility to 20, (5) set the minimum probability to 1.4, and (5) click "Run algorithm".**If you want to execute this example from the command line**, then execute this command:

java -jar spmf.jar run**PHUSPM****contextPHUSPM**.txt output.txt 20 1.4 in a folder containing spmf.jar and the example input file**contextPHUSPM**.txt.**If you are using the source code version of SPMF,**launch the file**"MainTestPHUSPM.java"**in the package**ca.pfv.SPMF.tests**.

What is **PHUSPM**?

PHUSPM(Zhang et al., 2018) is an algorithm for discoveringhigh-utility sequential patternsin anuncertainsequence databasecontaining utility and probability information.

What is the input?

The algorithm takes as

input:

- a sequence database with utility information,
- a minimum utility threshold
minUtility(a positive integer)- and a minimum support threshold
minSupport(a positive float).Let's consider the following sequence database consisting 4 sequences of transactions (s1,s2, s3, s4) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "

contextPHUSPM.txt" in the packageca.pfv.spmf.testsof the SPMF distribution.

SID

Sequence

Probabilitys1

<[(1, 6)], [(2, 4)], [(1, 3), (3, 2), (5, 2)] >

0.6

s2

<[(2, 1), (3, 3], [(1, 2), (4, 4], [(1, 4, (2, 2)]>

0.8

s3

<[(6, 2)], [(6, 2), (4, 4], [(2, 4), (3, 3]>

0.5

s4

<[(2, 1)], [(3, 3)], [(1, 2), (2, 2)], [(1, 8), (2, 1)]>

0.9

Each line of the database is a sequence:

- each sequence is an ordered list of itemsets, such that sequences are enclosed by < >, and itemsets are enclosed by [].
- each itemset contains a set of items (symbols) represented by integers (in this example: 1, 2, 3, 4, 5, 6)
- each item is followed by a utility value (e.g. sale profit), indicated after “,”
- each sequence has a probability (indicated in the "Probability" column in the above table)
Note that this representation of the input database is not exactly the same as in the paper about PHUSPM. However, it is equivalent.

What are real-life examples of such a database? A typical example is a database containing sequences of customer transactions. Imagine that each sequence represents the transactions made by a customer. The first customer named "s1" bought items 1 the item generated a profit of 6$. Then, the customer bought item 2 for 4 $. Then, the customer bought items 1, 3 and 5, and those items respectively generated a profit of 3$, 2$ and 2$. The probability associated to this sequence is 0.6.

What is the output?

The

output of the algorithmis the set ofhigh utility probability sequential patternsmeeting the constraints specified by the parameters.A

sequential patternis a sequence of itemsets X1, X2, ... Xk, where X1, X2... Xk are itemsets(sets of items). A sequential pattern is said to occur in another sequence SB= Y1, Y2, ... Ym, where Y1, Y2... Ym are itemsets, if and only if there exists integers 1 <= i1 < i2... < ik <= m such that X1is a subset ofYi1, X2s a subset ofYi2, ... Xks a subset ofYik.The

utility (profit) of a sequential patternis the sum of the maximum utility (profit) generated by the pattern in each sequences where it appears. Theprobabilityof a sequential patternis the sum of the probability generated by the pattern in each sequences where is appears. For example, the sequence<(1), (1)> appears in sequences s1,s2, and s4. In s1, the profit and probability generated by that patern are 6 + 3 = 9 $ and 0.6, respectively. In s2, the profit and probability generated by that pattern are 2 + 4 = 6 $ and 0.8, respectively. In s4, the profit and probability generated by that pattern are 2 + 8 = 10$ and 0.9, respectively. Thus, the total utility of that pattern in the database is 9 + 6 + 10 = 25 $ and the probability is 0.6 + 0.8 + 0.9 = 3.2.The algorithm returns all

high-utility probability sequential patterns,such that each pattern the two following criteria:

- the utility of the sequential pattern in the database is no less than a minimum utility threshold set by the user,
- the probability of the sequential pattern in the database is no less than a minimum expected support threshold set by the user,
For example, if we run the algorithm with a minimum utility of 20 and a minimum probability of 1.4, we obtain the following high probability-utility sequential patterns:

sequence

utility

probability<(1), (1)>

25

2.3

<(2), (1)>

22

2.3

<(2), (1), (1, 2)>

21

1.7

<(3), (1, 2)>

21

1.7

<(3), (1), (1)>

22

1.7

<(3), (1), (1, 2)>

25

1.7

Input file format

The

input file format of that algorithmis defined as follows. It is a text file.

- Each lines represents a sequence of transactions.
- Each transaction is separated by the keyword -1.
- A transaction is a list of items (positive integers). Each item is followed by a positive integer, which is item profit. If there is more than one item, they are separated by ",".
- At the end of a sequence is the probability of the sequence
- In a transaction, it is assumed that items are sorted according to some order (eg. alphabetical order).
For example, for the previous example, the input file is defined as follows:

1 6 -1 2 4 -1 3 3, 5 2, 1 3 -1 17 -1 0.6

2 1, 3 3 -1 1 2, 4 4 -1 1 4, 2 2 -1 16 -1 0.8

6 2 -1 6 2, 4 4 -1 2 4, 3 3 -1 15 -1 0.5

2 1 -1 3 3 -1 1 2, 2 2 -1 1 8, 2 1 -1 17 -1 0.9For example, consider the first line. It means that the first customer bought items 1, and the item generated a profit 6$. Then, the customer bought item 2 for 4$. Then, the customer bought item 3 for 3$, $5 for 2$ and 1 for 3$. The total utility (profit) generated by that sequence of transaction is 6$ + 4$ + 3$ + 2$ + 3$ = 17 $, and the probability associated to that sequence is 0.6. Other lines follow the same format..

Output file format

The

output file formatof the algorithmis defined as follows. It is a text file, where each line represents ahigh utility probability sequential pattern. Each line, first indicates the sequential patterns which is a list of itemsets. Each itemset is represented by a list of positive integers separated by spaces. Each itemset is separated by a -1. Then, the keyword "#UTIL" appears followed by the utility of the sequential pattern. Then, the keyword "#SP" appears followed by the probability of the sequential pattern.1 -1 1 -1 #UITL: 25 #SP: 2.3000002

2 -1 1 -1 #UITL: 22 #SP: 2.3000002

3 -1 1 2 -1 #UITL: 21 #SP: 1.7

3 -1 1 -1 1 -1 #UITL: 22 #SP: 1.7

3 -1 1 -1 1 2 -1 #UITL: 25 #SP: 1.7

2 -1 1 -1 1 2 -1 #UITL: 21 #SP: 1.7For example, the first line represents the pattern of buying item 1 followed by buying the item 1. This pattern has a total utility of 25, meaning that it generated a 25 $ profit and a probability of 2.3.

The other lines follow the same format.

Performance

High utility probability sequential pattern miningis a variation of the problem ofhigh-utility sequential pattern miningwhere probability are introduced to annotate sequences. This is the original implementation of the algorithm.

Where can I get more information about the algorithm?

This is the article describing the algorithm:

Zhang, B., Lin, J. C.-W., Li, T., Gan, W., Fournier-Viger, P. (2017). Mining High Utility-Probability Sequential Patterns in Uncertain Databases. PLoS One, Public Library of Science, to appear.

Besides, you may read this survey of sequential pattern mining, which gives an overview of sequential pattern mining algorithms.

<< Return to table of contents of SPMF documentation

Copyright © 2008-2020 Philippe Fournier-Viger. All rights reserved.