Mining High-Utility Sequential Patterns from a Sequence Database with utility information using the USPAN Algorithm (SPMF documentation)

This example explains how to run the USpan algorithm using the SPMF open-source data mining library.

How to run this example?

What is USpan?

USpan (Zida et al, 2012) is a famous algorithm for discovering high-utility sequential patterns in a sequence database containing utility information.

An typical example of a sequence database with utility information is a database of customer transactions containing sequences of transactions performed by customers, where each transaction is a set of items annotated with the profit generated by the sale of items. The goal of high-utility sequential rule mining is to find patterns of the form A, B, C, meaning that several customers have bought items A, followed by buying item B, followed by buying item C, and that this pattern generated a high profit. Although, this algorithm is designed for the scenario of sequence of transactions, the task is general and could be applied to other types of data such as sequences of webpages visited by user on a website, where the sale profit is replaced by the time spent on webpages.

A limitation of the problem of high-utility sequential pattern mining is that patterns are only found based on the profit that they generate but there is no measure of the confidence that these patterns will be followed. For example, a pattern A,B,C may have a high utility but most customers may still buy items A,B without buying C. The alternative that proposed a solution to this problem is high-utility sequential rule mining, which discover rules of the form A -> BC with a confidence (conditional probability). The algorithm HUSRM also offered in SPMF finds the high-utility sequential rules.

What is the input?

USPAN takes as input a sequence database with utility information, a minimum utility threshold min_utility (a positive integer), an optionally, a maximum pattern length parameter (a positive integer) indicating the maximum number of items that a pattern should contani.

Let's consider the following sequence database consisting of 4 sequences of transactions (s1,s2, s3, s4) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DataBase_HUSRM.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.


Sequence Sequence utility
s1 {1[1],2[4]},{3[10]},{6[9]},{7[2]},{5[1]} 27
s2 {1[1],4[12]},{3[20]},{2[4]},{5[1],7[2]} 40
s3 {1[1]},{2[4]},{6[9]},{5[1]} 15
s4 {1[3],2[4],3[5]},{6[3],7[1]} 16

Each line of the database is a sequence:

Note that this representation of the input database is not exactly the same as in the paper about USpan. However, it is equivalent.

What are real-life examples of such a database? A typical example is a database containing sequences of customer transactions. Imagine that each sequence represents the transactions made by a customer. The first customer named "s1" bought items 1 and 2, and those items respectively generated a profit of 1$ and 4$. Then, the customer bought item 3 for 10$. Then, the customer bought item 6 for 9 $. Then, the customer bought items 7 for 2$. Then the customer bought item 5 for 1 $.

What is the output?

The output of USPAN is the set of high utility sequential patterns meeting the criteria specified by the user

A sequential pattern is a sequence of itemsets X1, X2, ... Xk, where X1, X2... Xk are itemsets (sets of items). A sequential pattern is said to occur in another sequence SB = Y1, Y2, ... Ym, where Y1, Y2... Ym are itemsets, if and only if there exists integers 1 <= i1 < i2... < ik <= m such that X1 is a subset of Yi1, X2 s a subset of Yi2, ... Xk s a subset of Yik.

The utility (profit) of a sequential pattern is the sum of the maximum utility (profit) generated by the pattern in each sequences where it appears. For example, the rule (3)(7) appears in sequences s1,s2, and s4. In s1, the profit generated by that patern is 10 + 2 = 12 $. In s2, the profit generated by that pattern is 20 + 2 = 22 $. In s4, the profit generated by that pattern is 5+1 = 6$. Thus, the total utility of that rule in the database is 12 + 22 + 6 = 40 $.

The USPAN algorithm returns all high-utility sequential patterns, such that each pattern the two following criteria:

For example, if we run USPANwith a minimum utility of 35 and a maximum pattern length of 4 items, we obtain 9 high-utility sequential patterns:

rule utility
(1, 4), (3) (2) 37
(1, 4) (3) (7) 35
(1) (3) (7) 36
(3) 35
(3) (7) 40
(4) (3) (2) 36
(4) (3) (2) (5) 37
(4) (3) (2) (7) 38
(4) (3) (2) (5, 7) 35

If the database is a transaction database from a store, we could interpret these results as rules representing the purchasing behavior of customers, such that these rules have a high confidence and generate a high profit. For example, the rule {1,3} -> {7} means that all customers buying the items 1 and 3 always buy the item 7 thereafter (since the confidence is 100%) and that this rule has generated a profit of 57 $ and appear in three sequences.

Input file format

The input file format of USPAN is defined as follows. It is a text file.

For example, for the previous example, the input file is defined as follows:

1[1] 2[4] -1 3[10] -1 6[9] -1 7[2] -1 5[1] -1 -2 SUtility:27
1[1] 4[12] -1 3[20] -1 2[4] -1 5[1] 7[2] -1 -2 SUtility:40
1[1] -1 2[4] -1 6[9] -1 5[1] -1 -2 SUtility:15
1[3] 2[4] 3[5] -1 6[3] 7[1] -1 -2 SUtility:16

For examle, consider the first line. It means that the first customer nbought items 1 and 2, and those items respectively generated a profit of 1$ and 4$. Then, the customer bought item 3 for 10$. Then, the customer bought item 6 for 9 $. Then, the customer bought items 7 for 2$. Then the customer bought item 5 for 1 $. Thus, this customer has made 5 transaction. The total utility (profit) generated by that sequence of transaction is 1$ + 4$ + 10$ + 9$ + 2$ + 1$ = 27 $.

Output file format

The output file format of USPAN is defined as follows. It is a text file, where each line represents a high utility sequential pattern. Each line, first indicate the sequential patterns which is a list of itemsets. Each itemset is represented by a list of positive integers. Each itemset is separated by a -1. Then, the keyword "#UTIL" appears followed by the utility of the sequential pattern.

1 4 -1 3 -1 2 -1 #UTIL: 37
1 4 -1 3 -1 7 -1 #UTIL: 35
1 -1 3 -1 7 -1 #UTIL: 36
3 -1 #UTIL: 35
3 -1 7 -1 #UTIL: 40
4 -1 3 -1 2 -1 #UTIL: 36
4 -1 3 -1 2 -1 5 -1 #UTIL: 37
4 -1 3 -1 2 -1 7 -1 #UTIL: 38
4 -1 3 -1 5 7 -1 #UTIL: 35

For example, the first line represents the pattern of buying items 1 and 4 together, then buying item 3, then buying item 2. This pattern has a total utility of 37, meaning that it generated a 37 $ profit. The other lines follow the same format.

Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextPrefixSpan.txt". Here we have modified the file to give names to the items:

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
@ITEM=6=noodle
@ITEM=7=rice
@ITEM=-1=|
1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". The 9th line indicates that the symbol "-1" must be replaced by "|". Then the following lines define four sequences in the SPMF format.

Then, if we apply a sequential pattern mining algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns having this format (the content may be actually different):

apple | orange | #UTIL: 37

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided.

Performance

High utility sequential pattern mining is a more difficult problem than sequential pattern mining. Therefore, high-utility sequential pattern mining algorithms are generally slower than sequential pattern mining algorithm. For this reason, it is wise to use the optional maximum pattern length constraint when using USpan, to reduce the number of patterns found, and thus the size of the search space.

It is also worth noting that in the USpan paper they do not compare the performance of their algorithm with previous algorithms for high-utility sequential pattern mining.

Where can I get more information about the USPAN algorithm?

This is the article describing the USPAN algorithm:

Yin, Junfu, Zhigang Zheng, and Longbing Cao. "USpan: an efficient algorithm for mining high utility sequential patterns." Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2012.

Besides, you may read this survey of sequential pattern mining, which gives an overview of sequential pattern mining algorithms.