Mining High-Utility Episodes in a Complex Event Sequence using the US-SPAN Algorithm (SPMF documentation)
This example explains how to run the UP-SPAN algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "UP-SPAN" algorithm, (2) select the input file "ExampleTUP.txt", (3) set the output file name (e.g. "output.txt") (4) set the minimum utility to 0.56, (5) set the maximum time duration to 2, (6) set outputSingleEvents = true and (6) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run UP-SPAN ExampleTUP.txt output.txt 0.56 2 false in a folder containing spmf.jar and the example input file ExampleTUP.txt.
- If you are using the source code version of SPMF, launch the file "MainTestUP_SPAN.java" in the package ca.pfv.SPMF.tests.
What is UP-SPAN?
UP-SPAN (Wu et al., 2013) is an algorithm for discovering high-utility episodes in a complex event sequence containing utility information (which is equivalent to a transaction database orderd by time).
High utility episode mining can be used to discover subsequences of events (episodes) that have a high importance (e.g. yield a high profit) in a complex event sequence. A complex event sequence is an ordered list of event sets, where an event sets is a set of events occuring simultaneously. In a complex event sequence with utility information, each event is annotated with some utility information (numbers indicating the importance of the event - e.g. profit yield by the event).
Although high utility episode mining algorithms are often presented in the context of analyzing customer transaction data, there exist other applications.
What is the input?
UP-SPAN takes as input a complex event sequence (also called a transaction database) with utility information, a minimum utility threshold min_utility (a percentage), a minimum time duration (an integer), and a optional boolean parameter outputSingleEvents. Let's consider the following sequence consisting of 5 time points (t1,t2...t6) and 7 event types (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "ExampleTUP.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.
||Events||Total event utility||Utilities of each event|
|t1||5 3||14||10 4|
|t3||5 4||13||10 3|
|t5||4 1||3||1 2|
|t6||7 6||5||2 3|
Each line of the sequence is called an event set and consists of:
- a time point (first column)
- a set of events that are considered simultaneous at this time point (the second column of the table),
- the sum of the utilities (e.g. profit) of the events (the third column of the table),
- the utility (importance) of each event at this time point (the fourth column of the table).
Note that the value in the third column of each line is the sum of the values in the fourth column.
What are real-life examples of such a database? There are several applications in real life. One application is a sequence of customer transactions. Imagine that each time points is a set of items purchased by a customer at a given time. In that case, a time point is a transaction and events are items purchased by thecustomer. Then, the utilities of events are the profit generated by the sale of each item (event) in each transaction (time point). In that example, the first transaction (time point) named "t1" represents that a customer bought items 5 and 3. The amount of money (utility) spent for each item is respectively 10$ and 4 $. The total amount of money spent in this transaction is 10 + 4 = 14 $.
What is the output?
The output of UP-SPAN is the set of high utility episodes having a utility no less than a min_utility threshold (a positive integer) set by the user. To explain what is a high utility episodes, it is necessary to review some definitions.
An episode is a subsequence of the input sequence. An occurrence of an episode in a sequence is a set of no more than maxTimeDuration+1 consecutive events that contains the episode . The utility of an episode is the sum the utilities of its minimal occurrences in the sequence (for a more formal definition, see the paper by Wu et al., 2013).
For example, consider the episode <(3), (4,5), (2)> which represents that event 3 was followed by events 4 and 5 at the same time, and then followed by event 2. If maxTimeDuration = 2, and occurrence of this episode is the time points t2, t3, t4. The utility of this episode is the sum of the utility of <(3), (4,5), (2)> at the time points t2, t3, t4, that is 10 + 10 + 3 + 16 = 39.
The total utility of the input sequence total_utility is the sum of the third column in the previous example, that is: 14 + 10 + 13 + 16 + 3 + 5 = 61.
A high utility episode is an episode such that its utility is no less than min_utility * total_utility. For a given input sequence, the UP-SPAN algorithm outputs all high utility episodes. For example, consider that min_utility = 0.56. If we run UP-SPAN with min_utility = 0.56, maxTimeDuration = 2, we obtain 3 high-utility episodes that have a utility no less than 0.56 * 61 = 34.16.
|high utility episode||utility|
|(3, 5)(3)(4, 5)||37|
If the input sequence is a sequence of transactions made by a customer in a store, we could interpret these results as all the subsequences of purchases made by a customer that generated a profit of 34.16 $ or more.
Note that by the definition of the problem of high utility episode mining, episodes containing only 1 item are not output. To include these episodes in the results, set the parameter outputSingleEpisodes to true.
Input file format
The input file format of UP-SPAN is defined as follows. It is a text file. Each lines represents an event set. Each line is composed of three sections, as follows.
- First, the events (items) contained in the event set are listed. An event is represented by a positive integer. Each item is separated from the next item by a single space. It is assumed that all events within a same event set (line) are sorted according to a total order (e.g. ascending order) and that no event can appear twice within the same event set.
- Second, the symbol ":" appears and is followed by the total utility of the event set (an integer).
- Third, the symbol ":" appears and is followed by the utility of each event in this event set (an integer), separated by single spaces.
For example, for the previous example, the input file is defined as follows:
5 3:14:10 4
5 4:13:10 3
4 1:3:1 2
7 6:5:2 3
Consider the first line. It means that at the first time point, two events named 5 and 3 occurred, and that these events respectively have a utility of 10 and 4. The total utility of this event set is thus 10 + 4 = 14. The following lines follow the same format.
Output file format
The output file format is defined as follows. It is a text file. Each line is a high utility episode. Each event in a high utility episode is a positive integer and items from the same event set within an episode are separated by single spaces. The value "-1" indicates the end of an event set. On each line, the episode is first indicated. Then, the keyword "#UTIL:" appears followed by an integer indicating the utility of the pattern (a positive integer). For example, a few lines from the output file from the previous example are shown below:
3 5 -1 3 -1 4 5 -1 #UTIL: 37
3 -1 4 5 -1 2 -1 #UTIL: 39
3 -1 5 -1 2 -1 #UTIL: 36
The first line indicates that the high utility episode consisting of the event set (3, 5), followed by the itemset (3), followed by (4, 5) has a utility of 37.
High utility episode mining is a more difficult problem than the traditional problem of frequent episode mining, as the former is more general.. UP-SPAN is the first algorithm for high utility episode mining. Recently, some faster algorithms like HUE-SPAN have been proposed (also offered in SPMF).
Important note: If you are interested to compare the episodes found by UP-SPAN with those found by HUE-SPAN (Fournier-Viger et al., 2019) , it should be noted that the definition of window is slightly different for these two algorithms. The difference is that maxWindow must be increased by 1 for HUE-Span to obtain the same result as UP-Span. Thus in the above example, if maxWindow = 2 for HUE-SPAN, then UP-SPAN must be run with maxWindow =1 to obtain the same result. Besides, it is also important to note that HUE-SPAN let the user use a novel definition of utility (as presented in the HUE-SPAN paper). To obtain the same results as UP-SPAN using HUE-SPAN it is necessary to use the old definition of utility. This is achieved by setting UseTraditionalUtility = true in the graphical user interface of SPMF or by setting CheckMaximumUtility = false in the source code of SPMF. Then the two algorithms should return the same result.
Besides other algorithms for variations of the high utility episodes mining are offered in SPMF such as the TUP algorithm for top-k high utility episode mining.
This is the original implementation of UP-SPAN obtained from UP-Miner under the GPL license.
Where can I get more information about the UP-SPAN algorithm?
This is the reference of the article describing the UP-SPAN algorithm:
Wu, Cheng-Wei, et al. "Mining high utility episodes in complex event sequences." Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013.