This example explains how to run the UP-SPAN algorithm using the SPMF open-source data mining library.
How to run this example?
What is UP-SPAN?
UP-SPAN (Wu et al., 2013) is an algorithm for discovering high-utility episodes in a complex event sequence containing utility information (which is equivalent to a transaction database orderd by time).
High utility episode mining can be used to discover subsequences of events (episodes) that have a high importance (e.g. yield a high profit) in a complex event sequence. A complex event sequence is an ordered list of event sets, where an event sets is a set of events occuring simultaneously. In a complex event sequence with utility information, each event is annotated with some utility information (numbers indicating the importance of the event - e.g. profit yield by the event).
Although high utility episode mining algorithms are often presented in the context of analyzing customer transaction data, there exist other applications.
What is the input?
UP-SPAN takes as input a complex event sequence (also called a transaction database) with utility information, a minimum utility threshold min_utility (a percentage), a minimum time duration (an integer), and a optional boolean parameter outputSingleEvents. Let's consider the following sequence consisting of 5 time points (t1,t2...t6) and 7 event types (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "ExampleTUP.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.
Time points Events Total event utility Utilities of each event t1 5 3 14 10 4 t2 3 10 10 t3 5 4 13 10 3 t4 2 4 4 t5 4 1 3 1 2 t6 7 6 5 2 3
Each line of the sequence is called an event set and consists of:
- a time point (first column)
- a set of events that are considered simultaneous at this time point (the second column of the table),
- the sum of the utilities (e.g. profit) of the events (the third column of the table),
- the utility (importance) of each event at this time point (the fourth column of the table).
Note that the value in the third column of each line is the sum of the values in the fourth column.
What are real-life examples of such a database? There are several applications in real life. One application is a sequence of customer transactions. Imagine that each time points is a set of items purchased by a customer at a given time. In that case, a time point is a transaction and events are items purchased by thecustomer. Then, the utilities of events are the profit generated by the sale of each item (event) in each transaction (time point). In that example, the first transaction (time point) named "t1" represents that a customer bought items 5 and 3. The amount of money (utility) spent for each item is respectively 10$ and 4 $. The total amount of money spent in this transaction is 10 + 4 = 14 $.
What is the output?
The output of UP-SPAN is the set of high utility episodes having a utility no less than a min_utility threshold (a positive integer) set by the user. To explain what is a high utility episodes, it is necessary to review some definitions.
An episode is a subsequence of the input sequence. An occurrence of an episode in a sequence is a set of no more than maxTimeDuration+1 consecutive events that contains the episode . The utility of an episode is the sum the utilities of its minimal occurrences in the sequence (for a more formal definition, see the paper by Wu et al., 2013).
For example, consider the episode <(3), (4,5), (2)> which represents that event 3 was followed by events 4 and 5 at the same time, and then followed by event 2. If maxTimeDuration = 2, and occurrence of this episode is the time points t2, t3, t4. The utility of this episode is the sum of the utility of <(3), (4,5), (2)> at the time points t2, t3, t4, that is 10 + 10 + 3 + 16 = 39.
The total utility of the input sequence total_utility is the sum of the third column in the previous example, that is: 14 + 10 + 13 + 16 + 3 + 5 = 61.
A high utility episode is an episode such that its utility is no less than min_utility * total_utility. For a given input sequence, the UP-SPAN algorithm outputs all high utility episodes. For example, consider that min_utility = 0.56. If we run UP-SPAN with min_utility = 0.56, maxTimeDuration = 2, we obtain 3 high-utility episodes that have a utility no less than 0.56 * 61 = 34.16.
|high utility episode||utility|
|(3, 5)(3)(4, 5)||37|
If the input sequence is a sequence of transactions made by a customer in a store, we could interpret these results as all the subsequences of purchases made by a customer that generated a profit of 34.16 $ or more.
Note that by the definition of the problem of high utility episode mining, episodes containing only 1 item are not output. To include these episodes in the results, set the parameter outputSingleEpisodes to true.
Input file format
The input file format of UP-SPAN is defined as follows. It is a text file. Each lines represents an event set. Each line is composed of three sections, as follows.
- First, the events (items) contained in the event set are listed. An event is represented by a positive integer. Each item is separated from the next item by a single space. It is assumed that all events within a same event set (line) are sorted according to a total order (e.g. ascending order) and that no event can appear twice within the same event set.
- Second, the symbol ":" appears and is followed by the total utility of the event set (an integer).
- Third, the symbol ":" appears and is followed by the utility of each event in this event set (an integer), separated by single spaces.
For example, for the previous example, the input file is defined as follows:
5 3:14:10 4
5 4:13:10 3
4 1:3:1 2
7 6:5:2 3
Consider the first line. It means that at the first time point, two events named 5 and 3 occurred, and that these events respectively have a utility of 10 and 4. The total utility of this event set is thus 10 + 4 = 14. The following lines follow the same format.
Output file format
The output file format is defined as follows. It is a text file. Each line is a high utility episode. Each event in a high utility episode is a positive integer and items from the same event set within an episode are separated by single spaces. The value "-1" indicates the end of an event set. On each line, the episode is first indicated. Then, the keyword "#UTIL:" appears followed by an integer indicating the utility of the pattern (a positive integer). For example, a few lines from the output file from the previous example are shown below:
3 5 -1 3 -1 4 5 -1 #UTIL: 37
3 -1 4 5 -1 2 -1 #UTIL: 39
3 -1 5 -1 2 -1 #UTIL: 36
The first line indicates that the high utility episode consisting of the event set (3, 5), followed by the itemset (3), followed by (4, 5) has a utility of 37.
High utility episode mining is a more difficult problem than the traditional problem of frequent episode mining, as the former is more general.. UP-SPAN is the first algorithm for high utility episode mining. Other algorithms are offered in SPMF such as the TUP algorithm.
This is the original implementation of UP-SPAN obtained from UP-Miner under the GPL license.
Where can I get more information about the UP-SPAN algorithm?
This is the reference of the article describing the UP-SPAN algorithm:
Wu, Cheng-Wei, et al. "Mining high utility episodes in complex event sequences." Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013.
<< Return to table of contents of SPMF documentation
Copyright © 2008-2020 Philippe Fournier-Viger. All rights reserved.