This example explains how to run the TUP algorithm using the SPMF open-source data mining library.
How to run this example?
What is TUP?
TUP (Rathore et al., 2016) is an algorithm for discovering the top-k high-utility episodes in a complex event sequence containing utility information (which is equivalent to a transaction database orderd by time).
High utility episode mining can be used to discover subsequences of events (episodes) that have a high importance (e.g. yield a high profit) in a complex event sequence. A complex event sequence is an ordered list of event sets, where an event sets is a set of events occuring simultaneously. In a complex event sequence with utility information, each event is annotated with some utility information (numbers indicating the importance of the event - e.g. profit yield by the event).
Although high utility episode mining algorithms are often presented in the context of analyzing customer transaction data, there exist other applications.
Two versions of the TUP algorithms are offered in SPMF called TUP_Combined and TUP_Preinsertion, respectively. They are two versions using different optimizations, which produce the same result but may be more or less efficient on different sequences.
Note that this is the original implementation of TUP.
What is the input?
TUP takes as input a complex event sequence (also called a transaction database) with utility information, a parameter k representing the number of patterns to be output and a minimum time duration (an integer). Let's consider the following sequence consisting of 5 time points (t1,t2...t6) and 7 event types (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "ExampleTUP.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.
Time points Events Total event utility Utilities of each event t1 5 3 14 10 4 t2 3 10 10 t3 5 4 13 10 3 t4 2 4 4 t5 4 1 3 1 2 t6 7 6 5 2 3
Each line of the sequence is called an event set and consists of:
- a time point (first column)
- a set of events that are considered simultaneous at this time point (the second column of the table),
- the sum of the utilities (e.g. profit) of the events (the third column of the table),
- the utility (importance) of each event at this time point (the fourth column of the table).
Note that the value in the third column of each line is the sum of the values in the fourth column.
What are real-life examples of such a database? There are several applications in real life. One application is a sequence of customer transactions. Imagine that each time points is a set of items purchased by a customer at a given time. In that case, a time point is a transaction and events are items purchased by thecustomer. Then, the utilities of events are the profit generated by the sale of each item (event) in each transaction (time point). In that example, the first transaction (time point) named "t1" represents that a customer bought items 5 and 3. The amount of money (utility) spent for each item is respectively 10$ and 4 $. The total amount of money spent in this transaction is 10 + 4 = 14 $.
What is the output?
The output of TUP is the k episodes that have the highest utility in the input sequence (called the top k high utility episodes). To explain what is a top-k high utility episode, it is necessary to review some definitions.
An episode is a subsequence of the input sequence. An occurrence of an episode in a sequence is a set of no more than maxTimeDuration+1 consecutive events that contains the episode . The utility of an episode is the sum the utilities of its minimal occurrences in the sequence (for a more formal definition, see the paper by Wu et al., 2013).
For example, consider the episode <(3), (4,5), (2)> which represents that event 3 was followed by events 4 and 5 at the same time, and then followed by event 2. If maxTimeDuration = 2, and occurrence of this episode is the time points t2, t3, t4. The utility of this episode is the sum of the utility of <(3), (4,5), (2)> at the time points t2, t3, t4, that is 10 + 10 + 3 + 16 = 39.
For a given input sequence, the TUP algorithm outputs the k episodes having the highest utilities. For example, consider that k = 3. If we run TUP with maxTimeDuration = 2 and k = 2, we obtain the following 3 episodes.
|top-k high utility episodes||utility|
|(3, 5)(3)(4, 5)||37|
If the input sequence is a sequence of transactions made by a customer in a store, we could interpret these results as all the three subsequences of purchases made by a customer that generated the highest profit whithin 3 consecutive events..
Note that by the definition of the problem of high utility episode mining, episodes containing only 1 item are not output.
Input file format
The input file format of TUP is defined as follows. It is a text file. Each lines represents an event set. Each line is composed of three sections, as follows.
- First, the events (items) contained in the event set are listed. An event is represented by a positive integer. Each item is separated from the next item by a single space. It is assumed that all events within a same event set (line) are sorted according to a total order (e.g. ascending order) and that no event can appear twice within the same event set.
- Second, the symbol ":" appears and is followed by the total utility of the event set (a double value).
- Third, the symbol ":" appears and is followed by the utility of each event in this event set (an integer), separated by single spaces.
For example, for the previous example, the input file is defined as follows:
5 3:14:10 4
5 4:13:10 3
4 1:3:1 2
7 6:5:2 3
Consider the first line. It means that at the first time point, two events named 5 and 3 occurred, and that these events respectively have a utility of 10 and 4. The total utility of this event set is thus 10 + 4 = 14. The following lines follow the same format.
Output file format
The output file format is defined as follows. It is a text file. Each line is a high utility episode. Each event in a high utility episode is a positive integer and items from the same event set within an episode are separated by single spaces. The value "-1" indicates the end of an event set. On each line, the episode is first indicated. Then, the keyword "#UTIL:" appears followed by a double value indicating the utility of the pattern (a positive number). For example, a few lines from the output file from the previous example are shown below:
3 -1 5 -1 2 -1 #UTIL: 36.0
3 -1 5 4 -1 2 -1 #UTIL: 39.0
5 3 -1 3 -1 5 4 -1 #UTIL: 37.0
The first line indicates that the high utility episode consisting of the event set (3, 5), followed by the itemset (3), followed by (4, 5) has a utility of 37.
High utility episode mining is a more difficult problem than the traditional problem of frequent episode mining, as the former is more general.. TUP (2016) is the first algorithm for mining the top-k high utility episode. Other algorithms for high utility episode mining are offered in SPMF such as UP-SPAN (2013).
In the source code, there is also two examples showing how to use TUP to keep the results in memory instead of saving them to a file. This is shown in the following two examples of the source code:
This is the original implementation of TUP.
Where can I get more information about the TUP algorithm?
This is the reference of the article describing the TUP algorithm:
<< Return to table of contents of SPMF documentation
Copyright © 2008-2020 Philippe Fournier-Viger. All rights reserved.