Mining High-Utility Episodes in a Complex Event Sequence using the HUE-SPAN Algorithm (SPMF documentation)

This example explains how to run the HUE-SPAN algorithm using the SPMF open-source data mining library.

How to run this example?

What is HUE-SPAN?

HUE-SPAN (Fournier-Viger et al., 2019) is an algorithm for discovering high-utility episodes in a complex event sequence containing utility information. It was shown to outperform the previous state-of-the-art algorithm, named US-SPAN (Wu et al., 2003). HUE-SPAN also introduces a new way of counting the utility to avoid some problem of US-SPAN such that the utility of some episodes may be underestimated (see the paper for details) but it can also use the traditional definition of utility.

High utility episode mining can be used to discover subsequences of events (episodes) that have a high importance (e.g. yield a high profit) in a complex event sequence. A complex event sequence is an ordered list of event sets, where an event sets is a set of events occuring simultaneously. In a complex event sequence with utility information, each event is annotated with some utility information (numbers indicating the importance of the event - e.g. profit yield by the event).

Although high utility episode mining algorithms are often presented in the context of analyzing customer transaction data, there exist other applications.

What is the input?

HUE-SPAN takes as input a complex event sequence (also called a transaction database) with utility information, a minimum utility threshold min_utility (a percentage), a minimum time duration (an integer), and a optional boolean parameter outputSingleEvents. Let's consider the following database consisting of 5 time points (t1,t2...t5) and 7 event types (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "contextHUE_Span.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.

Time points

Events

Total event utility

Utilities of each event (item)

t1

1

2

2

t2

2 4

4

2 2

t3

2 3

6

3 3

t4

1 3

7

4 3

t5

4

2

2

Each line of the sequence is called an event set and consists of:

Note that the value in the third column of each line is the sum of the values in the fourth column.

What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction represents the items purchased by a customer. The fourth customer named "t4" bought items 1 and 3. The amount of money spent for each item is respectively 4 $ and 3 $. The total amount of money spent in this transaction is 4 + 3 = 7 $.

What is the output?

The output of UP-SPAN is the set of high utility episodes. To explain what is a high utility episode, it is necessary to review some definitions.

An episode is a subsequence of the input sequence. An occurrence of an episode in a sequence is a set of no more than maxTimeDuration+1 consecutive events that contains the episode . The utility of an episode is the sum the utilities of its minimal occurrences in the sequence (see the paper for a more formal definition).

For example, consider the episode <(2), (3)> which represents that event 2 was followed by events 3. If maxDur = 2, and minimal occurrences of this episode are [t2, t3] and [t3, t4]. The utility of this episode is the sum of the utility of <(2), (3)> for the minimal occurrences [t2, t3] and [t3, t4]., that is (2+3)+(3+3) = 11.

The total utility of the input sequence, denoted as total_utility, is the sum of the third column in the previous example, that is: 2+4+6+7+2 = 21

A high utility episode is an episode such that its utility is no less than min_utility * total_utility. For a given input sequence, the HUE-Span algorithm outputs all high utility episodes. For example, consider that min_utility = 0.45. If we run HUE-Span with minUtil = 0.45, maxDur = 2, we obtain 7 high-utility episodes that have a utility no less than 0.45 * 21 = 9.45

High utility episode

Utility

<(1), (4)>

10

<(2, 3), (1)>

10

<(2, 3), (1, 3)>

13

<(2), (1, 3)>

10

<(2), (3)>

11

<(3), (1, 3)>

10

<(2, 4), (2, 3)>

10

If the input sequence is a sequence of transactions made by a customer in a store, we could interpret these results as all the subsequences of purchases made by a customer that generated a profit of 9.45 $ or more.

Input file format

The input file format of HUE-SPAN is defined as follows. It is a text file. Each lines represents an event set. Each line is composed of three sections, as follows.

For example, for the previous example, the input file is defined as follows:

1:2:2
2 4:4:2 2
2 3:6:3 3
1 3:7:4 3
4:2:

Consider the second line. It means that at the second time point, two events 2 and 4 occurred, and that these events respectively have a utility of 2 and 2. The total utility of this event set is thus 2 + 2 = 4. The following lines follow the same format.

Output file format

The output file format is defined as follows. It is a text file. Each line is a high utility episode. Each event in a high utility episode is a positive integer and items from the same event set within an episode are separated by single spaces. The value "-1" indicates the end of an event set. On each line, the episode is first indicated. Then, the keyword "#Utility:" appears followed by an integer indicating the utility of the pattern (a positive integer). For example, a few lines from the output file from the previous example are shown below:

1 -1 4 -1 #UTIL: 10
2 3 -1 1 -1 #UTIL: 10
2 3 -1 1 3 -1 #UTIL: 13
2 -1 1 3 -1 #UTIL: 10
2 -1 3 -1 #UTIL: 11
3 -1 1 3 -1 #UTIL: 10
4 2 -1 2 3 -1 #UTIL: 10

For instance, the first line indicates that the high utility episode consisting of the itemset (1), followed by the itemset (4) has a utility of 10.

Comparison of HUE-SPAN with UP-SPAN

Important note: If you are interested to compare the episodes found by UP-SPAN (Wu et al., 2013) with those found by HUE-SPAN, it should be noted that the definition of window is slightly different for these two algorithms. The difference is that maxWindow must be increased by 1 for HUE-Span to obtain the same result as UP-Span. Thus in the above example, if maxWindow = 2 for HUE-SPAN, then UP-SPAN must be run with maxWindow =1 to obtain the same result. Besides, it is also important to note that HUE-SPAN let the user use a novel definition of utility (as presented in the HUE-SPAN paper). To obtain the same results as UP-SPAN using HUE-SPAN it is necessary to use the old definition of utility. This is achieved by setting UseTraditionalUtility = true in the graphical user interface of SPMF or by setting CheckMaximumUtility = false in the source code of SPMF. Then the two algorithms should return the same result. But as of 2020-03-04, a bug has been found in UP-Span such that it can miss some patterns (thanks to Acquah Hackman for reporting the bug). For example, in the above database, minutil = 0.45 and maxWindow = 2, UP-Span misses the episode 2 -1 1 3 -1 #UTIL: 10 that exists in the database, while HUE-SPAN with maxWindow = 3 can find it. We will try to fix that bug in UP-SPAN soon!

Performance

The first algorithm for high utility episode mining is UP-SPAN (Wu et al., 2013). The HUE-SPAN algorithm was shown to be faster than UP-SPAN. This is the original implementation of HUE-SPAN.

Optional parameters

In the release version of SPMF, there is an optional parameter "Use the traditional utility?", which allows to change the definition of utility to use the traditional one instead of the new one presented in the HUE-SPAN paper. In that paper, it was argued that the new definition is better. But using the traditional definition of utility can be interesting to compare the performance with other algorithms like UP-SPAN, for example. This optional parameter takes two values : true or false. For using this parameter in the command line, the parameter is simply added at the end of the command such as: java -jar spmf.jar run HUE-SPAN contextHUE_Span.txt output.txt 0.45 2 true, where true indicates to use the traditional definition of utility.

In the source code version of SPMF, there are some additional parameters that can be used to deactivate some optimizations. Normally, it is not necessary to deactivate them, unless for doing some performance experiments.

Implementation details

The version implemented here contains all the optimizations described in the paper proposing HUE-SPAN. Note that the input format is not exactly the same as described in the original article. But it is equivalent.

Where can I get more information about the HUE-SPAN algorithm?

Here is the paper describing the HUE-SPAN algorithm:

Fournier-Viger, P., Yang, P., Lin, J. C.-W., Yun, U. (2019). HUE-SPAN: Fast High Utility Episode Mining. Proc. 14th Intern. Conference on Advanced Data Mining and Applications (ADMA 2019) Springer LNAI, pp. 169-184.

Also, here is a powerpoint presentation of HUE-SPAN.

Besides, for a general overview of high utility itemset mining, you may read this survey paper.