Mining Frequent Episodes in a Complex Event Sequence using the MINEPI+ Algorithm (SPMF documentation)

This example explains how to run the MINEPI+ algorithm using the SPMF open-source data mining library.

How to run this example?

What is MINEPI+?

MINEPI+ (Kuo-Yu et al., 2008) is an algorithm for discovering frequent episodes in a complex event sequence. In simple words, frequent episode mining means to look for subsequences that appear frequently in a long sequence of events, where some events may be simultaneous.

Various algorithms have been proposed to find frequent episodes in data. MINEPI+ is one of the famous ones. It is to be noted that different algorithms for episode mining do not count the support (occurrence frequency) of an episode in the same way. The MINEPI+ and EMMA algorithm count the support using a concept of head frequency (see the paper for more details).

Also, it is to be noted that the MINEPI+ algorithm can be applied on datasets that have timestamps or that do not have timestamps by setting the third parameter of the algorithm to false or true, respectively.

Frequent episode mining has many possible applications as data from many domains can be encoded as sequences of events.

What is the input?

The MINEPI+ algorithm takes as input an event sequence with a minimum support threshold, a maximum window length and a boolean parameter self_increment, that must be set to true if the input dataset has no timestamps or otherwise false. Let’s consider following sequence consisting of 11 time points (t1, t2, …, t10) and 4 events type (1, 2, 3, 4). This database is provided in the text file "contextEMMA.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.

Time points

Itemset (event set)

1

1

2

1

3

1 2

6

1

7

1 2

8

3

9

2

11

4

Each line of the sequence is called an event set and consists of:

For example, in the above example, at time point 1, the event 1 occurred. Then at time point 2, the event 1 occurred again. Then at time point 3, the events 1 and 2 occurred simultaneously. And so on.

What is the output?

The output of EMMA is the set of frequent episodes having a support no less than a minSup threshold (a positive integer) set by the user. To explain what is a frequent episode, it is necessary to review some definitions

An episode is a sequence of events. It is said to appear in a time interval [ti,tj] (where ti and tj are time points) if all the events from the episode appear in the same order in the input sequence in that time interval. For example, consider the episode <(1),(2)>, which means event 1 followed by event 2. This episode appears in the time interval [t2,t3] because in that interval the event 1 appears (at time point t2) followed by event 2 (at time point t3). Moreover, it is important to note that each time interval is required to have a time duration (tj - ti) that is smaller than the maximum window parameter set by the user.

Given an episode, it is possible to find all the time intervals where it appears in the input sequence. For an interval [ti,tj], ti is said to be the starting point of the time interval. The support of an episode is the number of different starting points of intervals that contains this episode. For example, for a maximum window of 2, the episode <(1),(2)> appears in the ime intervals [t2,3] and [t6,t7]. Because that episode has two different starting points (t2 and t6), its support is 2.
If we change the maximum window parameter to 3, then <(1),(2)> appears in time intervals [t1,t3] [t2,t3], [t6,t7]and [t7,t9]. Because that episode has four different starting points t1,t2,t6,t7), its support is 4.

Another example is the episode <(1,2)> which means that events 1 and 2 appeared at the same time. This episode appears in two time intervals that are [t3,t3] and [t7,t7]. These time intervals have two different starting points (t3 and t7). Thus the support of that episode is 2.

A frequent episode is an episode that has a support no less than minSup.

For example, if set minSup = 2 and maxWindow = 2. We obtain 6 frequent episodes with the following support values:

Frequent episode

Support

<(1)>

5

<(2)>

3

<(1, 2)>

2

<(1), (1)>

3

<(1), (2)>

2

<(1), (1, 2)>

2

For instance, the last episode <(1), (1, 2)> ( indicating that event 1 was followed by event 1 and 2 at the same time) has a support of 2.

Input file format

The input file format of MINEPI+ is defined as follows. It is a text file. Each lines represents an event set. Each line is composed of three sections, as follows.

For example, for the previous example, the input file is defined as follows:

1:2:2
2 4:4:2 2
2 3:6:3 3
1 3:7:4 3
4:2:

Consider the second line. It means that at the second time point, two events 2 and 4 occurred, and that these events respectively have a utility of 2 and 2. The total utility of this event set is thus 2 + 2 = 4. The following lines follow the same format.

Output file format

The output file format is defined as follows. It is a text file. Each line is a frequent episode. Each event in a frequent episode is a positive integer and items from the same event set within an episode are separated by single spaces. The value "-1" indicates the end of an event set. On each line, the episode is first indicated. Then, the keyword "#SUP:" appears followed by an integer indicating the support of the pattern (a positive integer). For example, a few lines from the output file from the previous example are shown below:

1-1  #SUP : 5
2-1  #SUP : 3
1 2-1  #SUP : 2
1 -1 1-1  #SUP : 3
1 -1 2-1  #SUP : 2
1 -1 1 2-1  #SUP : 2

For instance, the last line indicates that event 1 followed by events 1 and 2 simultaneously, has a support of 2. Other lines follow the same format.

Performance

The first algorithm for high utility episode mining is UP-SPAN (Wu et al., 2013). The MINEPI+ algorithm was shown to be faster than UP-SPAN. This is the original implementation of MINEPI+.

Optional parameters

In the release version of SPMF, there is an optional parameter "Use the traditional utility?", which allows to change the definition of utility to use the traditional one instead of the new one presented in the MINEPI+ paper. In that paper, it was argued that the new definition is better. But using the traditional definition of utility can be interesting to compare the performance with other algorithms like UP-SPAN, for example. This optional parameter takes two values : true or false. For using this parameter in the command line, the parameter is simply added at the end of the command such as: java -jar spmf.jar run MINEPI+ contextEMMA.txt output.txt 0.45 2 true, where true indicates to use the traditional definition of utility.

In the source code version of SPMF, there are some additional parameters that can be used to deactivate some optimizations. Normally, it is not necessary to deactivate them, unless for doing some performance experiments.

Implementation details

The version implemented here contains all the optimizations described in the paper proposing MINEPI+. Note that the input format is not exactly the same as described in the original article. But it is equivalent.

Where can I get more information about the MINEPI+ algorithm?

Here is the paper describing the MINEPI+ algorithm:

Fournier-Viger, P., Yang, P., Lin, J. C.-W., Yun, U. (2019). MINEPI+: Fast High Utility Episode Mining. Proc. 14th Intern. Conference on Advanced Data Mining and Applications (ADMA 2019) Springer LNAI, pp. 169-184.