Mining Frequent Episodes in a Complex Event Sequence using the MINEPI+ Algorithm (SPMF documentation)
This example explains how to run the MINEPI+ algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "MINEPI+" algorithm, (2) select the input file "contextEMMA.txt", (3) set the output file name (e.g. "output.txt") (4) set the minimum support, maximum window and the boolean parameter indicating that timestamps are not provided respectively to 2, 2, false and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run MINEPI+ contextEMMA.txt output.txt 2 2 false in a folder containing spmf.jar and the example input file contextEMMA.txt. - If you are using the source code version of SPMF, launch the file "MainTestEMMA.java" in the package ca.pfv.SPMF.tests.
What is MINEPI+?
MINEPI+ (Kuo-Yu et al., 2008) is an algorithm for discovering frequent episodes in a complex event sequence. In simple words, frequent episode mining means to look for subsequences that appear frequently in a long sequence of events, where some events may be simultaneous.
Various algorithms have been proposed to find frequent episodes in data. MINEPI+ is one of the famous ones. It is to be noted that different algorithms for episode mining do not count the support (occurrence frequency) of an episode in the same way. The MINEPI+ and EMMA algorithm count the support using a concept of head frequency (see the paper for more details).
Also, it is to be noted that the MINEPI+ algorithm can be applied on datasets that have timestamps or that do not have timestamps by setting the third parameter of the algorithm to false or true, respectively.
Frequent episode mining has many possible applications as data from many domains can be encoded as sequences of events.
What is the input?
The MINEPI+ algorithm takes as input an event sequence with a minimum support threshold, a maximum window length and a boolean parameter self_increment, that must be set to true if the input dataset has no timestamps or otherwise false. Let’s consider following sequence consisting of 11 time points (t1, t2, …, t10) and 4 events type (1, 2, 3, 4). This database is provided in the text file "contextEMMA.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.
Time points |
Itemset (event set) |
1 |
1 |
2 |
1 |
3 |
1 2 |
6 |
1 |
7 |
1 2 |
8 |
3 |
9 |
2 |
11 |
4 |
Each line of the sequence is called an event set and consists of:
- a time point (first column)
- an event set (second column)
For example, in the above example, at time point 1, the event 1 occurred. Then at time point 2, the event 1 occurred again. Then at time point 3, the events 1 and 2 occurred simultaneously. And so on.
What is the output?
The output of EMMA is the set of frequent episodes having a support no less than a minSup threshold (a positive integer) set by the user. To explain what is a frequent episode, it is necessary to review some definitions
An episode is a sequence of events. It is said to appear in a time interval [ti,tj] (where ti and tj are time points) if all the events from the episode appear in the same order in the input sequence in that time interval. For example, consider the episode <(1),(2)>, which means event 1 followed by event 2. This episode appears in the time interval [t2,t3] because in that interval the event 1 appears (at time point t2) followed by event 2 (at time point t3). Moreover, it is important to note that each time interval is required to have a time duration (tj - ti) that is smaller than the maximum window parameter set by the user.
Given an episode, it is possible to find all the time intervals where it appears in the input sequence. For an interval [ti,tj], ti is said to be the starting point of the time interval. The support of an episode is the number of different starting points of intervals that contains this episode. For example, for a maximum window of 2, the episode <(1),(2)> appears in the ime intervals [t2,3] and [t6,t7]. Because that episode has two different starting points (t2 and t6), its support is 2.
If we change the maximum window parameter to 3, then <(1),(2)> appears in time intervals [t1,t3] [t2,t3], [t6,t7]and [t7,t9]. Because that episode has four different starting points t1,t2,t6,t7), its support is 4.
Another example is the episode <(1,2)> which means that events 1 and 2 appeared at the same time. This episode appears in two time intervals that are [t3,t3] and [t7,t7]. These time intervals have two different starting points (t3 and t7). Thus the support of that episode is 2.
A frequent episode is an episode that has a support no less than minSup.
For example, if set minSup = 2 and maxWindow = 2. We obtain 6 frequent episodes with the following support values:
Frequent episode |
Support |
<(1)> |
5 |
<(2)> |
3 |
<(1, 2)> |
2 |
<(1), (1)> |
3 |
<(1), (2)> |
2 |
<(1), (1, 2)> |
2 |
For instance, the last episode <(1), (1, 2)> ( indicating that event 1 was followed by event 1 and 2 at the same time) has a support of 2.
Input file format
The input file format of MINEPI+ is defined as follows. It is a text file. Each lines represents an event set. Each line is composed of three sections, as follows.
- First, the events (items) contained in the event set are listed. An event is represented by a positive integer. Each item is separated from the next item by a single space. It is assumed that all events within a same event set (line) are sorted according to a total order (e.g. ascending order) and that no event can appear twice within the same event set.
- Second, the symbol ":" appears and is followed by the total utility of the event set (an integer).
- Third, the symbol ":" appears and is followed by the utility of each event in this event set (an integer), separated by single spaces.
For example, for the previous example, the input file is defined as follows:
1:2:2
2 4:4:2 2
2 3:6:3 3
1 3:7:4 3
4:2:
Consider the second line. It means that at the second time point, two events 2 and 4 occurred, and that these events respectively have a utility of 2 and 2. The total utility of this event set is thus 2 + 2 = 4. The following lines follow the same format.
Output file format
The output file format is defined as follows. It is a text file. Each line is a frequent episode. Each event in a frequent episode is a positive integer and items from the same event set within an episode are separated by single spaces. The value "-1" indicates the end of an event set. On each line, the episode is first indicated. Then, the keyword "#SUP:" appears followed by an integer indicating the support of the pattern (a positive integer). For example, a few lines from the output file from the previous example are shown below:
1-1 #SUP : 5
2-1 #SUP : 3
1 2-1 #SUP : 2
1 -1 1-1 #SUP : 3
1 -1 2-1 #SUP : 2
1 -1 1 2-1 #SUP : 2
For instance, the last line indicates that event 1 followed by events 1 and 2 simultaneously, has a support of 2. Other lines follow the same format.
Performance
The first algorithm for high utility episode mining is UP-SPAN (Wu et al., 2013). The MINEPI+ algorithm was shown to be faster than UP-SPAN. This is the original implementation of MINEPI+.
Optional parameters
In the release version of SPMF, there is an optional parameter "Use the traditional utility?", which allows to change the definition of utility to use the traditional one instead of the new one presented in the MINEPI+ paper. In that paper, it was argued that the new definition is better. But using the traditional definition of utility can be interesting to compare the performance with other algorithms like UP-SPAN, for example. This optional parameter takes two values : true or false. For using this parameter in the command line, the parameter is simply added at the end of the command such as: java -jar spmf.jar run MINEPI+ contextEMMA.txt output.txt 0.45 2 true, where true indicates to use the traditional definition of utility.
In the source code version of SPMF, there are some additional parameters that can be used to deactivate some optimizations. Normally, it is not necessary to deactivate them, unless for doing some performance experiments.
Implementation details
The version implemented here contains all the optimizations described in the paper proposing MINEPI+. Note that the input format is not exactly the same as described in the original article. But it is equivalent.
Where can I get more information about the MINEPI+ algorithm?
Here is the paper describing the MINEPI+ algorithm:
Fournier-Viger, P., Yang, P., Lin, J. C.-W., Yun, U. (2019). MINEPI+: Fast High Utility Episode Mining. Proc. 14th Intern. Conference on Advanced Data Mining and Applications (ADMA 2019) Springer LNAI, pp. 169-184.