Mining Frequent Episodes in a Simple Event Sequence using the MINEPI Algorithm (SPMF documentation)
This example explains how to run the MINEPI algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "MINEPI" algorithm, (2) select the input file "contextMINEPI.txt", (3) set the output file name (e.g. "output.txt") (4) set the minimum support, maximum window and the boolean parameter indicating that timestamps are provided respectively to 2, 2, false and and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run MINEPI contextMINEPI.txt output.txt 2 2 false in a folder containing spmf.jar and the example input file contextMINEPI.txt. - If you are using the source code version of SPMF, launch the file "MainTestMINEPI.java" in the package ca.pfv.SPMF.tests.
What is MINEPI?
MINEPI (Mannila et al., 1997) is the first algorithm for mining frequent episodes in a sequence of events. A frequent episode is a subsequence that appear frequently whithin some time window in a long sequence of events.
Various algorithms have been proposed to find frequent episodes in data. It is to be noted that different algorithms do not count the support (occurrence frequency) of an episode in the same way. The MINEPI algorithm count the support using the concept of minimal occurrences.
It should also be noted that this implementation of MINEPI does not allow simultaneous events. Other algorithms for frequent episode mining, offered in SPMF like EMMA and MINEPI+ allow simultaneous events.
What is the input?
MINEPI takes as input an event sequence with a minimum support, a maximum window and a boolean parameter self_increment, that must be set to true if the input dataset has no timestamps or otherwise false. Let’s consider following sequence consisting of 10 time points (t1, t2, …, t10) and 4 events type (1, 2, 3, 4). This database is provided in the text file "contextMINEPI.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.
Time points |
Event |
t1 |
1 |
t2 |
2 |
t3 |
1 |
t4 |
2 |
t5 |
4 |
t6 |
3 |
t7 |
2 |
t8 |
3 |
t9 |
2 |
t10 |
1 |
Each line of the sequence is called an event set and consists of:
- a time point (first column)
- an event (second column)
What are real-life examples of such a database? There are several applications in real life. One application is to analyse text data. For example, the events could represents words in a text. Another applications is to analyse a sequence of events in a computer network.
What is the output?
The output of MINEPI is the set of frequent episodes having a support no less than a minSup threshold (a positive integer) set by the user. To explain what is a frequent episodes, it is necessary to review some definitions.
The support of an episode is the number of its minimal occurrence. A minimal occurrence of an epiosde is a time-interval that there is no other sub-time-interval that contains this episode, and that time interval is no greater than the maximum window parameter. For example, consider the episode <(1), (2)> which means that event 1 was followed by event 2. The minimal occurrences of <(1), (2)> are [t1, t2] and [t3, t4]. It can be oserved that [t1, t3] also contains <(1), (2)>, but [t1, t3] is not a minimal occurrence of <(1), (2)>, because there exist a sub-time-interval [t1, t2] that contains <(1), (2)>. A minimal occurrence should satisfy the maximum window threshold provided by user.
A frequent epiosde is an episode having a support that is no less than minSup. For example, if set minSup = 2 and maxWindow = 2. We obtain 6 frequent episodes.
Frequent episode |
Support |
<(1)> |
3 |
<(2)> |
4 |
<(3)> |
2 |
<(1), (2)> |
2 |
<(2), (1)> |
2 |
<(3), (2)> |
2 |
For instance, the last episode indicating that event 3 was followed by event 2, has a support of 2.
Input file format
The input format is a text file. An item (event) is represented by a positive integer. A line must contains not more than one item (event). This is a restriction for this implementation of MINEPI. An item is followed by the character "|" and then the timestamp of this line. Note that it is possible to run MINEPI on a database that has no timestamps. In that case, the boolean parameter (the third parameter of the algorithm) should be set to true to indicate that the dataset has no timestamps.
In the previous example, the input file is defined as follows:
1|1
2|2
1|3
2|4
4|5
3|6
2|7
3|8
2|9
1|10
Consider the third line. It means that at the third time point, the event 1 occurred. The other lines follow the same format.
Output file format
The output file format is defined as follows. It is a text file, where each line represents a frequent episode. Each event in frequent episodes is a positive number. The value "-1" indicates the end of an event. On each line, the episode is first indicated. Then, the keyword "#SUP:" appears followed by an integer indicating the support of the episode (a positive integer). For example, the output file is shown below for this example:
1 -1 #SUP : 3
2 -1 #SUP : 4
3 -1 #SUP : 2
1 -1 2 -1 #SUP : 2
2 -1 1 -1 #SUP : 2
3 -1 2 -1 #SUP : 2
For instance, the last line indicates that the frequent episode <(3),(2)>, that is event 3 followed by event 2, has a support of 2. The other lines follow the same format.
Performance
MINEPI is the first algorithm for mining frequent episodes. It is not the fastest algorithm as it use a breadth-first search. Moreover, it is here limited to processing sequences without simultaneous events. But MINEPI is interesting because it uses a concept of minimal occurrences. Other algorithms like EMMA and MINEPI+ allows to process simultaneous events, and use a different ways of counting the support, which was argued to be more useful by the authors of EMMA. These algorithms are also offered in SPMF
Where can I get more information about the MINEPI algorithm?
The MINEPI algorithm is described in this article:
Mannila H, Toivonen H, Verkamo AI (1997) Discovery of frequent episodes in event sequences. Data Mining Knowl Discov 1(3):259–289