Mining Episode Rules in a Complex Sequence using the TKE, MINEPI+ or EMMA algorithm (SPMF documentation)

This example explains how to mine Episode rules using the TKE, MINEPI or EMMA algorithm using the SPMF open-source data mining library.

How to run this example?

What are these algorithms?

There exists several algorithms such as TKE, EMMA and MINEPI+ to find frequent episodes in a complex event sequence. That means to finds subsequences that appear many times in a long sequence of events, where some events may be simultaneous. But although knowing that some subsequence of events appear many times in a sequence is useful, it is not so suitable for making predictions. For this reason the concept of episode rules was proposed where the goal is to extract rules of the form X --> Y indicating that if some sequence of events appears it will be followed by another sequence of events with some given confidience (e.g. conditional probability P(Y|X)).

The standard episode rules are generated as a post-processing step after applying an algorithm such as TKE, EMMA and MINEPI+. That post-processing step was described by Manila (1995) and can be applied to most episode mining algorithms. It is to be noted that because TKE, EMMA and MINEPI+ use different search procedure and definitions, they do not always generate the same episode rules.

There is another example in SPMF for describing the TKE, EMMA and MINEPI+ algorithms for mining frequent episodes. This example will only describe how to generate the episode rules using these algorithms.

Also, note that there are also some other algorithms in SPMF for extracting other types of episode rules such as POERM. It was argued that the episode rules extracted by POERM are more general than those described in this example and may thus be more useful. For more details, you may see the example describing POERM.

What is the input?

The TKE, EMMA and MINEPI+ algorithms takes as input an event sequence Let’s consider following sequence consisting of 11 time points (t1, t2, …, t10) and 4 events type (1, 2, 3, 4). This database is provided in the text file "contextEMMA.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.

Time points

Itemset (event set)

1

1

2

1

3

1 2

6

1

7

1 2

8

3

9

2

11

4

Each line of the sequence is called an event set and consists of:

For example, in the above example, at time point 1, the event 1 occurred. Then at time point 2, the event 1 occurred again. Then at time point 3, the events 1 and 2 occurred simultaneously. And so on.

To run the TKE, EMMA and MINEPI+ algorithms to generate episode rules, it is also required to provide some parameters.

What is the output?

The output of these algorithms is a set of episode rules that have a high support (occurrence frequency) and a high confidence. More precisely, the algorithms will output all the episode rules that have a support greater or equal to the minimum support threshold set by the user, and a confidence that is greater or equal to the minimum confidence set by the user.

For a rule X --> Y, the support means the number of times that X is followed by Y in the input sequence (whithin a time interval no greater than the maximum time duration).

For a rule X --> Y, the confidence means the number of times that X is followed by Y in the input sequence divided by the number of times that X appears (whithin a time interval no greater than the maximum time duration).

For a more formal definition, you may also see the research papers describing these algorithms.

For example, if we use the TKE algorithm to generate episode rules, set k = 6, max time duration= 2, minimum confidence = 0.2, max consequent count = 1 and minimum support = 2, then we find the following episode rules:

Frequent episode

Support

Confidence

{1} ==> {2}

2 0.4

{1} ==> {1,2}

2 0.4

{1} ==> {1}

3 0.6

For instance, the last episode {1} ==> {1,2} ( indicating that event 1 was followed by event 1 and 2 at the same time), that this rule appeared two times (support = 2) and that the confidence of this rule is 0.4. This means that 40 % of the time when {1} appears, it is followed by {1,2} whithin the maximum time duration of 2..

Input file format

The input file format of that algorithm is a text file. An item (event) is represented by a positive integer. A transaction (event set) is a line in the text file. In each line (event set), items are separated by a single space. It is assumed that all items (events) within a same transaction line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same event set. Each line is optionally followed by the character "|" and then the timestamp of the event set (line). Note that it is possible to run TKE on a database that has no timestamps. In that case, the boolean parameter (the third parameter) should be set to true to indicate that the dataset has no timestamps. If there is not timestamps and the parameter is set to true, the algorithm will assume that each line has a timestamp that is increasing by 1.
In the previous example, the input file is defined as follows:

1|1
1|2
1 2|3
1|6
1 2|7
3|8
2|9
4|1

Output file format

The output file format is defined as follows. It is a text file. Each line is an episode rules.

Each line contains the antecedent of a rule, which is an episode, followed by " ==> ", followed by the consequent of the rule, which is another episode. Then, the keyword "#SUP:" appears followed by an integer indicating the support of the pattern (a positive integer). For example, a few lines from the output file from the previous example are shown below:

{1} ==> {2} #SUP: 2 #CONF: 0.4
{1} ==> {1,2} #SUP: 2 #CONF: 0.4
{1} ==> {1} #SUP: 3 #CONF: 0.6

For instance, the second line indicates that event 1 is followed by events 1 and 2 simultaneously, has a support of 2 and a confidence of 40%. Other lines follow the same format.

Note that by changing the parameters, it is possible to find more complex rules such as:

{1}{2} ==> {1,2} #SUP: 2 #CONF: 0.4

This rules indicates that if 1 is followed by 2, then it is followed by 1 and 2 at the same time, and this rule has a support of 2 and a confidence of 40%.

Performance and Implementation details

This example has described how to generate episode rules using the output of the TKE, EMMA and MINEPI+ algorithm. This is done based on the standard algorithm for generating episode rules proposed by Toivonen and Manila (1995). There are also other ways of generating episode rules. For example, the POERM algorithm also offered in SPMF can generate another type of episode rules called Partially-Ordered Episode rules, which are more general than the rules described in this example.

Where can I get more information about these algorithms?

The TKE algorithm is described in this article:

Fournier-Viger, P., Wang, Y., Yang, P., Lin, J. C.-W., Yun, U. (2020). TKE: Mining Top-K Frequent Episodes. Proc. 33rd Intern. Conf. on Industrial, Engineering and Other Applications of Applied Intelligent Systems (IEA AIE 2020), Springer LNAI, 12 pages

Here is the paper describing the MINEPI+ algorithm:

Fournier-Viger, P., Yang, P., Lin, J. C.-W., Yun, U. (2019). MINEPI+: Fast High Utility Episode Mining. Proc. 14th Intern. Conference on Advanced Data Mining and Applications (ADMA 2019) Springer LNAI, pp. 169-184.

The EMMA algorithm is described in this article:

Kuo-Yu Huang,Chia-Hui Chang (2008). Efficient mining of frequent episodes from complex sequences.Inf. Syst.33(1):96-114    

The first paper discussing about episode rules is :

Mannila, H., Toivonen, H., Verkamo, A.I.: Discovering frequent episodes in sequences. In: Proc. 1st Int. Conf. on Knowledge Discovery and Data Mining (1995)