Mining Frequent Episodes and Rules in a Complex Sequence using the EMDO or EMDO-Rules algorithms (SPMF documentation)

This example explains how to run the EMDO and EMDO-Rules algorithms using the SPMF open-source data mining library.

How to run this example?

If you want to run EMDO:

If you are using the graphical interface, (1) choose the "EMDO" algorithm, (2) select the input file "db_emdo.txt", (3) set the output file name (e.g. "output.txt") (4) set the minimum support to 2, and then (5) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run EMDO db_emdo.txt output.txt 2
in a folder containing spmf.jar and the example input file db_emdo.txt.
If you are using the source code version of SPMF, launch the file "MainTestEMDO.java"in the package ca.pfv.SPMF.tests.

If you want to run EMDO-Rules:

If you are using the graphical interface, (1) choose the "EMDO-Rules" algorithm, (2) select the input file "db_emdo.txt", (3) set the output file name (e.g. "output_rules.txt") (4) set the minimum support to 2 and the minimum confidence to 0.2, and then (5) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run EMDO-Rules db_emdo.txt output_rules.txt 2 0.2
in a folder containing spmf.jar and the example input file db_emdo.txt.
If you are using the source code version of SPMF, launch the file "MainTestEMDO.java"in the package ca.pfv.SPMF.tests.

What is EMDO?

EMDO (Ouarem et al, 2024) is an algorithm for discovering the frequent parallel episodes in a complex event sequence. In simple words, the algorithm is used to analyze a sequence of symbols or events that have timestamps. The output is sets of events that frequently appear without constraints on their order..

EMDO-Rules is an algorithm based on EMDO that applies an additional processing step to further derive implication rules from the frequent parallel episodes. The output is some rules of the form X-->Y indicating that if some event(s) X appeared in any order in a sequence at some given time, they are likely to be followed by some other event(s) appearing in any order at a subsequent time. Finding such rule is useful to discover some sequential or temporal relationships in a sequence. For example, it can be used to analyze the purchase behavior of a customer (a sequence of purchases), or it could be used to find relationships between some words in a text (a sequence of words).

Various algorithms have been proposed to find episode rules. Diifferent algorithms use different ways of counting how many times an episode or a rule appears in a sequence. The EMDO algorithm is special in that it uses the frequency based on distinct occurrences. Thus, when counting how many times an episode appears, the algorithm only count the maximal set of occurrences that are distinct. This can be interesting for some applications.

This can be viewed as similar to some other episode rule mining algorithms like POERM, which also find rules between unordered sets of events. However, the method for counting the support is different between EMDO and POERM. Depending on an application's needs, one algorithm or the other may be more suitable.

What is the input?

The algorithm takes as input an event sequence with a minimum support threshold, amd a mininum confidence threshold. Let’s consider following sequence consisting of 7 time points (t1, t2, …, t7) and five events type (1, 2, 3, 4, 5). This database is provided in the text file "db_emdo.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.

Timestamp	Itemset (event set)
1	1 2
2	3 4
3	1 4 5
4	1 2 3
5	1
6	3 5
7	2 4

Each line of the sequence is called an event set and consists of:

a time point (first column)
an event set (second column)

For example, in the above example, at time point 1, the events 1 and 2 occurred. Then at time point 2, the event 3 and 4 occurred. Then at time point 3, the events 1, 4, and 5 occurred simultaneously. And so on.

What is the output?

To explain what is the output of EMDO, it is necessary to explain some definitions.

A parallel episode X = {e1, e2, ... en} is a set of events that appear in an input sequence. The events are unordered, meaning that they are not required to appear at the same time or in a specific order.

An occurrence of a parallel episode is a set of timestamps where all the events from the episode can be found. For instance, an occurrence of the episode {1,2,3} in the example sequence is the timestamps {timestamp:1, timestamp:2} because the items 1,2 and 3 can be all found in these two timestamps when taken as a whole.

An episode can have multiple occurrences in a sequence. For example, the episode {1,2,3} has an occurrence {timestamp:1, timestamp:2} but also other occurrences such as {timestamp:4} and {timestamp:1, timestamp:4} to name just a few.

Two occurrences of an episode are distinct if they dont share any timestamps. For example, the occurrence {timestamp:1, timestamp:2} is distinct from {timestamp:4}. But the occurrence {timestamp:1, timestamp:2} is not distinct from {timestamp:1, timestamp:4}

The support of an episode is the maximum number of distinct occurrences that can be found in the sequence. For example, the episode {1,2,3} has a support of 3 because at most three distinct occurrences can be found in the sequence such as: {timestamp:1, timestamp:2}, {timestamp:4}, {timestamp:5, timestamp:6,timestamp:7}

The output of the EMDO algorithm is all parallel episodes that have a support no less than a minimum support threshold (minsup) , set by the user.

For instance, for minsup = 2, the output is 31 frequent parallel episodes such as:

{2, 3, 4} support = 3
{2, 4, 5} support = 2
{2, 3, 4, 5} support = 2
{2, 3, 5} support = 2
{1, 2, 3, 5} support = 2
{1, 3, 5} support = 2
....

To explain what is the output of EMDO-Rules, it is necessary to explain some additional definitions.

An episode rule is an implication of the form X ==> Y between two frequent parallel episodes X and Y. The meaning of a rule is that if the events from X appear in a sequence, it will trigger the events from Y. To capture this idea that X triggers Y, it is required that the occurrences of Y be after those of X.

An occurrence of an episode rule is a list of timestamps where all the events from X can be found, and those of Y after those of X.

The support of an episode rule X ==> Y is the maximum number of occurrences or the rule in the sequence.

The confidence of an episode rule X ==> Y is the support of the rule, divided by the support of X.

The algorithm output all episode rules that have a confidence no less than a user-defined minimum confidence threshold (minconf).
For example, here are some of the rules discovered in the example sequence for minconf = 0.2 (which means 20%):

{1} ==> {5} SUPPORT: 2 CONFIDENCE: 0.5

{1} ==> {3} SUPPORT: 3 CONFIDENCE: 0.75

{1} ==> {4} SUPPORT: 2 CONFIDENCEF: 0.5

{1} ==> {1} SUPPORT: 3 CONFIDENCE: 0.75

{1} ==> {2} SUPPORT: 2 CONFIDENCE: 0.5

{1} ==> {5,3} SUPPORT: 2 CONFIDENCE: 0.5

{1} ==> {5,3,4} SUPPORT: 2 CONFIDENCE: 0.5

Input file format

The input file format of that algorithm is a text file. An item (event) is represented by a positive integer. A transaction (event set) is a line in the text file. In each line (event set), items are separated by a single space. It is assumed that all items (events) within a same transaction line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same event set. Each line is optionally followed by the character "|" and then the timestamp of the event set (line).
In the previous example, the input file is defined as follows:

1 2|1
3 4|2
1 4 5|3
1 2 3|4
1|5
3 5|6
2 4|7

Output file format

The output file format is defined as follows. It is a text file. Each line is an episode rule. Each event in an episode rule is a positive integer On each line, the items from the rule antecedent are first listed, separated by single spaces. Then the keyword "==>" appears, followed by the items from the rule consequent, separated by single spaces. Then, the keyword "#SUP:" appears followed by an integer indicating the support of the rule as a number of occurrences. Then, the keyword "#CONF:" appears followed by a double values in the [0, 1] interval indicating the confidence of the rule (e.g. 0.5 means 50%). The output for the example is:

{5} ==> {5,3,4} #SUP: 1 #CONF: 0.5

Consider the first line. It indicates that the rule {5} ==> {3, 4, 5} has a support of 1 (it appears once in the input sequence) and that the confidence of this rule is 20%. The next lines follow the same format

Performance

This is the original implementation of EMDO.

Where can I get more information about the algorithm?

The EMDO algorithm is described in this article:

O. Ouarem, F. Nouioua, P. Fournier-Viger: Discovering frequent parallel episodes in complex event sequences by counting distinct occurrences. Appl. Intell. 54(11-12): 701-721 (2024)