Mining Partially-Ordered Episode Rules in a Complex Sequence using the POERM or POERM-ALL algorithms (SPMF documentation)
This example explains how to run the POERM (or POERM-ALL) algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "POERM" or "POERMALL" algorithm, (2) select the input file "contextEMMA.txt", (3) set the output file name (e.g. "output.txt") (4) set the minimum support to 2, XSpan = 2, YSpan = 2, the minimum confidence to 0.5 (which means 50%), the XYSpan to 3, and set "Without timestamps?" to false, and then (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run POERM contextEMMA.txt output.txt 2 2 2 0.5 3 false in a folder containing spmf.jar and the example input file contextEMMA.txt. - If you are using the source code version of SPMF, launch
the file "MainTestPOERM.java" or "MainTestPOERM-ALL.java" in the package ca.pfv.SPMF.tests.
What is POERM and POERM-ALL?
POERM and POERM-ALL (Fournier-Viger, P., Chen, Y. et al., 2021) is an algorithm for discovering partially-ordered episode rules in a complex event sequence. In simple words, the algorithm is used to analyze a sequence of symbols or events that can have timestamps or not. The output is some rules of the form X-->Y indicating that if some event(s) or symbol(s) X appear in the sequence, they will be followed by some other event(s) or symbol(s) Y. Finding such rule is useful to discover some sequential or temporal relationships in a sequence. For example, it can be used to analyze the purchase behavior of a customer (a sequence of purchases), or it could be used to find relationships between some words in a text (a sequence of words).
Various algorithms have been proposed to find episode rules. POERM is a recent algorithm that is designed to find a specific type of rules called "partially ordered" episode rules. Partially ordered means that these rules are not totally ordered. In a rule X --> Y, we just require that symbols in X appear before those in Y but there is not order requirement between the events in X, or between the events in Y. It was argued that these rules are more general than the traditional episode rules used in previous papers.
POERM and POERM-ALL both have the same input and output but they use different strategies to found the result. POERM is generally more efficient than POERM-ALL.
What is the input?
The algorithm takes as input an event sequence with a minimum support threshold, a mininum confidence threshold, some constraints on rules to be found called XSpan, YSpan, XYSpan and a boolean parameter without_timestamp? (self-increment), that must be set to true if the input dataset has no timestamps or otherwise false. Let’s consider following sequence consisting of 11 time points (t1, t2, …, t10) and 4 events type (1, 2, 3, 4). This database is provided in the text file "contextEMMA.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.
Time points |
Itemset (event set) |
1 |
1 |
2 |
1 |
3 |
1 2 |
6 |
1 |
7 |
1 2 |
8 |
3 |
9 |
2 |
11 |
4 |
Each line of the sequence is called an event set and consists of:
- a time point (first column)
- an event set (second column)
For example, in the above example, at time point 1, the event 1 occurred. Then at time point 2, the event 1 occurred again. Then at time point 3, the events 1 and 2 occurred simultaneously. And so on.
To make it easier to understand, we could also represent the above sequence visually like this:
It is possible to apply the POERM and POERM algorithms on dataset that have no timestamps. In that case, it will be assumed that timestamps are 1,2,3,4...
What is the output?
The output of POERM or POERM-ALL is the set of partially-ordered episode rules having a support (occurrence frequency) that is no less than a minSup threshold (a positive integer) set by the user, a confidence that is no less than a minimum confidence threshold (a double number representing a percentage), and that meet some constraints XSpan, YSpan and XYSpan.
A rule is denoted as X --> Y and it is a relationship between two sets of events X and Y. The intuitive meaning of a rule X--> Y is that if all events from X appear in any order in a sequence, they will be followed by all events from Y. To avoid ﬁnding rules containing events that are too far appart, the algorithm allows to specify three constraints: (1) events from X must appear whithin some maximum amount of time XSpan ∈ Z+, (2) events from Y must appear whithin some maximum amount of time YSpan ∈ Z+, and (3) the time between X and Y must be no less than a constant XYSpan ∈ Z+. The three constraints XSpan, YSpan and XYSpan are parameters of the algorithm , and are illustrated in the figure below for a rule X → Y .
The support of a rule X-->Y means how many times the rule appeared in the input sequence (how many times X was followed by Y, and respected the three time constraints). For example, if a rule appears three times, we will say that the support of the rule is 3.
The confidence of a rule X-->Y means how many times X-->Y appeared in the input sequence, divided by how many times X appeared in the input sequence. Thus, you can think about the confidence about the conditional probability P(Y|X), that is how likely it is that Y will follow X in the sequence. The confidence is expressed as a percentage. For example, a confidence of 50% means that 50% of the times X was followed by Y in the input sequence.
The output of the algorithm is all the episode rules that (1) have a confidence greater or equal to the minConf parameter, and (2) a support greater or equal to the minSup parameter, given the three time constraints.
Now, let's look at the result for the above example. Four rules are found:
Episode rules |
Support | Confidence |
1 2 --> 1 |
2 | 0.5 (which means 50%) |
1 2 --> 2 |
2 | 0.5 (which means 50%) |
1 2 -> 3 |
2 | 0.5 (which means 50%) |
1 2 -> 2 3 |
2 | 0.5 (which means 50%) |
For instance, the first episode rule {1,2} --> 1 ( indicating that event 1 and 2 are followed by event 1) has a support of 2 and a confidence of 50%. The two occurrences of this rule are illustrated in the picture below:
Since the rule appears twice, the support is two. Moreover, it can be observed that there are two occurrences of X = {1,2} but only one is followed by Y={1}. Hence, the confidence is 50%.
Input file format
The input file format of that algorithm is a text file. An item (event) is represented by a positive integer. A transaction (event set) is a line in the text file. In each line (event set), items are separated by a single space. It is assumed that all items (events) within a same transaction line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same event set. Each line is optionally followed by the character "|" and then the timestamp of the event set (line). Note that it is possible to run the algorithms on a database that has no timestamps. In that case, the boolean parameter "without timestamps (self-increment)" should be set to true to indicate that the dataset has no timestamps. If there is no timestamps and the parameter is set to true, the algorithm will assume that each line has a timestamp that is increasing by 1.
In the previous example, the input file is defined as follows:
1|1
1|2
1 2|3
1|6
1 2|7
3|8
2|9
4|1
Output file format
The output file format is defined as follows. It is a text file. Each line is a partially episode rule. Each event in a partially episode rule is a positive integer On each line, the items from the rule antecedent are first listed, separated by single spaces. Then the keyword "==>" appears, followed by the items from the rule consequent, separated by single spaces. Then, the keyword "#SUP:" appears followed by an integer indicating the support of the rule as a number of sequences. Then, the keyword "#CONF:" appears followed by a double values in the [0, 1] interval indicating the confidence of the rule (e.g. 0.5 means 50%). The output for the example is:
1 2 ==> 1 #SUP: 2 #CONF: 0.5
1 2 ==> 2 #SUP: 2 #CONF: 0.5
1 2 ==> 3 #SUP: 2 #CONF: 0.5
1 2 ==> 2 3 #SUP: 2 #CONF: 0.5
Consider the first line. It indicates that the rule {1, 2} ==> {1} has a support of 2 (it appears twice in the input sequence) and that the confidence of this rule is 50%. The next lines follow the same format
Performance
POERM and POERM-ALL are the first algorithms for mining partially-ordered episode rules. POERM should generally be more efficient than POERM-ALL. A comparison of both algorithms is presented in the paper by Fournier-Viger et al. (2021)
Implementation details
This is the original implementations of those algorithms.
Where can I get more information about the POERM and POERM-ALL algorithms?
The POERM and POERM-ALL algorithm are described in this article:
Fournier-Viger, P., Chen, Y., Nouioua, F., Lin, J. C.-W. (2020). Mining Partially-Ordered Episode Rules in an Event Sequence. Proc. of the 13th Asian Conference on Intelligent Information and Database Systems (ACIIDS 2021),