Mining Sequential Rules between Sequential Patterns with the RuleGen algorithm (SPMF documentation)

This example explains how to run the RuleGen algorithm using the SPMF open-source data mining library.

How to run this example?

What is RuleGen?

RuleGen is an algorithm for discovering sequential rules that appears in sequence databases. It was proposed by Zaki (2000).

What is the input of RuleGen ?

The input of RuleGen is a sequence database and two user-specified thresholds named minsup (a value in [0, 1] representing a percentage) and minconf (a value in [0, 1] representing a percentage).

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output of RuleGen ?

The RuleGen algorithms outputs all sequential rules having a support and confidence respectively higher or equals to user-specified minsup and minconf thresholds.

A rule X==>Y is defined by RuleGen as a sequential relationship between two sequential patterns X and Y. The confidence of a rule X ==> Y is defined as the number of sequences containing X divided by the number of sequences containing Y. The support of a rule X ==> is defined as the number of sequences containing Y.

In this example, we apply RuleGen with minsup = 75 %, minconf= 50%. We obtains 21 sequential rules:

Rule Support Confidence

{1 } ==> {1 }{2 }

4

1.0

{1 } ==> {1 }{3 }

4

1.0

{1 } ==> {1 }{3 }{2 }

3

0.75

{1 } ==> {1 }{3 }{3 }

3

0.75

{2 } ==> {1 }{2 }

4

1.0

{2 } ==> {2 }{3 }

3

0.75

{2 } ==> {3 }{2 }

3

0.75

{2 } ==> {1 }{3 }{2 }

3

0.75

{3 } ==> {1 }{3 }

4

1.0

{3 } ==> {2 }{3 }

3

0.75

{3 } ==> {3 }{2 }

3

0.75

{3 } ==> {3 }{3 }

3

0.75

{3 } ==> {4 }{3 }

3

0.75

{3 } ==> {1 }{3 }{2 }

3

0.75

{3 } ==> {1 }{3 }{3 }

3

0.75

{4 } ==> {4 }{3 }

3

1.0

{1 }{2 } ==> {1 }{3 }{2 }

3

0.75

{1 }{3 } ==> {1 }{3 }{2 }

3

0.75

{1 }{3 } ==> {1 }{3 }{3 }

3

0.75

{3 }{2 } ==> {1 }{3 }{2 }

3

1.0

{3 }{3 } ==> {1 }{3 }{3 }

3

1.0

For example, the rule {1 } ==> {1 }{3 }{3 } means that if the sequential pattern {1} appears in a sequence, the sequential pattern {1} {3 }{3 } also appear in this sequence. Consider the rule {3 }{2 } ==> {1 }{3 }{2 }. It means that if the pattern {3 }{2 } appears in a sequence, the pattern {1 }{3 }{2 } also appears in that sequence. For example, the antecedent of this rule appears in  sequence s1, as follows: (1), (1 2 3), (1 3), (4), (3 6) and the consequent of this rule appears in  sequence s1 as follows: (1), (1 2 3), (1 3), (4), (3 6).

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a postive integer and items from the same itemset within a sequence are separated by single spaces. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the sample input file "contextPrefixSpan.txt" contains the following four lines (four sequences).

1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

The first line represents a sequence where the itemset {1} is followed by the itemset {1, 2, 3}, followed by the itemset {1, 3}, followed by the itemset {4}, followed by the itemset {3, 6}. The next lines follow the same format.

Note that it is also possible to use a text file containing a text (several sentences) if the text file has the ".text" extension, as an alternative to the default input format. If the algorithm is applied on a text file from the graphical interface or command line interface, the text file will be automatically converted to the SPMF format, by dividing the text into sentences separated by ".", "?" and "!", where each word is considered as an item. Note that when a text file is used as input of a data mining algorithm, the performance will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file. Each line is a sequential rule and consists of three parts:

For example, the output file from the previous example is shown below:

{1 } ==> {1 }{2 } #SUP: 4 #CONF: 1.0
{1 } ==> {1 }{3 } #SUP: 4 #CONF: 1.0
{1 } ==> {1 }{3 }{2 } #SUP: 3 #CONF: 0.75
{1 } ==> {1 }{3 }{3 } #SUP: 3 #CONF: 0.75
{2 } ==> {1 }{2 } #SUP: 4 #CONF: 1.0
{2 } ==> {2 }{3 } #SUP: 3 #CONF: 0.75
{2 } ==> {3 }{2 } #SUP: 3 #CONF: 0.75
{2 } ==> {1 }{3 }{2 } #SUP: 3 #CONF: 0.75
{3 } ==> {1 }{3 } #SUP: 4 #CONF: 1.0
{3 } ==> {2 }{3 } #SUP: 3 #CONF: 0.75
{3 } ==> {3 }{2 } #SUP: 3 #CONF: 0.75
{3 } ==> {3 }{3 } #SUP: 3 #CONF: 0.75
{3 } ==> {4 }{3 } #SUP: 3 #CONF: 0.75
{3 } ==> {1 }{3 }{2 } #SUP: 3 #CONF: 0.75
{3 } ==> {1 }{3 }{3 } #SUP: 3 #CONF: 0.75
{4 } ==> {4 }{3 } #SUP: 3 #CONF: 1.0
{1 }{2 } ==> {1 }{3 }{2 } #SUP: 3 #CONF: 0.75
{1 }{3 } ==> {1 }{3 }{2 } #SUP: 3 #CONF: 0.75
{1 }{3 } ==> {1 }{3 }{3 } #SUP: 3 #CONF: 0.75
{3 }{3 } ==> {1 }{3 }{3 } #SUP: 3 #CONF: 1.0
{3 }{2 } ==> {1 }{3 }{2 } #SUP: 3 #CONF: 1.0

Consider the last line. It represents a rule where the antecedent is the itemset {3} followed by the itemset {2}, and the consequent is the itemset {1} followed by {3}, followed by {2}. The rule has a support of 3 sequences and a confidence of 100%. The other lines of the output file follow the same format.

Implementation details

The RuleGen algorithm first apply a sequential pattern mining algorithm and then combines pairs of sequential patterns to generate rules between two sequential patterns. Note that in our implementation we use the PrefixSpan algorithm for mining sequential patterns instead of SPADE because the PrefixSpan algorithm is generally faster than SPADE.

Also, it is important to note that rules found by RuleGen always have the form X == > Y such that X is a subsequence of Y. This definition of a sequential rule is different from the definition of a sequential rules used by other sequential rule mining algorithms offered in SPMF such as CMRules, CMDeo, RuleGrowth, TRuleGrowth, TopSeqRules and TNS where X and Y are unordered itemsets, and X is not a subset of Y. The rules found by these latter algorithms are more general. Moreover, we have shown that we can achieve higher prediction accuracy by using the kind of rules found by RuleGrowth, CMRules instead of using the rules generated by RuleGen. (see this article for details).

Where can I get more information about this algorithm?

The RuleGen algorithm is described in this paper:

Mohammed Javeed Zaki: Scalable Algorithms for Association Mining. IEEE Trans. Knowl. Data Eng. 12(3): 372-390 (2000)