Mining Sequential Rules Common to Several Sequences with the Window Size Constraint (SPMF documentation)

This example explains how to run the TRuleGrowth algorithm using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "TRuleGrowth" algorithm, (2) select the input file "contextPrefixSpan.txt", (3) set the output file name (e.g. "output.txt") (4) set minsup = 0.7, minconf =0.8 and window_size = 3 (5) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run TRuleGrowth contextPrefixSpan.txt output.txt 70% 80% 3 in a folder containing spmf.jar and the example input file contextPrefixSpan.txt.
If you are using the source code version of SPMF, launch the file "MainTestTRuleGrowth.java" in the package ca.pfv.SPMF.tests.

What is TRuleGrowth?

TRuleGrowth is an algorithm for discovering sequential rules that appears in sequence databases. It was proposed by Fournier-Viger in 2012. It is a variation of the RuleGrowth algorithm that accepts a window size constraint.

What is the input of TRuleGrowth ?

The input of TRuleGrowthis a sequence database, two user-specified thresholds named minsup (a value in [0, 1] representing a percentage) and minconf (a value in [0, 1] representing a percentage) and a parameter named window_size (an integer >=0).

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.

ID	Sequences
S1	(1), (1 2 3), (1 3), (4), (3 6)
S2	(1 4), (3), (2 3), (1 5)
S3	(5 6), (1 2), (4 6), (3), (2)
S4	(5), (7), (1 6), (3), (2), (3)

What is the output of TRuleGrowth ?

Given a sequence database, and parameters named minsup, minconf and window_size, TRuleGrowth outputs all sequential rules having a support and confidence respectively higher than minsup and minconf that appears within window_size consecutive itemsets

A sequential rule X==>Y is a sequential relationship between two sets of items X and Y such that X and Y are disjoint, and that X is unordered and Y is unordered. The support of a rule X==>Y is the number of sequences that contain all items of X before all items from Y in within the window defined by window_size, divided by the number of sequences in the database. The confidence of a rule is the number of sequences that contain all items of X before all items from Y in the window defined by window_size, divided by the number of sequences that contains items in X in the window defined by window_size.

For example, if we set minsup = 0.7, minconf =0.8 and window_size = 3, TRuleGrowth discovers 4 rules:

Rule	Support	Confidence
{1 } ==> {2 }	80 % (4 sequences)	100 %
{1 } ==> {2, 3 }	80 % (4 sequences)	100 %
{1 } ==> {3 }	80 % (4 sequences)	100 %
{4 } ==> {3 }	60 % (3 sequences)	100 %

For example, the rule {1} ==> {2 3} means that if 1 appears in a sequence, it will be followed by 2 and (in any order) with a confidence of 100 %. Moreover, this rule has a support of 100 % because it appears in four sequences (S1, S2, S3 and S4) out of four sequences within window_size consecutive itemsets.

Optional parameters

The RuleGrowth implementation allows to specify some optional parameters :

"minimum antecedent length" allows to specify the maximum number of items that can appear in the left side (antecedent) of a rule
"maximum consequent length"allows to specify the maximum number of items that can appear in the right side (consequent) of a rule

These parameters are available in the GUI of SPMF and also in the example "MainTestTRuleGrowth.java" provided in the source code of SPMF.

The parameter(s) can be also used in the command line with the Jar file. If you want to use these optional parameters in the command line, it can be done as follows. Consider this example:
java -jar spmf.jar run TRuleGrowth contextPrefixSpan.txt output.txt 75% 50% 3 2 3
This command means to apply RuleGrowth on the file "contextPrefixSpan.txt" and output the results to "output.txt". Moreover, it specifies that the user wants to find patterns for minsup = 75 %, minconf = 50 %, a window size of 3 itemsets, and rules found must contain respectively a maximum of 2 items and 3 items in their antecedent and consequent.

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a postive integer and items from the same itemset within a sequence are separated by single spaces. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the sample input file "contextPrefixSpan.txt" contains the following four lines (four sequences).

1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

The first line represents a sequence where the itemset {1} is followed by the itemset {1, 2, 3}, followed by the itemset {1, 3}, followed by the itemset {4}, followed by the itemset {3, 6}. The next lines follow the same format.

Note that it is also possible to use a text file containing a text (several sentences) if the text file has the ".text" extension, as an alternative to the default input format. If the algorithm is applied on a text file from the graphical interface or command line interface, the text file will be automatically converted to the SPMF format, by dividing the text into sentences separated by ".", "?" and "!", where each word is considered as an item. Note that when a text file is used as input of a data mining algorithm, the performance will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file. Each line is a sequential rule. Each item from a sequential rule is a postive integer. On each line, the items from the rule antecedent are first listed, separated by single spaces. Then the keyword "==>" appears, followed by the items from the rule consequent, separated by single spaces. Then, the keyword "#SUP:" appears followed by an integer indicating the support of the rule as a number of sequences. Then, the keyword "#CONF:" appears followed by a double values in the [0, 1] interval indicating the confidence of the rule. For example, the output file from the previous example is shown below:

1 ==> 2 #SUP: 4 #CONF: 1.0
1 ==> 2,3 #SUP: 4 #CONF: 1.0
1 ==> 3 #SUP: 4 #CONF: 1.0
4 ==> 3 #SUP: 3 #CONF: 1.0

Consider the second line. It indicates that the rule {1} ==> {2, 3} has a support of 4 sequences and a confidence of 100 %. The other lines follow the same format.

Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextPrefixSpan.txt". Here we have modified the file to give names to the items:

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
@ITEM=6=noodle
@ITEM=7=rice
@ITEM=-1=|
1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". The 9th line indicates that the symbol "-1" must be replaced by "|". Then the following lines define four sequences in the SPMF format.

Then, if we apply a sequential pattern mining algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, including the following ones:

apple ==> orange #SUP: 4 #CONF: 1.0
apple ==> orange,tomato #SUP: 4 #CONF: 1.0
apple ==> tomato #SUP: 4 #CONF: 1.0

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided.

Performance

TRuleGrowth is a very efficient algorihtm. It is faster and more memory efficient than CMDeo and CMRules. If the windows_size constraint is used, it can also be much faster than RuleGrowth depending on how the window_size constraint is set.

Implementation details

In SPMF, there is also a version of TRuleGrowth that accepts strings instead of integers. It is available under the name "TRuleGrowth with strings" in the release version of SPMF or in the package ca.pfv.spmf.sequential_rules.trulegrowth_with_strings for the source code version of SPMF. To run it, you should use the input file: contextPrefixSpanStrings.txt.

Where can I get more information about this algorithm?

The TRuleGrowth algorithm is described in this paper:

Fournier-Viger, P., Wu, C.-W., Tseng, V.S., Nkambou, R. (2012). Mining Sequential Rules Common to Several Sequences with the Window Size Constraint. Proceedings of the 25th Canadian Conf. on Artificial Intelligence (AI 2012), Springer, LNAI 7310, pp.299-304