Mining Frequent Closed Sequential Patterns by Post-Processing using SPAM or PrefixSpan (SPMF documentation)

This example explains how to mine closed sequential patterns by post-processing using the SPMF open-source data mining library.

What is this?

This example shows how to use the PrefixSpan and SPAM algorithms to discover all sequential patterns and keep only closed patterns by post-processing. This should be less efficient than using a dedicated algorithm for closed pattern mining like ClaSP, CloSpan and BIDE+.

How to run this example?

If you want to use SPAM with post-processing:

If you want to use PrefixSpan with post-processing:

What is the input ?

The input is a sequence database and a user-specified threshold named minsup (a value in [0,1] representing a percentage).

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output?

The output is all frequent closed sequential patterns that occurs in a sequence database.

To explain more formally what is a closed sequential pattern, it is necessary to review some definition.

A sequential pattern is a sequence. A sequence SA = X1, X2, ... Xk, where X1, X2... Xk are itemsets is said to occur in another sequence SB = Y1, Y2, ... Ym, where Y1, Y2... Ym are itemsets, if and only if there exists integers 1 <= i1 < i2... < ik <= m such that X1 ⊆ Yi1, X2 ⊆ Yi2, ... Xk ⊆ Yik.

The support of a sequential pattern is the number of sequences where the pattern occurs divided by the total number of sequences in the database.

A frequent sequential pattern is a sequential pattern having a support no less than the minsup parameter provided by the user.

A closed sequential pattern is a sequential pattern such that it is not strictly included in another pattern having the same support.

Why mining closed sequential patterns? It can be shown that the set of closed sequential patterns is generally much smaller than the set of sequential patterns and that no information small. Moreover, finding closed sequential patterns is often much more efficient than discovering all patterns.

For example, for minsup= 50 %, the following patterns are found in the previous sequence database .

ID Closed Sequential Pattern Support
S1 (6) 75 %
S2 (5) 75 %
S3 (2), (3) 75 %
S4 (1), (2) 100 %
S5 (1), (3) 100 %
S6 (1 2), (6) 50 %
S7 (4), (3) 75 %
S8 (1) (2), (3) 50 %
S9 (1), (2 3), (1) 50 %
S10 (1), (3), (2) 75 %
S11 (1), (3), (3) 75 %
S12 (1 2), (4), (3) 50 %
S13 (6), (2), (3) 50 %
S14 (5), (2), (3) 50 %
S15 (4), (3), (2) 50 %
S16 (5), (6), (3), (2) 50 %
S17 (5), (1), (3), (2) 50 %

For instance, the sequential pattern "(1,2),(6)" appears in the first and third sequence (it has therefore a support of 50%). Another pattern is "(4), (3)". It appears in the first, second and third sequence (it has thus a support of 75 %).

Optional parameter(s)

The algorithm implementation allows to specify additional optional parameter(s) :

These parameter(s) are available in the GUI of SPMF and also in the example(s) "MainTestSPAM_PostProcessingStepForClosedMining ... .java" provided in the source code of SPMF.

The parameter(s) can be also used in the command line with the Jar file. If you want to use these optional parameter(s) in the command line, it can be done as follows. Consider this example:
java -jar spmf.jar run SPAM_PostProcessingClosed contextPrefixSpan.txt output.txt 50% true
This command means to apply the algorithm on the file "contextPrefixSpan.txt" and output the results to "output.txt". Moreover, it specifies that the user wants to find patterns for minsup = 50%, and sequence ids should be output for each pattern found.

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a postive integer and items from the same itemset within a sequence are separated by single space. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextPrefixSpan.txt" contains the following four lines (four sequences).

1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

The first line represents a sequence where the itemset {1} is followed by the itemset {1, 2, 3}, followed by the itemset {1, 3}, followed by the itemset {4}, followed by the itemset {3, 6}. The next lines follow the same format.

Note that it is also possible to use a text file containing a text (several sentences) if the text file has the ".text" extension, as an alternative to the default input format. If the algorithm is applied on a text file from the graphical interface or command line interface, the text file will be automatically converted to the SPMF format, by dividing the text into sentences separated by ".", "?" and "!", where each word is considered as an item. Note that when a text file is used as input of a data mining algorithm, the performance will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file. Each line is a frequent sequential pattern. Each item from a sequential pattern is a postive integer and items from the same itemset within a sequence are separated by single spaces. The value "-1" indicates the end of an itemset. On each line, the sequential pattern is first indicated. Then, the keyword " #SUP: " appears followed by an integer indicating the support of the pattern as a number of sequences. For example, a few lines from the output file from the previous example are shown below:

1 2 -1 4 -1 3 -1 #SUP: 2
1 2 -1 6 -1 #SUP: 2
1 -1 2 -1 3 -1 #SUP: 2

The first line indicates that the frequent sequential pattern consisting of the itemset {1, 2}, followed by the itemset {4}, followed by the itemset {3} has a support of 2 sequences. The next lines follow the same format.

Performance

Mining closed patterns by post-processing should be less efficient than using a dedicated algorithms for closed sequential pattern mining.

Implementation details

The implementations of SPAM and PrefixSpan used in this example are made by Antonio Gomariz Peñalver (AGP).

In the source code, there is also two test files that shows how to keep the result into memory instead of saving it to a file.

Where can I get more information?

For information about SPAM and PrefixSpan, see the examples about PrefixSpan and SPAM.

Besides, you may read this survey of sequential pattern mining, which gives an overview of sequential pattern mining algorithms.