Mining Top-K Sequential Patterns Using the TKS Algorithm (SPMF documentation)

This example explains how to run the TKS algorithm using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "TKS" algorithm, (2) select the input file "contextPrefixSpan.txt", (3) set the output file name (e.g. "output.txt") (4) set k = 5 and (5) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run TKS contextPrefixSpan.txt output.txt 5 in a folder containing spmf.jar and the example input file contextPrefixSpan.txt.
If you are using the source code version of SPMF, launch the file "MainTestTKS.java" in the package ca.pfv.SPMF.tests.

What is TKS?

TKS is an algorithm for discovering the top-k most frequent sequential patterns in a sequence database. TKS was proposed by Fournier-Viger et al.(2013).

What is the input of TKS?

The input of TKS is a sequence database and a user-specified parameter named k (a positive integer representing the desired number of patterns to be found).

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered.

ID	Sequences
S1	(1), (1 2 3), (1 3), (4), (3 6)
S2	(1 4), (3), (2 3), (1 5)
S3	(5 6), (1 2), (4 6), (3), (2)
S4	(5), (7), (1 6), (3), (2), (3)

What is the output of TKS?

TKS discovers the top-k most frequent sequential patterns that occurs in the input sequence database, where k is set by the user. Note that it is possible that TKS returns more than k patterns if several patterns have exactly the same support.

To explain more formally what is a sequential pattern, it is necessary to review some definition.

A sequential pattern is a sequence. A sequence SA = X1, X2, ... Xk, where X1, X2... Xk are itemsets is said to occur in another sequence SB = Y1, Y2, ... Ym, where Y1, Y2... Ym are itemsets, if and only if there exists integers 1 <= i1 < i2... < ik <= m such that X1 ⊆ Yi1, X2 ⊆ Yi2, ... Xk ⊆ Yik.

The support of a sequential pattern is the number of sequences where the pattern occurs divided by the total number of sequences in the database.

Why using TKS? It is often hard to set the minsup threshold of sequential pattern mining algorithm to get a fixed number of patterns without running the algorithms several times and fine-tuning the parameter. With a top-k sequential pattern mining algorithm, the user can set k the number of patterns to be output directly, which is more intuitive than using minsup.

For example, if we run TKS with k=5 on the sequence database, the top-5 most frequent patterns are:

ID	Sequential Pattern	Support
S1	(2)	100 %
S2	(1) (2)	100 %
S3	(1), (3)	100 %
S4	(3)	100 %
S5	(1)	100 %

For instance, the sequential pattern "(1),(2)" appears in the first, second, third and fourth sequence (it has therefore a support of 100%).

Optional parameters

The TKS implementation allows to specify four optional parameters :

"minimum pattern length" allows to specify the minimum number of items that patterns found should contain.
"maximum pattern length" allows to specify the maximum number of items that patterns found should contain.
"required items" allow to specify a set of items that must appears in every patterns found.
"max gap" allows to specify if gaps are allowed in sequential patterns. For example, if "max gap" is set to 1, no gap is allowed (i.e. each consecutive itemset of a pattern must appear consecutively in a sequence). If "max gap" is set to N, a gap of N-1 itemsets is allowed between two consecutive itemsets of a pattern. If the parameter is not used, by default "max gap" is set to +∞.

These parameters are available in the GUI of SPMF and also in the example "MainTestTKS.java" provided in the source code of SPMF.

If you want to use these optional parameters in the command line, it can be done as follows. Consider the command:
java -jar spmf.jar run TKS contextPrefixSpan.txt output.txt 5 2 6 1,3 1
It means that the user want to apply TKS on the file "contextPrefixSpan.txt" and output the results to "output.txt". Moreover, the user wants to find the top-5 patterns (k = 5) where patterns must have a minimum length of 2 item, a maximum length of 6 items, must contain items 1 and 3, and have no gap between itemsets.
Now, let's say that you want to run the TKS algorithm again with the same parameters except that you don't want to use the "required items" parameter. You could do as follows:
java -jar spmf.jar run TKS contextPrefixSpan.txt output.txt 5 2 6 "" 1

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a postive integer and items from the same itemset within a sequence are separated by single space. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextPrefixSpan.txt" contains the following four lines (four sequences).

1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

The first line represents a sequence where the itemset {1} is followed by the itemset {1, 2, 3}, followed by the itemset {1, 3}, followed by the itemset {4}, followed by the itemset {3, 6}. The next lines follow the same format.

Note that it is also possible to use a text file containing a text (several sentences) if the text file has the ".text" extension, as an alternative to the default input format. If the algorithm is applied on a text file from the graphical interface or command line interface, the text file will be automatically converted to the SPMF format, by dividing the text into sentences separated by ".", "?" and "!", where each word is considered as an item. Note that when a text file is used as input of a data mining algorithm, the performance will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will also have to be converted. This cost however should be small.

Output file format

The output file format is defined as follows. It is a text file. Each line is a frequent sequential pattern. Each item from a sequential pattern is a postive integer and items from the same itemset within a sequence are separated by single spaces. The value "-1" indicates the end of an itemset. On each line, the sequential pattern is first indicated. Then, the keyword " #SUP: " appears followed by an integer indicating the support of the pattern as a number of sequences. For example, a few lines of an output file (different from the example above) could be:

1 -1 1 2 -1 #SUP: 2
5 -1 7 -1 3 -1 2 -1 #SUP: 2
5 -1 1 -1 3 -1 2 -1 #SUP: 2

The first line indicates that the frequent sequential pattern consisting of the itemset {1}, followed by the itemset {1, 2}, has a support of 2 sequences. The next lines follow the same format.

Optional feature: giving names to items

Some users have requested the feature of given names to items instead of using numbers. This feature is offered in the user interface of SPMF and in the command line of SPMF. To use this feature, your file must include @CONVERTED_FROM_TEXT as first line and then several lines to define the names of items in your file. For example, consider the example database "contextPrefixSpan.txt". Here we have modified the file to give names to the items:

@CONVERTED_FROM_TEXT
@ITEM=1=apple
@ITEM=2=orange
@ITEM=3=tomato
@ITEM=4=milk
@ITEM=5=bread
@ITEM=6=noodle
@ITEM=7=rice
@ITEM=-1=|
1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

In this file, the first line indicates, that it is a file where names are given to items. Then, the second line indicates that the item 1 is called "apple". The third line indicates that the item 2 is called "orange". The 9th line indicates that the symbol "-1" must be replaced by "|". Then the following lines define four sequences in the SPMF format.

Then, if we apply a sequential pattern mining algorithm using this file using the user interface of SPMF or the command line, the output file contains several patterns, having this format:

apple | orange | #SUP: 4

Note that this feature could be also used from the source code of SPMF using the ResultConverter class. However, there is currently no example provided.

Performance

TKS is the fastest top-k sequential pattern mining algorithm offered in SPMF.

Where can I get more information about this algorithm?

The TKS algorithm is described in this paper:

Fournier-Viger, P., Gomariz, A., Gueniche, T., Mwamikazi, E., Thomas, R. (2013). Efficient Mining of Top-K Sequential Patterns. Proc. 9th International Conference on Advanced Data Mining and Applications (ADMA 2013) Part I, Springer LNAI 8346, pp. 109-120.

Besides, you may read this survey of sequential pattern mining, which gives an overview of sequential pattern mining algorithms.