Mining Compressing Sequential Patterns Using the GoKrimp Algorithm (SPMF documentation)

This example explains how to run the GoKrimp algorithm using the SPMF open-source data mining library.

How to run this example?

What is GoKrimp?

The GoKrimp algorithm finds a non-redundant set of compressing sequential patterns based on the Minimum Description Length principle. It searches for a set of sequential pattern that compresses the data most. In this way, the set of patterns found by GoKrimp is usually non-redundant, and they are much more meaningful compared the the set of frequent patterns.

Another important property of GoKrimp is that it is parameter-free, thus users are not required to fine tune the parameters such as minimum support which is a difficult task in many applications.

The Gokrimp algorithm was first published in the SIAM Data Mining conference in 2012 and was chosen to include in the special issue of the best papers of SDM 2012 in the journal of Statistical Analysis and Data Mining (SADM Wiley 2014).

This implementation is the original implementation of GoKrimp.

What is the input of GoKrimp?

The input is a sequence database and a label file (optional).

A sequence database is a set of sequences where each sequence is an ordered list of items. Current version of GoKrimp does not work for a sequence of itemsets but the file input format is consistent with the standard file input format of the SPMF package.

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a positive integer and it is separated by the value "-1". The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file may contain the following two lines (two sequences).

1 -1 2 -1 3 -1 4 -1 3-1 -2
4 -1 3 -1 2 -1 1 -1 -2

The first line represents a sequence where the item {1} is followed by the item { 2}, followed by the item {3}, followed by the item {4} and followed by the item {3}. The next lines follow the same format.

The label file format is defined as follows. It is a text file where each line contains a string indicating the label of the item with identifier equal to the line number. For example the label file may contain the following 4 lines:

Support
Vector
Machine
Algorithm

which indicates that the label of the item 1 is “Support”, of item 2 is “Vector”, of item 3 is “Machine”, and of item 4 is “Algorithm”. Label file is optional.

Output file format

The output file format is defined as follows. It is a text file. Each line corresponds to a compressing sequential pattern. Each item separated by a single space from a compressing sequential pattern is either represented by a positive integer or by its label if the label file is provided. On each line, the compressing sequential pattern is first indicated. Then, the keyword "#SUP" appears followed by a real number indicating the contribution of the pattern in the compression. For example, here a few lines from the output file from the journal of machine learning dataset (jmlr_goKrimp.dat):

support vector machin #SUP: 1922.0710148279322
real world #SUP: 598.4753133154009
machin learn #SUP: 514.3586664227769
state art #SUP: 412.9730013575172
high dimension #SUP: 362.7776787300827
reproduc hilbert space #SUP: 359.42939766764175
neural network #SUP: 210.35608129308093
experiment result #SUP: 187.4169747827109
compon analysi #SUP: 176.54417917714454
supervis learn #SUP: 160.87427082075737
support vector #SUP: 148.74911007808987
well known #SUP: 138.22464635269716
hilbert space #SUP: 21.132125171017833
Compressed size: 839792.3563524645, uncompressed size: 845005.7388124614, compression ratio: 1.006207942261633
Running time: 2 seconds

The first line indicates that the compressing sequential pattern is support vector machin followed by the #SUP tag indicating the compression contribution of this pattern. The last two lines shows the compressed size and uncompressed size of the data in the number of bits, the compression ratio and the running times in seconds.

Performance

GoKrimp is very efficient. It output one pattern in each step so you can terminate the algorithm at anytime. In each step, GoKrimp starts from a seed item (usually frequent items) and tries to extend this item with related items chosen by the Sign Test. When extension of a pattern does not give significant compression benefit, GoKrimp outputs the pattern and starts looking for the next pattern with the new seeds.

Tips for using the source code:

GoKrimp algorithm uses the SignTest to test for dependency between a pattern and events used to extend a pattern. It requires that the event occurs at least in N=25 sequences to perform the test properly. If you have a very long sequence instead of a database of many sequences, you should split the long sequences into a set of short sequences.

The source code contains the SeqKrimp algorithm implementation which gets the candidate pattern set and returns a good subset of compressing patterns. You can feed SeqKrimp with any frequent pattern mining algorithm, e.g. BIDE+ and PreFixSpan.

The default encoding for an integer in our implementation is the Elias Gamma code, you can try the Elias Delta code by uncomenting the Elias Delta returning code in the function bits of GoKrimp class. The result might be slightly different.

The fields NSTART and NRELATED in the GoKrimp class are used to control the number of initial most frequent events used as seeds to extend to get the set of patterns and the maximum number of related events used to extend a given candidate pattern. The default value for NSTART and NRELATED is 1000. You can change this value to smaller value if the source code runs too long yet the quality of the results will be affected.

The GoKrimp algorithm is described in this paper:

Hoang Thanh Lam, Fabian Mörchen, Dmitriy Fradkin, Toon Calders: Mining Compressing Sequential Patterns in Journal of Statistical Analysis and Data Mining 7(1): 34-52 (2014) by Wiley.

Acknowledgements
The work was done when the first author was doing his PhD at Eindhoven University of Technology, the Netherlands under the support by the Netherlands Organisation for Scientific Research (NWO) in the project COMPASS. Part of the work has been done at Siemens Corporate Research center in Princeton, NJ USA.