Mining Interesting Sequential Patterns In a Single or Multiple Sequences Using the QCSP Algorithm (SPMF documentation)

This example explains how to run the QCSP algorithm using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "QCSP" algorithm, (2) select the input file "moby_fragment.seq" (a single sequence or multiple sequences seperated by -2), (3) set the output file name (e.g. "output.txt"), (4) set Minsup = 3 (absolute support of a single item), (5) set Alpha = 3 (count cohesive patterns if duration is less than |X|*alpha), (6) set Maximum pattern length = 4, (7) set Top-k patterns = 25 (return k most cohesive patterns), (8) set the label file as "moby_fragment.lab" and (9) click "Run algorithm".
Note that the algorithm can run without the label file if none is available.
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run QCSP moby_fragment.dat output.txt 3 3 4 25 moby_fragment.lab in a folder containing spmf.jar and the example input file moby_fragment.dat and moby_fragment.lab.
If you are using the source code version of SPMF, launch the file "MainTestQCSP_saveToMemory.java" in the package ca.pfv.spmf.test, or "MainTestQCSP_saveToFile.java"

What is QCSP?

The QCSP algorithm finds the top-k most quantile-based cohesive sequential patterns. Quantile-based cohesion is defined as ratio between the number of pattern occurrences that are near, versus all occurrences. That is we compute all minimal windows having length <= pattern-length * alpha. In this way, the set of patterns found by QCSP are much more meaningful compared the the set of frequent patterns.

In contrast to cohesive itemsets, quantile-based cohesion is also more robust in the presence of outliers or noise. In our paper we compare QCSP with comparable methods such as GoKrimp and Skopus. An important property of QCSP is that it find less frequent, but very cohesive patterns, other methods fail to find.

Another import advantage is that QCSP can find sequential patterns with repeating items and can take both multiple sequence and a single sequence as input. In the latter case, QCSP finds minimal windows in the original sequence, without the need to create a sequential database using a sliding window.

The QCSP algorithm was first published in the SIAM Data Mining conference in 2018.

What is the input of QCSP?

The input is a sequence database and a (optional) label file similar to GoKrimp.

A sequence database is a set of sequences where each sequence is an ordered list of items. The current version of QCSP does not work for a sequence of itemsets but the file input format is consistent with the standard file input format of the SPMF package.

The input file format is a text file where each line represents a sequence. Each item from a sequence is a positive integer (> 0) and it is separated by the value "-1". The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file may contain the following two lines (two sequences).

1 -1 2 -1 3 -1 4 -1 3-1 -2
4 -1 3 -1 2 -1 1 -1 -2

The first line represents a sequence where the item {1} is followed by the item { 2}, followed by the item {3}, followed by the item {4} and followed by the item {3}. The next lines follow the same format.

Alternatively we support a single long sequence of items. For example:

1 2 3 4 3 0 0 0 0 0 0 4 3 2 1

Here the sequence starts with item {1}, followed by {2}, etc. The special item value "0" is used for denoting gaps. A value of "-1" can be used to seperate items, but is optional. The element "-2" is ignored for a single long sequence.

The label file format is defined as follows. It is a text file where each line contains a string indicating the label of the item with identifier equal to the line number. For example the label file may contain the following 4 lines:

Support
Vector
Machine
Algorithm

which indicates that the label of the item 1 is “Support”, of item 2 is “Vector”, of item 3 is “Machine”, and of item 4 is “Algorithm”. Label file is optional.

Output file format

The output file format is defined as follows. Each line corresponds to a quantile-based sequential pattern, its support (defined using minimal windows) and the quantile-based cohesion score. Each item is separated by a single space from a top-k quantile-based cohesive pattern is either represented by a positive integer or by its label if the label file is provided. On each line, the compressing sequential pattern is first indicated. Then, the keyword "#SUP" appears followed by the support, and the keyword "QCOH" followed by the ratio between pattern occurrences that are near versus all occurrences. For an example for multiple sequencess (see "MainTestQCSP_saveToMemory.java"), here a few lines from the output file from the journal of machine learning dataset used as an example by GoKrimp.
>Started QSCP.run()
Parameters: alpha=2,000000, maxsize=6, top-k=25, pruningOf=false
<Finished QSCP.run(). Took 47,4 sec
============================
(mont,carlo) #SUP: 32 #QCOH: 1,000
(nearest,neighbor) #SUP: 70 #QCOH: 0,886
(kullback,leibler) #SUP: 16 #QCOH: 0,875
(support,vector) #SUP: 434 #QCOH: 0,765
(emot,cognit) #SUP: 10 #QCOH: 0,800
(http,www) #SUP: 23 #QCOH: 0,783
(gpl,licens) #SUP: 13 #QCOH: 0,769
(cross,valid) #SUP: 99 #QCOH: 0,747
(reproduc,hilbert) #SUP: 81 #QCOH: 0,716
(real,world) #SUP: 233 #QCOH: 0,670
(gnu,licens) #SUP: 12 #QCOH: 0,667
(heavi,tail) #SUP: 9 #QCOH: 0,667
(gnu,gpl) #SUP: 9 #QCOH: 0,667
(belief,propag) #SUP: 61 #QCOH: 0,639
(support,vector,machin) #SUP: 741 #QCOH: 0,565
(handwritten,digit) #SUP: 22 #QCOH: 0,636
(www,edu) #SUP: 16 #QCOH: 0,625
(health,care) #SUP: 13 #QCOH: 0,615
(support,machin) #SUP: 501 #QCOH: 0,559
(plai,role) #SUP: 53 #QCOH: 0,604
(releas,gpl) #SUP: 10 #QCOH: 0,600
(state,art) #SUP: 187 #QCOH: 0,578
(largest,deflag) #SUP: 14 #QCOH: 0,571
(isotrop,concav) #SUP: 18 #QCOH: 0,556
(gnu,gpl,licens) #SUP: 17 #QCOH: 0,529

The first pattern has the highest value for quantile-based cohesion. For mont carlo, the #SUP tag indicates that either month or carlo occur 32 times. The QCOH score is 1.0 meaning that all occurrences are cohesive. Since alpha is set to 2, this means that each occurrence is within a minimal window of at most 4 (thus at most two gaps, or other items).

For a example of a single sequence, we use a small fragment of the novel Moby Dick by Herman Melville (see "MainTestQCSP_singleSequence.java"), namely:

Receiving the top-maul from Starbuck, he advanced towards the main-mast with the hammer uplifted in one hand, exhibiting the gold with the other, and with a high raised voice exclaiming: Whosoever of ye raises me a white-headed whale with a wrinkled brow and a crooked jaw; whosoever of ye raises me that white-headed whale, with three holes punctured in his starboard fluke - look ye, whosoever of ye raises me that same white whale, he shall have this gold ounce, my boys!

"Huzza! huzza!" cried the seamen, as with swinging tarpaulins they hailed the act of nailing the gold to the mast.

It's a white whale," I say, resumed Ahab, as he threw down the top-maul; a white whale. "Skin your eyes for him, men; look sharp for white water; if ye see but a bubble, sing out."

All this while Tashtego, Daggoo, and Queequeg had looked on with even more intense interest and surprise than the rest, and at the mention of the wrinkled brow and crooked jaw they had started as if each was separately touched by some specific recollection.

"Captain Ahab," said Tashtego, "that white whale must be the same that some call Moby Dick."

"Moby Dick?" shouted Ahab. "Do ye know the white whale then, Tash?"

"Does he fan-tail a little curious, sir, before he goes down?" said the Gay-Header deliberately.

"And has he a curious spout, too," said Daggoo, "very bushy, even for a parmacetty, and mighty quick, Captain Ahab?"

"And he have one, two, tree - oh! good many iron in him hide, too, Captain," cried Queequeg disjointedly, "all twiske-tee betwisk, like him - him - "

An the output is (after removing subsets, or keeping maximal cohesive sequential patterns) is:
>Started QSCP.run()
Parameters: alpha=2.000000, maxsize=10, top-k=20, pruningOf=false
<Finished QSCP.run(). Took 2.1 sec
============================
QC Patterns:
(moby,dick) #SUP: 14 #QCOH: 1.000
(wrinkled,brow,crooked,jaw) #SUP: 8 #QCOH: 1.000
(top,maul) #SUP: 4 #QCOH: 1.000
(white,whale) #SUP: 18 #QCOH: 0.889
(d,see) #SUP: 5 #QCOH: 0.800
(this,ounce) #SUP: 5 #QCOH: 0.800
(cried,queequeg) #SUP: 6 #QCOH: 0.667
(whosoever,raises,me,headed) #SUP: 13 #QCOH: 0.615
(captain,ahab) #SUP: 13 #QCOH: 0.615
(huzza,cried) #SUP: 5 #QCOH: 0.600

Performance

QCSP is very efficient, since we use a variantion of constrainted prefix-project pattern growth with pruning using an upper bound. For high values of maxsize the algorithm can take a slightly longer time.

How to know more about this algorithm?

The QCSP algorithm is described in this paper:

Len Feremans, Boris Cule, Bart Goethals: Mining Top-k Quantile-based Cohesive Sequential Patterns in Proceedings of the 2018 SIAM international conference on data mining (SDM). Society for Industrial and Applied Mathematics, 2018.

Acknowledgements
The work was done when Len Feremans was doing his PhD at University of Antwerpen, within Adrem Data Lab. The authors would like to thank the VLAIO SBO HYMOP project for funding this research.