Example: Mining Sequential Patterns with Flexible Time Constraints from a Time-Extended Sequence Database with the SPM-FC-L algorithm (SPMF documentation)
This example explains how to run the SPM_FC_L algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "SPM_FC_L" algorithm, (2) select the input file "mooc_small.txt", (3) set the output file name (e.g. "output.txt") (4) minsup= 0.08, alpha= 1.666, beta = 0.5, gamma= 0.33, mingap = 0, maxgap = 12345678, winsize = 0, and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run SPM_FC_L mooc_small.txt output.txt 0.008 1.666 0.5 0.33 0 12345678 0
in a folder containing spmf.jar and the example input file mooc_small.txt.
- To run this example with the source code version of SPMF, launch the file "MainTestSPM_FC_L_saveToFile.java" in the package ca.pfv.SPMF.tests.
What is this algorithm?
The SPM_FC_L algorithm is an algorithm for discovering frequent sequential patterns respecting some flexible time-constraints from sequences that contain time information. SPM_FC_L was designed specially for analyzing MOOC data (e-learning data) but can be applied to data from other domains as well where events are recorded as sequences and have timestamps.
The SPM_FC_L algorithm was proposed by Song et al. (2022), together with another algorithm called SPM_FC_P, also offered in SPMF. This is the original implementation of the SPM_FC_L algorithm.
What is the input?
The input is a time-extended sequence database and some constraints.
A time-extended sequence database for this algorithm is a set of time-extended sequences. A time-extended sequence is a list of items (symbols or events), ordered by time, where each item is anotated with a timestamp (an integer value). For example, a time-extended sequence could represents the sequence of courses that a given student has been enrolled in with the enrollment times as timestamps.
The file "mooc_small.txt" contained in the distribution of SPMF contains 21 sequences of such data, which was collected from an e-learning environment.
<2441> 0 -1 <2770> 1 -1 <2770> 2 -1 <2930> 3 -1 <4100> 4 -1 <4100> 5 -1 <4250> 6 -1 <5370> 7 -1 -2
<821> 8 -1 <971> 9 -1 <1940> 10 -1 <2220> 11 -1 <2881> 12 -1 <4091> 13 -1 <4091> 14 -1 <4171> 15 -1 <4240> 16 -1 -2
<180> 17 -1 <191> 18 -1 <191> 19 -1 <1120> 20 -1 <3250> 9 -1 <4111> 2 -1 <4541> 21 -1 <4541> 22 -1 <4541> 23 -1 <5450> 24 -1 <5450> 25 -1 -2
<1880> 17 -1 <2310> 26 -1 <4020> 2 -1 -2
<510> 10 -1 <510> 27 -1 <2841> 28 -1 <2841> 29 -1 <3001> 15 -1 <3471> 17 -1 <5450> 25 -1 -2
<1091> 30 -1 <1091> 14 -1 <1321> 31 -1 -2
<831> 10 -1 <870> 23 -1 <1130> 19 -1 <1131> 17 -1 -2
<4561> 14 -1 <4561> 17 -1 <4561> 10 -1 <4611> 30 -1 <4691> 26 -1 <5110> 42 -1 <5361> 168 -1 -2
<4551> 357 -1 <4551> 117 -1 <4560> 2 -1 <4560> 428 -1 <5201> 378 -1 -2
<4560> 291 -1 <4560> 200 -1 <4560> 277 -1 <4560> 287 -1 <4560> 300 -1 <4560> 299 -1 -2
<4560> 433 -1 <4560> 223 -1 <4560> 408 -1 <4560> 227 -1 <4560> 300 -1 <4580> 118 -1 <4580> 432 -1 <4580> 350 -1 <4581> 436 -1 <4581> 181 -1 <4601> 430 -1 <4601> 307 -1 <4601> 214 -1 <4601> 349 -1 <4601> 158 -1 <4601> 428 -1 <4601> 451 -1 <4611> 291 -1 <4611> 341 -1 <4611> 464 -1 <4611> 163 -1 <4621> 439 -1 -2
<4560> 433 -1 <4560> 223 -1 <4560> 408 -1 <4560> 227 -1 <4560> 300 -1 <4580> 118 -1 <4580> 432 -1 <4580> 350 -1 <4581> 181 -1 <4591> 450 -1 <4601> 430 -1 <4601> 307 -1 <4601> 214 -1 <4601> 282 -1 <4601> 349 -1 <4601> 158 -1 <4601> 428 -1 <4601> 451 -1 <4611> 291 -1 <4611> 341 -1 <4611> 464 -1 <4611> 163 -1 <4621> 457 -1 -2
<4560> 15 -1 <4560> 5 -1 <4560> 398 -1 <4560> 262 -1 -2
<4560> 271 -1 <4561> 18 -1 <4561> 49 -1 <4561> 19 -1 <4561> 137 -1 <4561> 347 -1 -2
<4560> 457 -1 <4560> 386 -1 <4560> 264 -1 <4561> 46 -1 <4621> 256 -1 <4621> 336 -1 <4761> 68 -1 <4761> 269 -1 <4761> 122 -1 <4761> 291 -1 <4770> 57 -1 <4770> 239 -1 <4770> 89 -1 <4770> 262 -1 <4770> 177 -1 <4771> 17 -1 <4771> 18 -1 <4771> 254 -1 <4771> 15 -1 <4771> 251 -1 <4771> 58 -1 <4771> 517 -1 <4771> 92 -1 <4771> 138 -1 <4771> 140 -1 <4771> 20 -1 <4781> 123 -1 <4781> 65 -1 <4781> 206 -1 <4791> 40 -1 <4791> 4 -1 <4801> 313 -1 <5010> 595 -1 <5010> 477 -1 <5010> 609 -1 <5011> 84 -1 <5050> 266 -1 <5051> 53 -1 <5060> 292 -1 <5060> 169 -1 <5060> 26 -1 <5060> 268 -1 <5060> 41 -1 <5060> 49 -1 <5061> 171 -1 <5061> 22 -1 <5070> 282 -1 <5070> 48 -1 <5070> 112 -1 <5070> 64 -1 <5071> 309 -1 <5080> 63 -1 <5080> 353 -1 <5080> 29 -1 <5080> 312 -1 <5091> 235 -1 <5110> 119 -1 <5111> 39 -1 <5111> 437 -1 <5111> 121 -1 <5111> 658 -1 <5130> 111 -1 <5130> 11 -1 <5130> 217 -1 <5130> 21 -1 <5130> 87 -1 <5130> 16 -1 <5130> 174 -1 <5130> 267 -1 <5230> 172 -1 <5230> 551 -1 <5230> 545 -1 -2
<4560> 306 -1 <4560> 162 -1 <4560> 203 -1 -2
<4561> 58 -1 <4561> 2 -1 <4561> 15 -1 <5060> 515 -1 <5091> 140 -1 <5330> 10 -1 <5461> 146 -1 <5461> 53 -1 -2
<4561> 58 -1 <4561> 2 -1 <4561> 15 -1 <5060> 515 -1 <5091> 140 -1 <5330> 10 -1 <5461> 146 -1 <5461> 53 -1 -2
<4561> 140 -1 <4561> 103 -1 <4561> 3 -1 -2
<4561> 571 -1 <4561> 219 -1 <4561> 116 -1 <4561> 239 -1 -2
<4561> 14 -1 <4561> 10 -1 <4561> 193 -1 -2
In that file, each line is a time-extended sequence.
Let's take the first line of that file as example:
<2441> 0 -1 <2770> 1 -1 <2770> 2 -1 <2930> 3 -1 <4100> 4 -1 <4100> 5 -1 <4250> 6 -1 <5370> 7 -1 -2
The first line is a time-extended sequence, which indicates that an event of type 0 occurred at time 2441, was followed by an event of type 1 at time 2770, and then it indicates that an event of type 2 was observed at time 2770. Thereafter, an event of type 3 occurred at time 2930. It was then followed by event of type 4 and an event of type 5 at time 4100, and then event of type 6 at time 4250, and finally an event of type 7 was observed after, at time 5370. The following lines follow the same format.
Thus, a time-extended sequence database contains one or more lines, where each line is a time-extended sequence. Each line starts by a timestamp <x> where x is a double value indicating a time. Then, it is followed by an item (a positive integer) indicating an event type that occurred at that time. Then the separator -1 appears. Thereafter, another timestamp may appear followed by another item and another separator (-1). This is repeated until the end of the sequence, which is indicated by special value -2.
The SPM_FC_L algorithm takes as input a time-extended sequence database in this format and outputs all the sequential patterns that respect some flexible time constraints that covers three aspects called the length, discreteness and validity.
To find sequential patterns using this algorithm, the user needs to provide four parameters, which define constraints on sequential patterns that should be found (see the paper by Song et al., 2022 for full details):
- minimum support (minsup): it indicates that a sequential pattern should have some minimum occurrence frequency in the sequence database (a positive integer between 0 and 1, representing a percentage)
- alpha : this is a weight indicating the importance of the length factor ( a double value between 0 and 1)
- beta : this is a weight indicating the importance of the discreteness factor (a double value between 0 and 1)
- gamma : this is a weight indicating the importance of the validity factor (a double value between 0 and 1)
- mingap: a minimum gap between items of a sequential patterns (an double value no less than 0)
- maxgap: a maximum gap between items of a sequential patterns (an double value no less than 1)
- winsize: a window size (a positive double value)
What is the output?
The output is a set of sequential patterns meeting the constraints given by the user. For example, if we run the algorithm with minsup= 8 %, alpha= 1.666, beta = 0.5, gamma= 0.33, we obtain the following output file result containing 22 sequential patterns (lines):
2 -1 #LENGTH: 4.742142018254623 #DISCRETE: 6.0 #VALIDITY: 5.608074706196417 #INTEGRATION: 5.659715238441243
10 -1 #LENGTH: 5.636461733338858 #DISCRETE: 7.0 #VALIDITY: 6.477432941595223 #INTEGRATION: 6.598554602754884
14 -1 #LENGTH: 3.342381145736094 #DISCRETE: 4.0 #VALIDITY: 4.0 #INTEGRATION: 3.890396857622682
15 -1 #LENGTH: 4.030452846817028 #DISCRETE: 6.0 #VALIDITY: 5.869358235398806 #INTEGRATION: 5.628194886269107
17 -1 #LENGTH: 4.147545808557844 #DISCRETE: 6.0 #VALIDITY: 5.738716470797612 #INTEGRATION: 5.6041631250255115
18 -1 #LENGTH: 1.6024877295108262 #DISCRETE: 3.0 #VALIDITY: 3.0 #INTEGRATION: 2.7670812882518043
19 -1 #LENGTH: 2.3869433491153034 #DISCRETE: 3.0 #VALIDITY: 2.869358235398806 #INTEGRATION: 2.8542766366521524
26 -1 #LENGTH: 1.7874436990180838 #DISCRETE: 3.0 #VALIDITY: 2.7387164707976117 #INTEGRATION: 2.710812773435551
53 -1 #LENGTH: 1.622573056608136 #DISCRETE: 3.0 #VALIDITY: 3.0 #INTEGRATION: 2.7704288427680224
58 -1 #LENGTH: 1.622573056608136 #DISCRETE: 3.0 #VALIDITY: 3.0 #INTEGRATION: 2.7704288427680224
140 -1 #LENGTH: 2.527410247298039 #DISCRETE: 4.0 #VALIDITY: 4.0 #INTEGRATION: 3.754568374549673
291 -1 #LENGTH: 1.854310289492285 #DISCRETE: 4.0 #VALIDITY: 3.869358235398806 #INTEGRATION: 3.5988377933816493
300 -1 #LENGTH: 1.7635928832441574 #DISCRETE: 3.0 #VALIDITY: 2.6080747061964176 #INTEGRATION: 2.6632903826061654
428 -1 #LENGTH: 1.7913439120061654 #DISCRETE: 3.0 #VALIDITY: 2.869358235398806 #INTEGRATION: 2.755010063800629
15 -1 53 -1 #LENGTH: 1.622573056608136 #DISCRETE: 2.9532050081194936 #VALIDITY: 3.0 #INTEGRATION: 2.7470313468277694
15 -1 140 -1 #LENGTH: 1.622573056608136 #DISCRETE: 2.9844142353923804 #VALIDITY: 3.0 #INTEGRATION: 2.7626359604642126
17 -1 26 -1 #LENGTH: 1.7874436990180838 #DISCRETE: 2.9920534271349286 #VALIDITY: 2.625141976854531 #INTEGRATION: 2.6689813223553216
58 -1 53 -1 #LENGTH: 1.622573056608136 #DISCRETE: 2.9532050081194936 #VALIDITY: 3.0 #INTEGRATION: 2.7470313468277694
140 -1 53 -1 #LENGTH: 1.622573056608136 #DISCRETE: 2.9902077044833963 #VALIDITY: 3.0 #INTEGRATION: 2.7655326950097208
58 -1 140 -1 #LENGTH: 1.622573056608136 #DISCRETE: 2.9844142353923804 #VALIDITY: 3.0 #INTEGRATION: 2.7626359604642126
15 -1 140 -1 53 -1 #LENGTH: 1.622573056608136 #DISCRETE: 2.9520143458422625 #VALIDITY: 3.0 #INTEGRATION: 2.746436015689154
58 -1 140 -1 53 -1 #LENGTH: 1.622573056608136 #DISCRETE: 2.9520143458422625 #VALIDITY: 3.0 #INTEGRATION: 2.746436015689154
The output file format is defined as follows. It is a text file, where each line is a sequential pattern. Each line starts by listing the items of the sequential pattern, where each item is represented by a positive integer. The items of a sequential patterns are separated by single spaces and the symbol -1. Finally, after all the items of a sequential pattern, the keywords #LENGTH, #DISCRETE, #VALIDITY and #INTEGRATION appear followed by the corresponding values for the sequential pattern.
For example, consider the first line:
2 -1 #LENGTH: 4.742142018254623 #DISCRETE: 6.0 #VALIDITY: 5.608074706196417 #INTEGRATION: 5.659715238441243
It represents a sequential pattern consisting of an event of type 2, and this pattern has a #LENGTH, #DISCRETE #VALIDITY and #INTEGRATION values of 4.74, 6.0, 5.6 and 5.65, respectively.
Consider another example:
140 -1 53 -1 #LENGTH: 1.622573056608136 #DISCRETE: 2.9902077044833963 #VALIDITY: 3.0 #INTEGRATION: 2.7655326950097208
This sequential pattern indicates that an event of type 140 was followed by an event of type 53.
The other lines follow the same format.
For more details about how to interpret these results, please see the paper by Song et al. (2022)
Implementation details
This is the original implementation of the SPM_FC_L algorithm.
Optional parameter(s)
This algorithm's implementation allows to specify additional optional parameter(s) :
- "show sequences ids?" (true/false) This parameter allows to specify that sequence ids of sequences containing a pattern should be output for each pattern found. For example, if the parameter is set to true, each pattern in the output file will be followed by the keyword #SID followed by a list of sequences ids (integers separated by space). For example, a line terminated by "#SID: 0 2" means that the pattern on this line appears in the first and the third sequences of the sequence database (sequences with ids 0 and 2).
These parameter(s) are available in the GUI of SPMF and also in the example(s) "MainTest ... .java" provided in the source code of SPMF.
The parameter(s) can be also used in the command line with the Jar
file. If you want to use this optional parameter(s) in the command
line for the example described in this page,, it can be done as follows. Use this command:
java -jar spmf.jar run SPM_FC_L mooc_small.txt
output.txt 0.008 1.666 0.5 0.33 0 12345678 0 true
where last value "true" means to show the sequence ids in the output.
Where can I get more information about this algorithm?
The SPM_FC_L algorithm is described in this paper
Song, W., Ye, W., Fournier-Viger, P. (2022). Mining sequential patterns with flexible constraints from MOOC data. Applied Intelligence
Besides, you may read this survey of sequential pattern mining, which gives an overview of various sequential pattern mining algorithms.