(Closed) Multi-dimensional Sequential Patterns Mining with Time Constraints (SPMF documentation)
This example explains how to run the Fournier08 algorithm using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "SeqDim_(BIDE+AprioriClose)+time" algorithm, (2) select the input file "ContextMDSequence.txt", (3) set the output file name (e.g. "output.txt") (4) set minsup = 50%, mininterval=0, maxinterval=1000, minwholeinterval=0, maxwholeinterval=1000 and (5) click "Run algorithm".
- If you want to execute this example from the command line,
then execute this command:
java -jar spmf.jar run SeqDim_(BIDE+AprioriClose)+time ContextMDSequence.txt output.txt 50% 0 1000 0 1000 in a folder containing spmf.jar and the example input file ContextMDSequence.txt. - If you are using the source code version of SPMF, launch the file "MainTestSequentialPatternsMining4.java" in the package ca.pfv.SPMF.tests.
What is this algorithm?
The Fournier-Viger et al., 2008 algorithm is a sequential pattern mining algorithm combining features from several other sequential pattern mining algorithms. It offers also some original features. In this example, we show how it can be used to discover multi-dimensional sequential patterns with time-constraints.
Multi-dimensional sequential pattern mining is an extension of the problem of sequential pattern mining that consider the context of each sequences. We have taken the SeqDIM algorithm for multi-dimensional sequential pattern mining and modified it to use features of the Hirate & Yamana (2008) algorithms so that it only discovers patterns that respect time constraints set by the user.
What is the input?
The input is a multidimensional time-extended sequence database. We here only provide a brief explanation. Please refer to the article by Fournier-Viger et al., 2008 for more details and a formal definition.
A time-extended multidimensional sequence database is a set of time-extended multi-dimensional sequences. A time-extended multi-dimensional sequence (here called MD-Sequence) is a time-extended sequence (as defined by Hirate & Yamana) but with dimensional information (as defined by Pinto et al. 2001). The set of dimensional values for an MD-Sequence is called an MD-Pattern. For a multi-dimensional database, there is a fix set of dimensions. Each dimensions can take a symbolic value or the value "*" which means any value. In the following MD-Database, there is four MD-Sequences named S1, S2, S3, S4.
MD-Sequences |
||||
ID | MD-Patterns | Sequences | ||
d1 | d2 | d3 | ||
S1 | 1 | 1 | 1 | (0, 2 4), (1, 3), (2, 2), (3, 1) |
S2 | 1 | 2 | 2 | (0, 2 6), (1, 3 5), (2, 6 7) |
S3 | 1 | 2 | 1 | (0, 1 8), (1, 1), (2, 2), (3, 6) |
S4 | * | 3 | 3 | (0, 2 5), (1, 3 5) |
The task of multi-dimensional sequential pattern mining consists of finding MD-Sequences that have a support higher than a minimum support threshold minsup for a MD-Database. Furthermore, in the Fournier-Viger algorithm, we offers the possibility to mine only frequent closed MD-Sequences, by implementing the idea of Songram et al. 2006. A frequent closed MD-Sequence is a frequent MD-Sequence that is not included in any other MD-Sequence having the same support.
This algorithm has five parameters:
- minsup (a value in [0,1] representing a percentage),
- mininterval (the minimum amount of time between two consecutive itemsets in the patterns to be found) (an integer>=0),
- maxinterval (the maximum amount of time between two consecutive itemsets in the patterns to be found) (an integer>=1)
- minwholeinterval(the minimum time duration of patterns to be found) (an integer>=0)
- maxwholeinterval (the maximum time duration of the patterns to be found) (an integer>=1)
What is the output?
The output is the set of frequent closed MD-Sequences contained in the database (see the Fournier-Viger, 2008 paper for a formal definition).
For example, if we mine frequent closed MD-Sequences from the previous database with a minsup of 50% mininterval=0, maxinterval=1000, minwholeinterval=0, maxwholeinterval=1000, we obtain the following result:
Frequent Closed MD-Sequences |
||||
ID | MD-Patterns | Sequences | ||
d1 | d2 | d3 | ||
P1 | * | * | * | (0, 2) |
P2 | 1 | * | 1 | (0, 1) |
P3 | 1 | 2 | * | (0, 6) |
P4 | * | * | * | (0, 2), (1, 3 5) |
P5 | * | * | * | (0, 2), (1, 3) |
If we mine frequent MD-sequences instead of frequent closed MD-Sequences, we will obtain 23 frequent MD-Sequences instead.
Input file format
The input file format is defined as follows. It is a text file where each line represents a multi-dimensional time-extended sequence from a multi-dimensional time-extended sequence database. Each line consists of two parts.
- The first part is a list of dimension values separated by single spaces. A dimension value is a positive integer or the symbol "*" meaning "any values". Finally, the value "-3" indicates the end of the first part. Note that each line should have the same number of dimension values.
- The second part is a list of itemsets, where each itemset has a timestamp represented by a positive integer and each item is represented by a positive integer. Each itemset is first represented by its timestamp between the "<" and "> symbols. Then, the items of the itemset appear separated by single spaces. Finally, the end of an itemset is indicated by "-1". After all the itemsets, the end of a sequence (line) is indicated by the symbol "-2". Note that it is assumed that items are sorted according to a total order in each itemset and that no item appears twice in the same itemset.
For example, the input file "contextSequencesTimeExtended.txt" contains the following four lines (four sequences).
1 1 1 -3 <0> 2 4 -1 <1> 3 -1 <2> 2 -1
<3> 1 -1 -2
1 2 2 -3 <0> 2 6 -1 <1> 3 5 -1 <2> 6 7 -1 -2
1 2 1 -3 <0> 1 8 -1 <1> 1 -1 <2> 2 -1 <3> 6 -1
-2
* 3 3 -3 <0> 2 5 -1 <1> 3 5 -1 -2
Consider the second line. It indicates that the second multi-dimensional time-extended sequence of this database has the dimension values 1, 2 and 2. Furthermore, the first itemset is {2, 4} with a timestamp of 0. Then, the item 3 appears with a timestamp of 1. Then the item 2 appears with a timestamp of 2. Finally, the item 1 appears with a timestamp of 3. The other sequence follows the same format. Note that timestamps do not need to be consecutive integers. But they should increase for each succesive itemset within a same sequence.
Output file format
The output file format is defined as follows. It is a text file. Each line is a frequent (closed) MD sequential pattern. Each line is separated into three parts: (1) a MD-pattern, (2) a sequence and (3) the support of the MD sequential pattern formed by the first and second part.
- The first part is a list of dimension values separated by single spaces between the symbols "[" and "]". A dimension value is a positive integer or the symbol "*" meaning "any values". Note that each line have the same number of dimension values.
- The second part of each line is a sequence. A sequence is a list of itemset. Each itemset is represented by {t=x, ...} where x is an integer indicating that the itemset has the relative timestamp "x" and where "..." is a list of items separated by single spaces. Each item in an itemset is represented by a postive integers. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset.
- The third part is the keyword "#SUP:" followed by an integer indicating the support of the pattern as a number of sequences.
For example, here is the output file for this example:
[ * * * ]{t=0, 2 }{t=1, 3 5 } #SUP: 2
[ * * * ]{t=0, 2 }{t=1, 3 } #SUP: 3
[ 1 * * ]{t=0, 2 }{t=1, 3 } #SUP: 2
[ * * * ]{t=0, 2 } #SUP: 4
[ 1 * * ]{t=0, 2 } #SUP: 3
[ 1 * 1 ]{t=0, 1 } #SUP: 2
[ 1 2 * ]{t=0, 6 } #SUP: 2
Consider the first line. It presents a MD-sequential pattern having the dimension values 1, 2 and *. Furthemore, the line indicates that the sequence of this MD sequential pattern is the itemset {2} followed by the itemset {6} and that the MD-Sequential-pattern has a support of 2 transactions. The next lines follow the same format.
Where can I get more information about this algorithm?
The algorithm is described in this paper:
Fournier-Viger, P., Nkambou, R & Mephu Nguifo, E. (2008), A Knowledge Discovery Framework for Learning Task Models from User Interactions in Intelligent Tutoring Systems. Proceedings of the 7th Mexican International Conference on Artificial Intelligence (MICAI 2008). LNAI 5317, Springer, pp. 765-778.
The idea of using time-constraints is based on this paper
Hirate & Yamana (2006) Generalized Sequential Pattern Mining with Item Intervals. JCP 1(3): 51-60.
The idea of multi-dimensional pattern mining is based on this paper:
H. Pinto, J. Han, J Pei, K. Wang, Q. Chen, U. Dayal: Multi-Dimensional Sequential Pattern Mining. CIKM 2001: 81-88
The idea of closed multi-dimensional pattern mining is based on this paper:
P. Songram, V. Boonjing, S. Intakosum: Closed Multi-dimensional Sequential-Pattern Minin. Proc. of ITNG 2006.
The idea of mining closed sequential pattern is based on this paper:
J. Wang, J. Han: BIDE: Efficient Mining of Frequent Closed Sequences. ICDE 2004: 79-90
Besides, you may read this survey of sequential pattern mining, which gives an overview of sequential pattern mining algorithms.