(Closed) Multi-dimensional Sequential Patterns Mining with Time Constraints (SPMF documentation)

This example explains how to run the Fournier08 algorithm using the SPMF open-source data mining library.

How to run this example?

What is this algorithm?

The Fournier-Viger et al., 2008 algorithm is a sequential pattern mining algorithm combining features from several other sequential pattern mining algorithms. It offers also some original features. In this example, we show how it can be used to discover multi-dimensional sequential patterns with time-constraints.

Multi-dimensional sequential pattern mining is an extension of the problem of sequential pattern mining that consider the context of each sequences. We have taken the SeqDIM algorithm for multi-dimensional sequential pattern mining and modified it to use features of the Hirate & Yamana (2008) algorithms so that it only discovers patterns that respect time constraints set by the user.

What is the input?

The input is a multidimensional time-extended sequence database. We here only provide a brief explanation. Please refer to the article by Fournier-Viger et al., 2008 for more details and a formal definition.

A time-extended multidimensional sequence database is a set of time-extended multi-dimensional sequences. A time-extended multi-dimensional sequence (here called MD-Sequence) is a time-extended sequence (as defined by Hirate & Yamana) but with dimensional information (as defined by Pinto et al. 2001). The set of dimensional values for an MD-Sequence is called an MD-Pattern. For a multi-dimensional database, there is a fix set of dimensions. Each dimensions can take a symbolic value or the value "*" which means any value. In the following MD-Database, there is four MD-Sequences named S1, S2, S3, S4.

MD-Sequences
ID MD-Patterns Sequences

d1 d2 d3
S1 1 1 1 (0, 2 4), (1, 3), (2, 2), (3, 1)
S2 1 2 2 (0, 2 6), (1, 3 5), (2, 6 7)
S3 1 2 1 (0, 1 8), (1, 1), (2, 2), (3, 6)
S4 * 3 3 (0, 2 5), (1, 3 5)

The task of multi-dimensional sequential pattern mining consists of finding MD-Sequences that have a support higher than a minimum support threshold minsup for a MD-Database. Furthermore, in the Fournier-Viger algorithm, we offers the possibility to mine only frequent closed MD-Sequences, by implementing the idea of Songram et al. 2006. A frequent closed MD-Sequence is a frequent MD-Sequence that is not included in any other MD-Sequence having the same support.

This algorithm has five parameters:

What is the output?

The output is the set of frequent closed MD-Sequences contained in the database (see the Fournier-Viger, 2008 paper for a formal definition).

For example, if we mine frequent closed MD-Sequences from the previous database with a minsup of 50% mininterval=0, maxinterval=1000, minwholeinterval=0, maxwholeinterval=1000, we obtain the following result:

Frequent Closed MD-Sequences
ID MD-Patterns Sequences

d1 d2 d3
P1 * * * (0, 2)
P2 1 * 1 (0, 1)
P3 1 2 * (0, 6)
P4 * * * (0, 2), (1, 3 5)
P5 * * * (0, 2), (1, 3)

If we mine frequent MD-sequences instead of frequent closed MD-Sequences, we will obtain 23 frequent MD-Sequences instead.

Input file format

The input file format is defined as follows. It is a text file where each line represents a multi-dimensional time-extended sequence from a multi-dimensional time-extended sequence database. Each line consists of two parts.

For example, the input file "ContextMDSequence.txt" contains the following four lines (four sequences).

1 1 1 -3 <0> 2 4 -1 <1> 3 -1 <2> 2 -1 <3> 1 -1 -2
1 2 2 -3 <0> 2 6 -1 <1> 3 5 -1 <2> 6 7 -1 -2
1 2 1 -3 <0> 1 8 -1 <1> 1 -1 <2> 2 -1 <3> 6 -1 -2
* 3 3 -3 <0> 2 5 -1 <1> 3 5 -1 -2

Consider the second line. It indicates that the second multi-dimensional time-extended sequence of this database has the dimension values 1, 2 and 2. Furthermore, the first itemset is {2, 4} with a timestamp of 0. Then, the item 3 appears with a timestamp of 1. Then the item 2 appears with a timestamp of 2. Finally, the item 1 appears with a timestamp of 3. The other sequence follows the same format. Note that timestamps do not need to be consecutive integers. But they should increase for each succesive itemset within a same sequence.

Output file format

The output file format is defined as follows. It is a text file. Each line is a frequent (closed) MD sequential pattern. Each line is separated into three parts: (1) a MD-pattern, (2) a sequence and (3) the support of the MD sequential pattern formed by the first and second part.

For example, here is the output file for this example:

[ * * * ]{t=0, 2 }{t=1, 3 5 } #SUP: 2
[ * * * ]{t=0, 2 }{t=1, 3 } #SUP: 3
[ 1 * * ]{t=0, 2 }{t=1, 3 } #SUP: 2
[ * * * ]{t=0, 2 } #SUP: 4
[ 1 * * ]{t=0, 2 } #SUP: 3
[ 1 * 1 ]{t=0, 1 } #SUP: 2
[ 1 2 * ]{t=0, 6 } #SUP: 2

Consider the first line. It presents a MD-sequential pattern having the dimension values 1, 2 and *. Furthemore, the line indicates that the sequence of this MD sequential pattern is the itemset {2} followed by the itemset {6} and that the MD-Sequential-pattern has a support of 2 transactions. The next lines follow the same format.

Where can I get more information about this algorithm?

The algorithm is described in this paper:

Fournier-Viger, P., Nkambou, R & Mephu Nguifo, E. (2008), A Knowledge Discovery Framework for Learning Task Models from User Interactions in Intelligent Tutoring Systems. Proceedings of the 7th Mexican International Conference on Artificial Intelligence (MICAI 2008). LNAI 5317, Springer, pp. 765-778.

The idea of using time-constraints is based on this paper

Hirate & Yamana (2006) Generalized Sequential Pattern Mining with Item Intervals. JCP 1(3): 51-60.

The idea of multi-dimensional pattern mining is based on this paper:

H. Pinto, J. Han, J Pei, K. Wang, Q. Chen, U. Dayal: Multi-Dimensional Sequential Pattern Mining. CIKM 2001: 81-88

The idea of closed multi-dimensional pattern mining is based on this paper:

P. Songram, V. Boonjing, S. Intakosum: Closed Multi-dimensional Sequential-Pattern Minin. Proc. of ITNG 2006.

The idea of mining closed sequential pattern is based on this paper:

J. Wang, J. Han: BIDE: Efficient Mining of Frequent Closed Sequences. ICDE 2004: 79-90

Besides, you may read this survey of sequential pattern mining, which gives an overview of sequential pattern mining algorithms.