Mining Sequential Patterns with Time Constraints from a Time-Extended Sequence Database containing Valued Items (SPMF documentation)

This example explains how to run the Fournier08 algorithm using the SPMF open-source data mining library.

How to run this example?

To run this example with the source code version of SPMF, launch the file "MainTestSequentialPatternsMining3.java" in the package ca.pfv.SPMF.tests.

This example is not available in the release version of SPMF.

What is this algorithm?

The Fournier-Viger et al., 2008 algorithm is a sequential pattern mining algorithm combining features from several other sequential pattern mining algorithms. It also offers some original features. In this example, we show how it can be used to discover sequential patterns with time-constraints from a sequence database where items are annotated with integer values.

To create an algorithm that can do that, we have extended the Hirate & Yamana (2006) algorithm to accept items with integer values.

What is the input?

The input is a time-extended sequence database where items are annotated with integer values (Fournier-Viger et al., 2008), a minsup threshold (a value in [0, 1] representing a percentage) and some constraints.

Such a database is defined as a set of sequences. A sequence is here a list of itemsets (groups of items), where each item is a symbol represented by an integer and each item is annotated with an integer representing a value associated with the item. Furthermore, an itemset has a timestamp indicating the time at which it occured. Note that an item is not allowed to occur more than once in an itemset and that items in an itemset are assumed to be lexically ordered.

For example, consider the following database. These integer values that are annotation to items are indicated with a bold font and blue color. Consider the sequence S1 of this database. At time 0, the item 1 occured with the value 2, and item 2 is not annotated with a value. At time 1, item 3 appeared. At time 2, item 6 occured. At time 3, item 5 occured with the value 1.

ID	Sequences
S1	(0, 1(2) 2), (1, 3), (2, 6), (3, 5(1))
S2	(0, 1(2) 2), (1, 4(8)), (2, 3), (3, 5(2) 6 7)
S3	(0, 1(3) 2), (1, 4(7)), (2, 3), (3, 5(4))
S4	(0, 1(3) 2), (1, 4(6)), (2, 5(5)), (3, 6 7)

What is the output?

To mine sequential patterns from a time-extended sequence database where items can be annotated with integer values, we have added an original mechanism in the Fournier-Viger et al. algorithm. Because it is a little bit complicated to explain, please refer to Fournier-Viger et al., 2008 for a detailed description.

As PrefixSpan, the algorithm grows patterns one item at a time by recursively projecting a database by a frequent item. However, we have modified this step so that when the support of an item that is annotated is higher or equals to 2 * minsup, the database projection operation calls the K-Means algorithm [15] to try to separate these values in two or more clusters. Thereafter, the database will be projected separately for each group of values. Thus, the different groups will be considered as different items.

This is best illustrated with an example. If we mine patterns from the first table with minsup = 50%, we can get 56 sequential patterns. Note however that the number of patterns found can vary because K-Means is a randomized algorithm. For this example, six patterns found are presented in the next table:

ID	Sequential Patterns	Support
P1	(0, 3)	75%
P2	(0, 5 (average: 3 min:1 max:5))	100 %
P3	(0, 6)	75 %
P4	(0, 3), (1, 6)	50 %
P5	(0, 1 (average: 3 min: 3 max: 3) 2), (1 4 (average: 6.5 min: 6 max: 7))	50 %
P6	(0, 1 ((average: 2 min: 2 max: 2) 2), (3, 5 (average: 3.5 min: 1 max: 2))	50 %
...	...	...

When the algorithm was executed, as some point, it considered projecting the database with item "1" because this item is frequent. Item "1" is annotated with values 2, 2, 3 and 3 in sequences S1, S2, S3 and S4, respectively. The algorithm applied the K-Means to find clusters. Two clusters were found. They are {2, 2}and {3, 3}. We can see this in sequences P5 and P6. In sequence P5, item "1" represents the cluster {3, 3}, whereas sequence P6 include item "1" with the cluster {2, 2}. This feature of clustering is useful as it allows to group similar values together for an item and treat them differently.

In the source code of MainTestSequentialPatternsMining3.java, there is a few parameters for K-Means. Two of these parameters are particularly important:

MaxK (an integer value >=1): This parameter is the maximum number of clusters to create. If you set this parameter to two for example, no more than two clusters will be made every time that K-Means is called. If you want to desactivate the clustering, this parameter should be set to 1 (only one cluster will always be made).
numberOfTriesForEachK (an integer value >=1): This parameter indicates the number of times that K-Means should be run each time there is some values to be clustered. Normally, this value should be set to 1. If you set this parameter to a value different than 1, K-Means will be run the number of times that you specify and the maximum number of clusters found will be kept. This parameter was added because K-Means is a randomized algorithm.

Important note: If the clustering described in this example is used jointly with the mining of closed sequential patterns (described in a previous example), the set of patterns found may not be a lossless representation of all patterns.

Input file format

The input file format is defined as follows. It is a text file where each line represents a time-extended sequence from a sequence database. Each line is a list of itemsets, where each itemset has a timestamp represented by a positive integer and each item is represented by a positive integer. Each itemset is first represented by its timestamp between the "<" and "> symbol. Then, the items of the itemset appear separated by single spaces. Finally, the end of an itemset is indicated by "-1". After all the itemsets, the end of a sequence (line) is indicated by the symbol "-2". Note that it is assumed that items are sorted according to a total order in each itemset and that no item appears twice in the same itemset. Furthermore, items can be annotated with a positive integer value specified between the "(" and ")" symbols.

For example, the input file "contextSequencesTimeExtended_ValuedItems.txt" contains the following four lines (four sequences).

<0> 1(2) 2 -1 <1> 3 -1 <2> 6 -1 <3> 5(1) -1 -2
<0> 1(2) 2 -1 <1> 4(8) -1 <2> 3 -1 <3> 5(2) 6 7 -1 -2
<0> 1(3) 2 -1 <1> 4(7) -1 <2> 3 -1 <3> 5(4) -1 -2
<0> 1(3) 2 -1 <1> 4(6) -1 <2> 5(5) -1 <3> 6 7 -1 -2

Consider the first line. It indicates that at time "0" the item 1 appears with the value 2 and the item 2 with no value (a value of 0 is assumed by default). Then, at time 1, the item 3 appears with no value. Then, at time 2, the item 6 appears with no value. Then at time 3, the item 5 appears with the value 1. The following sequences follow the same format.

Output file format

The output file format is defined as follows. It is a text file. Each line is a sequential pattern. Each line starts by listing the itemsets of the sequential pattern, where each itemset has a relative timestamp represented by a positive integer between the "<" and "> symbol. Then the timestamp is followed by each item in the itemset, each represented by a positive integer. The items of the itemset appear separated by single spaces and the symbol "-1" indicates the end of an itemset. Finally, after all the itemsets of a sequential pattern, the keyword "#SUP:" is followed by an integer indicating the support of the pattern as a number of sequences. Note that each item can be annotated with information about values specified between the "(" and ")" symbols. Information about values are a double value indicating the average value for this item in this pattern and the minimum and maximum values. For example, here is a few lines from the output file from the previous example:

<0> 1 (2.5, min=2.0 max=3.0) -1 <3> 7 -1 #SUP: 2
<0> 1 (2.5, min=2.0 max=3.0) -1 <3> 6 7 -1 #SUP: 2
<0> 1 (2.5, min=2.0 max=3.0) -1 <3> 6 -1 #SUP: 2

Consider the first line. It represents a sequential pattern. The first itemset occured has a relative timestamp of 0. Furthermore, it contains the item 1 with an average value of 2.5 and a minimum and maximum values respectively of 2 and 3. Then, a second itemset appears with a relative timestamp of 3. It contains the item 7. The support of the sequential pattern is 2 sequences. The two other lines follow the same format.

Where can I get more information about this algorithm?

The algorithm is described in this paper:

Fournier-Viger, P., Nkambou, R & Mephu Nguifo, E. (2008), A Knowledge Discovery Framework for Learning Task Models from User Interactions in Intelligent Tutoring Systems. Proceedings of the 7th Mexican International Conference on Artificial Intelligence (MICAI 2008). LNAI 5317, Springer, pp. 765-778.

Besides, you may read this survey of sequential pattern mining, which gives an overview of sequential pattern mining algorithms.