Mining Closed Sequential Patterns with Time Constraints from a Time-Extended Sequence Database (SPMF documentation)

This example explains how to run the Fournier-08 algorithm using the SPMF open-source data mining library.

How to run this example?

What is this algorithm?

The Fournier-Viger et al., 2008 algorithm is a sequential pattern mining algorithm combining features from several other sequential pattern mining algorithms. It also offers some original features. In this example, we show how it can be used to discover closed sequential patterns with time-constraints.

Closed sequential patterns is a compact representation of all sequential patterns. Mining closed sequential patterns is important because it can greatly reduce the number of patterns found without loss of information. Using time-constraint is important because it allows to filter unininteresting patterns according to time-related constraints.

Mining closed patterns or using time constraints is also important because it can greatly improve the speed and memory usage when these constraints are used.

What is the input?

The input is a time-extended sequence database (as defined by Hirate-Yamana, 2006) and some constraints.

A time-extended sequence database is a set of time-extended sequences. A time-extended sequences is a list of itemsets (groups of items). Each itemset is anotated with a timestamp that is an integer value.

For example, consider the following time-extended sequence database provided in the file contextSequencesTimeExtended.txt of the SPMF distribution. The database contains 4 time-extended sequences. Each sequence contains itemsets that are annotated with a timestamp. For example, consider the sequence S1. This sequence indicates that itemset {1} appeared at time 0. It was followed by the itemset {1, 2, 3} at time 1. This latter itemset was followed by the itemset {1 2} at time 2.

ID Sequences
S1 (0, 1), (1, 1 2 3}), (2, 1 3)
S2 (0, 1 ) (1, 1 2 ), (2, 1 2 3), (3, 1 2 3 )
S3 (0, 1 2), (1, 1 2 )
S4 (0, 2), (1, 1 2 3 )

The algorithms discovers closed time-extended sequential patterns that are common to several sequences. To do that, the user needs to provide five constraints (see the paper by Hirate & Yamana, 2006 for full details):

Note : It is recommended to set max_whole_interval to a very large value because if it is used with closed sequential pattern mining, the algorithm become approximate and may not return all closed itemsets respecting the time constraint (the reason is that this constraint is not compatible with the " backScan pruning" of the BIDE+ algorithm).

What is the output?

The output is a set of closed time-extended sequential patterns meeting the constraints given by the user. For example, if we run the algorithm with minsup= 55 %, min_time_interval = 0, max_time_interval = 2, min_whole_interval = 0, max_whole_interval = 100, we obtain the following results:

ID Sequential Patterns Support
S1 (0, 1 2 3) 75%
S2 (0, 1 2) 100 %
S3 (0, 2), (1, 1 2) 75 %
S4 (0, 2), (1, 1 3) 75 %
S5 (0, 2), (1, 1) 100 %
S6 (0, 1), (1, 1 2) 75 %
S7 (0, 1 2), (1, 1) 75 %

For instance, the last pattern S7 indicates that the items 1 and 2 were followed by item 1 one time unit after. This pattern has a support of 75 % because it appears in S1, S2 and S3. It is important to note that the timestamps in the sequential patterns found are relative. For example, the pattern S6 is considered to appear in S1, S2 and S3 because {1} appears one time unit after {1, 2} in all of these sequences, even though the timestamps do not need to be the same in all of these sequences.

If you compare the results of this example with the previous example, you can observe that the number of closed time-extended sequential patterns (6) is much smaller than the number of time-extended sequential patterns (15) found in the previous example.

Input file format

The input file format is defined as follows. It is a text file where each line represents a time-extended sequence from a sequence database. Each line is a list of itemsets, where each itemset has a timestamp represented by a positive integer and each item is represented by a positive integer. Each itemset is first represented by it timestamp between the " <" and " > symbol. Then, the items of the itemset appear separated by single spaces. Finally, the end of an itemset is indicated by " -1" . After all the itemsets, the end of a sequence (line) is indicated by the symbol " -2" . Note that it is assumed that items are sorted according to a total order in each itemset and that no item appears twice in the same itemset.

For example, the input file " contextSequencesTimeExtended.txt" contains the following four lines (four sequences).

<0> 1 -1 <1> 1 2 3 -1 <2> 1 3 -1 -2
<0> 1 -1 <1> 1 2 -1 <2> 1 2 3 -1 <3> 1 2 3 -1 -2
<0> 1 2 -1 <1> 1 2 -1 -2
<0> 2 -1 <1> 1 2 3 -1 -2

Consider the first line. It indicates that at time " 0" the itemset {1} appeared, followed by the itemset {1, 2, 3} at time 1, then followed by the itemset {1, 3} at time 2. Note that timestamps do not need to be consecutive integers. But they should increase for each succesive itemset within a sequence. The second, third and fourth line follow the same format.

Output file format

The output file format is defined as follows. It is a text file. Each line is a time-extended frequent closed sequential pattern. Each line starts by listing the itemsets of the sequential pattern, where each itemset has a relative timestamp represented by a positive integer between the " <" and " > symbol. Then the timestamp is followed by each item in the itemset, each represented by a positive integer. The items of the itemset appear separated by single spaces and the symbol " -1" indicates the end of an itemset. Finally, after all the itemsets of a sequential pattern, the keyword " #SUP:" is followed by an integer indicating the support of the pattern as a number of sequences. For example, here is a two lines from the output file from the previous example:

<0> 1 2 -1 <1> 1 3 -1 #SUP: 2
<0> 1 2 -1 <1> 1 2 -1 #SUP: 2

Consider the first line. It represents a sequential pattern having the itemset {1, 2} with a relative timestamp of 0, followed by the itemset {1, 3} one time unit later. This pattern has a support of 2 sequences. The second line follow the same format.

Implementation details

To propose an algorithm to discover closed sequential patterns with time constraints, we have combined ideas from the BIDE+ algorithm and the Hirate & Yamana algorithm. Both of these algorithms are based on the PrefixSpan algorithm.

Where can I get more information about this algorithm?

The Fournier-Viger (2008) algorithm is described in this paper:

Fournier-Viger, P., Nkambou, R & Mephu Nguifo, E. (2008), A Knowledge Discovery Framework for Learning Task Models from User Interactions in Intelligent Tutoring Systems. Proceedings of the 7th Mexican International Conference on Artificial Intelligence (MICAI 2008). LNAI 5317, Springer, pp. 765-778.

The idea of using time-constraints is based on this paper

Yu Hirate, Hayato Yamana (2006) Generalized Sequential Pattern Mining with Item Intervals. JCP 1(3): 51-60.

The idea of mining closed sequential pattern is based on this paper:

J. Wang, J. Han: BIDE: Efficient Mining of Frequent Closed Sequences. ICDE 2004: 79-90

Besides, you may read this survey of sequential pattern mining, which gives an overview of sequential pattern mining algorithms.