# Mining Closed Sequential Patterns with Time Constraints from a Time-Extended Sequence Database (SPMF documentation)

This example explains how to run the **Fournier-08 algorithm** using the SPMF open-source data mining library.

### How to run this example?

**If you are using the graphical interface,**(1) choose the**"Fournier08-Closed+time"**algorithm**,**(2) select the input file**"contextSequencesTimeExtended.txt"**, (3) set the output file name (e.g. "**output**.txt" ) (4) minsup= 55 %, min_time_interval = 0, max_time_interval = 2, min_whole_interval = 0, max_whole_interval = 2, (5) click " Run algorithm" .**If you want to execute this example from the command line**, then execute this command:

java -jar spmf.jar run Fournier08-Closed+time**contextSequencesTimeExtended**.txt output.txt 55% 0 2 0 2 in a folder containing spmf.jar and the example input file contextSequencesTimeExtended.txt.**To run this example with the source code version of SPMF,**launch the file**" MainTestSequentialPatternsMining2_saveToMemory.java"**in the package**ca.pfv.SPMF.tests**.

### What is this algorithm?

The Fournier-Viger
et al., 2008 algorithm is a sequential pattern mining algorithm
combining features from several other sequential pattern mining
algorithms. It also offers some original features. In this example, we
show how it can be used to discover **closed sequential patterns
with time-constraints. **

Closed sequential patterns is a compact representation of all sequential patterns. Mining closed sequential patterns is important because it can greatly reduce the number of patterns found without loss of information. Using time-constraint is important because it allows to filter unininteresting patterns according to time-related constraints.

Mining closed patterns or using time constraints is also important because it can greatly improve the speed and memory usage when these constraints are used.

### What is the input?

The input is a **time-extended sequence** **database** (as defined by Hirate-Yamana,
2006) and some **constraints**.

A **time-extended sequence database** is a set of **time-extended
sequences**. A** time-extended sequences** is a
list of itemsets (groups of items). Each **itemset** is
anotated with a **timestamp** that is an integer value.

For example, consider the following time-extended sequence
database provided in the file **contextSequencesTimeExtended.txt **of the SPMF distribution. The database contains 4
time-extended sequences. Each sequence contains itemsets that are
annotated with a timestamp. For example, consider the sequence S1. This
sequence indicates that itemset {1} appeared at time 0. It was followed
by the itemset {1, 2, 3} at time 1. This latter itemset was followed by
the itemset {1 2} at time 2.

ID |
Sequences |

S1 | (0, 1), (1, 1 2 3}), (2,
1 3) |

S2 | (0, 1 ) (1,
1 2 ), (2, 1 2 3), (3, 1 2 3 ) |

S3 | (0, 1 2), (1, 1 2 ) |

S4 | (0, 2), (1, 1 2 3 ) |

The algorithms discovers **closed time-extended sequential
patterns** that are common to several sequences. To do that, the
user needs to provide five constraints (see the paper by Hirate &
Yamana, 2006 for full details):

- minimum support (
**minsup**): the minimum number of sequences that should contain a sequential patterns (an integer >=1) - minimum time interval allowed between two succesive itemsets of
a sequential pattern (
**min_time_interval**) (an integer >=0) - maximum time interval allowed between two succesive itemsets of
a sequential pattern (
**max_time_interval**) (an integer >=0) - minimum time interval allowed between the first itemset and the
last itemset of a sequential pattern (
**min_whole_interval**) (an integer >=0) - maximum time interval allowed between the first itemset and the
last itemset of a sequential pattern (
**max_whole_interval**) (an integer >=0)

**Note : **It is recommended to set **max_whole_interval** to a very large value because if it is used with closed sequential
pattern mining, the algorithm become approximate and may not return all
closed itemsets respecting the time constraint (the reason is that this
constraint is not compatible with the "
backScan pruning"
of the BIDE+
algorithm).

### What is the output?

The output is a set of **closed time-extended sequential
patterns **meeting the constraints given by the user. For
example, if we run the algorithm with minsup= 55 %, min_time_interval =
0, max_time_interval = 2, min_whole_interval = 0, max_whole_interval =
100, we obtain the following results:

ID |
Sequential Patterns |
Support |

S1 | (0, 1 2 3) |
75% |

S2 | (0, 1 2) |
100 % |

S3 | (0, 2), (1, 1 2) |
75 % |

S4 | (0, 2), (1, 1 3) |
75 % |

S5 | (0, 2), (1, 1) |
100 % |

S6 | (0, 1), (1, 1 2) |
75 % |

S7 | (0, 1 2), (1, 1) |
75 % |

For instance, the last pattern S7 indicates that the items 1 and 2 were followed by item 1 one time unit after. This pattern has a support of 75 % because it appears in S1, S2 and S3. It is important to note that the timestamps in the sequential patterns found are relative. For example, the pattern S6 is considered to appear in S1, S2 and S3 because {1} appears one time unit after {1, 2} in all of these sequences, even though the timestamps do not need to be the same in all of these sequences.

If you compare the results of this example with the previous example, you can observe that the number of closed time-extended sequential patterns (6) is much smaller than the number of time-extended sequential patterns (15) found in the previous example.

### Input file format

The **input file format** is defined as follows. It
is a text file where each line represents a time-extended sequence from
a sequence database. Each line is a list of itemsets, where each
itemset has a timestamp represented by a positive integer and each item
is represented by a positive integer. Each itemset is first represented
by it timestamp between the "
<"
and "
> symbol. Then, the items of
the itemset appear separated by single spaces. Finally, the end of an
itemset is indicated by "
-1"
. After all the itemsets, the end of a
sequence (line) is indicated by the symbol "
-2"
. Note that it is
assumed that items are sorted according to a total order in each
itemset and that no item appears twice in the same itemset.

For example, the input file "
**contextSequencesTimeExtended.txt**"
contains the following four lines (four sequences).

<0> 1 -1 <1> 1 2 3 -1 <2> 1 3 -1 -2

<0> 1 -1 <1> 1 2 -1 <2> 1 2 3 -1 <3> 1 2 3 -1 -2

<0> 1 2 -1 <1> 1 2 -1 -2

<0> 2 -1 <1> 1 2 3 -1 -2

Consider the first line. It indicates that at time " 0" the itemset {1} appeared, followed by the itemset {1, 2, 3} at time 1, then followed by the itemset {1, 3} at time 2. Note that timestamps do not need to be consecutive integers. But they should increase for each succesive itemset within a sequence. The second, third and fourth line follow the same format.

### Output file format

The **output file format** is defined as follows. It
is a text file. Each line is a time-extended frequent closed sequential
pattern. Each line starts by listing the itemsets of the sequential
pattern, where each itemset has a relative timestamp represented by a
positive integer between the "
<"
and "
> symbol. Then the
timestamp is followed by each item in the itemset, each represented by
a positive integer. The items of the itemset appear separated by single
spaces and the symbol "
-1"
indicates the end of an itemset. Finally,
after all the itemsets of a sequential pattern, the keyword "
#SUP:"
is
followed by an integer indicating the support of the pattern as a
number of sequences. For example, here is a two lines from the output
file from the previous example:

<0> 1 2 -1 <1> 1 3 -1 #SUP: 2

<0> 1 2 -1 <1> 1 2 -1 #SUP: 2

Consider the first line. It represents a sequential pattern having the itemset {1, 2} with a relative timestamp of 0, followed by the itemset {1, 3} one time unit later. This pattern has a support of 2 sequences. The second line follow the same format.

### Implementation details

To propose an algorithm to discover closed sequential patterns with time constraints, we have combined ideas from the BIDE+ algorithm and the Hirate & Yamana algorithm. Both of these algorithms are based on the PrefixSpan algorithm.

### Where can I get more information about this algorithm?

The Fournier-Viger (2008) algorithm is described in this paper:

* Fournier-Viger, P., Nkambou, R & Mephu Nguifo, E. (2008), A Knowledge Discovery Framework for Learning Task Models from User
Interactions in Intelligent Tutoring Systems. Proceedings of the
7th Mexican International Conference on Artificial Intelligence (MICAI
2008). LNAI 5317, Springer, pp. 765-778.*

The idea of using time-constraints is based on this paper

*Yu Hirate, Hayato Yamana (2006) Generalized Sequential Pattern Mining with Item
Intervals. JCP 1(3): 51-60.*

The idea of mining closed sequential pattern is based on this paper:

*J. Wang, J. Han: BIDE: Efficient Mining of Frequent Closed Sequences.
ICDE 2004: 79-90*

Besides, you may read this survey of sequential pattern mining, which gives an overview of sequential pattern mining algorithms.