Mining Frequent Multi-dimensional Sequential Patterns from a Multi-dimensional Sequence Database with SeqDIM, using PrefixSpan and Apriori (SPMF documentation)

This example explains how to run the SeqDim algorithm using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "SeqDim_(PrefixSpan+Apriori)" algorithm, (2) select the input file "ContextMDSequenceNoTime.txt", (3) set the output file name (e.g. "output.txt") (4) set minsup = 75% and (5) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run SeqDim_(PrefixSpan+Apriori) ContextMDSequenceNoTime.txt output.txt 75% in a folder containing spmf.jar and the example input file ContextMDSequenceNoTime.txt.
If you are using the source code version of SPMF, launch the file "MainTestMultiDimSequentialPatternMining.java" in the package ca.pfv.SPMF.tests.

What is SeqDIM?

SeqDIM is an algorithm for discovering multi-dimensional sequential patterns in a multi-dimensional sequence database.

Why multi-dimensional sequential pattern is interesting? The reason is that regular sequential pattern mining algorithms do not consider the context of sequences. For example, the context of a sequence of transactions at a supermarket could be the profile of the customer such as his age, the city where he lives, etc.

In multi-dimensional sequential pattern mining, a sequence database is annotated with dimension to indicate the context of the sequence. The dimensions are symbolic values.

Multi-dimensional sequential pattern mining algorithm discovers sequential patterns with context information. This can be very useful for real applications such as market basket analysis. For example, one could find patterns specific to customers who are teenagers and lives in a particular city, or items boughts by adults living in another city.

What is the input?

The input is a multi-dimensional sequence database (as defined by Pinto et al. 2001) and a threshold named minsup (a value in [0,1] representing a percentage).

A multi-dimensional database is a set of multi-dimensional sequences and a set of dimensions d1, d2... dn. A multi-dimensional sequence (MD-Sequence) is composed of an MD-pattern and a sequence. A sequence is an ordered list of itemsets (groups of items). Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered. An MD-pattern is a set of symbolic values for the dimensions (here represented by integer numbers).

For example, consider the following database, provided in the file "ContextMDSequenceNoTime.txt" of the SPMF distribution. The database contains 4 MD-sequences.

MD-Sequences
ID	MD-Patterns			Sequences
	d1	d2	d3
S1	1	1	1	(2 4), (3), (2), (1)
S2	1	2	2	(2 6), (3 5), (6 7)
S3	1	2	1	(1 8), (1), (2), (6)
S4	*	3	3	(2 5), (3 5)

For instance, the first MD-Sequence represents that items 2 and 4 appeared together, then were followed by 3, which was followed by item 2, wich was followed by item 1. The context of this sequence is the value 1 for dimension d1, the value 1 for dimension d2 and the value 1 for dimension d3. Note that the value "*" in the fourth MD-sequence means "any values".

What is the output?

The output is the set of all multi-dimensional sequential patterns that appears in at least minsup sequences of the database. Here, we will not provide a formal definition but rather show an example. For a formal definition of what is a multi-dimensional sequential pattern you can refer to the paper by Pinto et al. (2001) which explains it very well.

Let's look at the example. If we apply SeqDIM on the previous database with a minsup of 50%, we obtain the following result:

Frequent MD-Sequences
ID	MD-Patterns			Sequences	Support
	d1	d2	d3
P1	*	*	*	(3)	75 %
P2	*	*	*	(2)	100 %
P3	1	*	*	(2)	75 %
P4	*	*	*	(2), (3)	75 %

For instance, the third pattern (P3) represent the sequence containing the item 2 only with the value 1 for dimension d1. Note that the value"*" for dimension d2 and d3 means "any values". This pattern P3 has a support of 75% because it appears in 75 % of the sequences of the original database (it appears in S1, S2 and S3).

Input file format

The input file format is defined as follows. It is a text file where each line represents a multi-dimensional sequence from a sequence database. Each line is separated into two parts: (1) a MD-pattern and (2) a sequence.

The first part is a list of dimension values separated by single spaces. A dimension value is a positive integer or the symbol "*" meaning "any values". Finally, the value "-3" indicates the end of the first part. Note that each line should have the same number of dimension values.
The second part of each line is a sequence. Each item in a sequence is represented by a postive integers and items from the same itemset within a sequence are separated by single space. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line).

For example, the input file "ContextMDSequenceNoTime.txt" contains the following four lines (four sequences).

1 1 1 -3 2 4 -1 3 -1 2 -1 1 -1 -2
1 2 2 -3 2 6 -1 3 5 -1 6 7 -1 -2
1 2 1 -3 1 8 -1 1 -1 2 -1 6 -1 -2
* 3 3 -3 2 5 -1 3 5 -1 -2

This file contains four MD-sequences (four lines). Each line has 3 dimensions in each MD-Pattern. For example, consider the second line. It represents a MD-sequence where the value for the three dimensions are respectively 1, 2 and 2. Then, the sequence in this MD-Sequence is the itemset {2, 6} followed by the itemset {3, 5}, followed by the itemset {6, 7}.

Output file format

The output file format is defined as follows. It is a text file. Each line is a frequent MD sequential pattern. Each line is separated into three parts: (1) a MD-pattern, (2) a sequence and (3) the support of the MD sequential pattern formed by the first and second part.

The first part is a list of dimension values separated by single spaces between the symbols "[" and "]". A dimension value is a positive integer or the symbol "*" meaning "any values". Note that each line have the same number of dimension values.
The second part of each line is a sequence. A sequence is a list of itemset. Each itemset is represented by {t=0, ...} where "..." is a list of items separated by single spaces.
Each item in an itemset is represented by a postive integers. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset.
The third part is the keyword "#SUP:" followed by an integer indicating the support of the pattern as a number of sequences.

For example, a few lines from the output file from the previous example are shown below:

[ 1 * * ]{t=0, 2 }{t=0, 3 } #SUP: 2
[ * * * ]{t=0, 2 }{t=0, 3 } #SUP: 3
[ * * * ]{t=0, 2 }{t=0, 3 5 } #SUP: 2

Consider the first line. It presents a MD-sequential pattern having the dimension values 1, * and *. Furthemore, the line indicates that this pattern is for the sequence containing the itemset {2} followed by the itemset {3} and that the MD-Sequential-pattern has a support of 2 transactions. The next lines follow the same format.

Implementation details

The SeqDIM algorithm is a meta-algorithm. It requires a sequential pattern mining algorithm for discovering sequential patterns and an itemset mining algorithm to deal with the dimensions. In our implementations, we have used the PrefixSpan algorithm (Pei et al., 2004) for sequential pattern mining and the Apriori algorithm (Agrawal & Srikant, 1994) for dealing with the dimensions. PrefixSpan is a very fast algorithm. However, Apriori is not the most efficient algorithm. It could be replaced by FPGrowth in future version for more efficiency.

Note also that SeqDIM can generate a lot of patterns. A solution to this problem is to use the algorithm by Songram et al., also offered in SPMF. This latter algorithm eliminates a lot of redundancy in patterns without any loss of information.

Also note that the implementation of SeqDIM in SPMF needs a little bit refactoring as it is currently integrated with the Fournier-Viger (2008) algorithm in the code. In a future version of SPMF, the two algorithms will be separated.

Where can I get more information about this algorithm?

The algorithm is described in this paper:

H. Pinto, J. Han, J Pei, K. Wang, Q. Chen, U. Dayal: Multi-Dimensional Sequential Pattern Mining. CIKM 2001: 81-88