Mining Frequent Closed Multi-dimensional Sequential Patterns from a Sequence Database with SeqDIM/Songram, using Bide+ and AprioriClose/Charm (SPMF documentation)

This example explains how to run the SeqDIM (BIde+/Charm/AprioriClose) algorithm using the SPMF open-source data mining library.

How to run this example?

What is the Songram et al. algorithm?

The Songram et al. algorithm is an algorithm for discovering closed multi-dimensional sequential patterns in a multi-dimensional sequence database.

Why multi-dimensional sequential pattern is interesting? The reason is that regular sequential pattern mining algorithms do not consider the context of sequences. For example, the context of a sequence of transactions from a supermarket could be the profile of the customer such as his age, the city where he lives, etc.

In multi-dimensional sequential pattern mining, a sequence database is annotated with dimension to indicate the context of the sequence. The dimensions are symbolic values.

Multi-dimensional sequential pattern mining algorithm discovers sequential patterns with context information. This can be very useful for real applications such as market basket analysis. For example, one could find patterns specific to customers who are teenagers and lives in a particular city, or items boughts by adults living in another city.

However, there is a problem with classical multi-dimensional sequential pattern mining algorithm such as the one by Pinto et al. (2001). It is that there can be a lot of redundancy in the results. For this reason, Songram et al. proposed to discover closed multi-dimensional sequential patterns. This allows to find a subset of all patterns that eliminates a great deal of redundancy without any information loss. The algorithm by Songram et al. is more efficient than the one of Pinto (2001) in terms of execution time and memory usage.

What is the input?

The input is a multi-dimensional sequence database (as defined by Pinto et al. 2001) and a threshold named minsup (a value in [0,1] representing a percentage).

A multi-dimensional database is a set of multi-dimensional sequences and a set of dimensions d1, d2... dn. A multi-dimensional sequence (MD-Sequence) is composed of an MD-pattern and a sequence. A sequence is an ordered list of itemsets (groups of items). An MD-pattern is a set of symbolic values for the dimensions (here represented by integer numbers).

For example, consider the following database, provided in the file "ContextMDSequenceNoTime.txt" of the SPMF distribution. The database contains 4 MD-sequences.

ID MD-Patterns Sequences

d1 d2 d3
S1 1 1 1 (2 4), (3), (2), (1)
S2 1 2 2 (2 6), (3 5), (6 7)
S3 1 2 1 (1 8), (1), (2), (6)
S4 * 3 3 (2 5), (3 5)

For instance, the first MD-Sequence represents that items 2 and 4 appeared together, then were followed by 3, which was followed by item 2, wich was followed by item 1. The context of this sequence is the value 1 for dimension d1, the value 1 for dimension d2 and the value 1 for dimension d3. Note that the value "*" in the fourth MD-sequence means "any values".

What is the output?

The output is the set of all closed multi-dimensional sequential patterns that appears in at least minsup sequences of the database. The difference with the SeqDIM algorithm is that this algorithm discovers "closed" multi-dimensional patterns. A "closed" multi-dimensional pattern is a multi-dimensional pattern such that no other pattern is include in it and appears in exactly the same set of sequences (see the Songram et al. paper for a formal definition).

Let's look at an example. If we apply the Songram et al. algorithm on the previous database with a minsup of 50%, we obtain the following result:

Frequent Closed MD-Sequences
ID MD-Patterns Sequences Support

d1 d2 d3

P1 * * * (2) 100 %
P2 1 * 1 (1) 50 %
P3 1 2 * (2), (6) 50 %
P4 * * * (2), (3 5) 50 %
P5 * * * (2), (3) 75 %
P6 1 * * (2), (3) 50 %
P7 1 * * (2) 75 %

For instance, the second patterns found (P2) represents that items 2 is followed by item 6. The context of this pattern is the value 1 for dimension d1, 2 for dimension d2 and any value for dimension d3. Note that the value "*" means "any values". This pattern is said to have a support of 50% because it appears in 50 % of the MD-Sequences from the original database (it appears in S2 and S3).

Input file format

The input file format is defined as follows. It is a text file where each line represents a multi-dimensional sequence from a sequence database. Each line is separated into two parts: (1) a MD-pattern and (2) a sequence.

For example, the input file "ContextMDSequenceNoTime.txt" contains the following four lines (four sequences).

1 1 1 -3 2 4 -1 3 -1 2 -1 1 -1 -2
1 2 2 -3 2 6 -1 3 5 -1 6 7 -1 -2
1 2 1 -3 1 8 -1 1 -1 2 -1 6 -1 -2
* 3 3 -3 2 5 -1 3 5 -1 -2

This file contains four MD-sequences (four lines). Each line has 3 dimensions in each MD-Pattern. For example, consider the second line. It represents a MD-sequence where the value for the three dimensions are respectively 1, 2 and 2. Then, the sequence in this MD-Sequence is the itemset {2, 6} followed by the itemset {3, 5}, followed by the itemset {6, 7}.

Output file format

The output file format is defined as follows. It is a text file. Each line is a frequent MD sequential pattern. Each line is separated into three parts: (1) a MD-pattern, (2) a sequence and (3) the support of the MD sequential pattern formed by the first and second part.

For example, here is the output file from the previous example:

[ 1 2 * ]{t=0, 2 }{t=0, 6 } #SUP: 2
[ 1 * * ]{t=0, 2 }{t=0, 3 } #SUP: 2
[ 1 * * ]{t=0, 2 } #SUP: 3
[ 1 * 1 ]{t=0, 1 } #SUP: 2

Consider the first line. It presents a MD-sequential pattern having the dimension values 1, 2 and *. Furthemore, the line indicates that the sequence of this MD sequential pattern is the itemset {2} followed by the itemset {6} and that the MD-Sequential-pattern has a support of 2 transactions. The next lines follow the same format.

Implementation details

The Songram et al. algorithm is a meta-algorithm. It requires a closed sequential pattern mining algorithm for discovering sequential patterns and a closed itemset mining algorithm to deal with the dimensions. Our implementation uses the SeqDIM/Songram algorithm (Pinto et al. 2001, Songram et al. 2006) in combination with BIDE+ (Wang et al. 2007) and AprioriClose(Pasquier et al., 1999) or Charm (Zaki, 2002). To choose AprioriClose in the graphical user interface, select "SeqDim_(BIDE+AprioriClose)". To use Charm, select "SeqDim_(BIDE+Charm)"

Note that the implementation of Songram et al. algorithm in SPMF needs a little bit refactoring as it is currently integrated with the Fournier-Viger (2008) algorithm in the code. In a future version of SPMF, they will be separated. This is not really a problem for performance but it would make it easier to reuse the algorithms if both were separated.

Where can I get more information about this algorithm?

The algorithm is described in this paper:

P. Songram, V. Boonjing, S. Intakosum: Closed Multi-dimensional Sequential-Pattern Minin. Proc. of ITNG 2006.

The idea of multi-dimensional pattern mining is based on this paper:

H. Pinto, J. Han, J Pei, K. Wang, Q. Chen, U. Dayal: Multi-Dimensional Sequential Pattern Mining. CIKM 2001: 81-88

The idea of mining closed sequential pattern is based on this paper:

J. Wang, J. Han: BIDE: Efficient Mining of Frequent Closed Sequences. ICDE 2004: 79-90

Besides, you may read this survey of sequential pattern mining, which gives an overview of sequential pattern mining algorithms.