Charm (SPMF documentation)

This example explains how to run the SeqDIM (BIde+/Charm/AprioriClose) algorithm using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "SeqDim_(BIDE+AprioriClose)"quot; algorithm, (2) select the input file "ContextMDSequenceNoTime.txt", (3) set the output file name (e.g. "output.txt") (4) set minsup = 50% and (5) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run SeqDim_(BIDE+AprioriClose) ContextMDSequenceNoTime.txt output.txt 50% in a folder containing spmf.jar and the example input file ContextMDSequenceNoTime.txt.
If you are using the source code version of SPMF, launch the file "MainTestMultiDimSequentialPatternMiningClosed.java" in the package ca.pfv.SPMF.tests.

What is the Songram et al. algorithm?

The Songram et al. algorithm is an algorithm for discovering closed multi-dimensional sequential patterns in a multi-dimensional sequence database.

Why multi-dimensional sequential pattern is interesting? The reason is that regular sequential pattern mining algorithms do not consider the context of sequences. For example, the context of a sequence of transactions from a supermarket could be the profile of the customer such as his age, the city where he lives, etc.

In multi-dimensional sequential pattern mining, a sequence database is annotated with dimension to indicate the context of the sequence. The dimensions are symbolic values.

Multi-dimensional sequential pattern mining algorithm discovers sequential patterns with context information. This can be very useful for real applications such as market basket analysis. For example, one could find patterns specific to customers who are teenagers and lives in a particular city, or items boughts by adults living in another city.

However, there is a problem with classical multi-dimensional sequential pattern mining algorithm such as the one by Pinto et al. (2001). It is that there can be a lot of redundancy in the results. For this reason, Songram et al. proposed to discover closed multi-dimensional sequential patterns. This allows to find a subset of all patterns that eliminates a great deal of redundancy without any information loss. The algorithm by Songram et al. is more efficient than the one of Pinto (2001) in terms of execution time and memory usage.

What is the input?

The input is a multi-dimensional sequence database (as defined by Pinto et al. 2001) and a threshold named minsup (a value in [0,1] representing a percentage).

A multi-dimensional database is a set of multi-dimensional sequences and a set of dimensions d1, d2... dn. A multi-dimensional sequence (MD-Sequence) is composed of an MD-pattern and a sequence. A sequence is an ordered list of itemsets (groups of items). An MD-pattern is a set of symbolic values for the dimensions (here represented by integer numbers).

For example, consider the following database, provided in the file "ContextMDSequenceNoTime.txt" of the SPMF distribution. The database contains 4 MD-sequences.

MD-Sequences
ID	MD-Patterns			Sequences
	d1	d2	d3
S1	1	1	1	(2 4), (3), (2), (1)
S2	1	2	2	(2 6), (3 5), (6 7)
S3	1	2	1	(1 8), (1), (2), (6)
S4	*	3	3	(2 5), (3 5)

For instance, the first MD-Sequence represents that items 2 and 4 appeared together, then were followed by 3, which was followed by item 2, wich was followed by item 1. The context of this sequence is the value 1 for dimension d1, the value 1 for dimension d2 and the value 1 for dimension d3. Note that the value "*" in the fourth MD-sequence means "any values".

What is the output?

The output is the set of all closed multi-dimensional sequential patterns that appears in at least minsup sequences of the database. The difference with the SeqDIM algorithm is that this algorithm discovers "closed" multi-dimensional patterns. A "closed" multi-dimensional pattern is a multi-dimensional pattern such that no other pattern is include in it and appears in exactly the same set of sequences (see the Songram et al. paper for a formal definition).

Let's look at an example. If we apply the Songram et al. algorithm on the previous database with a minsup of 50%, we obtain the following result:

Frequent Closed MD-Sequences
ID	MD-Patterns			Sequences	Support
	d1	d2	d3
P1	*	*	*	(2)	100 %
P2	1	*	1	(1)	50 %
P3	1	2	*	(2), (6)	50 %
P4	*	*	*	(2), (3 5)	50 %
P5	*	*	*	(2), (3)	75 %
P6	1	*	*	(2), (3)	50 %
P7	1	*	*	(2)	75 %

For instance, the second patterns found (P2) represents that items 2 is followed by item 6. The context of this pattern is the value 1 for dimension d1, 2 for dimension d2 and any value for dimension d3. Note that the value "*" means "any values". This pattern is said to have a support of 50% because it appears in 50 % of the MD-Sequences from the original database (it appears in S2 and S3).

Input file format

The input file format is defined as follows. It is a text file where each line represents a multi-dimensional sequence from a sequence database. Each line is separated into two parts: (1) a MD-pattern and (2) a sequence.

The first part is a list of dimension values separated by single spaces. A dimension value is a positive integer or the symbol "*" meaning "any values". Finally, the value "-3" indicates the end of the first part. Note that each line should have the same number of dimension values.
The second part of each line is a sequence. Each item in a sequence is represented by a postive integers and items from the same itemset within a sequence are separated by single space. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line).

For example, the input file "ContextMDSequenceNoTime.txt" contains the following four lines (four sequences).

1 1 1 -3 2 4 -1 3 -1 2 -1 1 -1 -2
1 2 2 -3 2 6 -1 3 5 -1 6 7 -1 -2
1 2 1 -3 1 8 -1 1 -1 2 -1 6 -1 -2
* 3 3 -3 2 5 -1 3 5 -1 -2

This file contains four MD-sequences (four lines). Each line has 3 dimensions in each MD-Pattern. For example, consider the second line. It represents a MD-sequence where the value for the three dimensions are respectively 1, 2 and 2. Then, the sequence in this MD-Sequence is the itemset {2, 6} followed by the itemset {3, 5}, followed by the itemset {6, 7}.

Output file format

The output file format is defined as follows. It is a text file. Each line is a frequent MD sequential pattern. Each line is separated into three parts: (1) a MD-pattern, (2) a sequence and (3) the support of the MD sequential pattern formed by the first and second part.

The first part is a list of dimension values separated by single spaces between the symbols "[" and "]". A dimension value is a positive integer or the symbol "*" meaning "any values". Note that each line have the same number of dimension values.
The second part of each line is a sequence. A sequence is a list of itemset.Each itemset is represented by {t=0, ...} where "..." is a list of items separated by single spaces. Each item in a is represented by a postive integers. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset.
The third part is the keyword "#SUP:" followed by an integer indicating the support of the MD

For example, here is the output file from the previous example:

[ 1 2 * ]{t=0, 2 }{t=0, 6 } #SUP: 2
[ 1 * * ]{t=0, 2 }{t=0, 3 } #SUP: 2
[ 1 * * ]{t=0, 2 } #SUP: 3
[ 1 * 1 ]{t=0, 1 } #SUP: 2

Consider the first line. It presents a MD-sequential pattern having the dimension values 1, 2 and *. Furthemore, the line indicates that the sequence of this MD sequential pattern is the itemset {2} followed by the itemset {6} and that the MD-Sequential-pattern has a support of 2 transactions. The next lines follow the same format.

Implementation details

The Songram et al. algorithm is a meta-algorithm. It requires a closed sequential pattern mining algorithm for discovering sequential patterns and a closed itemset mining algorithm to deal with the dimensions. Our implementation uses the SeqDIM/Songram algorithm (Pinto et al. 2001, Songram et al. 2006) in combination with BIDE+ (Wang et al. 2007) and AprioriClose(Pasquier et al., 1999) or Charm (Zaki, 2002). To choose AprioriClose in the graphical user interface, select "SeqDim_(BIDE+AprioriClose)". To use Charm, select "SeqDim_(BIDE+Charm)"

Note that the implementation of Songram et al. algorithm in SPMF needs a little bit refactoring as it is currently integrated with the Fournier-Viger (2008) algorithm in the code. In a future version of SPMF, they will be separated. This is not really a problem for performance but it would make it easier to reuse the algorithms if both were separated.

Where can I get more information about this algorithm?

The algorithm is described in this paper:

P. Songram, V. Boonjing, S. Intakosum: Closed Multi-dimensional Sequential-Pattern Minin. Proc. of ITNG 2006.

The idea of multi-dimensional pattern mining is based on this paper:

H. Pinto, J. Han, J Pei, K. Wang, Q. Chen, U. Dayal: Multi-Dimensional Sequential Pattern Mining. CIKM 2001: 81-88

The idea of mining closed sequential pattern is based on this paper:

J. Wang, J. Han: BIDE: Efficient Mining of Frequent Closed Sequences. ICDE 2004: 79-90

Besides, you may read this survey of sequential pattern mining, which gives an overview of sequential pattern mining algorithms.