SPMFA Sequential Pattern Mining Framework

Introduction

Algorithms

Download

Documentation

License

 

 

03589 visitors since 2010-02

Documentation

This section provides examples of how to use the SPMF framework. You can execute these examples by using the tests files located in the ca.pfv.spmf.tests package of the SPMF distribution. When I will have some extra time I will create more examples and spend more time to clean the code, add more comments, optimize the code and improve this documentation. If you have any question or if you want to report a bug, you can contact me. I will be pleased to try to help you, as much as I can, and fix the bugs. You can also have a look at the various articles that I have referenced on this page to learn more about each algorithm.

List of examples

Frequent Itemset Mining and Rare Itemset Mining

Association Rule Mining

Clustering

Sequential Pattern Mining

Multi-dimensional Sequential Pattern Mining

Sequential Pattern Mining with the Fournier-Viger et al. (2008) algorithm

Example 1: Mining Frequent Itemsets by Using the Apriori Algorithm

This example is provided as the files "MainTestApriori1.java" and "contextPasquier99.txt" in the SPMF distribution.

Let's consider the following dataset from the article Pasquier et al., 1999. This binary context contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5).

  1 2 3 4 5
t1 x   x x  
t2   x x   x
t3 x x x   x
t4   x     x
t5 x x x   x

The output of the Apriori algorithm for a binary context and a minimum support thresold minsup is the set of all frequent itemsets and their support.

By running Apriori, with the previous binary context and a minsup of 40% (2 transactions), we obtain the following result:

itemsets support
{} 5
{1} 3
{2} 4
{3} 4
{5} 4
{1, 2} 2
{1, 3} 3
{1, 5} 2
{2, 3} 3
{2, 5} 4
{3, 5} 3
{1, 2, 3} 2
{1, 2, 5} 3
{1, 3, 5} 2
{2, 3, 5} 3
{1, 2, 3, 5} 2

Example 2: Mining Frequent Itemsets by Using the AprioriTid Algorithm

This example is provided as the files "MainTestAprioriTID.java" and "contextPasquier99.txt" in the SPMF distribution.

Let's consider the following dataset from the article Pasquier et al., 1999. This binary context contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5).

  1 2 3 4 5
t1 x   x x  
t2   x x   x
t3 x x x   x
t4   x     x
t5 x x x   x

The output of the AprioriTID algorithm (Agrawal & Srikant, 94) for a binary context and a minimum support thresold minsup is the set of all frequent itemsets and their support.

By running AprioriTID, with the previous binary context and a minsup of 40% (2 transactions), we obtain the following result:

itemsets support
{} 5
{1} 3
{2} 4
{3} 4
{5} 4
{1, 2} 2
{1, 3} 3
{1, 5} 2
{2, 3} 3
{2, 5} 4
{3, 5} 3
{1, 2, 3} 2
{1, 2, 5} 3
{1, 3, 5} 2
{2, 3, 5} 3
{1, 2, 3, 5} 2

Example 3: Mining Frequent Itemsets by Using the FP-Growth Algorithm

This example is provided as the files "MainTestFPGrowth.java" and "contextPasquier99.txt" in the SPMF distribution.

Let's consider the following dataset from the article Pasquier et al., 1999. This binary context contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5).

  1 2 3 4 5
t1 x   x x  
t2   x x   x
t3 x x x   x
t4   x     x
t5 x x x   x

The output of the FP-Growth algorithm for a binary context and a minimum support thresold minsup is the set of all frequent itemsets and their support.

By running FP-Growth, with the previous binary context and a minsup of 40% (2 transactions), we obtain the following result:

frequent itemsets support
{} 5
{1} 3
{2} 4
{3} 4
{5} 4
{1, 2} 2
{1, 3} 3
{1, 5} 2
{2, 3} 3
{2, 5} 4
{3, 5} 3
{1, 2, 3} 2
{1, 2, 5} 3
{1, 3, 5} 2
{2, 3, 5} 3
{1, 2, 3, 5} 2

Example 4: Mining Frequent Itemsets by Using the Relim Algorithm

This example is provided as the files "MainTestRelim.java" and "contextPasquier99.txt" in the SPMF distribution.

Let's consider the following dataset from the article Pasquier et al., 1999. This binary context contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5).

  1 2 3 4 5
t1 x   x x  
t2   x x   x
t3 x x x   x
t4   x     x
t5 x x x   x

The output of the Relim algorithm for a binary context and a minimum support thresold minsup is the set of all frequent itemsets and their support.

By running Relim, with the previous binary context and a minsup of 40% (2 transactions), we obtain the following result:

frequent itemsets support
{} 5
{1} 3
{2} 4
{3} 4
{5} 4
{1, 2} 2
{1, 3} 3
{1, 5} 2
{2, 3} 3
{2, 5} 4
{3, 5} 3
{1, 2, 3} 2
{1, 2, 5} 3
{1, 3, 5} 2
{2, 3, 5} 3
{1, 2, 3, 5} 2

Example 5: Mining Frequent Itemsets by Using the Eclat Algorithm

This example is provided as the files "MainTestEclat.java" and "contextPasquier99.txt" in the SPMF distribution.

Let's consider the following dataset from the article Pasquier et al., 1999. This binary context contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5).

  1 2 3 4 5
t1 x   x x  
t2   x x   x
t3 x x x   x
t4   x     x
t5 x x x   x

The output of the Eclat algorithm for a binary context and a minimum support thresold minsup is the set of all frequent itemsets and their support.

By running Eclat, with the previous binary context and a minsup of 40% (2 transactions), we obtain the following result:

frequent itemsets support
{} 5
{1} 3
{2} 4
{3} 4
{5} 4
{1, 2} 2
{1, 3} 3
{1, 5} 2
{2, 3} 3
{2, 5} 4
{3, 5} 3
{1, 2, 3} 2
{1, 2, 5} 3
{1, 3, 5} 2
{2, 3, 5} 3
{1, 2, 3, 5} 2

Example 6: Mining Frequent Closed Itemsets Using the AprioriClose Algorithm

This example is provided as the files "MainTestAprioriClose1.java" and "contextPasquier99.txt" in the SPMF distribution.

Let's consider the following dataset from the article Pasquier et al., 1999. This binary context contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5).

  1 2 3 4 5
t1 x   x x  
t2   x x   x
t3 x x x   x
t4   x     x
t5 x x x   x

The output of the AprioriClose algorithm for a binary context and a minimum support thresold minsup is the set of all frequent itemsets and their support, and the set of frequent closed itemsets.

By running AprioriClose, with the previous binary context and a minsup of 40% (2 transactions), we obtain the following result:

frequent itemsets support is closed?
{} 5 yes
{1} 3 no
{2} 4 no
{3} 4 yes
{5} 4 no
{1, 2} 2 no
{1, 3} 3 yes
{1, 5} 2 no
{2, 3} 3 no
{2, 5} 4 yes
{3, 5} 3 no
{1, 2, 3} 2 no
{1, 2, 5} 3 no
{1, 3, 5} 2 no
{2, 3, 5} 3 yes
{1, 2, 3, 5} 2 yes

Example 7: Mining Frequent Closed Itemsets Using the Charm Algorithm

This example is provided as the files "MainTestCharm.java" and "contextPasquier99.txt" in the SPMF distribution.

Let's consider the following dataset from the article Pasquier et al., 1999. This binary context contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5).

  1 2 3 4 5
t1 x   x x  
t2   x x   x
t3 x x x   x
t4   x     x
t5 x x x   x

The output of the Charm algorithm for a binary context and a minimum support thresold minsup is the set of all frequent closed itemsets and their support..

By running Charm, with the previous binary context and a minsup of 40% (2 transactions), we obtain the following result:

frequent closed itemsets support
{} 5
{3} 4
{1, 3} 3
{2, 5} 4
{2, 3, 5} 3
{1, 2, 3, 5} 2

Example 8: Mining Frequent Maximal Itemsets Using the Charm-MFI Algorithm

This example is provided as the files "MainTestCharmMFI.java" and "contextPasquier99.txt" in the SPMF distribution.

Let's consider the following dataset from the article Pasquier et al., 1999. This binary context contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5).

  1 2 3 4 5
t1 x   x x  
t2   x x   x
t3 x x x   x
t4   x     x
t5 x x x   x

The Charm-MFI algorithm consist of applying the Charm algorithm for finding frequent closed itemsets and then to check which of them are maximal itemsets. The output is the set of frequent closed itemsets where a boolean value indicate for each of them if they are maximal or not.

By running Charm-MFI, with the previous binary context and a minsup of 40% (2 transactions), we obtain the following result:

frequent closed itemsets support is maximal?
{} 5 yes
{3} 4 no
{1, 3} 3 yes
{2, 5} 4 no
{2, 3, 5} 3 yes
{1, 2, 3, 5} 2 yes

Example 9: Mining Frequent Closed Itemsets and Minimal Generators Using the Zart Algorithm

This example is provided as the files "MainTestZart.java" and "contextZart.txt" in the SPMF distribution.

Let's consider the following dataset from the article Szathmary et al. 2007. This binary context contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). This binary context is provided in the file "contextZart.txt" of the SPMF distribution.

  1 2 3 4 5
t1 x x   x x
t2 x   x    
t3 x x x   x
t4   x x   x
t5 x x x   x

The output of the Zart algorithm for a binary context and a minimum support thresold minsup is the set of all frequent itemsets and their support, and the associated minimal generator(s) for each closed frequent itemset.

By running Zart with the previous binary context and a minsup of 40% (2 transactions), we obtain the following result:

itemsets support is closed? minimal generators
{} 5 yes {}
{1} 4 yes {1}
{2} 4 no
{3} 4 yes {3}
{5} 4 no
{1, 2} 3 no
{1, 3} 3 yes {1,3}
{1, 5} 3 no
{2, 3} 3 no
{2, 5} 4 yes {2}, {5}
{3, 5} 3 no
{1, 2, 3} 2 no
{1, 2, 5} 3 yes {1, 2}, {1, 5}
{1, 3, 5} 2 no
{2, 3, 5} 3 yes {2, 3}, {3, 5}
{1, 2, 3, 5} 2 yes {1, 2, 3}, {1, 3, 5}

Example 10: Mining Pseudo-Closed Itemsets

This example is provided as the files "MainTestPseudoClosedItemsets.java" and "contextPasquier99.txt" in the SPMF distribution.

Let's consider the following dataset from the article Pasquier et al., 1999. This binary context contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5).

  1 2 3 4 5
t1 x   x x  
t2   x x   x
t3 x x x   x
t4   x     x
t5 x x x   x

We can mine all pseudo-closed itemsets from this context with minsup = 20%. Finding pseudo-closed itemsets, is an important step for generating the Guigues-Duquenne basis of association rules (see next example).

We obtain the following set of pseudo-closed itemsets:

Pseudo-closed itemsets Support Closure
{2} 80 % 2 5
{1} 60 % 1 3
{4} 20 % 1 3 4
{5} 80 % 2 5

Example 11: Mining Minimal Rare Itemsets

This example is provided as the files "MainTestAprioriRare.java" and "contextZart.txt" in the SPMF distribution.

Let's consider the following dataset from the article Szathmary et al. 2007b. This binary context contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). This binary context is provided in the file "contextZart.txt" of the SPMF distribution.

  1 2 3 4 5
t1 x x   x x
t2 x   x    
t3 x x x   x
t4   x x   x
t5 x x x   x

By mining running the AprioriRare algorithm with minsup = 60 % and this binary context, we obtain the following set of minimal rare itemsets (see Szathmary et al. 2007b for further details):

Minimal Rare Itemsets Support
{4} 20 %
{1, 3, 5} 40 %
{1, 2, 3} 40 %

and the set of frequent itemsets :

Frequent Itemsets Support
{} 100 %
{1} 80 %
{2} 80 %
{3} 80 %
{5} 80 %
{1, 2} 60 %
{1, 3} 60 %
{1, 5} 60 %
{2, 3} 60 %
{2, 5} 80 %
{3, 5} 60 %
{1, 2, 3} 40 %
{1, 2, 5} 60 %
{1, 3, 5} 40 %
{2, 3, 5} 60 %
{1, 2, 3, 5} 40 %

Example 12: Mining Perfectly Rare Itemsets Using the AprioriInverse Algorithm

This example is provided as the files "MainTestAprioriInverse.java" and "contextInverse.txt" in the SPMF distribution.

Let's consider the following binary context, consisting of 5 transactions (t1,t2...t5) and 5 items (1,2,3,4,5). This context is provided in the file "contextInverse.txt" of the SPMF distribution.:

  1 2 3 4 5
t1 x x   x x
t2 x   x    
t3 x x x   x
t4   x x    
t5 x x   x x

By mining running the AprioriInverse algorithm with minsup = 0.001 % and maxsup of 0.60 % and this binary context, we obtain the following set of perfectly rare itemsets (see Koh & Roundtree 2005 for further details):

Perfectly Rare Itemsets Support
{3} 60 %
{4} 40 %
{5} 60 %
{4, 5} 40 %
{3, 5} 20 %

Example 13: Mining Closed Itemsets from a Data Stream Using the CloStream Algorithm

This example is provided as the files "MainTestCloStream.java"in the SPMF distribution.

CloStream (Yen et al., 2009) is an algorithm for incremental mining of closed itemsets from a data stream. The following example consists of applying CloStream on the following transactions, processed one by one as a stream:

  1 2 3 4 5
t1 1   2 x  
t2   x x   x
t3 x x x   x
t4   x     x
t5 x x x   x

By running CloStream, we obtain the following result:

closed itemsets support
{} 5
{3} 4
{1, 3} 3
{1, 3, 4} 1
{2, 5} 4
{2, 3, 5} 3
{1, 2, 3, 5} 2

Example 14: Mining All Association Rules

This example is provided as the files "MainTestAllAssociationRules.java" and "contextIGB.txt" in the SPMF distribution.

Let's consider the following binary context, consisting of 6 transactions (t1,t2...t6) and 5 items (1,2,3,4,5). This context is provided in the file "contextIGB.txt" of the SPMF distribution.:

  1 2 3 4 5
t1 x x   x x
t2   x x   x
t3 x x   x x
t4 x x x   x
t5 x x x x x
t6   x x x  

Given a minimum support threshold (minsup) and a minimum confidence threshold (minconf) and a binary context, we can mine the whole set of association rules by (1) using the Apriori algorithm to find frequent itemsets and then (2) using the frequent itemsets to generate the set of all association rules as described by Agrawal & Srikant, 1994.

By applying this combination of algorithms with minsup = 50 %, minconf= 60%, we obtains 55 associations rules (run the example in the SPMF distribution to see the result)

Important note: In this example, the Apriori algorithm is used as the basis for finding association rules. If you want to use the FP-GROWTH algorithm instead of the APRIORI algorithm, please refer to the test file "MainTestAllAssociationRules_FPGrowth_version.java" in the SPMF distribution. It performs the same process. But it uses FP-GROWTH instead of Apriori..

Example 15: Mining Guigues-Duquenne Basis of Exact Rules

This example is provided as the files "MainTestGDBasis.java" and "contextPasquier99.txt" in the SPMF distribution.

Let's consider the following dataset from the article Pasquier et al., 1999. This binary context contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5).

  1 2 3 4 5
t1 x   x x  
t2   x x   x
t3 x x x   x
t4   x     x
t5 x x x   x

Given a minimum support threshold (minsup) and a minimum confidence threshold (minconf) and a binary context, we can compute the Guigues-Duquenne basis of association rules. With the previously described context minsup = 20 % and minconf = 50%, we discover the following set of association rules:

Rule Support Confidence
2 ==> 5 80 % 100 %
1 ==>3 60 % 75 %
5 ==> 2 80 % 100 %

Example 16: Mining Proper Basis for Approximative Rules

This example is provided as the files "MainTestProperBasis.java" and "contextPasquier99.txt" in the SPMF distribution.

Let's consider the following dataset from the article Pasquier et al., 1999. This binary context contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5).

  1 2 3 4 5
t1 x   x x  
t2   x x   x
t3 x x x   x
t4   x     x
t5 x x x   x

Given a minimum support threshold (minsup) and a minimum confidence threshold (minconf) and a binary context, we can compute the Proper basis for approximate association rules. With the previously described context minsup = 40 % and minconf = 50%, we discover the following set of association rules:

Rule Support Confidence
3 ==>1 60 % 75 %
2, 5 ==> 3 60 % 75 %
3 ==> 2, 5 60 % 75 %
2, 3, 5 ==> 1 40 % 66 %
2, 5 ==> 1, 3 40 % 50 %
1, 3 ==> 2, 5 40 % 66 %
3 ==> 1, 2, 5 40 % 50 %

Example 17: Mining Structural Basis for Approximative Rules

This example is provided as the files "MainTestStructuralBasis.java" and "contextPasquier99.txt" in the SPMF distribution.

Let's consider the following dataset from the article Pasquier et al., 1999. This binary context contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). This binary context is provided in the file "contextPasquier99.txt" of the SPMF distribution.

  1 2 3 4 5
t1 x   x x  
t2   x x   x
t3 x x x   x
t4   x     x
t5 x x x   x

Given a minimum support threshold (minsup) and a minimum confidence threshold (minconf) and a binary context, we can compute the Structural basis for approximate association rules. With the previously described context minsup = 40 % and minconf = 50%, we discover the following set of association rules:

Rule Support Confidence
3 ==>1 60 % 75 %
2, 5 ==> 3 60 % 75 %
1, 3 ==> 2, 5 40 % 66 %

Example 18: Mining the IGB basis of Association Rules

This example is provided as the files "MainTestIGB.java" and "contextIGB.txt" in the SPMF distribution.

Let's consider the following binary context, consisting of 8 transactions (t1,t2...t8) and 5 items (1,2,3,4,5), taken from the article of Gasmi et al., 2005. This context is provided in the file "contextIGB.txt" of the SPMF distribution.:

  1 2 3 4 5
t1 x x   x x
t2   x x   x
t3 x x   x x
t4 x x x   x
t5 x x x x x
t6   x x x  

Given a minimum support threshold (minsup) and a minimum confidence threshold (minconf) and a binary context, the IGB algorithm can find the IGB basis of association rules. It is a compact set of association rules that is both informative and generic.

By applying the IGB algorithm on the context previously described with minsup = 0.50 % and minconf= 0.61%, we discover the following set of association rules:

Rule Support Confidence
1 ==> 2, 4, 5 0.50 0.75
4 ==> 1, 2, 5 0.50 0.75
3 ==> 2, 5 0.50 0.75
{} ==> 2, 3 0.66 0.66
{} ==> 1, 2, 5 0.66 0.66
{} ==> 2, 4 0.66 0.66
{} ==> 2, 5 0.83 0.83
{} ==> 2 1 1

Example 19: Mining Perfectly Sporadic Association Rules

This example is provided as the files "MainTestAllPerfectlySporadicRules.java" and "contextInverse.txt" in the SPMF distribution.

Let's consider the following binary context, consisting of 5 transactions (t1,t2...t5) and 5 items (1,2,3,4,5). This context is provided in the file "contextInverse.txt" of the SPMF distribution.:

  1 2 3 4 5
t1 x x   x x
t2 x   x    
t3 x x x   x
t4   x x    
t5 x x   x x

By mining running the AprioriInverse algorithm with minsup = 0.001 % and maxsup of 0.60 % and this binary context, we obtain the following set of perfectly rare itemsets (see Koh & Roundtree 2005 for further details):

Perfectly Rare Itemsets Support
{3} 60 %
{4} 40 %
{5} 60 %
{4, 5} 40 %
{3, 5} 20 %

Then, by generating all association rules with using the perfectly rare itemsets (see Koh & Roundtree 2005 for further details) and a minconf of 60 %, the set of perfectly sporadic association rules is found:

Rule Support Confidence
5 ==> 4 40 % 60 %
4 ==> 5 40 % 100 %

Example 20: Mining Closed Association Rules

This example is provided as the files "MainTestClosedAssociationRules.java" and "contextZart.txt" in the SPMF distribution.

Let's consider the following dataset. This binary context contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). This binary context is provided in the file "contextZart.txt" of the SPMF distribution.

  1 2 3 4 5
t1 x x   x x
t2 x   x    
t3 x x x   x
t4   x x   x
t5 x x x   x

Given a minimum support threshold (minsup) and a minimum confidence threshold (minconf) and a binary context, we can mine the whole set of closed association rules by (1) using the AprioriClose algorithm to find frequent closed itemsets and then (2) using the frequent closed itemsets to generate the set of all closed association rules as described by Szathmary (2006).

By applying this algorithm with minsup = 60 %, minconf= 60%, we obtains 16 closed associations rules (run the example in the SPMF distribution to see the result)

Example 21: Mining Minimal Non Redundant Association Rules

This example is provided as the files "MainTestMNRRules.java" and "contextZart.txt" in the SPMF distribution.

Let's consider the following dataset. This binary context contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). This binary context is provided in the file "contextZart.txt" of the SPMF distribution.

  1 2 3 4 5
t1 x x   x x
t2 x   x    
t3 x x x   x
t4   x x   x
t5 x x x   x

Given a minimum support threshold (minsup) and a minimum confidence threshold (minconf) and a binary context, we can mine the whole set of minimal non redundant association rules (Kryszkiewicz, 1998). In this implementation, we use the Zart algorithm (Szathmary et al. 2006) to discover closed itemsets and minimal generator. From this, the set, the set of minimal non redundant association rules can be generated (Kryszkiewicz, 1998; Szathmary et al. 2006) .

By applying this algorithm with minsup = 60 %, minconf= 60%, we obtains 14 minimal non rendudant associations rules (run the example in the SPMF distribution to see the result)

Example 22: Clustering Values with the K-Means algorithm

This example is provided as the file "MainTestKMeans.java" in the SPMF distribution.

The K-Means algorithm takes as input a set of values and can create K clusters. Note that running K-Means with the same data does not always generate the same result as it initially generate random clusters.

Let's consider the following set of values: {2, 2, 3, 7, 7, 7, 7, 3}

By running K-Means with this set of values and K=3, we can obtain the following clusters:

Cluster 1 : {7,7,7,7} Cluster 2 : {2, 2} Cluster 3 : {3, 3}

Example 23: Clustering Values with an Extended K-Means algorithm

This example is provided as the file "MainTestKMeans2.java" in the SPMF distribution.

We offer a second implementation of the K-Means algorithm that offers a few more parameters. It takes as input (1) Kmax, a maximum number of clusters to be created, (2) X, the minimum size required for each cluster and (4) Y, the number of tries for each K.

The algorithm proceed as follow. First it applies K-Means to find K=1 clusters with a size higher than X. It repeat this process Y times and notes the higher number of clusters that was found for K. Then it repeat the same process again with K=K+1, K=K+2 and so on... The algorithm stops when the number of clusters found does not increase for two sucessive K or if K=Kmax.

Let's consider the following set of values:

{2, 2, 3, 7, 7, 7, 7, 3}

By running K-Means with this set of values and X=4, Y=5 and KMax = 6, we can obtain the following clusters:

Cluster 1 : {7,7,7,7} Cluster 2 : {2, 2, 3, 3}

Example 24: Clustering Values with a Hierarchical Clustering algorithm

This example is provided as the file "MainTestHierarchicalClustering.java" in the SPMF distribution.

We have implemented a hierarchical clustering algorithm that is based on the description of Hierarchical Clustering Algorithms from
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html. This is a Hierarchical Clustering with a constant "threshold" that indicate
the maximal distance between two clusters to group them together and create a bigger cluster. The distance between two clusters is calculated as the distance between the medians of each cluster.Initially, a cluster is created for each values. Then clusters are combined until no cluster can be combined into bigger clusters.

Let's consider the following set of values:

{2, 2, 3, 7, 7, 8, 9, 3}

By running the algorithm with this set of values and threshold=2.0, we obtain the following clusters:

Cluster 1 : {2,2,3,3} Cluster 2 : {7, 7, 8} Cluster 3 : {9}

Example 25: Mining Frequent Sequential Patterns Using the PrefixSpan Algorithm

This example is provided as the files "MainTestPrefixSpan.java" and "contextPrefixSpan.txt" in the SPMF distribution.

Note: This example uses the PrefixSpan algorithm (Pei et al., 2004).

Consider the following database. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. It is a sequence database (as defined in Pei et al., 2004). The database contains 4 sequences.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

From a sequence database, the basic task of sequential pattern mining is to find all sequential patterns that occurs in more than minsup sequences of the database.

From this dataset, if we run the algorithm with minsup= 50 %, 53 sequential patterns will be found.

The list is too long to be presented here. An example of pattern found is "(1,2),(6)" which appears in the first and third sequence (it has therefore a support of 50%). Another pattern is "(4), (3), (2)". It appears in the second and third sequence (it has thus a support of 50 %).

Example 26: Mining Frequent Closed Sequential Patterns Using the BIDE+ Algorithm

This example is provided as the files "MainTestBIDEPlus.java" and "contextPrefixSpan.txt" in the SPMF distribution.

Note: This example uses the BIDE+ algorithm (Wang et al. 2007).

Consider the following database. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. It is a sequence database (as defined in Pei et al., 2004). The database contains 4 sequences.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

From this dataset, if we run the BIDE+ algorithm with minsup= 50 %, sixteen frequent closed sequential patterns will be found:

ID Closed Sequential Pattern Support
S1 (6) 75 %
S2 (5) 75 %
S3 (2), (3) 75 %
S4 (1), (2) 100 %
S5 (1), (3) 100 %
S6 (1 2), (6) 50 %
S7 (4), (3) 75 %
S8 (1) (2), (3) 50 %
S9 (1), (2 3), (1) 50 %
S10 (1), (3), (2) 75 %
S11 (1), (3), (3) 75 %
S12 (1 2), (4), (3) 50 %
S13 (6), (2), (3) 50 %
S14 (5), (2), (3) 50 %
S15 (4), (3), (2) 50 %
S16 (5), (6), (3), (2) 50 %
S17 (5), (1), (3), (2) 50 %

A closed sequential pattern is a pattern that is not included in another pattern having the same support.

Example 27: Mining Frequent Multi-dimensional Sequential Patterns from a Multi-dimensional Sequence Database with SeqDIM, using PrefixSpan and Apriori

This example is provided as the files "MainTestMultiDimSequentialPatternMining.java" and "ContextMDSequenceNoTime.txt" in the SPMF distribution.

Note: This example uses the SeqDIM algorithm (Pinto et al. 2001) in combination with PrefixSpan (Pei et al., 2004) and Apriori .(Agrawal & Srikant, 1994)

Consider the following multi-dimensional sequence database. This database is provided in the file "ContextMDSequenceNoTime.txt" of the SPMF distribution. It is a multi-dimensional sequence database (as defined in Pinto et al. 2001). The database contains 4 MD-sequences.

MD-Sequences
ID MD-Patterns Sequences
  d1 d2 d3  
S1 1 1 1 (2 4), (3), (2), (1)
S2 1 2 2 (2 6), (3 5), (6 7)
S3 1 2 1 (1 8), (1), (2), (6)
S4 * 3 3 (2 5), (3 5)

The task of multi-dimensional sequential pattern mining consists of finding MD-Sequences that have a support higher than a minimum support threshold minsup for a MD-Database. If we mine frequent MD-Sequences from the previous database with a minsup of 75%, we obtain the following result:

Frequent MD-Sequences
ID MD-Patterns Sequences Support
  d1 d2 d3    
P1 * * * (3) 75 %
P2 * * * (2) 100 %
P3 1 * * (2) 75 %
P4 * * * (2), (3) 75 %

Example 28: Mining Frequent Closed Multi-dimensional Sequential Patterns from a Sequence Database with SeqDIM/Songram, using Bide+ and AprioriClose

This example is provided as the files "MainTestMultiDimSequentialPatternMiningClosed.java" and "ContextMDSequenceNoTime.txt" in the SPMF distribution.

Note: This example uses the SeqDIM/Songram algorithm (Pinto et al. 2001, Songram et al. 2006) in combination with BIDE+ (Wang et al. 2007) and AprioriClose(Pasquier et al., 1999).

Consider the following multi-dimensional sequence database. This database is provided in the file "ContextMDSequenceNoTime.txt" of the SPMF distribution. It is a multi-dimensional sequence database (as defined in Pinto et al. 2001). The database contains 4 MD-sequences.

MD-Sequences
ID MD-Patterns Sequences
  d1 d2 d3  
S1 1 1 1 (2 4), (3), (2), (1)
S2 1 2 2 (2 6), (3 5), (6 7)
S3 1 2 1 (1 8), (1), (2), (6)
S4 * 3 3 (2 5), (3 5)

The task of multi-dimensional sequential pattern mining consists of finding MD-Sequences that have a support higher than a minimum support threshold minsup for a MD-Database. If we mine frequent closed MD-Sequences from the previous database with a minsup of 50%, we obtain the following result:

Frequent Closed MD-Sequences
ID MD-Patterns Sequences Support
  d1 d2 d3    
P1 * * * (2) 100 %
P2 1 * 1 (1) 50 %
P3 1 2 * (2), (6) 50 %
P4 * * * (2), (3 5) 50 %
P5 * * * (2), (3) 75 %

A closed MD-Sequence is an MD-Sequence that does not appear in another MD-Sequence having the same support.

Example 29: Mining Sequential Patterns with Time Constraints from a Time-Extended Sequence Database

This example is provided as the files "MainTestSequentialPatternsMining1.java" and "contextSequencesTimeExtended.txt" in the SPMF distribution.

Note: This example uses the Fournier-Viger et al., 2008 algorithm that combines many features from other sequential pattern mining algorithms and offers also some original features. Look at the other examples included in this documentation for other possibilities.

Consider the following database. This database is a time-extended sequence database (as defined by Hirate-Yamana, 2006). The database contains 4 sequences. Each sequence contains itemsets that are annotated with a timestamp. For example, consider the sequence S1. This sequence indicates that itemset {1} appeared at time 0. It was followed by the itemset {1, 2, 3} at time 1. Finally, itemset {1 2} appeared at time 2.

ID Sequences
S1 (0, 1), (1, 1 2 3}), (2, 1 3)
S2 (0, 1 ) (1, 1 2 ), (2, 1 2 3), (3, 1 2 3 )
S3 (0, 1 2), (1, 1 2 )
S4 (0, 2), (1, 1 2 3 )

From a sequence database, the basic task of sequential pattern mining is to find all sequential patterns that occurs more than minsup times in the database. In the case of a time extended database, the Fournier-Viger algorithm offers the following parameters based on Hirate-Yamana, 2006. Each sequential patterns must respect these parameters.

  • minimum support (minsup)
  • minimum time interval between two succesive itemsets of a sequence (min_time_interval)
  • maximum time interval between two succesive itemsets of a sequence (min_time_interval)
  • minimum time interval between the first itemset and the last itemset of a sequence (min_whole_interval)
  • maximum time interval between the first itemset and the last itemset of a sequence (max_whole_interval)

From the dataset, if we run the algorithm with minsup= 55 %, min_time_interval = 0, max_time_interval = 2, min_whole_interval = 0, max_whole_interval = 2, we obtain the following results:

ID Sequential Patterns Support
S1 (0, 3) 75 %
S2 (0, 2 3) 75 %
S3 (0, 2) 100 %
S4 (0, 1 2 3) 75%
S5 (0, 1 2) 100 %
S6 (0, 1 3) 75 %
S7 (0, 1) 100 %
S8 (0, 2), (1, 3) 75 %
S9 (0, 2), (1, 1 2) 75 %
S10 (0, 2), (1, 1 3) 75 %
S11 (0, 2), (1, 1) 100 %
S12 (0, 2), (1, 2) 75 %
S13 (0, 1), (1, 1 2) 75 %
S14 (0, 1), (1, 1) 75 %
S15 (0, 1), (1, 2) 75 %
S16 (0, 1 2), (1, 1) 75 %

Example 30: Mining Closed Sequential Patterns with Time Constraints from a Time-Extended Sequence Database

This example is provided as the files "MainTestSequentialPatternsMining2.java" and "contextSequencesTimeExtended.txt" in the SPMF distribution.

Note: This example uses the Fournier-Viger et al., 2008 algorithm that combines many features from other sequential pattern mining algorithms and offers also some original features. Look at the other examples included in this documentation for other possibilities.

The results from the previous example are very interesting but there is redundacy among the patterns found. For example, pattern S14 is included in pattern S13 and both patterns have exactly the same support. For eliminating redundancy, we have included the possibility to mine only closed sequential patterns. A closed sequential patterns is a pattern that is not included in another pattern that have exactly the same support.A closed pattern induces an equivalence class of pattern sharing the same closure, i.e. all the patterns belonging to the equivalence class are verified by exactly the same set of plans. Those patterns are partially ordered, e.g. considering the inclusion relation. The smallest elements in the equivalence class are called minimal generators, and the unique maximal element is called the closed pattern. The set of closed frequent sequences is a compact representation of all paterns as it allows reconstituting the set of all frequent sequences and their support (no information is loss).

To extend our algorithm to mine closed sequential patterns, we have integrated the BIDE+ mechanism (Wang et al. 2007).

Note : It is very important to note that if we mine closed sequences, we should not use the parameter "max_whole_interval" discussed in the previous example, since this parameter is not compatible with the "backScan pruning" of the BIDE+ mechanism.

Now, consider the time-extended sequence database from the previous example:

ID Sequences
S1 (0, 1), (1, 1 2 3}), (2, 1 3)
S2 (0, 1 ) (1, 1 2 ), (2, 1 2 3), (3, 1 2 3 )
S3 (0, 1 2), (1, 1 2 )
S4 (0, 2), (1, 1 2 3 )

If we mine closed sequential patterns with minsup= 55 %, min_time_interval = 0, max_time_interval = 2, we obtain the following set of closed sequential patterns:

ID Sequential Patterns Support
S1 (0, 1 2 3) 75%
S2 (0, 1 2) 100 %
S3 (0, 2), (1, 1 2) 75 %
S4 (0, 2), (1, 1 3) 75 %
S5 (0, 2), (1, 1) 100 %
S6 (0, 1), (1, 1 2) 75 %
S7 (0, 1 2), (1, 1) 75 %

We can observe that the number of closed sequential patterns (6) is much smaller than the number of sequential patterns (15) found in the previous example.

Example 31: Mining Sequential Patterns with Time Constraints from a Time-Extended Sequence Database containing Valued Items

This example is provided as the files "MainTestSequentialPatternsMining3.java" and "contextSequencesTimeExtended_ValuedItems.txt" in the SPMF distribution.

Note: This example uses the Fournier-Viger et al., 2008 algorithm that combines many features from other sequential pattern mining algorithms and offers also some original features. Look at the different examples included in this documentation to know all the possibilities.

Consider the following time-extended sequence database where items can be annotated with numeric values. These values are here indicated with a bold font and blue color. Consider the sequence S1 of this database. At time 0, the item 1 is annotated with the value 2, and item 2 is not annotated with a value. At time 1, item 3 appeared. At time 2, item 6 occured. At time 3, item 5 has value 1.

ID Sequences
S1 (0, 1(2) 2), (1, 3), (2, 6), (3, 5(1))
S2 (0, 1(2) 2), (1, 4(8)), (2, 3), (3, 5(2) 6 7)
S3 (0, 1(3) 2), (1, 4(7)), (2, 3), (3, 5(4))
S4 (0, 1(3) 2), (1, 4(6)), (2, 5(5)), (3, 6 7)

To mine sequential patterns from a time-extended sequence database where items can be annotated with integer values, we have added an original mechanism in the Fournier-Viger et al. algorithm (see Fournier-Viger et al., 2008 for a detailed description). As PrefixSpan, the algorithm grows patterns one item at a time by recursively projecting a database by a frequent item. However, we have modified this step so that when the support of an item that is annotated with values is higher or equals to 2 * minsup, the database projection operation calls the K-Means algorithm [15] to try to separate these values in two or more clusters. Thereafter, the database will be projected separately for each group of values. Thus, the different groups will be considered as different items.

This is best illustrated with an example. If we mine patterns from the first table with minsup = 50%, we can get 56 sequential patterns. Note however that the number of patterns found can vary because K-Means is a randomized algorithm. For this example, six patterns found are presented in the next table:

ID Sequential Patterns Support
P1 (0, 3) 75%
P2 (0, 5 (average: 3 min:1 max:5)) 100 %
P3 (0, 6) 75 %
P4 (0, 3), (1, 6) 50 %
P5 (0, 1 (average: 3 min: 3 max: 3) 2), (1 4 (average: 6.5 min: 6 max: 7)) 50 %
P6 (0, 1 ((average: 2 min: 2 max: 2) 2), (3, 5 (average: 3.5 min: 1 max: 2)) 50 %
... ... ...

When the algorithm was executed, as some point, it considered projecting the database with item "1" because this item is frequent. Item "1" is annotated with values 2, 2, 3 and 3 in sequences S1, S2, S3 and S4, respectively. The algorithm applied the K-Means to find clusters. Two clusters were found. They are {2, 2}and {3, 3}. We can see this in sequences P5 and P6. In sequence P5, item "1" represents the cluster {3, 3}, whereas sequence P6 include item "1" with the cluster {2, 2}. This feature of clustering is useful as it allows to group similar values together for an item and treat them differently.

In the source code of MainTestSequentialPatternsMining3.java, there is a few parameters for K-Means. Two of these parameters are particularly important:

  • MaxK : This parameter is the maximum number of clusters to create. If you set this parameter to two for example, no more than two clusters will be made every time that K-Means is called. If you want to desactivate the clustering, this parameter should be set to 1 (only one cluster will always be made).
  • numberOfTriesForEachK: This parameter indicates the number of times that K-Means should be run each time there is some values to be clustered. Normally, this value should be set to 1. If you set this parameter to a value different than 1, K-Means will be run the number of times that you specify and the maximum number of clusters found will be kept. This parameter was added because K-Means is a randomized algorithm.

Important note: If the clustering described in this example is used jointly with the mining of closed sequential patterns (described in example #29), the set of patterns found may not be a lossless representation of all patterns.

Example 32: (Closed) Multi-dimensional Sequential Patterns Mining with Time Constraints

This example is provided as the files "MainTestSequentialPatternsMining4.java" and "ContextMDSequence.txt" in the SPMF distribution.

Note: This example uses the Fournier-Viger et al., 2008 algorithm that combines many features from other sequential pattern mining algorithms and offers also some original features. Look at the different examples included in this documentation to know all the possibilities.

Consider the following multi-dimensional database (MD-Database). A multidimensional database is a sequence database where each sequence (an MD-Sequence) is annotated with dimensional information (Pinto et al. 2001). The set of dimensional values for an MD-Sequence is called an MD-Pattern. For a multi-dimensional database, there is a fix set of dimensions. Each dimensions can take a symbolic value or the value "*" which means any value. In the following MD-Database, there is four MD-Sequences named S1, S2, S3, S4.

MD-Sequences
ID MD-Patterns Sequences
  d1 d2 d3  
S1 1 1 1 (0, 2 4), (1, 3), (2, 2), (3, 1)
S2 1 2 2 (0, 2 6), (1, 3 5), (2, 6 7)
S3 1 2 1 (0, 1 8), (1, 1), (2, 2), (3, 6)
S4 * 3 3 (0, 2 5), (1, 3 5)

The task of multi-dimensional sequential pattern mining consists of finding MD-Sequences that have a support higher than a minimum support threshold minsup for a MD-Database. Furthermore, in the Fournier-Viger algorithm, we offers the possibility to mine only frequent closed MD-Sequences, by implementing the idea of Songram et al. 2006. A frequent closed MD-Sequence is a frequent MD-Sequence that is not included in any other MD-Sequence having the same support.

If we mine frequent closed MD-Sequences from the previous database with a minsup of 50%, we obtain the following result:

Frequent Closed MD-Sequences
ID MD-Patterns Sequences
  d1 d2 d3  
P1 * * * (0, 2)
P2 1 * 1 (0, 1)
P3 1 2 * (0, 6)
P4 * * * (0, 2), (1, 3 5)
P5 * * * (0, 2), (1, 3)

If we mine frequent MD-sequences instead of frequent closed MD-Sequences, we will obtain 23 frequent MD-Sequences.

Copyright © 2008-2010 Philippe Fournier-Viger. All rights reserved.