Documentation

This section provides examples of how to use SPMF to perform various data mining tasks.

If you have any question or if you want to report a bug, you can check the FAQ, post in the forum or contact me. You can also have a look at the various articles that I have referenced on the algorithms page of this website to learn more about each algorithm.

List of examples

Itemset Mining (Frequent Itemsets, Rare Itemsets, etc.)

Association Rule Mining

Clustering

Sequential Pattern Mining

Multi-dimensional Sequential Pattern Mining

Sequential Pattern Mining with the Fournier-Viger et al. (2008) algorithm

Sequential Rule Mining

Classification

Tools

 

Example 1 : Mining Frequent Itemsets by Using the Apriori Algorithm

How to run this example?

What is Apriori?

Apriori is an algorithm for discovering frequent itemsets in transaction databases. It was proposed by Agrawal & Srikant (1993).

What is the input of the Apriori algorithm?

The input is a transaction database (a.k.a. binary context) and a threshold named minsup (a value between 0 and 100 %).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {1, 2, 3, 5}

What is the output of the Apriori algorithm?

Apriori is an algorithm for discovering itemsets (group of items) occuring frequently in a transaction database (frequent itemsets). A frequent itemset is an itemset appearing in at least minsup transactions from the transaction database, where minsup is a parameter given by the user.

For example, if Apriori is run on the previous transaction database with a minsup of 40 % (2 transactions), Apriori produces the following result:

itemsets support
{1} 3
{2} 4
{3} 4
{5} 4
{1, 2} 2
{1, 3} 3
{1, 5} 2
{2, 3} 3
{2, 5} 4
{3, 5} 3
{1, 2, 3} 2
{1, 2, 5} 3
{1, 3, 5} 2
{2, 3, 5} 3
{1, 2, 3, 5} 2

How should I interpret the results?

In the results, each itemset is annotated with its support. The support of an itemset is how many times the itemset appears in the transaction database. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5. It is a frequent itemset because its support is higher or equal to the minsup parameter.

Performance

The Apriori algorithm is an important algorithm for historical reasons and also because it is a simple algorithm that is easy to learn. However, faster and more memory efficient algorithms have been proposed. If efficiency is required, it is recommended to use a more efficient algorithm like FPGrowth instead of Apriori. You can see a performance comparison of Apriori, FPGrowth, and other frequent itemset mining algorithms by clicking on the "performance" section of this website.

Implementation details

In SPMF, there is also an implementation of Apriori that uses a hash-tree as an internal structure to store candidates. This structure provide a more efficient way to count the support of itemsets. This version of Apriori is named "Apriori_with_hash_tree" in the GUI of SPMF and the command line. For the source code version, it can be run by executing the test file MainTestAprioriHT_saveToFile.java. This version of Apriori can be up to twice faster than the regular version in some cases but it uses more memory. This version of Apriori has two parameters: (1) minsup and (2) the number of child nodes that each node in the hash-tree should have. For the second parameter, we suggest to use the value 30.

Where can I get more information about the Apriori algorithm?

This is the technical report published in 1994 describing Apriori.

R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. Research Report RJ 9839, IBM Almaden Research Center, San Jose, California, June 1994.

You can also read chapter 6 of the book "introduction to data mining" which provide a nice and easy to understand introduction to Apriori.

Example 2 : Mining Frequent Itemsets by Using the AprioriTid Algorithm

How to run this example?

What is AprioriTID?

AprioriTID is an algorithm for discovering frequent itemsets (groups of items appearing frequently) in a transaction database. It was proposed by Agrawal & Srikant (1993).

AprioriTID is a variation of the Apriori algorithm. It was proposed in the same article as Apriori as an alternative implementation of Apriori. It produces the same output as Apriori. But it uses a different mechanism for counting the support of itemsets.

What is the input of the AprioriTID algorithm?

The input is a transaction database (a.k.a. binary context) and a threshold named minsup (a value between 0 and 100 %).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {1, 2, 3, 5}

What is the output of the AprioriTID algorithm?

AprioriTID is an algorithm for discovering itemsets (group of items) occuring frequently in a transaction database (frequent itemsets). A frequent itemset is an itemset appearing in at least minsup transactions from the transaction database, where minsup is a parameter given by the user.

For example, if AprioriTID is run on the previous transaction database with a minsup of 40 % (2 transactions), AprioriTID produces the following result:

itemsets support
{1} 3
{2} 4
{3} 4
{5} 4
{1, 2} 2
{1, 3} 3
{1, 5} 2
{2, 3} 3
{2, 5} 4
{3, 5} 3
{1, 2, 3} 2
{1, 2, 5} 3
{1, 3, 5} 2
{2, 3, 5} 3
{1, 2, 3, 5} 2

How should I interpret the results?

In the results, each itemset is annotated with its support. The support of an itemset is how many times the itemset appears in the transaction database. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5. It is a frequent itemset because its support is higher or equal to the minsup parameter.

Performance

The Apriori and AprioriTID algorithms are important algorithms for historical reasons and also because they are simple algorithms that are easy to learn. However, faster and more memory efficient algorithms have been proposed. For efficiency, it is recommended to use more efficient algorithms like FPGrowth instead of AprioriTID or Apriori. You can see a performance comparison of Apriori, AprioriTID, FPGrowth, and other frequent itemset mining algorithms by clicking on the "performance" section of this website.

Implementation details

There are two versions of AprioriTID in SPMF. The first one is called AprioriTID and is the regular AprioriTID algorithm. The second one is called AprioriTID_Bitset and uses bitsets as internal structures instead of HashSet of Integers to represent sets of transactions IDs. The advantage of the bitset version is that using bitsets for representing sets of transactions IDs is more memory efficient and performing the intersection of two sets of transactions IDs is more efficient with bitsets (it is done by doing the logical AND operation).

Where can I get more information about the AprioriTID algorithm?

This is the technical report published in 1994 describing Apriori and AprioriTID.

R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. Research Report RJ 9839, IBM Almaden Research Center, San Jose, California, June 1994.

You can also read chapter 6 of the book "introduction to data mining" which provide a nice and easy to understand introduction to Apriori.

Example 3 : Mining Frequent Itemsets by Using the FP-Growth Algorithm

How to run this example?

What is FPGrowth?

FPGrowth is an algorithm for discovering frequent itemsets in a transaction database. It was proposed by Han et al. (2000). FPGrowth is a very fast an memory efficient algorithm. It uses a special internal structure called an FP-Tree.

What is the input of the FPGrowth algorithm?

The input of FPGrowth is a transaction database (a.k.a. binary context) and a threshold named minsup (a value between 0 and 100 %).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {2, 3, 5}
t4 {2, 5}
t5 {2, 3, 5}

What is the output of the FPGrowth algorithm?

FPGrowth is an algorithm for discovering itemsets (group of items) occuring frequently in a transaction database (frequent itemsets). A frequent itemset is an itemset appearing in at least minsup transactions from the transaction database, where minsup is a parameter given by the user.

For example, if FPGrowth is run on the previous transaction database with a minsup of 40 % (2 transactions), FPGrowth produces the following result:

itemsets support
{1} 3
{2} 4
{3} 4
{5} 4
{1, 2} 2
{1, 3} 3
{1, 5} 2
{2, 3} 3
{2, 5} 4
{3, 5} 3
{1, 2, 3} 2
{1, 2, 5} 3
{1, 3, 5} 2
{2, 3, 5} 3
{1, 2, 3, 5} 2

How should I interpret the results?

In the results, each itemset is annotated with its support. The support of an itemset is how many times the itemset appears in the transaction database. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5. It is a frequent itemset because its support is higher or equal to the minsup parameter.

Performance

There exists several algorithms for mining frequent itemsets. In SPMF, you can try for example Apriori, AprioriTID, Eclat, HMine, Relim and more. Among all these algorithms, FPGrowth is generally the fastest and most memory efficient algorithm. You can see a performance comparison by clicking on the "performance" section of this website.

Where can I get more information about the FPGrowth algorithm?

This is the journal article describing FPGrowth:

Jiawei Han, Jian Pei, Yiwen Yin, Runying Mao: Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. Data Min. Knowl. Discov. 8(1): 53-87 (2004)

You can also read chapter 6 of the book "introduction to data mining" which provide an easy to understand introduction to FPGrowth (but does not give all the details).

Example 4 : Mining Frequent Itemsets by Using the Relim Algorithm

How to run this example?

What is Relim?

Relim is an algorithm for discovering frequent itemsets in a transaction database. Relim was proposed by Borgelt (2005). It is not a very efficient algorithm. It is included in SPMF for comparison purposes.

What is the input of the Relim algorithm?

The input is a transaction database (a.k.a. binary context) and a threshold named minsup (a value between 0 and 100 %).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {1, 2, 3, 5}

What is the output of the Relim algorithm?

Relim is an algorithm for discovering itemsets (group of items) occuring frequently in a transaction database (frequent itemsets). A frequent itemset is an itemset appearing in at least minsup transactions from the transaction database, where minsup is a parameter given by the user.

For example, if Relim is run on the previous transaction database with a minsup of 40 % (2 transactions), Relim produces the following result:

itemsets support
{1} 3
{2} 4
{3} 4
{5} 4
{1, 2} 2
{1, 3} 3
{1, 5} 2
{2, 3} 3
{2, 5} 4
{3, 5} 3
{1, 2, 3} 2
{1, 2, 5} 3
{1, 3, 5} 2
{2, 3, 5} 3
{1, 2, 3, 5} 2

How should I interpret the results?

In the results, each itemset is annotated with its support. The support of an itemset is how many times the itemset appears in the transaction database. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5. It is a frequent itemset because its support is higher or equal to the minsup parameter.

Performance

There exists several algorithms for mining frequent itemsets. Relim is not a very efficient algorithm. For efficiency, it is recommended to use FPGrowth for better performance. You can see a performance comparison by clicking on the "performance" section of this website.

Where can I get more information about the FPGrowth algorithm?

This is the conference article describing Relim:

Keeping Things Simple: Finding Frequent Item Sets by Recursive Elimination Christian Borgelt. Workshop Open Source Data Mining Software (OSDM'05, Chicago, IL), 66-70. ACM Press, New York, NY, USA 2005

Note that the author of Relim and collaborators have proposed extensions and additional optimizations of Relim that I have not implemented.

Example 5 : Mining Frequent Itemsets by Using the Eclat Algorithm

How to run this example?

What is Eclat ?

Eclat is an algorithm for discovering frequent itemsets in a transaction database. It was proposed by Zaki (2001). Contrarily to algorithms such as Apriori, Eclat uses a depth-first search for discovering frequent itemsets instead of a breath-first search.

What is the input of the Eclat algorithm?

The input is a transaction database (a.k.a. binary context) and a threshold named minsup (a value between 0 and 100 %).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {1, 2, 3, 5}

What is the output of the Eclat algorithm?

Eclat is an algorithm for discovering itemsets (group of items) occuring frequently in a transaction database (frequent itemsets). A frequent itemset is an itemset appearing in at least minsup transactions from the transaction database, where minsup is a parameter given by the user.

For example, if Eclat is run on the previous transaction database with a minsup of 40 % (2 transactions), Eclat produces the following result:

itemsets support
{1} 3
{2} 4
{3} 4
{5} 4
{1, 2} 2
{1, 3} 3
{1, 5} 2
{2, 3} 3
{2, 5} 4
{3, 5} 3
{1, 2, 3} 2
{1, 2, 5} 3
{1, 3, 5} 2
{2, 3, 5} 3
{1, 2, 3, 5} 2

How should I interpret the results?

Each frequent itemset is annotated with its support. The support of an itemset is how many times the itemset appears in the transaction database. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5. It is a frequent itemset because its support is higher or equal to the minsup parameter.

Performance

There exists several algorithms for mining frequent itemsets. Eclat is one of the best. But generally, FPGrowth is a better algorithm. You can see a performance comparison by clicking on the "performance" section of this website.

Nevertheless, the Eclat algorithm is interesting because it uses a depth-first search. For some extensions of the problem of itemset mining such as mining high utility itemsets (see the HUI-Miner algorithm), the search procedure of Eclat works very well.

Implementation details

In SPMF, there are two versions of ECLAT. The first one is named "Eclat" and uses HashSets for representing sets of transaction IDs (tidsets). The second version is named "Eclat_bitset" and uses bitsets for representing tidsets. Using bitsets has the advantage of generally being more memory efficient and can also make the algorithm faster depending on the dataset.

Where can I get more information about the Eclat algorithm?

Here is an article describing the Eclat algorithm:

Mohammed Javeed Zaki: Scalable Algorithms for Association Mining. IEEE Trans. Knowl. Data Eng. 12(3): 372-390 (2000)

Example 6 : Mining Frequent Itemsets by Using the HMine Algorithm

How to run this example?

What is H-Mine ?

H-Mine is an algorithm for discovering frequent itemsets in transaction databases, proposed by Pei et al. (2001). Contrarily to previous algorithms such as Apriori, H-Mine uses a pattern-growth approach to discover frequent itemsets.

What is the input of the H-Mine algorithm?

The input of H-Mine is a transaction database (a.k.a. binary context) and a threshold named minsup (a value between 0 and 100 %).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {2, 3, 5}
t4 {2, 5}
t5 {2, 3, 5}

What is the output of the H-Mine algorithm?

H-Mine is an algorithm for discovering itemsets (group of items) occuring frequently in a transaction database (frequent itemsets). A frequent itemset is an itemset appearing in at least minsup transactions from the transaction database, where minsup is a parameter given by the user.

For example, if H-Mine is run on the previous transaction database with a minsup of 40 % (2 transactions), H-Mine produces the following result:

itemsets support
{1} 3
{2} 4
{3} 4
{5} 4
{1, 2} 2
{1, 3} 3
{1, 5} 2
{2, 3} 3
{2, 5} 4
{3, 5} 3
{1, 2, 3} 2
{1, 2, 5} 3
{1, 3, 5} 2
{2, 3, 5} 3
{1, 2, 3, 5} 2

How should I interpret the results?

In the results, each itemset is annotated with its support. The support of an itemset is how many times the itemset appears in the transaction database. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5. It is a frequent itemset because its support is higher or equal to the minsup parameter.

Performance

There exists several algorithms for mining frequent itemsets. H-Mine is claimed to be one of the best by their author. However, in the SPMF implementation, H-Mine does not perform very well. Therefore, it is recommended to use FPGrowth, instead. You can see a performance comparison by clicking on the "performance" section of this website.

Where can I get more information about the H-Mine algorithm?

Here is an article describing the H-Mine algorithm:

J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. "H-Mine: Fast and space-preserving frequent pattern mining in large databases". IIE Transactions, Volume 39, Issue 6, pages 593-605, June 2007, Taylor & Francis.

Example 7 : Mining Frequent Closed Itemsets Using the AprioriClose Algorithm

How to run this example?

What is AprioClose?

AprioriClose (a.k.a Close) is an algorithm for discovering frequent closed itemsets in a transaction database. It was proposed by Pasquier et al. (1999).

What is the input of the AprioriClose algorithm?

The input is a transaction database (a.k.a. binary context) and a threshold named minsup (a value between 0 and 100 %).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {1, 2, 3, 5}

What is the output of the AprioriClose algorithm?

AprioriClose outputs frequent closed itemsets. To explain what is a frequent closed itemset, it is necessary to review a few definitions.

An itemset is an unordered set of distinct items. The support of an itemset is the number of transactions that contain the itemset. For example, consider the itemset {1, 3}. It has a support of 2 because it appears in two transactions (t1 and t5) from the transaction database .

A frequent itemset is an itemset that appears in at least minsup transactions from the transaction database, where minsup is a threshold set by the user. A frequent closed itemset is a frequent itemset that is not included in a proper superset having exactly the same support. The set of frequent closed itemsets is thus a subset of the set of frequent itemsets. Why is it interesting to discover frequent closed itemset ? The reason is that the set of frequent closed itemsets is usually much smaller than the set of frequent itemsets and it can be shown that no information is lost (all the frequent itemsets can be regenerated from the set of frequent closed itemsets - see Pasquier(1999) for more details).

If we apply AprioriClose on the previous transaction database with a minsup of 40 % (2 transactions), we get the following five frequent closed itemsets:

frequent closed itemsets support
{3} 4
{1, 3} 3
{2, 5} 4
{2, 3, 5} 3
{1, 2, 3, 5} 2
If you would apply the regular Apriori algorithm instead of AprioriClose, you would get 15 itemsets instead of 5, which shows that the set of frequent closed itemset can be much smaller than the set of frequent itemsets.

How should I interpret the results?

In the results, each frequent closed itemset is annotated with its support. For example, the itemset {2, 3, 5} has a support of 3 because it appears in transactions t2, t3 and t5. The itemset {2, 3, 5} is a frequent itemset because its support is higher or equal to the minsup parameter. Futhermore, it is a closed itemsets because it has no proper superset having exactly the same support.

Performance

The AprioriClose algorithm is important for historical reasons because it is the first algorithm for mining frequent closed itemsets. However, there exists several other algorithms for mining frequent closed itemsets. In SPMF, it is recommended to use DCI_Closed or Charm instead of AprioriClose, because they are more efficient.

Implementation details

In SPMF, there are two versions of AprioriClose. The first version is named "AprioriClose" and is based on the "Apriori" algorithm. The second version is named "Apriori_TIDClose" and is based on the AprioriTID algorithm instead of Apriori (it uses tidsets to calculate support to reduce the number of database scans). Both version are available in the graphical user interface of SPMF. In the source code, the files "MainTestAprioriClose1.java" and"MainTestAprioriTIDClose.java" respectively correspond to these two versions.

Where can I get more information about the AprioriClose algorithm?

The following article describes the AprioriClose algorithm:

Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal: Discovering Frequent Closed Itemsets for Association Rules. ICDT 1999: 398-416

Example 8 : Mining Frequent Closed Itemsets Using the DCI_Closed Algorithm

How to run this example?

What is DCI_Closed?

DCI_Closed is an algorithm for discovering frequent closed itemsets in a transaction database. DCI_Closed was proposed by Lucchese et al. (2004).

What is the input of the DCI_Closed algorithm?

The input is a transaction database (a.k.a. binary context) and a threshold named minsup (a value between 0 and 100 %).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {1, 2, 3, 5}

What is the output of the DCI_Closed algorithm?

DCI_Closed outputs frequent closed itemsets. To explain what is a frequent closed itemset, it is necessary to review a few definitions.

An itemset is a unordered set of disctint items. The support of an itemset is the number of transactions that contain the itemset. For example, consider the itemset {1, 3}. It has a support of 2 because it appears in two transactions from the transaction database.

A frequent itemset is an itemset that appears in at least minsup transactions from the transaction database. A frequent closed itemset is a frequent itemset that is not included in a proper superset having exactly the same support. The set of frequent closed itemsets is thus a subset of the set of frequent itemsets. Why is it interesting to discover frequent closed itemsets ? The reason is that the set of frequent closed itemsets is usually much smaller than the set of frequent itemsets and it can be shown that no information is lost (all the frequent itemsets can be regenerated from the set of frequent closed itemsets - see Lucchese (2004) for more details).

If we apply DCI_Closed on the transaction database with a minsup of 2 transactions, we get the following result:

frequent closed itemsets support
{3} 4
{1, 3} 3
{2, 5} 4
{2, 3, 5} 3
{1, 2, 3, 5} 2
If you compare this result with the output of a frequent itemset mining algorithm like Apriori, you would notice that only 5 closed itemsets are found by DCI_Closed instead of about 15 itemsets by Apriori, which shows that the set of frequent closed itemset can be much smaller than the set of frequent itemsets.

How should I interpret the results?

In the results, each frequent closed itemset is annotated with its support. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5. It is a frequent itemset because its support is higher or equal to the minsup parameter. It is a closed itemsets because it has no proper superset having exactly the same support.

Performance

The DCI_Closed algorithm is one of the fastest algorithms for frequent closed itemset mining. The version in SPMF is optimized and very efficient. SPMF also offers other algorithms for frequent closed itemset mining such as Charm and AprioriClose. DCI_Closed and Charm are more efficient than AprioriClose.

Implementation details

In the source code version of SPMF, there are two versions of DCI_Closed. The first one uses HashSet to store the transaction ids. The second one is an optimized version that uses a bit matrix to store transactions ids, and also includes additional optimizations. The first version can be tested by running MainTestDCI_Closed.java and the second version by running MainTestDCI_Closed_Optimized.java. In the release version of SPMF, only the optimized version of DCI_Closed is available in the graphical user interface and command line interface.

Where can I get more information about the DCI_Closed algorithm?

Here is an article describing the DCI_Closed algorithm:

Claudio Lucchese, Salvatore Orlando, Raffaele Perego: DCI Closed: A Fast and Memory Efficient Algorithm to Mine Frequent Closed Itemsets. FIMI 2004

Example 9 : Mining Frequent Closed Itemsets Using the Charm Algorithm

How to run this example?

What is Charm?

Charm is an algorithm for discovering frequent closed itemsets in a transaction database. It was proposed by Zaki (2002).

What is the input of the Charm algorithm?

The input is a transaction database (a.k.a. binary context) and a threshold named minsup (a value between 0 and 100 %).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {1, 2, 3, 5}

What is the output of the Charm algorithm?

Charm outputs frequent closed itemsets. To explain what is a frequent closed itemset, it is necessary to review a few definitions.

An itemset is a unordered set of distinct items. The support of an itemset is the number of transactions that contain the itemset. For example, the itemset {1, 3} has a support of 2 because it appears in two transactions from the previous transaction database.

A frequent itemset is an itemset that appears in at least minsup transactions from the transaction database. A frequent closed itemset is a frequent itemset that is not included in a proper superset having exactly the same support. The set of frequent closed itemsets is thus a subset of the set of frequent itemsets. Why is it interesting to discover frequent closed itemsets ? The reason is that the set of frequent closed itemsets is usually much smaller than the set of frequent itemsets and it can be shown that no information is lost by discovering only frequent closed itemsets (because all the frequent itemsets can be regenerated from the set of frequent closed itemsets - see Zaki (2002) for more details).

If we apply Charm on the previous transaction database with a minsup of 40 % (2 transactions), we get the following result:

frequent closed itemsets support
{3} 4
{1, 3} 3
{2, 5} 4
{2, 3, 5} 3
{1, 2, 3, 5} 2
If you compare this result with the output from a frequent itemset mining algorithm like Apriori, you would notice that only 5 closed itemsets are found by Charm instead of about 15 itemsets by Apriori, which shows that the set of frequent closed itemset can be much smaller than the set of frequent itemsets.

How should I interpret the results?

In the results, each frequent closed itemset is annotated with its support. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5. It is a frequent itemset because its support is higher or equal to the minsup parameter. It is a closed itemset because it has no proper superset having exactly the same support.

Performance

The Charm algorithm is an important algorithm because it is one of the first depth-first algorithm for mining frequent closed itemsets. In SPMF, Charm and DCI_Closed are the two most efficient algorithms for frequent closed itemset mining.

Implementation details

In SPMF, there are two versions of Charm. The first version is named "Charm" and uses HashSet to represents sets of transaction IDs. It can be run with the graphical and command line interfaces. In the source code, it can be executed by running "MainTestCharm_saveToMemory".

The second version of Charm is named "Charm_bitset". It uses bitsets to represent sets of transaction IDs. It can be run with the graphical and command line interfaces. In the source code, it can be executed by running MainTestCharm_bitset_saveToFile.java. This version of Charm should generally be faster and more memory-efficient than the version not using bitsets.

Where can I get more information about the Charm algorithm?

This article describes the Charm algorithm:

Mohammed Javeed Zaki, Ching-Jiu Hsiao: CHARM: An Efficient Algorithm for Closed Itemset Mining. SDM 2002.

Example 10 : Mining Frequent Maximal Itemsets Using the Charm-MFI Algorithm

How to run this example?

What is Charm-MFI?

Charm-MFI is an algorithm for discovering frequent maximal itemsets in a transaction database.

Charm-MFI is not an efficient algorithm because it discovers maximal itemsets by performing post-processing after discovering frequent closed itemsets with the Charm algorithm (hence the name: Charm-MFI). Moreover, note that the original Charm-MFI algorithm is not correct. In SPMF, it has been fixed so that it generates the correct result.

What is the input of the Charm-MFI algorithm?

The input is a transaction database (a.k.a. binary context) and a threshold named minsup (a value between 0 and 100 %).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {1, 2, 3, 5}

What is the output of the Charm-MFI algorithm?

Charm-MFI outputs frequent maximal itemsets. To explain what is a frequent maximal itemset, it is necessary to review a few definitions.

An itemset is a unordered set of distinct items. The support of an itemset is the number of transactions that contain the itemset. For example, the itemset {1, 3} has a support of 2 because it appears in two transactions from the previous transaction database.

A frequent itemset is an itemset that appears in at least minsup transactions from the transaction database. A frequent closed itemset is a frequent itemset that is not included in a proper superset having the same support. A frequent maximal itemset is a frequent itemset that is not included in a proper superset that is a frequent itemset. The set of frequent maximal itemsets is thus a subset of the set of frequent closed itemsets, which is a subset of frequent itemsets. Why it is interesting to discover frequent maximal itemsets ? The reason is that the set of frequent maximal itemsets is usually much smaller than the set of frequent itemsets and also smaller than the set of frequent closed itemsets. However, unlike frequent closed itemsets, frequent maximal itemsets are not a lossless representation of the set of frequent itemsets (it is possible to regenerate all frequent itemsets from the set of frequent maximal itemsets but it would not be possible to get their support without scanning the database).

If we apply Charm-MFI on the previous transaction database with a minsup of 40 % (2 transactions), we get the following result:

frequent maximal itemsets support
{1, 2, 3, 5} 2
This itemset is the only maximal itemsets itemsets and it has a support of 2 because it appears in two transactions.

How should I interpret the results?

In the results, each frequent maximum itemset is annotated with its support. For example, the itemset {1, 2, 3 5} is a maximal itemset having a support of 2 because it appears in transactions t3 and t5. The itemset {2, 5} has a support of 4 and is not a maximal itemset because it is included in {2, 3, 5}, which is a frequent itemset.

Performance

The Charm-MFI algorithm is not a very efficient algorithm because it finds frequent maximal itemsets by post-processing instead of finding them directly. However, it is the only algorithm that is included in SPMF for this discovering maximal itemsets. Eventually, other algorithms will be added to SPMF for this task.

Where can I get more information about the Charm-MFI algorithm?

The Charm-MFI algorithm is described in this thesis (in French language only):

L. Szathmary (2006). Symbolic Data Mining Methods with the Coron Platform.

Example 11 : Mining Frequent Closed Itemsets and Minimal Generators Using the Zart Algorithm

How to run this example?

What is Zart?

Zart is an algorithm for discovering frequent closed itemsets and their corresponding generators in a transaction database.

Zart is an Apriori-based algorithm. Why is it useful to discover closed itemsets and their generators at the same time? One reason is that this information is necessary to generate some special kind of association rules such as the IGB basis of association rules (see the example for IGB for more information about IGB association rules).

What is the input of the Zart algorithm?

The input is a transaction database (a.k.a. binary context) and a threshold named minsup (a value between 0 and 100 %).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextZart.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 2, 4, 5}
t2 {1, 3}
t3 {1, 2, 3, 5}
t4 {2, 3, 5}
t5 {1, 2, 3, 5}

What is the output of the Zart algorithm?

The output of the Zart algorithm for a transaction database and a minimum support thresold minsup is the set of all frequent closed itemsets and their support, and the associated generator(s) for each closed frequent itemset.

To explain what is a frequent closed itemset and a generator, it is necessary to review a few definitions.

An itemset is a unordered set of distinct items. The support of an itemset is the number of transactions that contain the itemset. For example, the itemset {1, 3} has a support of 3 because it appears in three transactions from the database (t2, t3 and t5). A frequent itemset is an itemset that appears in at least minsup transactions from the transaction database. A frequent closed itemset is a frequent itemset that is not included in a proper superset having the same support. A generator Y of a closed itemset X is an itemset such that (1) it has the same support as X and (2) it does not have any subset having the same support.

By running Zart with the previous transaction database and a minsup of 40% (2 transactions), we obtain the following result:

itemsets support is closed? minimal generators
{} 5 yes {}
{1} 4 yes {1}
{2} 4 no
{3} 4 yes {3}
{5} 4 no
{1, 2} 3 no
{1, 3} 3 yes {1,3}
{1, 5} 3 no
{2, 3} 3 no
{2, 5} 4 yes {2}, {5}
{3, 5} 3 no
{1, 2, 3} 2 no
{1, 2, 5} 3 yes {1, 2}, {1, 5}
{1, 3, 5} 2 no
{2, 3, 5} 3 yes {2, 3}, {3, 5}
{1, 2, 3, 5} 2 yes {1, 2, 3}, {1, 3, 5}

How should I interpret the results?

In the results, all frequent itemsets are shown. Each frequent itemset that is a closed itemset is marked as such ("yes"). For each closed itemset, its support is indicated and its list of generators. For example, the itemset {1,2,3,5} has a support of 2 because it appears in 2 transactions (t3 and t5). It is a closed itemset because it has no proper superset having the same support. Moreover is has two generators: {1, 2, 3} and {1, 3, 5}. By definition, these generators have the same support as {1, 2, 3, 5}.

Another example. The itemset {1, 3, 5} is not closed and it has a support of 2. It is not closed because it has a proper superset {1, 2, 3, 5} that has the same support.

Implementation details

In the source code version of SPMF, there are two versions of Zart. The version "MainTestZart.java" keeps the result into memory. The version named "MainTestZart_saveToFile.java" saves the result to a file. In the graphical user interface and command line interface only the second version is offered.

Performance

The Zart algorithm is not a very efficient algorithm because it is based on Apriori. If someone only want to discover closed itemsets and do not need the information about generators, then he should instead use DCI_Closedor Charm, which are more efficient for closed itemset mining. However, in some cases it is desirable to discover closed itemset and their corresponding generators (for example to generate IGB association rules). For these cases, Zart is an appropriate algorithm.

Where can I get more information about the Zart algorithm?

The Zart algorithm is described in this paper:

L. Szathmary, A. Napoli, S. O. Kuznetsov. ZART: A Multifunctional Itemset Mining Algorithm. Laszlo Szathmary, Amedeo Napoli, Sergei O. Kuznetsov In: CLA, 2007.

Example 12 : Mining Minimal Rare Itemsets

How to run this example?

What is AprioriRare?

AprioriRare is an algorithm for mining minimal rare itemsets from a transaction database. It is an Apriori-based algorithm. It was proposed by Szathmary et al. (2007).

What is the input ?

The input is a transaction database (a.k.a. binary context) and a threshold named minsup (a value between 0 and 100 %).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextZart.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 2, 4, 5}
t2 {1, 3}
t3 {1, 2, 3, 5}
t4 {2, 3, 5}
t5 {1, 2, 3, 5}

What is the output?

The output of AprioriRare is the set of minimal rare itemsets. To explain what it a minimal rare itemset, it is necessary to review a few definitions. An itemset is a unordered set of distinct items. The support of an itemset is the number of transactions that contain the itemset divided by the total number of transactions. For example, the itemset {1, 2} has a support of 60% because it appears in 3 transactions out of 5 in the previous database (it appears in t1, t2 and t5). A frequent itemset is an itemset that has a support no less than the minsup parameter. A minimal rare itemset is an itemset that is not a frequent itemset and that all its subsets are frequent itemsets.

For example, if we run AprioriRare algorithm with minsup = 60 % and the previous transaction database, we obtain the following set of minimal rare itemsets:

Minimal Rare Itemsets Support
{4} 20 %
{1, 3, 5} 40 %
{1, 2, 3} 40 %

Performance

AprioriRare is the only algorithm for minimal rare itemset mining offered in SPMF. Since it is based on Apriori, it suffers from the same fundamental limitations (it may generate too much candidates and it may generate candidates that do not appear in the database).

Where can I get more information about this algorithm?

The AprioriRare algorithm is described in this paper:

Laszlo Szathmary, Amedeo Napoli, Petko Valtchev: Towards Rare Itemset Mining. ICTAI (1) 2007: 305-312

Example 13 : Mining Perfectly Rare Itemsets Using the AprioriInverse Algorithm

How to run this example?

What is AprioriInverse?

AprioriInverse is an algorithm mining perfectly rare itemsets. Why mining perfectly rare itemsets? One reason is that it is useful for generating the set of sporadic association rules.

What is the input?

The input is a transaction database (a.k.a. binary context) and a threshold named minsup (a value between 0 and 100 %).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextInverse.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 2, 4, 5}
t2 {1, 3}
t3 {1, 2, 3, 5}
t4 {2, 3}
t5 {1, 2, 4, 5}

What is the output?

The output of AprioriInverse is the set of all perfectly rare itemsets in the database such that their support is lower than maxsup and higher than minsup. To explain what it a perfectly rare itemset, it is necessary to review a few definitions. An itemset is an unordered set of distinct items. The support of an itemset is the number of transactions that contain the itemset divided by the total number of transactions. For example, the itemset {1, 2} has a support of 60% because it appears in 3 transactions out of 5 in the previous database (it appears in t1, t2 and t5). A frequent itemset is an itemset that has a support no less than the maxsup parameter. A perfectly rare itemset (a.k.a. sporadic itemset) is an itemset that is not a frequent itemset and that all its proper subsets are also not frequent itemsets. Moreover, it has to have a support higher or equal to the minsup threshold.

By running the AprioriInverse algorithm with minsup = 0.1 % and maxsup of 60 % and this transaction database, we obtain the following set of perfectly rare itemsets (see Koh & Roundtree 2005 for further details):

Perfectly Rare Itemsets Support
{3} 60 %
{4} 40 %
{5} 60 %
{4, 5} 40 %
{3, 5} 20 %

Performance

AprioriInverse is the only algorithm for perfectly rare itemset mining offered in SPMF. Since it is based on Apriori, it suffers from the same fundamental limitations (it may generate too much candidates and may generate candidates that do not appear in the database).

Where can I get more information about this algorithm?

The AprioriInverse algorithm is described in this p aper:

Yun Sing Koh, Nathan Rountree: Finding Sporadic Rules Using Apriori-Inverse. PAKDD 2005: 97-106

Example 14 : Mining Closed Itemsets from a Data Stream Using the CloStream Algorithm

How to run this example?

What is CloStream?

CloStream is an algorithm for incrementally mining closed itemsets from a data stream. It was proposed by Yen et al. (2009).

Why is it useful? Because most closed itemset mining algorithms such as Charm, DCI_Closed and AprioriClose are batch algorithms. This means that if the transaction database is updated, we need to run the algorithms again to update the set of closed itemsets. If there is constant insertion of new transactions and the results need to be updated often, it may become very costly to use these algorithms. A stream mining algorithm like CloStream is specially designed to handle this situation. It assumes that each transaction in a database can only be read once and that new transaction appears regularly. Every time that a new transaction appear, the result is updated by CloStream.

What is the input of CloStream?

The input of CloStream is a stream of transactions. Each transaction is a set of items (symbols). For example, consider the following five transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2 and 4. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction. CloStream is an algorithm for processing a stream. This means that CloStream is allowed to read each transaction only once because a stream is assumed to be potentially infinite and coming at high speed.

Transaction ID Items
t1 {1,2 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {2, 3, 5}

What is the output of CloStream?

CloStream produces as output the set of closed itemsets contained in the transactions that it has seen until now. An itemset is an unordered set of distinct items. The support of an itemset is the number of transactions that contains the itemset. For example, the itemset {1, 2, 4} has a support of 1 because it only appear in t1. A closed itemset is an itemset that is not included in another itemset having the same support. For example, if we apply CloStream to the five following transactions, the final result is:
closed itemsets support
{} 5
{3} 4
{1, 3} 3
{1, 3, 4} 1
{2, 5} 4
{2, 3, 5} 3
{1, 2, 3, 5} 2

For example, the itemset {2, 3, 5} has a support of 3 because it appears in transactions t2, t4 and t5. It is a closed itemset because it has no proper superset having the same support.

Performance

CloStream is a reasonably efficient algorithm. A limitation of this algorithm is that it is not possible to set a minimum support threshold. Therefore, if the number of closed itemsets is large, this algorithm may use too much memory. However, CloStream has the advantage of being very simple an easy to implement.

Where can I get more information about this algorithm?

The CloStream algorithm is described in this paper:

Show-Jane Yen, Yue-Shi Lee, Cheng-Wei Wu, Chin-Lin Lin: An Efficient Algorithm for Maintaining Frequent Closed Itemsets over Data Stream. IEA/AIE 2009: 767-776.

Example 15 : Mining Frequent Itemsets from Uncertain Data Using the U-Apriori Algorithm

How to run this example?

What is UApriori?

UApriori is an algorithm for mining frequent itemsets from a transaction database where the data is uncertain (contains probabilities). The UApriori algorithm was proposed by Chui et al. (2007).

This algorithm can have multiple applications such as in mining medical data or sensor data where observations may be uncertain.

What is the input ?

UApriori takes as input a transaction database containing probabilities and a minnimum expected support threshold (a value between 0 and 1). A transaction database is a set of transactions where each transaction is a set of items. In UApriori, we assume that each item in a transaction is annotated with an existential probability. For example, let's consider the following transaction database, consisting of 4 transactions (t1,t2...t5) and 5 items (1,2,3,4,5). The transaction t1 contains item 1 with a probability of 0.5, item 2 with a probability of 0.4, item 4 with a probability of 0.3 and item 5 with a probability of 0.7. This database is provided in the file "contextUncertain.txt" of the SPMF distribution:

  1 2 3 4 5
t1 0.5 0.4   0.3 0.7
t2   0.5 0.4   0.4
t3 0.6 0.5   0.1 0.5
t4 0.7 0.4 0.3   0.9

What is the output?

The output of U-Apriori is the set of frequent itemsets. Note that the definition of a frequent itemset is here different from the definition used by the regular Apriori algorithm because we have to consider the existential probabilities.

The expected support of an itemset in a transaction is defined as the product of the existential probability of each item from the itemset in this transaction. It is a value between 0 and 1. For example, the support of itemset {1, 2} in transaction t1 is 0.5 x 0.4 = 0.2. The expected support of an itemset in a transaction database is the sum of its support in all transactions where it occurs. For example, the expected support of itemset {2, 3} is the sum of its expected support in t2 and t4 : 0.5 x 0.4 + 0.4 x 0.3 = 0.32. A frequent itemset is an itemset that has an expected support higher or equal to the minimum expected support set by the user. For example, by running U-Apriori with a minimum expected support of 0.10, we obtain 19 frequent itemsets, including:

itemsets expected support
{2 3 5} 0.19
{1 3 5} 0.19
{1 4 5} 0.14
{2 4 5} 0.11
{1 2 5} 0.54
{1 5} 1.28
{1 3} 0.21
{1 4} 0.21
{2 3} 0.32
{1 2} 0.78
... ...

Performance

UApriori is not the most efficient algorithm for uncertain itemset mining but it is simple and it is the first algorithm designed for this task.

Where can I get more information about the UApriori algorithm?

Here is an article describing the UApriori algorithm:

C. Kit Chui, B. Kao, E. Hung: Mining Frequent Itemsets from Uncertain Data. PAKDD 2007: 47-58

Example 16 : Mining High-Utility Itemsets from a Database with Utility Information with the Two-Phase Algorithm

How to run this example?

What is Two-Phase?

Two-Phase (Liu et al., 2005) is an algorithm for discovering high-utility itemsets in a transaction database containing utility information.

High utility itemset mining has several applications such as discovering groups of items in transactions of a store that generate the most profit. A database containing utility information is a database where items can have quantities and a unit price. Although these algorithms are often presented in the context of market basket analysis, there exist other applications.

What is the input?

Two-phase takes as input a transaction database with utility information and a minimum utility threshold min_utility (a positive integer). Let's consider the following database consisting of 5 transactions (t1,t2...t5) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DB_utility.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.

  Items Transaction utility Item utilities for this transaction
t1 3 5 1 2 4 6 30 1 3 5 10 6 5
t2 3 5 2 4 20 3 3 8 6
t3 3 1 4 8 1 5 2
t4 3 5 1 7 27 6 6 10 5
t5 3 5 2 7 11 2 3 4 2

Each line of the database is:

Note that the value in the second column for each line is the sum of the values in the third column.

What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction represents the items purchased by a customer. The first customer named "t1" bought items 3, 5, 1, 2, 4 and 6. The amount of money spent for each item is respectively 1 $, 3 $, 5 $, 10 $, 6 $ and 5 $. The total amount of money spent in this transaction is 1 + 3 + 5 + 10 + 6 + 5 = 30 $.

What is the output?

The output of Two-Phase is the set of high utility itemsets having a utility no less than a minutility threshold (a positive integer) set by the user. To explain what is a high utility itemset, it is necessary to review some definitions. An itemset is an unordered set of distinct items. The utility of an itemset in a transaction is the sum of the utility of its items in the transaction. For example, the utility of the itemset {1 4} in transaction t1 is 5 + 6 = 11 and the utility of {1 4} in transaction t3 is 5 + 2 = 7. The utility of an itemset in a database is the sum of its utility in all transactions where it appears. For example, the utility of {1 4} in the database is the utility of {1 4} in t1 plus the utility of {1 4} in t3, for a total of 11 + 7 = 18. A high utility itemset is an itemset such that its utility is no less than minutility. For example, if we run Two Phase with a minimum utility of 30, we obtain 8 high-utility itemsets:

itemsets utility support
{2 4} 30 40 % (2 transactions)
{2 5} 31 60 % (3 transactions)
{1 3 5} 31 40 % (2 transactions)
{2 3 4} 34 40 % (2 transactions)
{2 3 5} 37 60 % (3 transactions)
{2 4 5} 36 40 % (2 transactions)
{2 3 4 5} 40 40 % (2 transactions)
{1 2 3 4 5 6} 30 20 % (1 transactions)

If the database is a transaction database from a store, we could intrepret these results as all the groups of items bought together that generated a profit of 30 $ or more.

Performance

High utility itemset mining is a much more difficult problem than frequent itemset mining. Therefore, algorithms for high-utility itemset mining are generally slower than frequent itemset mining algorithms.

The Two-Phase algorithm is an important algorithm because it introduced the concept of mining high utility itemset by using two phases by first overestimating the utility of itemsets in phase I and then calculating their exact utility in phase II. However, there are now some more efficient algorithms. For efficiency, it is recommended to use a more efficient algorithm such as HUI-Miner that is also included in SPMF and is one of the most efficient algorithm for this problem.

Implementation details

In the source code version of SPMF, there are two versions of Two-Phase: one that saves the result to a file (MainTestTwoPhaseAlgorithm_saveToFile.java) and one that saves the result to memory (MainTestTwoPhaseAlgorithm_saveToMemory.java). In the graphical interface and command line interface, only the version that saves to file is offered.

Also note that the input format is not exactly the same as described in the original article. But it is equivalent.

Where can I get more information about the Two-Phase algorithm?

Here is an article describing the Two-Phase algorithm:

Y. Liu, W.-K. Liao, A. N. Choudhary: A Two-Phase Algorithm for Fast Discovery of High Utility Itemsets. PAKDD 2005: 689-695

Example 17 : Mining High-Utility Itemsets from a Database with Utility Information with the HUI-Miner Algorithm

How to run this example?

What is HUI-Miner?

HUI-Miner (Mengchi Liu & Junfeng Qu, CIKM 2012) is an algorithm for discovering high-utility itemsets in a transaction database containing utility information.

High utility itemset mining has several applications such as discovering groups of items in transactions of a store that generate the most profit. A database containing utility information is a database where items can have quantities and a unit price. Although these algorithms are often presented in the context of market basket analysis, there exist other applications.

What is the input?

HUI-Miner takes as input a transaction database with utility information and a minimum utility threshold min_utility (a positive integer). Let's consider the following database consisting of 5 transactions (t1,t2...t5) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DB_utility.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.

  Items Transaction utility Item utilities for this transaction
t1 3 5 1 2 4 6 30 1 3 5 10 6 5
t2 3 5 2 4 20 3 3 8 6
t3 3 1 4 8 1 5 2
t4 3 5 1 7 27 6 6 10 5
t5 3 5 2 7 11 2 3 4 2

Each line of the database is:

Note that the value in the second column for each line is the sum of the values in the third column.

What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction represents the items purchased by a customer. The first customer named "t1" bought items 3, 5, 1, 2, 4 and 6. The amount of money spent for each item is respectively 1 $, 3 $, 5 $, 10 $, 6 $ and 5 $. The total amount of money spent in this transaction is 1 + 3 + 5 + 10 + 6 + 5 = 30 $.

What is the output?

The output of HUI-Miner is the set of high utility itemsets having a utility no less than a minutility threshold (a positive integer) set by the user. To explain what is a high utility itemset, it is necessary to review some definitions. An itemset is an unordered set of distinct items. The utility of an itemset in a transaction is the sum of the utility of its items in the transaction. For example, the utility of the itemset {1 4} in transaction t1 is 5 + 6 = 11 and the utility of {1 4} in transaction t3 is 5 + 2 = 7. The utility of an itemset in a database is the sum of its utility in all transactions where it appears. For example, the utility of {1 4} in the database is the utility of {1 4} in t1 plus the utility of {1 4} in t3, for a total of 11 + 7 = 18. A high utility itemset is an itemset such that its utility is no less than minutility. For example, if we run HUI-Miner with a minimum utility of 30, we obtain 8 high-utility itemsets:

itemsets utility support
{2 4} 30 40 % (2 transactions)
{2 5} 31 60 % (3 transactions)
{1 3 5} 31 40 % (2 transactions)
{2 3 4} 34 40 % (2 transactions)
{2 3 5} 37 60 % (3 transactions)
{2 4 5} 36 40 % (2 transactions)
{2 3 4 5} 40 40 % (2 transactions)
{1 2 3 4 5 6} 30 20 % (1 transactions)

If the database is a transaction database from a store, we could intrepret these results as all the groups of items bought together that generated a profit of 30 $ or more.

Performance

High utility itemset mining is a more difficult problem than frequent itemset mining. Therefore, algorithms for high-utility itemset mining are generally slower than frequent itemset mining algorithms. The HUI-Miner algorithm is reported as one of the most efficient algorithm for high utility itemset mining.

Implementation details

The version implemented here contains all the optimizations described in the paper proposing HUI-Miner. Note that the input format is not exactly the same as described in the original article. But it is equivalent.

Where can I get more information about the HUI-Miner algorithm?

This is the reference of the article describing the HUI-Miner algorithm:

M. Liu, J.-F. Qu: Mining high utility itemsets without candidate generation. CIKM 2012, 55-64

Example 18 : Mining Erasable Itemsets from a Product Database with the VME algorithm

How to run this example?

What is the VME algorithm?

VME (Deng & Xu, 2010) is an algorithm for mining erasable itemsets from a product database with profit information.

What is the input?

VME takes as input a product database and a threshold (a value between 0 and 100%). A product is defined as a set of items that are used to assemble the product. Moreover each product is annotated with a profit (a positive integer) that indicates how much money this product generate for the company. For example, let's consider the following product database, consisting of 6 products and 7 items (this example is taken from the article of Deng & Xu, 2010). Each product is annotated with the profit information. For example, the first line indicates that the product 1 generate a total profit of 50 $ for the company and that its assembly requires parts 2, 3, 4 and 6. This product database is provided in the file "contextVME.txt" of the SPMF distribution.:

  profit items
product1 50$ {2, 3, 4, 6}
product2 20$ {2, 5, 7}
product3 50$ {1, 2, 3, 5}
product4 800$ {1, 2, 4}
product5 30$ {6, 7}
product6 50$ {3, 4}

What is the output?

The output is the set of erasable itemsets generating a loss of profit lower or equal to the user-specificed threshold. The idea is to discover item that the company could stop manufacturing and that would minimize the amount of profit lost by being unable to build products.

To explain what is an erasable itemset more formally, it is necessary to review some defitions. An itemset is an unordered set of distinct items. The loss of profit generated by an itemset is defined as the sum of the product profit for all products containing an item from this itemset. For example, the lost of profit of itemset {5, 6} is the sum of the profits of products containing 5 and/or 6: 50$ + 20 $ + 50 $ + 30 $ = 150 $. The loss of profit can also be expressed as a percentage of the total profit of the database. For example, in this database the total profit is 50 + 20 + 50 + 800 + 30 + 50 = 1000$. Therefore, the lost of profit by the itemset {5, 6} could be expressed as 15% (150 / 1000 * 100).

By running VME with a threshold of 15 %, we obtain 8 erasable itemsets (having a profit loss less or equal to 15% x 1000$ = 150 $):

erasable itemsets loss of profit ("gain")
{3} 150
{5} 70
{6} 80
{7} 50
{5 6} 150
{5 7} 100
{6 7} 100
{5 6 7} 150

This means that if the items from one of those erasable itemsets are not manufactured anymore, then the loss of profit will be lower or equal to 15%.

Performance

The VME algorithm is Apriori-based. It is not the fastest algorithm for this problem. But it is the only one available in SPMF because this problem is not very popular. For more efficient algorithms for this problem, you can search for the author names. They have proposed a few algorithms with some improvements.

Where can I get more information about the VME algorithm?

Here is an article describing the VME algorithm:

Z. Deng, X. Xu: An Efficient Algorithm for Mining Erasable Itemsets. ADMA (1) 2010: 214-225.

Example 19 : Building, updating incrementally and using an Itemset-Tree to generate targeted frequent itemsets and association rules.

How to run this example?

What is an itemset-tree?

An itemset-tree is a special data structure that can be used for performing efficient queries about itemsets and association rules in a transaction database without having to generate all of them beforehand.

An itemset-tree has the nice property of being incremental, which means that new transactions can be added to an existing itemset tree very efficiently without having to rebuid the tree from scratch. An itemset-tree also has the property of being compact.

How to use it?

An itemset-tree is built by inserting a set of transactions into the tree. A transaction is simply a set of distinct items. For example, we could insert the following 6 transactions (t1,t2...t5) into an itemset-tree. In this example, the transaction t1 represents the set of items {1, 4}. This set of transactions is provided in the file "contextItemsetTree.txt" of the SPMF distribution.

transaction IDs items
t1 {1, 4}
t2 {2, 5}
t3 {1, 2, 3, 4, 5}
t4 {1, 2, 4}
t5 {2, 5}
t6 {2, 4}

The result of the insertion of these six transactions is the following itemset-tree (see the article by Kubat for more details).

{}   sup=6
  [2 ]   sup=3
    [2 5 ]   sup=2
    [2 4 ]   sup=1
  [1 ]   sup=3
    [1 2 ]   sup=2
      [1 2 4 ]   sup=1
      [1 2 3 4 5 ]   sup=1
    [1 4 ]   sup=1

The root is the empty itemset {} and the leafs are {2, 5}, {2, 4}, {1 2 4},{1 2 3 4 5} and {1, 4}.

Once an itemset-tree has been created, it is possible to update it by inserting a new transaction. For example, in this example provided in the source code, we update the previous tree by adding a new transaction {4, 5}. The result is this tree:

{}   sup=7
  [2 ]   sup=3
    [2 5 ]   sup=2
    [2 4 ]   sup=1
  [1 ]   sup=3
    [1 2 ]   sup=2
      [1 2 4 ]   sup=1
      [1 2 3 4 5 ]   sup=1
    [1 4 ]   sup=1
  [4 5 ]   sup=1

Next, it is shown how to query the tree to determine the support of a target itemset efficiently. For example, if we execute the query of finding the support of the itemset {2}, the support is determined to be 5 because 2 appear in 5 transactions.

After that the source code offers an example of how to use the itemset tree to get all itemsets that subsume an itemset and to get their support. For example, if we use the itemset {1 2} for this query the result is:

[1 2 ]    supp:2
[1 2 3 ]    supp:1
[1 2 4 ]    supp:2
[1 2 5 ]    supp:1
[1 2 3 4 ]    supp:1
[1 2 3 5 ]    supp:1
[1 2 4 5 ]    supp:1
[1 2 3 4 5 ]    supp:1

Another example provided is how to use the tree to find all itemsets that subsume an itemset such that the support is higher or equal to a user-specified threshold named minsup (a positive integer representing a number of transactions). For example, if we execute this query with the itemset {1} and minsup =2, we get this result:

[1 ]    supp:3
[1 2 ]    supp:2
[1 4 ]    supp:3
[1 2 4 ]    supp:2

Lastly, another example is how to generate all association rules having a target itemset as antecedent and a support and confidence respectively higher or equal to some user-speficied thresholds minsup (a positive integer representing a number of transactions) and minconf (a value between 0 and 1). For example, if the target itemset is {1} and minconf = 0.1 and minsup = 2, the result is:

[ 1  ] ==> [2  ]  sup=2  conf=0.6666666666666666

[ 1  ] ==> [4  ]  sup=3  conf=1.0

[ 1  ] ==> [2 4  ]  sup=2  conf=0.666666666666666

Performance

The itemset-tree is an efficient data structure for the case of a database that needs to be updated frequently and where targeted queries need to be performed. For details about the complexity in terms of space and time, please refer to the article by Kubat et al., which provides an extensive discussion of the complexity

Where can I get more information about the Itemset-tree data structure and related algorithms?

This article describes the itemset-tree and related algorithms for querying it:

Miroslav Kubat, Aladdin Hafez, Vijay V. Raghavan, Jayakrishna R. Lekkala, Wei Kian Chen: Itemset Trees for Targeted Association Querying. IEEE Trans. Knowl. Data Eng. 15(6): 1522-1534 (2003)

Example 20 : Mining Frequent Itemsets with Multiple Support Thresholds Using the MSApriori Algorithm

How to run this example?

What is MISApriori?

MISApriori is an algorithm for mining frequent itemsets by using multiple minimum supports. It is a generalization of the Apriori algorithm, which uses a single minimum support threshold.

The idea behind MSApriori is that different minimum supports could be used to consider the fact that some items are less frequent than others in a dataset.

What is the input of this algorithm?

The input of MSApriori is a transaction database and two parameters named beta (a value between 0 and 1) and LS (a value between 0 and 1). These parameters are used to determine a minimum support for each item.

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 6 transactions (t1, t2, ..., t5, t6) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextIGB.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 2, 4, 5}
t2 {2, 3, 5}
t3 {1, 2, 4, 5}
t4 {1, 2, 3, 5}
t5 {1, 2, 3, 4, 5}
t6 {2, 3, 4}

What is the output of this algorithm?

The output of MSApriori is the set of all frequent itemsets contained in the database.

Contrarily to the original Apriori algorithm, MSApriori use multiple minimum supports thresholds instead of just one. In fact, MSApriori uses a minimum support value for each item. Because it would be time consuming to set a minimum support threshold value for each item for a large database, the thresholds are determined automatically by using two user-specified parameters named beta (0 <= B <= 1) and LS (0 <= LS <= 1).

The minimum support of an item k is then defined as the greatest value between:

Note that if B is set to 0, there will be a single minimum support for all items and this will be equivalent to the regular Apriori algorithm.

The support of an itemset is the number of transactions containing the itemset divided by the total number of transactions. An itemset is a frequent itemset if its support is higher or equal to the smallest minimum support threshold from the minimum support thresholds of all its items.

Why MSApriori is useful? It is useful because it allows discovering frequent itemsets containing rare items (if their minimum support is set low).

If we run MSApriori on the previous transaction database with beta = 0.4 and LS = 0.2, we obtain the following result:

1 supp: 4
2 supp: 6
3 supp: 4
4 supp: 4
5 supp: 5
1 2  Support: 4
1 3  Support: 2
1 4  Support: 3
1 5  Support: 4
2 3  Support: 4
2 4  Support: 4
2 5  Support: 5
3 4  Support: 2
3 5  Support: 3
4 5  Support: 3
1 2 3  Support: 2
1 2 4  Support: 3
1 2 5  Support: 4
1 3 5  Support: 2
1 4 5  Support: 3
2 3 4  Support: 2
2 3 5  Support: 3
2 4 5  Support: 3
1 2 3 5  Support: 2
1 2 4 5  Support: 
Note that here the support is expressed by an integer value which represents the number of transactions containing the itemset. For example, itemset {2, 3 5} has a support of 3 because it appears in three transactions, namely t2, t4 and t5. This integer value can be converted as a percentage by dividing by the total number of transactions.

Performance

MSApriori is one of the first algorithm for mining itemsets with multiple minimum support thresholds. It is not the most efficient algorithm for this task because it is based on Apriori and thus suffer from the same limitations. If performance is important, it is recommend to use CFPGrowth, which is based on FPGrowth and is more efficient.

Note that there is one important difference between the input of CFPGrowth and MSApriori in SPMF. The MISApriori works by setting the multiple minimum supports by using the LS and BETA values. The CFPGrowth implementation uses a list of minimum support values stored in a text file instead.

Where can I get more information about the MSApriori algorithm?

This article describes the MSApriori algorithm:

B. Liu, W. Hsu, Y. Ma, "Mining Association Rules with Multiple Minimum Supports" Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-99), August 15-18, 1999, San Diego, CA, USA.

Example 21 : Mining Frequent Itemsets with Multiple Support Thresholds Using the CFPGrowth Algorithm

How to run this example?

What is CFPGrowth?

CFPGrowth is an algorithm for mining frequent itemsets by using multiple minimum supports. It is a generalization of the Apriori algorithm, which uses a single minimum support threshold.

What is the input of this algorithm?

The input of CFPGrowth is a transaction database and a list of minimum support thresholds indicating the minimum support threshold for each item.

A transaction database is a set of transactions, where each transaction is a list of distinct items (symbols). For example, let's consider the following transaction database. It consists of 5 transactions (t1,t2...t6) and 8 items (1,2,3,4,5,6,7,8). For instance, transaction t1 is the set of items {1, 3, 4, 6}. This database is provided in the file "contextCFPGrowth.txt" of the SPMF distribution.. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction ID items
t1 {1, 3, 4,6}
t2 {1, 3, 5, 6, 7}
t3 {1, 2, 3, 6, 8}
t4 {2, 6, 7}
t5 {2, 3}

The list of minimum support threshold is stored in a text file that is read as input by the algorithm. This is provided in the file "MIS.txt":

item minimum support threshold
1 1
2 2
3 3
4 3
5 2
6 3
7 2
8 1

This file indicated for example that the minimum support threshold to be used for item 6 is 3.

What is the output of this algorithm?

The output of CFPgrowth is the set of all frequent itemsets contained in the database.

What is a frequent itemset ? The support of an itemset is the number of transactions containing the itemset. An itemset is a frequent itemset if its support is higher or equal to the smallest minimum support threshold among the minimum support thresholds of all its items. For example, the itemset {1 2 8} is frequent because it appears in one transactions (t3) and its support is higher than the smallest minimum support among the minimum support of item 1, item 2 and item 8, which are respectively 1, 2 and 1.

Why CFPGrowth is useful? It is useful because it permits setting lower minimum support thresholds for rare items. Therefore, it allows discovering frequent itemsets containing rare items.

If we run CFPGrowth on the previous transaction database with the MIS.txt file previously described, we get the following result, where each line represents an itemsets followed by ":" and then its absolute support.:

8:1
8 1:1
8 1 2:1 // for example, this itemset is {1, 2, 8}, and it has a support of 1.
8 1 2 6:1
8 1 2 6 3:1
8 1 2 3:1
8 1 6:1
8 1 6 3:1
8 1 3:1
8 2:1
8 2 6:1
8 2 6 3:1
8 2 3:1
8 6:1
8 6 3:1
8 3:1
1:3 // for example, this itemset is {1}, and it has a support of 3.
1 7:1
1 7 5:1
1 7 5 6:1
1 7 5 6 3:1
1 7 5 3:1
1 7 6:1
1 7 6 3:1
1 7 3:1
1 5:1
1 5 6:1
1 5 6 3:1
1 5 3:1
1 2:1
1 2 6:1
1 2 6 3:1
1 2 3:1
1 6:3
1 6 4:1
1 6 4 3:1
1 6 3:3
1 4:1
1 4 3:1
1 3:3
7:2
7 6:2
2:3
2 6:2
2 3:2
6:4
6 3:3
3:4
Note: If you are using the GUI version of SPMF the file containing the minimum support must be located in the same folder as the input file containing the transaction database.

Performance

CFPGrowth is a very efficient algorithm. It is based on FPGrowth.

SPMF also offers the MISApriori algorithm, which is less efficient than CFPGrowth. Note that there is one important difference between the input of CFPGrowth and MSApriori in SPMF. The MISApriori algorithm works by setting the multiple minimum supports by using some special parameters named LS and BETA (see the example describing MISApriori for more details). The CFPGrowth implementation instead uses a list of minimum support values stored in a text file.

Where can I get more information about the CFPGrowth algorithm?

This article describes CFPGrowth:

Y.-H. Hu, Y.-L. Chen: Mining association rules with multiple minimum supports: a new mining algorithm and a support tuning mechanism. Decision Support Systems 42(1): 1-24 (2006)

Example 22 : Mining All Association Rules

How to run this example?

What is this algorithm?

It is an algorithm for discovering all association rules in a transaction database, following the two steps approach proposed by Agrawal & Srikant (1993). The first step is to discover frequent itemsets. The second step is to generate rules by using the frequent itemsets. The main difference with Agrawal & Srikant in this implementation is that FPGrowth is used to generate frequent itemsets instead of Apriori because FPGrowth is more efficient.

What is the input?

The input is a transaction database (a.k.a binary context) and two thresholds named minsup (a value between 0 and 1) and minconf (a value between 0 and 1).

A transaction database is a set of transactions. Each transaction is a set of distinct items. For example, consider the following transaction database. It contains 6 transactions (t1, t2, ..., t5, t6) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextIGB.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 2, 4, 5}
t2 {2, 3, 5}
t3 {1, 2, 4, 5}
t4 {1, 2, 3, 5}
t5 {1, 2, 3, 4, 5}
t6 {2, 3, 4}

What is the output?

The output of an association rule mining algorithm is a set of association rules respecting the user-specified minsup and minconf thresholds. To explain how this algorithm works, it is necessary to review some definitions. An association rule X==>Y is a relationship between two itemsets (sets of items) X and Y such that the intersection of X and Y is empty. The support of a rule is the number of transactions that contains X∪Y. The confidence of a rule is the number of transactions that contains X∪Y divided by the number of transactions that contain X.

If we apply an association rule mining algorithm, it will return all the rules having a support and confidence respectively no less than minsup and minconf.

For example, by applying the algorithm with minsup = 0.5 (50%), minconf = 0.6 (60%), we obtains 55 associations rules (run the example in the SPMF distribution to see the result).

Implementation details

Association rule mining is traditionnaly performed in two steps : (1) mining frequent itemset and (2) generating association rules by using frequent itemsets. In this implementation, we use the FPGrowth algorithm for Step 1 because it is very efficient. For Step 2, we use the algorithm that was proposed by Agrawal & Srikant (1994).

Note that in SPMF, we offer also the alternative of choosing Apriori instead of FPGrowth for Step1. This is called the "Apriori_association_rules" algorithm in the graphical user interface or command line interface.

Where can I get more information about this algorithm?

The following technical report published in 1994 describes how to generate association rules from frequent itemsets (Step 2):

R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. Research Report RJ 9839, IBM Almaden Research Center, San Jose, California, June 1994.

You can also read chapter 6 of the book "introduction to data mining" which provide a nice and easy to understand introduction to how to discover frequent itemsets and generate association rules.

The following article describes the FPGrowth algorithm for mining frequent itemsets:

Jiawei Han, Jian Pei, Yiwen Yin, Runying Mao: Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. Data Min. Knowl. Discov. 8(1): 53-87 (2004)

Example 23 : Mining All Association Rules with the Lift Measure

How to run this example?

What is this algorithm?

This is a variation of the algorithm for mining all association rules from a transaction database, described in the previous example.

Traditionnally, association rule mining is performed by using two interestingness measures named the support and confidence to evaluate rules. In this example, we show how to use another popular measure that is called the lift or interest.

What is the input?

The input is a transaction database (a.k.a binary context) and three thresholds named minsup (a value between 0 and 1), minconf (a value between 0 and 1) and minlift (a value between -infinity to +infinity).

A transaction database is a set of transactions. Each transaction is a set of distinct items. For example, consider the following transaction database. It contains 6 transactions (t1, t2, ..., t5, t6) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextIGB.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 2, 4, 5}
t2 {2, 3, 5}
t3 {1, 2, 4, 5}
t4 {1, 2, 3, 5}
t5 {1, 2, 3, 4, 5}
t6 {2, 3, 4}

What is the output ?

The output of this algorithm is a set of all the association rules that have a support, confidence and lift respectively higher than minsup, minconf and minlift.

The lift of a rule X-->Y is calculated as lift(X-->Y) = ( (sup(X U Y)/ N) / (sup(X)/ N*sup(Y)/ N ), where

The confidence of a rule X-->Y is calculated as conf(X-->Y) = sup(X U Y) / (sup(X)).

The support of a rule X -->Y is defined as sup(X-->Y) = sup(X∪Y) / N

By applying the algorithm with minsup = 0.5, minconf= 0.9 and minlift = 1 on the previous database, we obtains 18 associations rules:

  rule 0:   4  ==> 2      support :  0.666 (4/6) confidence :  1.0  lift :  1.0
rule 1: 3 ==> 2 support : 0.666 (4/6) confidence : 1.0 lift : 1.0
rule 2: 1 ==> 5 support : 0.666 (4/6) confidence : 1.0 lift : 1.2
rule 3: 1 ==> 2 support : 0.666 (4/6) confidence : 1.0 lift : 1.0
rule 4: 5 ==> 2 support : 0.833(5/6) confidence : 1.0 lift : 1.0
rule 5: 4 5 ==> 2 support : 0.5 (3/6) confidence : 1.0 lift : 1.0
rule 6: 1 4 ==> 5 support : 0.5 (3/6) confidence : 1.0 lift : 1.2
rule 7: 4 5 ==> 1 support : 0.5 (3/6) confidence : 1.0 lift : 1.5
rule 8: 1 4 ==> 2 support : 0.5 (3/6) confidence : 1.0 lift : 1.0
rule 9: 3 5 ==> 2 support : 0.5 (3/6) confidence : 1.0 lift : 1.0
rule 10: 1 5 ==> 2 support : 0.66 (4/6) confidence : 1.0 lift : 1.0
rule 11: 1 2 ==> 5 support : 0.66 (4/6) confidence : 1.0 lift : 1.2
rule 12: 1 ==> 2 5 support : 0.66 (4/6) confidence : 1.0 lift : 1.2
rule 13: 1 4 5 ==> 2 support : 0.5 (3/6) confidence : 1.0 lift : 1.0
rule 14: 1 2 4 ==> 5 support : 0.5 (3/6) confidence : 1.0 lift : 1.2
rule 15: 2 4 5 ==> 1 support : 0.5 (3/6) confidence : 1.0 lift : 1.5
rule 16: 4 5 ==> 1 2 support : 0.5 (3/6) confidence : 1.0 lift : 1.5
rule 17: 1 4 ==> 2 5 support : 0.5 (3/6) confidence : 1.0 lift : 1.5

How to interpret the results?

For an association rule X ==> Y, if the lift is equal to 1, it means that X and Y are independent. If the lift is higher than 1, it means that X and Y are positively correlated. If the lift is lower than 1, it means that X and Y are negatively correlated. For example, if we consider the rule {1, 4} ==> {2, 5}, it has a lift of 1.5, which means that the occurence of the itemset {1, 4} is positively correlated with the occurence of {2, 5}.

Implementation details

In the source code version of SPMF, there are two versions of this algorithm. The first version saves the result into memory ("MainTestAllAssociationRules_FPGrowth_wifthLift"). The second one saves the result to a file ("MainTestAllAssociationRules_FPGrowth_saveToFile_wifthLift").

Where can I get more information about this algorithm?

The following technical report published in 1994 describes how to generate association rules from frequent itemsets (Step 2):

R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. Research Report RJ 9839, IBM Almaden Research Center, San Jose, California, June 1994.

You can also read chapter 6 of the book "introduction to data mining" which provide a nice and easy to understand introduction to how to discover frequent itemsets and generate association rules, and also describes the advantages of using the lift measure.

The following article describes the FPGrowth algorithm for mining frequent itemsets:

Jiawei Han, Jian Pei, Yiwen Yin, Runying Mao: Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. Data Min. Knowl. Discov. 8(1): 53-87 (2004)

Example 24 : Mining the IGB basis of Association Rules

How to run this example?

What is this algorithm?

This algorithm mines a subset of all association rules that is called IGB association rules (Informative and Generic Basis of Association Rules) from a transaction database.

To discover the IGB association rules, this algorithm performs two steps: (1) first it discovers Closed itemsets and their associated generators by applying the Zart algorithm. Then (2), association rules are generated by using closed itemsets and generators.

What is the input?

The input is a transaction database and two thresholds named minsup (a value between 0 and 1) and minconf (a value between 0 and 1) .

A transaction database is a set of transactions. Each transaction is a set of distinct items. For example, consider the following transaction database. It contains 6 transactions (t1, t2, ..., t5, t6) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextIGB.txt of the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 2, 4, 5}
t2 {2, 3, 5}
t3 {1, 2, 4, 5}
t4 {1, 2, 3, 5}
t5 {1, 2, 3, 4, 5}
t6 {2, 3, 4}

What is the output?

The output is the IGB basis of association rules. It is a compact set of association rules that is both informative and generic. To explain what is the IGB basis of association rules, it is necessary to review some definitions. An itemset is a group of items. The support of an itemset is the number of times that it appears in the database divided by the total number of transactions in the database. For example, the itemset {1 3} has a support of 33 % because it appears in 2 out of 6 transactions from the database.

An association rule X--> Y is an association between two itemsets X and Y that are disjoint. The support of an association rule is the number of transactions that contains X and Y divided by the total number of transactions. The confidence of an association rule is the number of transactions that contains X and Y divided by the number of transactions that contains X.

A closed itemset is an itemset that is strictly included in no itemset having the same support. An itemset Y is the closure of an itemset X if Y is a closed itemset, X is a subset of Y and X and Y have the same support. A generator Y of a closed itemset X is an itemset such that (1) it has the same support as X and (2) it does not have a subset having the same support.

The IGB set of association rules is the set of association rules of the form X ==> Y - X, where X is a minimal generator of Y, Y is a closed itemset having a support higher or equal to minsup, and the confidence of the rule is higher or equal to minconf.

For example, by applying the IGB algorithm on the transaction databasepreviously described with minsup = 0.50 and minconf= 0.61, we obtain the following set of association rules:

Rule Support Confidence
1 ==> 2, 4, 5 0.50 0.75
4 ==> 1, 2, 5 0.50 0.75
3 ==> 2, 5 0.50 0.75
{} ==> 2, 3 0.66 0.66
{} ==> 1, 2, 5 0.66 0.66
{} ==> 2, 4 0.66 0.66
{} ==> 2, 5 0.83 0.83
{} ==> 2 1 1

Where can I get more information about IGB association rules?

This article described IGB rules:

G. Gasmi, S. Ben Yahia, E. Mephu Nguifo, Y. Slimani: IGB: A New Informative Generic Base of Association Rules. PAKDD 2005: 81-90

Example 25 : Mining Perfectly Sporadic Association Rules

How to run this example?

What is this algorithm?

This is an algorithm for mining perfectly sporadic association rules. The algorithm first uses AprioriInverse to generate perfectly rare itemsets. Then, it uses these itemsets to generate the association rules.

What is the input?

The input of this algorithm is a transaction database and three thresholds named minsu, maxsup and minconf. A transaction database is a set of transactions. A transaction is a set of distinct items (symbols), assumed to be sorted in lexical order. For example, the following transactions database contains 5 transactions (t1,t2...t5) and 5 items (1,2,3,4,5). This database is provided in the file "contextInverse.txt" of the SPMF distribution:

Transaction id Items
t1 {1, 2, 4, 5}
t2 {1, 3}
t3 {1, 2, 3, 5}
t4 {2, 3}
t5 {1, 2, 4, 5}

What is the output?

The output is the set of perfectly sporadic association rules respecting the minconf (a value in [0,1]), minsup (a value in [0,1]) and maxsup (a value in [0,1]) parameters.

To explain what it a perfectly sporadic association rule, we need to review some definitions. An itemset is an unordered set of distinct items. The support of an itemset is the number of transactions that contain the itemset divided by the total number of transactions. For example, the itemset {1, 2} has a support of 60% because it appears in 3 transactions out of 5 (it appears in t1, t2 and t5). A frequent itemset is an itemset that has a support no less than the maxsup parameter.

A perfectly rare itemset (a.k.a. sporadic itemset) is an itemset that is not a frequent itemset and that all its subsets are also not frequent itemsets. Moreover, it has to have a support higher or equal to the minsup threshold.

An association rule X==>Y is a relationship between two itemsets (sets of items) X and Y such that the intersection of X and Y is empty. The support of a rule is the number of transactions that contains X∪Y divided by the total number of transactions. The confidence of a rule is the number of transactions that contains X∪Y divided by the number of transactions that contain X.

A perfectly sporadic association rule X==>Y is an association rule such that the confidence is higher or equal to minconf and the support of any non empty subset of X∪Y is lower than maxsup.

For example, let's apply the algorithm with minsup = 0.1 %, maxsup of 60 % and minconf = 60 %.

The first step that the algorithm perform is to apply AprioriInverse algorithm with minsup = 0.1 % and maxsup of 60 %. The result is the following set of perfectly rare itemsets:

Perfectly Rare Itemsets Support
{3} 60 %
{4} 40 %
{5} 60 %
{4, 5} 40 %
{3, 5} 20 %

Then, the second step is to generate all perfectly sporadic association rules respecting minconf by using the perfectly rare itemsets found in the first step. The result is :

Rule Support Confidence
5 ==> 4 40 % 60 %
4 ==> 5 40 % 100 %

How to interpret the result?

For example, consider the rule 5 ==> 4. It means that if item 5 appears in a transaction, it is likely to be associated with item 4 with a confidence of 60 % (because 5 and 4 appears together in 40% of the transactions where 5 appears). Moreover, this rule has a support of 40 % because it appears in 40% of the transactions of this database.

Where can I get more information about this algorithm?

The AprioriInverse algorithm and how to generate sporadic rules are described in this paper:

Yun Sing Koh, Nathan Rountree: Finding Sporadic Rules Using Apriori-Inverse. PAKDD 2005: 97-106

Example 26 : Mining Closed Association Rules

How to run this example?

What is this algorithm?

It is an algorithm for mining "closed association rules", which are a concise subset of all association rules.

What is the input of this algorithm?

The input is a transaction database (a.k.a. binary context) and two thresholds named minsup (a value in [0,1] that represents a percentage) and minconf (a value in [0,1] that represents a percentage).

A transaction database is a set of transactions. Each transaction is a set of distinct items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextZart.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 2, 4, 5}
t2 {1, 3}
t3 {1, 2, 3, 5}
t4 {2, 3, 5}
t5 {1, 2, 3, 5}

What is the output of this algorithm?

Given the minimum support threshold (minsup) and minimum confidence threshold (minconf) set by the user, the algorithm returns the set of closed association rules that respect these thresholds. To explain what is a closed association rule, it is necessary to review some definitions.

An itemset is an unordered set of distinct items. The support of an itemset is the number of transactions that contain the itemset divided by the total number of transactions. For example, the itemset {1, 2} has a support of 60% because it appears in 3 transactions out of 5 (it appears in t1, t2 and t5). A closed itemset is an itemset that is strictly included in no itemset having the same support.

An association rule X==>Y is a relationship between two itemsets (sets of items) X and Y such that the intersection of X and Y is empty. The support of a rule X==>Y is the number of transactions that contains X∪Y divided by the total number of transactions. The confidence of a rule X==>Y is the number of transactions that contains X∪Y divided by the number of transactions that contain X. A closed association rule is an association rule of the form X ==> Y such that the union of X and Y is a closed itemset.

The algorithm returns all closed association rules such that their support and confidence are respectively higher or equal to the minsup and minconf thresholds set by the user.

For instance, by applying this algorithm with minsup = 60 %, minconf= 60%, we obtains 16 closed associations rules:

1 ==> 3 #SUP: 3 #CONF: 0.75 // which means that this rule has a support of 3 transactions and a confidence of 75 %
3 ==> 1 #SUP: 3 #CONF: 0.75 // which means that this rule has a support of 3 transactions and a confidence of 75 %
2 ==> 5 #SUP: 4 #CONF: 1.0 // which means that this rule has a support of 4 transactions and a confidence of 100 %
5 ==> 2 #SUP: 4 #CONF: 1.0 // ...
2 5 ==> 1 #SUP: 3 #CONF: 0.75
1 5 ==> 2 #SUP: 3 #CONF: 1.0
1 2 ==> 5 #SUP: 3 #CONF: 1.0
1 ==> 2 5 #SUP: 3 #CONF: 0.75
2 ==> 1 5 #SUP: 3 #CONF: 0.75
5 ==> 1 2 #SUP: 3 #CONF: 0.75
3 5 ==> 2 #SUP: 3 #CONF: 1.0
2 3 ==> 5 #SUP: 3 #CONF: 1.0
2 5 ==> 3 #SUP: 3 #CONF: 0.75
5 ==> 2 3 #SUP: 3 #CONF: 0.75
3 ==> 2 5 #SUP: 3 #CONF: 0.75
2 ==> 3 5 #SUP: 3 #CONF: 0.75

Where can I get more information about closed association rules?

The following Ph.D. thesis proposed "closed association rules".

Szathmary, L. (2006). Symbolic Data Mining Methods with the Coron Platform. Szathmary, L. PhD thesis, University Henri Poincaré — Nancy 1, France.

Example 27 : Mining Minimal Non Redundant Association Rules

How to run this example?

What is this algorithm?

This algorithm discover the set of "minimal non redundant association rules" (Kryszkiewicz, 1998), which is a lossless and compact set of association rules.

In this implementation we use the Zart algorithm for discovering closed itemsets and their associated generators. Then, this information is used to generate the "minimal non redundant association rules".

What is the input?

The input is a transaction database (a.k.a. binary context), a threshold named minconf (a value in [0,1] that represents a percentage) and a threshold named minsup (a value in [0,1] that represents a percentage).

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextZart.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 2, 4, 5}
t2 {1, 3}
t3 {1, 2, 3, 5}
t4 {2, 3, 5}
t5 {1, 2, 3, 5}

What is the output?

This algorithm returns the set of minimal non redundant association rules.

To explain what is the set of minimal non redundant association rules, it is necessary to review some definitions. An itemset is a set of distinct items. The support of an itemset is the number of times that it appears in the database divided by the total number of transactions in the database. For example, the itemset {1 3} has a support of 33 % because it appears in 2 out of 6 transactions from the database.

An association rule X--> Y is an association between two itemsets X and Y that are disjoint. The support of an association rule is the number of transactions that contains X and Y divided by the total number of transactions. The confidence of an association rule is the number of transactions that contains X and Y divided by the number of transactions that contains X.

A closed itemset is an itemset that is strictly included in no itemset having the same support. An itemset Y is the closure of an itemset X if Y is a closed itemset, X is a subset of Y and X and Y have the same support. A generator Y of a closed itemset X is an itemset such that (1) it has the same support as X and (2) it does not have a subset having the same support.

The set of minimal non redundant association rules is defined as the set of association rules of the form P1 ==> P2 / P1, where P1 is a generator of P2, P2 is a closed itemset, and the rule has a support and confidence respectively no less than minsup and minconf.

For example, by applying this algorithm with minsup = 60 %, minconf= 60% on the previous database, we obtains 11 minimal non rendudant associations rules:

  rule 0:  3  ==> 1 support :  60% (3/5) confidence :  75%
  rule 1:  3  ==> 2 5 support :  60% (3/5) confidence :  75%
  rule 2:  1 2  ==> 5 support :  60% (3/5) confidence :  100%
  rule 3:  1 5  ==> 2 support :  60% (3/5) confidence :  100%
  rule 4:  2  ==> 1 5 support :  60% (3/5) confidence :  75%
  rule 5:  2  ==> 5 support :  80% (4/5) confidence :  100%
  rule 6:  2  ==> 3 5 support :  60% (3/5) confidence :  75%
  rule 7:  5  ==> 2 support :  80% (4/5) confidence :  100%
  rule 8:  1  ==> 2 5 support :  60% (3/5) confidence :  75%
  rule 9:  1  ==> 3 support :  60% (3/5) confidence :  75%
  rule 10:  2 3  ==> 5 support :  60% (3/5) confidence :  100%
  rule 11:  3 5  ==> 2 support :  60% (3/5) confidence :  100%

Where can I get more information about closed association rules?

The following article provides detailed information about Minimal Non Redundant Association Rules:

M. Kryszkiewicz (1998). Representative Association Rules and Minimum Condition Maximum Consequence Association Rules. Proc. of PKDD '98, Nantes, France, September 23-26.

Example 28 : Mining Indirect Association Rules with the INDIRECT algorithm

How to run this example?

What is the INDIRECT algorithm?

Indirect (Tan et al., KDD 2000; Tan, Steinbach & Kumar, 2006, p.469) is an algorithm for discovering indirect associations between items in transactions databases.

Why this algorithm is important? Because traditional association rule mining algorithms focus on direct associations between itemsets. This algorithm can discover indirect associations, which can be useful in domains such as biology. Indirect association rule mining has various applications such as stock market analysis and competitive product analysis (Tan et al., 2000).

What is the input?

The input of the indirect algorithm is a transaction database and three parameters named minsup (a value in [0,1] that represents a percentage), ts (a value in [0,1] that represents a percentage) and minconf (a value in [0,1] that represents a percentage).

A transaction database is a set of transactions. Each transaction is a set of distinct items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 4 and 5. This database is provided as the file contextIndirect.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 4, 5}
t2 {2, 3, 4}
t3 {1, 2, 4, 5}
t4 {5}
t5 {1, 2, 4, 5}

The three numeric parameters of the indirect algorithm are:

What is the output?

The result is all indirect associations respecting the parameters minsup, ts and minconf. An indirect association has the form {x,y} ==> M, where x and y are single items and M is an itemset called the "mediator".

An indirect association has to respect the following conditions:

For example, by applying the indirect algorithm with minsup = 60 %, ts = 50 % and minconf= 10%, we obtain 3 indirect association rules:

  1. {1, 2 | {4}}, which means that 1 and 2 are indirectly associated by the mediator {4 }.
  2. {1, 5 | {4}}, which means that 1 and 5 are indirectly associated by the mediator {4 }.
  3. {2, 5 | {4}}, which means that 1 and 5 are indirectly associated by the mediator {4 }.

To see additional details about each of these three indirect rules, run this example.

Implementation details

The implementation attempts to be as faitful as possible to the original algorithm, except that the confidence is used instead of the IS measure.

Note that some algorithms claimed to be more efficient than Indirect such as HI-Mine but they have not been implemented in SPMF.

Where can I get more information about indirect association rules?

The concept of indirect associations was propsoed by Tan (2000) in this conference paper:

Pang-Ning Tan, Vipin Kumar, Jaideep Srivastava: Indirect Association: Mining Higher Order Dependencies in Data. PKDD 2000: 632-637

Moreover, note that the book "Introduction do data mining" by Tan, Steinbach and Kumar provides an overview of indirect association rules that is easy to read.

Example 29 : Hiding Sensitive Association Rules with the FHSAR algorithm

How to run this example?

What is FHSAR?

FHSAR is an algorithm for hiding sensitive association rules in a transaction database.

What are the applications? For example, consider a company that want to release a transaction database to the public. But it does not want to disclose some sensitive associations between items that appear in the database and that could give a competitive advantage to their competitor. The FHSAR algorithm can hide these associations by modifying the database.

What is the input?

The FHSAR algorithm is designed to hide sensitive association rules in a transaction database so that they will not be found for a given minsup and minconf threshold generally used by association rule mining algorithms. The input are: minsup (a value in [0,1] that represents a percentage), minconf (a value in [0,1] that represents a percentage), a transaction database and some sensitive association rules to be hidden.

A transaction database is a set of transactions. Each transaction is a set of distinct items. For example, consider the following transaction database. It contains 6 transactions (t1, t2, ..., t5, t6) and 5 items (1, 2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextIGB.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 2, 4, 5}
t2 {2, 3, 5}
t3 {1, 2, 4, 5}
t4 {1, 2, 3, 5}
t5 {1, 2, 3, 4, 5}
t6 {2, 3, 4}

An association rule X==>Yis an association between two sets of items X and Y such that X and Y are disjoint. The support of an association rule X==>Yis the number of transactions that contains both X and Y divided by the total number of transactions. The confidence of an association rule X==>Y is the number of transactions that contains both X and Y divided by the number of transactions that contain X. For example, the rule {1 2} ==> {4 5} has a support of 50 % because it appears in 3 transactions out of 5. Furthermore, it has a confidence of 75 % because {1 2} appears in 4 transactions and {1, 2, 4, 5} appears in 3 transactions.

What is the output?

The output is a new transaction database such that the sensitive rules will not be found if an association rule mining algorithm is applied with minsup and minconf.

For example, we can apply FHSAR with the parameters minsup = 0.5 and minconf = 0.60 to hide the following association rules provided in the file "sar.txt":

the result is a new transaction database where these rules are hidden for the given thresholds minsup and minconf:

Transaction id Items
t1 {4, 5}
t2 {3, 5}
t3 {4, 5}
t4 {1, 2, 3, 5}
t5 {1, 2, 3, 4, 5}
t6 {2, 3, 4}

Note that the result of the algorithm is not always the same because I use the HashSet data structure to represent transactions internally and this data structure do not keep the order. Therefore, the items that are removed may not be the same if the algorithm is run twice.

Where can I get more information about the FHSAR algorithm?

This algorithm was proposed in this paper:

C.-C.Weng, S.-T. Chen, H.-C. Lo: A Novel Algorithm for Completely Hiding Sensitive Association Rules. ISDA (3) 2008: 202-208

Example 30 : Mining the Top-K Association Rules

How to run this example?

What is TopKRules?

TopKRules is an algorithm for discovering the top-k association rules appearing in a transaction database.

Why is it useful to discover top-k association rules? Because other association rule mining algorithms requires to set a minimum support (minsup) parameter that is hard to set (usually users set it by trial and error, which is time consuming). TopKRules solves this problem by letting users directly indicate k, the number of rules to be discovered instead of usng minsup.

What is the input of TopKRules ?

TopKRules takes three parameters as input:

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 6 transactions (t1, t2, ..., t5, t6) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextIGB.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 2, 4, 5}
t2 {2, 3, 5}
t3 {1, 2, 4, 5}
t4 {1, 2, 3, 5}
t5 {1, 2, 3, 4, 5}
t6 {2, 3, 4}

What is the output of TopKRules ?

TopKRules outputs the top-k association rules.

To explain what are top-k association rules, it is necessary to review some definitions. An itemset is a set of distinct items. The support of an itemset is the number of times that it appears in the database divided by the total number of transactions in the database. For example, the itemset {1 3} has a support of 33 % because it appears in 2 out of 6 transactions from the database.

An association rule X--> Y is an association between two itemsets X and Y that are disjoint. The support of an association rule is the number of transactions that contains X and Y divided by the total number of transactions. The confidence of an association rule is the number of transactions that contains X and Y divided by the number of transactions that contains X.

The top-k association rules are the k most frequent association rules in the database having a confidence higher or equal to minconf.

For example, if we run TopKRules with k = 2 and minconf = 0.8, we obtain the top-2 rules in the database having a confidene higher or equals to 80 %.

For instance, the rule 2 ==>5 means that if item 2 appears, it is likely to be associated with item 5 with a confidence of 83% in a transaction. Moreover, this rule has a support of 83 % because it appears in five transactions (S1, S2 and S3) out of the six transactions contained in this database.

It is important to note that for some values of k, the algorithm may return slightly more than k rules. This is can happen if several rules have exactly the same support.

Performance

TopKRules is a very efficient algorithm for mining the top-k association rules.

It provides the benefits that it is very intuitive to use. It should be noted that the problem of top-k association rule mining is more computationally expensive than the problem of association rule mining. Using TopKRules is recommended for k values of up to 5000, depending on the datasets.

Besides, note that there is a variation of TopKRules named TNR that is available in SPMF. The improvement in TNR is that it eliminates some association rules that are deemed "redundant" (rules that are included in other rules having the same support and confidence - see the TNR example for the formal definition). Using TNR is more costly than using TopKRules but it brings the benefit of eliminating a type of redundancy in results.

Where can I get more information about this algorithm?

The TopKRules algorithm was proposed in this paper:

Fournier-Viger, P., Wu, C.-W., Tseng, V. S. (2012). Mining Top-K Association Rules. Proceedings of the 25th Canadian Conf. on Artificial Intelligence (AI 2012), Springer, LNAI 7310, pp. 61-73.

Example 31 : Mining the Top-K Non-Redundant Association Rules

How to run this example?

What is TopKRules?

TNR is an algorithm for discovering the top-k non-redundant association rules appearing in a transaction database. It is an approximate algorithm in the sense that it always generates non-redundant rules. But these rules may not always be the top-k non-redundant association rules. TNR uses a parameter named delta, which is a positive integer >=0 that can be used to improve the chance that the result is exact (the higher the delta value, the more chances that the result will be exact).

Why is it important to discover top-k non-redundant association rules? Because other association rule mining algorithms requires that the user set a minimum support (minsup) parameter that is hard to set (usually users set it by trial and error, which is time consuming). Moreover, the result of association rule mining algorithms usually contains a high level of redundancy (for example, thousands of rules can be found that are variation of other rules having the same support and confidence). The TNR algorithm provide a solution to both of these problems by letting users directly indicate k, the number of rules to be discovered, and by eliminating redundancy in results.

What is the input of TNR ?

TNR takes four parameters as input:

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 6 transactions (t1, t2, ..., t5, t6) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 2, 4 and 5. This database is provided as the file contextIGB.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 2, 4, 5}
t2 {2, 3, 5}
t3 {1, 2, 4, 5}
t4 {1, 2, 3, 5}
t5 {1, 2, 3, 4, 5}
t6 {2, 3, 4}

What is the output of TNS ?

TNR outputs an approximation of the k most frequent non redundant association rules having a confidence higher or equal to minconf.

To explain what are top-k non redundant association rules, it is necessary to review some definitions. An itemset is a set of distinct items. The support of an itemset is the number of times that it appears in the database divided by the total number of transactions in the database. For example, the itemset {1 3} has a support of 33 % because it appears in 2 out of 6 transactions from the database.

An association rule X--> Y is an association between two itemsets X and Y that are disjoint. The support of an association rule is the number of transactions that contains X and Y divided by the total number of transactions. The confidence of an association rule is the number of transactions that contains X and Y divided by the number of transactions that contains X.

An association rule ra: X → Y is redundant with respect to another rule rb : X1 → Y1 if and only if:

The top-k non redundant association rules are the k most non-redundant frequent association rules in the database having a confidence higher or equal to minconf.

For example, If we run TNR with k = 10 and minconf = 0.5 and delta = 2, the following set of rules is found

4, ==> 2,  sup= 4  conf= 1.0
2, ==> 1,5, sup= 4 conf= 0.6666666666666666
2, ==> 5, sup= 5 conf= 0.8333333333333334
5, ==> 2, sup= 5 conf= 1.0
5, ==> 1,2, sup= 4 conf= 0.8
1, ==> 2,5, sup= 4 conf= 1.0
2, ==> 3, sup= 4 conf= 0.6666666666666666
2, ==> 4, sup= 4 conf= 0.6666666666666666
3, ==> 2, sup= 4 conf= 1.0
1,4, ==> 2,5, sup= 3 conf= 1.0

For instance, the association rule 2 ==> 1 5 means that if items 2 appears, it is likely to be associated with item 1 and item 5 with a confidence of 66 %. Moreover, this rule has a support 66 % (sup = 4) because it appears in three transaction (S1, S2 and S3) out of six transactions contained in this database.

Note that for some values of k and some datasets, TNR may return more than k association rules. This can happen if several rules have exactly the same support, and it is normal.  It is also possible that the algorithm returns slightly less than k association rules in some circonstances because the algorithm is approximate.

Performance

TNR is an efficient algorihtm. It is based on the TopKRules algorithm for discovering top-k association rule. The main difference between TNR and TopKRules is that TNR includes additional strategies to eliminate redundancy in results, and that TNR is an approximate algorithm, while TopKRules is not.

TNR and TopKRules are more intuitive to use than regular association rule mining algorithms. However, it should be noted that the problem of top-k association rule mining is more computationally expensive than the problem of association rule mining. Therefore, it is recommended to use TNR or TopKRules for k values of up to 5000 depending on the dataset. If more rules should be found, it could be better to find association rules with a classical association rule mining algorithm like FPGrowth, for more efficiency.

Where can I get more information about this algorithm?

The TNR algorithm is described in this paper:

Fournier-Viger, P., Tseng, V.S. (2012). Mining Top-K Non-Redundant Association Rules. Proc. 20th International Symposium on Methodologies for Intelligent Systems (ISMIS 2012), Springer, LNCS 7661, pp. 31- 40.

Example 32 : Clustering Values with the K-Means algorithm

How to run this example?

What is K-Means?

K-Means is one of the most famous clustering algorithm. It is used to separate a set of vectors of double values into groups of vectors (clusters) according to their similarity. In this implementation the eucledian distance is used to compute the similarity.

What is the input?

K-Means takes as input a set of vectors containing one or more double values and a parameter K (a positive integer >=1) indicating the number of clusters to be created.

An example of input is provided in the file "configKmeans.txt" of the SPMF distribution. It contains 10 lines, each representing a vector of data containing four doubles values separated by a space.

1 2 3 4
1 6 8 8
1 2 3 3
2 4 5 5
4 7 8 7
7 6 8 9
4 4 3 3
2 2 5 5
7 5 5 5
5 6 8 9

For example, the first line : 1 2 3 4 represents a vector with four values, namely 1, 2, 3 and 4.

What is the output?

K-Means groups vectors in clusters according to their similarity. In SPMF, the similarity is defined as the euclidian distance. K-Means returns K clusters or less.

Note that running K-Means with the same data does not always generate the same result because K-Means initializes clusters randomly.

By running K-Means on the previous input file and K=3, we can obtain the following clusters:

cluster 1: [1.0,2.0,3.0,4.0][1.0,2.0,3.0,3.0][2.0,4.0,5.0,5.0][4.0,4.0,3.0,3.0][2.0,2.0,5.0,5.0]
cluster 2: [7.0,6.0,8.0,9.0][1.0,6.0,8.0,8.0][4.0,7.0,8.0,7.0][5.0,6.0,8.0,9.0]
cluster 3: [7.0,5.0,5.0,5.0]

Where can I get more information about K-Means?

K-Means was proposed by MacQueen in 1967. K-means is one of the most famous data mining algorithm. It is described in almost all data mining books that focus on algorithms, and on many websites. By searching on the web, you will find plenty of ressources explaining K-Means.

Example 33 : Clustering Values with a Hierarchical Clustering algorithm

How to run this example?

What is this algorithm?

We have implemented a hierarchical clustering algorithm that is based on the description of Hierarchical Clustering Algorithms from
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html. The algorithm is used to separate a set of vectors of double values into groups of vectors (clusters) according to their similarity. In this implementation the eucledian distance is used to compute the similarity.

The algorithm works as follow. It first create a cluster for each single vector. Then it recursively try to merge clusters together to create larger clusters. To determine if two clusters can be merged, a constant "threshold" indicate the maximal distance between two clusters for merging. The distance between two clusters is calculated as the euclidian distance between the means of each cluster.

What is the input?

The input is a set of vectors containing double values. For example, the file "configKmeans.txt" is a sample input file containing 10 vectors (10 lines). Each vector (line) is a list of double values separated by spaces. Each vector should have the same size. For example, the first line of the input file is "1 2 3 4". It represents a vector with four values, namely 1,2,3 and 4.

1 2 3 4
1 6 8 8
1 2 3 3
2 4 5 5
4 7 8 7
7 6 8 9
4 4 3 3
2 2 5 5
7 5 5 5
5 6 8 9

Furthermore, the user should also provide a parameter called maxDistance (a positive value > 0) to the algorithm. This parameter indicate the maximal distance allowed between the mean of two clusters to merge them into a single cluster.

What is the output?

The output is a set of cluster (a set of vectors). For example, by running the algorithm with "configKmeans.txt" and maxDistance = 4, we obtain the following set of clusters:

cluster 1: [1.0,2.0,3.0,4.0][1.0,2.0,3.0,3.0][2.0,4.0,5.0,5.0][2.0,2.0,5.0,5.0][4.0,4.0,3.0,3.0]
cluster 2: [1.0,6.0,8.0,8.0]
cluster 3: [4.0,7.0,8.0,7.0][7.0,6.0,8.0,9.0][5.0,6.0,8.0,9.0]
cluster 4: [7.0,5.0,5.0,5.0]

Where can I get more information about Hierarchical clustering?

There is a good introduction here:
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html.

Moreover, you could also read the free chapter on clustering from the book "introduction to data mining" by Tan, Steinbach and Kumar on the book website.

Example 34 : Mining Frequent Sequential Patterns Using the PrefixSpan Algorithm

How to run this example?

To run the implementation of PrefixSpan by P. Fournier-Viger (PFV):

To run the version the implementation of PrefixSpan by A. Gomariz Peñalver (AGP):

What is PrefixSpan?

PrefixSpan is an algorithm for discovering sequential patterns in sequence databases, proposed by Pei et al. (2001).

What is the input of PrefixSpan?

The input of PrefixSpan is a sequence database and a user-specified threshold named minsup (a value in [0,1] representing a percentage). Moreover, the implementation in SPMF adds another parameter, which is the maximum sequential pattern length in terms of items.

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of distinct items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output of PrefixSpan?

PrefixSpan discovers all frequent sequential patterns occuring in a sequence database (subsequences that occurs in more than minsup sequences of the database.

To explain more formally what is a sequential pattern, it is necessary to review some definition.

A sequential pattern is a sequence. A sequence SA = X1, X2, ... Xk, where X1, X2... Xk are itemsets is said to occur in another sequence SB = Y1, Y2, ... Ym, where Y1, Y2... Ym are itemsets, if and only if there exists integers 1 <= i1 < i2... < ik <= m such that X1 ⊆ Yi1, X2 ⊆ Yi2, ... Xk ⊆ Yik.

The support of a sequential pattern is the number of sequences where the pattern occurs divided by the total number of sequences in the database.

A frequent sequential pattern is a sequential pattern having a support no less than the minsup parameter provided by the user.

For example, if we run PrefixSpan with minsup= 50 % and with a maximum pattern length of 100 items, 53 sequential patterns arefound. The list is too long to be presented here. An example of pattern found is "(1,2),(6)" which appears in the first and the third sequence (it has therefore a support of 50%). This pattern has a length of 3 because it contains three items. Another pattern is "(4), (3), (2)". It appears in the second and third sequence (it has thus a support of 50 %). It also has a length of 3 because it contains 3 items.

Performance

PrefixSpan is one of the fastest sequential pattern mining algorithm. However, the SPAM and SPADE implementation in SPMF can be faster than PrefixSpan (see the "performance" section of the website for a performance comparison).

Implementation details

Note that in the source code, we also provide examples of how to keep the result into memory instead of saving it to a file. This can be useful if the algorithms are integrated into another Java software.

Note also that in the source code, there is a version of PrefixSpan based on the PFV version that takes as input a dataset with strings instead of integers. It can be run by using the files MainTestPrefixSpan_WithStrings_saveToMemory.java and MainTestPrefixSpanWithStrings_saveToFile.java. For the graphical user interface version of SPMF, it is possible to use the version of PrefixSpan that uses Strings instead of integer by selecting "PrefixSpan with strings" and to test it with the input file contextPrefixSpanStrings.txt. The version of PrefixSpan with Strings was made to temporarily accomodate the needs of some users of SPMF. In the future, it may be replaced by a more general mechanism for using files with strings for all algorithms.

Where can I get more information about PrefixSpan?

The PrefixSpan algorithm is described in this article:

J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, M. Hsu: Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach. IEEE Trans. Knowl. Data Eng. 16(11): 1424-1440 (2004)

Example 35 : Mining Frequent Sequential Patterns Using the GSP Algorithm

How to run this example?

What is GSP?

GSP is one of the first algorithm for discovering sequential patterns in sequence databases, proposed by Srikant et al. (1992). It uses an Apriori-like approach for discovering sequential patterns. Note that this version does not include the constraints proposed in the article.

What is the input of GSP?

The input of GSP is a sequence database and a user-specified threshold named minsup (a value in [0,1] representing a percentage). Moreover, the implementation in SPMF adds another parameter, which is the maximum sequential pattern length in terms of items.

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of distinct items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output of GSP?

GSP discovers all frequent sequential patterns occuring in a sequence database (subsequences that occurs in more than minsup sequences of the database.

To explain more formally what is a sequential pattern, it is necessary to review some definition.

A sequential pattern is a sequence. A sequence SA = X1, X2, ... Xk, where X1, X2... Xk are itemsets is said to occur in another sequence SB = Y1, Y2, ... Ym, where Y1, Y2... Ym are itemsets, if and only if there exists integers 1 <= i1 < i2... < ik <= m such that X1 ⊆ Yi1, X2 ⊆ Yi2, ... Xk ⊆ Yik.

The support of a sequential pattern is the number of sequences where the pattern occurs divided by the total number of sequences in the database.

A frequent sequential pattern is a sequential pattern having a support no less than the minsup parameter provided by the user.

For example, if we run GSP with minsup= 50 % and with a maximum pattern length of 100 items, 53 sequential patterns arefound. The list is too long to be presented here. An example of pattern found is "(1,2),(6)" which appears in the first and the third sequence (it has therefore a support of 50%). This pattern has a length of 3 because it contains three items. Another pattern is "(4), (3), (2)". It appears in the second and third sequence (it has thus a support of 50 %). It also has a length of 3 because it contains 3 items.

Performance

See the "performance" section of the website for a performance comparison with other sequential pattern mining algorithm.

Implementation details

The implementation is faithful to the article, except that the gap contraints and window constraints are currently not implemented (will be considered in future versions of SPMF).

Also note that in the source code, we also provide an example of how to run GSP and keep the result into memory instead of saving it to a file ("MainTestGSP_saveToMemory.java".). This can be useful if the algorithms are integrated into another Java software.

Where can I get more information about GSP?

The GSP algorithm is described in this article:

R. Srikant and R. Agrawal. 1996. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology (EDBT '96), Peter M. G. Apers, Mokrane Bouzeghoub, and Georges Gardarin (Eds.). Springer-Verlag, London, UK, UK, 3-17.

Example 36 : Mining Frequent Sequential Patterns Using the SPADE Algorithm

How to run this example?

What is SPADE?

SPADE is a popular sequential pattern mining algorithm proposed by Zaki

What is the input of SPADE?

The input of SPADEis a sequence database and a user-specified threshold named minsup (a value in [0,1] representing a percentage). Moreover, the implementation in SPMF adds another parameter, which is the maximum sequential pattern length in terms of items.

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of distinct items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output of SPADE?

SPADE discovers all frequent sequential patterns occuring in a sequence database (subsequences that occurs in more than minsup sequences of the database.

To explain more formally what is a sequential pattern, it is necessary to review some definition.

A sequential pattern is a sequence. A sequence SA = X1, X2, ... Xk, where X1, X2... Xk are itemsets is said to occur in another sequence SB = Y1, Y2, ... Ym, where Y1, Y2... Ym are itemsets, if and only if there exists integers 1 <= i1 < i2... < ik <= m such that X1 ⊆ Yi1, X2 ⊆ Yi2, ... Xk ⊆ Yik.

The support of a sequential pattern is the number of sequences where the pattern occurs divided by the total number of sequences in the database.

A frequent sequential pattern is a sequential pattern having a support no less than the minsup parameter provided by the user.

For example, if we run SPADE with minsup= 50 % and with a maximum pattern length of 100 items, 53 sequential patterns arefound. The list is too long to be presented here. An example of pattern found is "(1,2),(6)" which appears in the first and the third sequence (it has therefore a support of 50%). This pattern has a length of 3 because it contains three items. Another pattern is "(4), (3), (2)". It appears in the second and third sequence (it has thus a support of 50 %). It also has a length of 3 because it contains 3 items.

Performance

See the "performance" section of the website for a performance comparison with other sequential pattern mining algorithm.

Implementation details

In the source code, we also provide examples of how to keep the result into memory instead of saving it to a file. This can be useful if the algorithms are integrated into another Java software. Examples of how to save result into memory are named according to the following naming convention: "MainTest..._saveToMemory".

Also note that in the source code, there are three variations the SPADE implementations that tries different ways to perform the join of IdLists. The fastest implementation is the one named "Fat_Bitmap". It is the one offered in the graphical user interface.

"MainTestSPADE_AGP_BitMap_saveToFile.java" "MainTestSPADE_AGP_BitMap_saveToMemory.java" "MainTestSPADE_AGP_EntryList_saveToFile.java" "MainTestSPADE_AGP_EntryList_saveToMemory.java" "MainTestSPADE_AGP_FatBitMap_saveToFile.java" "MainTestSPADE_AGP_FatBitMap_saveToMemory.java"

Lastly, in the source code, a parallelized version of SPADE is also offered:

"MainTestSPADE_AGP_Parallelized_BitMap_saveToFile.java" "MainTestSPADE_AGP_Parallelized_BitMap_saveToMemory.java" "MainTestSPADE_AGP_Parallelized_EntryList_saveToFile.java" "MainTestSPADE_AGP_Parallelized_EntryList_saveToMemory.java" "MainTestSPADE_AGP_Parallelized_FatBitMap_saveToFile.java" "MainTestSPADE_AGP_Parallelized_FatBitMap_saveToMemory.java"

Besides, note that an alternative input file contextSPADE.txt is provided. It contains the example used in the article proposing SPADE.

Where can I get more information about SPADE?

The SPADE algorithm is described in this article:

Mohammed J. Zaki. 2001. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Mach. Learn. 42, 1-2 (January 2001), 31-60. DOI=10.1023/A:1007652502315 http://dx.doi.org/10.1023/A:1007652502315

 

Example 37 : Mining Frequent Sequential Patterns Using the SPAM Algorithm

How to run this example?

To run the implementation of SPAM by P. Fournier-Viger (PFV):

To run the version the implementation of SPAM by A. Gomariz Peñalver (AGP):

What is SPAM?

SPAM is an algorithm for discovering frequent sequential patterns in a sequence database. It was proposed by Ayres (2002).

What is the input of SPAM?

The input of SPAM is a sequence database and a user-specified threshold named minsup (a value in [0,1] representing a percentage). Moreover, the implementation in SPMF adds another parameter, which is the maximum sequential pattern length in terms of items.

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output of SPAM?

SPAM discovers all frequent sequential patterns occuring in a sequence database (subsequences that occurs in more than minsup sequences of the database.

To explain more formally what is a sequential pattern, it is necessary to review some definition.

A sequential pattern is a sequence. A sequence SA = X1, X2, ... Xk, where X1, X2... Xk are itemsets is said to occur in another sequence SB = Y1, Y2, ... Ym, where Y1, Y2... Ym are itemsets, if and only if there exists integers 1 <= i1 < i2... < ik <= m such that X1 ⊆ Yi1, X2 ⊆ Yi2, ... Xk ⊆ Yik.

The support of a sequential pattern is the number of sequences where the pattern occurs divided by the total number of sequences in the database.

A frequent sequential pattern is a sequential pattern having a support no less than the minsup parameter provided by the user.

For example, if we run SPAM with minsup= 50 %, 53 sequential patterns will be found. The list is too long to be presented here. An example of pattern found is "(1,2),(6)" which appears in the first and the third sequence (it has therefore a support of 50%). This pattern has a length of 3 because it contains three items. Another pattern is "(4), (3), (2)". It appears in the second and third sequence (it has thus a support of 50 %). It also has a length of 3 because it contains 3 items.

Performance

SPAM is one of the fastest sequential pattern mining algorithm.The SPAM implementation in SPMF is reported to be faster than PrefixSpan (see the "performance" section of the website for a performance comparison).

Implementation details

In the source code, we also provide examples of how to keep the result into memory instead of saving it to a file. This can be useful if the algorithms are integrated into another Java software. Examples of how to save result into memory are named according to the following naming convention: "MainTest..._saveToMemory".

For the AGP implementation of SPAM, several version are provided in the source code that shows different way to perform the join of IdLists. The fastest implementation is the one named "Fat_Bitmap". It is the one offered in the graphical user interface.

The AGP and PFV implementations of SPAM shares some source code but also have some significant differences. See the performance section of the website for a performance comparison (will be added at the end of August 2013).

Where can I get more information about SPAM?

The SPAM algorithm was proposed in this paper:

J. Ayres, J. Gehrke, T.Yiu, and J. Flannick. Sequential PAttern Mining Using Bitmaps. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada, July 2002.

Example 38: Mining Frequent Closed Sequential Patterns Using the ClaSP Algorithm

How to run this example?

What is ClaSP?

ClaSP is a very efficient algorithm for discovering closed sequential patterns in sequence databases, proposed by Antonio Goramiz Peñalver et al. (2013). This implementation is the original implementation.

What is the input of ClaSP?

The input of ClaSP is a sequence database and a user-specified threshold named minsup (a value in [0,1] representing a percentage).

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output of ClaSP?

ClaSP discovers all frequent closed sequential patterns that occurs in a sequence database.

To explain more formally what is a closed sequential pattern, it is necessary to review some definition.

A sequential pattern is a sequence. A sequence SA = X1, X2, ... Xk, where X1, X2... Xk are itemsets is said to occur in another sequence SB = Y1, Y2, ... Ym, where Y1, Y2... Ym are itemsets, if and only if there exists integers 1 <= i1 < i2... < ik <= m such that X1 ⊆ Yi1, X2 ⊆ Yi2, ... Xk ⊆ Yik.

The support of a sequential pattern is the number of sequences where the pattern occurs divided by the total number of sequences in the database.

A frequent sequential pattern is a sequential pattern having a support no less than the minsup parameter provided by the user.

A closed sequential pattern is a sequential pattern such that it is not strictly included in another pattern having the same support.

Why using ClaSP? It can be shown that the set of closed sequential patterns is generally much smaller than the set of sequential patterns and that no information small. Moreover, finding closed sequential patterns is often much more efficient than discovering all patterns.

For example, if we run ClaSP with minsup= 50 % on the sequence database, the following patterns are found.

ID Closed Sequential Pattern Support
S1 (6) 75 %
S2 (5) 75 %
S3 (2), (3) 75 %
S4 (1), (2) 100 %
S5 (1), (3) 100 %
S6 (1 2), (6) 50 %
S7 (4), (3) 75 %
S8 (1) (2), (3) 50 %
S9 (1), (2 3), (1) 50 %
S10 (1), (3), (2) 75 %
S11 (1), (3), (3) 75 %
S12 (1 2), (4), (3) 50 %
S13 (6), (2), (3) 50 %
S14 (5), (2), (3) 50 %
S15 (4), (3), (2) 50 %
S16 (5), (6), (3), (2) 50 %
S17 (5), (1), (3), (2) 50 %

For instance, the sequential pattern "(1,2),(6)" appears in the first and third sequence (it has therefore a support of 50%). Another pattern is "(4), (3)". It appears in the second and third sequence (it has thus a support of 75 %).

Performance

ClaSP is a very efficient algorithm for closed sequential pattern mining. See the article proposing ClaSP for a performance comparison with CloSpan and SPADE.

Implementation details

In the source code version of SPMF, there is also an example of how to use ClaSP and keep the result in memory instead of saving it to a file ( MainTestClaSP_saveToMemory.java ).

An alternative input file contextClaSP.txt is also provided. It contains the example sequence database used in the article proposing ClaSP.

Where can I get more information about this algorithm?

The ClaSP algorithm was proposed in this paper:

A. Gomariz, M. Campos, R. Marín and B. Goethals (2013), ClaSP: An Efficient Algorithm for Mining Frequent Closed Sequences. Proc. PAKDD 2013, pp. 50-61.

Example 39 : Mining Frequent Closed Sequential Patterns Using the ClaSP Algorithm

How to run this example?

What is CloSpan?

CloSpan is a pattern-growth algorithm for discovering closed sequential patterns in sequence databases, proposed by Yan et al. (2003),

What is the input of CloSpan?

The input of CloSpan is a sequence database and a user-specified threshold named minsup (a value in [0,1] representing a percentage).

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output of CloSpan?

CloSpan discovers all frequent closed sequential patterns that occurs in a sequence database.

To explain more formally what is a closed sequential pattern, it is necessary to review some definition.

A sequential pattern is a sequence. A sequence SA = X1, X2, ... Xk, where X1, X2... Xk are itemsets is said to occur in another sequence SB = Y1, Y2, ... Ym, where Y1, Y2... Ym are itemsets, if and only if there exists integers 1 <= i1 < i2... < ik <= m such that X1 ⊆ Yi1, X2 ⊆ Yi2, ... Xk ⊆ Yik.

The support of a sequential pattern is the number of sequences where the pattern occurs divided by the total number of sequences in the database.

A frequent sequential pattern is a sequential pattern having a support no less than the minsup parameter provided by the user.

A closed sequential pattern is a sequential pattern such that it is not strictly included in another pattern having the same support.

Why using CloSpan? It can be shown that the set of closed sequential patterns is generally much smaller than the set of sequential patterns and that no information small. Moreover, finding closed sequential patterns is often much more efficient than discovering all patterns.

For example, if we run CloSpan with minsup= 50 % on the sequence database, the following patterns are found.

ID Closed Sequential Pattern Support
S1 (6) 75 %
S2 (5) 75 %
S3 (2), (3) 75 %
S4 (1), (2) 100 %
S5 (1), (3) 100 %
S6 (1 2), (6) 50 %
S7 (4), (3) 75 %
S8 (1) (2), (3) 50 %
S9 (1), (2 3), (1) 50 %
S10 (1), (3), (2) 75 %
S11 (1), (3), (3) 75 %
S12 (1 2), (4), (3) 50 %
S13 (6), (2), (3) 50 %
S14 (5), (2), (3) 50 %
S15 (4), (3), (2) 50 %
S16 (5), (6), (3), (2) 50 %
S17 (5), (1), (3), (2) 50 %

For instance, the sequential pattern "(1,2),(6)" appears in the first and third sequence (it has therefore a support of 50%). Another pattern is "(4), (3)". It appears in the second and third sequence (it has thus a support of 75 %).

Performance

CloSpan is an efficient algorithm for closed sequential pattern mining. However, it should be noted that some newer algorithm like ClaSP have shown better performance on many datasets (see the ClaSP paper for a performance comparison).

Implementation details

In the source code version of SPMF, there is also an example of how to use ClaSP and keep the result in memory instead of saving it to a file ( MainTestCloSpan_saveToMemory.java ).

An alternative input file contextCloSpan.txt is also provided. It contains the example sequence database used in the article proposingCloSpan.

Where can I get more information about this algorithm?

The CloSpan algorithm was proposed in this paper:

Yan, X., Han, J., & Afshar, R. (2003, May). CloSpan: Mining closed sequential patterns in large datasets. In Proc. 2003 SIAM Int’l Conf. Data Mining (SDM’03) (pp. 166-177).

Example 40 : Mining Frequent Closed Sequential Patterns Using the BIDE+ Algorithm

How to run this example?

What is BIDE+?

BIDE+ is an algorithm for discovering closed sequential patterns in sequence databases, proposed by Wang et al.(2007).

What is the input of BIDE+?

The input of BIDE+ is a sequence database and a user-specified threshold named minsup (a value in [0,1] representing a percentage).

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output of BIDE+?

BIDE+ discovers all frequent closed sequential patterns that occurs in a sequence database.

To explain more formally what is a closed sequential pattern, it is necessary to review some definition.

A sequential pattern is a sequence. A sequence SA = X1, X2, ... Xk, where X1, X2... Xk are itemsets is said to occur in another sequence SB = Y1, Y2, ... Ym, where Y1, Y2... Ym are itemsets, if and only if there exists integers 1 <= i1 < i2... < ik <= m such that X1 ⊆ Yi1, X2 ⊆ Yi2, ... Xk ⊆ Yik.

The support of a sequential pattern is the number of sequences where the pattern occurs divided by the total number of sequences in the database.

A frequent sequential pattern is a sequential pattern having a support no less than the minsup parameter provided by the user.

A closed sequential pattern is a sequential pattern such that it is not strictly included in another pattern having the same support.

Why using BIDE+? It can be shown that the set of closed sequential patterns is generally much smaller than the set of sequential patterns and that no information small. Moreover, finding closed sequential patterns is often much more efficient than discovering all patterns.

For example, if we run BIDE+ with minsup= 50 % on the sequence database, the following patterns are found.

ID Closed Sequential Pattern Support
S1 (6) 75 %
S2 (5) 75 %
S3 (2), (3) 75 %
S4 (1), (2) 100 %
S5 (1), (3) 100 %
S6 (1 2), (6) 50 %
S7 (4), (3) 75 %
S8 (1) (2), (3) 50 %
S9 (1), (2 3), (1) 50 %
S10 (1), (3), (2) 75 %
S11 (1), (3), (3) 75 %
S12 (1 2), (4), (3) 50 %
S13 (6), (2), (3) 50 %
S14 (5), (2), (3) 50 %
S15 (4), (3), (2) 50 %
S16 (5), (6), (3), (2) 50 %
S17 (5), (1), (3), (2) 50 %

For instance, the sequential pattern "(1,2),(6)" appears in the first and third sequence (it has therefore a support of 50%). Another pattern is "(4), (3)". It appears in the second and third sequence (it has thus a support of 75 %).

Performance

BIDE+ is a very efficient algorithm for closed sequential pattern mining. This implementations includes all the optimizations described in the paper.

Implementation details

I have included three versions of BIDE+ in the SPMF distribution. The first one keeps the frequent itemsets into memory and print the results to the console (MainTestBIDEPlus_saveToMemory.java). The second one is a version that saves the result directly to a file (MainTestBIDEPlus_saveToFile.java). The second version is faster.

The third version of BIDE+ accepts strings instead of integers. It is available under the name "BIDE+ with strings" in the GUI version of SPMF or in the package ca.pfv.spmf.sequential_rules.bide_with_strings for the source code version of SPMF. To run it, you should use the input file: contextPrefixSpanStrings.txt.

Where can I get more information about this algorithm?

The BIDE algorithm is described in this paper:

J. Wang, J. Han: BIDE: Efficient Mining of Frequent Closed Sequences. ICDE 2004: 79-90

Example 41 : Mining Frequent Closed Sequential Patterns by Post-Processing Using the PrefixSpan or SPAM Algorithm

What is this?

This example shows how to use the PrefixSpan and SPAM algorithms to discover all sequential patterns and keep only closed patterns by post-processing. This should be less efficient than using a dedicated algorithm for closed pattern mining like ClaSP, CloSpan and BIDE+.

How to run this example?

If you want to use SPAM with post-processing:

If you want to use PrefixSpan with post-processing:

What is the input ?

The input is a sequence database and a user-specified threshold named minsup (a value in [0,1] representing a percentage).

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output?

The output is all frequent closed sequential patterns that occurs in a sequence database.

To explain more formally what is a closed sequential pattern, it is necessary to review some definition.

A sequential pattern is a sequence. A sequence SA = X1, X2, ... Xk, where X1, X2... Xk are itemsets is said to occur in another sequence SB = Y1, Y2, ... Ym, where Y1, Y2... Ym are itemsets, if and only if there exists integers 1 <= i1 < i2... < ik <= m such that X1 ⊆ Yi1, X2 ⊆ Yi2, ... Xk ⊆ Yik.

The support of a sequential pattern is the number of sequences where the pattern occurs divided by the total number of sequences in the database.

A frequent sequential pattern is a sequential pattern having a support no less than the minsup parameter provided by the user.

A closed sequential pattern is a sequential pattern such that it is not strictly included in another pattern having the same support.

Why mining closed sequential patterns? It can be shown that the set of closed sequential patterns is generally much smaller than the set of sequential patterns and that no information small. Moreover, finding closed sequential patterns is often much more efficient than discovering all patterns.

For example, for minsup= 50 %, the following patterns are found in the previous sequence database .

ID Closed Sequential Pattern Support
S1 (6) 75 %
S2 (5) 75 %
S3 (2), (3) 75 %
S4 (1), (2) 100 %
S5 (1), (3) 100 %
S6 (1 2), (6) 50 %
S7 (4), (3) 75 %
S8 (1) (2), (3) 50 %
S9 (1), (2 3), (1) 50 %
S10 (1), (3), (2) 75 %
S11 (1), (3), (3) 75 %
S12 (1 2), (4), (3) 50 %
S13 (6), (2), (3) 50 %
S14 (5), (2), (3) 50 %
S15 (4), (3), (2) 50 %
S16 (5), (6), (3), (2) 50 %
S17 (5), (1), (3), (2) 50 %

For instance, the sequential pattern "(1,2),(6)" appears in the first and third sequence (it has therefore a support of 50%). Another pattern is "(4), (3)". It appears in the second and third sequence (it has thus a support of 75 %).

Performance

Mining closed patterns by post-processing should be less efficient than using a dedicated algorithms for closed sequential pattern mining.

Implementation details

The implementations of SPAM and PrefixSpan used in this example are made by Antonio Gomariz Peñalver (AGP).

In the source code, there is also two test files that shows how to keep the result into memory instead of saving it to a file.

Where can I get more information?

For information about SPAM and PrefixSpan, see the examples about PrefixSpan and SPAM.

Example 42 : Mining Frequent Multi-dimensional Sequential Patterns from a Multi-dimensional Sequence Database with SeqDIM, using PrefixSpan and Apriori

How to run this example?

What is SeqDIM?

SeqDIM is an algorithm for discovering multi-dimensional sequential patterns in a multi-dimensional sequence database.

Why multi-dimensional sequential pattern is interesting? The reason is that regular sequential pattern mining algorithms do not consider the context of sequences. For example, the context of a sequence of transactions at a supermarket could be the profile of the customer such as his age, the city where he lives, etc.

In multi-dimensional sequential pattern mining, a sequence database is annotated with dimension to indicate the context of the sequence. The dimensions are symbolic values.

Multi-dimensional sequential pattern mining algorithm discovers sequential patterns with context information. This can be very useful for real applications such as market basket analysis. For example, one could find patterns specific to customers who are teenagers and lives in a particular city, or items boughts by adults living in another city.

What is the input?

The input is a multi-dimensional sequence database (as defined by Pinto et al. 2001) and a threshold named minsup (a value in [0,1] representing a percentage).

A multi-dimensional database is a set of multi-dimensional sequences and a set of dimensions d1, d2... dn. A multi-dimensional sequence (MD-Sequence) is composed of an MD-pattern and a sequence. A sequence is an ordered list of itemsets (groups of items). Note that it is assumed that no items appear twice in the same itemset and that items in an itemset are lexically ordered. An MD-pattern is a set of symbolic values for the dimensions (here represented by integer numbers).

For example, consider the following database, provided in the file "ContextMDSequenceNoTime.txt" of the SPMF distribution. The database contains 4 MD-sequences.

MD-Sequences
ID MD-Patterns Sequences
  d1 d2 d3  
S1 1 1 1 (2 4), (3), (2), (1)
S2 1 2 2 (2 6), (3 5), (6 7)
S3 1 2 1 (1 8), (1), (2), (6)
S4 * 3 3 (2 5), (3 5)

For instance, the first MD-Sequence represents that items 2 and 4 appeared together, then were followed by 3, which was followed by item 2, wich was followed by item 1. The context of this sequence is the value 1 for dimension d1, the value 1 for dimension d2 and the value 1 for dimension d3. Note that the value "*" in the fourth MD-sequence means "any values".

What is the output?

The output is the set of all multi-dimensional sequential patterns that appears in at least minsup sequences of the database. Here, we will not provide a formal definition but rather show an example. For a formal definition of what is a multi-dimensional sequential pattern you can refer to the paper by Pinto et al. (2001) which explains it very well.

Let's look at the example. If we apply SeqDIM on the previous database with a minsup of 50%, we obtain the following result:

Frequent MD-Sequences
ID MD-Patterns Sequences Support
  d1 d2 d3    
P1 * * * (3) 75 %
P2 * * * (2) 100 %
P3 1 * * (2) 75 %
P4 * * * (2), (3) 75 %

For instance, the third pattern (P3) represent the sequence containing the item 2 only with the value 1 for dimension d1. Note that the value"*" for dimension d2 and d3 means "any values". This pattern P3 has a support of 75% because it appears in 75 % of the sequences of the original database (it appears in S1, S2 and S3).

Implementation details

The SeqDIM algorithm is a meta-algorithm. It requires a sequential pattern mining algorithm for discovering sequential patterns and an itemset mining algorithm to deal with the dimensions. In our implementations, we have used the PrefixSpan algorithm (Pei et al., 2004) for sequential pattern mining and the Apriori algorithm (Agrawal & Srikant, 1994) for dealing with the dimensions. PrefixSpan is a very fast algorithm. However, Apriori is not the most efficient algorithm. It could be replaced by FPGrowth in future version for more efficiency.

Note also that SeqDIM can generate a lot of patterns. A solution to this problem is to use the algorithm by Songram et al., also offered in SPMF. This latter algorithm eliminates a lot of redundancy in patterns without any loss of information.

Also note that the implementation of SeqDIM in SPMF needs a little bit refactoring as it is currently integrated with the Fournier-Viger (2008) algorithm in the code. In a future version of SPMF, the two algorithms will be separated.

Where can I get more information about this algorithm?

The algorithm is described in this paper:

H. Pinto, J. Han, J Pei, K. Wang, Q. Chen, U. Dayal: Multi-Dimensional Sequential Pattern Mining. CIKM 2001: 81-88

Example 43 : Mining Frequent Closed Multi-dimensional Sequential Patterns from a Sequence Database with SeqDIM/Songram, using Bide+ and AprioriClose/Charm

How to run this example?

What is the Songram et al. algorithm?

The Songram et al. algorithm is an algorithm for discovering closed multi-dimensional sequential patterns in a multi-dimensional sequence database.

Why multi-dimensional sequential pattern is interesting? The reason is that regular sequential pattern mining algorithms do not consider the context of sequences. For example, the context of a sequence of transactions from a supermarket could be the profile of the customer such as his age, the city where he lives, etc.

In multi-dimensional sequential pattern mining, a sequence database is annotated with dimension to indicate the context of the sequence. The dimensions are symbolic values.

Multi-dimensional sequential pattern mining algorithm discovers sequential patterns with context information. This can be very useful for real applications such as market basket analysis. For example, one could find patterns specific to customers who are teenagers and lives in a particular city, or items boughts by adults living in another city.

However, there is a problem with classical multi-dimensional sequential pattern mining algorithm such as the one by Pinto et al. (2001). It is that there can be a lot of redundancy in the results. For this reason, Songram et al. proposed to discover closed multi-dimensional sequential patterns. This allows to find a subset of all patterns that eliminates a great deal of redundancy without any information loss. The algorithm by Songram et al. is more efficient than the one of Pinto (2001) in terms of execution time and memory usage.

What is the input?

The input is a multi-dimensional sequence database (as defined by Pinto et al. 2001) and a threshold named minsup (a value in [0,1] representing a percentage).

A multi-dimensional database is a set of multi-dimensional sequences and a set of dimensions d1, d2... dn. A multi-dimensional sequence (MD-Sequence) is composed of an MD-pattern and a sequence. A sequence is an ordered list of itemsets (groups of items). An MD-pattern is a set of symbolic values for the dimensions (here represented by integer numbers).

For example, consider the following database, provided in the file "ContextMDSequenceNoTime.txt" of the SPMF distribution. The database contains 4 MD-sequences.

MD-Sequences
ID MD-Patterns Sequences
  d1 d2 d3  
S1 1 1 1 (2 4), (3), (2), (1)
S2 1 2 2 (2 6), (3 5), (6 7)
S3 1 2 1 (1 8), (1), (2), (6)
S4 * 3 3 (2 5), (3 5)

For instance, the first MD-Sequence represents that items 2 and 4 appeared together, then were followed by 3, which was followed by item 2, wich was followed by item 1. The context of this sequence is the value 1 for dimension d1, the value 1 for dimension d2 and the value 1 for dimension d3. Note that the value "*" in the fourth MD-sequence means "any values".

What is the output?

The output is the set of all closed multi-dimensional sequential patterns that appears in at least minsup sequences of the database. The difference with the SeqDIM algorithm is that this algorithm discovers "closed" multi-dimensional patterns. A "closed" multi-dimensional pattern is a multi-dimensional pattern such that no other pattern is include in it and appears in exactly the same set of sequences (see the Songram et al. paper for a formal definition).

Let's look at an example. If we apply the Songram et al. algorithm on the previous database with a minsup of 50%, we obtain the following result:

Frequent Closed MD-Sequences
ID MD-Patterns Sequences Support
  d1 d2 d3    
P1 * * * (2) 100 %
P2 1 * 1 (1) 50 %
P3 1 2 * (2), (6) 50 %
P4 * * * (2), (3 5) 50 %
P5 * * * (2), (3) 75 %
P6 1 * * (2), (3) 50 %
P7 1 * * (2) 75 %

For instance, the second patterns found (P2) represents that items 2 is followed by item 6. The context of this pattern is the value 1 for dimension d1, 2 for dimension d2 and any value for dimension d3. Note that the value "*" means "any values". This pattern is said to have a support of 50% because it appears in 50 % of the MD-Sequences from the original database (it appears in S2 and S3).

Implementation details

The Songram et al. algorithm is a meta-algorithm. It requires a closed sequential pattern mining algorithm for discovering sequential patterns and a closed itemset mining algorithm to deal with the dimensions. Our implementation uses the SeqDIM/Songram algorithm (Pinto et al. 2001, Songram et al. 2006) in combination with BIDE+ (Wang et al. 2007) and AprioriClose(Pasquier et al., 1999) or Charm (Zaki, 2002). To choose AprioriClose in the graphical user interface, select "SeqDim_(BIDE+AprioriClose)". To use Charm, select "SeqDim_(BIDE+Charm)"

Note that the implementation of Songram et al. algorithm in SPMF needs a little bit refactoring as it is currently integrated with the Fournier-Viger (2008) algorithm in the code. In a future version of SPMF, they will be separated. This is not really a problem for performance but it would make it easier to reuse the algorithms if both were separated.

Where can I get more information about this algorithm?

The algorithm is described in this paper:

P. Songram, V. Boonjing, S. Intakosum: Closed Multi-dimensional Sequential-Pattern Minin. Proc. of ITNG 2006.

The idea of multi-dimensional pattern mining is based on this paper:

H. Pinto, J. Han, J Pei, K. Wang, Q. Chen, U. Dayal: Multi-Dimensional Sequential Pattern Mining. CIKM 2001: 81-88

The idea of mining closed sequential pattern is based on this paper:

J. Wang, J. Han: BIDE: Efficient Mining of Frequent Closed Sequences. ICDE 2004: 79-90

Example 44 : Mining Sequential Patterns with Time Constraints from a Time-Extended Sequence Database

How to run this example?

What is this algorithm?

The Hirate-Yamana, 2006 algorithm is an algorithm for discovering frequent sequential patterns respecting some time-constraints to filter uninteresting patterns.

The idea of using time constraints is interesting because it can greatly reduce the number of patterns found and it is also faster and use less memory than if all patterns are discovered.

Note that in our implementation, the Hirate & Yamana algorithm is not implemented as a standalone algorithm. The features of the Hirate-Yamana 2006 algorithm are rather integrated in the Fournier-Viger et al. (2008) algorithm that combines features from several other sequential pattern mining algorithms.

What is the input?

The input is a time-extended sequence database (as defined by Hirate-Yamana, 2006) and some constraints.

A time-extended sequence database is a set of time-extended sequences. A time-extended sequences is a list of itemsets (groups of items). Each itemset is anotated with a timestamp that is an integer value. Note that it is assumed that an item should not appear more than once in an itemset and that items in an itemset are lexically ordered.

For example, consider the following time-extended sequence database provided in the file contextSequencesTimeExtended.txt of the SPMF distribution. The database contains 4 time-extended sequences. Each sequence contains itemsets that are annotated with a timestamp. For example, consider the sequence S1. This sequence indicates that itemset {1} appeared at time 0. It was followed by the itemset {1, 2, 3} at time 1. This latter itemset was followed by the itemset {1 2} at time 2.

ID Sequences
S1 (0, 1), (1, 1 2 3}), (2, 1 3)
S2 (0, 1 ) (1, 1 2 ), (2, 1 2 3), (3, 1 2 3 )
S3 (0, 1 2), (1, 1 2 )
S4 (0, 2), (1, 1 2 3 )

The algorithms discovers time-extended sequential patterns that are common to several sequences. To do that, the user needs to provide five constraints (see the paper by Hirate & Yamana, 2006 for full details):

What is the output?

The output is a set of time-extended sequential patterns meeting the five constraints given by the user. For example, if we run the algorithm with minsup= 55 %, min_time_interval = 0, max_time_interval = 2, min_whole_interval = 0, max_whole_interval = 2, we obtain the following results:

ID Sequential Patterns Support
S1 (0, 3) 75 %
S2 (0, 2 3) 75 %
S3 (0, 2) 100 %
S4 (0, 1 2 3) 75%
S5 (0, 1 2) 100 %
S6 (0, 1 3) 75 %
S7 (0, 1) 100 %
S8 (0, 2), (1, 3) 75 %
S9 (0, 2), (1, 1 2) 75 %
S10 (0, 2), (1, 1 3) 75 %
S11 (0, 2), (1, 1) 100 %
S12 (0, 2), (1, 2) 75 %
S13 (0, 1), (1, 1 2) 75 %
S14 (0, 1), (1, 1) 75 %
S15 (0, 1), (1, 2) 75 %
S16 (0, 1 2), (1, 1) 75 %

For instance, the pattern S16 indicates that the items 1 and 2 were followed by item 1 one time unit after. This pattern has a support of 75 % because it appears in S1, S2 and S3. It is important to note that the timestamps in the sequential patterns found are relative. For example, the pattern S16 is considered to appear in S1, S2 and S3 because {1} appears one time unit after the itemset {1, 2} in all of these sequences, even though the timestamps do not need to be the same in all of these sequences.

Implementation details

In this implementation, we have followed the Hirate & Yamana (2006) algorithm closely. The only difference is that we did not keep the idea of interval itemization function for discretization. But we have keep the core idea which is to use the time constraints.

Note that the Hirate & Yamana algorithm is an extension of the PrefixSpan algorithm. We have implemented it based on our implementation of PrefixSpan.

Where can I get more information about this algorithm?

The Hirate & Yamana algorithm is described in this paper

Yu Hirate, Hayato Yamana (2006) Generalized Sequential Pattern Mining with Item Intervals. JCP 1(3): 51-60.

The implementation of the algorithm in SPMF is part of the Fournier-Viger (2008) algorithm, which is described in this paper:

Fournier-Viger, P., Nkambou, R & Mephu Nguifo, E. (2008), A Knowledge Discovery Framework for Learning Task Models from User Interactions in Intelligent Tutoring Systems. Proceedings of the 7th Mexican International Conference on Artificial Intelligence (MICAI 2008). LNAI 5317, Springer, pp. 765-778.

Example 45 : Mining Closed Sequential Patterns with Time Constraints from a Time-Extended Sequence Database

How to run this example?

What is this algorithm?

The Fournier-Viger et al., 2008 algorithm is a sequential pattern mining algorithm combining features from several other sequential pattern mining algorithms. It also offers some original features. In this example, we show how it can be used to discover closed sequential patterns with time-constraints.

Closed sequential patterns is a compact representation of all sequential patterns. Mining closed sequential patterns is important because it can greatly reduce the number of patterns found without loss of information. Using time-constraint is important because it allows to filter unininteresting patterns according to time-related constraints.

Mining closed patterns or using time constraints is also important because it can greatly improve the speed and memory usage when these constraints are used.

What is the input?

The input is a time-extended sequence database (as defined by Hirate-Yamana, 2006) and some constraints.

A time-extended sequence database is a set of time-extended sequences. A time-extended sequences is a list of itemsets (groups of items). Each itemset is anotated with a timestamp that is an integer value.

For example, consider the following time-extended sequence database provided in the file contextSequencesTimeExtended.txt of the SPMF distribution. The database contains 4 time-extended sequences. Each sequence contains itemsets that are annotated with a timestamp. For example, consider the sequence S1. This sequence indicates that itemset {1} appeared at time 0. It was followed by the itemset {1, 2, 3} at time 1. This latter itemset was followed by the itemset {1 2} at time 2.

ID Sequences
S1 (0, 1), (1, 1 2 3}), (2, 1 3)
S2 (0, 1 ) (1, 1 2 ), (2, 1 2 3), (3, 1 2 3 )
S3 (0, 1 2), (1, 1 2 )
S4 (0, 2), (1, 1 2 3 )

The algorithms discovers closed time-extended sequential patterns that are common to several sequences. To do that, the user needs to provide five constraints (see the paper by Hirate & Yamana, 2006 for full details):

Note : It is recommended to set max_whole_interval to a very large value because if it is used with closed sequential pattern mining, the algorithm become approximate and may not return all closed itemsets respecting the time constraint (the reason is that this constraint is not compatible with the "backScan pruning" of the BIDE+ algorithm).

What is the output?

The output is a set of closed time-extended sequential patterns meeting the constraints given by the user. For example, if we run the algorithm with minsup= 55 %, min_time_interval = 0, max_time_interval = 2, min_whole_interval = 0, max_whole_interval = 100, we obtain the following results:

ID Sequential Patterns Support
S1 (0, 1 2 3) 75%
S2 (0, 1 2) 100 %
S3 (0, 2), (1, 1 2) 75 %
S4 (0, 2), (1, 1 3) 75 %
S5 (0, 2), (1, 1) 100 %
S6 (0, 1), (1, 1 2) 75 %
S7 (0, 1 2), (1, 1) 75 %

For instance, the last pattern S7 indicates that the items 1 and 2 were followed by item 1 one time unit after. This pattern has a support of 75 % because it appears in S1, S2 and S3. It is important to note that the timestamps in the sequential patterns found are relative. For example, the pattern S16 is considered to appear in S1, S2 and S3 because {1} appears one time unit after {1, 2} in all of these sequences, even though the timestamps do not need to be the same in all of these sequences.

If you compare the results of this example with the previous example, you can observe that the number of closed time-extended sequential patterns (6) is much smaller than the number of time-extended sequential patterns (15) found in the previous example.

Implementation details

To propose an algorithm to discover closed sequential patterns with time constraints, we have combined ideas from the BIDE+ algorithm and the Hirate & Yamana algorithm. Both of these algorithms are based on the PrefixSpan algorithm.

Where can I get more information about this algorithm?

The Fournier-Viger (2008) algorithm is described in this paper:

Fournier-Viger, P., Nkambou, R & Mephu Nguifo, E. (2008), A Knowledge Discovery Framework for Learning Task Models from User Interactions in Intelligent Tutoring Systems. Proceedings of the 7th Mexican International Conference on Artificial Intelligence (MICAI 2008). LNAI 5317, Springer, pp. 765-778.

The idea of using time-constraints is based on this paper

Yu Hirate, Hayato Yamana (2006) Generalized Sequential Pattern Mining with Item Intervals. JCP 1(3): 51-60.

The idea of mining closed sequential pattern is based on this paper:

J. Wang, J. Han: BIDE: Efficient Mining of Frequent Closed Sequences. ICDE 2004: 79-90

Example 46 Mining Sequential Patterns with Time Constraints from a Time-Extended Sequence Database containing Valued Items

How to run this example?

To run this example with the source code version of SPMF, launch the file "MainTestSequentialPatternsMining3.java" in the package ca.pfv.SPMF.tests.

This example is not available in the release version of SPMF.

What is this algorithm?

The Fournier-Viger et al., 2008 algorithm is a sequential pattern mining algorithm combining features from several other sequential pattern mining algorithms. It also offers some original features. In this example, we show how it can be used to discover sequential patterns with time-constraints from a sequence database where items are annotated with integer values.

To create an algorithm that can do that, we have extended the Hirate & Yamana (2006) algorithm to accept items with integer values.

What is the input?

The input is a time-extended sequence database where items are annotated with integer values (Fournier-Viger et al., 2008), a minsup threshold (a value in [0, 1] representing a percentage) and some constraints.

Such a database is defined as a set of sequences. A sequence is here a list of itemsets (groups of items), where each item is a symbol represented by an integer and each item is annotated with an integer representing a value associated with the item. Furthermore, an itemset has a timestamp indicating the time at which it occured. Note that an item is not allowed to occur more than once in an itemset and that items in an itemset are assumed to be lexically ordered.

For example, consider the following database. These integer values that are annotation to items are indicated with a bold font and blue color. Consider the sequence S1 of this database. At time 0, the item 1 occured with the value 2, and item 2 is not annotated with a value. At time 1, item 3 appeared. At time 2, item 6 occured. At time 3, item 5 occured with the value 1.

ID Sequences
S1 (0, 1(2) 2), (1, 3), (2, 6), (3, 5(1))
S2 (0, 1(2) 2), (1, 4(8)), (2, 3), (3, 5(2) 6 7)
S3 (0, 1(3) 2), (1, 4(7)), (2, 3), (3, 5(4))
S4 (0, 1(3) 2), (1, 4(6)), (2, 5(5)), (3, 6 7)

What is the output?

To mine sequential patterns from a time-extended sequence database where items can be annotated with integer values, we have added an original mechanism in the Fournier-Viger et al. algorithm. Because it is a little bit complicated to explain, please refer to Fournier-Viger et al., 2008 for a detailed description.

As PrefixSpan, the algorithm grows patterns one item at a time by recursively projecting a database by a frequent item. However, we have modified this step so that when the support of an item that is annotated is higher or equals to 2 * minsup, the database projection operation calls the K-Means algorithm [15] to try to separate these values in two or more clusters. Thereafter, the database will be projected separately for each group of values. Thus, the different groups will be considered as different items.

This is best illustrated with an example. If we mine patterns from the first table with minsup = 50%, we can get 56 sequential patterns. Note however that the number of patterns found can vary because K-Means is a randomized algorithm. For this example, six patterns found are presented in the next table:

ID Sequential Patterns Support
P1 (0, 3) 75%
P2 (0, 5 (average: 3 min:1 max:5)) 100 %
P3 (0, 6) 75 %
P4 (0, 3), (1, 6) 50 %
P5 (0, 1 (average: 3 min: 3 max: 3) 2), (1 4 (average: 6.5 min: 6 max: 7)) 50 %
P6 (0, 1 ((average: 2 min: 2 max: 2) 2), (3, 5 (average: 3.5 min: 1 max: 2)) 50 %
... ... ...

When the algorithm was executed, as some point, it considered projecting the database with item "1" because this item is frequent. Item "1" is annotated with values 2, 2, 3 and 3 in sequences S1, S2, S3 and S4, respectively. The algorithm applied the K-Means to find clusters. Two clusters were found. They are {2, 2}and {3, 3}. We can see this in sequences P5 and P6. In sequence P5, item "1" represents the cluster {3, 3}, whereas sequence P6 include item "1" with the cluster {2, 2}. This feature of clustering is useful as it allows to group similar values together for an item and treat them differently.

In the source code of MainTestSequentialPatternsMining3.java, there is a few parameters for K-Means. Two of these parameters are particularly important:

Important note: If the clustering described in this example is used jointly with the mining of closed sequential patterns (described in a previous example), the set of patterns found may not be a lossless representation of all patterns.

Where can I get more information about this algorithm?

The algorithm is described in this paper:

Fournier-Viger, P., Nkambou, R & Mephu Nguifo, E. (2008), A Knowledge Discovery Framework for Learning Task Models from User Interactions in Intelligent Tutoring Systems. Proceedings of the 7th Mexican International Conference on Artificial Intelligence (MICAI 2008). LNAI 5317, Springer, pp. 765-778.

Example 47 : (Closed) Multi-dimensional Sequential Patterns Mining with Time Constraints

How to run this example?

What is this algorithm?

The Fournier-Viger et al., 2008 algorithm is a sequential pattern mining algorithm combining features from several other sequential pattern mining algorithms. It offers also some original features. In this example, we show how it can be used to discover multi-dimensional sequential patterns with time-constraints.

Multi-dimensional sequential pattern mining is an extension of the problem of sequential pattern mining that consider the context of each sequences. We have taken the SeqDIM algorithm for multi-dimensional sequential pattern mining and modified it to use features of the Hirate & Yamana (2008) algorithms so that it only discovers patterns that respect time constraints set by the user.

What is the input?

The input is a multidimensional time-extended sequence database. We here only provide a brief explanation. Please refer to the article by Fournier-Viger et al., 2008 for more details and a formal definition.

A time-extended multidimensional sequence database is a set of time-extended multi-dimensional sequences. A time-extended multi-dimensional sequence (here called MD-Sequence) is a time-extended sequence (as defined by Hirate & Yamana) but with dimensional information (as defined by Pinto et al. 2001). The set of dimensional values for an MD-Sequence is called an MD-Pattern. For a multi-dimensional database, there is a fix set of dimensions. Each dimensions can take a symbolic value or the value "*" which means any value. In the following MD-Database, there is four MD-Sequences named S1, S2, S3, S4.

MD-Sequences
ID MD-Patterns Sequences
  d1 d2 d3  
S1 1 1 1 (0, 2 4), (1, 3), (2, 2), (3, 1)
S2 1 2 2 (0, 2 6), (1, 3 5), (2, 6 7)
S3 1 2 1 (0, 1 8), (1, 1), (2, 2), (3, 6)
S4 * 3 3 (0, 2 5), (1, 3 5)

The task of multi-dimensional sequential pattern mining consists of finding MD-Sequences that have a support higher than a minimum support threshold minsup for a MD-Database. Furthermore, in the Fournier-Viger algorithm, we offers the possibility to mine only frequent closed MD-Sequences, by implementing the idea of Songram et al. 2006. A frequent closed MD-Sequence is a frequent MD-Sequence that is not included in any other MD-Sequence having the same support.

This algorithm has five parameters:

What is the output?

The output is the set of frequent closed MD-Sequences contained in the database (see the Fournier-Viger, 2008 paper for a formal definition).

For example, if we mine frequent closed MD-Sequences from the previous database with a minsup of 50% mininterval=0, maxinterval=1000, minwholeinterval=0, maxwholeinterval=1000, we obtain the following result:

Frequent Closed MD-Sequences
ID MD-Patterns Sequences
  d1 d2 d3  
P1 * * * (0, 2)
P2 1 * 1 (0, 1)
P3 1 2 * (0, 6)
P4 * * * (0, 2), (1, 3 5)
P5 * * * (0, 2), (1, 3)

If we mine frequent MD-sequences instead of frequent closed MD-Sequences, we will obtain 23 frequent MD-Sequences instead.

Where can I get more information about this algorithm?

The algorithm is described in this paper:

Fournier-Viger, P., Nkambou, R & Mephu Nguifo, E. (2008), A Knowledge Discovery Framework for Learning Task Models from User Interactions in Intelligent Tutoring Systems. Proceedings of the 7th Mexican International Conference on Artificial Intelligence (MICAI 2008). LNAI 5317, Springer, pp. 765-778.

The idea of using time-constraints is based on this paper

Hirate & Yamana (2006) Generalized Sequential Pattern Mining with Item Intervals. JCP 1(3): 51-60.

The idea of multi-dimensional pattern mining is based on this paper:

H. Pinto, J. Han, J Pei, K. Wang, Q. Chen, U. Dayal: Multi-Dimensional Sequential Pattern Mining. CIKM 2001: 81-88

The idea of closed multi-dimensional pattern mining is based on this paper:

P. Songram, V. Boonjing, S. Intakosum: Closed Multi-dimensional Sequential-Pattern Minin. Proc. of ITNG 2006.

The idea of mining closed sequential pattern is based on this paper:

J. Wang, J. Han: BIDE: Efficient Mining of Frequent Closed Sequences. ICDE 2004: 79-90

Example 48 : Mining Sequential Rules Common to Several Sequences with the CMRules algorithm

How to run this example?

What is CMRules?

CMRules is an algorithm for discovering sequential rules that appears in sequence databases. This algorithm was proposed by Fournier-Viger et al. in 2010.

What is the input of CMRules ?

The input of CMRules is a sequence database and two user-specified thresholds named minsup (a value in [0, 1] representing a percentage) and minconf (a value in [0, 1] representing a percentage).

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output of CMRules ?

Given a sequence database, and parameters named minsup and minconf, CMRules outputs all sequential rules having a support and confidence respectively higher than minsup and minconf.

A sequential rule X==>Y is a sequential relationship between two sets of items X and Y such that X and Y are disjoint, and that X is unordered and Y is unordered. The support of a rule X==>Y is the number of sequences that contains X∪Y divided by the number of sequences in the database. The confidence of a rule is the number of sequences that contains X∪Y, divided by the number of sequences that contains X.

In this example, we apply CMRules with minsup = 75 %, minconf= 50%. We obtains 9 sequential rules:

Rule Support Confidence
1 ==> 2 100 % 100 %
1 ==>3 100 % 100 %
2 ==> 3 75 % 75 %
3 ==> 2 75 % 75 %
4 ==> 3 75 % 100 %
1 3 ==> 2 75 % 75 %
1 2 ==> 3 75 % 75 %
1 4 ==> 3 75 % 100 %
1 ==> 2 3 100 % 100 %

For example, the rule 1 4 ==> 3 means that if 1 an 4 appears in any order they will be followed by 3 with a confidence of 100 %. Moreover, this rule has a support of 75 % because it appears in three sequences (S1, S2 and S3) out of four sequences.

Performance

CMRules is a relatively efficient algorihtm. However, the RuleGrowth algorithm is faster.

What is interesting about CMRules is that it uses an association rule mining based approach for discovering sequential rules. Therefore it could be used to discover both sequential rules and association rules at the same time.

Where can I get more information about this algorithm?

The CMRules algorithm is described in this paper:

Fournier-Viger, P., Faghihi, U., Nkambou, R., Mephu Nguifo, E. (2012). CMRules: Mining Sequential Rules Common to Several SequencesKnowledge-based Systems, Elsevier, 25(1): 63-76.

Example 49 : Mining Sequential Rules Common to Several Sequences with the CMDeo algorithm

How to run this example?

What is CMDeo ?

CMDeo is an algorithm for discovering sequential rules that appears in sequence databases. It was proposed by Fournier-Viger in 2010.

What is the input of CMDeo ?

The input of CMDeo is a sequence database and two user-specified thresholds named minsup (a value in [0, 1] representing a percentage) and minconf (a value in [0, 1] representing a percentage).

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output of CMDeo ?

Given a sequence database, and parameters named minsup and minconf, CMDeo outputs all sequential rules having a support and confidence respectively higher than minsup and minconf.

A sequential rule X==>Y is a sequential relationship between two sets of items X and Y such that X and Y are disjoint, and that X is unordered and Y is unordered. The support of a rule X==>Y is the number of sequences that contains X∪Y divided by the number of sequences in the database. The confidence of a rule is the number of sequences that contains X∪Y, divided by the number of sequences that contains X.

In this example, we apply CMDeo with minsup = 75 %, minconf= 50%. We obtains 9 sequential rules:

Rule Support Confidence
1 ==> 2 100 % 100 %
1 ==>3 100 % 100 %
2 ==> 3 75 % 75 %
3 ==> 2 75 % 75 %
4 ==> 3 75 % 100 %
1 3 ==> 2 75 % 75 %
1 2 ==> 3 75 % 75 %
1 4 ==> 3 75 % 100 %
1 ==> 2 3 100 % 100 %

For example, the rule 1 4 ==> 3 means that if 1 an 4 appears in any order they will be followed by 3 with a confidence of 100 %. Moreover, this rule has a support of 75 % because it appears in three sequences (S1, S2 and S3) out of four sequences.

Performance

CMDeo is a relatively efficient algorihtm. However, the RuleGrowth algorithm is faster.

Where can I get more information about this algorithm?

The CMDeo algorithm is described in this paper:

Fournier-Viger, P., Faghihi, U., Nkambou, R., Mephu Nguifo, E. (2012). CMRules: Mining Sequential Rules Common to Several SequencesKnowledge-based Systems, Elsevier, 25(1): 63-76.

Example 50 : Mining Sequential Rules Common to Several Sequences with the RuleGrowth algorithm

How to run this example?

What is RuleGrowth?

RuleGrowth is an algorithm for discovering sequential rules that appears in sequence databases. It was proposed by Fournier-Viger in 2011.

What is the input of RuleGrowth ?

The input of RuleGrowth is a sequence database and two user-specified thresholds named minsup (a value in [0, 1] representing a percentage) and minconf (a value in [0, 1] representing a percentage).

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output of RuleGrowth ?

Given a sequence database, and parameters named minsup and minconf, RuleGrowth outputs all sequential rules having a support and confidence respectively higher than minsup and minconf.

A sequential rule X==>Y is a sequential relationship between two sets of items X and Y such that X and Y are disjoint, and that X is unordered and Y is unordered. The support of a rule X==>Y is the number of sequences that contains X∪Y divided by the number of sequences in the database. The confidence of a rule is the number of sequences that contains X∪Y, divided by the number of sequences that contains X.

In this example, we apply RuleGrowth with minsup = 75 %, minconf= 50%. We obtains 9 sequential rules:

Rule Support Confidence
1 ==> 2 100 % 100 %
1 ==>3 100 % 100 %
2 ==> 3 75 % 75 %
3 ==> 2 75 % 75 %
4 ==> 3 75 % 100 %
1 3 ==> 2 75 % 75 %
1 2 ==> 3 75 % 75 %
1 4 ==> 3 75 % 100 %
1 ==> 2 3 100 % 100 %

For example, the rule 1 4 ==> 3 means that if 1 an 4 appears in any order they will be followed by 3 with a confidence of 100 %. Moreover, this rule has a support of 75 % because it appears in three sequences (S1, S2 and S3) out of four sequences.

Performance

RuleGrowth is a very efficient algorihtm. It is faster and more memory efficient than CMDeo and CMRules.

Note that there is a variation of RuleGrowth that accepts time constraints. It is named TRuleGrowth and it is also offered in SPMF. There is also a variations for mining top-k sequential rules named TopSeqRules offered in SPMF.

Where can I get more information about this algorithm?

The RuleGrowth algorithm is described in this paper:

Fournier-Viger, P., Nkambou, R. & Tseng, V. S. (2011). RuleGrowth: Mining Sequential Rules Common to Several Sequences by Pattern-Growth. Proceedings of the 26th Symposium on Applied Computing (ACM SAC 2011). ACM Press, pp. 954-959.

Example 51 : Mining Sequential Rules between Sequential Patterns with the RuleGen algorithm

How to run this example?

What is RuleGen?

RuleGen is an algorithm for discovering sequential rules that appears in sequence databases. It was proposed by Zaki (2000).

What is the input of RuleGen ?

The input of RuleGen is a sequence database and two user-specified thresholds named minsup (a value in [0, 1] representing a percentage) and minconf (a value in [0, 1] representing a percentage).

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output of RuleGen ?

The RuleGen algorithms outputs all sequential rules having a support and confidence respectively higher or equals to user-specified minsup and minconf thresholds.

A rule X==>Y is defined by RuleGen as a sequential relationship between two sequential patterns X and Y. The confidence of a rule X ==> Y is defined as the number of sequences containing X divided by the number of sequences containing Y. The support of a rule X ==> is defined as the number of sequences containing Y.

In this example, we apply RuleGen with minsup = 75 %, minconf= 50%. We obtains 21 sequential rules:

Rule Support Confidence

{1 } ==> {1 }{2 }

4

1.0

{1 } ==> {1 }{3 }

4

1.0

{1 } ==> {1 }{3 }{2 }

3

0.75

{1 } ==> {1 }{3 }{3 }

3

0.75

{2 } ==> {1 }{2 }

4

1.0

{2 } ==> {2 }{3 }

3

0.75

{2 } ==> {3 }{2 } 

3

0.75

{2 } ==> {1 }{3 }{2 }

3

0.75

{3 } ==> {1 }{3 }

4

1.0

{3 } ==> {2 }{3 }

3

0.75

{3 } ==> {3 }{2 }

3

0.75

{3 } ==> {3 }{3 }

3

0.75

{3 } ==> {4 }{3 }

3

0.75

{3 } ==> {1 }{3 }{2 }

3

0.75

{3 } ==> {1 }{3 }{3 }

3

0.75

{4 } ==> {4 }{3 }

3

1.0

{1 }{2 } ==> {1 }{3 }{2 }

3

0.75

{1 }{3 } ==> {1 }{3 }{2 }

3

0.75

{1 }{3 } ==> {1 }{3 }{3 }

3

0.75

{3 }{2 } ==> {1 }{3 }{2 }

3

1.0

{3 }{3 } ==> {1 }{3 }{3 }

3

1.0

For example, the rule {1 }  ==> {1 }{3 }{3 } means that if the sequential pattern {1} appears in a sequence, the sequential pattern {1} {3 }{3 } also appear in this sequence. In other words, it means that if {1} appears, it will be followed by {3}, {3}.

Implementation details

The RuleGen algorithm first apply a sequential pattern mining algorithm and then combines pairs of sequential patterns to generate rules between two sequential patterns. Note that in our implementation we use the PrefixSpan algorithm for mining sequential patterns instead of SPADE because the PrefixSpan algorithm is generally faster than SPADE.

Also, it is important to note that rules found by RuleGen always have the form X == > Y such that X is a subsequence of Y. This definition of a sequential rule is different from the definition of a sequential rules used by other sequential rule mining algorithms offered in SPMF such as CMRules, CMDeo, RuleGrowth, TRuleGrowth, TopSeqRules and TNS where X and Y are unordered itemsets, and X is not a subset of Y. The rules found by these latter algorithms are more general. Moreover, we have shown that we can achieve higher prediction accuracy by using the kind of rules found by RuleGrowth, CMRules instead of using the rules generated by RuleGen. (see this article for details).

Where can I get more information about this algorithm?

The RuleGen algorithm is described in this paper:

Mohammed Javeed Zaki: Scalable Algorithms for Association Mining. IEEE Trans. Knowl. Data Eng. 12(3): 372-390 (2000)

Example 52 : Mining Sequential Rules Common to Several Sequences with the Window Size Constraint

How to run this example?

What is TRuleGrowth?

TRuleGrowth is an algorithm for discovering sequential rules that appears in sequence databases. It was proposed by Fournier-Viger in 2012. It is a variation of the RuleGrowth algorithm that accepts a window size constraint.

What is the input of TRuleGrowth ?

The input of TRuleGrowthis a sequence database, two user-specified thresholds named minsup (a value in [0, 1] representing a percentage) and minconf (a value in [0, 1] representing a percentage) and a parameter named window_size (an integer >=0).

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output of TRuleGrowth ?

Given a sequence database, and parameters named minsup, minconf and window_size, TRuleGrowth outputs all sequential rules having a support and confidence respectively higher than minsup and minconf that appears within window_size consecutive itemsets

A sequential rule X==>Y is a sequential relationship between two sets of items X and Y such that X and Y are disjoint, and that X is unordered and Y is unordered. The support of a rule X==>Y is the number of sequences that contains X∪Y divided by the number of sequences in the database. The confidence of a rule is the number of sequences that contains X∪Y, divided by the number of sequences that contains X.

For example, if we set minsup = 0.7, minconf =0.8 and window_size = 3, TRuleGrowth discovers 4 rules:

Rule Support Confidence

{1 }     ==> {2 }   

80 % (4 sequences)

100 %

{1 }     ==> {2, 3 }   

80 % (4 sequences)

100 %

{1 }     ==> {3 }    80 % (4 sequences) 100 %

{4 }     ==> {3 }   

60 % (3 sequences)

100 %

For example, the rule {1} ==> {2 3} means that if 1 appears in a sequence, it will be followed by 2 and (in any order) with a confidence of 100 %. Moreover, this rule has a support of 100 % because it appears in four sequences (S1, S2, S3 and S4) out of four sequences within window_size consecutive itemsets.

Performance

TRuleGrowth is a very efficient algorihtm. It is faster and more memory efficient than CMDeo and CMRules. If the windows_size constraint is used, it can also be much faster than RuleGrowth depending on how the window_size constraint is set.

Implementation details

In SPMF, there is also a version of TRuleGrowth that accepts strings instead of integers. It is available under the name "TRuleGrowth with strings" in the release version of SPMF or in the package ca.pfv.spmf.sequential_rules.trulegrowth_with_strings for the source code version of SPMF. To run it, you should use the input file: contextPrefixSpanStrings.txt.

Where can I get more information about this algorithm?

The TRuleGrowth algorithm is described in this paper:

Fournier-Viger, P., Wu, C.-W., Tseng, V.S., Nkambou, R. (2012). Mining Sequential Rules Common to Several Sequences with the Window Size Constraint. Proceedings of the 25th Canadian Conf. on Artificial Intelligence (AI 2012), Springer, LNAI 7310, pp.299-304

Example 53 : Mining the Top-K Sequential Rules

How to run this example?

What is TopSeqRules?

TopSeqRules is an algorithm for discovering the top-k sequential rules appearing in a sequence database.

Why is it important to discover top-k sequential rules? Because other sequential rule mining algorithms requires the user to set a minimum support (minsup) parameter that is hard to set (usually users set it by trial and error, which is time consuming). The TopSeqRules algorithm solve this problem by letting users directly indicate k, the number of rules to be discovered.

What is the input of TopSeqRules ?

TopSeqRules takes three parameters as input:

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output of TopSeqRules ?

TopSeqRules outputs the k most frequent sequential rules having a confidence higher or equal to minconf.

A sequential rule X==>Y is a sequential relationship between two sets of items X and Y such that X and Y are disjoint, and that X is unordered and Y is unordered. The support of a sequential rule X==>Y is the number of sequences that contains X∪Y divided by the number of sequences in the database. The confidence of a sequential rule is the number of sequences that contains X∪Y, divided by the number of sequences that contains X.

For example, if we run TopSeqRules with k = 3 and minconf = 0.8, the result is the following rules:

Rule Support Confidence

{3 }     ==> {4 }   

75 % (3 sequences)

100 %

{1 }     ==> {3 }   

100 % (4 sequences)

100 %

{1,4 }     ==> {3 }    75 % (3 sequences) 100 %

These rules are the top three rules appearing in the sequence database having a confidence higher or equals to 80 %.

For example, the rule 1 4 ==> 3 means that if 1 an 4 appears in any order they will be followed by 3 with a confidence of 100 %. Moreover, this rule has a support 75 % because it appears in three sequences (S1, S2 and S3) out of four sequences.

It is important to note that for some values of k, the algorithm may return slightly more rules than k. This can happen if several rules have exactly the same support, and it is normal.

Performance

TopSeqRules is a very efficient algorihtm. It is based on the RuleGrowth algorithm, which is one of the most efficient algorithm for mining sequential rules.

It is more intuitive to use TopSeqRules than RuleGrowth. However, it should be note that the problem of top-k sequential rule mining is more computationally expensive than the problem of sequential rule mining. Therefore, it is recommended to use TopSeqRules for k values of up to 1000 or 2000 depending on the dataset. If more rules should be found, it could be better to use RuleGrowth or TRuleGrowth.

Besides, note that there is a variation of TopSeqRules named TNS that is available in SPMF. The improvement in TNS is that it eliminate some sequential rules that are deemed "redundant" (rules that are included in other rules having the same support and confidence - see the TNS example for the formal definition). Using TNS is more costly than using TopSeqRules but it brings the benefit of eliminating some redundancy in the results.

Where can I get more information about this algorithm?

The TopSeqRules algorithm is described in this paper:

Fournier-Viger, P. & Tseng, V. S. (2011). Mining Top-K Sequential Rules. Proceedings of the 7th Intern. Conf. on Advanced Data Mining and Applications (ADMA 2011). LNAI 7121, Springer, pp.180-194.

Example 54 : Mining the Top-K Non-Redundant Sequential Rules

How to run this example?

What is TNS?

TNS is an algorithm for discovering the top-k non-redundant sequential rules appearing in a sequence database. It is an approximate algorithm in the sense that it always generates non-redundant rules. But these may not always be the top-k non-redundant rules. TNS uses a parameter named delta, which is a positive integer that can be used to improve the chance that the result is exact (the higher delta value, the more chances that the result will be exact).

Why is it important to discover top-k non-redundant sequential rules? Because other sequential rule mining algorithms requires that the user set a minimum support (minsup) parameter that is hard to set (usually users set it by trial and error, which is time consuming). Moreover, the result of sequential rule mining algorithms usually contains a high level of redundancy (for example, thousands of rules can be found that are variation of other rules having the same support and confidence). The TNS algorithm provide solution to both of these problems by letting users directly indicate k, the number of rules to be discovered, and by eliminating redundancy in results.

What is the input of TNS ?

TNS takes four parameters as input:

A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output of TNS ?

TNS outputs an approximation of the k most frequent non redundant sequential rules having a confidence higher or equal to minconf.

A sequential rule X==>Y is a sequential relationship between two sets of items X and Y such that X and Y are disjoint, and that X is unordered and Y is unordered. The support of a sequential rule X==>Y is denoted as sup(X==>Y) and defined as the number of sequences that contains X∪Y divided by the number of sequences in the database. The confidence of a sequential rule X==>Y is denoted as conf(X==>Y) and defined as the number of sequences that contains X∪Y, divided by the number of sequences that contains X.

A sequential rule ra: X → Y is redundant with respect to another rule rb : X1 → Y1 if and only if:

For example, If we run TNS with k = 10 and minconf = 0.5 and delta = 2, the following set of non-redundant rules is found

2 ==> 3  sup= 3  conf= 0.75
1,3 ==> 2  sup= 3  conf= 0.75
1,4 ==> 3  sup= 3  conf= 1.0
1 ==> 2,3  sup= 4  conf= 1.0
3 ==> 4  sup= 3  conf= 1.0
2,5 ==> 6  sup= 2  conf= 1.0
2,3 ==> 4  sup= 2  conf= 0.6666666666666666
1 ==> 2,3,4,6  sup= 2  conf= 0.5
3,5 ==> 6  sup= 2  conf= 1.0
2 ==> 3,4,6  sup= 2  conf= 0.5

For instance, the rule 1 4 ==> 3 means that if 1 an 4 appears in any order they will be followed by 3 with a confidence of 100 %. Moreover, this rule has a support 75 % (sup = 3) because it appears in three sequences (S1, S2 and S3) out of four sequences.

Note that for some values of k and some datasets, TNS may return more than k rules. This can happen if several rules have exactly the same support, and it is normal.  It is also possible that the algorithm returns slightly less than k rules in some circonstances because the algorithm is approximate.

Performance

TNS is an efficient algorihtm. It is based on the TopSeqRules algorithm for discovering top-k sequential rules. The main difference between TNS and TopSeqRules is that TNS includes additional strategies to eliminate redundancy in results, and that TNS is an approximate algorithm, while TopSeqRules is not.

TNS and TopSeqRules are more intuitive to use than regular sequential rule mining algorithms such as RuleGrowth. However, it should be note that the problem of top-k sequential rule mining is more computationally expensive than the problem of sequential rule mining. Therefore, it is recommended to use TNS or TopSeqRules for k values of up to 1000 or 2000 depending on the dataset. If more rules should be found, it could be better to use RuleGrowth or TRuleGrowth, for more efficiency.

Where can I get more information about this algorithm?

The TNS algorithm is described in this paper:

Fournier-Viger, P., Tseng, V. S. (2013). TNS: Mining Top-K Non-Redundant Sequential Rules. Proc. 28th Symposium on Applied Computing (ACM SAC 2013). ACM Press, to appear.

Example 55 : Creating a decision tree with the ID3 algorithm to predict the value of a target attribute

How to run this example?

To run this example with the source code version of SPMF, launch the file "MainTestID3.java" in the package ca.pfv.SPMF.tests.

This example is not available in the release version of SPMF.

What is the ID3 algorithm?

The ID3 algorithm is a classic data mining algorithm for classifying instances (a classifier). It is well-known and described in many artificial intelligence and data mining books.

What is the input?

The input is a set of training data for building a decision tree.

For instance, in this example, we use the following database (from the book "Machine Learning by Tom Mitchell). This database is provided in the file "tennis.txt" of the SPMF distribution.

This database defines five attributes and contains fourteen instances.

outlook temp humid wind play?
sunny hot high weak no
sunny hot high strong no
overcast hot high weak yes
rain mild high weak yes
rain cool normal weak yes
rain cool normal strong no
overcast cool normal strong yes
sunny mild high weak no
sunny cool normal weak yes
rain mild normal weak yes
sunny mild normal strong yes
overcast mild high strong yes
overcast hot normal weak yes
rain mild high strong no

What is the output?

By applying the ID3 algorithm, a decision tree is created. To create the decision tree, we have to choose a target attribute. Here, we choose the attribute "play".

The following decision tree is created:

id3 decision tree

We can now use the tree to predict the value of the target attribute "play" for a new instance.

For example, consider this new instance, where the value for "play" is unknown.

sunny hot normal weak ?

By applying the decision tree, the value for the attribute "play" for this instance is "yes".

Where can I get more information about the ID3 algorithm?

The ID3 algorithm was proposed by Quinlan (1986). It is one of the most popular algorithm for learning decision trees. By searching on the web, you can find plenty of information on this algorithm. It is also described in several data mining books and artificial intelligence books.

Example 56 : Converting a Sequence Database to SPMF Format

How to run this example?

The tool for converting a sequence databases to SPMF format takes three prameters as input:

The algorithm outputs a sequence database in SPMF format.

The CSV_INTEGER format is defined as follows:

For example, the follwing sequence databasee is in CSV_INTEGER format and contains four sequences:

1,2,3,4
5,6,7,8
5,6,7
1,2,3

The Kosarak format is defined as follows:

For example, the follwing sequence databasee is in Kosarak format and contains four sequences:

1 2 3 4
5 6 7 8
5 6 7
1 2 3

The IBMGenerator format is the format used by the IBM Data Quest Generator. The format is defined as follows:

For example, the follwing sequence databasee is in Kosarak format and contains four sequences:

1 -1 2 -1 3 -1 4 -1 -2
5 -1 6 -1 7 -1 8 -1 -2
5 -1 6 -1 7 -1 -2
1 -1 2 -1 3 -1 -2

The Snake format is defined as follows:

For example, the follwing sequence databasee is in Snake format and contains four sequences:

ABCD
ABAB
CACD
ADAC

The BMS format is defined as follows:

For example, the follwing sequence databasee is in BMS format and contains four sequences with the ids 10, 20, 30 and 40, respectively:

10 1
10 2
10 3
10 4
20 5
20 6
20 7
20 8
30 5
30 6
30 7
40 1
40 2
40 3

Example 57 : Converting a Transaction Database to SPMF Format

How to run this example?

The tool for converting a transaction databases to SPMF format. It takes three prameters as input:

The algorithm outputs a transaction database in SPMF format.

The CSV_INTEGER format is defined as follows:

For example, the follwing sequence database is in CSV_INTEGER format and contains four sequences:

1,2,3,4
5,6,7,8
5,6,7
1,2,3

Other formats will be added eventually.

Example 58 : Generating a Synthetic Sequence Database

How to run this example?

What is this tool?

This tool is a random generator of sequence databases. It can be used to generate synthetic sequence databases to compare the performance of data mining algorithms that takes a sequence database as input.

Synthetic databases are often used in the data mining litterature to evaluate algorithms. In particular, they are useful for comparing the scalability of algorithms. For example, one can generate sequence databases having various size and see how the algorithms react in terms of execution time and memory usage with respect to the database size.

What is the input?

The tool for generating a sequence databases takes four prameters as input:

1) the number of sequences to be generated (an integer >= 1)

2) the maximum number of distinct item that the database should contain (an integer >= 1),

3) the number of items that each itemset should contain (an integer >= 1)

4) the number of itemsets that each sequence should contain (an integer >= 1)

What is the output?

The algorithm outputs a sequence database respecting these parameters. The database is generated by using a random number generator.

Example 59 : Generating a Synthetic Sequence Database with Timestamps

How to run this example?

What is this tool?

This tool is a random generator of sequence databases with timestamps. It can be used to generate synthetic sequence databases with timestamps to compare the performance of data mining algorithms that takes a sequence database with timestamps as input.

Synthetic databases are often used in the data mining litterature to evaluate algorithms. In particular, they are useful for comparing the scalability of algorithms. For example, one can generate sequence databases having various size and see how the algorithms react in terms of execution time and memory usage with respect to the database size.

What is the input?

The tool for generating a sequence databases with timestamps takes four prameters as input:

1) the number of sequences to be generated (an integer >= 1)

2) the maximum number of distinct item that the database should contain (an integer >= 1),

3) the number of items that each itemset should contain (an integer >= 1)

4) the number of itemsets that each sequence should contain (an integer >= 1)

What is the output?

The algorithm outputs a sequence database with timestamps respecting these parameters. The database is generated by using a random number generator.

Example 60 : Generating a Synthetic Transaction Database

How to run this example?

What is this tool?

This tool is a random generator of transaction databases. It can be used to generate synthetic transaction databases with timestamps to compare the performance of data mining algorithms that takes a transaction database as input.

Synthetic databases are often used in the data mining litterature to evaluate algorithms. In particular, they are useful for comparing the scalability of algorithms. For example, one can generate sequence databases having various size and see how the algorithms react in terms of execution time and memory usage with respect to the database size.

What is the input?

The tool for generating a transaction takes three prameters as input:

1) the number of transactions to be generated (an integer >= 1)

2) the maximum number of distinct item that the database should contain (an integer >= 1),

3) the number of items that each transaction should contain (an integer >= 1)

What is the output?

The algorithm outputs a transaction database database respecting the parameters provided. A random number generator is used to generate the database.

Example 61 : Calculate Statistics for a Sequence Database

How to run this example?

What is this tool?

This tool is a tool for generating statistics about a sequence database. It can be used to know for example if the database is dense or sparse before applying a data mining algorithms.

What is the input?

The input is a sequence database. A sequence database is a set of sequence. Each sequence is an ordered list of itemsets. An itemset is an unordered list of items (symbols). For example, consider the following database. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. It contains 4 sequences. The second sequence represents that the set of items {1 4} was followed by {3}, which was followed by {2, 3}, which were followed by {1, 5}. It is a sequence database (as defined in Pei et al., 2004).

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output?

The output is statistics about the sequence database. For example, if we use the tool on the previous sequence database given as example, we get the following statistics:

Example 62: Using the ARFF format in the source code version of SPMF

The GUI interface and command line interface of SPMF can read the ARFF file format since version 0.93 of SPMF and this is totally transparent to the user. But what if you want to use the ARFF format when running algorithms from the source code? This example explains how to do it and it is quite simple.

But before presenting the example, let's explain a few things about how the ARFF support is implemented in SPMF:

Having said that, we will now explain how to use the ARFF format in the source code with an example. We will use the Apriori algorithm but the steps are the same for the other algorithms. We will first show how to run the Apriori algorithm if the input file is in SPMF format. Then, we will show how to run the Apriori algorithm if the input is in ARFF format to illustrate the differences.

If the input is in SPMF format

To run Apriori with a file "input.txt" in SPMF format with the parameter minsup = 0.4, the following code is used:

AlgoApriori apriori = new AlgoApriori();
apriori.runAlgorithm(0.4, "input.txt", "output.txt");

If the input is in ARFF format

Now let's say that the input file is in the ARFF format.

// We first need to convert the input file from ARFF to SPMF format. To do that, we create a transaction database converter. Then we call its method "convertARFFandReturnMap" to convert the input file to the SPMF format. It produces a converted input file named "input_converted.arff". Moreover, the conversion method returns a map containing mapping information between the data in ARFF format and the data in SPMF format.

TransactionDatabaseConverter converter = new TransactionDatabaseConverter();
Map<Integer, String> mapping = converter.convertARFFandReturnMap("input.arff", "input_converted.txt", Formats.ARFF, Integer.MAX_VALUE);

// Then we run the algorithm with the converted file "input_converted.txt". This creates a file "output.txt" containing the result.

AlgoApriori apriori = new AlgoApriori();
apriori.runAlgorithm(0.4, "input_converted.txt", "output.txt");

// Finally, we need to use the mapping to convert the output file so that the result is shown using the names that are found in the ARFF file rather than the integer-based representation used internally by the Apriori algorithm. This is very simple and performed as follows. The result is a file named "final_output.txt".

ResultConverter converter = new ResultConverter();
converter.convert(mapping, "output.txt", "final_output.txt");

What is the cost of using the ARFF format in terms of performance? The only additional cost when using ARFF is the cost of converting the input and output files, which is generally much smaller than the cost of performing the data mining. In the future, we plan to add support for SQL databases, Excel files and other formats by using a similar conversion mechanism that does not affect the performance of the mining phase. We also plan to add support for the visualizations of patterns.