This section provides examples of how to use SPMF to perform various data mining tasks.
If you have any question or if you want to report a bug, you can check the FAQ, post in the forum or contact me. You can also have a look at the various articles that I have referenced on the algorithms page of this website to learn more about each algorithm.
Itemset Mining (Frequent Itemsets, Rare Itemsets, etc.)
Association Rule Mining
Clustering
Sequential Pattern Mining
Multi-dimensional Sequential Pattern Mining
Sequential Pattern Mining with the Fournier-Viger et al. (2008) algorithm
Sequential Rule Mining
Classification
Tools
How to run this example?
What is Apriori?
It is an algorithm for discovering frequent itemsets in transaction databases proposed by Agrawal & Srikant (1993).
What is the input of the Apriori algorithm?
The input is a transaction database (a.k.a. binary context) and a threshold named minsup.
A transaction database is a set of transactions. For example, consider the following transaction database (transaction database) from the article Pasquier et al., 1999. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). Each transaction is an unordered set of items. For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | ||
t2 | x | x | x | ||
t3 | x | x | x | x | |
t4 | x | x | |||
t5 | x | x | x | x |
What is the output of the Apriori algorithm?
Apriori is an algorithm for discovering itemsets (group of items) that appear frequently in a set of transactions.
The output of the Apriori algorithm is the set of all itemsets (groups of items) appearing in at least minsup transactions from the transaction database. Those itemsets are called "frequent itemsets".
For example, if Apriori is run on the previous transaction database with a minsup of 40 % (2 transactions), Apriori produces the the following result:
itemsets | support |
{1} | 3 |
{2} | 4 |
{3} | 4 |
{5} | 4 |
{1, 2} | 2 |
{1, 3} | 3 |
{1, 5} | 2 |
{2, 3} | 3 |
{2, 5} | 4 |
{3, 5} | 3 |
{1, 2, 3} | 2 |
{1, 2, 5} | 3 |
{1, 3, 5} | 2 |
{2, 3, 5} | 3 |
{1, 2, 3, 5} | 2 |
How should I interpret the results?
In the results, each itemset is annotated with its support. The support of an itemset is how many times the itemset appears in the transaction database. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5.
Performance
The Apriori algorithm is an important algorithm for historical reasons and also because it is a simple algorithm that is easy to learn. However, faster and more memory efficient algorithms have been proposed. I recommend to use FPGrowth instead of Apriori. You can see a performance comparison of Apriori, FPGrowth, and other frequent itemset mining algorithms by clicking on the "performance" section of this website.
Implementation details
In SPMF, there is also an implementation of Apriori that uses a hash-tree as an internal structure to store candidates. It is named "Apriori_with_hash_tree" in the release version of SPMF. For the source code version, it can be run by executing the test file MainTestAprioriHT_saveToFile.java. This version of Apriori can be up to twice faster than the regular version in some cases but it uses more memory. This version of Apriori has two parameters: (1) minsup and (2) the number of child nodes that each node in the hash-tree should have. For the second parameter, we suggest to use the value 30.
Where can I get more information about the Apriori algorithm?
Here is the PDF of the technical report published in 1994 describing Apriori.
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. Research Report RJ 9839, IBM Almaden Research Center, San Jose, California, June 1994.
You can also read chapter 6 of the book "introduction to data mining" which provide a nice and easy to understand introduction to Apriori.
How to run this example?
What is AprioriTID?
It is an algorithm for discovering frequent itemsets in transaction databases proposed by Agrawal & Srikant (1993)
It is a variation of the Apriori algorithm. It was proposed in the same article as Apriori as an alternative implementation of Apriori. It produces the same output as Apriori. But it uses a different mechanism for counting the support of itemsets.
What is the input of the AprioriTID algorithm?
The input is a transaction database (a.k.a. binary context) and a threshold named minsup.
A transaction database is a set of transactions. For example, consider the following transaction database (transaction database) from the article Pasquier et al., 1999. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). Each transaction is an unordered set of items. For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | ||
t2 | x | x | x | ||
t3 | x | x | x | x | |
t4 | x | x | |||
t5 | x | x | x | x |
What is the output of the AprioriTID algorithm?
AprioriTID is an algorithm for discovering itemsets (group of items) that appear frequently in a set of transactions.
The output of the AprioriTID algorithm is the set of all itemsets (groups of items) appearing in at least minsup transactions from the transaction database. Those itemsets are called "frequent itemsets".
For example, if AprioriTID is run on the previous transaction database with a minsup of 40 % (2 transactions), AprioriTID produces the the following result:
itemsets | support |
{1} | 3 |
{2} | 4 |
{3} | 4 |
{5} | 4 |
{1, 2} | 2 |
{1, 3} | 3 |
{1, 5} | 2 |
{2, 3} | 3 |
{2, 5} | 4 |
{3, 5} | 3 |
{1, 2, 3} | 2 |
{1, 2, 5} | 3 |
{1, 3, 5} | 2 |
{2, 3, 5} | 3 |
{1, 2, 3, 5} | 2 |
How should I interpret the results?
In the results, each itemset is annotated with its support. The support of an itemset is how many times the itemset appears in the transaction database. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5.
Performance
The Apriori and AprioriTID algorithms are important algorithms for historical reasons and also because they are simple algorithms that are easy to learn. However, faster and more memory efficient algorithms have been proposed. I recommend to use FPGrowth instead of AprioriTID or Apriori. You can see a performance comparison of Apriori, AprioriTID, FPGrowth, and other frequent itemset mining algorithms by clicking on the "performance" section of this website.
Implementation details
I have included three versions of AprioriTID in the source code version of SPMF for learning purposes. The first one keeps the frequent itemsets into memory and print the results to the console (MainTestAprioriTID.java). The second one is a version that does not keep the results into memory, but save them to a file (MainTestAprioriTID_saveToFile.java). The third one (MainTestAprioriTID_bitset_saveToFile.java) is like the second one except that it uses bit vectors to represent tids sets. The third one is much faster than the two other versions and use less memory.
Where can I get more information about the Apriori and AprioriTID algorithm?
Here is the PDF of the technical report published in 1994 describing Apriori and AprioriTID.R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. Research Report RJ 9839, IBM Almaden Research Center, San Jose, California, June 1994.
You can also read chapter 6 of the book "introduction to data mining" which provide a nice and easy to understand introduction to Apriori.
How to run this example?
What is FPGrowth?
It is an algorithm for discovering frequent itemsets in transaction databases, proposed by Han et al. (2000)
It is a very fast an memory efficient algorithm, which uses a special internal structure called an FP-Tree.
What is the input of the FPGrowth algorithm?
The input is a transaction database (a.k.a. binary context) and a threshold named minsup.
A transaction database is a set of transactions. For example, consider the following transaction database (transaction database) from the article Pasquier et al., 1999. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). Each transaction is an unordered set of items. For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | ||
t2 | x | x | x | ||
t3 | x | x | x | x | |
t4 | x | x | |||
t5 | x | x | x | x |
What is the output of the FPGrowth algorithm?
FPGrowth is an algorithm for discovering itemsets (group of items) that appear frequently in a set of transactions.
The output of the FPGrowth algorithm is the set of all itemsets (groups of items) appearing in at least minsup transactions from the transaction database. Those itemsets are called "frequent itemsets".
For example, if FPGrowth is run on the previous transaction database with a minsup of 40 % (2 transactions), FPGrowth produces the the following result:
itemsets | support |
{1} | 3 |
{2} | 4 |
{3} | 4 |
{5} | 4 |
{1, 2} | 2 |
{1, 3} | 3 |
{1, 5} | 2 |
{2, 3} | 3 |
{2, 5} | 4 |
{3, 5} | 3 |
{1, 2, 3} | 2 |
{1, 2, 5} | 3 |
{1, 3, 5} | 2 |
{2, 3, 5} | 3 |
{1, 2, 3, 5} | 2 |
How should I interpret the results?
In the results, each itemset is annotated with its support. The support of an itemset is how many times the itemset appears in the transaction database. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5.
Performance
There exists several algorithms for mining frequent itemsets. In SPMF, you can try for example Apriori, AprioriTID, Eclat, HMine, Relim and more. Among all these algorithms, FPGrowth is generally the best algorithms. You can see a performance comparison by clicking on the "performance" section of this website.
Where can I get more information about the FPGrowth algorithm?
This is the journal article describing FPGrowth:Jiawei Han, Jian Pei, Yiwen Yin, Runying Mao: Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. Data Min. Knowl. Discov. 8(1): 53-87 (2004)
You can also read chapter 6 of the book "introduction to data mining" which provide an easy to understand introduction to FPGrowth (but does not give all the details).
How to run this example?
What is Relim?
It is an algorithm for discovering frequent itemsets in transaction databases, proposed by Borgelt (2005)
It is not a very efficient algorithm. It is included in SPMF for comparison purposes.
What is the input of the Relim algorithm?
The input is a transaction database (a.k.a. binary context) and a threshold named minsup.
A transaction database is a set of transactions. For example, consider the following transaction database (transaction database) from the article Pasquier et al., 1999. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). Each transaction is an unordered set of items. For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | ||
t2 | x | x | x | ||
t3 | x | x | x | x | |
t4 | x | x | |||
t5 | x | x | x | x |
What is the output of the Relim algorithm?
Relim is an algorithm for discovering itemsets (group of items) that appear frequently in a set of transactions.
The output of the Relim algorithm is the set of all itemsets (groups of items) appearing in at least minsup transactions from the transaction database. Those itemsets are called "frequent itemsets".
For example, if Relim is run on the previous transaction database with a minsup of 40 % (2 transactions), Relim produces the the following result:
itemsets | support |
{1} | 3 |
{2} | 4 |
{3} | 4 |
{5} | 4 |
{1, 2} | 2 |
{1, 3} | 3 |
{1, 5} | 2 |
{2, 3} | 3 |
{2, 5} | 4 |
{3, 5} | 3 |
{1, 2, 3} | 2 |
{1, 2, 5} | 3 |
{1, 3, 5} | 2 |
{2, 3, 5} | 3 |
{1, 2, 3, 5} | 2 |
How should I interpret the results?
In the results, each itemset is annotated with its support. The support of an itemset is how many times the itemset appears in the transaction database. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5.
Performance
There exists several algorithms for mining frequent itemsets. Relim is not very efficient. It is included for comparison purposes. I would recommend to use FPGrowth for better performance. You can see a performance comparison by clicking on the "performance" section of this website.
Where can I get more information about the FPGrowth algorithm?
This is the conference article describing Relim:
Keeping Things Simple: Finding Frequent Item Sets by Recursive Elimination Christian Borgelt. Workshop Open Source Data Mining Software (OSDM'05, Chicago, IL), 66-70. ACM Press, New York, NY, USA 2005
Note that the author of Relim and collaborators have proposed extensions and additional optimizations of Relim that I have not implemented.
How to run this example?
What is Eclat ?
Eclat is an algorithm for discovering frequent itemsets in transaction databases, proposed by Zaki (2001). Contrarily to previous algorithms such as Apriori, Eclat uses a depth-first search for discovering frequent itemsets.
What is the input of the Eclat algorithm?
The input is a transaction database (a.k.a. binary context) and a threshold named minsup.
A transaction database is a set of transactions. For example, consider the following transaction database (transaction database) from the article Pasquier et al., 1999. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). Each transaction is an unordered set of items. For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | ||
t2 | x | x | x | ||
t3 | x | x | x | x | |
t4 | x | x | |||
t5 | x | x | x | x |
What is the output of the Eclat algorithm?
Eclat is an algorithm for discovering itemsets (group of items) that appear frequently in a set of transactions.
The output of the Eclat algorithm is the set of all itemsets (groups of items) appearing in at least minsup transactions from the transaction database. Those itemsets are called "frequent itemsets".
For example, if Eclat is run on the previous transaction database with a minsup of 40 % (2 transactions), Eclat produces the the following result:
itemsets | support |
{1} | 3 |
{2} | 4 |
{3} | 4 |
{5} | 4 |
{1, 2} | 2 |
{1, 3} | 3 |
{1, 5} | 2 |
{2, 3} | 3 |
{2, 5} | 4 |
{3, 5} | 3 |
{1, 2, 3} | 2 |
{1, 2, 5} | 3 |
{1, 3, 5} | 2 |
{2, 3, 5} | 3 |
{1, 2, 3, 5} | 2 |
How should I interpret the results?
In the results, each itemset is annotated with its support. The support of an itemset is how many times the itemset appears in the transaction database. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5.
Performance
There exists several algorithms for mining frequent itemsets. Eclat is one of the best. But generally, FPGrowth is a better algorithm. You can see a performance comparison by clicking on the "performance" section of this website.
Nevertheless, the Eclat algorithm is interesting because it uses a depth-first search. For some extensions of the problem of itemset mining such as mining high utility itemsets (see the HUI-Miner algorithm), the search procedure of Eclat works very well.
Implementation details
In the source code version of SPMF, there are two version of ECLAT. The second version of ECLAT is available in MainTestEclat_bitset_saveToFile.java and it is generally faster than the basic version. It uses bit vectors for representing tids sets. In the GUI version of SPMF, only the bitset version of Eclat is available.
Where can I get more information about the Eclat algorithm?
Here is an article describing the Eclat algorithm:
Mohammed Javeed Zaki: Scalable Algorithms for Association Mining. IEEE Trans. Knowl. Data Eng. 12(3): 372-390 (2000)
How to run this example?
What is H-Mine ?
H-Mine is an algorithm for discovering frequent itemsets in transaction databases, proposed by Pei et al. (2001). Contrarily to previous algorithms such as Apriori, H-Mine uses a pattern-growth approach to discover frequent itemsets.
What is the input of the H-Mine algorithm?
The input is a transaction database (a.k.a. binary context) and a threshold named minsup.
A transaction database is a set of transactions. For example, consider the following transaction database (transaction database) from the article Pasquier et al., 1999. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). Each transaction is an unordered set of items. For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | ||
t2 | x | x | x | ||
t3 | x | x | x | x | |
t4 | x | x | |||
t5 | x | x | x | x |
What is the output of the H-Mine algorithm?
H-Mine is an algorithm for discovering itemsets (group of items) that appear frequently in a set of transactions.The output of the H-Mine algorithm is the set of all itemsets (groups of items) appearing in at least minsup transactions from the transaction database. Those itemsets are called "frequent itemsets".
For example, if H-Mine is run on the previous transaction database with a minsup of 40 % (2 transactions), H-Mine produces the the following result:
itemsets | support |
{1} | 3 |
{2} | 4 |
{3} | 4 |
{5} | 4 |
{1, 2} | 2 |
{1, 3} | 3 |
{1, 5} | 2 |
{2, 3} | 3 |
{2, 5} | 4 |
{3, 5} | 3 |
{1, 2, 3} | 2 |
{1, 2, 5} | 3 |
{1, 3, 5} | 2 |
{2, 3, 5} | 3 |
{1, 2, 3, 5} | 2 |
How should I interpret the results?
In the results, each itemset is annotated with its support. The support of an itemset is how many times the itemset appears in the transaction database. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5.
Performance
There exists several algorithms for mining frequent itemsets. H-Mine is claimed to be one of the best by their author. However, in my implementation, it does not perform very well. I would recommend to use FPGrowth, instead. You can see a performance comparison by clicking on the "performance" section of this website.
Where can I get more information about the H-Mine algorithm?
Here is an article describing the H-Mine algorithm:
J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang. "H-Mine: Fast and space-preserving frequent pattern mining in large databases". IIE Transactions, Volume 39, Issue 6, pages 593-605, June 2007, Taylor & Francis.
How to run this example?
What is AprioClose?
AprioriClose (a.k.a Close) is an algorithm for discovering frequent closed itemsets in transaction databases, proposed by Pasquier et al. (1999).
What is the input of the AprioriClose algorithm?
The input is a transaction database (a.k.a. binary context) and a threshold named minsup.
A transaction database is a set of transactions. For example, consider the following transaction database (transaction database) from the article Pasquier et al., 1999. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). Each transaction is an unordered set of items. For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | ||
t2 | x | x | x | ||
t3 | x | x | x | x | |
t4 | x | x | |||
t5 | x | x | x | x |
What is the output of the AprioriClose algorithm?
AprioriClose outputs frequent closed itemsets.
To explain what is a frequent closed itemset, we need to review some definitions.
An itemset is a set of items. The support of an itemset is the number of transactions that contain the itemset. For example, consider the itemset {1, 3}. It has a support of 2 because it appears in two transactions from the transaction database.
A frequent itemset is an itemset that appears in at least minsup transactions from the transaction database. A frequent closed itemset is a frequent itemset that is not included in another itemset having exactly the same support. The set of frequent closed itemsets is thus a subset of the set of frequent itemsets. Why it is interesting to discover frequent closed itemsets ? The reason is that the set of frequent closed itemsets is usually much smaller than the set of frequent itemsets and it can be shown that no information is lost (all the frequent itemsets can be regenerated from the set of frequent closed itemsets - see Pasquier(1999) for more details).
If we apply AprioriClose on the transaction database with a minsup of 40 % (2 transactions), we get the following result:
frequent closed itemsets | support |
{3} | 4 |
{1, 3} | 3 |
{2, 5} | 4 |
{2, 3, 5} | 3 |
{1, 2, 3, 5} | 2 |
If you compare this result with the output from Apriori, you would notice that only 5 itemsets are found instead of about 15.
How should I interpret the results?
In the results, each frequent closed itemset is annotated with its support. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5.
Performance
The AprioriClose algorithm is important for historical reasons because it is the first algorithm for mining frequent closed itemsets. However, there exists several other algorithms for mining frequent closed itemsets. In SPMF, I would recommend to use DCI_Closed, which is more efficient.
Implementation details
In the source code version of SPMF, there are two versions of AprioriClose. The second version is based on AprioriTID instead of Apriori (it uses tidsets to calculate support to reduce the number of database scans). This second version can be tested by running the example named "MainTestAprioriTIDClose.java" in the SPMF distribution.
Where can I get more information about the AprioriClose algorithm?
Here is an article describing the AprioriClose algorithm:
Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal: Discovering Frequent Closed Itemsets for Association Rules. ICDT 1999: 398-416
How to run this example?
What is DCI_Closed?
DCI_Closed is an algorithm for discovering frequent closed itemsets in transaction databases, proposed by Lucchese et al. (2004).
What is the input of the DCI_Closed algorithm?
The input is a transaction database (a.k.a. binary context) and a threshold named minsup.
A transaction database is a set of transactions. For example, consider the following transaction database (transaction database) from the article Pasquier et al., 1999. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). Each transaction is an unordered set of items. For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | ||
t2 | x | x | x | ||
t3 | x | x | x | x | |
t4 | x | x | |||
t5 | x | x | x | x |
What is the output of the DCI_Closed algorithm?
DCI_Closed outputs frequent closed itemsets.
To explain what is a frequent closed itemset, we need to review some definitions.
An itemset is a set of items. The support of an itemset is the number of transactions that contain the itemset. For example, consider the itemset {1, 3}. It has a support of 2 because it appears in two transactions from the transaction database.
A frequent itemset is an itemset that appears in at least minsup transactions from the transaction database. A frequent closed itemset is a frequent itemset that is not included in another itemset having exactly the same support. The set of frequent closed itemsets is thus a subset of the set of frequent itemsets. Why it is interesting to discover frequent closed itemsets ? The reason is that the set of frequent closed itemsets is usually much smaller than the set of frequent itemsets and it can be shown that no information is lost (all the frequent itemsets can be regenerated from the set of frequent closed itemsets - see Lucchese (2004) for more details).
If we apply DCI_Closed on the transaction database with a minsup of 40 % (2 transactions), we get the following result:
frequent closed itemsets | support |
{3} | 4 |
{1, 3} | 3 |
{2, 5} | 4 |
{2, 3, 5} | 3 |
{1, 2, 3, 5} | 2 |
If you compare this result with the output from a frequent itemset mining algorithm like Apriori, you would notice that only 5 closed itemsets are found by AprioriClose instead of about 15 itemsets by Apriori.
How should I interpret the results?
In the results, each frequent closed itemset is annotated with its support. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5.
Performance
The DCI_Closed algorithm is one of the faster algorithms for frequent closed itemset mining. The version in SPMF is optimized and very efficient. SPMF also offers other algorithm for frequent closed itemset mining such as Charm and AprioriClose. But they are less efficient.
Implementation details
In the source code version of SPMF, there are two versions of DCI_Closed. The first one uses HashSet to store the transaction ids. The second one is an optimized version that uses a bit matrix to store transactions ids, and also includes other optimizations. The first version can be tested by running MainTestDCI_Closed.java and the second version by running MainTestDCI_Closed_Optimized.java. In the release version of SPMF, only the optimized version of DCI_Closed is available.
Where can I get more information about the DCI_Closed algorithm?
Here is an article describing the DCI_Closed algorithm:
Claudio Lucchese, Salvatore Orlando, Raffaele Perego: DCI Closed: A Fast and Memory Efficient Algorithm to Mine Frequent Closed Itemsets. FIMI 2004
How to run this example?
What is Charm?
Charm is an algorithm for discovering frequent closed itemsets in transaction databases, proposed by Zaki (2002).
What is the input of the Charm algorithm?
The input is a transaction database (a.k.a. binary context) and a threshold named minsup.
A transaction database is a set of transactions. For example, consider the following transaction database (transaction database) from the article Pasquier et al., 1999. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). Each transaction is an unordered set of items. For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | ||
t2 | x | x | x | ||
t3 | x | x | x | x | |
t4 | x | x | |||
t5 | x | x | x | x |
What is the output of the Charm algorithm?
Charm outputs frequent closed itemsets.
To explain what is a frequent closed itemset, we need to review some definitions.
An itemset is a set of items. The support of an itemset is the number of transactions that contain the itemset. For example, consider the itemset {1, 3}. It has a support of 2 because it appears in two transactions from the transaction database.
A frequent itemset is an itemset that appears in at least minsup transactions from the transaction database. A frequent closed itemset is a frequent itemset that is not included in another itemset having exactly the same support. The set of frequent closed itemsets is thus a subset of the set of frequent itemsets. Why it is interesting to discover frequent closed itemsets ? The reason is that the set of frequent closed itemsets is usually much smaller than the set of frequent itemsets and it can be shown that no information is lost (all the frequent itemsets can be regenerated from the set of frequent closed itemsets - see Zaki(2002) for more details).
If we apply Charm on the transaction database with a minsup of 40 % (2 transactions), we get the following result:
frequent closed itemsets | support |
{3} | 4 |
{1, 3} | 3 |
{2, 5} | 4 |
{2, 3, 5} | 3 |
{1, 2, 3, 5} | 2 |
If you compare this result with the output from a frequent itemset mining algorithm like Apriori, you would notice that only 5 closed itemsets are found by Charm instead of about 15 itemsets by Apriori.
How should I interpret the results?
In the results, each frequent closed itemset is annotated with its support. For example, the itemset {2, 3 5} has a support of 3 because it appears in transactions t2, t3 and t5.
Performance
The Charm algorithm is an important algorithm because it is one of the first depth-first algorithm for mining frequent closed itemsets. However, I would recommend to use DCI_Closed, which is probably more efficient than Charm in most cases.
Implementation details
In the source code version of SPMF, there are two versions of Charm. The second version of Charm is available in MainTestCharm_bitset_saveToFile.java and it is much faster. It uses bit vectors for representing tids sets. In the release version of SPMF, only the second version is available from the graphical user interface.
Where can I get more information about the Charm algorithm?
Here is an article describing the Charm algorithm:
Mohammed Javeed Zaki, Ching-Jiu Hsiao: CHARM: An Efficient Algorithm for Closed Itemset Mining. SDM 2002.
How to run this example?
This example is not available in the release version of SPMF.
To run this example with the source code version of SPMF, launch the file "MainTestCharmMFI.java" in the package ca.pfv.SPMF.tests.
What is Charm-MFI?
Charm-MFI is an algorithm for discovering frequent maximal itemsets in a transaction database.
Charm-MFI is not efficient because it discovers maximal itemsets by performing post-processing after discovering closed itemsets by using Charm (hence the name: Charm-MFI). Moreover, note that the original Charm-MFI algorithm is not correct. I had to fix it.
What is the input of the Charm-MFI algorithm?
The input is a transaction database (a.k.a. binary context) and a threshold named minsup.
A transaction database is a set of transactions. For example, consider the following transaction database (transaction database) from the article Pasquier et al., 1999. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). Each transaction is an unordered set of items. For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | ||
t2 | x | x | x | ||
t3 | x | x | x | x | |
t4 | x | x | |||
t5 | x | x | x | x |
What is the output of the Charm-MFI algorithm?
Charm-MFI outputs frequent maximal itemsets.
To explain what is a frequent maximal itemset, we need to review some definitions.
An itemset is a set of items. The support of an itemset is the number of transactions that contain the itemset. For example, consider the itemset {1, 3}. It has a support of 2 because it appears in two transactions from the transaction database.
A frequent itemset is an itemset that appears in at least minsup transactions from the transaction database. A frequent closed itemset is a frequent itemset that is not included in another frequent itemset having the same support. A frequent maximal itemset is a frequent itemset that is not included in another frequent itemset. The set of frequent maximal itemsets is thus a subset of the set of frequent itemsets. Why it is interesting to discover frequent maximal itemsets ? The reason is that the set of frequent maximal itemsets is usually much smaller than the set of frequent itemsets and also smaller than the set of frequent closed itemsets. However, unlike frequent closed itemsets, frequent maximal itemsets are not a lossless representation of the set of frequent itemsets (we can regenerate all frequent itemsets from the set of frequent maximal itemsets but we would not be able to get their support).
If we apply Charm-MFI on the transaction database with a minsup of 40 % (2 transactions), we get the following result:
frequent maximal itemsets | support |
{1, 2, 3, 5} | 2 |
This itemset is the only maximal itemsets itemsets and it has a support of 2 because it appears in two transactions.
How should I interpret the results?
In the results, each frequent maximum itemset is annotated with its support. For example, the itemset {2, 3 5} is a maximal itemset having a support of 3 because it appears in transactions t2, t3 and t5. The itemset {2, 5} has a support of 4 and is not a maximal itemset because it is included in {2, 3, 5}
Performance
The Charm-MFI algorithm is not a very efficient algorithm because it finds frequent maximal itemsets by post-processing instead of finding them directly. However, it is the only algorithm that is included in SPMF for this task. Eventually, other algorithms will be added to SPMF.
Where can I get more information about the Charm-MFI algorithm?
The Charm-MFI algorithm is described in this thesis (in French language only):
L. Szathmary (2006). Symbolic Data Mining Methods with the Coron Platform.
How to run this example?
What is Zart?
Zart is an algorithm for discovering frequent closed itemsets and their corresponding generators in a transaction database.
Zart is an Apriori-based algorithm. Why is it useful to discover closed itemsets and their generators at the same time? One reason is that this information is necessary to generate some special kind of association rules such as the IGB basis of association rules (see the example for IGB).
What is the input of the Zart algorithm?
The input is a transaction database (a.k.a. binary context) and a threshold named minsup.
A transaction database is a set of transactions. Let's consider the following dataset from the article Szathmary et al. 2007. This transaction database contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). This transaction database is provided in the file "contextZart.txt" of the SPMF distribution. Each transaction is an unordered set of items. For example, the first transaction represents the set of items 1, 2, 4 and 5.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | x | |
t2 | x | x | |||
t3 | x | x | x | x | |
t4 | x | x | x | ||
t5 | x | x | x | x |
What is the output of the Zart algorithm?
The output of the Zart algorithm for a transaction database and a minimum support thresold minsup is the set of all frequent closed itemsets and their support, and the associated generator(s) for each closed frequent itemset.
To explain what is a frequent closed itemset and a generator, we need to review some definitions.
An itemset is a set of items. The support of an itemset is the number of transactions that contain the itemset. For example, consider the itemset {1, 3}. It has a support of 3 because it appears in three transactions from the database (t2, t3 and t5).
A frequent itemset is an itemset that appears in at least minsup transactions from the transaction database. A frequent closed itemset is a frequent itemset that is not included in another frequent itemset having the same support. A generator Y of a closed itemset X is an itemset such that (1) it has the same support as X and (2) it does not have any subset having the same support.
By running Zart with the previous transaction database and a minsup of 40% (2 transactions), we obtain the following result:
itemsets | support | is closed? | minimal generators |
{} | 5 | yes | {} |
{1} | 4 | yes | {1} |
{2} | 4 | no | |
{3} | 4 | yes | {3} |
{5} | 4 | no | |
{1, 2} | 3 | no | |
{1, 3} | 3 | yes | {1,3} |
{1, 5} | 3 | no | |
{2, 3} | 3 | no | |
{2, 5} | 4 | yes | {2}, {5} |
{3, 5} | 3 | no | |
{1, 2, 3} | 2 | no | |
{1, 2, 5} | 3 | yes | {1, 2}, {1, 5} |
{1, 3, 5} | 2 | no | |
{2, 3, 5} | 3 | yes | {2, 3}, {3, 5} |
{1, 2, 3, 5} | 2 | yes | {1, 2, 3}, {1, 3, 5} |
How should I interpret the results?
In the results, each frequent itemset is shown. Each frequent itemset that is a closed itemset is marked as such. For each closed itemset, its support is indicated and its list of generators. For example, the itemset {1,2,3,5} has a support of 2 because it appers in 2 transactions (t3 and t5). It is closed because there is no superset having the same support. Moreover is has two generators: {1, 2, 3} and {1, 3, 5}. By definition, these generators have the same support as {1, 2, 3, 5}.
Another example. The itemset {1, 3, 5} is not closed and it has a support of 2. It is not closed because it is included in {1, 2, 3, 5} that has the same support.
Implementation details
In the source code version of SPMF, there is a version of Zart that keep the results in memory ("MainTestZart.java") and a version that saves to a file ("MainTestZart_saveToFile.java"). In the release version of SPMF, the GUI only offer the version that saves to a file.
Performance
The Zart algorithm is not a very efficient algorithm because it is based on Apriori. If someone only want to discover closed itemsets and do not need the information about generators, then he should instead use DCI_Closed, which is very efficient for closed itemset mining. However, in some cases it is desirable to discover closed itemset and their corresponding generators (for example to generate IGB association rules). For these cases, Zart is an appropriate algorithm.
Where can I get more information about the Zart algorithm?
The Zart algorithm is described in this paper:
L. Szathmary, A. Napoli, S. O. Kuznetsov. ZART: A Multifunctional Itemset Mining Algorithm. Laszlo Szathmary, Amedeo Napoli, Sergei O. Kuznetsov In: CLA, 2007.
How to run this example?
What is this algorithm?
This algorithm generated pseudo-closed itemsets from a transaction database. Discovering pseudo-closed itemsets is useful for generating some special kind of association rules such as the Guigues-Duquenne Basis of Exact Rules (see the example on this topic for more details).
What is the input?
The input is a transaction database (a.k.a. binary context) and a threshold named minsup.
A transaction database is a set of transactions. For example, consider the following transaction database (transaction database) from the article Pasquier et al., 1999. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). Each transaction is an unordered set of items. For example, the first transaction represents the set of items 1, 3 and 4.
1 2 3 4 5 t1 x x x t2 x x x t3 x x x x t4 x x t5 x x x x
What is the output?
The output is the set of all pseudo-closed itemsets.
An itemset is a group of items. The support of an itemset is the number of transactions that contains the items divided by the total number of transactions. For example, the itemset {2, 3 5} has a support of 60 % because it appears in three out of five transactions. A closed itemset is an itemset such that it is included in no itemset having the same support. An itemset Y is the closure of an itemset X if Y is a closed itemset, X is a subset of Y and X and Y have the same support. A pseudo-closed itemset is an itemset such that it is not closed and it contains the closure of all its subsets that are pseudo-closed itemsets.
For example, if mine all pseudo-closed itemsets from this database with minsup = 20%, we obtain the following set of pseudo-closed itemsets:
Pseudo-closed itemsets | Support | Closure |
{2} | 80 % | 2 5 |
{1} | 60 % | 1 3 |
{4} | 20 % | 1 3 4 |
{5} | 80 % | 2 5 |
This table indicates for example that there is a pseudo closed itemset {2} that appears in 80% of the transactions of this database and its closure is the closed itemset {2, 5}.
Where can I get more information about this algorithm?
This paper explain this algorithm and the theory on which is it based.
N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal: Discovering Frequent Closed Itemsets for Association Rules. ICDT 1999: 398-416
How to run this example?
What is AprioriRare?
AprioriRare is an algorithm for mining minimal rare itemsets from a transaction database. It is an Apriori-based algorithm.
What is the input ?
The input is a transaction database (a.k.a. binary context) and a threshold named minsup.
A transaction database is a set of transactions. Let's consider the following dataset from the article Szathmary et al. 2007. This transaction database contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). This transaction database is provided in the file "contextZart.txt" of the SPMF distribution. Each transaction is an unordered set of items. For example, the first transaction represents the set of items 1, 2, 4 and 5.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | x | |
t2 | x | x | |||
t3 | x | x | x | x | |
t4 | x | x | x | ||
t5 | x | x | x | x |
What is the output?
The output of AprioriRare is the set of minimal rare itemset.
To explain what it a minimal rare itemset, we need to review some definitions. An itemset is a group of items. The support of an itemset is the number of transactions that contain the itemset divided by the total number of transactions. For example, the itemset {1, 2} has a support of 60% because it appears in 3 transactions out of 5 (it appears in t1, t2 and t5). A frequent itemset is an itemset that has a support no less than the minsup parameter. A minimal rare itemset is an itemset that is not a frequent itemset and that all its subsets are frequent itemsets.
If we run AprioriRare algorithm with minsup = 60 % and this transaction database, we obtain the following set of minimal rare itemsets (see Szathmary et al. 2007b for further details):
Minimal Rare Itemsets | Support |
{4} | 20 % |
{1, 3, 5} | 40 % |
{1, 2, 3} | 40 % |
Performance
AprioriRare is the only algorithm for minimal rare itemset mining offered in SPMF. Since it is based on Apriori, it suffers from the same fundamental limitations (it may generate too much candidates and may generate candidates that do not appear in the database).
Where can I get more information about this algorithm?
The AprioriRare algorithm is described in this paper:
Laszlo Szathmary, Amedeo Napoli, Petko Valtchev: Towards Rare Itemset Mining. ICTAI (1) 2007: 305-312
How to run this example?
What is AprioriInverse?
AprioriInverse is an algorithm mining perfectly rare itemsets. Why mining perfectly rare itemsets? One reason is that it is useful for generating the set of sporadic association rules.
What is the input?
The input of AprioriInverse is a transaction database and two thresholds named minsup and maxsup. A transaction database is a set of transactions. A transaction is a set of items (symbols). For example, the following transactions database contains 5 transactions (t1,t2...t5) and 5 items (1,2,3,4,5). This database is provided in the file "contextInverse.txt" of the SPMF distribution.:
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | x | |
t2 | x | x | |||
t3 | x | x | x | x | |
t4 | x | x | |||
t5 | x | x | x | x |
What is the output?
The output of AprioriInverse is the set of all perfectly rare itemsets in the database such that their support is lower than maxsup and higher than minsup.
To explain what it a perfectly rare itemset, we need to review some definitions. An itemset is a set of items. The support of an itemset is the number of transactions that contain the itemset divided by the total number of transactions. For example, the itemset {1, 2} has a support of 60% because it appears in 3 transactions out of 5 (it appears in t1, t2 and t5). A frequent itemset is an itemset that has a support no less than the maxsup parameter.
A perfectly rare itemset (a.k.a. sporadic itemset) is an itemset that is not a frequent itemset and that all its subsets are also not frequent itemsets. Moreover, it has to have a support higher or equal to the minsup threshold.
By running the AprioriInverse algorithm with minsup = 0.1 % and maxsup of 60 % and this transaction database, we obtain the following set of perfectly rare itemsets (see Koh & Roundtree 2005 for further details):
Perfectly Rare Itemsets | Support |
{3} | 60 % |
{4} | 40 % |
{5} | 60 % |
{4, 5} | 40 % |
{3, 5} | 20 % |
Performance
AprioriInverse is the only algorithm for perfectly rare itemset mining offered in SPMF. Since it is based on Apriori, it suffers from the same fundamental limitations (it may generate too much candidates and may generate candidates that do not appear in the database).
Where can I get more information about this algorithm?
The AprioriInverse algorithm is described in this p aper:
Yun Sing Koh, Nathan Rountree: Finding Sporadic Rules Using Apriori-Inverse. PAKDD 2005: 97-106
How to run this example?
What is CloStream?
CloStream (Yen et al., 2009) is an algorithm for incrementally mining closed itemsets from a data stream.
Why is it useful? Because most closed itemset mining algorithms such as Charm, DCI_Closed and AprioriClose are batch algorithms. This means that if the transaction database is updated, we need to run the algorithms again to update the set of closed itemsets. If there is constant insertion of new transactions and the results need to be updated often, it may become costly. A stream mining algorithm assume that each transaction in a database can only be read once at that new transaction appears regularly. Every time that a new transaction appear, the result is updated.
What is the input of CloStream?
The input of CloStream is a stream of transactions. For example, consider the five following transactions. A transaction is a set of items. For example, the first transaction t1 contains items 1, 2 and 4.
t1 | 1 | 2 | 4 | ||
t2 | 2 | 3 | 5 | ||
t3 | 1 | 2 | 3 | 5 | |
t4 | 2 | 5 | |||
t5 | 1 | 2 | 3 | 5 |
What is the output of CloStream?
CloStream produces as output the set of closed itemsets contained in the database. An itemset is a group of items. The support of an itemset is the number of transactions that contains the itemset. For example, the itemset {1, 2, 4} has a support of 1 because it only appear in t1. A closed itemset is an itemset that is not included in another itemset having the same support.For example, if we apply CloStream to the five transactions, the final result is:
closed itemsets | support |
{} | 5 |
{3} | 4 |
{1, 3} | 3 |
{1, 3, 4} | 1 |
{2, 5} | 4 |
{2, 3, 5} | 3 |
{1, 2, 3, 5} | 2 |
Performance
CloStream is a reasonably efficient algorithm. The only limitation is that it is not possible to set a minimum support threshold. Therefore, if the number of closed itemset is large, this algorithm may use too much memory. However, CloStream has the advantage of being very simple an easy to implement.
Where can I get more information about this algorithm?
The CloStream algorithm is described in this paper:
Show-Jane Yen, Yue-Shi Lee, Cheng-Wei Wu, Chin-Lin Lin: An Efficient Algorithm for Maintaining Frequent Closed Itemsets over Data Stream. IEA/AIE 2009: 767-776
How to run this example?
What is UApriori?
UApriori is an algorithm for mining frequent itemsets from a transaction database where the data is uncertain (contains probabilities. It was proposed by Chui et al. 2007.
What is the input ?
UApriori takes as input a transaction database and a minnimum expected support. A transaction database is a set of transactions where each transaction is a set of items. In UApriori, we assume that each item in a transaction is annotated with an existential probability. For example, let's consider the following transaction database, consisting of 4 transactions (t1,t2...t5) and 5 items (1,2,3,4,5). The transaction t1 contains item1 with a probability of 0.5, item 2 with a probability of 0.4, item 4 with a probability of 0.3 and item 5 with a probability of 0.7. This database is provided in the file "contextUncertain.txt" of the SPMF distribution:
1 | 2 | 3 | 4 | 5 | |
t1 | 0.5 | 0.4 | 0.3 | 0.7 | |
t2 | 0.5 | 0.4 | 0.4 | ||
t3 | 0.6 | 0.5 | 0.1 | 0.5 | |
t4 | 0.7 | 0.4 | 0.3 | 0.9 |
What is the output?
The output of U-Apriori is the set of frequent itemsets. Note that the definition of frequent itemset is here different from the definition used by the regular Apriori algorithm because we have to consider the existential probabilities.
The expected support of an itemset in a transaction is defined as the product of the existential probability of each item from the itemset in this transaction. For example, the support of itemset {1, 2} in transaction t1 is 0.5 x 0.4. The expected support of an itemset in a transaction database is the sum of its support in all transactions where it occurs. For example, the expected support of itemset {2, 3} is the sum of its expected support in t2 and t4 : 0.5 x 0.4 + 0.4 x 0.3 = 0.32.
A frequent itemset is an itemset that has an expected support higher or equal to the minimum expected support set by the user.
By running U-Apriori with a minimum expected support of 0.10, we obtain 19 frequent itemsets, including:
itemsets | expected support |
{2 3 5} | 0.19 |
{1 3 5} | 0.19 |
{1 4 5} | 0.14 |
{2 4 5} | 0.11 |
{1 2 5} | 0.54 |
{1 5} | 1.28 |
{1 3} | 0.21 |
{1 4} | 0.21 |
{2 3} | 0.32 |
{1 2} | 0.78 |
... | ... |
Performance
UApriori is the not the most efficient algorithm for uncertain itemset mining but it is simple and it is the first algorithm designed for this task.
Where can I get more information about the UApriori algorithm?
Here is an article describing the UApriori algorithm:
C. Kit Chui, B. Kao, E. Hung: Mining Frequent Itemsets from Uncertain Data. PAKDD 2007: 47-58
How to run this example?
What is Two-Phase?
Two-Phase (Liu et al., 2005) is an algorithm for discovering high-utility itemsets in a transaction database containing utility information.
What is the input?
Two-phase takes as input as input a transaction database with utility information and a minimum utility threshold min_utility.
Let's consider the following database consisting of 5 transactions (t1,t2...t5) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DB_utility.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.
Items | Transaction utility | Item utilities for this transaction | |
t1 | 3 5 1 2 4 6 | 30 | 1 3 5 10 6 5 |
t2 | 3 5 2 4 | 20 | 3 3 8 6 |
t3 | 3 1 4 | 8 | 1 5 2 |
t4 | 3 5 1 7 | 27 | 6 6 10 5 |
t5 | 3 5 2 7 | 11 | 2 3 4 2 |
Each line of the database is:
- a set of items (the first column of the table),
- the sum of the utilities of these items in this transaction (the second column of the table),
- the utility of each item for this transaction (the third column of the table).
Note that the the value in the second column for each line is the sum of the values in the third column.
What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction the items purchased by a customer. The first customer named "t1" bought items 3, 5, 1, 2, 4 and 6. The amount of money spent for each item is respectively 1, 3, 5, 10 , 6 and 5. The total amount of money spent in this transaction is 1 + 3 + 5 + 10 + 6 + 5 = 30.
What is the output?
The output of Two-Phase is the set of high utility itemsets having a utility no less than the minutility threshold set by the user.
Let me explain in more details. An itemset is set of items. The utility of an itemset in a transaction is the sum of the utility of its items in the transaction. For example, the utility of the itemset {1 4} in transaction t1 is 5 + 6 = 11 and the utility of {1 4} in transaction t3 is 5 + 2 = 7. The utility of an itemset in a database is the sum of its utility in all transactions where it appears. For example, the utility of {1 4} in the database is the utility of {1 4} in t1 plus the utility of {1 4} in t3, for a total of 11 + 7 = 18. A high utility itemset is an itemset such its utility is no less than minutility.
If we run Two Phase with a minimum utility of 30, we obtain 8 high-utility itemsets:
itemsets | utility | support |
{2 4} | 30 | 40 % (2 transactions) |
{2 5} | 31 | 60 % (3 transactions) |
{1 3 5} | 31 | 40 % (2 transactions) |
{2 3 4} | 34 | 40 % (2 transactions) |
{2 3 5} | 37 | 60 % (3 transactions) |
{2 4 5} | 36 | 40 % (2 transactions) |
{2 3 4 5} | 40 | 40 % (2 transactions) |
{1 2 3 4 5 6} | 30 | 20 % (1 transactions) |
If the database is a transaction database from a store, we could intrepret these results as all the groups of items bought together that generated a profit of 30 $ or more.
Performance
High utility itemset mining is a more difficult problem than frequent itemset mining. Therefore, algorithms for high-utility itemset mining are generally slower than frequent itemset mining algorithms.
The Two-Phase algorithm is an important algorithm because it introduced the concept of mining high utility itemset by using two phases (see the paper for details). However, there are now some more efficient algorithms. I would recomment to use HUI-Miner that is also included in SPMF and is more efficient.
Implementation details
In the source code version of SPMF, there are two versions of Two-Phase: one that saves the result to a file (MainTestTwoPhaseAlgorithm_saveToFile.java) and one that saves the result to memory (MainTestTwoPhaseAlgorithm.java).
Also note that the input format is not exactly the same as described in the original article. But it is equivalent.
Where can I get more information about the Two-Phase algorithm?
Here is an article describing the Two-Phase algorithm:
Y. Liu, W.-K. Liao, A. N. Choudhary: A Two-Phase Algorithm for Fast Discovery of High Utility Itemsets. PAKDD 2005: 689-695
How to run this example?
What is HUI-Miner?
HUI-Miner (Mengchi Liu & Junfeng Qu, CIKM 2012) is an algorithm for discovering high-utility itemsets in a transaction database containing utility information.
What is the input?
HUI-Miner takes as input as input a transaction database with utility information and a minimum utility threshold min_utility.
Let's consider the following database consisting of 5 transactions (t1,t2...t5) and 7 items (1, 2, 3, 4, 5, 6, 7). This database is provided in the text file "DB_utility.txt" in the package ca.pfv.spmf.tests of the SPMF distribution.
Items | Transaction utility | Item utilities for this transaction | |
t1 | 3 5 1 2 4 6 | 30 | 1 3 5 10 6 5 |
t2 | 3 5 2 4 | 20 | 3 3 8 6 |
t3 | 3 1 4 | 8 | 1 5 2 |
t4 | 3 5 1 7 | 27 | 6 6 10 5 |
t5 | 3 5 2 7 | 11 | 2 3 4 2 |
Each line of the database is:
- a set of items (the first column of the table),
- the sum of the utilities of these items in this transaction (the second column of the table),
- the utility of each item for this transaction (the third column of the table).
Note that the the value in the second column for each line is the sum of the values in the third column.
What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction the items purchased by a customer. The first customer named "t1" bought items 3, 5, 1, 2, 4 and 6. The amount of money spent for each item is respectively 1, 3, 5, 10 , 6 and 5. The total amount of money spent in this transaction is 1 + 3 + 5 + 10 + 6 + 5 = 30.
What is the output?
The output of HUI-Miner is the set of high utility itemsets having a utility no less than the minutility threshold set by the user.
Let me explain in more details. An itemset is set of items. The utility of an itemset in a transaction is the sum of the utility of its items in the transaction. For example, the utility of the itemset {1 4} in transaction t1 is 5 + 6 = 11 and the utility of {1 4} in transaction t3 is 5 + 2 = 7. The utility of an itemset in a database is the sum of its utility in all transactions where it appears. For example, the utility of {1 4} in the database is the utility of {1 4} in t1 plus the utility of {1 4} in t3, for a total of 11 + 7 = 18. A high utility itemset is an itemset such its utility is no less than minutility.
If we run HUI-Miner with a minimum utility of 30, we obtain 8 high-utility itemsets:
itemsets | utility | support |
{2 4} | 30 | 40 % (2 transactions) |
{2 5} | 31 | 60 % (3 transactions) |
{1 3 5} | 31 | 40 % (2 transactions) |
{2 3 4} | 34 | 40 % (2 transactions) |
{2 3 5} | 37 | 60 % (3 transactions) |
{2 4 5} | 36 | 40 % (2 transactions) |
{2 3 4 5} | 40 | 40 % (2 transactions) |
{1 2 3 4 5 6} | 30 | 20 % (1 transactions) |
If the database is a transaction database from a store, we could intrepret these results as all the groups of items bought together that generated a profit of 30 $ or more.
Performance
High utility itemset mining is a more difficult problem than frequent itemset mining. Therefore, algorithms for high-utility itemset mining are generally slower than frequent itemset mining algorithms.
The HUI-Miner algorithm is reported as one of the most efficient algorithm for high utility itemset mining.
Implementation details
The version implemented here contains all the optimizations described in the paper proposing HUI-Miner.
Note that the input format is not exactly the same as described in the original article. But it is equivalent.
Where can I get more information about the HUI-Miner algorithm?
This is the reference of the article describing the HUI-Miner algorithm:
M. Liu, J.-F. Qu: Mining high utility itemsets without candidate generation. CIKM 2012, 55-64
How to run this example?
What is the VME algorithm?
VME (Deng & Xu, 2010) is an algorithm for mining erasable itemsets from a transaction database with profit information.
What is the input?
VME takes as input a product database where each product is a set of items and a threshold. Moreover each product is annotated with a profit
For example, let's consider the following product database, consisting of 6 products and 7 items (this example is taken from the article of Deng & Xu, 2010). Each product is annotated with a profit. This product database is provided in the file "contextVME.txt" of the SPMF distribution.:
profit | item1 | item2 | item3 | item4 | item5 | item6 | item7 | |
product1 | 50$ | x | x | x | x | |||
product2 | 20$ | x | x | x | ||||
product3 | 50$ | x | x | x | x | |||
product4 | 800$ | x | x | x | ||||
product5 | 30$ | x | x | |||||
product6 | 50$ | x | x |
What is the output?
The output is the set of erasable itemsets generating a loss of profit lower or equal to a user-specificed threshold.
An itemset is a group of items. The loss of profit generated byan itemset is defined as the sum of the product profit for all products containing an item from this itemset. For example, the lost of profit of itemset {5, 6} is the sum of the profits of products containing 5 and/or 6: 50$ + 20 $ + 50 $ + 30 $ = 150 $. The loss of profit can also be expressed as a percentage of the total profit of the database. For example, in this database the total profit is 50 + 20 + 50 + 800 + 30 + 50 = 1000$.
By running VME with a threshold of 15 %, we obtain 8 erasable itemsets (having a profit less or equal to 15% x 1000$ = 150 $):
erasable itemsets | loss of profit ("gain") |
{3} | 150 |
{5} | 70 |
{6} | 80 |
{7} | 50 |
{5 6} | 150 |
{5 7} | 100 |
{6 7} | 100 |
{5 6 7} | 150 |
Performance
The VME algorithm is Apriori-based. It is not the fastest algorithm for this problem. But it is the only one available in SPMF because this problem is not very popular. For other algorithms , you can search for the author names. They have proposed a few algorithms with some improvements.
Where can I get more information about the VME algorithm?
Here is an article describing the VME algorithm:
Z. Deng, X. Xu: An Efficient Algorithm for Mining Erasable Itemsets. ADMA (1) 2010: 214-225.
How to run this example?
What is an itemset-tree?
An itemset-tree is a special data structure that can be used for performing efficient queries about itemsets and association rules in a database without having to generate all of them beforehand.
An itemset-tree has the nice property of being incremental, which means that new transactions can be added to an existing itemset tree very efficiently without having to rebuid the tree. It also has the property of being compact.
How to use it?
An itemset-tree is build by inserting a set of transactions. A transaction is simply an unordered set of items. For example, we could insert the following 6 transactions (t1,t2...t5) into an itemset-tree. In this example, the transaction t1 represents the set of items {1, 4}. This set of transactions is provided in the file "contextItemsetTree.txt" of the SPMF distribution.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | |||
t2 | x | x | |||
t3 | x | x | x | x | x |
t4 | x | x | x | ||
t5 | x | x | |||
t6 | x | x |
The result of the insertion of these six transactions is an itemset-tree (see the article by Kubat for more details).
{} sup=6 [2 ] sup=3 [2 5 ] sup=2 [2 4 ] sup=1 [1 ] sup=3 [1 2 ] sup=2 [1 2 4 ] sup=1 [1 2 3 4 5 ] sup=1 [1 4 ] sup=1The root is the empty itemset {} and the leafs are {2, 5}, {2, 4}, {1 2 4},{1 2 3 4 5} and {1, 4}.
Once an itemset-tree has been created, it is possible to update it by inserting a new transaction. For example, in this example provided in the source code, we update the previous tree by adding a new transaction {4, 5}. The result is this tree:
{} sup=7 [2 ] sup=3 [2 5 ] sup=2 [2 4 ] sup=1 [1 ] sup=3 [1 2 ] sup=2 [1 2 4 ] sup=1 [1 2 3 4 5 ] sup=1 [1 4 ] sup=1 [4 5 ] sup=1Next, it is shown how to query the tree to determine the support of a target itemset efficiently. For example, if we execute the query with {2}, the support is determined to be 5 because 2 appear in 5 transactions.
After that the source code offers an example of how to use the itemset tree to get all itemsets that subsume an itemset and their support. For example, if we use {1 2} as a query the result is:
[1 2 ] supp:2 [1 2 3 ] supp:1 [1 2 4 ] supp:2 [1 2 5 ] supp:1 [1 2 3 4 ] supp:1 [1 2 3 5 ] supp:1 [1 2 4 5 ] supp:1 [1 2 3 4 5 ] supp:1Another example provided is how to use the tree to find all itemsets that subsume an itemset such that the support is higher or equal than a user-specified threshold named minsup. For example, if we execute this query with the itemset {1} and minsup =2, we get this result:
[1 ] supp:3 [1 2 ] supp:2 [1 4 ] supp:3 [1 2 4 ] supp:2Lastly, another example is how to generate all association rules having a target itemset as antecedent and a support and confidence respectively higher or equal to some user-speficied thresholds minsup and minconf. For example, if the target itemset is {1} and minconf = 0.1 and minsup = 2, the result is:
[ 1 ] ==> [2 ] sup=2 conf=0.6666666666666666 [ 1 ] ==> [4 ] sup=3 conf=1.0 [ 1 ] ==> [2 4 ] sup=2 conf=0.6666666666666666
Performance
This data structure is efficient for the case of a database that need to be updated and where targeted queries need to be performed. For details about the complexity in terms of space and time, please refer to the article by Kubat et al.
Where can I get more information about the Itemset-tree data structure and related algorithms?
Here is an article describing the itemset-tree and related algorithms:
Miroslav Kubat, Aladdin Hafez, Vijay V. Raghavan, Jayakrishna R. Lekkala, Wei Kian Chen: Itemset Trees for Targeted Association Querying. IEEE Trans. Knowl. Data Eng. 15(6): 1522-1534 (2003)
How to run this example?
What is MISApriori?
MISApriori is an algorithm for mining frequent itemsets by using multiple minimum supports. It is a generalization of the Apriori algorithm, which uses a single minimum support threshold.
What is the input of this algorithm?
The input of MSApriori is a transaction database and two parameters named beta and LS. These parameters are used to determine a minimum support for each item.
A transaction database is a set of transactions, where each transaction is an unordered set of items (symbols). For example, let's consider the following transaction database. It consists of 6 transactions (t1,t2...t6) and 5 items (1,2,3,4,5). For instance, transaction t1 is the set of items {1, 2, 4, 5}. This database is provided in the file "contextIGB.txt" of the SPMF distribution.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | x | |
t2 | x | x | x | ||
t3 | x | x | x | x | |
t4 | x | x | x | x | |
t5 | x | x | x | x | x |
t6 | x | x | x |
What is the output of this algorithm?
The output of MSApriori is the set of all frequent itemsets contained in the database.
Contrarily to the original Apriori algorithm, MSApriori use multiple minimum supports thresholds instead of just one. In fact, MSApriori uses a minimum support value for each item. Because it would be time consuming to set a minimum support threshold value for each item for a large database, the thresholds are determined automatically by using two user-specified parameters named beta (0 <= B <= 1) and LS (0 <= LS <= 1).
The minimum support of an item k is then defined as the greatest value between:
- LS
- and B x f(k) where f(k) is the number of transactions containing the item k.
So if B is set to 0, there will be a single minimum support for all items and this will be equivalent to the regular Apriori algorithm.
The support of an itemset is the number of transactions containing the itemset divided by the total number of transactions. An itemset is a frequent itemset if its support is higher or equal to the smallest minimum support threshold from the minimum support thresholds of all its items.
Why MSApriori is useful? It is useful because it allows discovering frequent itemsets containing rare items (if their minimum support is set low).
If we run MSApriori on the previous transaction database with beta = 0.4 and LS = 0.2, we obtain the following result:
1 supp: 4 2 supp: 6 3 supp: 4 4 supp: 4 5 supp: 5 1 2 Support: 4 1 3 Support: 2 1 4 Support: 3 1 5 Support: 4 2 3 Support: 4 2 4 Support: 4 2 5 Support: 5 3 4 Support: 2 3 5 Support: 3 4 5 Support: 3 1 2 3 Support: 2 1 2 4 Support: 3 1 2 5 Support: 4 1 3 5 Support: 2 1 4 5 Support: 3 2 3 4 Support: 2 2 3 5 Support: 3 2 4 5 Support: 3 1 2 3 5 Support: 2 1 2 4 5 Support:Note that here the support is expressed by an integer value which represents the number of transactions containing the itemset. For example, itemset {2, 3 5} has a support of 3 because it appears in three transactions, namely t2, t4 and t5. This integer value can be converted as a percentage by dividing by the total number of transactions.
Performance
MSApriori is one of the first algorithm for mining itemsets with multiple minimum support thresholds. It is not the most efficient algorithm for this task because it is based on Apriori. If performance is important, it is recommend to use CFPGrowth, which is based on FPGrowth and is more efficient.
Note that there is one important difference between the input of CFPGrowth and MSApriori in SPMF. The MISApriori works by setting the multiple minimum supports by using the LS and BETA values. The CFPGrowth implementation uses a list of minimum support values stored in a text file.
Where can I get more information about the MSApriori algorithm?
This article describes MSApriori:
B. Liu, W. Hsu, Y. Ma, "Mining Association Rules with Multiple Minimum Supports" Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-99), August 15-18, 1999, San Diego, CA, USA.
How to run this example?
What is CFPGrowth?
CFPGrowth is an algorithm for mining frequent itemsets by using multiple minimum supports. It is a generalization of the Apriori algorithm, which uses a single minimum support threshold.
What is the input of this algorithm?
The input of CFPGrowth is a transaction database and a list of minimum support thresholds indicating the minimum support threshold for each item.
A transaction database is a set of transactions, where each transaction is an unordered set of items (symbols). For example, let's consider the following transaction database. It consists of 5 transactions (t1,t2...t6) and 8 items (1,2,3,4,5,6,7,8). For instance, transaction t1 is the set of items {1, 3, 4, 6}. This database is provided in the file "contextCFPGrowth.txt" of the SPMF distribution.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
t1 | x | x | x | x | ||||
t2 | x | x | x | x | x | |||
t3 | x | x | x | x | x | |||
t4 | x | x | x | |||||
t5 | x | x |
The list of minimum support threshold is stored in a text file that is read as input by the algorithm. This is provided in the file "MIS.txt":
item | support threshold |
1 | 1 |
2 | 2 |
3 | 3 |
4 | 3 |
5 | 2 |
6 | 3 |
7 | 2 |
8 | 1 |
This file indicated for example that the minimum support threshold to be used for item 6 is 3.
What is the output of this algorithm?
The output of CFPgrowth is the set of all frequent itemsets contained in the database.
What is a frequent itemset ? The support of an itemset is the number of transactions containing the itemset. An itemset is a frequent itemset if its support is higher or equal to the smallest minimum support threshold among the minimum support thresholds of all its items. For example, the itemset {1 2 8} is frequent because it appears in one transactions (t3) and its support is higher than the smallest minimum support among the minimum support of item 1, item 2 and item 8.
Why CFPGrowth is useful? It is useful because lower minimum support thresholds can be associated to rare items. Therefore, it allows discovering frequent itemsets containing rare items.
If we run CFPGrowth on the previous transaction database with the MIS.txt file previously described, we get the following result, where each line represents an itemsets followed by ":" and then its absolute support.:
8:1Note: If you are using the GUI version of SPMF the file containing the minimum support must be located in the same folder as the input file containing the transaction database.
8 1:1
8 1 2:1 // for example, this itemset is {1, 2, 8}, and it has a support of 1.
8 1 2 6:1
8 1 2 6 3:1
8 1 2 3:1
8 1 6:1
8 1 6 3:1
8 1 3:1
8 2:1
8 2 6:1
8 2 6 3:1
8 2 3:1
8 6:1
8 6 3:1
8 3:1
1:3 // for example, this itemset is {1}, and it has a support of 3.
1 7:1
1 7 5:1
1 7 5 6:1
1 7 5 6 3:1
1 7 5 3:1
1 7 6:1
1 7 6 3:1
1 7 3:1
1 5:1
1 5 6:1
1 5 6 3:1
1 5 3:1
1 2:1
1 2 6:1
1 2 6 3:1
1 2 3:1
1 6:3
1 6 4:1
1 6 4 3:1
1 6 3:3
1 4:1
1 4 3:1
1 3:3
7:2
7 6:2
2:3
2 6:2
2 3:2
6:4
6 3:3
3:4
Performance
CFPGrowth is a very efficient algorithm. It is based on FPGrowth.
Note that SPMF also offers MISApriori, which is less efficient. Note that there is one important difference between the input of CFPGrowth and MSApriori in SPMF. The MISApriori works by setting the multiple minimum supports by using some special parameters named LS and BETA (see the example describing MISApriori for more details). The CFPGrowth implementation instead uses a list of minimum support values stored in a text file.
Where can I get more information about the CFPGrowth algorithm?
This article describes CFPGrowth:
Y.-H. Hu, Y.-L. Chen: Mining association rules with multiple minimum supports: a new mining algorithm and a support tuning mechanism. Decision Support Systems 42(1): 1-24 (2006)
How to run this example?
What is this algorithm?
It is an algorithm for discovering all association rules in a transaction database, following the two steps approach proposed by Agrawal & Srikant (1993). The first step is to discover frequent itemsets. The second step is to generate rules by using the frequent itemsets. The main difference with Agrawal & Srikant in this implementation is that FPGrowth is used to generate frequent itemsets instead of Apriori because FPGrowth is more efficient.
What is the input?
The input is a transaction database (a.k.a binary context) and two thresholds named minsup and minconf.A transaction database is a set of transactions, where each transaction is an unordered set of items (symbols). For example, let's consider the following transaction database. It consists of 6 transactions (t1,t2...t6) and 5 items (1,2,3,4,5). For instance, transaction t1 is the set of items {1, 2, 4, 5}. This database is provided in the file contextIGB.txt of the SPMF distribution.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | x | |
t2 | x | x | x | ||
t3 | x | x | x | x | |
t4 | x | x | x | x | |
t5 | x | x | x | x | x |
t6 | x | x | x |
What is the output?
The output of an association rule mining algorithm is a set of association rules respecting the user-specified minsup and minconf thresholds.
To explain how it works, we need to review some definitions. An association rule X==>Y is a relationship between two itemsets (sets of items) X and Y such that the intersection of X and Y is empty. The support of a rule is the number of transactions that contains X∪Y. The confidence of a rule is the number of transactions that contains X∪Y divided by the number of transactions that contain X.
So if we apply an association rule mining algorithm, it will return all the rules having a support and confidence respectively no less than minsup and minconf.
For example, by applying the algorithm with minsup = 50 %, minconf= 60%, we obtains 55 associations rules (run the example in the SPMF distribution to see the result).
Implementation details
Association rule mining is traditionnaly performed in two steps : (1) mining frequent itemset and (2) generating association rules by using frequent itemsets. In this implementation, we use the FPGrowth algorithm for Step 1 because it is very efficient. For Step 2, we use the algorithm that was proposed by Agrawal & Srikant (1994).Note that in the source code version of SPMF, there is also a version of this algorithm that use Apriori in Step 1. But it is slower than FPGrowth.
Where can I get more information about this algorithm?
Here is the PDF of the technical report published in 1994 that describe how to generate association rules from frequent itemsets (Step 2).
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. Research Report RJ 9839, IBM Almaden Research Center, San Jose, California, June 1994.
You can also read chapter 6 of the book "introduction to data mining" which provide a nice and easy to understand introduction to how to discover frequent itemsets and generate association rules.
This is an article describing the FPGrowth algorithm for mining frequent itemsets.
Jiawei Han, Jian Pei, Yiwen Yin, Runying Mao: Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. Data Min. Knowl. Discov. 8(1): 53-87 (2004)
How to run this example?
What is this algorithm?
This is a variation of the algorithm for mining all association rules from a transaction database, described in the previous example.
Traditionnally, association rule mining is performed by using two measures named the support and confidence to evaluate rules. In this example, we show how to use another popular measure that is called the lift or interest.
What is the input?
The input is a transaction database (a.k.a binary context) and three thresholds named minsup,minconf and minlift.A transaction database is a set of transactions, where each transaction is an unordered set of items (symbols). For example, let's consider the following transaction database. It consists of 6 transactions (t1,t2...t6) and 5 items (1,2,3,4,5). For instance, transaction t1 is the set of items {1, 2, 4, 5}. This database is provided in the file contextIGB.txt of the SPMF distribution.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | x | |
t2 | x | x | x | ||
t3 | x | x | x | x | |
t4 | x | x | x | x | |
t5 | x | x | x | x | x |
t6 | x | x | x |
What is the output ?
The output is a set of all the association rules that have a support, confidence and lift respectively higher than minsup, minconf and minlift.
The lift of a rule X-->Y is calculated as lift(X-->Y) = ( (sup(X U Y)* N) / (sup(X)*sup(Y) ), where
- N is the number of transactions in the transaction database,
- sup(X∪Y) is the number of transactions containing X and Y,
- sup(X) is the number of transactions containing X
- sup(Y) is the number of transactions containing Y.
The confidence of a rule X-->Y is calculated as conf(X-->Y) = sup(X U Y) / (sup(X)).
The support of a rule X -->Y is defined as sup(X-->Y) = sup(X∪Y) / N
By applying the algorithm with minsup = 0.5, minconf= 0.9 and minlift = 1 on the previous database, we obtains 18 associations rules:
rule 0: 4 ==> 2 support : 0.666 (4/6) confidence : 1.0 lift : 1.0
rule 1: 3 ==> 2 support : 0.666 (4/6) confidence : 1.0 lift : 1.0
rule 2: 1 ==> 5 support : 0.666 (4/6) confidence : 1.0 lift : 1.2
rule 3: 1 ==> 2 support : 0.666 (4/6) confidence : 1.0 lift : 1.0
rule 4: 5 ==> 2 support : 0.833(5/6) confidence : 1.0 lift : 1.0
rule 5: 4 5 ==> 2 support : 0.5 (3/6) confidence : 1.0 lift : 1.0
rule 6: 1 4 ==> 5 support : 0.5 (3/6) confidence : 1.0 lift : 1.2
rule 7: 4 5 ==> 1 support : 0.5 (3/6) confidence : 1.0 lift : 1.5
rule 8: 1 4 ==> 2 support : 0.5 (3/6) confidence : 1.0 lift : 1.0
rule 9: 3 5 ==> 2 support : 0.5 (3/6) confidence : 1.0 lift : 1.0
rule 10: 1 5 ==> 2 support : 0.66 (4/6) confidence : 1.0 lift : 1.0
rule 11: 1 2 ==> 5 support : 0.66 (4/6) confidence : 1.0 lift : 1.2
rule 12: 1 ==> 2 5 support : 0.66 (4/6) confidence : 1.0 lift : 1.2
rule 13: 1 4 5 ==> 2 support : 0.5 (3/6) confidence : 1.0 lift : 1.0
rule 14: 1 2 4 ==> 5 support : 0.5 (3/6) confidence : 1.0 lift : 1.2
rule 15: 2 4 5 ==> 1 support : 0.5 (3/6) confidence : 1.0 lift : 1.5
rule 16: 4 5 ==> 1 2 support : 0.5 (3/6) confidence : 1.0 lift : 1.5
rule 17: 1 4 ==> 2 5 support : 0.5 (3/6) confidence : 1.0 lift : 1.5
How to interpret the results?
For an association rule X ==> Y, if the lift is equal to 1, it means that X and Y are independent. If the lift is higher than 1, it means that X and Y are positively correlated. If the lift is lower than 1, it means that X and Y are negatively correlated.
Implementation details
In the source code version of SPMF, there are two versions of this algorithm. The first version saves the result into memory ("MainTestAllAssociationRules_FPGrowth_wifthLift").. The second one saves the result to a file ("MainTestAllAssociationRules_FPGrowth_saveToFile_wifthLift").
How to run this example?
What is this algorithm?
It is an algorithm (Pasquier, 1999) for discovering the Guigues-Duquenne basis of association rules (a.k.a. exact association rules).
What is the input?
The input is a transaction database (a.k.a. binary context) and two thresholds named minsup and minconf.
A transaction database is a set of transactions. For example, consider the following transaction database (transaction database). It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). Each transaction is an unordered set of items. For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | ||
t2 | x | x | x | ||
t3 | x | x | x | x | |
t4 | x | x | |||
t5 | x | x | x | x |
What is the output?
The output is the Guigues-Duquenne basis of association rules respecting the minsup and minconf thresholds set by the user.
To explain what the result, let's review some definition. An itemset is a group of items. The support of an itemset is the number of times that it appears in the database divided by the total number of transactions in the database. For example, the itemset {1 3} has a support of 60 % because it appears in 3 out of 5 transactions from the database.
An association rule X--> Y is an association between two itemsets X and Y are disjoint. The support of an association rule is the number of transactions that contains X and Y divided by the total number of transactions. The confidence of an association rule is the number of transactions that contains X and Y divided by the number of transactions that contains X.
A closed itemset is an itemset such that it is included in no itemset having the same support. An itemset Y is the closure of an itemset X if Y is a closed itemset, X is a subset of Y and X and Y have the same support. A pseudo-closed itemset is an itemset such that it is not closed and it contains the closure of all its subsets that are pseudo-closed itemsets.
The Guigues-Duqenne basis of association rules is the set of association rules contained in a database such that their support is higher than minsup and their confidence is 100%. Moreover, each rule from this basis must have the form X--> Y-X where X is a pseudo-closed itemset and Y is a closed itemset such that Y is the closure of X.
For example, if we run the algorithm on the previous database with minsup = 20 % and minconf = 100%, we discover the following set of association rules:
Rule | Support | Confidence |
2 ==> 5 | 80 % | 100 % |
5 ==> 2 | 80 % | 100 % |
Note that the Guigues-Duquenne Basis assumes that minconf is always equals to 100%. However, in SPMF it is possible to set minconf to smaller values.
Performance
This implementation is based on AprioriClose, which is not the most efficient algorithm for closed itemset mining.
Where can I get more information about this algorithm?
This paper explain this algorithm and the theory on which is it based.
Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal: Discovering Frequent Closed Itemsets for Association Rules. ICDT 1999: 398-416
How to run this example?
To run this example with the source code version of SPMF, launch the file "MainTestProperBasis.java" in the package ca.pfv.SPMF.tests.
This example is not availabe in the release version of SPMF.
What is this algorithm?
It is an algorithm (Pasquier, 1999) for discovering the proprer basis for approximate association rules.
What is the input?
The input is a transaction database (a.k.a. binary context) and two thresholds named minsup and minconf.
A transaction database is a set of transactions. For example, consider the following transaction database (transaction database). It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). Each transaction is an unordered set of items. For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | ||
t2 | x | x | x | ||
t3 | x | x | x | x | |
t4 | x | x | |||
t5 | x | x | x | x |
What is the output?
The output is the propoer basis of approximate association rules respecting the minsup and minconf thresholds set by the user.
Because the definition of the proper basis of approximate association rules is a little bit complicated, it is not explained here. You can refer to the paper by Pasquier et al. (1999) for more details.
If we run the algorithm on the previous database with minsup = 40 % and minconf = 50%, we discover the following set of association rules:
Rule Support Confidence 3 ==>1 60 % 75 % 2, 5 ==> 3 60 % 75 % 3 ==> 2, 5 60 % 75 % 2, 3, 5 ==> 1 40 % 66 % 2, 5 ==> 1, 3 40 % 50 % 1, 3 ==> 2, 5 40 % 66 % 3 ==> 1, 2, 5 40 % 50 %
Where can I get more information about this algorithm?
This paper explain this algorithm and the theory on which is it based.
Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal: Discovering Frequent Closed Itemsets for Association Rules. ICDT 1999: 398-416
You may also have a look at the Ph.D thesis of N. Pasquier (in French).
How to run this example?
To run this example with the source code version of SPMF, launch the file "MainTestStructuralBasis.java" in the package ca.pfv.SPMF.tests.
This example is not availabe in the release version of SPMF.
What is this algorithm?
It is an algorithm (Pasquier, 1999) for discovering the structural basis for approximate association rules.
What is the input?
The input is a transaction database (a.k.a. binary context) and two thresholds named minsup and minconf.
A transaction database is a set of transactions. For example, consider the following transaction database (transaction database). It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). Each transaction is an unordered set of items. For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | ||
t2 | x | x | x | ||
t3 | x | x | x | x | |
t4 | x | x | |||
t5 | x | x | x | x |
What is the output?
The output is the structural basis of approximate association rules respecting the minsup and minconf thresholds set by the user.
Because the definition of the structural basis of approximate association rules is a little bit complicated, it is not explained here. You can refer to the paper by Pasquier et al. (1999) for more details.
If we run the algorithm on the previous database with minsup = 40 % and minconf = 50%, we discover the following set of association rules:
Rule Support Confidence 3 ==>1 60 % 75 % 2, 5 ==> 3 60 % 75 % 1, 3 ==> 2, 5 40 % 66 %
Where can I get more information about this algorithm?
This paper explain this algorithm and the theory on which is it based.
Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal: Discovering Frequent Closed Itemsets for Association Rules. ICDT 1999: 398-416
You may also have a look at the Ph.D thesis of N. Pasquier (in French).
How to run this example?
What is this algorithm?
This algorithm mines a subset of all association rules that is called IGB (Informative and Generic Basis of Association Rules) from a transaction database.
To discover the IGB association rules, this algorithmp perform two steps: (1) first it discovers Closed itemsets and their associated generators by applying the Zart algorithm. Then (2), association rules are generated by using the closed itemsets and generators.
What is the input?
The input is a transaction database and two thresholds named minsup and minconf.
A transaction database is a set of transactions, where each transaction is an unordered set of items (symbols). For example, let's consider the following transaction database. It consists of 6 transactions (t1,t2...t6) and 5 items (1,2,3,4,5). For instance, transaction t1 is the set of items {1, 2, 4, 5}. This database is provided in the file "contextIGB.txt" of the SPMF distribution.
1 2 3 4 5 t1 x x x x t2 x x x t3 x x x x t4 x x x x t5 x x x x x t6 x x x
What is the output?
The output is the IGB basis of association rules. It is a compact set of association rules that is both informative and generic.
To explain what is the IGB basis of association rules, let's review some definition. An itemset is a group of items. The support of an itemset is the number of times that it appears in the database divided by the total number of transactions in the database. For example, the itemset {1 3} has a support of 33 % because it appears in 2 out of 6 transactions from the database.
An association rule X--> Y is an association between two itemsets X and Y are disjoint. The support of an association rule is the number of transactions that contains X and Y divided by the total number of transactions. The confidence of an association rule is the number of transactions that contains X and Y divided by the number of transactions that contains X.
A closed itemset is an itemset such that it is included in no itemset having the same support. An itemset Y is the closure of an itemset X if Y is a closed itemset, X is a subset of Y and X and Y have the same support. A generator Y of a closed itemset X is an itemset such that (1) it has the same support as X and (2) it does not have any subset having the same support.
The IGB set of association rules is the set of rules of the form X ==> Y - X, where X is a minimal generator of Y, Y is a closed itemset having a support higher or equal to minsup, and the confidence of the rule is higher or equal to minconf.
For example, by applying the IGB algorithm on the context previously described with minsup = 0.50 and minconf= 0.61, we discover the following set of association rules:
Rule | Support | Confidence |
1 ==> 2, 4, 5 | 0.50 | 0.75 |
4 ==> 1, 2, 5 | 0.50 | 0.75 |
3 ==> 2, 5 | 0.50 | 0.75 |
{} ==> 2, 3 | 0.66 | 0.66 |
{} ==> 1, 2, 5 | 0.66 | 0.66 |
{} ==> 2, 4 | 0.66 | 0.66 |
{} ==> 2, 5 | 0.83 | 0.83 |
{} ==> 2 | 1 | 1 |
Where can I get more information about IGB association rules?
This article described IGB rules:
G. Gasmi, S. Ben Yahia, E. Mephu Nguifo, Y. Slimani: IGB: A New Informative Generic Base of Association Rules. PAKDD 2005: 81-90
How to run this example?
To run this example with the source code version of SPMF, launch the file "MainTestAllPerfectlySporadicRules.java" in the package ca.pfv.SPMF.tests.
This example is not availabe in the release version of SPMF.
What is this algorithm?
This is an algorithm for mining perfectly sporadic association rules. It first uses AprioriInverse to generate perfectly rare itemsets. Then, it uses these itemsets to generate the association rules.
What is the input?
The input of this algorithm is a transaction database and three thresholds named minsu, maxsup and minconf. A transaction database is a set of transactions. A transaction is a set of items (symbols). For example, the following transactions database contains 5 transactions (t1,t2...t5) and 5 items (1,2,3,4,5). This database is provided in the file "contextInverse.txt" of the SPMF distribution.:
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | x | |
t2 | x | x | |||
t3 | x | x | x | x | |
t4 | x | x | |||
t5 | x | x | x | x |
What is the output?
The output is the set of perfectly sporadic association rules respecting the minconf, minsup and maxsup parameters.
To explain what it a perfectly sporadic association rule, we need to review some definitions. An itemset is a set of items. The support of an itemset is the number of transactions that contain the itemset divided by the total number of transactions. For example, the itemset {1, 2} has a support of 60% because it appears in 3 transactions out of 5 (it appears in t1, t2 and t5). A frequent itemset is an itemset that has a support no less than the maxsup parameter.
A perfectly rare itemset (a.k.a. sporadic itemset) is an itemset that is not a frequent itemset and that all its subsets are also not frequent itemsets. Moreover, it has to have a support higher or equal to the minsup threshold.
An association rule X==>Y is a relationship between two itemsets (sets of items) X and Y such that the intersection of X and Y is empty. The support of a rule is the number of transactions that contains X∪Y. The confidence of a rule is the number of transactions that contains X∪Y divided by the number of transactions that contain X.
A perfectly sporadic association rule X==>Y is an association rule such that the confidence is higher or equal to minconf and the support of any non empty subset of X∪Y is lower than maxsup.
For example, let's apply the algorithm with minsup = 0.1 %, maxsup of 60 % and minconf = 60 %.
The first step that the algorithm perform is to apply AprioriInverse algorithm with minsup = 0.1 % and maxsup of 60 %. The result is the following set of perfectly rare itemsets:
Perfectly Rare Itemsets | Support |
{3} | 60 % |
{4} | 40 % |
{5} | 60 % |
{4, 5} | 40 % |
{3, 5} | 20 % |
Then, the second step is to generate all perfectly sporadic association rules respecting minconf by using the perfectly rare itemset. The result is :
Rule | Support | Confidence |
5 ==> 4 | 40 % | 60 % |
4 ==> 5 | 40 % | 100 % |
How to interpret the result? For example, consider the rule 5 ==> 4. It means that if item 5 appears in a transaction, it is likely to be associated with item 4 with a confidence of 60 % (because 5 and 4 appears together in 40% of the transactions where 5 appears). Moreover, this rule has a support of 40 % because it appears in 40% of the transactions of this database.
Where can I get more information about this algorithm?
The AprioriInverse algorithm and how to generate sporadic rules are described in this paper:
Yun Sing Koh, Nathan Rountree: Finding Sporadic Rules Using Apriori-Inverse. PAKDD 2005: 97-106
How to run this example?
What is this algorithm?
It is an algorithm for mining "closed association rules", which are a concise subset of all association rules.
What is the input of this algorithm?
The input is a transaction database (a.k.a. binary context) and thresholds named minsup and minconf.
A transaction database is a set of transactions. Let's consider the following dataset from the article Szathmary et al. 2007. This transaction database contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). This transaction database is provided in the file "contextZart.txt" of the SPMF distribution. Each transaction is an unordered set of items. For example, the first transaction represents the set of items 1, 2, 4 and 5.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | x | |
t2 | x | x | |||
t3 | x | x | x | x | |
t4 | x | x | x | ||
t5 | x | x | x | x |
What is the output of this algorithm?
Given the minimum support threshold (minsup) and minimum confidence threshold (minconf) set by the user, the algorithm returns the whole set of closed association rules.
A closed association rule is an associatino rule of the form X ==> Y such that the union of X and Y is a closed itemset (see Szathmary, 2006 for details).
For instance, by applying this algorithm with minsup = 60 %, minconf= 60%, we obtains 16 closed associations rules (run the example in the SPMF distribution to see the result).
Where can I get more information about closed association rules?
The following Ph.D. thesis proposed "closed association rules".Szathmary, L. (2006). Symbolic Data Mining Methods with the Coron Platform. Szathmary, L. PhD thesis, University Henri Poincaré — Nancy 1, France.
How to run this example?
What is this algorithm?
This algorithm discover the set of "minimal non redundant association rules" (Kryszkiewicz, 1998), which is a lossless and compact set of association rules.
In this implementation we use the Zart algorithm for discovering closed itemsets and their associated generators. Then, this information is used to generate the "minimal non redundant association rules".
What is the input?
The input is a transaction database (a.k.a. binary context), a threshold named minconf and a threshold named minsup.
A transaction database is a set of transactions. Let's consider the following dataset from the article Szathmary et al. 2007. This transaction database contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). This transaction database is provided in the file "contextZart.txt" of the SPMF distribution. Each transaction is an unordered set of items. For example, the first transaction represents the set of items 1, 2, 4 and 5.
1 2 3 4 5 t1 x x x x t2 x x t3 x x x x t4 x x x t5 x x x x
What is the output?
This algorithm returns the set of minimal non redundant association rules.
This set is defined as the set of association rules of the form P1 ==> P2 / P1, where P1 is a generator of P2, P2 is a closed itemset, and the rule has a support and confidence respectively no less than minsup and minconf. The formal definition can be found in Kryszkiewicz, 1998.
For example, by applying this algorithm with minsup = 60 %, minconf= 60% on the previous database, we obtains 11 minimal non rendudant associations rules (run the example in the SPMF distribution to see the result):
rule 0: 3 ==> 1 support : 60% (3/5) confidence : 75% rule 1: 3 ==> 2 5 support : 60% (3/5) confidence : 75% rule 2: 1 2 ==> 5 support : 60% (3/5) confidence : 100% rule 3: 1 5 ==> 2 support : 60% (3/5) confidence : 100% rule 4: 2 ==> 1 5 support : 60% (3/5) confidence : 75% rule 5: 2 ==> 5 support : 80% (4/5) confidence : 100% rule 6: 2 ==> 3 5 support : 60% (3/5) confidence : 75% rule 7: 5 ==> 2 support : 80% (4/5) confidence : 100% rule 8: 1 ==> 2 5 support : 60% (3/5) confidence : 75% rule 9: 1 ==> 3 support : 60% (3/5) confidence : 75% rule 10: 2 3 ==> 5 support : 60% (3/5) confidence : 100% rule 11: 3 5 ==> 2 support : 60% (3/5) confidence : 100%
Where can I get more information about closed association rules?
The following article provided detailed information about Minimal Non Redundant Association Rules:M. Kryszkiewicz (1998). Representative Association Rules and Minimum Condition Maximum Consequence Association Rules. Proc. of PKDD '98, Nantes, France, September 23-26.
How to run this example?
What is the INDIRECT algorithm?
Indirect (Tan et al., KDD 2000; Tan, Steinbach & Kumar, 2006, p.469) is an algorithm for discovering indirect associations between items in transactions databases.
Why this algorithm is important? Because traditional association rule mining algorithms focus on direct associations. This algorithm can discover indirect associations. It has various applications such as stock market analysis and competitive product analysis (Tan et al., 2000).
What is the input?
The input of the indirect algorithm is a transaction database and three parameters named minsup, ts and minconf. A transaction database is a set of transactions. A transaction is an unordered set of items (symbols). For example, let's consider the following transaction database, containing 5 transactions (t1, t2, ..., t5) and 5 items (1,2,...5). This transaction database is provided in the file "contextIndirect.txt" of the SPMF distribution. The first transaction represents the set of items 1 and 4. The second transaction represent the set of items 2, 3 and 4.
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | |||
t2 | x | x | x | ||
t3 | x | x | x | x | |
t4 | x | ||||
t5 | x | x | x | x |
The three numeric parameters of the indirect algorithm are:
What is the output?
The result is all indirect associations respecting the three parameters minsup, ts and minconf. An indirect association has the form {x,y} ==> M, where x and y are single items and M is an itemset called the "mediator".
An indirect associations has to respect the following conditions.
- The number of transactions containing all items of {x}∪ M divided by the total number of transaction must be higher or equal to minsup.
- The number of transactions containing all items of {y}∪ M divided by the total number of transaction must be higher or equal to minsup.
- The number of transactions containing {x,y} divided by the total number of transaction must be higher or equal to ts.
- The confidence of {x}with respect to M and {y}with respect M must be higher or equal to minconf. The confidence of an itemset X with respect to another itemset Y is defined as the number of transactions that contains X and Y divided by the number of transactions that contain X.
For example, by applying the indirect algorithm with minsup = 60 %, ts = 50 % and minconf= 10%, we obtain 3 indirect association rules:
- {1, 2 | {4}}, which means that 1 and 2 are indirectly associated by the mediator {4 }.
- {1, 5 | {4}}, which means that 1 and 5 are indirectly associated by the mediator {4 }.
- {2, 5 | {4}}, which means that 1 and 5 are indirectly associated by the mediator {4 }.
To see additional details about each of these three indirect rules, run the algorithm with the SPMF software.
Implementation details
The implementation attemps to be as faitful as possible to the original algorithm, except that the confidence is used instead of the IS measure.
Note that some algorithms claimed to be more efficient than Indirect such as HI-Mine but it is not implemented in SPMF.
Where can I get more information about indirect association rules?
The concept of indirect associations was propsoed by Tan (2000) in this conference paper:
Pang-Ning Tan, Vipin Kumar, Jaideep Srivastava: Indirect Association: Mining Higher Order Dependencies in Data. PKDD 2000: 632-637
Moreover, note that the book "Introduction do data mining" by Tan, Steinbach and Kumar provides an overview of indirect association rules.
How to run this example?
What is FHSAR?
FHSAR is an algorithm for hiding sensitive association rules in a transaction database.
What are the applications? For example, consider a company that want to release a transaction database to the public. But it do not want to disclose some sensitive associations between items that appear in the database and that could give a competitive advantage to their competitor. The FHSAR algorithm can hide these associations without making major changes to the database.
What is the input?
The FHSAR algorithm is designed to hide sensitive association rules in a database so that they will not be found for a given minsup and minconf thresholds. The input are: minsup, minconf, a transaction database and some sensitive association rules.
A transaction database is a set of transactions. Each transaction is an unordered set of items. For example, consider the following transaction database. It contains 6 transactions (t1,t2...t6) and 5 items (1,2,3,4,5). The first transaction represents the set of items 1,2, 4 and 5. This database is provided in the file "contextIGB.txt" of the SPMF distribution.:
1 | 2 | 3 | 4 | 5 | |
t1 | x | x | x | x | |
t2 | x | x | x | ||
t3 | x | x | x | x | |
t4 | x | x | x | x | |
t5 | x | x | x | x | x |
t6 | x | x | x |
An association rule X==>Yis an association between two sets of items X and Y. The support of an association rule X==>Yis the number of transactions that contains both X and Y divided by the total number of transactions. The confidence of an association rule X==>Y is the number of transactions that contains both X and Y divided by the number of transactions that contain X. For example, the rule {1 2} ==> {4 5} has a support of 50 % because it appears in 3 transactions out of 5. Furthermore, it has a confidence of 75 % because {1 2} appears in 4 transactions and {1, 2, 4, 5} appears in 3 transactions.
What is the output?
The output is a new transaction database such that the sensitive rules will not be found if an association rule mining algorithm is applied with minsup and minconf.
For example, we can apply FHSAR with the parameters minsup = 0.5 and minconf = 0.60 to hide the following rules provided in the file "sar.txt":
- 4 ==> 1
- 1 2 ==> 4 5
- 5 ==> 2
The result is a new transaction database where these rules are hidden for the given thresholds minsup and minconf:
1 2 3 4 5 t1 x x t2 x x t3 x x t4 x x x x t5 x x x x x t6 x x x Note that the result of the algorithm is not always the same because I use some HashSet to represent transactions internally and they do not keep the order. Therefore, the items that are removed may not be the same if the algorithm is run twice.
Where can I get more information about the FHSAR algorithm?
This algorithm was proposed in this paper:
C.-C.Weng, S.-T. Chen, H.-C. Lo: A Novel Algorithm for Completely Hiding Sensitive Association Rules. ISDA (3) 2008: 202-208
How to run this example?
What is TopKRules?
TopKRules is an algorithm for discovering the top-k association rules appearing in a transaction database.
Why is it useful to discover top-k association rules? Because other association rule mining algorithms requires to set a minimum support (minsup) parameter that is hard to set (usually users set it by trial and error, which is time consuming). TopKRules solves this problem by letting users directly indicate k, the number of rules to be discovered.
What is the input of TopKRules ?
TopKRules takes three parameters as input:
- a transaction database,
- a parameter k representing the number of association rules to be discovered,
- a parameter minconf representing the minimum confidence that the association rules should have.
A transaction database is a set of transactions, where each transaction is an unordered set of items (symbols). For example, let's consider the following transaction database. It consists of 6 transactions (t1,t2...t6) and 5 items (1,2,3,4,5). For instance, transaction t1 is the set of items {1, 2, 4, 5}. This database is provided in the file "contextIGB.txt" of the SPMF distribution.
1 2 3 4 5 t1 x x x x t2 x x x t3 x x x x t4 x x x x t5 x x x x x t6 x x x
What is the output of TopKRules ?
TopKRules outputs the top-k association rules, which are the k most frequent association rules having a confidence higher or equal to minconf.
An association rule X==>Y is a relationship between two itemsets (sets of items) X and Y such that the intersection of X and Y is empty. The support of an association rule is the number of transactions that contains X∪Y. The confidence of an association rule is the number of transactions that contains X∪Y divided by the number of transactions that contain X.
For example, if we run TopKRules with k = 2 and minconf = 0.8, we obtain the the top-2 rules in the database having a confidene higher or equals to 80 %.
- 2 ==> 5, which have a support of 5 (it appears in 5 sequences) and a confidence of 83%
- 5 ==> 2, which have a support of 5 (it appears in 5 sequences) and a confidence of 100 %
For instance, the rule 2 ==>5 means that if item 2 appears, it is likely to be associated with item 5 with a confidence of 83%. Moreover, this rule has a support of 83 % because it appears in five transactions (S1, S2 and S3) out of the six transactions contained in this database.
It is important to note that for some values of k, the algorithm may return slightly more rules than k. This is can happen if several rules have exactly the same support.
Performance
TopKRules is a very efficient algorithm for mining the top-k association rules.
It provides the benefits that it is very intuitive to use. It should be note that the problem of top-k association rule mining is more computationally expensive than the problem of association rule mining. Using TopKRules is recommended for k values up to 5000, depending on the datasets.
Besides, note that there is a variation of TopKRules named TNR that is available in SPMF. The improvement in TNR is that it eliminates some association rules that are deemed "redundant" (rules that are included in other rules having the same support and confidence - see the TNR example for the formal definition). Using TNR is more costly than using TopKRules but it brings the benefit of eliminating some redundancy in the results.
Where can I get more information about this algorithm?
The TopKRules algorithm was proposed in this paper:
Fournier-Viger, P., Wu, C.-W., Tseng, V. S. (2012). Mining Top-K Association Rules. Proceedings of the 25th Canadian Conf. on Artificial Intelligence (AI 2012), Springer, LNAI 7310, pp. 61-73.
How to run this example?
What is TopKRules?
TNR is an algorithm for discovering the top-k non-redundant association rules appearing in a transaction database. It is an approximate algorithm in the sense that it always generates non-redundant rules. But these rules may not always be the top-k non-redundant association rules. TNR uses a parameter named delta, which is a positive integer that can be used to improve the chance that the result is exact (the higher delta value, the more chances that the result will be exact).
Why is it important to discover top-k non-redundant association rules? Because other association rule mining algorithms requires that the user set a minimum support (minsup) parameter that is hard to set (usually users set it by trial and error, which is time consuming). Moreover, the result of association rule mining algorithms usually contains a high level of redundancy (for example, thousands of rules can be found that are variation of other rules having the same support and confidence). The TNR algorithm provide solution to both of these problems by letting users directly indicate k, the number of rules to be discovered, and by eliminating redundancy in results.
What is the input of TNR ?
TNR takes four parameters as input:
- a transaction database,
- a parameter k representing the number of rules to be discovered,
- a parameter minconf representing the minimum confidence that association rules should have.
- a parameter delta (an integer) that is used to increase the chances of having an exact result (because this algorihm is approximate).
A transaction database is a set of transactions, where each transaction is an unordered set of items (symbols). For example, let's consider the following transaction database. It consists of 6 transactions (t1,t2...t6) and 5 items (1,2,3,4,5). For instance, transaction t1 is the set of items {1, 2, 4, 5}. This database is provided in the file "contextIGB.txt" of the SPMF distribution.
1 2 3 4 5 t1 x x x x t2 x x x t3 x x x x t4 x x x x t5 x x x x x t6 x x x
What is the output of TNS ?
TNR outputs an approximation of the k most frequent non redundant association rules having a confidence higher or equal to minconf.
An association rule X==>Y is a relationship between two itemsets (sets of items) X and Y such that the intersection of X and Y is empty. The support of an association rule is the number of transactions that contains X∪Y. The confidence of an association rule is the number of transactions that contains X∪Y divided by the number of transactions that contain X.
An association rule ra: X → Y is redundant with respect to another rule rb : X1 → Y1 if and only if:
- conf(ra) = conf(rb)
- sup(ra) = sup(rb)
- X1 ⊆ X ∧ Y ⊆ Y1.
For example, If we run TNR with k = 10 and minconf = 0.5 and delta = 2, the following set of rules is found
4, ==> 2, sup= 4 conf= 1.0
2, ==> 1,5, sup= 4 conf= 0.6666666666666666
2, ==> 5, sup= 5 conf= 0.8333333333333334
5, ==> 2, sup= 5 conf= 1.0
5, ==> 1,2, sup= 4 conf= 0.8
1, ==> 2,5, sup= 4 conf= 1.0
2, ==> 3, sup= 4 conf= 0.6666666666666666
2, ==> 4, sup= 4 conf= 0.6666666666666666
3, ==> 2, sup= 4 conf= 1.0
1,4, ==> 2,5, sup= 3 conf= 1.0For instance, the association rule 2 ==> 1 5 means that if items 2 appears, it is likely to be associated with item 1 and item 5 with a confidence of 66 %. Moreover, this rule has a support 66 % (sup = 4) because it appears in three transaction (S1, S2 and S3) out of six transactions contained in this database.
Note that for some values of k and some datasets, TNR may return more than k association rules. This can happen if several rules have exactly the same support, and it is normal. It is also possible that the algorithm returns slightly less than k association rules in some circonstances because the algorithm is approximate.
Performance
TNR is an efficient algorihtm. It is based on the TopKRules algorithm for discovering top-k association rule. The main difference between TNR and TopKRules is that TNR includes additional strategies to eliminate redundancy in results, and that TNR is an approximate algorithm, while TopKRules is not.
TNR and TopKRules are more intuitive to use than regular association rule mining algorithms. However, it should be note that the problem of top-k association rule mining is more computationally expensive than the problem of association rule mining. Therefore, it is recommended to use TNR or TopKRules for k values of up to 5000 depending on the dataset. If more rules should be found, it could be better to find association rule swith FPGrowth, for more efficiency.
Where can I get more information about this algorithm?
The TNR algorithm was proposed in this paper:
Fournier-Viger, P., Tseng, V.S. (2012). Mining Top-K Non-Redundant Association Rules. Proc. 20th International Symposium on Methodologies for Intelligent Systems (ISMIS 2012), Springer, LNCS 7661, pp. 31- 40.
How to run this example?
What is K-Means?
K-Means is one of the most famous clustering algorithms.
What is the input?
K-Means takes as input a set of vectors containing one or more double values and a parameter K.
An example input is provided : "configKmeans.txt". It contains 10 lines, each representing a vector of data containing four doubles values separated by a space.
For example, the first line : 1 2 3 4 represents a vector with four values, namely 1,2,3 and 4.
What is the output?
K-Means groups vectors in clusters according to their similarity. In SPMF, the similarity is defined as the euclidian distance.
K-Means returns K clusters or less.
Note that running K-Means with the same data does not always generate the same result because K-Means initializes clusters randomly.
By running K-Means on the input file and K=3, we can obtain the following clusters:
cluster 1: [1.0,2.0,3.0,4.0][1.0,2.0,3.0,3.0][2.0,4.0,5.0,5.0][4.0,4.0,3.0,3.0][2.0,2.0,5.0,5.0]
cluster 2: [7.0,6.0,8.0,9.0][1.0,6.0,8.0,8.0][4.0,7.0,8.0,7.0][5.0,6.0,8.0,9.0]
cluster 3: [7.0,5.0,5.0,5.0]
Where can I get more information about K-Means?
K-Means was proposed by MacQueen in 1967. K-means is one of the most famous data mining algorithm. It is described in almost all data mining books that focus on algorithms, and on several websites. By searching on the web, you will find plenty of ressources about K-Means.
How to run this example?
What is this algorithm?
We have implemented a hierarchical clustering algorithm that is based on the description of Hierarchical Clustering Algorithms from
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html. This is a Hierarchical Clustering with a constant "threshold" that indicate
the maximal distance between two clusters to group them together and create a bigger cluster. The distance between two clusters is calculated as the distance between the means of each cluster. Initially, a cluster is created for each values. Then clusters are combined iteratively until no cluster can be merged into bigger clusters.
What is the input?
The input is a set of vectors containing double values. For example, the file "configKmeans.txt" is a sample input file containing 10 vectors (10 lines). Each vector (line) is a list of double values separated by spaces. Each vector should have the same size. For example, the first line of the input file is "1 2 3 4". It represents a vector with four values, namely 1,2,3 and 4.
Furthermore, the user should also provide a parameter called maxDistance to the algorithm. This parameter indicate the maximal distance allowed between the mean of two clusters so that they could be merged into a single cluster.
What is the output?
The output is a set of cluster (a set of vectors). By running the algorithm with "configKmeans.txt" and maxDistance = 4, we obtain the following clusters:
cluster 1: [2.0,2.6,3.8,4.0][1.0,2.0,3.0,3.0][2.0,3.0,5.0,5.0][2.0,2.0,5.0,5.0][4.0,4.0,3.0,3.0]
cluster 2: [1.0,6.0,8.0,8.0]
cluster 3: [5.0,6.333333333333333,8.0,8.333333333333334][6.0,6.0,8.0,9.0][5.0,6.0,8.0,9.0]
cluster 4: [7.0,5.0,5.0,5.0]
Where can I get more information about Hierarchical clustering?
There is a good introduction here:
http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html.Moreover, you could also read the free chapter on clustering from the book "introduction to data mining" by Tan, Steinbach and Kumar on the book website.
How to run this example?
What is PrefixSpan?
PrefixSpan is an algorithm for discovering sequential patterns in sequence databases, proposed by Pei et al. (2001).
What is the input of PrefixSpan?
The input of PrefixSpan is a sequence database and a user-specified threshold named minsup.
A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.
ID | Sequences |
S1 | (1), (1 2 3), (1 3), (4), (3 6) |
S2 | (1 4), (3), (2 3), (1 5) |
S3 | (5 6), (1 2), (4 6), (3), (2) |
S4 | (5), (7), (1 6), (3), (2), (3) |
What is the output of PrefixSpan?
PrefixSpan discovers all sequential patterns that occurs in more than minsup sequences of the database.
For example, if we run PrefixSpan with minsup= 50 %, 53 sequential patterns will be found. The list is too long to be presented here. An example of pattern found is "(1,2),(6)" which appears in the first and third sequence (it has therefore a support of 50%). Another pattern is "(4), (3), (2)". It appears in the second and third sequence (it has thus a support of 50 %).
Performance
PrefixSpan is one of the fastest sequential pattern mining algorithm.The SPAM implementation in SPMF can be faster than PrefixSpan in some cases (see the "performance" section of the website for a performance comparison).
Implementation details
I have included four versions of PrefixSpan in the source code version of SPMF. The first one keeps the frequent itemsets into memory and print the results to the console (MainTestPrefixSpan.java). The second one is a version that save the result directly to a file (MainTestPrefixSpan_saveToFile.java). It is more efficient. The third takes as input a dataset with strings instead of integers and keep the result into memory and output it to the console (MainTestPrefixSpanWithStrings.java) The fourth one takes as input a dataset with strings instead of integers from a file and save the result to a file (MainTestPrefixSpanWithStrings_saveToFile.java). For the graphical user interface version of SPMF, it is possible to use the version of PrefixSpan that uses Strings instead of integer by selecting "PrefixSpan with strings" and to test it with the input file contextPrefixSpanStrings.txt.
Where can I get more information about PrefixSpan?
The PrefixSpan algorithm is described in this article:
J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, M. Hsu: Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach. IEEE Trans. Knowl. Data Eng. 16(11): 1424-1440 (2004)
How to run this example?
What is SPAM?
SPAM is an algorithm for discovering sequential patterns in sequence databases, proposed by Ayres (2002).
What is the input of SPAM?
The input of SPAM is a sequence database and a user-specified threshold named minsup.
A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.
ID | Sequences |
S1 | (1), (1 2 3), (1 3), (4), (3 6) |
S2 | (1 4), (3), (2 3), (1 5) |
S3 | (5 6), (1 2), (4 6), (3), (2) |
S4 | (5), (7), (1 6), (3), (2), (3) |
What is the output of SPAM?
SPAM discovers all sequential patterns that occurs in more than minsup sequences of the database.
For example, , if we run SPAM with minsup= 50 %, 53 sequential patterns will be found. The list is too long to be presented here. An example of pattern found is "(1,2),(6)" which appears in the first and third sequence (it has therefore a support of 50%). Another pattern is "(4), (3), (2)". It appears in the second and third sequence (it has thus a support of 50 %).
Performance
SPAM is one of the fastest sequential pattern mining algorithm.The SPAM implementation in SPMF is reported to be faster than PrefixSpan (see the "performance" section of the website for a performance comparison).
Implementation details
I have implemented SPAM with all the main optimizations presented in the paper.
Where can I get more information about SPAM?
The SPAM algorithm was proposed in this paper:
J. Ayres, J. Gehrke, T.Yiu, and J. Flannick. Sequential PAttern Mining Using Bitmaps. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, Alberta, Canada, July 2002.
How to run this example?
What is BIDE+?
BIDE+ is an algorithm for discovering closed sequential patterns in sequence databases, proposed by Wang et al.(2007).
What is the input of BIDE+?
The input of BIDE+ is a sequence database and a user-specified threshold named minsup.
A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.
ID | Sequences |
S1 | (1), (1 2 3), (1 3), (4), (3 6) |
S2 | (1 4), (3), (2 3), (1 5) |
S3 | (5 6), (1 2), (4 6), (3), (2) |
S4 | (5), (7), (1 6), (3), (2), (3) |
What is the output of BIDE+?
BIDE+ discovers all closed sequential patterns that occurs in at least minsup sequences of the database.
A sequential pattern is a subsequence of a sequence. The support of a sequential pattern is the number of sequences from the database that contains it. A closed sequential pattern is a pattern that is not included in another sequential pattern having the same support.
Why using BIDE+? It can be shown that the set of closed sequential patterns is generally much smaller than the set of sequential patterns and that no information small. Moreover, finding closed sequential patterns is often much more efficient than discovering all patterns.
For example, , if we run BIDE+ with minsup= 50 % on the sequence database, the following patterns are found.
ID Closed Sequential Pattern Support S1 (6) 75 % S2 (5) 75 % S3 (2), (3) 75 % S4 (1), (2) 100 % S5 (1), (3) 100 % S6 (1 2), (6) 50 % S7 (4), (3) 75 % S8 (1) (2), (3) 50 % S9 (1), (2 3), (1) 50 % S10 (1), (3), (2) 75 % S11 (1), (3), (3) 75 % S12 (1 2), (4), (3) 50 % S13 (6), (2), (3) 50 % S14 (5), (2), (3) 50 % S15 (4), (3), (2) 50 % S16 (5), (6), (3), (2) 50 % S17 (5), (1), (3), (2) 50 % For instance, the sequential pattern "(1,2),(6)" appears in the first and third sequence (it has therefore a support of 50%). Another pattern is "(4), (3)". It appears in the second and third sequence (it has thus a support of 75 %).
Performance
BIDE+ is a very efficient algorithm for closed sequential pattern mining. This implementations includes all the optimizations.
Implementation details
I have included three versions of BIDE+ in the SPMF distribution. The first one keeps the frequent itemsets into memory and print the results to the console (MainTestBIDEPlus.java). The second one is a version that save the result directly to a file (MainTestBIDEPlus_saveToFile.java). The second version is faster.
The third version of BIDE+ accepts strings instead of integers. It is available under the name "BIDE+ with strings" in the GUI version of SPMF or in the package ca.pfv.spmf.sequential_rules.bide_with_strings for the source code version of SPMF. To run it, you should use the input file: contextPrefixSpanStrings.txt.
Where can I get more information about this algorithm?
The BIDE algorithm is described in this paper:
J. Wang, J. Han: BIDE: Efficient Mining of Frequent Closed Sequences. ICDE 2004: 79-90
How to run this example?
What is SeqDIM?
SeqDIM is an algorithm for discovering multi-dimensional sequential patterns in a multi-dimensional sequence database.
Why multi-dimensional sequential pattern is interesting? The reason is that regular sequential pattern mining algorithms do not consider the context of sequences. For example, the context of a sequence of transactions at a supermarket could be the profile of the customer such as his age, the city where he lives, etc.
In multi-dimensional sequential pattern mining, a sequence database is annotated with dimension to indicate the context of the sequence. The dimensions are symbolic values.
Multi-dimensional sequential pattern mining algorithm discovers sequential patterns with context information. This can be very useful for real applications such as market basket analysis. For example, one could find patterns specific to customers who are teenagers and lives in a particular city, or items boughts by adults living in another city.
What is the input?
The input is a multi-dimensional sequence database (as defined by Pinto et al. 2001) and a threshold named minsup.
A multi-dimensional database is a set of multi-dimensional sequences and a set of dimensions d1, d2... dn. A multi-dimensional sequence (MD-Sequence) is composed of an MD-pattern and a sequence. A sequence is an ordered list of itemsets (groups of items). An MD-pattern is a set of symbolic values for the dimensions (here represented by integer numbers).
For example, consider the following database, provided in the file "ContextMDSequenceNoTime.txt" of the SPMF distribution. The database contains 4 MD-sequences.
MD-Sequences |
||||
ID | MD-Patterns | Sequences | ||
d1 | d2 | d3 | ||
S1 | 1 | 1 | 1 | (2 4), (3), (2), (1) |
S2 | 1 | 2 | 2 | (2 6), (3 5), (6 7) |
S3 | 1 | 2 | 1 | (1 8), (1), (2), (6) |
S4 | * | 3 | 3 | (2 5), (3 5) |
For instance, the first MD-Sequence represents that items 2 and 4 appeared together, then were followed by 3, which was followed by item 2, wich was followed by item 1. The context of this sequence is the value 1 for dimension d1, the value 1 for dimension d2 and the value 1 for dimension d3. Note that the value "*" in the fourth MD-sequence means "any values".
What is the output?
The output is the set of all multi-dimensional sequential patterns that appears in at least minsup sequences of the database. Here, we will not provide a formal definition but rather show an example. For a formal definition of what is a multi-dimensional sequential pattern you can refer to the paper by Pinto et al. (2001).
Let's look at the example. If we apply SeqDIM on the previous database with a minsup of 50%, we obtain the following result:
Frequent MD-Sequences |
|||||
ID | MD-Patterns | Sequences | Support | ||
d1 | d2 | d3 | |||
P1 | * | * | * | (3) | 75 % |
P2 | * | * | * | (2) | 100 % |
P3 | 1 | * | * | (2) | 75 % |
P4 | * | * | * | (2), (3) | 75 % |
For instance, the third pattern (P3) represent the sequence containing the item 2 only with the value 1 for dimension d1. Note that the "*" for dimension d2 and d3 means "any values". This pattern P3 has a support of 75% because it appears in 75 % of the sequences of the original database (it appears in S1, S2 and S3).
Implementation details
The SeqDIM algorithm is a meta-algorithm. It requires a sequential pattern mining algorithm for discovering sequential patterns and an itemset mining algorithm to deal with the dimensions. In our implementations, we have used the PrefixSpan algorithm (Pei et al., 2004) for sequential pattern mining and the Apriori algorithm (Agrawal & Srikant, 1994) for dealing with the dimensions. PrefixSpan is a very fast algorithm. However, Apriori is not the most efficient algorithm. It could be replaced by FPGrowth in future version for more efficiency.
Note also that SeqDIM can generate a lot of patterns. A solution to this problem is to use the algorithm by Songram et al., also offered in SPMF. This latter algorithm eliminates a lot of redundancy in patterns without any loss of information.
Also note that the implementation of SeqDIM in SPMF needs a little bit refactoring as it is currently integrated with the Fournier-Viger (2008) algorithm in the code. In future versions, I will separate both algorithms.
Where can I get more information about this algorithm?
The algorithm is described in this paper:
H. Pinto, J. Han, J Pei, K. Wang, Q. Chen, U. Dayal: Multi-Dimensional Sequential Pattern Mining. CIKM 2001: 81-88
How to run this example?
What is the Songram et al. algorithm?
The Songram et al. algorithm is an algorithm for discovering closed multi-dimensional sequential patterns in a multi-dimensional sequence database.
Why multi-dimensional sequential pattern is interesting? The reason is that regular sequential pattern mining algorithms do not consider the context of sequences. For example, the context of a sequence of transactions from a supermarket could be the profile of the customer such as his age, the city where he lives, etc.
In multi-dimensional sequential pattern mining, a sequence database is annotated with dimension to indicate the context of the sequence. The dimensions are symbolic values.
Multi-dimensional sequential pattern mining algorithm discovers sequential patterns with context information. This can be very useful for real applications such as market basket analysis. For example, one could find patterns specific to customers who are teenagers and lives in a particular city, or items boughts by adults living in another city.
However, there is a problem with classical multi-dimensional sequential pattern mining algorithm. It is that there can be a lot of redundancy in the results. For this reason, Songram et al. proposed to discover closed multi-dimensional sequential patterns. This allows to find a subset of all patterns that eliminates a great deal of redundancy without any information loss. It is more efficient in time and memory than previous algorithms..
What is the input?
The input is a multi-dimensional sequence database (as defined by Pinto et al. 2001) and a threshold named minsup.
A multi-dimensional database is a set of multi-dimensional sequences and a set of dimensions d1, d2... dn. A multi-dimensional sequence (MD-Sequence) is composed of an MD-pattern and a sequence. A sequence is an ordered list of itemsets (groups of items). An MD-pattern is a set of symbolic values for the dimensions (here represented by integer numbers).
For example, consider the following database, provided in the file "ContextMDSequenceNoTime.txt" of the SPMF distribution. The database contains 4 MD-sequences.
MD-Sequences |
||||
ID | MD-Patterns | Sequences | ||
d1 | d2 | d3 | ||
S1 | 1 | 1 | 1 | (2 4), (3), (2), (1) |
S2 | 1 | 2 | 2 | (2 6), (3 5), (6 7) |
S3 | 1 | 2 | 1 | (1 8), (1), (2), (6) |
S4 | * | 3 | 3 | (2 5), (3 5) |
For instance, the first MD-Sequence represents that items 2 and 4 appeared together, then were followed by 3, which was followed by item 2, wich was followed by item 1. The context of this sequence is the value 1 for dimension d1, the value 1 for dimension d2 and the value 1 for dimension d3. Note that the value "*" in the fourth MD-sequence means "any values".
What is the output?
The output is the set of all closed multi-dimensional sequential patterns that appears in at least minsup sequences of the database. The difference with the SeqDIM algorithm is that this algorithm discovers "closed" multi-dimensional patterns. A "closed" multi-dimensional pattern is a multi-dimensional pattern such that no other pattern is include in it and appears in exactly the same set of sequences (see the Songram et al. paper for more details).
Let's look at an example. If we apply the Songram et al. algorithm on the previous database with a minsup of 50%, we obtain the following result:
Frequent Closed MD-Sequences |
|||||
ID | MD-Patterns | Sequences | Support | ||
d1 | d2 | d3 | |||
P1 | * | * | * | (2) | 100 % |
P2 | 1 | * | 1 | (1) | 50 % |
P3 | 1 | 2 | * | (2), (6) | 50 % |
P4 | * | * | * | (2), (3 5) | 50 % |
P5 | * | * | * | (2), (3) | 75 % |
P6 | 1 | * | * | (2), (3) | 50 % |
P7 | 1 | * | * | (2) | 75 % |
For instance, the second patterns found (P2) represents that items 2 is followed by item 6. The context of this pattern is the value 1 for dimension d1, 2 for dimension d2 and any value for dimension d3. Note that the value "*" means "any values". This pattern is said to have a support of 50% because it appears in 50 % of the MD-Sequences from the original database (it appears in S2 and S3).
Implementation details
The Songram et al. algorithm is a meta-algorithm. It requires a closed sequential pattern mining algorithm for discovering sequential patterns and a closed itemset mining algorithm to deal with the dimensions. Our implementation uses the SeqDIM/Songram algorithm (Pinto et al. 2001, Songram et al. 2006) in combination with BIDE+ (Wang et al. 2007) and AprioriClose(Pasquier et al., 1999) or Charm (Zaki, 2002).
Note that the implementation of Songram et al. algorithm in SPMF needs a little bit refactoring as it is currently integrated with the Fournier-Viger (2008) algorithm in the code. In future versions, I will separate both algorithms. This is not really a problem for performance but it would make it easier to reuse the algorithms if both were separated.
Where can I get more information about this algorithm?
The algorithm is described in this paper:
P. Songram, V. Boonjing, S. Intakosum: Closed Multi-dimensional Sequential-Pattern Minin. Proc. of ITNG 2006.
The idea of multi-dimensional pattern mining is based on this paper:
H. Pinto, J. Han, J Pei, K. Wang, Q. Chen, U. Dayal: Multi-Dimensional Sequential Pattern Mining. CIKM 2001: 81-88
The idea of mining closed sequential pattern is based on this paper:
J. Wang, J. Han: BIDE: Efficient Mining of Frequent Closed Sequences. ICDE 2004: 79-90
How to run this example?
To run this example with the source code version of SPMF, launch the file "MainTestSequentialPatternsMining1.java" in the package ca.pfv.SPMF.tests.
This example is not available in the release version of SPMF.
What is this algorithm?
The Hirate-Yamana, 2006 algorithm is an algorithm for discovering sequential patterns with time-constraints to filter uninteresting patterns.
The idea of using time constraints is interesting because it can greatly reduce the number of patterns found and it is also faster and use less memory than if all patterns are discovered.
Note that in the implementation, we have implemented the features of the Hirate-Yamana 2006 algorithm in the Fournier-Viger et al. (2008) algorithm that combines features from several other sequential pattern mining algorithms.
What is the input?
The input is a time-extended sequence database (as defined by Hirate-Yamana, 2006)and some constraints.
A time-extended sequence database is a set of time-extended sequences. Atime-extended sequencesis a list of itemsets (groups of items). Each itemset is anotated with a timestamp that is an integer value.
For example, consider the following time-extended sequence database provided in the file contextSequencesTimeExtended.txt of the SPMF distribution. The database contains 4 time-extended sequences. Each sequence contains itemsets that are annotated with a timestamp. For example, consider the sequence S1. This sequence indicates that itemset {1} appeared at time 0. It was followed by the itemset {1, 2, 3} at time 1. This latter itemset was followed by the itemset {1 2} at time 2.
ID | Sequences |
S1 | (0, 1), (1, 1 2 3}), (2, 1 3) |
S2 | (0, 1 ) (1, 1 2 ), (2, 1 2 3), (3, 1 2 3 ) |
S3 | (0, 1 2), (1, 1 2 ) |
S4 | (0, 2), (1, 1 2 3 ) |
The algorithms discovers time-extended sequential patterns that are common to several sequences. To do that, the user needs to provide five constraints (see the paper by Hirate & Yamana, 2006 for full details):
- minimum support (minsup): the minimum number of sequences that should contain a sequential patterns
- minimum time interval allowed between two succesive itemsets of a sequential pattern (min_time_interval)
- maximum time interval allowed between two succesive itemsets of a sequential pattern (min_time_interval)
- minimum time interval allowed between the first itemset and the last itemset of a sequential pattern (min_whole_interval)
- maximum time interval allowed between the first itemset and the last itemset of a sequential pattern (max_whole_interval
What is the output?
The output is a set of time-extended sequential patterns meeting the constraints given by the user. For example, if we run the algorithm with minsup= 55 %, min_time_interval = 0, max_time_interval = 2, min_whole_interval = 0, max_whole_interval = 2, we obtain the following results:
ID | Sequential Patterns | Support |
S1 | (0, 3) | 75 % |
S2 | (0, 2 3) | 75 % |
S3 | (0, 2) | 100 % |
S4 | (0, 1 2 3) | 75% |
S5 | (0, 1 2) | 100 % |
S6 | (0, 1 3) | 75 % |
S7 | (0, 1) | 100 % |
S8 | (0, 2), (1, 3) | 75 % |
S9 | (0, 2), (1, 1 2) | 75 % |
S10 | (0, 2), (1, 1 3) | 75 % |
S11 | (0, 2), (1, 1) | 100 % |
S12 | (0, 2), (1, 2) | 75 % |
S13 | (0, 1), (1, 1 2) | 75 % |
S14 | (0, 1), (1, 1) | 75 % |
S15 | (0, 1), (1, 2) | 75 % |
S16 | (0, 1 2), (1, 1) | 75 % |
For instance, the last pattern S16 indicates that the items 1 and 2 were followed by item 1 one time unit after. This pattern has a support of 75 % because it appears in S1, S2 and S3. It is important to note that the timestamps in the sequential patterns found are relative. For example, the pattern S16 is considered to appear in S1, S2 and S3 because {1} appears one time unit after {1, 2} in all of these sequences, even though the timestamps do not need to be the same in all of these sequences.
Implementation details
In this implementation, we have followed the Hirate & Yamana (2006) algorithm closely. The only difference is that we did not keep the idea of interval itemization function for discretization. But we have keep the core idea which is to use time constraints.
Note that the Hirate & Yamana algorithm is an extension of the PrefixSpan algorithm. We have implemented based on our implementation of PrefixSpan.
Where can I get more information about this algorithm?
The Hirate & Yamana algorithm is described in this paper
Yu Hirate, Hayato Yamana (2006) Generalized Sequential Pattern Mining with Item Intervals. JCP 1(3): 51-60.
The implementation as part of the Fournier-Viger (2008) algorithm is described in this paper:
Fournier-Viger, P., Nkambou, R & Mephu Nguifo, E. (2008), A Knowledge Discovery Framework for Learning Task Models from User Interactions in Intelligent Tutoring Systems. Proceedings of the 7th Mexican International Conference on Artificial Intelligence (MICAI 2008). LNAI 5317, Springer, pp. 765-778.
How to run this example?
To run this example with the source code version of SPMF, launch the file "MainTestSequentialPatternsMining2.java" in the package ca.pfv.SPMF.tests.
This example is not available in the release version of SPMF.
What is this algorithm?
The Fournier-Viger et al., 2008 algorithm is a sequential pattern mining algorithm combining features from several other sequential pattern mining algorithms. It also offers some original features. In this example, we show how it can be used to discover closed sequential patterns with time-constraints.
Closed sequential patterns is a compact representation of all sequential patterns. Mining closed sequential patterns is important because it can greatly reduce the number of patterns found without loss of information. Using time-constraint is important because it allows to filter unininteresting patterns according to time-related constraints.
What is the input?
The input is a time-extended sequence database (as defined by Hirate-Yamana, 2006)and some constraints.
A time-extended sequence database is a set of time-extended sequences. A time-extended sequences is a list of itemsets (groups of items). Each itemset is anotated with a timestamp that is an integer value.
For example, consider the following time-extended sequence database provided in the file contextSequencesTimeExtended.txt of the SPMF distribution. The database contains 4 time-extended sequences. Each sequence contains itemsets that are annotated with a timestamp. For example, consider the sequence S1. This sequence indicates that itemset {1} appeared at time 0. It was followed by the itemset {1, 2, 3} at time 1. This latter itemset was followed by the itemset {1 2} at time 2.
ID | Sequences |
S1 | (0, 1), (1, 1 2 3}), (2, 1 3) |
S2 | (0, 1 ) (1, 1 2 ), (2, 1 2 3), (3, 1 2 3 ) |
S3 | (0, 1 2), (1, 1 2 ) |
S4 | (0, 2), (1, 1 2 3 ) |
The algorithms discovers closed time-extended sequential patterns that are common to several sequences. To do that, the user needs to provide five constraints (see the paper by Hirate & Yamana, 2006 for full details):
- minimum support (minsup): the minimum number of sequences that should contain a sequential patterns
- minimum time interval allowed between two succesive itemsets of a sequential pattern (min_time_interval)
- maximum time interval allowed between two succesive itemsets of a sequential pattern (min_time_interval)
- minimum time interval allowed between the first itemset and the last itemset of a sequential pattern (min_whole_interval)
- maximum time interval allowed between the first itemset and the last itemset of a sequential pattern (max_whole_interval)
Note : It is recommended to set max_whole_interval to a very large value because if it is used with closed sequential pattern mining, the algorithm become approximate and may not return all closed itemsets respecting the time constraint (the reason is that this constraint is not compatible with the "backScan pruning" of the BIDE+ algorithm).
What is the output?
The output is a set of closed time-extended sequential patterns meeting the constraints given by the user. For example, if we run the algorithm with minsup= 55 %, min_time_interval = 0, max_time_interval = 2, min_whole_interval = 0, max_whole_interval = 100, we obtain the following results:
ID Sequential Patterns Support S1 (0, 1 2 3) 75% S2 (0, 1 2) 100 % S3 (0, 2), (1, 1 2) 75 % S4 (0, 2), (1, 1 3) 75 % S5 (0, 2), (1, 1) 100 % S6 (0, 1), (1, 1 2) 75 % S7 (0, 1 2), (1, 1) 75 % For instance, the last pattern S7 indicates that the items 1 and 2 were followed by item 1 one time unit after. This pattern has a support of 75 % because it appears in S1, S2 and S3. It is important to note that the timestamps in the sequential patterns found are relative. For example, the pattern S16 is considered to appear in S1, S2 and S3 because {1} appears one time unit after {1, 2} in all of these sequences, even though the timestamps do not need to be the same in all of these sequences.
If you compare the results of this example with the previous example, you can observe that the number of closed time-extended sequential patterns (6) is much smaller than the number of time-extended sequential patterns (15) found in the previous example.
Implementation details
To propose an algorithm to discover closed sequential patterns with time constraints, we have combined ideas from the BIDE+ algorithm and the Hirate & Yamana algorithm. Both of these algorithms are based on the PrefixSpan algorithm.
Where can I get more information about this algorithm?
The Fournier-Viger (2008) algorithm is described in this paper:
Fournier-Viger, P., Nkambou, R & Mephu Nguifo, E. (2008), A Knowledge Discovery Framework for Learning Task Models from User Interactions in Intelligent Tutoring Systems. Proceedings of the 7th Mexican International Conference on Artificial Intelligence (MICAI 2008). LNAI 5317, Springer, pp. 765-778.
The idea of using time-constraints is based on this paper
Yu Hirate, Hayato Yamana (2006) Generalized Sequential Pattern Mining with Item Intervals. JCP 1(3): 51-60.
The idea of mining closed sequential pattern is based on this paper:
J. Wang, J. Han: BIDE: Efficient Mining of Frequent Closed Sequences. ICDE 2004: 79-90
How to run this example?
To run this example with the source code version of SPMF, launch the file "MainTestSequentialPatternsMining3.java" in the package ca.pfv.SPMF.tests.
This example is not available in the release version of SPMF.
What is this algorithm?
The Fournier-Viger et al., 2008 algorithm is a sequential pattern mining algorithm combining features from several other sequential pattern mining algorithms. It also offers some original features. In this example, we show how it can be used to discover sequential patterns with time-constraints from a sequence database where items are annotated with integer values.
To create an algorithm that can do that, we have extended the Hirate & Yamana (2006) algorithm to accept items with integer values.
What is the input?
The input is a time-extended sequence database where items are annotated with integer values (Fournier-Viger et al., 2008) and some constraints.
Such a database is defined as a set of sequences. A sequence is here a list of itemsets (groups of items), where each item is a symbol represented by an integer and each item is annotated with an integer representing a value associated with the item. Furthermore, an itemset has a timestamp indicating the time at which it occured.
For example, consider the following database. These integer values that are annotation to items are indicated with a bold font and blue color. Consider the sequence S1 of this database. At time 0, the item 1 occured with the value 2, and item 2 is not annotated with a value. At time 1, item 3 appeared. At time 2, item 6 occured. At time 3, item 5 occured with the value 1.
ID | Sequences |
S1 | (0, 1(2) 2), (1, 3), (2, 6), (3, 5(1)) |
S2 | (0, 1(2) 2), (1, 4(8)), (2, 3), (3, 5(2) 6 7) |
S3 | (0, 1(3) 2), (1, 4(7)), (2, 3), (3, 5(4)) |
S4 | (0, 1(3) 2), (1, 4(6)), (2, 5(5)), (3, 6 7) |
What is the output?
To mine sequential patterns from a time-extended sequence database where items can be annotated with integer values, we have added an original mechanism in the Fournier-Viger et al. algorithm. Because it is a little bit complicated to explain, please refer to Fournier-Viger et al., 2008 for a detailed description.
As PrefixSpan, the algorithm grows patterns one item at a time by recursively projecting a database by a frequent item. However, we have modified this step so that when the support of an item that is annotated is higher or equals to 2 * minsup, the database projection operation calls the K-Means algorithm [15] to try to separate these values in two or more clusters. Thereafter, the database will be projected separately for each group of values. Thus, the different groups will be considered as different items.
This is best illustrated with an example. If we mine patterns from the first table with minsup = 50%, we can get 56 sequential patterns. Note however that the number of patterns found can vary because K-Means is a randomized algorithm. For this example, six patterns found are presented in the next table:
ID | Sequential Patterns | Support |
P1 | (0, 3) | 75% |
P2 | (0, 5 (average: 3 min:1 max:5)) | 100 % |
P3 | (0, 6) | 75 % |
P4 | (0, 3), (1, 6) | 50 % |
P5 | (0, 1 (average: 3 min: 3 max: 3) 2), (1 4 (average: 6.5 min: 6 max: 7)) | 50 % |
P6 | (0, 1 ((average: 2 min: 2 max: 2) 2), (3, 5 (average: 3.5 min: 1 max: 2)) | 50 % |
... | ... | ... |
When the algorithm was executed, as some point, it considered projecting the database with item "1" because this item is frequent. Item "1" is annotated with values 2, 2, 3 and 3 in sequences S1, S2, S3 and S4, respectively. The algorithm applied the K-Means to find clusters. Two clusters were found. They are {2, 2}and {3, 3}. We can see this in sequences P5 and P6. In sequence P5, item "1" represents the cluster {3, 3}, whereas sequence P6 include item "1" with the cluster {2, 2}. This feature of clustering is useful as it allows to group similar values together for an item and treat them differently.
In the source code of MainTestSequentialPatternsMining3.java, there is a few parameters for K-Means. Two of these parameters are particularly important:
- MaxK : This parameter is the maximum number of clusters to create. If you set this parameter to two for example, no more than two clusters will be made every time that K-Means is called. If you want to desactivate the clustering, this parameter should be set to 1 (only one cluster will always be made).
- numberOfTriesForEachK: This parameter indicates the number of times that K-Means should be run each time there is some values to be clustered. Normally, this value should be set to 1. If you set this parameter to a value different than 1, K-Means will be run the number of times that you specify and the maximum number of clusters found will be kept. This parameter was added because K-Means is a randomized algorithm.
Important note: If the clustering described in this example is used jointly with the mining of closed sequential patterns (described in a previous example), the set of patterns found may not be a lossless representation of all patterns.
Where can I get more information about this algorithm?
The algorithm is described in this paper:
Fournier-Viger, P., Nkambou, R & Mephu Nguifo, E. (2008), A Knowledge Discovery Framework for Learning Task Models from User Interactions in Intelligent Tutoring Systems. Proceedings of the 7th Mexican International Conference on Artificial Intelligence (MICAI 2008). LNAI 5317, Springer, pp. 765-778.
How to run this example?
What is this algorithm?
The Fournier-Viger et al., 2008 algorithm is a sequential pattern mining algorithm combining features from several other sequential pattern mining algorithms. It offers also some original features. In this example, we show how it can be used to discover multi-dimensional sequential patterns with time-constraints.
Multi-dimensional sequential patterns is an extension of the problem of sequential pattern mining that consider the context of each sequences. We have taken the SeqDIM algorithm for multi-dimensional sequential pattern mining and modified it to use features of the Hirate & Yamana (2008) algorithms so that it can only discovers patterns that respect some time constraint set by the user.
What is the input?
The input is a multidimensional time-extended sequence database. We here only provide a brief explanation. Please refer to the article by Fournier-Viger et al., 2008 for more details and a formal definition.
A time-extended multidimensional sequence database is a set of time-extended multi-dimensional sequences. A time-extended multi-dimensional sequence (here called MD-Sequence) is a time-extended sequence (as defined by Hirate & Yamana) but with dimensional information (as defined by Pinto et al. 2001). The set of dimensional values for an MD-Sequence is called an MD-Pattern. For a multi-dimensional database, there is a fix set of dimensions. Each dimensions can take a symbolic value or the value "*" which means any value. In the following MD-Database, there is four MD-Sequences named S1, S2, S3, S4.
MD-Sequences |
||||
ID | MD-Patterns | Sequences | ||
d1 | d2 | d3 | ||
S1 | 1 | 1 | 1 | (0, 2 4), (1, 3), (2, 2), (3, 1) |
S2 | 1 | 2 | 2 | (0, 2 6), (1, 3 5), (2, 6 7) |
S3 | 1 | 2 | 1 | (0, 1 8), (1, 1), (2, 2), (3, 6) |
S4 | * | 3 | 3 | (0, 2 5), (1, 3 5) |
The task of multi-dimensional sequential pattern mining consists of finding MD-Sequences that have a support higher than a minimum support threshold minsup for a MD-Database. Furthermore, in the Fournier-Viger algorithm, we offers the possibility to mine only frequent closed MD-Sequences, by implementing the idea of Songram et al. 2006. A frequent closed MD-Sequence is a frequent MD-Sequence that is not included in any other MD-Sequence having the same support.
This algorithm has five parameters: minsup, mininterval (the minimum amount of time between two consecutive itemsets in the patterns to be found), maxinterval (the maximum amount of time between two consecutive itemsets in the patterns to be found),minwholeinterval(the minimum time duration of patterns to be found),maxwholeinterval (the maximum time duration of the patterns to be found),
What is the output?
If we mine frequent closed MD-Sequences from the previous database with a minsup of 50% mininterval=0, maxinterval=1000, minwholeinterval=0, maxwholeinterval=1000, we obtain the following result:
Frequent Closed MD-Sequences |
||||
ID | MD-Patterns | Sequences | ||
d1 | d2 | d3 | ||
P1 | * | * | * | (0, 2) |
P2 | 1 | * | 1 | (0, 1) |
P3 | 1 | 2 | * | (0, 6) |
P4 | * | * | * | (0, 2), (1, 3 5) |
P5 | * | * | * | (0, 2), (1, 3) |
If we mine frequent MD-sequences instead of frequent closed MD-Sequences, we will obtain 23 frequent MD-Sequences instead.
Where can I get more information about this algorithm?
The algorithm is described in this paper:
Fournier-Viger, P., Nkambou, R & Mephu Nguifo, E. (2008), A Knowledge Discovery Framework for Learning Task Models from User Interactions in Intelligent Tutoring Systems. Proceedings of the 7th Mexican International Conference on Artificial Intelligence (MICAI 2008). LNAI 5317, Springer, pp. 765-778.
The idea of using time-constraints is based on this paper
Yu Hirate, Hayato Yamana (2006) Generalized Sequential Pattern Mining with Item Intervals. JCP 1(3): 51-60.
The idea of multi-dimensional pattern mining is based on this paper:
H. Pinto, J. Han, J Pei, K. Wang, Q. Chen, U. Dayal: Multi-Dimensional Sequential Pattern Mining. CIKM 2001: 81-88
The idea of closed multi-dimensional pattern mining is based on this paper:
P. Songram, V. Boonjing, S. Intakosum: Closed Multi-dimensional Sequential-Pattern Minin. Proc. of ITNG 2006.
The idea of mining closed sequential pattern is based on this paper:
J. Wang, J. Han: BIDE: Efficient Mining of Frequent Closed Sequences. ICDE 2004: 79-90
How to run this example?
What is CMRules?
CMRules is an algorithm for discovering sequential rules that appears in sequence databases. It was proposed by Fournier-Viger in 2010.
What is the input of CMRules ?
The input of CMRules is a sequence database and two user-specified thresholds named minsup and minconf.
A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.
ID | Sequences |
S1 | (1), (1 2 3), (1 3), (4), (3 6) |
S2 | (1 4), (3), (2 3), (1 5) |
S3 | (5 6), (1 2), (4 6), (3), (2) |
S4 | (5), (7), (1 6), (3), (2), (3) |
What is the output of CMRules ?
Given a sequence database, and parameters named minsup and minconf, CMRules outputs all sequential rules having a support and confidence respectively higher than minsup and minconf.
A sequential rule X==>Y is a sequential relationship between two sets of items X and Y such that X and Y are disjoint, and that X is unordered and Y is unordered. The support of a rule X==>Y is the number of sequences that contains X∪Y divided by the number of sequences in the database. The confidence of a rule is the number of sequences that contains X∪Y, divided by the number of sequences that contains X.
In this example, we apply CMRules with minsup = 75 %, minconf= 50%. We obtains 9 sequential rules:
Rule | Support | Confidence |
1 ==> 2 | 100 % | 100 % |
1 ==>3 | 100 % | 100 % |
2 ==> 3 | 75 % | 75 % |
3 ==> 2 | 75 % | 75 % |
4 ==> 3 | 75 % | 100 % |
1 3 ==> 2 | 75 % | 75 % |
1 2 ==> 3 | 75 % | 75 % |
1 4 ==> 3 | 75 % | 100 % |
1 ==> 2 3 | 100 % | 100 % |
For example, the rule 1 4 ==> 3 means that if 1 an 4 appears in any order they will be followed by 3 with a confidence of 100 %. Moreover, this rule has a support of 75 % because it appears in three sequences (S1, S2 and S3) out of four sequences.
Performance
CMRules is a relatively efficient algorihtm. However, the RuleGrowth algorithm is faster.
What is interesting about CMRules is that it uses an association rule mining based approach for discovering sequential rules. Therefore it could be used to discover both sequential rules and association rules at the same time.
Where can I get more information about this algorithm?
The CMRules algorithm is described in this paper:
Fournier-Viger, P., Faghihi, U., Nkambou, R., Mephu Nguifo, E. (2012). CMRules: Mining Sequential Rules Common to Several Sequences. Knowledge-based Systems, Elsevier, 25(1): 63-76.
How to run this example?
What is CMDeo ?
CMDeo is an algorithm for discovering sequential rules that appears in sequence databases. It was proposed by Fournier-Viger in 2010.
What is the input of CMDeo ?
The input of CMDeo is a sequence database and two user-specified thresholds named minsup and minconf.
A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.
ID | Sequences |
S1 | (1), (1 2 3), (1 3), (4), (3 6) |
S2 | (1 4), (3), (2 3), (1 5) |
S3 | (5 6), (1 2), (4 6), (3), (2) |
S4 | (5), (7), (1 6), (3), (2), (3) |
What is the output of CMDeo ?
Given a sequence database, and parameters named minsup and minconf, CMDeo outputs all sequential rules having a support and confidence respectively higher than minsup and minconf.
A sequential rule X==>Y is a sequential relationship between two sets of items X and Y such that X and Y are disjoint, and that X is unordered and Y is unordered. The support of a rule X==>Y is the number of sequences that contains X∪Y divided by the number of sequences in the database. The confidence of a rule is the number of sequences that contains X∪Y, divided by the number of sequences that contains X.
In this example, we apply CMDeo with minsup = 75 %, minconf= 50%. We obtains 9 sequential rules:
Rule | Support | Confidence |
1 ==> 2 | 100 % | 100 % |
1 ==>3 | 100 % | 100 % |
2 ==> 3 | 75 % | 75 % |
3 ==> 2 | 75 % | 75 % |
4 ==> 3 | 75 % | 100 % |
1 3 ==> 2 | 75 % | 75 % |
1 2 ==> 3 | 75 % | 75 % |
1 4 ==> 3 | 75 % | 100 % |
1 ==> 2 3 | 100 % | 100 % |
For example, the rule 1 4 ==> 3 means that if 1 an 4 appears in any order they will be followed by 3 with a confidence of 100 %. Moreover, this rule has a support of 75 % because it appears in three sequences (S1, S2 and S3) out of four sequences.
Performance
CMDeo is a relatively efficient algorihtm. However, the RuleGrowth algorithm is faster.
Where can I get more information about this algorithm?
The CMDeo algorithm is described in this paper:
Fournier-Viger, P., Faghihi, U., Nkambou, R., Mephu Nguifo, E. (2012). CMRules: Mining Sequential Rules Common to Several Sequences. Knowledge-based Systems, Elsevier, 25(1): 63-76.
How to run this example?
What is RuleGrowth?
RuleGrowth is an algorithm for discovering sequential rules that appears in sequence databases. It was proposed by Fournier-Viger in 2011.
What is the input of RuleGrowth ?
The input of RuleGrowth is a sequence database and two user-specified thresholds named minsup and minconf.
A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.
ID | Sequences |
S1 | (1), (1 2 3), (1 3), (4), (3 6) |
S2 | (1 4), (3), (2 3), (1 5) |
S3 | (5 6), (1 2), (4 6), (3), (2) |
S4 | (5), (7), (1 6), (3), (2), (3) |
What is the output of RuleGrowth ?
Given a sequence database, and parameters named minsup and minconf, RuleGrowth outputs all sequential rules having a support and confidence respectively higher than minsup and minconf.
A sequential rule X==>Y is a sequential relationship between two sets of items X and Y such that X and Y are disjoint, and that X is unordered and Y is unordered. The support of a rule X==>Y is the number of sequences that contains X∪Y divided by the number of sequences in the database. The confidence of a rule is the number of sequences that contains X∪Y, divided by the number of sequences that contains X.
In this example, we apply RuleGrowth with minsup = 75 %, minconf= 50%. We obtains 9 sequential rules:
Rule | Support | Confidence |
1 ==> 2 | 100 % | 100 % |
1 ==>3 | 100 % | 100 % |
2 ==> 3 | 75 % | 75 % |
3 ==> 2 | 75 % | 75 % |
4 ==> 3 | 75 % | 100 % |
1 3 ==> 2 | 75 % | 75 % |
1 2 ==> 3 | 75 % | 75 % |
1 4 ==> 3 | 75 % | 100 % |
1 ==> 2 3 | 100 % | 100 % |
For example, the rule 1 4 ==> 3 means that if 1 an 4 appears in any order they will be followed by 3 with a confidence of 100 %. Moreover, this rule has a support of 75 % because it appears in three sequences (S1, S2 and S3) out of four sequences.
Performance
RuleGrowth is a very efficient algorihtm. It is faster and more memory efficient than CMDeo and CMRules.
Note that there is a variation of RuleGrowth that accepts time constraints. It is named TRuleGrowth and it is also offered in SPMF. There is also a variations for mining top-k sequential rules named TopSeqRules offered in SPMF.
Where can I get more information about this algorithm?
The RuleGrowth algorithm is described in this paper:
Fournier-Viger, P., Nkambou, R. & Tseng, V. S. (2011). RuleGrowth: Mining Sequential Rules Common to Several Sequences by Pattern-Growth. Proceedings of the 26th Symposium on Applied Computing (ACM SAC 2011). ACM Press, pp. 954-959.
How to run this example?
What is RuleGen?
RuleGen is an algorithm for discovering sequential rules that appears in sequence databases. It was proposed by Zaki (2000).
What is the input of RuleGen ?
The input of RuleGen is a sequence database and two user-specified thresholds named minsup and minconf.
A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.
ID | Sequences |
S1 | (1), (1 2 3), (1 3), (4), (3 6) |
S2 | (1 4), (3), (2 3), (1 5) |
S3 | (5 6), (1 2), (4 6), (3), (2) |
S4 | (5), (7), (1 6), (3), (2), (3) |
What is the output of RuleGen ?
The RuleGen algorithms outputs all sequential rules having a support and confidence respectively higher or equals to user-specified minsup and minconf thresholds.
A rule X==>Y is defined by RuleGen as a sequential relationship between two sequential patterns X and Y. The confidence of a rule X ==> Y is defined as the number of sequences containing X divided by the number of sequences containing Y. The support of a rule X ==> is defined as the number of sequences containing Y.
In this example, we apply RuleGen with minsup = 75 %, minconf= 50%. We obtains 21 sequential rules:
Rule | Support | Confidence |
{1 } ==> {1 }{2 } |
4 |
1.0 |
{1 } ==> {1 }{3 } |
4 |
1.0 |
{1 } ==> {1 }{3 }{2 } |
3 |
0.75 |
{1 } ==> {1 }{3 }{3 } |
3 |
0.75 |
{2 } ==> {1 }{2 } |
4 |
1.0 |
{2 } ==> {2 }{3 } |
3 |
0.75 |
{2 } ==> {3 }{2 } |
3 |
0.75 |
{2 } ==> {1 }{3 }{2 } |
3 |
0.75 |
{3 } ==> {1 }{3 } |
4 |
1.0 |
{3 } ==> {2 }{3 } |
3 |
0.75 |
{3 } ==> {3 }{2 } |
3 |
0.75 |
{3 } ==> {3 }{3 } |
3 |
0.75 |
{3 } ==> {4 }{3 } |
3 |
0.75 |
{3 } ==> {1 }{3 }{2 } |
3 |
0.75 |
{3 } ==> {1 }{3 }{3 } |
3 |
0.75 |
{4 } ==> {4 }{3 } |
3 |
1.0 |
{1 }{2 } ==> {1 }{3 }{2 } |
3 |
0.75 |
{1 }{3 } ==> {1 }{3 }{2 } |
3 |
0.75 |
{1 }{3 } ==> {1 }{3 }{3 } |
3 |
0.75 |
{3 }{2 } ==> {1 }{3 }{2 } |
3 |
1.0 |
{3 }{3 } ==> {1 }{3 }{3 } |
3 |
1.0 |
For example, the rule {1 } ==> {1 }{3 }{3 } means that if the sequential pattern {1} appears in a sequence, the sequential pattern {1} {3 }{3 } also appear in this sequence. In other words, it means that if {1} appears, it will be followed by {3}, {3}.
Implementation details
The RuleGen algorithm first apply a sequential pattern mining algorithm and then combines pairs of sequential patterns to generate rules between two sequential patterns. Note that in our implementation we use the PrefixSpan algorithm for mining sequential patterns instead of SPADE because the PrefixSpan algorithm is generally faster than SPADE.
Also, it is important to note that rules found by RuleGen always have the form X == > Y such that X is a subsequence of Y. This definition of a sequential rule is different from the definition of a sequential rules used by other sequential rule mining algorithms offered in SPMF such as CMRules, CMDeo, RuleGrowth, TRuleGrowth, TopSeqRules and TNS where X and Y are unordered itemsets, and X is not a subset of Y. The rules found by these latter algorithms are more general. Moreover, we have shown that we can achieve higher prediction accuracy by using the kind of rules found by RuleGrowth, CMRules instead of using the rules generated by RuleGen. (see this article for details).
Where can I get more information about this algorithm?
The RuleGen algorithm is described in this paper:
Mohammed Javeed Zaki: Scalable Algorithms for Association Mining. IEEE Trans. Knowl. Data Eng. 12(3): 372-390 (2000)
How to run this example?
What is TRuleGrowth?
TRuleGrowth is an algorithm for discovering sequential rules that appears in sequence databases. It was proposed by Fournier-Viger in 2012. It is a variation of the RuleGrowth algorithm that accepts a window size constraint.
What is the input of TRuleGrowth ?
The input of TRuleGrowth is a sequence database and two user-specified thresholds named minsup and minconf.
A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.
ID | Sequences |
S1 | (1), (1 2 3), (1 3), (4), (3 6) |
S2 | (1 4), (3), (2 3), (1 5) |
S3 | (5 6), (1 2), (4 6), (3), (2) |
S4 | (5), (7), (1 6), (3), (2), (3) |
What is the output of TRuleGrowth ?
Given a sequence database, and parameters named minsup, minconf and window_size, TRuleGrowth outputs all sequential rules having a support and confidence respectively higher than minsup and minconf that appears within window_size consecutive itemsets
A sequential rule X==>Y is a sequential relationship between two sets of items X and Y such that X and Y are disjoint, and that X is unordered and Y is unordered. The support of a rule X==>Y is the number of sequences that contains X∪Y divided by the number of sequences in the database. The confidence of a rule is the number of sequences that contains X∪Y, divided by the number of sequences that contains X.
For example, if we set minsup = 0.7, minconf =0.8 and window_size = 3, TRuleGrowth discovers 4 rules:
Rule Support Confidence {1 } ==> {2 }
80 % (4 sequences)
100 %
{1 } ==> {2, 3 }
80 % (4 sequences)
100 %
{1 } ==> {3 } 80 % (4 sequences) 100 % {4 } ==> {3 }
60 % (3 sequences)
100 %
For example, the rule {1} ==> {2 3} means that if 1 appears in a sequence, it will be followed by 2 and (in any order) with a confidence of 100 %. Moreover, this rule has a support of 100 % because it appears in four sequences (S1, S2, S3 and S4) out of four sequences within window_size consecutive itemsets.
Performance
TRuleGrowth is a very efficient algorihtm. It is faster and more memory efficient than CMDeo and CMRules. If the windows_size constraint is used, it can also be much faster than RuleGrowth depending on how the window_size constraint is set.
Implementation details
In SPMF, there is also a version of TRuleGrowth that accepts strings instead of integers. It is available under the name "TRuleGrowth with strings" in the release version of SPMF or in the package ca.pfv.spmf.sequential_rules.trulegrowth_with_strings for the source code version of SPMF. To run it, you should use the input file: contextPrefixSpanStrings.txt.
Where can I get more information about this algorithm?
The TRuleGrowth algorithm is described in this paper:
Fournier-Viger, P., Wu, C.-W., Tseng, V.S., Nkambou, R. (2012). Mining Sequential Rules Common to Several Sequences with the Window Size Constraint. Proceedings of the 25th Canadian Conf. on Artificial Intelligence (AI 2012), Springer, LNAI 7310, pp.299-304
How to run this example?
What is TopSeqRules?
TopSeqRules is an algorithm for discovering the top-k sequential rules appearing in a sequence database.
Why is it important to discover top-k sequential rules? Because other sequential rule mining algorithms requires the user to set a minimum support (minsup) parameter that is hard to set (usually users set it by trial and error, which is time consuming). The TopSeqRules algorithm solve this problem by letting users directly indicate k, the number of rules to be discovered.
What is the input of TopSeqRules ?
TopSeqRules takes three parameters as input:
- a sequence database,
- a parameter k representing the number of rules to be discovered,
- a parameter minconf representing the minimum confidence that the rules should have.
A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.
ID | Sequences |
S1 | (1), (1 2 3), (1 3), (4), (3 6) |
S2 | (1 4), (3), (2 3), (1 5) |
S3 | (5 6), (1 2), (4 6), (3), (2) |
S4 | (5), (7), (1 6), (3), (2), (3) |
What is the output of TopSeqRules ?
TopSeqRules outputs the k most frequent sequential rules having a confidence higher or equal to minconf.
A sequential rule X==>Y is a sequential relationship between two sets of items X and Y such that X and Y are disjoint, and that X is unordered and Y is unordered. The support of a sequential rule X==>Y is the number of sequences that contains X∪Y divided by the number of sequences in the database. The confidence of a sequential rule is the number of sequences that contains X∪Y, divided by the number of sequences that contains X.
For example, if we run TopSeqRules with k = 3 and minconf = 0.8, the result is the following rules:
Rule Support Confidence {3 } ==> {4 }
75 % (3 sequences)
100 %
{1 } ==> {3 }
100 % (4 sequences)
100 %
{1,4 } ==> {3 } 75 % (3 sequences) 100 % These rules are the top three rules appearing in the sequence database having a confidence higher or equals to 80 %.
For example, the rule 1 4 ==> 3 means that if 1 an 4 appears in any order they will be followed by 3 with a confidence of 100 %. Moreover, this rule has a support 75 % because it appears in three sequences (S1, S2 and S3) out of four sequences.
It is important to note that for some values of k, the algorithm may return slightly more rules than k. This can happen if several rules have exactly the same support, and it is normal.
Performance
TopSeqRules is a very efficient algorihtm. It is based on the RuleGrowth algorithm, which is one of the most efficient algorithm for mining sequential rules.
It is more intuitive to use TopSeqRules than RuleGrowth. However, it should be note that the problem of top-k sequential rule mining is more computationally expensive than the problem of sequential rule mining. Therefore, it is recommended to use TopSeqRules for k values of up to 1000 or 2000 depending on the dataset. If more rules should be found, it could be better to use RuleGrowth or TRuleGrowth.
Besides, note that there is a variation of TopSeqRules named TNS that is available in SPMF. The improvement in TNS is that it eliminate some sequential rules that are deemed "redundant" (rules that are included in other rules having the same support and confidence - see the TNS example for the formal definition). Using TNS is more costly than using TopSeqRules but it brings the benefit of eliminating some redundancy in the results.
Where can I get more information about this algorithm?
The TopSeqRules algorithm is described in this paper:
Fournier-Viger, P. & Tseng, V. S. (2011). Mining Top-K Sequential Rules. Proceedings of the 7th Intern. Conf. on Advanced Data Mining and Applications (ADMA 2011). LNAI 7121, Springer, pp.180-194.
How to run this example?
What is TNS?
TNS is an algorithm for discovering the top-k non-redundant sequential rules appearing in a sequence database. It is an approximate algorithm in the sense that it always generates non-redundant rules. But these may not always be the top-k non-redundant rules. TNS uses a parameter named delta, which is a positive integer that can be used to improve the chance that the result is exact (the higher delta value, the more chances that the result will be exact).
Why is it important to discover top-k non-redundant sequential rules? Because other sequential rule mining algorithms requires that the user set a minimum support (minsup) parameter that is hard to set (usually users set it by trial and error, which is time consuming). Moreover, the result of sequential rule mining algorithms usually contains a high level of redundancy (for example, thousands of rules can be found that are variation of other rules having the same support and confidence). The TNS algorithm provide solution to both of these problems by letting users directly indicate k, the number of rules to be discovered, and by eliminating redundancy in results.
What is the input of TNS ?
TNS takes four parameters as input:
- a sequence database,
- a parameter k representing the number of rules to be discovered,
- a parameter minconf representing the minimum confidence that rules should have.
- a parameter delta (an integer) that is used to increase the chances of having an exact result (because this algorihm is approximate).
A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is an unordered set of items. For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 itemsets. It means that item 1 was followed by items 1 2 and 3 at the same time, which were followed by 1 and 3, followed by 4, and followed by 3 and 6. It is assumed that items in an itemset are sorted in lexicographical order. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution.
ID | Sequences |
S1 | (1), (1 2 3), (1 3), (4), (3 6) |
S2 | (1 4), (3), (2 3), (1 5) |
S3 | (5 6), (1 2), (4 6), (3), (2) |
S4 | (5), (7), (1 6), (3), (2), (3) |
What is the output of TNS ?
TNS outputs an approximation of the k most frequent non redundant sequential rules having a confidence higher or equal to minconf.
A sequential rule X==>Y is a sequential relationship between two sets of items X and Y such that X and Y are disjoint, and that X is unordered and Y is unordered. The support of a sequential rule X==>Y is denoted as sup(X==>Y) and defined as the number of sequences that contains X∪Y divided by the number of sequences in the database. The confidence of a sequential rule X==>Y is denoted as conf(X==>Y) and defined as the number of sequences that contains X∪Y, divided by the number of sequences that contains X.
A sequential rule ra: X → Y is redundant with respect to another rule rb : X1 → Y1 if and only if:
- conf(ra) = conf(rb)
- sup(ra) = sup(rb)
- X1 ⊆ X ∧ Y ⊆ Y1.
For example, If we run TNS with k = 10 and minconf = 0.5 and delta = 2, the following set of non-redundant rules is found
2 ==> 3 sup= 3 conf= 0.75 1,3 ==> 2 sup= 3 conf= 0.75 1,4 ==> 3 sup= 3 conf= 1.0 1 ==> 2,3 sup= 4 conf= 1.0 3 ==> 4 sup= 3 conf= 1.0 2,5 ==> 6 sup= 2 conf= 1.0 2,3 ==> 4 sup= 2 conf= 0.6666666666666666 1 ==> 2,3,4,6 sup= 2 conf= 0.5 3,5 ==> 6 sup= 2 conf= 1.0 2 ==> 3,4,6 sup= 2 conf= 0.5For instance, the rule 1 4 ==> 3 means that if 1 an 4 appears in any order they will be followed by 3 with a confidence of 100 %. Moreover, this rule has a support 75 % (sup = 3) because it appears in three sequences (S1, S2 and S3) out of four sequences.
Note that for some values of k and some datasets, TNS may return more than k rules. This can happen if several rules have exactly the same support, and it is normal. It is also possible that the algorithm returns slightly less than k rules in some circonstances because the algorithm is approximate.
Performance
TNS is an efficient algorihtm. It is based on the TopSeqRules algorithm for discovering top-k sequential rules. The main difference between TNS and TopSeqRules is that TNS includes additional strategies to eliminate redundancy in results, and that TNS is an approximate algorithm, while TopSeqRules is not.
TNS and TopSeqRules are more intuitive to use than regular sequential rule mining algorithms such as RuleGrowth. However, it should be note that the problem of top-k sequential rule mining is more computationally expensive than the problem of sequential rule mining. Therefore, it is recommended to use TNS or TopSeqRules for k values of up to 1000 or 2000 depending on the dataset. If more rules should be found, it could be better to use RuleGrowth or TRuleGrowth, for more efficiency.
Where can I get more information about this algorithm?
The TNS algorithm is described in this paper:
Fournier-Viger, P., Tseng, V. S. (2013). TNS: Mining Top-K Non-Redundant Sequential Rules. Proc. 28th Symposium on Applied Computing (ACM SAC 2013). ACM Press, to appear.
How to run this example?
To run this example with the source code version of SPMF, launch the file "MainTestID3.java" in the package ca.pfv.SPMF.tests.
This example is not available in the release version of SPMF.
What is the ID3 algorithm?
The ID3 algorithm is a classic data mining algorithm for classifying instances (a classifier). It is well-known and described in many artificial intelligence and data mining books.
What is the input?
The input is a set of training data for building a decision tree.
For instance, in this example, we use the following database (from the book "Machine Learning by Tom Mitchell). This database is provided in the file "tennis.txt" of the SPMF distribution.
This database defines five attributes and contains fourteen instances.
outlook | temp | humid | wind | play? |
sunny | hot | high | weak | no |
sunny | hot | high | strong | no |
overcast | hot | high | weak | yes |
rain | mild | high | weak | yes |
rain | cool | normal | weak | yes |
rain | cool | normal | strong | no |
overcast | cool | normal | strong | yes |
sunny | mild | high | weak | no |
sunny | cool | normal | weak | yes |
rain | mild | normal | weak | yes |
sunny | mild | normal | strong | yes |
overcast | mild | high | strong | yes |
overcast | hot | normal | weak | yes |
rain | mild | high | strong | no |
What is the output?
By applying the ID3 algorithm, a decision tree is created. To create the decision tree, we have to choose a target attribute. Here, we choose the attribute "play".
The following decision tree is created:
We can now use the tree to predict the value of the target attribute "play" for a new instance.
For example, consider this new instance, where the value for "play" is unknown.
sunny | hot | normal | weak | ? |
By applying the decision tree, the value for the attribute "play" for this instance is "yes".
Where can I get more information about the ID3 algorithm?
The ID3 algorithm was proposed by Quinlan (1986). It is one of the most popular algorithm for learning decision trees. By searching on the web, you can find plenty of information on this algorithm. It is also described in several data mining books and artificial intelligence books.
How to run this example?
The tool for converting a sequence databases to SPMF format takes three prameters as input:
The algorithm outputs a sequence database in SPMF format.
The CSV format is defined as follows:
For example, the follwing sequence databasee is in CSV format and contains four sequences:
1,2,3,4 5,6,7,8 5,6,7 1,2,3
The Kosarak format is defined as follows:
For example, the follwing sequence databasee is in Kosarak format and contains four sequences:
1 2 3 4 5 6 7 8 5 6 7 1 2 3
The IBMGenerator format is the format used by the IBM Data Quest Generator. The format is defined as follows:
For example, the follwing sequence databasee is in Kosarak format and contains four sequences:
1 -1 2 -1 3 -1 4 -1 -2 5 -1 6 -1 7 -1 8 -1 -2 5 -1 6 -1 7 -1 -2 1 -1 2 -1 3 -1 -2
The Snake format is defined as follows:
For example, the follwing sequence databasee is in Snake format and contains four sequences:
ABCD ABAB CACD ADAC
The BMS format is defined as follows:
For example, the follwing sequence databasee is in BMS format and contains four sequences with the ids 10, 20, 30 and 40, respectively:
10 1 10 2 10 3 10 4 20 5 20 6 20 7 20 8 30 5 30 6 30 7 40 1 40 2 40 3
How to run this example?
What is this tool?
This tool is a random generator of sequence databases. It can be used to generate synthetic sequence databases to compare the performance of data mining algorithms that takes a sequence database as input.
Synthetic databases are often used in the data mining litterature to evaluate algorithms. In particular, they are useful for comparing the scalability of algorithms. For example, one can generate sequence databases having various size and see how the algorithms react in terms of execution time and memory usage with respect to the database size.
What is the input?
What is the output?The tool for generating a sequence databases takes four prameters as input:
1) the number of sequences to be generated
2) the maximum number of distinct item that the database should contain,
3) the number of items that each itemset should contain
4) the number of itemsets that each sequence should contain
The algorithm outputs a sequence database respecting these parameters. The database is generated by using a random number generator.
How to run this example?
What is this tool?
This tool is a random generator of sequence databases with timestamps. It can be used to generate synthetic sequence databases with timestamps to compare the performance of data mining algorithms that takes a sequence database with timestamps as input.
Synthetic databases are often used in the data mining litterature to evaluate algorithms. In particular, they are useful for comparing the scalability of algorithms. For example, one can generate sequence databases having various size and see how the algorithms react in terms of execution time and memory usage with respect to the database size.
What is the input?
What is the output?The tool for generating a sequence databases with timestamps takes four prameters as input:
1) the number of sequences to be generated
2) the maximum number of distinct item that the database should contain,
3) the number of items that each itemset should contain
4) the number of itemsets that each sequence should contain
The algorithm outputs a sequence database with timestamps respecting these parameters. The database is generated by using a random number generator.
How to run this example?
What is this tool?
This tool is a random generator of transaction databases. It can be used to generate synthetic transaction databases with timestamps to compare the performance of data mining algorithms that takes a transaction database as input.
Synthetic databases are often used in the data mining litterature to evaluate algorithms. In particular, they are useful for comparing the scalability of algorithms. For example, one can generate sequence databases having various size and see how the algorithms react in terms of execution time and memory usage with respect to the database size.
What is the input?
What is the output?The tool for generating a transaction takes three prameters as input:
1) the number of transactions to be generated
2) the maximum number of distinct item that the database should contain,,
3) the number of items that each transaction should contain
The algorithm outputs a transaction database database respecting the parameters provided. A random number generator is used to generate the database.
How to run this example?
What is this tool?
This tool is a tool for generating statistics about a sequence database. It can be used to know for example if the database is dense or sparse before applying a data mining algorithms.
What is the input?
The input is a sequence database. A sequence database is a set of sequence. Each sequence is an ordered list of itemsets. An itemset is an unordered list of items (symbols). For example, consider the following database. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. It contains 4 sequences. The second sequence represents that the set of items {1 4} was followed by {3}, which was followed by {2, 3}, which were followed by {1, 5}. It is a sequence database (as defined in Pei et al., 2004).
ID | Sequences |
S1 | (1), (1 2 3), (1 3), (4), (3 6) |
S2 | (1 4), (3), (2 3), (1 5) |
S3 | (5 6), (1 2), (4 6), (3), (2) |
S4 | (5), (7), (1 6), (3), (2), (3) |
What is the output?
The output is statistics about the sequence database. For example, if we use the tool on the previous sequence database given as example, we get the following statistics:
- the number of distinct items in this database: 7
- the largest item id: 7
- the average number of itemsets per sequence: 5
- the average number of distinct items per sequence: 5.5
- the average number of occurences in a sequence for each item appearing in a sequence: 1.41
- the average number of items per itemset: 1.55