There are two versions of SPMF.
- The source code version includes all the algorithms. It requires prior experience with Java
for compiling the source code and running the examples.
- The release
version provides a graphical user interface and a command line interface.
all of the algorithms except a few exceptions.
|Source code version (138 algorithms)
||Release version (125 algorithms)
1) Download spmf.zip
2) Read the instructions for installing and running the source
code: how_to_install.txt .
After you have installed the source code, if you intend to modify the
source code and/or reuse it in other Java projects, you may want to
read the developer's guide, which provides information about the source code organization.
Algorithms included: all the algorithms
1) Download spmf.jar and
the sample data files test_files.zip
2) If you want to use the graphical interface, follow these instructions: how_to_run_the
If you want to use the command line interface, follow these instructions:
Algorithms included: all except CloStream, estDec, estDec+, ItemsetTree, Memory-Efficient Itemset Tree, ID3, and a few others
If you have any questions, you may first have a look at the FAQ,
and then ask your question in the data
mining forum. If the question has to be private, you can send-me an e-mail.
- v. 2.20 - 2017-11-6 (bug fixes)
- Bug fixe(s):
- Fixed a bug such that the optional sequence identifiers in the output of some sequential pattern mining algorithms were incorrect. According to the documentation, sequence identifiers should start at 0, while for some algorithms, the sequence identifiers were starting from 1. Now the sequence identifiers start from 0 for all the algorithms. Thanks to Mathieu Gousseff for reporting the bug.
- Fixed a bug in the USpan algorithm such that the SWU upper-bound was looser than it should (thanks to Tin Truong Chi for finding and fixing the bug).
- Fixed a bug for the FEAT algorithm such that it was throwing exception when using the optional parameter to show sequence identifiers..
- Known issues:
- These is a bug in the BIDE+ implementation such that some incorrect results may be obtained for sequences with multiple items per itemset.
- v. 2.19 - 2017-10-22 (4 new algorithms)
- New algorihtm(s):
- An algorithm to calculate the regression line of a time series using the least squares method. After applying the algorithm to train a linear regression model, the model.can be used to make some simple predictions.
- The SSFIM algorithm for frequent itemset mining
- The mHUIMiner algorithm for high utility itemset mining (by Peng et al. from GitHub, GPL license)
- The ULB-Miner algorithm for high utility itemset mining (by Duong, H, Fournier-Viger et al.)
- New feature(s):
- Added an implementation of FHM called FHM(float) which can take utility values as float values instead of integers.
- Added the possibility of specifying a maximum pattern length to the following algorithms : Apriori, AprioriHT, FPGrowth, FPGrowth_association_rules, FPGrowth_association_rules_with_lift
- For sequence prediction, the Evaluator class was modified so that the SPMF format is used to compare sequence prediction models instead of another format.
- Bug fixe(s):
- Fixed a bug in the new version of the Apriori implementation with length constraint. Thanks to Muhammad Yasir Chaudhry for reporting the bug.
- Fixed a bug in the HUIM-BSO, HUIM-BSPtree algorithms in terms of supported input file format. Thanks to Majdi Mafarja for reporting the problem.
- Fixed a bug in the output format of PrefixSpan and BIDE+ algorithms (some -1 were missing in some cases). Thanks to Matthieu Gousseff for reporting the bug.
- Improved the documentation of SPMF by dividing the single documentation page into multiple webpages (for achival purpose, the old documentation page for SPMF 2.18 can be found here).
- v. 2.18 - 2017-08-06 (new versions of two algorithms, fix bug, fix dataset issue)
- Added two new versions of
the AprioriRare and AprioriInverse algorithms called "AprioriRare_TID" and "AprioriInverse_TID". These versions are based on AprioriTID instead of the regular Apriori. They thus keep transactions identifiers of patterns in memory to avoid scanning the database repeatedly, and can output the transaction ids to the output file (by setting the parameter "Show transactions IDs? to true in the user interface).
- Fixed and reuploaded the "retail" and "pumsb" datasets. They contained an item with the id "0". But some algorithms such as EFIM assume that item identifiers must be positive (thanks to Srikumar Krishnamoorthy for reporting this problem)
- Added a tool to add a value to all item identifiers in a transaction database. This was used to fix the above dataset problem.
- Fixed a bug in the generation of closed association rule mining using the FPClose algorithm (thanks to Benjamin Andow for reporting the bug)
- v. 2.17 - 2017-07-03
- Added optional parameters for the PFPM and PHM algorithms to specify the minimum and maximum number of items that patterns should contain.
- Modified the user interface so that algorithms can have up to seven parameters.
- v. 2.16
- Added a new feature to the CPT and CPT+ sequence prediction models. The user can now obtain information about how a prediction was made. By using the method getCountTable, one can obtain all the symbols and their scores calculated by the model. This explains what is the basis for a given prediction.
- v. 2.15 (4 new algorithms)
- Added implementation of four algorithms (by Jerry Chun-Wei Lin, Ting Li, et al.)
- the FFI-Miner algorithm for mining fuzzy frequent itemsets in a quantitative transaction database (similar to a database with utility values)
- the MMFI-Miner algorithm for mining fuzzy frequent itemsets in a quantitative transaction database
- the HAUI-Miner algorithm for mining high average utility itemsets in a transaction database with utility values.
- the HAUI-MMAU algorithm for mining high average utility itemsets in a transaction database with utility values using multiple minimum average-utility thresholds.
- v. 2.14 (bug fix)
- Fixed a bug the USpan implementation such that some patterns could be missed (thanks to Tai Dinh and Tin Truong Chi for reporting the bug).
- v. 2.13 (bug fix)
- Fixed a bug in Closed association rule mining with FPClose. Some exception was thrown in some rare case (thanks to Tarannum Zaman).
- v. 2.12 - 2017-02-05
- Added a new optional parameter to several itemset mining algorithms to let the user decide whether transactions identifiers should be shown in the output file, for each pattern found. The algorithms that support this feature are: AprioriTID, AprioriTID_bitset, Apriori_TIDClose, Charm_bitset, Charm_MFI, Eclat, Eclat_bitset, DCI_closed, CORI. In the user interface of SPMF, the new optional parameter is displayed as "Show transactions IDs? (optional)".
- v. 2.11 - 2017-01-27 (5 new algorithms)
- Five new algorithms for high-utility itemset mining have been added (by Jerry Chun-Wei Lin, Lu Yang, Philippe Fournier-Viger)
- SFUPMinerUgmax for mining skyline frequent-utility patterns
- HUIM-GA and HUIM-GA-tree algorithms for mining high-utility itemsets using genetic algorithms
- HUIM-BPSO and HUIM-BPSO-tree algorithms for mining high-utility itemsets using particle-swarm optimization
- v. 2.10 - 2017-01-17
- The SAX algorithm has now a new optional parameter "deactivatePAA". It allows to deactivate the transformation to the piecewise aggregate approximation (PAA) when applying SAX. This allows to convert a files containing several time series having different lengths to their SAX representations while preserving their original lengths (rather than converting all of them to time series having the same length).
- Fixed a bug in the TopKRules algorithm that was introduced in a previous version of SPMF. The output was correct but the algorithm was not using the set "candidates" in the most efficient way. (Thanks to Bima Haryanto Putra for reporting the bug)
- Fixed a bug in the MaxSP algorithm (thanks to Natalia Mord for proposing the bug fix).
- v. 2.09 - 2016-12-28
- Added a vizualization tool called the Instance viewer for visualizing the input files of clustering algorithms such as K-Means and DBScan
- Improved the documentation of the clustering algorithms with some more interesting examples and pictures. Moreover, also did some minor improvements to the code of clustering algorithms. In particular, the input file format for clustering algorithms now let the user specify the names of attributes used to describe the instances.
- I have also improved the Cluster Viewer to let the user select which attributes should be visualized when displaying clusters. Thus the Cluster Viewer can now be used to visualize instances having more than 2 attributes.
- Fixed a bug in the user/interface and command line interface of SPMF for the parameter "required items" of the TKS algorithm.
- v. 2.08 - 2016-12-25 (cluster visualization)
- Added a vizualization tool called the Cluster Viewer for visualizing clusters of 2D points found by clustering algorithms such as K-Means and DBScan
- Moved the TimeSeries viewer to another package and added a few additional features to its user interface..
- v. 2.07 - 2016-12-21
- Modified the clustering algorithms (K-Means, Bisecting K-means, Hierarchical clustering, DBScan and OPTICS) such that:
- a label (a name) can be assigned to each instance in the input file. The names of instances are now displayed in the output of these algorithms. This provides more meaningful results.
- a separator such as " " can be provided as parameter to these algorithms. The separator indicates which character is used in the input file to separate values. As a result, most clustering algorithms are now compatible with the time series file format and can be applied to time series (when using the ',' separator).
- Fixed a bug when running the OPTICS algorithm in the user interface or command line interface of SPMF.
- Minor improvements to the Time Series Viewer. When the user moves the mouse over a time series, the name of the time series is shown. Also other minor changes.
- v. 2.06 - 2016-12-18 (time-series mining)
- Added support for time-series data mining
- an implementation of the SAX algorithm is provided for converting time series to sequence(s) of symbols. This is useful to then apply traditional sequential pattern mining algorithms or sequential rule mining algorithms to time series.
- an algorithm to calculate the moving average of a time-series (this is useful for making a time series appears more "smooth" by removing noise)
- an algorithm to calculate the piecewise-aggregate approximation of a time-series, which is used to reduce the dimensionality of a time-series
- an algorithm to split a time-series into a given number of time series, or by number of data points.
- a vizualization tool called TimeSeriesViewer for visualizing time-series.
- Fixed an encoding bug for the conversion of chinese texts to sequences such that chinese characters were not appearing.
- Fixed a bug related to the command line interface of SPMF
- Updated the developer's guide on the website with some minor modifications.
- v. 2.05 - 2016-11-16
- Fixed a bug in the command line interface (thanks to Andrey Shestakov for reporting the bug)
- v.2.04- 2016-10-14
- Improved the graphical user interface and command line interface of SPMF so that more informative messages are shown to the user when an algorithm parameter is missing or when the value is of an incorrect type. This will make the user interface more user-friendly (thanks to Slimane Oulad Naoui for this suggestion).
- v.2.03- 2016-10-13
- Fixed a bug in the VMSP algorithm (AlgoVMSP.java) such that some patterns were missing in some cases when the maxgap constraint was used (thanks to Antoine Pigeau for reporting the problem)
- v.2.02- 2016-10-12
- Added support for mining TEXT files with Chinese text (by supporting the Chinese punctuation).
- Fixed a bug in the FOSHU and TSHOUN algorithms, an updated the documentation and sample input file for these algorithms (thanks to Yimin Zhang for reporting the problem)
- v.2.01 - 2016-09-16 (several improvements)
- Added the support for TEXT files. Using the graphical interface or command line, it is now possible to apply most sequential pattern mining and sequential rule mining algorithms directly to a text file. There is two ways of applying an algorithm on a text file. The first way is to apply the algorithm "Convert_TEXT_file_to_sequence_database" to transform a text file into a sequence database. Then this file can be used with most algorithms for sequential pattern or rule mining using the user interface or command line. The second way is to rename the text file with the extension ".text". Then when using the graphical interface or command line, SPMF will automatically convert the file to the SPMF format, run the selected algorithm, and then show the results in terms of words in the text file rather than integers. This is a feature that has been requested by several users. It is useful for performing data mining on text files without having to write code for converting text to sequences as it was previously required. For now, SPMF only supports the default text file encoding supported by Java. In the future, some options will be added to let the user choose other encodings as well. There is a new example in the documentation that provides also some explanations about how to use text files when running algorithms using the source code version of SPMF. Moreover, a tutorial on the blog explains some of the possibilities for analyzing text documents using this new version of SPMF.
- A new system has been designed for adding new algorithms to SPMF. To add an algorithm, an instance of the class DescriptionOfAlgorithm must be created for the new algorithm in the package "ca.pfv.spmf.algorithmmanager.descriptions". It allows to indicate the type of input, output, the parameters, etc. of the algorithm. This is then used to automatically generate the list of algorithms in the user interface of SPMF, unlike in previous versions of SPMF where this list was hard-coded. In the future, the descriptions of algorithms could be used to build a more complex user interface where user could visually combine various algorithms as a workflow. Another interesting possibility is to provide a user interface to run multiple algorithms one after the other, or to launch experiments where the parameters are varied automatically. This will be considered for features in future releases of SPMF. Moreover, another idea is to use the algorithm descriptions for adding a plug-in system in SPMF for importing algorithms from other jar files. In the next few days, I will also update the developper's guide to add more documentation.
- Added the lift measure to the CMDeo algorithm (thanks to Ryan Panos).
- Updated the code of the GCD algorithm for association rule mining with an improved version (thanks to Ahmed El-Serafy, Hazem El-Raffiee).
- Fixed some minor errors in the documentation.
- Fixed a bug in the CMDeo algorithm that could trigger an ArrayOutOfBoundException .
- Fixed a bug in the Pascal implementation (the support of single items was incorrect in some cases).
- Fixed a bug in the user interface for the FP-Close algorithm.
- Fixed a bug in the VMSP and VGEN algorithms that occurred when maxLength was set to 1.
- v.0.99j - 2016-06-16
- Fixed a bug in the VMSP implementation (thanks to Himel Dev for reporting the bug)
- Two additional large itemset mining datasets have been added to the datasets page of the website: PowerC and Susy (thanks to Zhang Zhongjie)
- v.0.99i - 2016-06-09 (1 new algorithm)
- Added an implementation of the GCD algorithm for mining association rules (thanks to Ahmed El-Serafy, Hazem El-Raffiee for providing the implementation)
- v.0.99h - 2016-06-09
- Added a tool to resize databases in SPMF format using a percentage of lines from an original database (useful for performing scalability experiments)
- Fixed a bug in the FHSAR implementation (thanks to Gehad Ahmed Soltan Abd-Elaleem for reporting the bug)
- v.0.99g - 2016-06-02 (4 new algorithms)
- Added an implementation of the MinFHM algorithm for mining minimal high-utility itemsets.
- Added an implementation of the PFPM algorithm for mining frequent periodic patterns in a transaction database (a sequence of transactions)
- Added an implementation of the PHM algorithm for mining periodic patterns that have a high utility (e.g. yield a high profit) in a sequence of transactions (a transaction database)
- Added an implementation of the SkyMine algorithm for mining skyline high-utility itemsets (thanks to V. Goyal. et al.)
- Added a tool to remove utility information from transactions databases containing utility information.
- v.0.99f - 2016-05-30
- Fixed a bug that may generate incorrect support count in VMSP and other SPAM based algorithms in some specific cases. The bug was introduced in a previous version when adding the maxgap constraint to SPAM based algorithms (thanks to Preethy Varma for reporting the bug)
- Seven large datasets for itemset mining have been added to the datasets page of the website: kddcup99, Skin, Pamp, USCensus, OnlineRetail, and RecordLink (thanks to Zhang Zhongjie)
- v0.99e - 2016-03-29 (1 new algorithm)
- Added the original implementation of the EFIM-Closed algorithm for mining closed high-utility itemsets.
- v0.99d - 2016-03-23 (1 new algorithm)
- Added the original implementation of the FHM+ algorithm for efficiently mining high-utility itemsets with length constraints.
- v0.99c - 2016-03-13
- Fixed bugs in the new BIDE+ and PrefixSpan implementation that occurred for sequences containing multiple items per itemset.
- v0.99b - 2016-02-28
- I have further optimized the new Prefixspan implementation, in the package ca.pfv.spmf.algorithms.sequentialpatterns.prefixspan.
- I have replaced the old implementation of BIDE+ with a new implementation. The new implementation is in the package ca.pfv.spmf.algorithms.sequentialpatterns.prefixspan. This new implementation is faster and more memory efficient (up to 10 times faster on some dataset, and uses less memory). I have tested this implementation quite well. But if you find some issues, please let me know. Note that some algorithms may still rely on the old implementation (e.g. the Fournier08 algorithm). I will further clean the code in upcoming versions of SPMF to avoid keeping two versions of BIDE+.
- Fixed a bug in the FOSHU and TS-HOUN algorithms. The absolute value of to(X) is now used to calculate the relative utility of an itemset X.
- v0.99 - 2016-02-21
- In this new version, I have replaced the Prefixspan implementation with a new implementation, in the package ca.pfv.spmf.algorithms.sequentialpatterns.prefixspan. This is something that I have wanted to do for a while since the previous version had been implemented a long time ago. The new version is based on different design decisions and includes some additional optimizations. It can thus be more than 10 times faster than the previous implementation on some dataset and use three times less memory. This also makes the RuleGen algorithm faster since it relies on PrefixSpan. Note that some algorithms may still rely on the old implementation.
- v0.98e - 2016-02-05
- Added the possibility to mine closed association rules using FPClose. The version using FPClose can be 10 times faster than the version using Charm for the step of rule generation because FPClose stores closed itemsets in a CFI-tree.
- v0.98d - 2016-02-02
- Fixed a bug in DBScan, Optics, and the KD-Tree implementation.
- v0.98c - 2016-01-29 (1 new algorithm):
- Added an implementation of the USPAN algorithm for mining high-utility sequential patterns.
- v0.98b - 2016-01-28 (1 new algorithm):
- Added an implementation of the FCHM algorithm for mining correlated high-utility itemsets using the bond measure.
- Modified the output of TKS to remove the -2 at the end of each pattern found, so that the output is similar to other sequential pattern mining algorithms.
- v0.98 - 2016-01-14 (added a new window for result visualization)
- This new version offers a new user user interface for vizualizing results. It is a window specifically designed for visualizing patterns found by pattern mining algorithms but it works with clustering algorithms and most algorithms. This window can be accessed when using the graphical interface of SPMF by selecting the checkbox "using SPMF viewer". The new window show the patterns found by an algorithm in a table, and it let the user apply some filters to select patterns or to sort the patterns by ascending or descending orders using various measures such as support and confidence (depending on the algorithms) by clicking on the column headers. This window for visualizing patterns should work with most algorithms offered in SPMF. If you find some bugs related to this new window for visualizing results, or if you have ideas to improve the user interface of SPMF, you may let me know.
- Besides, I fixed a few bugs.
- v0.97d / 0.97e - 2015-12-06 (2 new algorithm)
- Added an implementation of the GHUI-Miner algorithm
for mining generators of high-utility itemsets in a transaction database having utility information.
- Added an implementation of the CHUI-Miner algorithm for
mining closed high-utility itemsets in a transaction database having utility information.
a bug in MaxSP. No result where generated for minsup = 0 sequence. Now,
if the user set minsup = 0 sequence, MaxSP change minsup to 1 sequence
(because it does not make sense to generate patterns that do not exist
in the database).
- Fixed a bug in FPClose (thanks to Jamshi Nazeer for reporting the bug)
- Fixed a bug in the GoKrimp algorithm when reading a file without optional labels (thanks to Jaroslav Fowkes and Thomas Christie for reporting the bug)
- Fixed a bug in the ClaSP / CMClasp algorithms when handling databases with multiple items per itemsets (thanks to Tin Truong Chi for the bug fix)
- v0.97c - 2015-10-28 (1 new algorithm)
- Added a variation of the FHM algorithm named FHMFreq for mining frequent high-utility itemsets. This is a quite simple modification of FHM to add the minsup threshold.
- v0.97b - 2015-10-06 (1 new algorithm)
- Added an implementation of a Naive Bayes Document Classifier (implemented by Sabarish Raghu)
- Fixed a bug in the text clusterer (bug reported and fixed by Dharmen Punjani)
- Fixed a bug in FPClose (thanks to Insil Yun)
- v0.97a - 2015-09-19
- Fixed a bug in the graphical interface for the SPAM algorithm (thanks to Martin Böckle for reporting the bug).
- Added the minimum pattern length constraint for the SPAM algorithm.
- v0.97 - 2015-09-12 (major revision - 16 new algorithms)
- Added several sequence prediction algorithms to SPMF.
The algorithms are CPT+, CPT, PPM, DG, AKOM, TDAG and LZ78.
Those algorithms are designed to predict the next symbol of a given
sequence based on a set of training sequences. The algorithms are
implemented by Ted Gueniche, as part of its Ipredict project.
- Added an implementations of FOSHU and TS-HOUN for on-shelf high-utility itemset mining.
- Added an implementations of EIHI and HUI-LIST-INS for incremental high-utility itemset mining.
- Added an implementation of HUSRM for high-utility sequential rule mining.
- Added an implementation of EFIM, d2HUP and HUP-Miner algorithms for high-utility itemset mining.
- Added an implementation of HUG-Miner for mining high-utility generator patterns
- Added more performance comparisons on the "Performance" page of the website
- Added datasets for on-shelf utility mining and high utility sequential rule mining on the "Datasets" page of the website.
- Fixed a bug in the "maxgap" constraint implementation for the TKS, CM-SPAM
algorithms and other SPAM based algorithms, that sometimes occured when
an item appeared multiple times in the same sequence.
- Updated the map of data mining algorithms.
version requires to have Java 1.8, installed on your machine. It may be
necessary to update the Java SDK on your machine and perhaps also your
development environment such as Eclipse.
- v0.96r20 - 2015-08-25
an optional parameter to SPAM, VMSP, VGEN and TKS to show the
identifiers of sequences containing each pattern found. If this
parameter is set to true, the identifiers of sequences will be shown in
the output by these algorithms.
- v0.96r19 - 2015-08-18
a bug in the FPGrowth algorithm that was introduced in v96r14 when some
optimizations where made to the FPGrowth code (thanks to Masanori Akiyoshi for finding the bug). The support of itemsets was in some cases incorrectly calculated.
- Fixed an integer overflow problem occuring only for very large datasets for FHM, FHN and HUI-Miner.
- Fixed a bug in the CORI algorithm (thanks to Pierre-Emmanuel Leroy)
- v0.96r17/r18 - 2015-05-26
- Further optimization of memory usage for the Eclat, dEclat, Cori and DefMe algorithms.
- Fixed a bug in the correlation distance function for clustering.
- Fixed a bug that occurred when using the "maxgap" constraint in the VMSP, VGEN, CM-SPAM, SPAM and TKS algorithms (thanks to Choong Shin Siang and Wong Li Pei for reporting the bug).
- Optimized the H-Mine algorithm implementation.
- Fixed a bug in FHN.
- Fixed a bug in the UPGrowth+ implementation (thanks to Prashant Barhate for contributing this implementation)
- v0.96r16 - 2015-04-28 (1 new algorithm)
v0.96r15 - 2015-04-24 (4 new algorithms)
- Added an implementation of the CORI algorithm for mining rare correlated itemsets from a transaction database.
- Added an implementation of the FPClose algorithm for mining closed frequent itemsets from a transaction database.
- Added an implementation of the PrePost+ algorithm (a variation of PrePost) for frequent itemset mining
- Added an implementation of the FHN and HUINIV-Mine algorithms for high-utility itemset mining with negative or positive unit profit values.
- v0.96r14 - 2015-04-05 (1 new algorithm)
- Added an implementation of the FPMax algorithm for mining maximal frequent itemsets from a transaction database.
- Fixed a bug in the dCharm_bitset implementation. The result was sometimes incorrect.
- The runAlgorithm() method of CommandProcessor is now public as requested by some user.
- v0.96r13 - 2015-03-22 (1 new algorithm)
- Added an implementation of the Optics algorithm.
Optics generates a cluster-ordering from a set of double vectors. From
this ordering, various things can be done. I have implemented the
method extractDBScan() method to use this ordering to generate DBScan
style clusters. I have however not implemented the alternative
extractCluster() method, described in the paper.
- v0.96r12 - 2015-03-21
a bug in the hash function of CloSpan, ClaSP and CM-ClaSP, that
provoked a StackOverflow exception for these algorithms in some rare
cases (thanks to Wen Zhang for reporting the bug and Antonio Gomariz
for fixing it).
- v0.96r11 - 2015-03-16
a bug in the Zart algorithm (reported by Asmaa) that was generating an
ArrayOutOfBound exception when no single items were frequent.
Furthermore, I have modified the outpout of Zart to make it clearer and
updated the documentation.
- v0.96r10 - 2015-03-12
the graphical user interface of SPMF (files in the package
ca.pfv.spmf.gui) so that when the user is launching an algorithm, it is
now done in a separated thread and a button "Stop algorithm" is
available to stop the algorithm execution if it is taking too much time.
- v0.96r9 - 2015-03-07 (1 new algorithm)
- Added an implementation of the DBScan algorithm for density-based clustering.
- Added the feature of searching all points within a radius to the KD-Tree implementation.
- v0.96r8 - 2015-03-05
New feature for most sequential pattern mining algorithms: the user can
now request to show the corresponding sequence ids for each pattern
found. In other words, for each pattern found, SPMF can now show the
ids of the sequences where the pattern appears. This feature was added
in BIDE+, ClaSP, CM-ClaSP, CloSpan, CM-SPADE, CM-SPAM, SPADE, SPAM-AGP,
GSP, PrefixSpan, TSP, MaxSP, FEAT and FSGP. The documentation will be
Moreover, I fixed some minor issues in the code of FEAT
and FSGP (the code for saving to file was not working as expected for
- v0.96r7 - 2015-02-17 (1 new algorithm)
- Added an implementation of the Bisecting K-Means clustering algorithm.
more features to the K-Means and Hierarchical Clustering algorithms.
Previously, the euclidian distance was the only distance function
available. Now, the user can choose between Euclidian distance,
correlation distance, cosine distance, Manathan distance and Jaccard
- Update: 2015-02-19: Fixed a bug in the command line interface of SPMF (bug reported by Wen Zhang).
- v0.96r6 - 2015-02-16
- Fixed bugs in the dEclat, dCharm algorithms and FIN/PrePost implementations.
- v0.96r5 - 2015-02-13
the "max gap" parameter for the VGEN, VMSP, TKS, SPAM, SPAM and CM-SPAM
sequential pattern mining algorithms. It is an optional parameter that
allows to specify if gaps are allowed in sequential patterns. For
example, if "max gap" is set to 1, no gap is allowed (i.e. each
consecutive itemset of a pattern must appear consecutively in a
sequence). If "max gap" is set to N, a gap of N-1 itemsets is allowed
between two consecutive itemsets of a pattern. If the parameter is not
used, by default "max gap" is set to +∞.
- Fixed a bug in the Itemset-Tree and Memory Efficient Itemset-Tree implementations (thanks to Ryan G. Benton for reporting and fixing the bug). The support of itemsets was sometimes calculated incorrectly.
- v0.96r3/r4 - 2015-02-05 (3 new algorithms)
- Added an implementation of a Text Clusterer using the tf*idf measure (thanks to Sabarish Raghu)
- Added an implementation of the EstDec+ algorithm, for mining recent frequent itemsets from data streams (thanks to Azadeh Soltani).
- Memory optimizations of the EstDec algorithm. The algorithm can now use up to 4 times less memory (thanks to Azadeh Soltani).
- Memory optimization of the FPGrowth algorithm
to reduce the number of object creation. Reduces memory usage by up to
2 times on some datasets. Also optimized how FPGrowth enumerate all
itemsets when an FP-Tree contains a single branch.
- Refactoring of classes in the package ca.pfv.spmf.gui to separate the main class file and command line interface (Main) from the graphical interface (MainWindow)
and from the code for launching algorithms (CommandProcessor) used by
both the command line interface and GUI. From now on, the class "Main" will be the main class of the SPMF library.
of the CM-ClaSP and ClaSP code so that the debuging code for showing
the trie is now located in a separated class (ShowTrie). This is to
avoid HeadlessExceptions when running ClaSP/CM-ClaSP in a headless
environment (bug reported by Wen Zhang), that is an environment where graphical interface is available (e.g. on some Linux servers).
- Fixed a bug in BIDE+ that was causing ArrayOutOfBound exception on some datasets (thanks to Mehran Memon for reporting this bug).
- Other minor modifications to remove some warnings.
- Fixed a bug in dEclat implementations. Due to an error in how methods were overloaded, some code of Eclat was executed in dEclat.
- Fixed a bug. MaxSP was not working when executed from the GUI.
- Fixed a bug about how an ID3 decision tree was printed to console (thanks to G. Gutierrez for reporting this bug)
- Optimization: replaced StringBuffer by StringBuilder in all classes (since it is more efficient).
- Added the BMS2 dataset to the "dataset" page of the website.
- Added more features to the CM-SPAM algorithm
- allow the user to specify items that need to appear in patterns found.
- allow the user to specify the minimum/maximum length of patterns to be found.
more features to the RuleGrowth/TRulegrowth algorithms. User can now
specify the maximum size of rule antecedents/consequents to be found.
- v0.96r2 - 2014-11-30
- Added more features to the TKS algorithm for top-k sequential pattern mining
- modified TKS to allows the user to specify items that need to appear in patterns found.
- modified TKS to allows the user to specify the minimum length of patterns to be found.
the "fix transaction database" tool. It is a tool that fix some common
problems that may be found in transaction database files created by
users. This tool (1) removes duplicate items in transactions of a
transaction database and (2) sort items in transactions (those
requirements are assumed by most itemset mining algorithms).
- v0.96r - 2014-11-24 (4 new algorithms)
- Added the FIN and PREPOST algorithms (thanks to Zhihong Deng for
providing the original C++ source code, that I have converted to Java).
These two algorithms are very recent frequent itemset mining algorithm
and are very fast. According to some preliminary experiments, for
example, PrePost can be a few times faster than FPGrowth on some
- Added implementations of the LCMFreq, LCM
and LCMMax algorithms for respectively mining frequent itemsets, closed
itemsets and maximal itemsets (thanks to Alan Souza for
providing an implementation of LCM). I have modified the source code to
add a few optimizations. The implementation of LCM is based on the
paper describing LCM v.2 by Uno, although it does not perform
transaction merging yet (some more optimizations could be added in the
future). LCM is an interesting algorithm because it was the winner of
the FIMI 2004 competition.
- Update (2014-11-24): There was a bug in my modifications of LCM to implement LCMMax, so LCMMax is temporarilly removed from the source code until I can fix the issue. It may takes a few days or more before I can fix it.
- v0.96q - 2014-09-25
- Fixed a bug in the compare() method of the Rule class used by TNS and TopSeqRules (thanks to C. Albert Thompson for reporting the bug).
- New tools:
a tool to add consecutive timestamps to a sequence database (this is
useful for generating datasets with timestamps for testing algorithms
that require timestamps).
- Added a tool for converting
a transaction database to a sequence database (this can be useful for
generating datasets for experiments, though in real-life, it may not
make sense to convert transactions without ordering to sequences with
- Added a tool to add synthetic
utility values to a transaction database (this is useful for generating
datasets to be used in high utility itemset mining).
- v0.96p- 2014-09-14 (3 new algorithms)
- Added an implementation of the ERMINER algorithm for sequential rule mining.
- Added an implementation of the IHUP and UP-GROWTH algorithms for high-utility itemset mining (thanks to Prashant Barhate for implementing these algorithms).
- v0.96o- 2014-09-12
a bug in Eclat so that frequent itemsets found where not correctly
separated by their size and another bug in Eclat such that Eclat was
not pruning some itemsets containing 2 items when the triangular matrix
was deactivated. (thanks to Abdalghani Abujabal).
- Fixed a bug that occured when running the "Fournier08-Closed+time" algorithm using the GUI, (thanks to Nahumi)
- v0.96n- 2014-08-15
a tool to convert a sequence database to a transaction database. This
tool is useful for example to apply an algorithm designed for a
transaction database to a sequence database (e.g. mining association
rules in a sequence database).
- Added a tool to generate statistics about a transaction database.
- Fixed a bug in the association rule generation using CFPgrowth that was introduced in a previous version.
- v0.96j- 2014-06-23
- Fixed a bug in the LAPIN implementation because of overflow and cleaned the code, and added some comments.
- v0.96k- 2014-06-22 (1 new algorithm)
v0.96i - 2014-06-15 (2 new algorithms)
the LAPIN (aka LAPIN-SPAM) algorithm for sequential pattern mining.
This is a first implementation in SPMF based on the LAPIN-LCI variation
of LAPIN described in the technical report of LAPIN.
the dECLAT and dCHARM algorithm for mining frequent itemsets. Those
algorithms are respectively variations of the Eclat and Charm
algorithms. The difference is that it uses the diffsets data structure
rather than tidsets. We provide an implementation of dEclat using sets
of integers to represent diffsets ("dEclat") and one version using
bitsets ("dEclat_bitset"). For dCharm, only a version using bitsets is
- v0.96h - 2014-06-15 (1 new algorithm)
- Added the DEFME algorithm for mining frequent generator itemsets using a depth-first search.
- v0.96g - 2014-06-14
- Fixed bugs in FEAT/ FSGP that occurred when multiple items per itemsets appeared in input sequences.
- v0.96f - 2014-06-11
- Optimized the code for association rule generation. Up to 10 times faster on some datasets.
- Improved the source code for closed association rule mining and merge some classes.
a class ca.pfv.spmf.algorithms.ArraysAlgo to put all important
algorithms on sorted list of integers that are shared by several
algorithms (to remove some redundancy in the source code).
- v0.96e - 2014-06-10
v0.96d - 2014-05-22(3 new algorithms)
- Optimization of the binary search in Apriori based algorithms (Apriori, AprioriClose, AprioriInverse...), as well as in FHM and HUI-Miner.
- Major optimizations of the Eclat and Charm
algorithms. I have re-implemented most of the code.
the encoding of Java source code files from ISO-8859-1 to UTF-8 to
remove warnings when compiling the code in Net Beans (Thanks to M.
Witbrock for reporting this issue)
- Added a performance comparison with a closed source data mining library in the "performance" section of the website.
- three new algorithms:
- FEAT, FSGP and VGEN for mining frequent sequential generator patterns from a sequence database.
- fixed an array out of bound exception in the FPGrowth algorithm that occurred when all items are infrequent (thanks to Aman).
- v0.96c - 2014-04-30
- fixed a bug in association rule generation with CFPGrowth (AlgoCFPGrowth.java), thanks to Manperta Negara Situmorang.
- v0.96b - 2014-04-24
- Major optimization of all FP-Growth based algorithms (FPGrowth, FPGrowth_with_strings, CFPGrowth++), thanks to Dan Cappucio. The modification is to add mapItemLastNodes in the FPTree / MISTree classes (see the "performance" section of the website for an overview of the speed improvement).
- fixed a bug in the CMDeo algorithm that was causing an array out of bound exception
- v0.96 - 2014-04-06 (3 new algorithms)
- new algorithms:
- added the GoKrimp and SeqKrimp algorithms (by Thanh Lam Hoang et al) to discover compressing sequential patterns directly of by post-processing.
- added the FHM algorithm for mining high-utility itemsets
- v0.95d - 2014-03-10
v0.95c - 2014-03-10
- CFPGrowth has been renamed CFPGrowth++ since it includes the optimizations proposed in CFPGrowth++.
- fixed a bug in the VMSP algorithm (no result was shown)
- v0.95b - 2014-03-06
- fixed a bug in the TSP implementation
- fix some inconsistencies in the source code of some sequential pattern mining algorithms (thanks to C. Zhou)
- v0.95 - 2014-02-28 (major revision - 10 new algorithms)
- new algorithms for pattern mining:
- TKS for top-k sequential pattern mining
- TSP for top-k sequential pattern mining
- VMSP for maximal sequential pattern mining
- MaxSP for maximal sequential pattern mining
- ESTDEC for mining recent frequent itemsets from a stream (by Azadeh Soltani)
- MEIT (Memory Efficient Itemset-Tree), a data structure for targeted association rule mining
- CM-SPAM for sequential pattern mining
- CM-SPADE for sequential pattern mining
- CM-ClaSP for closed sequential pattern mining
- PASCAL for mining frequent itemsets and identifying generators
- added a few minor optimizations to Charm and Eclat
- added the possibility to generate association rules from the output of the CFPGrowth algorithm.
- refactoring of SPADE, Clasp, SPAM_AGP, GSP and PrefixSpan_AGP
- Updated the map of data mining algorithms.
- improved the documentation web page of the website to add the description of the file formats of each algorithm.
- fixed a bug in the ID3 algorithm implementation
- fixed a bug to use the hierarchical clustering algorithm in the GUI (thanks to A. Rai)
- v0.94d - 2014-01-25
- v0.94c - 2013-11-26
- fixed rounding inconsistencies among sequential pattern mining algorithms (thanks to A. Pramudita).
- v0.94b - 2013-10-07
- Optimized the BIDE+ algorithm.
- fixed a bug in the trimBeginingAndEnd method of PseudoSequenceBIDE.java for the BIDE+ algorithm.
- Cleaned the code of FPGrowth for the case of a tree with a single path (thanks to R. Loomba).
- v0.94 - 2013-08-12 (major revision - 6 new algorithms)
v 0.93e - 2013-06-06
v 0.93d - 2013-06-01
- Several new sequential pattern mining algorithms by Antonio Gomariz Peñalver:
- SPADE (regular and parallelized versions)
- ClaSP, a very efficient vertical algorithm for closed sequential pattern mining
- PrefixSpan and SPAM (alternative implementations)
- Closed sequential pattern mining by post-processing with PrefixSpan and SPAM.
- Also, new test files: contextCloSpan.txt, contextClaSP.txt and contextSPADE.txt
- Bug fixes
- fixed an important in the class AbstractOrderedItemset of SPMF 0.93, which have affected the result of several algorithms
including the algorithm for mining MNR rules.(thanks to Faizal Feroz):
- fixed a bug in the class ItemsetTree (thanks to Faizal Feroz):
- fixed a bug of integer overflow for large datasets (e.g. accidents) that occurred in the hashcode function of Charm (class HashTable) and other algorithms using the same class (thanks to K. Srinvas Rao)
- fixed a bug in Charm (bug also introduced in 0.93 due to refactoring).
- fixed a bug in TNS/TopSeqRules (thanks to Peter Toth)
- Updated the map of data mining algorithms and documentation.
v 0.93c - 2013-05-31
- added a new datasets page on the website.
support for the ARFF file format (a popular file format that represent
a relational database table as a text file). The ARFF format can be
used as input in the command line interface and graphical interface of
SPMF by algorithms that take a transaction database as input (most
itemset mining and association rule mining algorithms). This version
support all features of ARFF except that (1) the character "=" is
forbidden and (2) escape characters are not considered. Note that when
the ARFF format is used, the performance will be less than if the
native SPMF file format is used because a conversion has to be
performed. However, this additional cost should be small. Note that
SPMF also support a few other formats besides ARFF (see the last
examples in the documentation on file conversion for more information).
However, only the ARFF format is converted on-the-fly (other formats
have to be converted manually before applying an algorithm). 36
datasets in the ARFF format can be found in the datasets page of this website.
v 0.93b - 2013-05-21
- added a tool to convert the CSV format with positive integers to a transaction database in SPMF format.
- improved the documentation
- fixed a bug in the hierarchical clustering algorithm
- fixed a bug in the sequence database generator and transaction database generator.
- added the Hirate-Yamana algorithm to the GUI interface and command line interface
- v 0.93 - 2013-05-09
(major revision - 7 new algorithms)
- Several packages and files have been renamed for a better organization of the source code.
- Merged some files that were duplicated in various packages for less redundancy.
- Optimized several Apriori-based algorithms such as AprioriRare and AprioriInverse.
- Some algorithms have been optimized to use better data structures such as using arrays instead of list of integers.
7 algorithms to the GUI: Apriori_TIDClose, AprioriTID, Apriori -
Association rules, Sporadic association rules, Indirect association
rules, IGB, MNR.
- Fixed some small bugs in the source code.
to standardize as much as possible the output files written by the
algorithms so that algorithms performing the same task will output the
same file format.
- The CHARM-MFI algorithms was not
generating the correct result. I have found that the problem is the
algorithm itself, which is incorrect as it is described in Szathmary
(2006) for some special cases. I have adapted the algorithm so that it
generates the correct result.
- I have removed four
less popular algorithms that were not well-documented and not offered
in the release version of SPMF: the algorithm for mining pseudo-closed
itemsets, the Guigues-Duquenne basis, proper basis and the structural
basis of association rules. If you want these algorithms, you can
download version 0.92 of SPMF to get them.
- Added the maximum pattern length constraint to PrefixSpan and SPAM algorithms.
- v 092c - 2013-04-08
- v 092b - 2013-03-14
- fixed a bug in the calculation of the lift measure for association rules.
- v 092 - 2013-03-05
(1 new algorithm)
- added an implementation of HUI-Miner, one of the best algorithms for high utility itemset mining.
- v 091 - 2012-12-29
an implementation of Apriori that uses a hash-tree to store candidates
to calculate the support and generate candidates more efficiently (it
is named "Apriori_with_hash_tree" or "AprioriHT"). It can be up to
twice faster than the previous version (a performance comparison).
- added a version of FPGrowth that accepts strings instead of integers as input (FPGrowth_itemsets_with_strings)
- v 090 - 2012-12-25:
v 0.89 - 2012-08-26:
- added a tool to generate transaction databases.
- added a tool to generate sequence databases.
- added a tool to convert sequence databases to the SPMF format.
- added a command line interface to run algorithms from the
- added an implementation of TNR for top-k non-redundant association
- added an implementation of TNS for top-k non-redundant sequential
- clean the source code a little bit.
- fixed some small bugs in the Indirect, FHSAR and ZART algorithms.
- added an implementation of CFPGROWTH for mining itemsets with
multiple support thresholds (implemented by Azadeh Soltani).
- v 0.88 - 2012-08-22:
- added an implementation of MSAPRIORI for mining itemsets with
multiple support thresholds (implemented by Azadeh Soltani).
- fixed a small bug in the red-black tree implementation used by
TOPKRULES and TOPSEQRULES
- fixed a small bug in Cluster.java (thanks to F. Jafari)
- fixed a small bug in TRULEGROWTH.
- added implementations of TRULEGROWTH and BIDE+ that accepts
strings instead of integers as input.
- v 0.87 - 2012-07-28:
- improved the user interface so that (1) example parameter values
are shown for each parameter and (2) that percentage values can be
entered either in decimal format (e.g. 0.5) or as a percentage (e.g.
- fixed a bug in the hierarchical clustering algorithm in the GUI
version of SPMF
- v 0.86 - 2012-07-26:
- modified the user interface so that algorithms are presented by
their category in the combo box such as "sequential pattern mining",
"sequential rule mining", "itemset mining", "clustering", etc.
- optimized the basic Apriori implementation with binary search for
checking subsets of candidates, arrays of integers instead of lists,
- v 0.85 - 2012-07-17:
- added several algorithms to the GUI version of SPMF: K-MEANS,
TWO-PHASE, VME, ZART, RELIM, RULEGEN, SEQ-DIM, etc.
- improved the version of K-Means and the hierarchical clustering
algorithm so that it can work with vectors and cleaned the code..
- added some small optimizations to the RELIM and ZART
- cleaned the implementation of the algorithm for mining
- cleaned the code of algorithms for mining multi-dimensional
sequential patterns and modified them so that they save the results
to a file.
- cleaned the source code of Apriori-based algorithms.
- v 0.84 - 2012-07-15: added a few algorithms for
building, updating and querying an Itemset-Tree. An itemset
tree is a special structure representing a database that allows
efficiently generating targeted association rules, frequent itemsets and
to get the support of any itemset. This structure can be updated
incrementally (only available in the source code version of SPMF).
- v. 0.83 - 2012-07-04: added the possibility of mining
association rule with the lift measure and the minlift threshold.
- v. 0.82 - 2012-06-30: fixed a bug in the SPAM
implementation that occurred when minsup =0 (thanks to D. Bhatt).
- v. 0.81 - 2012-04-13: improved the SPAM
implementation. The number of bits by sequence is now variable. The
algorithm is therefore more memory efficient and can run on larger
datasets with longer sequences.
- v. 0.80 - 2012-04-08: improved the user interface
(thanks to Hanane Amirat), changed the license of the software to GPL v3, fixed a minor bug in the
TRuleGrowth algorithm, cleaned the source code of several algorithms by
removing some unused methods.
- v. 0.79 - 2012-03-17: added five Apriori-based
algorithms to the GUI version (Apriori, AprioriClose, AprioriRare,
AprioriInverse, UApriori) and made some minor improvements.
- v. 0.78 - 2012-03-05:
- Added the TRULEGROWTH for mining sequential rules with the window
- Added the TOPKRULES algorithm for mining the top-k association
rules in a transaction database.
- Added the TOPSEQRULES algorithm for mining the top-k sequential
rules in a sequence database.
- Added an implementation of FPGROWTH that saves the result to a
file instead of keeping the result into memory.
- Cleaned the implementation of PREFIXSPAN. I removed some unused
variables in the pseudo-sequence implementation (thanks to
shouwangji@___ for reporting this),
- Added an implementation of the KD-TREE data structure,
- Added a simple graphical user interface (ca.pfv.spmf.gui.MainWindow)
that allows to run 17 main algorithms (other algorithms will be
added to the user interface later).
- v.0.77 - 2011-10-28: Added an implementation of FHSAR
for association rule hiding.
- v.0.76 - 2011-10-22: Added faster and more memory
efficient implementations of AprioriTID, ECLAT and CHARM that use bit
vectors for representing tids sets.
- v.0.75 - 2011-10-18: Added an implementation of
INDIRECT for mining "indirect association rules".
- v.0.74 - 2011-08-11: Added an implementation of SPAM
for sequential pattern mining
- v.0.73 - 2011-09: cleaned the implementation of
PREFIXSPAN and BIDE+ and made some optimizations, added an implementation
of RULEGEN for generating sequential rules from sequential patterns.
- v.0.72 - 2011-07: Added implementations of RULEGROWTH
for mining sequential rules,ID3 for creating decision trees, VME for
mining erasable itemsets. Also, I cleaned a little bit the
implementations of K-Means, the hierarchical clustering algorithm,
AprioriTID and Apriori_TIDClose, CMRules and CMDeo.
- v.0.71 - 2011-02-01: fixed a bug in the CMRULES and
- v 0.70 - 2011-01-06: added an implementation of the
H-MINE algorithm for mining frequent itemsets.
- v 0.69- 2010-11-27: added two implementations of the
DCI_CLOSED algorithm for mining frequent closed itemsets (one
straigthforward implementation and one with optimizations).
- v 0.68 - 2010-11-09: improved the performance of all
Apriori-based algorithms. Also, I have added implementations of four
- the TWO-PHASE algorithm for mining high-utility itemsets,
Apriori_TIDClose for frequent closed itemset mining, and CMRULES and
CMDEO algorithms for mining sequential rules.
- v 0.67 - 2010-10-07: added an implementation of
U-Apriori for mining frequent itemsets from uncertain data, fixed a bug
in the CHARM and ECLAT algorithms
- v 0.66 - 2010-08-25: added an implementation of
AprioriTID and some code for generating association rules by using
- v 0.65 - 2010-08-14: fixed a bug in the BIDE+
algorithm (thanks to Brock for reporting it)
- v 0.64 - 2010-07-19: minor
changes (add comments, fixed minor bugs, etc.)
- v 0.63 - 2010-04: minor
changes (add comments, fixed minor bugs, etc.)
- v.062 - 2010-04-11: fixed a bug in the triangular
matrix used by the CHARM and ECLAT algorithms.
- v.061 - 2010-03-20: implementation of FP-GROWTH,
fixed a bug in BIDE+ (thanks to G. Bruno for reporting it) and fixed a
bug in RELIM
- v.060 - 2010-03-15: fixed a bug in CHARM/ECLAT
occurring if items are missing in the input file (thanks to A. Pardeshi
for reporting it)
- v.059 - 2010-02-05: fixed a bug in the BIDE+
implementation (thanks to G. Bruno for reporting it)
- v.0.58 - 2009-12-15: implementations of the CHARM,
CHARM-MFI and ECLAT algorithms
- v.0.57 - 2009-11: implementation of the BIDE+
- v.0.56 : 2009-10-09:
implementations of the RELIM and PREFIXSPAN algorithms
- v.0.55 : 2009-08-01: cleaned the
code and fixed some bugs
- v.050 - 2009-05-31 implementation of APRIORI-RARE
and the APRIORI algorithm for mining association rules
- v.049 - 2008-12-07: initial release.