SPMFAn Open-Source Data Mining Library

Introduction

Algorithms

Download

Documentation

Datasets

FAQ

License

Contributors

Citations

Performance

Developers' guide

Videos

Forum

Blog

Other resources

691510 visitors
since 2010

-------------
* We are hiring a postdoctoral researcher (details here) and a Ph.D. student.  Send your CV with cover letter to Prof. Fournier-Viger.
-------------

 

Algorithms

SPMF offers implementations of the following data mining algorithms.

Sequential Pattern Mining

These algorithms discover sequential patterns in a set of sequences. For a good overview of sequential pattern mining algorithms, please read this survey paper.

Sequential Rule Mining

These algorithms discover sequential rules in a set of sequences.

Sequence Prediction

These algorithms predict the next symbol(s) of a sequence based on a set of training sequences

Itemset Mining

These algorithms discover interesting itemsets that appear in a transaction database (a transaction is a set of symbols). For a good overview of itemset mining, please read this survey paper.

  • algorithms for discovering frequent itemsets in a transaction database.
  • algorithms for discovering frequent closed itemsets in a transaction database.
  • algorithms for recovering all frequent itemsets from frequent closed itemsets:
    • the LevelWise algorithm (Pasquier et al., 1999)
    • the DFI-Growth algorithm (___ et al., 2018) new
  • algorithms for discovering frequent maximal itemsets in a transaction database.
    • the FPMax algorithm (Grahne and Zhu, 2003)
    • the Charm-MFI algorithm for discovering frequent closed itemsets and maximal frequent itemsets by post-processing in a transaction database (Szathmary et al. 2006)
  • algorithms for mining frequent itemsets with multiple minimum supports
  • algorithms for mining generator itemsets in a transaction database
    • the DefMe algorithm for mining frequent generator itemsets in a transaction database (Soulet & Rioult, 2014)
    • the Pascal algorithm for mining frequent itemsets, and identifying at the same time which one are generators (Bastide et al., 2002)
    • the Zart algorithm for discovering frequent closed itemsets and their generators in a transaction database (Szathmary et al. 2007)
  • algorithms for mining rare itemsets and/or correlated itemsets in a transaction database
    • the AprioriInverse algorithm for mining perfectly rare itemsets (Koh & Roundtree, 2005)
    • the AprioriRare algorithm for mining minimal rare itemsets and frequent itemsets (Szathmary et al. 2007b)
    • the CORI algorithm for mining minimal rare correlated itemsets using the support and bond measures (Bouasker et al. 2015)
  • algorithms for performing targeted and dynamic queries about association rules and frequent itemsets.
    • the Itemset-Tree, a data structure that can be updated incrementally, and algorithms for querying it. (Kubat et al, 2003)
    • the Memory-Efficient Itemset-Tree, a data structure that can be updated incrementally, and algorithms for querying it. (Fournier-Viger, 2013, powerpoint)
  • algorithms to discover frequent itemsets in a stream
    • the estDec algorithm for mining recent frequent itemsets in a data stream (Chang & Lee, 2003)
    • the estDec+ algorithm for mining recent frequent itemsets in a data stream (Shin et al., 2014)
    • the CloStream algorithm for mining frequent closed itemsets in a data stream (Yen et al, 2009)
  • the U-Apriori algorithm for mining frequent itemsets in uncertain data (Chui et al, 2007)
  • the VME algorithm for mining erasable itemsets (Deng & Xu, 2010)
  • algorithms to discover fuzzy frequent itemsets in a quantitative transaction database

Periodic Pattern Mining

These algorithms discover patterns that periodically appear in a sequence of complex events (a transaction database)

  • the PFPM algorithm (Fournier-Viger et al, 2016a, powerpoint, video) for mining frequent periodic patterns in a sequence of transactions (a transaction database))new
  • the PHM algorithm (Fournier-Viger et al, 2016b, powerpoint) for mining periodic high-utility patterns (periodic patterns that yield a high profit) in a sequence of transactions (a transaction database) containing utility information new

Episode Mining

These algorithms discover episodes that appear in a sequence of complex events (which is simlar to discovering sequential patterns in a transaction database)

  • the TUP algorithm (Rathore et al., 2016) for mining the top-k high utility episodes in a sequence of complex events (a transaction database) with utility information new
  • the US-SPAN algorithm (Wu et al., 2013 ) for mining high utility episodes in a sequence of complex events (a transaction database) with utility information new

High-Utility Pattern Mining

These algorithms discover patterns having a high utility (importance) in different kinds of data

Association Rule Mining

These algorithms discover interesting associations between symbols in a transaction database.

  • an algorithm for mining all association rules in a transaction database (Agrawal & Srikant, 1994)
  • an algorithm for mining all association rules with the lift measure in a transaction database (adapted from Agrawal & Srikant, 1994)
  • an algorithm for mining the IGB informative and generic basis of association rules in a transaction database (Gasmi et al., 2005)
  • an algorithm for mining perfectly sporadic association rules (Koh & Roundtree, 2005)
  • an algorithm for mining closed association rules (Szathmary et al. 2006).
  • an algorithm for mining minimal non redundant association rules (Kryszkiewicz, 1998)
  • the Indirect algorithm for mining indirect association rules (Tan et al. 2000; Tan et 2006)
  • the FHSAR algorithm for hiding sensitive association rules (Weng et al. 2008)
  • the TopKRules algorithm for mining the top-k association rules (Fournier-Viger, 2012b, powerpoint)
  • the TopKClassRules algorithm for mining the top-k class association rules (a variation of TopKRules. This latter is described in Fournier-Viger, 2012b, powerpoint)
  • the TNR algorithm for mining top-k non-redundant association rules (Fournier-Viger 2012d, powerpoint)

Clustering

These algorithms automatically find clusters in different kinds of data

  • the original K-Means algorithm (MacQueen, 1967)
  • the Bisecting K-Means algorithm (Steinbach et al, 2000)
  • algorithms for density-based clustering
    • the DBScan algorithm (Ester et al., 1996)
    • the Optics algorithm to extract a cluster ordering of points, which can then be use to generate DBScan style clusters and more (Ankerst et al, 1999)
  • a hierarchical clustering algorithm
  • a tool called Cluster Viewer for visualizing clusters
  • a tool called Instance Viewer for visualizing the input of clustering algorithms

Time series mining

These algorithms perform various data analysis task on time series.

  • an algorithm for converting a time series to a sequence of symbols using the SAX representation of time series. Note that if one converts a set of time series with SAX, he will obtain a sequence database, which allows to then apply traditional algorihtms for sequential rule mining and sequential pattern mining on time series (SAX, 2007). new
  • algorithms for calculating the prior moving average of a time series (to remove noise) new
  • algorithms for calculating the cumulative moving average f a time series (to remove noise) new
  • algorithms for calculating the central moving average of a time series (to remove noise) new
  • an algorithm for calculating the median smoothing of a time series (to remove noise) new
  • an algorithm for calculating the exponential smoothing of a time series (to remove noise) new
  • an algorithm for calculating the min max normalization of a time series new
  • an algorithm for calculating the autocorrelation function of a time series new
  • an algorithm for calculating the standardization of a time series new
  • an algorithm for calculating the first and second order differencing of a time series
  • an algorithm for calculating the piecewise aggregate approximation of a time series (to reduce the number of data points of a time series)
  • an algorithm for calculating the linear regression of a time series (using the least squares method) new
  • an algorithm for splitting a time series into segments of a given length
  • an algorithm for splitting a time series into a given number of segments
  • algorithms to cluster time series (group time-series according to their similarities). This can be done by applying the clustering algorithms offered in SPMF (K-Means, Bisecting K-Means, DBScan, OPTICS, Hierarchical clustering) on time series.
  • a tool called Time Series Viewer for visualizing time series new

Classification

  • the ID3 algorithm for building decision trees (Quinlan, 1986)

Text mining

  • an algorithm for classifying text documents using a Naive Bayes classifier approach (S. Raghu, 2015)
  • an algorithm for clustering texts using the tf*idf measure (S. Raghu, 2015)

Data structures

  • red-black tree,
  • itemset-tree,
  • binary tree,
  • KD-tree,
  • triangular matrix.

Tools

  • A tool for generating a synthetic transaction database
  • A tool for generating a synthetic sequence database
  • A tool for generating a synthetic sequence database with timestamps
  • A tool for calculating statistics about a transaction database
  • A tool for calculating statistics about a transaction database with utility information
  • A tool for calculating statistics about a sequence database
  • A tool for converting a sequence database to a transaction database
  • A tool for converting a transaction database to a sequence database
  • A tool for converting a text file to a sequence database (each sentences becomes a sequence)
  • A tool for converting a sequence database in various formats (CSV, KOSARAK, BMS, IBM...) to a sequence database in SPMF format
  • A tool for converting a transaction database in various formats (CSV...) to a transaction database in SPMF format
  • A tool for converting time-series to a sequence database
  • A tool to generate utility values for a transaction database
  • A tool to add timestamps to a sequence database
  • A tool for removing utility information from a database  having utility information
  • A tool to resize a database in SPMF format (a text file) using a percentage of lines of data from an original database.
  • A tool for visualizing time-series

Visual map of algorithms

You can visualize the relationship between the various data mining algorithms offered in SPMF by clicking on this map (last updated : 2015/09/12 - SPMF 0.97):

map_algorithms_spmf_data_mining092_small

Copyright © 2008-2018 Philippe Fournier-Viger. All rights reserved.