SPMFAn Open-Source Data Mining Library

Introduction

Algorithms

Download

Documentation

Datasets

FAQ

License

Contributors

Citations

Performance

Developers' guide

Forum

Blog

Other resources

600687 visitors
since 2010

-------------
* We are hiring a postdoctoral researcher (details here) and a Ph.D. student.  Send your CV with cover letter to Prof. Fournier-Viger.
-------------

 

Algorithms

SPMF offers implementations of the following data mining algorithms.

Sequential Pattern Mining

These algorithms discover sequential patterns in a set of sequences. For a good overview of sequential pattern mining algorithms, please read this survey paper.

Sequential Rule Mining

These algorithms discover sequential rules in a set of sequences.

Sequence Prediction

These algorithms predict the next symbol(s) of a sequence based on a set of training sequences

Itemset Mining

These algorithms discover interesting itemsets that appear in a transaction database (a transaction is a set of symbols). For a good overview of itemset mining, please read this survey paper.

Periodic Pattern Mining

These algorithms discover patterns that periodically appear in a sequence of complex events (a transaction database)

  • the PFPM algorithm (Fournier-Viger et al, 2016a) for mining frequent periodic patterns in a sequence of transactions (a transaction database))new
  • the PHM algorithm (Fournier-Viger et al, 2016b, powerpoint) for mining periodic high-utility patterns (periodic patterns that yield a high profit) in a sequence of transactions (a transaction database) containing utility information new

Episode Mining

These algorithms discover episodes that appear in a sequence of complex events (which is simlar to discovering sequential patterns in a transaction database)

  • the TUP algorithm (Rathore et al., 2016) for mining the top-k high utility episodes in a sequence of complex events (a transaction database) with utility information new
  • the US-SPAN algorithm (Wu et al., 2013 ) for mining high utility episodes in a sequence of complex events (a transaction database) with utility information new

High-Utility Pattern Mining

These algorithms discover patterns having a high utility (importance) in different kinds of data

Association Rule Mining

These algorithms discover interesting associations between symbols in a transaction database.

  • an algorithm for mining all association rules in a transaction database (Agrawal & Srikant, 1994)
  • an algorithm for mining all association rules with the lift measure in a transaction database (adapted from Agrawal & Srikant, 1994)
  • an algorithm for mining the IGB informative and generic basis of association rules in a transaction database (Gasmi et al., 2005)
  • an algorithm for mining perfectly sporadic association rules (Koh & Roundtree, 2005)
  • an algorithm for mining closed association rules (Szathmary et al. 2006).
  • an algorithm for mining minimal non redundant association rules (Kryszkiewicz, 1998)
  • the Indirect algorithm for mining indirect association rules (Tan et al. 2000; Tan et 2006)
  • the FHSAR algorithm for hiding sensitive association rules (Weng et al. 2008)
  • the TopKRules algorithm for mining the top-k association rules (Fournier-Viger, 2012b, powerpoint)
  • the TopKClassRules algorithm for mining the top-k class association rules (a variation of TopKRules. This latter is described in Fournier-Viger, 2012b, powerpoint)
  • the TNR algorithm for mining top-k non-redundant association rules (Fournier-Viger 2012d, powerpoint)

Clustering

These algorithms automatically find clusters in different kinds of data

  • the original K-Means algorithm (MacQueen, 1967)
  • the Bisecting K-Means algorithm (Steinbach et al, 2000)
  • algorithms for density-based clustering
    • the DBScan algorithm (Ester et al., 1996)
    • the Optics algorithm to extract a cluster ordering of points, which can then be use to generate DBScan style clusters and more (Ankerst et al, 1999)
  • a hierarchical clustering algorithm
  • a tool called Cluster Viewer for visualizing clusters new
  • a tool called Instance Viewer for visualizing the input of clustering algorithmsnew

Time series mining

These algorithms perform various data analysis task on time series.

  • an algorithm for converting a time series to a sequence of symbols using the SAX representation of time series. Note that if one converts a set of time series with SAX, he will obtain a sequence database, which allows to then apply traditional algorihtms for sequential rule mining and sequential pattern mining on time series (SAX, 2007). new
  • algorithms for calculating the prior moving average of a time series (to remove noise) new
  • algorithms for calculating the cumulative moving average f a time series (to remove noise) new
  • algorithms for calculating the central moving average of a time series (to remove noise) new
  • an algorithm for calculating the median smoothing of a time series (to remove noise) new
  • an algorithm for calculating the exponential smoothing of a time series (to remove noise) new
  • an algorithm for calculating the min max normalization of a time series new
  • an algorithm for calculating the autocorrelation function of a time series new
  • an algorithm for calculating the standardization of a time series new
  • an algorithm for calculating the first and second order differencing of a time series new
  • an algorithm for calculating the piecewise aggregate approximation of a time series (to reduce the number of data points of a time series) new
  • an algorithm for calculating the linear regression of a time series (using the least squares method) new
  • an algorithm for splitting a time series into segments of a given length
  • an algorithm for splitting a time series into a given number of segments
  • algorithms to cluster time series (group time-series according to their similarities). This can be done by applying the clustering algorithms offered in SPMF (K-Means, Bisecting K-Means, DBScan, OPTICS, Hierarchical clustering) on time series.
  • a tool called Time Series Viewer for visualizing time series new

Classification

  • the ID3 algorithm for building decision trees (Quinlan, 1986)

Text mining

  • an algorithm for classifying text documents using a Naive Bayes classifier approach (S. Raghu, 2015)
  • an algorithm for clustering texts using the tf*idf measure (S. Raghu, 2015)

Data structures

  • red-black tree,
  • itemset-tree,
  • binary tree,
  • KD-tree,
  • triangular matrix.

Tools

  • A tool for generating a synthetic transaction database
  • A tool for generating a synthetic sequence database
  • A tool for generating a synthetic sequence database with timestamps
  • A tool for calculating statistics about a transaction database
  • A tool for calculating statistics about a transaction database with utility information
  • A tool for calculating statistics about a sequence database
  • A tool for converting a sequence database to a transaction database
  • A tool for converting a transaction database to a sequence database
  • A tool for converting a text file to a sequence database (each sentences becomes a sequence)
  • A tool for converting a sequence database in various formats (CSV, KOSARAK, BMS, IBM...) to a sequence database in SPMF format
  • A tool for converting a transaction database in various formats (CSV...) to a transaction database in SPMF format
  • A tool for converting time-series to a sequence database
  • A tool to generate utility values for a transaction database
  • A tool to add timestamps to a sequence database
  • A tool for removing utility information from a database  having utility information
  • A tool to resize a database in SPMF format (a text file) using a percentage of lines of data from an original database.
  • A tool for visualizing time-series

Visual map of algorithms

You can visualize the relationship between the various data mining algorithms offered in SPMF by clicking on this map (last updated : 2015/09/12 - SPMF 0.97):

map_algorithms_spmf_data_mining092_small

Copyright © 2008-2018 Philippe Fournier-Viger. All rights reserved.