SPMFAn Open-Source Data Mining Library

Introduction

Algorithms

Download

Documentation

Datasets

FAQ

License

Contributors

Citations

Performance

Developers' guide

Forum

Mailing-list

Blog

 

-------------
We are hiring a postdoctoral researcher (details here) and a Ph.D. student.  Send your CV with cover letter to Prof. Fournier-Viger.
-------------

393839 visitors
since 2010

Datasets

The SPMF software natively uses text files as input. Some small examples of text files that can be used with each algorithm are described in the documentation of SPMF. These sample input files can be downloaded from the download page (test_files.zip) for the release version of SPMF, and are included with the source code, for the source code version of SPMF. However, these datasets are quite small. For this reason, this webpage provides larger datasets that can be used with SPMF and that are often used in the data mining litterature for evaluating and comparing algorithm performance. Unless otherwise indicated, the datasets are in SPMF format.

Datasets for Sequential Pattern Mining / Sequential Rule Mining / Sequence Prediction

Real-life datasets in SPMF format:

  • BMSWebView1 (Gazelle) ( KDD CUP 2000). This dataset contains 59,601 sequences of clickstream data from an e-commerce. It contains 497 distinct items. The average length of sequences is 2.42 items with a standard deviation of 3.22. In this dataset, there are some long sequences. For example, 318 sequences contains more than 20 items.
  • BMSWebView2 (Gazelle) ( KDD CUP 2000). This is a second dataset used in the KDD-CUP 2000 competition. It contains 77,512 sequences of click-stream data. It contains 3340 distinct items. The average length of sequences is 4.62 items with a standard deviation of 6.07 items. new
  • Kosarak. This is a very large dataset containing 990 000 sequences of click-stream data from an hungarian news portal. The dataset in its original format can be found at http://fimi.ua.ac.be/data/. The dataset converted in SPMF format is provided here. However, this dataset is very large. Therefore, we also provide a subset of only 10 000 sequences in SPMF format here and a subset of 25 000 sequences here.
  • Sign: a dataset of sign language utterance containing approximately 800 sequences. The original dataset file in another format can be obtained here with more details on this dataset.
  • Bible. This dataset is a conversion of the Bible into a sequence database (each word is an item). It contains 36 369 sequences and 13905 distinct items. The average length of a sequence is 21.6 items. The average number of distinct item per sequence is 17.84.
  • Leviathan. This dataset is a conversion of the novel Leviathan by Thomas Hobbes (1651) as a sequence database (each word is an item). It contains 5834 sequences and 9025 distinct items. The average number of items per sequence is : 33.8. The average number of distinct item per sequence is 26.34.
  • MSNBC: a dataset of click-stream data. The original dataset contains 989,818 sequences obtained from the UCI repository. Here the shortest sequences have been removed to keep only 31,790 sequences. The number of distinct item in this dataset is 17 (an item is a webpage category). The average number of itemsets per sequence is13.33. The average number of distinct item per sequence is 5.33. Update: If you need the full dataset, you download it.
  • FIFA: a dataset of 20,450 sequences of click stream data from the website of FIFA World Cup 98. It has 2,990 distinct items (webpages). The average sequence length is 34.74 items with a standard deviation of 24.08 items. This dataset was created by processing a part of the web logs from the world cup offered here.

Synthetic datasets:

Datasets for Frequent Itemset mining / Association Rule Mining

Real-life datasets in SPMF format

Datasets obtained from (http://fimi.ua.ac.be/data/), which can be directly used in SPMF:

  • retail : anonymous retail market basket data from an anonymous Belgian retail store
  • mushrooms : prepared based on the UCI mushrooms dataset
  • pumsb : census data for population and housing.
  • chess : prepared based on the UCI chess dataset
  • connect: prepared based on the UCI connect-4 dataset
  • accidents : anonymized traffic accident data.
  • BMS_WebView_1: click-stream data from a webstore used in KDD-Cup 2000 (note: this version of the dataset has been prepared for itemset mining / association rule mining).
  • BMS_WebView_2: click-stream data from a webstore used in KDD-Cup 2000 (note: this version of the dataset has been prepared for itemset mining / association rule mining)

Two additional retail store datasets

  • chainstore:  dataset of customer transactions from a retail store, obtained and transformed from NU-Mine Bench
  • foodmart:  dataset of customer transactions from a retail store, obtained and transformed from SQL-Server 2000

Nine large datasets, some of them with item labels, converted  by Zhang Zhongjie (original datasets from the UCI repository)

  • kddcup99: This dataset is transformed from the KDD CUP 1999 dataset, found at https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data), where 1,000,000 instances were transformed to transactions, and it contains 135 items. The meaning of each item is shown in the kddcup99Attributes.xlsx file. new
  • OnlineRetail: This dataset is transformed from the Online Retail dataset, found at https://archive.ics.uci.edu/ml/datasets/ Online+Retail. The transformed dataset contains 541,909 transactions and 2603 items. The meaning of each item is given in the file OnlineRetailAttributes.xlsx. new
  • PAMP: This dataset is transformed from the PAMAP2 dataset, obtained from https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring, where 1,000,000 instances were transformed to transactions. The dataset contains 141 distinct items. The meaning of each item is given in the file PAMPAttributes.xlsx. new
  • Skin: This dataset is transformed from the dataset Skin Segmentation, obtained from https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation, which has 245,057 transactions and 11 items. The meaning of each item is given in the file SkinAttributes.xlsx. new
  • USCensus: This dataset is transformed from the US Census 1990 dataset, obtained from https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990), where 1,000,000 instances were transformed to transactions, and where there are 396 distinct items. The meaining of each item is given in the file USCensusAttributes.xlsx. new
  • RecordLink: The original dataset comes from http://archive.ics.uci.edu/ml/datasets/Record+Linkage+Comparison+Patterns. It contains 574,913 instances,  transformed to transactions and it contains 29 distinct items. new
  • PowerC: The dataset is about household electric power consumption. It was prepared using the original dataset from https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption. The dataset contains 1,040,000 instances, which were transformed to transactions, where value fields of the 3rd attribute to the 9th attribute are divided in 10 equal parts and every part is represented by a number. As a result, the dataset has 140 items.
  • Susy : This dataset is related to physics. It is related to particles detected using a particle accelerator.The dataset contains 5,000,000 instances prepared using the original datasetbtained from newhttp://archive.ics.uci.edu/ml/datasets/SUSY. The instances were transformed to transactions, such that the value field of every attribute is divided in 10 equal parts and every part is represented by a number. As a result, the dataset has 190 items new

Real-life datasets in ARFF format.

The GUI and command line interface of SPMF > 093d and higher can read ARFF (Attribute Relational File Format) files . You can download a collection of 36 real-life datasets in ARFF format from the UCI machine learning repository. These files can be used with all association rule mining and itemset mining algorithms that take a transaction database as input. Note that examples "MainTest..." in the source code have not been adapted to read ARFF files yet (for now, ARFF files are only accepted by the GUI and command line interface). Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will have to be converted. This cost however should be small. The specification of the ARFF format can be found here.

Synthetic datasets

It is possible to generate synthetic transaction databases by using the random transaction database generator provided in SPMF (see this example in the documentation to know how to do it), which is flexible and easy to use.

Moreover, you can download the following synthetic datasets often used in the data mining litterature, generated by the IBM Generator. The names are given according to this convention: D: number of sequences in the dataset, C: average number of itemsets per sequence, T: average number of items per itemset, I: average size of itemsets in potentially frequent sequences.

Another alternative for generating synthetic transactions databases is to use a Matlab program provided by Ashwin Balani.

Datasets for High-Utility Pattern Mining

The following datasets that can be used with high-utility itemset mining algorithms such as EFIM, FHM, HUI-Miner, IHUP, UPGrowth and Two-Phase. These datasets are real datasets with synthetic utility values. The internal utility values have been generated using a uniform distribution in [1, 10]. The external utility values have been generated using a gaussian (normal) distribution. The Chainstore dataset contains real utility values. For information about the format of these files, see the examples in the documentation of EFIM, FHM, HUI-Miner or Two-Phase.

Here are a few datasets in SPMF format for high-utility itemset mining with negative unit profit values. Those datasets have been generated to be used in the FHN paper (Fournier-Viger et al., 2014) and can be used with the FHN and HUINIV-Mine algorithms. See the FHN paper for details.

Here are a few datasets in SPMF format for on-shelf high-utility itemset mining with negative unit profit values. Those datasets have been generated to be used in the FOSHU paper (Fournier-Viger et al., 2015) and can be used with the FOSHU and TS-HOUN algorithms. See the FOSHU paper for details.

Some datasets in SPMF format for high-utility sequential rule mining or high-utility sequential pattern mining. Those datasets were generated for the HUSRM paper (Zida et al, 2015), and can be used with the HUSRM and USpan algorithm. See the HUSRM paper for more information.

Datasets for Clustering

A Matlab program provided by Ashwin Balani to generate synthetic clustering datasets.

 
Copyright © 2008-2017 Philippe Fournier-Viger. All rights reserved.