Datasets

The SPMF software natively uses text files as input. Some small examples of text files that can be used with each algorithm are described in the documentation of SPMF. These sample input files can be downloaded from the download page (test_files.zip) for the release version of SPMF, and are included with the source code, for the source code version of SPMF. However, these datasets are quite small. For this reason, this webpage provides larger datasets that can be used with SPMF and that are often used in the data mining litterature for evaluating and comparing algorithm performance. Unless otherwise indicated, the datasets are in SPMF format.

Datasets for Sequential Pattern Mining / Sequential Rule Mining / Sequence Prediction

Real-life datasets in SPMF format:

Dataset name Description Sequence count Item count Average sequence length Has item names?
BMSWebView1 (Gazelle) ( KDD CUP 2000)

This dataset was used in KDD CUP 2000. It contains clickstream data from an e-commerce

In this dataset, there are some long sequences. For example, 318 sequences contains more than 20 items.

59,601 497 2.42 No
BMSWebView2 (Gazelle) ( KDD CUP 2000) This dataset was used in KDD CUP 2000. It contains clickstream data from an e-commerce 77,512 3,340 4.62 No

Kosarak

a subset of only 10 000 sequences: here
a subset of 25 000 sequences: here.

This is a very large dataset containing 990 000 sequences of click-stream data from an hungarian news portal.
The dataset was converted in SPMF format using the original data from: http://fimi.ua.ac.be/data/.
(2020-1-10: the kosarak dataset file has been updated as someone informed me that some sequences were missing)
990,000 41,270 8.1 No
Sign a dataset of sign language utterance containing approximately 800 sequences
The original dataset file in another format can be obtained here with more details on this dataset.
~800 267 51.997 No
Bible

This dataset is a conversion of the Bible into a sequence database (each word is an item).

36,369 13,905 21.6 Yes
Bible_with_items
Leviathan This dataset is a conversion of the novel Leviathan by Thomas Hobbes (1651) as a sequence database (each word is an item). 5,834 9,025 33.8 Yes
Leviathan_with_items
MSNBC a dataset of click-stream data from the MSNBC website, converted from original data from the UCI repository.
The shortest sequences have been removed to keep only 31,790 sequences. If you need the full dataset, you can download it.
989,818 17 13.23 No
FIFA Click stream data from the website of FIFA World Cup 98 20,450 2,990 34.74 No
ProofSequences a dataset containing sequences of proof steps for mathematical proofs. The dataset is described in the paper:
Nawaz, M. S., Sun, M., Fournier-Viger, P. (2019). Proof Guidance in PVS with Sequential Pattern Mining. Proc. of 9th Intern. Conf. Fundamentals of Software Engineering (FSEN 2019), 15 pages, Springer, LNCS.
35 13 12.34 Yes

Synthetic datasets:

To generate synthetic sequence databases, you can use the sequence database generator provided in SPMF (see this example and this example in the documentation to know how to use it), which is flexible and easy to use.

Here are some synthetic sequence databases generated with the IBM Quest Dataset Generator, converted to the SPMF format:

Dataset name Description Sequence count Item count Average number of itemsets per sequence Average number of distinct item per sequence
data.slen_10.tlen_1.seq.patlen_2.lit.patlen_8.nitems_5000_spmf.txt Synthetic data 50,351 41,911 1.63 13.23
data.slen_10.tlen_1.seq.patlen_3.lit.patlen_8.nitems_5000_spmf.txt Synthetic data 47,785 53,137 1.93 15.63
data.slen_10.tlen_1.seq.patlen_4.lit.patlen_8.nitems_5000_spmf.txt Synthetic data 47,556 62,296 2.24 17.96
data.slen_10.tlen_1.seq.patlen_5.lit.patlen_8.nitems_5000_spmf.txt Synthetic data 47,988 69,686 2.56 20.49
data.slen_10.tlen_1.seq.patlen_6.lit.patlen_8.nitems_5000_spmf.txt Synthetic data 48,467 75,476 2.84 22.7
data.slen_8.tlen_1.seq.patlen_2.lit.patlen_8.nitems_5000_spmf.txt Synthetic data 45,535 41,270 1.52 12.30
data.slen_8.tlen_1.seq.patlen_3.lit.patlen_8.nitems_5000_spmf.txt Synthetic data 45,452 52,551 1.80 14.54
data.slen_8.tlen_1.seq.patlen_4.lit.patlen_8.nitems_5000_spmf.txt Synthetic data 46,131 61,137 2.09 16.71
data.slen_8.tlen_1.seq.patlen_5.lit.patlen_8.nitems_5000_spmf.txt Synthetic data 47,133 68,240 2.36 18.82

It is also possible to generate sequence databases by using the IBM Generator. Files generated by the IBM Generator can be converted in SPMF format by using the conversion tool provided in SPMF (see this example in the documentation for how to convert files.

Another alternative for generating synthetic sequences databases is to use a Matlab proram provided by Ashwin Balani.

Datasets for Frequent Itemset mining / Association Rule Mining / Periodic pattern mining / Frequent episode mining

Datasets in SPMF format

These datasets can be directly used in SPMF:

Dataset name Description Transaction count Item count Average item count per transaction Has item names?
retail customer transactions from an anonymous Belgian retail store.
(source FIMI: http://fimi.ua.ac.be/data/)
88,162 16,470 10,30 No
mushrooms

prepared based on the UCI mushrooms dataset
(source: FIMI)

8,416 119 23 No
pumsb

census data for population and housing
(source: FIMI)

49,046 2113 74 No
chess prepared based on the UCI chess dataset
(source: FIMI)
3,196 75 37 No
connect prepared based on the UCI connect-4 dataset (source: FIMI) 67,557 129 43 No
accidents anonymized traffic accident data (source: FIMI) 340,183 468 33.8 No
BMS_WebView_1 click-stream data from a webstore used in KDD-Cup 2000 (note: this version of the dataset has been prepared for itemset mining / association rule mining)
(source: FIMI)
59,602 497 2.51 No
BMS_WebView_2 click-stream data from a webstore used in KDD-Cup 2000 (note: this version of the dataset has been prepared for itemset mining / association rule mining)
(source: FIMI)
77,512 3,340 4.62 No
Kosarak transactions from click-stream data from an hungarian news portal.
(source: FIMI)
990,002 41,270 8.1 No
chainstore customer transactions from a major grocecy store chain in California, USA 1,112,949 46,086 7.23 No
foodmart customer transactions from a retail store, obtained and transformed from SQL-Server 2000 4,141 1,559 4.42 No
kddcup99 This dataset is transformed from the KDD CUP 1999 dataset, found at https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data) 1,000,000 135 16

Yes, kddcup99Attributes.xlsx

OnlineRetail

This dataset is transformed from the Online Retail dataset, found at https://archive.ics.uci.edu/ml/datasets/ Online+Retail.
Converted  by Zhang Zhongjie.
Note:
these is another dataset also called OnlineRetail in some papers, which is in fact the ECommerce dataset.

541,909 2,603 4.37

Yes, OnlineRetailAttributes.xlsx.

PAMP

This dataset is transformed from the PAMAP2 dataset, obtained from https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring
Note: In the original data, a few lines contained NaN values. These values have been removed.
Converted  by Zhang Zhongjie.

1,000,000 141 23.93 Yes,
PAMPAttributes.xlsx
Skin This dataset is transformed from the dataset Skin Segmentation, obtained from https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation
Converted  by Zhang Zhongjie.
245,057 11 4.0 Yes,
SkinAttributes.xlsx
USCensus This dataset is transformed from the US Census 1990 dataset, obtained from https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990)
Converted  by Zhang Zhongjie.
1,000,000 396 68.0 No
RecordLink This dataset is transformed from http://archive.ics.uci.edu/ml/datasets/Record+Linkage+Comparison+Patterns.
Converted  by Zhang Zhongjie.
574,913 29 10

Yes
RecordLinkAttribute.xslx

PowerC The dataset is about household electric power consumption. It was prepared using the original dataset from https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption. The dataset contains instances, which were transformed to transactions, where value fields of the 3rd attribute to the 9th attribute are divided in 10 equal parts and every part is represented by a number. As a result, the dataset has 140 items.
Converted  by Zhang Zhongjie.
1,040,000 140 7 No
Susy

This dataset is related to physics. It is related to particles detected using a particle accelerator. The dataset contains instances prepared using the original datasetbtained from newhttp://archive.ics.uci.edu/ml/datasets/SUSY. The instances were transformed to transactions, such that the value field of every attribute is divided in 10 equal parts and every part is represented by a number. As a result, the dataset has 190 items.
Converted  by Zhang Zhongjie.

5,000,000 190 19 No

Synthetic datasets

Here are a few synthetic datasets in SPMF format:

Dataset name Description Transaction count Item count Average item count per transaction
c20d10k synthetic dataset 10,000 192 20
c73d10k synthetic dataset 10,000 1,592 73
t25i10d10k synthetic dataset 9,976 929 24.77
t20i6d100k synthetic dataset 99,922 893 19.90

It is possible to generate synthetic transaction databases by using the random transaction database generator provided in SPMF (see this example in the documentation to know how to do it), which is flexible and easy to use.

Moreover, you can download the following synthetic datasets often used in the data mining litterature, generated by the IBM Generator. The names are given according to this convention: D: number of sequences in the dataset, C: average number of itemsets per sequence, T: average number of items per itemset, I: average size of itemsets in potentially frequent sequences.

Another alternative for generating synthetic transactions databases is to use a Matlab program provided by Ashwin Balani.

Real-life datasets in SPMF format, having timestamps:

The datasets containing timestamps can be directly used in SPMF with algorithms that accept timestamps:

Dataset name Description Transaction count Item count Average item count per transaction Has item names?
ECommerce_time_without_utilitynew
customer transactions occurring between 01/12/2010 and 09/12/2011 of a UK-based and registered non-store online retail (this is the version with timestamps ). If you want to know the meaning of the items in this dataset, the list of item names is available.
The data was prepared by Yimin Zhang using public data.

17,535

3,803 15.4 Yes

list of item names

Real-life datasets in ARFF format.

The GUI and command line interface of SPMF > 093d and higher can read ARFF (Attribute Relational File Format) files . You can download a collection of 36 real-life datasets in ARFF format from the UCI machine learning repository. These files can be used with all association rule mining and itemset mining algorithms that take a transaction database as input. Note that examples "MainTest..." in the source code have not been adapted to read ARFF files yet (for now, ARFF files are only accepted by the GUI and command line interface). Most features of the ARFF format are supported except that (1) the character "=" is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will have to be converted. This cost however should be small. The specification of the ARFF format can be found here.

Datasets for High-Utility Pattern Mining

The following transaction databases that can be used with high-utility itemset mining algorithms such as EFIM, FHM, HUI-Miner, IHUP, UPGrowth and Two-Phase. For information about the format of these files, see the examples in the documentation of EFIM, FHM, HUI-Miner or Two-Phase.

Real-life transaction datasets in SPMF format with real utility values

Those are three real-life customer transaction datasets with real utility values:

Dataset name Description Transaction count Item count Average item count per transaction Has real utility values? Has item names?
foodmart_utility dataset of customer transactions from a retail store, obtained and transformed from SQL-Server 2000 4,141 1,559 4.42 Yes No
chainstore_utility

dataset of customer transactions from a major grocery store chain in California, USA, containing 1,112,949 transactions and 46,086 items, obtained and transformed from NU-Mine Bench.

The original data is available here (not in SPMF format)

1,112,949 46,086 7.23 Yes No
ECommerce_retail_utility_no_timestamps

transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 of a UK-based and registered non-store online retail (this is the version with real utility values but without timestamps).

The data was prepared by Yimin Zhang using public data for his paper about peak high utility itemset mining.

17,535

3,803 15.4 Yes

Yes

list of item names new

Real-life transaction datasets in SPMF format having synthetic (fake) utility values

These datasets are real datasets but with synthetic utility values. The internal utility values have been generated using a uniform distribution in [1, 10].

Dataset name Description Transaction count Item count Average item count per transaction Has real utility values? Has item names?
retail_utility customer transactions from an anonymous Belgian retail store.
(source FIMI: http://fimi.ua.ac.be/data/)
88,162 16,470 10,30 No (synthetic) No
mushroom_utility

prepared based on the UCI mushrooms dataset
(source: FIMI)

8,416 119 23 No (synthetic) No
pumsb_utility,

census data for population and housing
(source: FIMI)

49,046 2113 74 No (synthetic) No
chess_utility, prepared based on the UCI chess dataset
(source: FIMI)
3,196 75 37 No (synthetic) No
connect_utilitty, prepared based on the UCI connect-4 dataset (source: FIMI) 67,557 129 43 No (synthetic) No
accidents_utility, anonymized traffic accident data (source: FIMI) 340,183 468 33.8 No (synthetic) No
kosarak_utility, transactions from click-stream data from an hungarian news portal.
(source: FIMI)
990,002 41,270 8.1 No (synthetic) No
BMS_utilitty, click-stream data from a webstore used in KDD-Cup 2000
(note: this version of the dataset has been prepared for high utility itemset mining )
(source: FIMI)
77,512 3,340 4.62 No (synthetic) No

Datasets in SPMF format having utility values and timestamps

Here are a few transaction databases in SPMF format for high utility itemset mining that contains timestamps. new These datasets where prepared by Yimin Zhang for the paper in Information Sciences about the LHUI-Miner and PHUI-Miner algorithms (Fournier-Viger et al., 2019). These datasets were used for discovering local high utility itemsets and peak high utility itemsets, but can be used for other tasks such as recent high utility pattern mining.

Dataset name Description Transaction count Item count Average item count per transaction Has real utility values? Has real timestamps? Has item names?
ECommerce_retail_utility_timestamps

transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 of a UK-based and registered non-store online retail (this is the version with real utility values but without timestamps).

The data was prepared by Yimin Zhang using public data for his paper about peak high utility itemset mining.

17,535

3,803 15.4 Yes Yes

Yes

list of item names new

foodmart_utility_timestamp dataset of customer transactions from a retail store, obtained and transformed from SQL-Server 2000 4,141 1,559 4.42 Yes No (synthetic) No
retail_utility_timestamp customer transactions from an anonymous Belgian retail store.
(source FIMI: http://fimi.ua.ac.be/data/)
88,162 16,470 10,30 No (synthetic) No (synthetic) No
kosarak_utility_timestamp transactions from click-stream data from an hungarian news portal.
(source: FIMI)
990,002 41,270 8.1 No (synthetic) No (synthetic) No
mushroom_utility_timestamp

prepared based on the UCI mushrooms dataset
(source: FIMI)

8,416 119 23 No (synthetic) No (synthetic) No

Datasets for high utility itemset mining with negative unit profit values.

Here are a few transaction databases in SPMF format for high-utility itemset mining with negative unit profit values. Those datasets have been generated to be used in the FHN paper (Fournier-Viger et al., 2014) and can be used with the FHN and HUINIV-Mine algorithms. See the FHN paper for details.

Dataset name Description Transaction count Item count Average item count per transaction Has real utility values? Has item names?
retail_negative, customer transactions from an anonymous Belgian retail store.
(source FIMI: http://fimi.ua.ac.be/data/)
88,162 16,470 10,30 No (synthetic) No
mushroom_negative

prepared based on the UCI mushrooms dataset
(source: FIMI)

8,416 119 23 No (synthetic) No
pumsb_negative,

census data for population and housing
(source: FIMI)

49,046 2113 74 No (synthetic) No
chess_negative, prepared based on the UCI chess dataset
(source: FIMI)
3,196 75 37 No (synthetic) No
accidents_negative, anonymized traffic accident data (source: FIMI) 340,183 468 33.8 No (synthetic) No
kosarak_negative transactions from click-stream data from an hungarian news portal.
(source: FIMI)
990,002 41,270 8.1 No (synthetic) No

Datasets for on-shelf high-utility itemset mining with negative unit profit values

Here are a few transaction databases in SPMF format for on-shelf high-utility itemset mining with negative unit profit values. Those datasets have been generated to be used in the FOSHU paper (Fournier-Viger et al., 2015) and can be used with the FOSHU and TS-HOUN algorithms. See the FOSHU paper for details.

Datasets for high-utility sequential rule mining or high-utility sequential pattern mining

    Some sequence databases in SPMF format for high-utility sequential rule mining or high-utility sequential pattern mining. Those datasets were generated for the HUSRM paper (Zida et al, 2015), and can be used with the HUSRM and USpan algorithm. See the HUSRM paper for more information.

    Dataset name Description Sequence count Item count Average sequence length Has item names?
    BMS_sequence_utility) This dataset was used in KDD CUP 2000. It contains clickstream data from an e-commerce 77,512 3,340 4.62 No

    Kosarak10k_sequence_utility

    This is a subset of 10,000 sequences of the Kosarak click-stream data from an hungarian news portal.
    The dataset was converted in SPMF format using the original data from: http://fimi.ua.ac.be/data/.
    (2020-1-10: the kosarak dataset file has been updated as someone informed me that some sequences were missing)
    10,000     No
    SIGN_sequence_utility a dataset of sign language utterance containing approximately 800 sequences
    The original dataset file in another format can be obtained here with more details on this dataset.
    ~800 267 51.997 No
    Bible_sequence_utility

    This dataset is a conversion of the Bible into a sequence database (each word is an item).

    36,369 13,905 21.6 Yes
    Bible_with_items
    FIFA_sequence_utility Click stream data from the website of FIFA World Cup 98 20,450 2,990 34.74 No

Datasets for Cost-effective Pattern Mining

In the paper by Fournier-Viger et al. (2020), it was proposed to find cost-effective patterns in sequences with utility and cost information. This type of data is a set of sequences where each sequence is an ordered list of activity, where an activity has a cost value representing the amount of resources (e.g. time, money) spent to perform the activity. Moreover, each sequence has a utility label that can be either binary or numeric, which represents a positive or negative outcome. This type of data can represent for example how students use an e-learning system. In that case, each sequence is a list of activities or sessions performed by a student, and the cost represents the time spent for studying for each activity. On the other hand, the utility represents the outcome of a sequence such as the final exam score of a student after performing the acitivities (numeric utility), or whether he passed or failed the course or exam (binary utility). Another example of cost-utility sequences is hospital data, where a sequence is a list of medicines taken by some patient, the cost indicates how much money or time was spent for these medical treaments, and the utility is whether a patient has cured or died.

From cost/utility sequences, we can then find patterns that have a low-cost but typically lead to a high utility (called low-cost high utility patterns or cost-efficient patterns) by applying algorithms such as CEPB, CEPN and CorCEPB. In the above examples, some cost-effective patterns may be that studying some e-learning materials A and B typically require not much time but leadw to high scores, or that taking some given medicines has a low cost but a high possibility of success to cure a disease.

Here are some datasets:

Dataset name Description Sequence count Type of utility
allSessions_binary.txt

This is an e-learning dataset, where each sequence represents a student using an e-learning platform named Deeds, and there are 62 students.

A sequence contains a list of sessions taken by a student. There are six different sessions denoted as 1,2,3,4,5 and 6. In a sequence, each session has a cost value that represents the time spent by a student on the session for studying or doing some learning activities.

Each sequence has a binary utility value, which indicates the outcome of the sequence. Here the utility value indicates if a student has failed or passed the final exam after doing the learning sessions.

This dataset can be used with the CEPB and CorCEPB algorthms

Note: This dataset was created by converting public data from Vahdat et al to the SPMF format. A threshold of 60% was assumed to be the score for passing the final exam.

62 Binary
session6_numeric.txt

This dataset is also an e-learning dataset from students using an e-learning platform named Deeds. This dataset contains 50 sequences.

Each sequence indicates the list of activities done by a student during a learning session called SESSION 6. Each activity has a cost value representing the amount of time that the student has spent doing the activity. Moreover, each sequence has a numeric utility value indicating the score that the student had at the test at the end of SESSION 6.

This dataset can be used with the CEPN algorthm.

Note: This dataset was created by converting public data from Vahdat et al to the SPMF format. Note that this file only contains SESSION 6.

50 Numeric
session6_binary.txt

This is the same as above, except that the utility has been converted to binary values instead of numeric values.

This dataset can be used with the CEPB and CorCEPB algorthms.

50 Binary
session5_numeric.txt This is similar to above, except that this is the data for SESSION 5 (numeric utility). 53 Numeric
session4_numeric.txt This is similar to above, except that this is the data for SESSION 4 (numeric utility). 54 Numeric
session4_binary.txt This is similar to above, except that this is the data for SESSION 4 (binary utility). 54 Binary

The original dataset was made public by Vahdat et al. and can be downloaded here but it is not in SPMF format. However, it gives a lot of details about the meaning of the data. This is very useful for understanding the meaning of the patterns found in the above data. This is the paper by Vahdat et al.:

M. Vahdat, L. Oneto, D. Anguita, M. Funk, M. Rauterberg.: A learning analytics approach to correlate the academic achievements of students with interaction data from an educational simulator. In: G. Conole et al. (eds.): EC-TEL 2015, LNCS 9307, pp. 352-366. Springer (2015).
DOI: 10.1007/978-3-319-24258-3 26

And this is the paper about cost-effective pattern mining with the CEPB, CEPN and CorCEPB algorithms, for which the datasets have prepared for SPMF:

Fournier-Viger, P., Li, J., Lin, J. C., Chi, T. T., Kiran, R. U. (2020). Mining Cost-Effective Patterns in Event Logs. Knowledge-Based Systems (KBS), Elsevier

Datasets for subgraph mining

Here are a few datasets commonly used to evaluate subgraph mining algorithms such as TKG and gSpan:

Dataset name Description Graph count Average node count per graph Average edge count per graph Vertex label count Edge label count
Chemical_340 a database of 340 graphs about chemistry

340

27.02 27.40 66 4
Coumpounds_422 a database of 422 graphs about coumpounds 422 39.61 42.31 4 21
Mutag a dataset of nitro compounds labeled regarding whether they have a mutagenic effect on a bacterium 344 17.93 19.79 7 11
PTC chemical compounds where their labels indicate carcinogenicity for male and female rats 4,110 14.29 14.69 19  
NCI1 datasets of chemical compounds screened for activity against non-small cell lung cancer and ovarian cancer cell lines 4,127 29.87 32.30 37 3
NCI109 datasets of chemical compounds screened for activity against non-small cell lung cancer and ovarian cancer cell lines 1,113 29.68 32.13 38 3
Proteins graph collection where nodes represent secondary structure elements and edges indicate neighborhood in the amino-acid sequence or in 3-dimension space 600 39.06 72.82 3 1
Enzymes protein tertiary structures obtained from the BRENDA enzyme database 1,178 284.32 62.14 3 1
DandD a dataset of protein structures where nodes are amino acids and edges indicate spatial closeness, which are classified into enzymes or non-enzymes 1,000 19.77 715.66 82 1
IMDB-B a moviecollaboration dataset that is collected from IMDB. Each graph is an ego-network where nodes represent actors/actresses and edges indicate if they appear in the same movie. Each graph is categorized into one of two genres (Action or Romance). 1,500 13 96.53 65 1

The two first datasets were obtained from the Web and are probably the most famous datasets in subgraph mining.

The other datasets were prepared by Dang Nguyen et al. based on data obtained from various data sources, and obtained from github.com/nphdang/gspan/, and used in this paper: Dang Nguyen, Wei Luo, Tu Dinh Nguyen, Svetha Venkatesh, Dinh Phung (2018). Learning Graph Representation via Frequent Subgraphs. SDM 2018, San Diego, USA. SIAM, 306-314.). The descriptions of these latter datasets in the above table were obtained from that paper too.

Datasets for Clustering

A Matlab program provided by Ashwin Balani to generate synthetic clustering datasets.