Datasets

The SPMF software natively uses text files as input. Some small examples of text files that can be used with each algorithm are described in the documentation of SPMF. These sample input files can be downloaded from the download page (test_files.zip) for the release version of SPMF, and are included with the source code, for the source code version of SPMF. However, these datasets are quite small. For this reason, this webpage provides larger datasets that can be used with SPMF and that are often used in the data mining litterature for evaluating and comparing algorithm performance. Unless otherwise indicated, the datasets are in SPMF format.

The datasets are divided in the following categories:

Datasets for Sequential Pattern Mining / Sequential Rule Mining / Sequence Prediction

Real-life datasets in SPMF format

Dataset name Description Sequence count Item count Average sequence length Has item names?
BMSWebView1 (Gazelle) ( KDD CUP 2000)

This dataset was used in KDD CUP 2000. It contains clickstream data from an e-commerce

In this dataset, there are some long sequences. For example, 318 sequences contains more than 20 items.

59,601 497 2.42 No
BMSWebView2 (Gazelle) ( KDD CUP 2000) This dataset was used in KDD CUP 2000. It contains clickstream data from an e-commerce 77,512 3,340 4.62 No

Kosarak

a subset of only 10 000 sequences: here
a subset of 25 000 sequences: here.

This is a very large dataset containing 990 000 sequences of click-stream data from an hungarian news portal.
The dataset was converted in SPMF format using the original data from: http://fimi.ua.ac.be/data/.
(2020-1-10: the kosarak dataset file has been updated as someone informed me that some sequences were missing)
990,000 41,270 8.1 No
Sign a dataset of sign language utterance containing approximately 800 sequences
The original dataset file in another format can be obtained here with more details on this dataset.
~800 267 51.997 No
Bible

This dataset is a conversion of the Bible into a sequence database (each word is an item).

36,369 13,905 21.6 Yes
Bible_with_items
Leviathan This dataset is a conversion of the novel Leviathan by Thomas Hobbes (1651) as a sequence database (each word is an item). 5,834 9,025 33.8 Yes
Leviathan_with_items
MSNBC a dataset of click-stream data from the MSNBC website, converted from original data from the UCI repository.
The shortest sequences have been removed to keep only 31,790 sequences. If you need the full dataset, you can download it.
989,818 17 13.23 No
FIFA Click stream data from the website of FIFA World Cup 98 20,450 2,990 34.74 No
BIKE

This contains sequences of locations where shared bikes where parked in a city. Each item represents a bike sharing station and each sequence indicate the different locations of a bike over time.

The dataset was obtained from the Github of Andrea Tonon, and is a transformation from Kaggle data (https://www.kaggle.com/cityofLA/los-angeles-metro-bike-share-trip-data)

21,078 67 7.27 No
MT745584 (COVID-19 genome sequence) new This is the sequence of nucleotide of the MT745584 strain of the COVID-19 coronavirus, collected on 2020-07-13 in Bahrain. The dataset has been obtained from public database and converted to the SPMF format by M. Saqib Nawaz for his paper: Using Artificial Intelligence Techniques for COVID-19 Genome Analysis. More datasets related to this paper and code can be found on github. 497 4 16 Original data
ProofSequences new a dataset containing sequences of proof steps for mathematical proofs. The dataset is described in the paper:
Nawaz, M. S., Sun, M., Fournier-Viger, P. (2019). Proof Guidance in PVS with Sequential Pattern Mining. Proc. of 9th Intern. Conf. Fundamentals of Software Engineering (FSEN 2019), 15 pages, Springer, LNCS.
35 13 12.34 Yes
E-Shopnew Click-stream data for online shopping. The dataset contains sequences of clicks (clickstream) from an online store offering clothing for pregnant women.

Notes: Dataset cleaned, pre-processed and transformed by Frederic Flouvat based on data from UCI repository.
A github project by F. Flouvat contains details and code about the transformation.
This data is interesting for algorithm testing because it is sequence of itemsets instead of sequence of items.
Average number of items per itemset : 9.0
24026 317 61.98 Yes

Item meanings (JSON format)
MicroblogPCUnew Datasets about spam in micro-blogs.  

Notes: Dataset cleaned, pre-processed and transformed by Frederic Flouvat based on data from UCI repository.
A github project by F. Flouvat contains details and code about the transformation.
This data is interesting for algorithm testing because it is sequence of itemsets instead of sequence of items.
Average number of items per itemset : 6.85
429 50505 296.37 Yes
Item meanings (JSON format)
OnlineRetail_II_allnew This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.

Notes: Dataset cleaned, pre-processed and transformed by Frederic Flouvat based on data from UCI repository.
A github project by F. Flouvat contains details and code about the transformation.
This data is interesting for algorithm testing because it is sequence of itemsets instead of sequence of items.
Average number of items per itemset : 22.775
4383 41431 32.154 Yes
Item meanings (JSON format)
OnlineRetail_II_bestnew

This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.

Notes: Dataset cleaned, pre-processed and transformed by Frederic Flouvat based on data from UCI repository.
A github project by F. Flouvat contains details and code about the transformation.
This data is interesting for algorithm testing because it is sequence of itemsets instead of sequence of items.
Average number of items per itemset : 9.0

Note: The dataset had some errors with respect to the format. It has been fixed (2023-7-15) new

4383 10157 48.23 Yes
Item meanings (JSON format)

A collection of 30 books converted to SPMF format (sequences of words or sequence of part-of-speeches)

Below there is a dataset that is a collection of 15 public domain books that have been prepared and converted to sequences by Jean-Marc Pokou et al. (2016) to the SPMF format. Each book can be used to extract patterns using sequential pattern mining or sequential rule mining algorithms.

These books are written by 10 different English novelists from the XIX century. The total number of words/sentences in the corpus of each author is as follows: Catharine Traill (276,829/ 6,588), Emerson Hough (295,166/ 15,643), Henry Addams (447,337/ 14,356), Herman Melville (208,662/ 8,203), Jacob Abbott (179,874/ 5,804), Louisa May Alcott (220,775/ 7,769), Lydia Maria Child (369,222/ 15,159), Margaret Fuller (347,303/ 11,254), Stephen Crane (214,368/ 12,177), and Thornton W. Burgess (55,916/ 2,950). The list of books is:

Author Datasets (books) in SPMF format
Catharine Traill - A Tale of The Rice Lake Plains
-Lost in the Backwoods
- The Backwoods of Canada
Emerson Hough - The Girl at the Halfway House
- The Law of the Land
- The Man Next Door
Henry Addams - Democracy, an American novel
- Mont-Saint-Michel and Chartres
- The Education of Henry Adams
Herman Melville - I and My Chimney
-Israel Potter
-The Confidence-Man His Masquerade
Jacob Abbott - Alexander the Great
- History of Julius Caesar
- Queen Elizabeth
Louisa May Alcott - Eight Cousins
- Rose in Bloom
- The Mysterious Key and What Opened
Lydia Maria Child - A Romance of the Republic
-Isaac THoppe
-Philothea)
Margaret Fuller - Life Without and Life Within
-Summer on the Lakes, in 1843
- Woman in the Nineteenth Century
Stephen Crane - Active Service
- Last Words
- The Third Violet
Thornton WBurgess - The Adventures of Buster Bear
- The Adventures of Chatterer the Red Squirrel
-The Adventures of Grandfather Frog

There are two versions of each datasets : sequences of words and sequences of Part-of-Speeches (POS) (obtained with the Stanford NLP tagger).

Here are the links to download the books:

If you use the above book datasets, you may want to cite this paper:

Pokou J. M., Fournier-Viger, P., Moghrabi, C. (2016). Authorship Attribution Using Small Sets of Frequent Part-of-Speech Skip-grams. Proc. 29th Intern. Florida Artificial Intelligence Research Society Conference (FLAIRS 29), AAAI Press, pp. 86-91

Sequences of API calls of malware programs

Here is a large dataset (compressed ZIP file 6 MB - uncompressed 813 MB ) containing the sequences of API calls from different types of malware programs.
This dataset contains 8 files, each corresponding to a different type of malware such as viruses, backdoors and spyware. Each file is a sequence database containing the sequences of API calls of multiple programs of a given type.

The table below summarize the information about these 8 files:

File Type of malware Samples (sequences) API calls (distinct items) Maximum sequence length Average sequence length
Adwaretranslated.txt Adware 379 212 1,450,685 6,867
Backdoortranslated.txt Backdoor 1001 227 1,402,652 11,293
Downloadertranslated.txt Downloader 1001 232 870,719 6,522
Droppertranslated.txt Dropper 891 226 1,068,329 16,008
Spywaretranslated.txt Spyware 832 229 1,764,421 46,951
Trojantranslated.txt Trojan 1001 232 1,232,913 13,818
Virustranslated.txt Virus 1001 241 1,062,231 18,370
Wormstranslated.txt Worms 1001 236 1,245,582 33,614

These datasets can be used with sequential pattern mining algorithms and sequential rule mining algorithms.

More information about these datasets can be found in this paper:

Nawaz, M. S., Fournier-Viger, P., Nawaz, M. Z., Chen, G., Wu, Y. (2022) MalSPM: Metamorphic Malware Behavior Analysis and Classification using Sequential Pattern Mining. Computers & Security, Elsever, to appear DOI: 10.1016/j.cose.2022.102741

More data related to this paper can also be found on Github.

Sequences of MOOC data with timestamps

This page also provides a time-extended sequence database that contains MOOC data (e-learning data) and can be used with sequential pattern mining algorithms such as SPM_FC_P and SPM_FC_L. It can be downloaded here: mooc.txt

This dataset was originally a “Course Recommendation” dataset provided by the MoocData platform but has been transformed by Song et al. in SPMF format. The original dataset was collected from XuetangX, one of the largest MOOC platforms in China. Originally used for course recommendation, the dataset contains the records of 82,535 course enrollment sequences from XuetangX from October 1, 2016 to March 31, 2018. The time span is 547 days. The number of courses is 1,302. The number of sequences is 82,535. The length of the longest sequence is 398. The length of the shortest sequence is 3. The average sequence length is 5.19.

See this paper for more details about this dataset and how it can be used:

Song, W., Ye, W., Fournier-Viger, P. (2022). Mining sequential patterns with flexible constraints from MOOC data. Applied Intelligence

Synthetic datasets

To generate synthetic sequence databases, you can use the sequence database generator provided in SPMF (see this example and this example in the documentation to know how to use it), which is flexible and easy to use.

Here are some synthetic sequence databases generated with the IBM Quest Dataset Generator, converted to the SPMF format:

Dataset name Description Sequence count Item count Average number of itemsets per sequence Average number of distinct item per sequence
data.slen_10.tlen_1.seq.patlen_2.lit.patlen_8.nitems_5000_spmf.txt Synthetic data 50,351 41,911 1.63 13.23
data.slen_10.tlen_1.seq.patlen_3.lit.patlen_8.nitems_5000_spmf.txt Synthetic data 47,785 53,137 1.93 15.63
data.slen_10.tlen_1.seq.patlen_4.lit.patlen_8.nitems_5000_spmf.txt Synthetic data 47,556 62,296 2.24 17.96
data.slen_10.tlen_1.seq.patlen_5.lit.patlen_8.nitems_5000_spmf.txt Synthetic data 47,988 69,686 2.56 20.49
data.slen_10.tlen_1.seq.patlen_6.lit.patlen_8.nitems_5000_spmf.txt Synthetic data 48,467 75,476 2.84 22.7
data.slen_8.tlen_1.seq.patlen_2.lit.patlen_8.nitems_5000_spmf.txt Synthetic data 45,535 41,270 1.52 12.30
data.slen_8.tlen_1.seq.patlen_3.lit.patlen_8.nitems_5000_spmf.txt Synthetic data 45,452 52,551 1.80 14.54
data.slen_8.tlen_1.seq.patlen_4.lit.patlen_8.nitems_5000_spmf.txt Synthetic data 46,131 61,137 2.09 16.71
data.slen_8.tlen_1.seq.patlen_5.lit.patlen_8.nitems_5000_spmf.txt Synthetic data 47,133 68,240 2.36 18.82

It is also possible to generate sequence databases by using the IBM Generator. Files generated by the IBM Generator can be converted in SPMF format by using the conversion tool provided in SPMF (see this example in the documentation for how to convert files.

Another alternative for generating synthetic sequences databases is to use a Matlab proram provided by Ashwin Balani.

Datasets that contain time-interval sequences

There are also several sequence datasets where events are described using time intervals (each event has a start time and an end time, that is a duration). Those datasets should be used with specialized algorithms for time-interval related pattern mining such as FastTIRP and VertTIRP. The datasets were obtained from the public repository of github user @omerh18 and converted in the SPMF format to make them easy to use with SPMF.

Dataset name Sequence count Event type count Avg. number of time intervals per sequence
asl_spmf.csv 65 146 31.32
aslgt_spmf.csv 1751 47 41.26
auslan2_spmf.csv 200 12 4.49
blocks_spmf.csv 210 8 5.74
context_spmf.csv 240 54 53.81
ct1_spmf.csv 1060 49 42.68
ct2_spmf.csv 576 64 307.14
diabetes_spmf.csv 2038 35 39.52
hepatitis_spmf.csv 498 63 96.44
pioneer_spmf.csv 160 92 30.51
skating_spmf.csv 530 41 35.68
smarthome_spmf.csv 89 92 260.81
st1_spmf.csv 2746 1299 59.53
st2_spmf.csv 1805 1299 530.44

Datasets for Frequent Itemset mining / Association Rule Mining / Periodic pattern mining / Frequent episode mining

Datasets in SPMF format

These datasets can be directly used in SPMF:

Dataset name Description Transaction count Item count (I) Average item count per transaction (A) Density (%)
(A / I ) * 100
Has item names?
retail customer transactions from an anonymous Belgian retail store.
(source FIMI: http://fimi.ua.ac.be/data/)
88,162 16470 10,30 0.06 % No
mushrooms

prepared based on the UCI mushrooms dataset
(source: FIMI)

8,416 119 23 19.33 % No
pumsb

census data for population and housing
(source: FIMI)

49,046 2113 74 3.50 % No
chess prepared based on the UCI chess dataset
(source: FIMI)
3,196 75 37 49.33 % No
connect prepared based on the UCI connect-4 dataset (source: FIMI) 67,557 129 43 33.33 % No
accidents anonymized traffic accident data (source: FIMI) 340,183 468 33.8 7.22 % No
BMS_WebView_1 click-stream data from a webstore used in KDD-Cup 2000 (note: this version of the dataset has been prepared for itemset mining / association rule mining)
(source: FIMI)
59,602 497 2.51 0.51 % No
BMS_WebView_2 click-stream data from a webstore used in KDD-Cup 2000 (note: this version of the dataset has been prepared for itemset mining / association rule mining)
(source: FIMI)
77,512 3340 4.62 0.14 % No
Kosarak transactions from click-stream data from an hungarian news portal.
(source: FIMI)
990,002 41270 8.1 0.02 % No
chainstore customer transactions from a major grocecy store chain in California, USA 1,112,949 46086 7.23 0.02 % No
foodmart customer transactions from a retail store, obtained and transformed from SQL-Server 2000 4,141 1559 4.42 0.28 % No
Fruithut

This is a dataset of customer transactions from a US retail store focusing on selling fruits.

The dataset contains 181,970 transactions, 1,265 different items. The largest transactions contains 36 items, while on average a customer purchase 3.58 items per transaction.

You can also download the taxonomy of items for this dataset, which can be useful for cross-level or multi-level itemset mining. The taxonomy is provided in a separated file where each line indicates than an item is a specialization of another item (a product belongs to a product category). An item can be a product or a category. The taxonomy has 4 levels and 43 categories. See the paper of Ying Wang et al. above for more details.

The data was obtained from Kaggle and then transformed to the SPMF format by Ying Wang et al. for her paper about mining cross-level high utility itemsets (2020).

181,970 1265 3.58 0.28 %

Yes, item names are included in the file

 

Liquor_11

This is a dataset of 9,284 customer transactions from a US liquor stores in the state of IOWA.

You can also download the taxonomy of items for this dataset, which can be useful for cross-level or multi-level itemset mining. The taxonomy is provided in a separated file where each line indicates than an item is a specialization of another item (a product belongs to a product category). An item can be a product or a category. The taxonomy has 7 levels and 77 categories. See the paper of Ying Wang et al. above for more details.

The data was obtained from the internet and then transformed to the SPMF format by Ying Wang et al. for her paper about mining cross-level high utility itemsets (2020). In particular the taxonomy was creating by doing some NLP from the item names.

The dataset contains only transactions with no more than 11 items. If you want all transactions with no more than 15 items, you can download liquor_15.txt, of if you want no more than 5 items, you can download liquor_5.txt.

9,284 2626 2.7 0.10 % Yes (temporary not available
kddcup99 This dataset is transformed from the KDD CUP 1999 dataset, found at https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data) 1,000,000 135 16 11.85 %

Yes, kddcup99Attributes.xlsx

OnlineRetail

This dataset is transformed from the Online Retail dataset, found at https://archive.ics.uci.edu/ml/datasets/ Online+Retail.
Converted  by Zhongjie Zhang.
Note:
these is another dataset also called OnlineRetail in some papers, which is in fact the ECommerce dataset.

541,909 2603 4.37 0.17 %

Yes, OnlineRetailAttributes.xlsx.

PAMP

This dataset is transformed from the PAMAP2 dataset, obtained from https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring
Note: In the original data, a few lines contained NaN values. These values have been removed.
Converted  by Zhongjie Zhang.

1,000,000 141 23.93 16.97 % Yes,
PAMPAttributes.xlsx
Skin This dataset is transformed from the dataset Skin Segmentation, obtained from https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation
Converted  by Zhongjie Zhang.
245,057 11 4.0 36.36 % Yes,
SkinAttributes.xlsx
USCensus This dataset is transformed from the US Census 1990 dataset, obtained from https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990)
Converted  by Zhongjie Zhang.
1,000,000 396 68.0 17.17 % No
RecordLink This dataset is transformed from http://archive.ics.uci.edu/ml/datasets/Record+Linkage+Comparison+Patterns.
Converted  by Zhongjie Zhang.
574,913 29 10 34.48 %

Yes
RecordLinkAttribute.xslx

PowerC The dataset is about household electric power consumption. It was prepared using the original dataset from https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption. The dataset contains instances, which were transformed to transactions, where value fields of the 3rd attribute to the 9th attribute are divided in 10 equal parts and every part is represented by a number. As a result, the dataset has 140 items.
Converted  by Zhongjie Zhang.
1,040,000 140 7 5.00 % No
Susy

This dataset is related to physics. It is related to particles detected using a particle accelerator. The dataset contains instances prepared using the original datasetbtained from newhttp://archive.ics.uci.edu/ml/datasets/SUSY. The instances were transformed to transactions, such that the value field of every attribute is divided in 10 equal parts and every part is represented by a number. As a result, the dataset has 190 items.
Converted  by Zhongjie Zhang.

5,000,000 190 19 10.00 % No
Chicago_Crimes_2001_to_2017_FIM The datasetnew was converted from the dataset 'Crimes in Chicago' (https://www.kaggle.com/datasets/currie32/crimes-in-chicago) by Zhongjie Zhang. The dataset records the crimes that have occurred in Chicago from 2001 to 2017.

Every transaction corresponds to a <month, area>. A transaction describes the crimes that have occurred in a specific area during a specific month. Items represents the types of crimes, while the utility of an item in a transaction indicates the number of occurrences of that crime. For a description of the types of crimes, see the file describing the item names of that dataset.
2,662,309 35 1.795 5.13% Yes, the description of items is here
Yoo-choose-buy-FIM

This dataset was obtained from the RecSys2015 challenge and converted to the SPMF format so that it can be used with SPMF. The dataset contains transactions from customers who have purchased items in an online store.

Each transaction correspond to a customer and indicates the products that have been purchased.

Note on the conversion: items having a quantity of 0 or a price of 0 were eliminated.

234,300 16,004 2.165 0.01% No

Synthetic datasets

Here are a few synthetic datasets in SPMF format:

Dataset name Description Transaction count Item count (I) Average item count per transaction (A) Density
(A / I ) * 100
c20d10k synthetic dataset 10,000 192 20
10.42 %
c73d10k synthetic dataset 10,000 1592 73
4.59 %
t25i10d10k synthetic dataset 9,976 929 24.77
2.67 %
t20i6d100k synthetic dataset 99,922 893 19.90
2.23 %

It is possible to generate synthetic transaction databases by using the random transaction database generator provided in SPMF (see this example in the documentation to know how to do it), which is flexible and easy to use.

Moreover, you can download the following synthetic datasets often used in the data mining litterature, generated by the IBM Generator. The names are given according to this convention: D: number of sequences in the dataset, C: average number of itemsets per sequence, T: average number of items per itemset, I: average size of itemsets in potentially frequent sequences.

Another alternative for generating synthetic transactions databases is to use a Matlab program provided by Ashwin Balani.

Real-life datasets in SPMF format, having timestamps

The datasets containing timestamps can be directly used in SPMF with algorithms that accept timestamps:

Dataset name Description Transaction count Item count (I) Average item count per transaction (A) Density
(A / I ) * 100
Has item names?
ECommerce_time_without_utilitynew customer transactions occurring between 01/12/2010 and 09/12/2011 of a UK-based and registered non-store online retail (this is the version with timestamps ). If you want to know the meaning of the items in this dataset, the list of item names is available.
The data was prepared by Yimin Zhang using public data.

14,975

3,468 11.71 0.34 % Yes

list of item names

Real-life datasets in ARFF format

The GUI and command line interface of SPMF > 093d and higher can read ARFF (Attribute Relational File Format) files . You can download a collection of 36 real-life datasets in ARFF format from the UCI machine learning repository. These files can be used with all association rule mining and itemset mining algorithms that take a transaction database as input. Note that examples MainTest...in the source code have not been adapted to read ARFF files yet (for now, ARFF files are only accepted by the GUI and command line interface). Most features of the ARFF format are supported except that (1) the character = is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will have to be converted. This cost however should be small. The specification of the ARFF format can be found here.

Datasets for High-Utility Pattern Mining

The following transaction databases that can be used with high-utility itemset mining algorithms such as EFIM, FHM, HUI-Miner, IHUP, UPGrowth and Two-Phase. For information about the format of these files, see the examples in the documentation of EFIM, FHM, HUI-Miner or Two-Phase.

Real-life transaction datasets in SPMF format with real utility values

Those are three real-life customer transaction datasets with real utility values:

Dataset name Description Transaction count Item count (I) Average item count per transaction (A) Density (%)
(A / I ) * 100
Has real utility values? Has item names?
foodmart_utility dataset of customer transactions from a retail store, obtained and transformed from SQL-Server 2000 4,141 1559 4.42 0.28 % Yes No
chainstore_utility

dataset of customer transactions from a major grocery store chain in California, USA, containing 1,112,949 transactions and 46,086 items, obtained and transformed from NU-Mine Bench.

The original data is available here (not in SPMF format)

1,112,949 46086 7.23 0.02 % Yes No
ECommerce_retail_utility_no_timestamps

transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 of a UK-based and registered non-store online retail (this is the version with real utility values but without timestamps).

The data was prepared by Yimin Zhang using public data for his paper about peak high utility itemset mining.

14,975

3468 11.71 0.34 % Yes

Yes

list of item names new

Fruithut_utility

This is a dataset of customer transactions from a US retail store focusing on selling fruits.

The dataset contains 181,970 transactions, 1,265 different items. The largest transactions contains 36 items, while on average a customer purchase 3.58 items per transaction.

You can also download the taxonomy of items for this dataset, which can be useful for cross-level or multi-level itemset mining. The taxonomy is provided in a separated file where each line indicates than an item is a specialization of another item (a product belongs to a product category). An item can be a product or a category. The taxonomy has 4 levels and 43 categories. See the paper of Ying Wang et al. above for more details.

The data was obtained from Kaggle and then transformed to the SPMF format by Ying Wang et al. for her paper about mining cross-level high utility itemsets (2020).

181,970 1265 3.58 0.28 %

 

Yes

Yes, item names are included in the file
Liquor_11

This is a dataset of 9,284 customer transactions from a US liquor stores in the state of IOWA.

You can also download the taxonomy of items for this dataset, which can be useful for cross-level or multi-level itemset mining. The taxonomy is provided in a separated file where each line indicates than an item is a specialization of another item (a product belongs to a product category). An item can be a product or a category. The taxonomy has 7 levels and 77 categories. See the paper of Ying Wang et al. above for more details.

The data was obtained from the internet and then transformed to the SPMF format by Ying Wang et al. for her paper about mining cross-level high utility itemsets (2020). In particular the taxonomy was creating by doing some NLP from the item names.

The dataset contains only transactions with no more than 11 items. If you want all transactions with no more than 15 items, you can download liquor_15.txt, of if you want no more than 5 items, you can download liquor_5.txt.

9,284 2626 2.7 0.10 % Yes Yes (but temporary not available)
Chicago_Crimes_2001_to_2017_utility The dataset newwas converted from the dataset 'Crimes in Chicago' (https://www.kaggle.com/datasets/currie32/crimes-in-chicago) by Zhongjie Zhang. The dataset records the crimes that have occurred in Chicago from 2001 to 2017.

Every transaction corresponds to a <month, area>. A transaction describes the crimes that have occurred in a specific area during a specific month. Items represents the types of crimes, while the utility of an item in a transaction indicates the number of occurrences of that crime. For a description of the types of crimes, see the file describing the item names of that dataset.
2,662,309 35 1.795 5.13% Yes Yes, the description of items is here
Yoo-choose-buy-Utility

This dataset new was obtained from the RecSys2015 challenge and converted to the SPMF format so that it can be used with SPMF. The dataset contains transactions from customers who have purchased items in an online store.

Each transaction correspond to a customer and indicates the products that have been purchased.

Note on the conversion: items having a quantity of 0 or a price of 0 were eliminated.

234,300 16,004 2.165 0.01% Yes No

Real-life transaction datasets in SPMF format having synthetic (fake) utility values

These datasets are real datasets but with synthetic utility values. The internal utility values have been generated using a uniform distribution in [1, 10].

Dataset name Description Transaction count Item count (I) Average item count per transaction (A) Density (%)
(A / I ) * 100
Has real utility values? Has item names?
retail_utility customer transactions from an anonymous Belgian retail store.
(source FIMI: http://fimi.ua.ac.be/data/)
88,162 16470 10,30
0.06 %
No (synthetic) No
mushroom_utility

prepared based on the UCI mushrooms dataset
(source: FIMI)

8,416 119 23
19.33 %
No (synthetic) No
pumsb_utility,

census data for population and housing
(source: FIMI)

49,046 2113 74
3.50 %
No (synthetic) No
chess_utility, prepared based on the UCI chess dataset
(source: FIMI)
3,196 75 37
49.33 %
No (synthetic) No
connect_utilitty, prepared based on the UCI connect-4 dataset (source: FIMI) 67,557 129 43
33.33 %
No (synthetic) No
accidents_utility, anonymized traffic accident data (source: FIMI) 340,183 468 33.8
7.22 %
No (synthetic) No
kosarak_utility, transactions from click-stream data from an hungarian news portal.
(source: FIMI)
990,002 41270 8.1
0.02 %
No (synthetic) No
BMS_utilitty, click-stream data from a webstore used in KDD-Cup 2000
(note: this version of the dataset has been prepared for high utility itemset mining )
(source: FIMI)
77,512 3340 4.62
0.14 %
No (synthetic) No

Datasets in SPMF format having utility values and timestamps

Here are a few transaction databases in SPMF format for high utility itemset mining that contains timestamps. new These datasets where prepared by Yimin Zhang for the paper in Information Sciences about the LHUI-Miner and PHUI-Miner algorithms (Fournier-Viger et al., 2019). These datasets were used for discovering local high utility itemsets and peak high utility itemsets, but can be used for other tasks such as recent high utility pattern mining.

Dataset name Description Transaction count Item count (I) Average item count per transaction (A) Density (%)
(A / I ) * 100
Has real utility values? Has real timestamps? Has item names?
ECommerce_retail_utility_timestamps

transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 of a UK-based and registered non-store online retail (this is the version with real utility values and with timestamps).

The data was prepared by Yimin Zhang using public data for his paper about peak high utility itemset mining.

14,975

3468 11.71 0.34 % Yes Yes

Yes

list of item names new

Fruithut_utility_timestamps

This is a dataset of customer transactions from a US retail store focusing on selling fruits.

The dataset contains 181,970 transactions, 1,265 different items. The largest transactions contains 36 items, while on average a customer purchase 3.58 items per transaction.

You can also download the taxonomy of items for this dataset, which can be useful for cross-level or multi-level itemset mining. The taxonomy is provided in a separated file where each line indicates than an item is a specialization of another item (a product belongs to a product category). An item can be a product or a category. The taxonomy has 4 levels and 43 categories. See the paper of Ying Wang et al. above for more details.

The data was obtained from Kaggle and then transformed to the SPMF format by Ying Wang et al. for her paper about mining cross-level high utility itemsets (2020).

181,970 1265 3.58 0.28 %

 

Yes

Yes Yes, item names are included in the file
foodmart_utility_timestamp dataset of customer transactions from a retail store, obtained and transformed from SQL-Server 2000 4,141 1,559 4.42 0.28 % Yes No (synthetic) No
retail_utility_timestamp customer transactions from an anonymous Belgian retail store.
(source FIMI: http://fimi.ua.ac.be/data/)
88,162 16,470 10,30 0.06 % No (synthetic) No (synthetic) No
kosarak_utility_timestamp transactions from click-stream data from an hungarian news portal.
(source: FIMI)
990,002 41,270 8.1 0.02 % No (synthetic) No (synthetic) No
mushroom_utility_timestamp

prepared based on the UCI mushrooms dataset
(source: FIMI)

8,416 119 23 19.33 % No (synthetic) No (synthetic) No

Datasets for high utility itemset mining with negative unit profit values.

Here are a few transaction databases in SPMF format for high-utility itemset mining with negative unit profit values. Those datasets have been generated to be used in the FHN paper (Fournier-Viger et al., 2014) and can be used with the FHN and HUINIV-Mine algorithms. See the FHN paper for details.

Dataset name Description Transaction count Item count (I) Average item count per transaction (A) Density (%)
(A / I ) * 100
Has real utility values? Has item names?
retail_negative, customer transactions from an anonymous Belgian retail store.
(source FIMI: http://fimi.ua.ac.be/data/)
88,162 16470 10,30 0.06 % No (synthetic) No
mushroom_negative

prepared based on the UCI mushrooms dataset
(source: FIMI)

8,416 119 23 19.33 % No (synthetic) No
pumsb_negative,

census data for population and housing
(source: FIMI)

49,046 2113 74 3.50 % No (synthetic) No
chess_negative, prepared based on the UCI chess dataset
(source: FIMI)
3,196 75 37 49.33 % No (synthetic) No
accidents_negative, anonymized traffic accident data (source: FIMI) 340,183 468 33.8 7.22 % No (synthetic) No
kosarak_negative transactions from click-stream data from an hungarian news portal.
(source: FIMI)
990,002 41270 8.1 0.02 % No (synthetic) No

Datasets for high utility quantitative itemset mining

The following datasets have been used for the FHUQI-Miner paper (Nouioua et al., 2021) about high utility quantitative itemset mining, and can also be used with the VHUQI algorithm. For these algorithms, two files are provided for each dataset: the first file contains transactions with quantities for each item (e.g. retail.txt). The second file contains the unit profit or weight of each item (e.g. retail1profit.txt). For more details about the format, please see the documentation page about FHUQI-Miner or VHUQI.

Dataset name Description Transaction count Item count (I) Average item count per transaction (A) Density (%)
(A / I ) * 100
Has real utility and quantityvalues? Has item names?

retail.txt

retailf1profit.txt

customer transactions from an anonymous Belgian retail store.
(note: this version of the dataset has been adaptedfor high utility quantitative itemset mining )
88162 16470 10.30 0.06 % No (synthetic) No

pumsb.txt

pumsbf1profit.txt

 

census data for population and housing
(note: this version of the dataset has been adaptedfor high utility quantitative itemset mining )

49046 2113 74 3.50 % No (synthetic) No

connect.txt

connect1profit.txt

prepared based on the UCI connect-4 dataset
(note: this version of the dataset has been adapted for high utility quantitative itemset mining )
67557 129 43 33.33 % No (synthetic) No

bms.txt

bmsf1profit.txt

 

click-stream data from a webstore used in KDD-Cup 2000
(note: this version of the dataset has been adaptedfor high utility quantitative itemset mining )
59601 497 2.42 0.51 % No (synthetic) No

bms2.txt

bmsf2profit.txt

click-stream data from a webstore used in KDD-Cup 2000

(note: this version of the dataset has been adaptedfor high utility quantitative itemset mining )

77512 3340 4.62 0.14 % No (synthetic) No

foodmart

foodmartf1profit.txt

dataset of customer transactions from a retail store, obtained and transformed from SQL-Server 2000
(note: this version of the dataset has been adaptedfor high utility quantitative itemset mining )
4141 1559 4.42 0.28 % Yes No

Datasets for on-shelf high-utility itemset mining with negative unit profit values

Datasets for high-utility sequential rule mining or high-utility sequential pattern mining

    Some sequence databases in SPMF format for high-utility sequential rule mining or high-utility sequential pattern mining. Those datasets were generated for the HUSRM paper (Zida et al, 2015), and can be used with the HUSRM and USpan algorithm. See the HUSRM paper for more information.

    Dataset name Description Sequence count Item count Average sequence length Has item names?
    BMS_sequence_utility) This dataset was used in KDD CUP 2000. It contains clickstream data from an e-commerce 77,512 3,340 4.62 No

    Kosarak10k_sequence_utility

    This is a subset of 10,000 sequences of the Kosarak click-stream data from an hungarian news portal.
    The dataset was converted in SPMF format using the original data from: http://fimi.ua.ac.be/data/.
    (2020-1-10: the kosarak dataset file has been updated as someone informed me that some sequences were missing)
    10,000     No
    SIGN_sequence_utility a dataset of sign language utterance containing approximately 800 sequences
    The original dataset file in another format can be obtained here with more details on this dataset.
    ~800 267 51.997 No
    Bible_sequence_utility

    This dataset is a conversion of the Bible into a sequence database (each word is an item).

    36,369 13,905 21.6 Yes
    Bible_with_items
    FIFA_sequence_utility Click stream data from the website of FIFA World Cup 98 20,450 2,990 34.74 No

Datasets for Cost-effective Pattern Mining (with utility and cost)

In the paper by Fournier-Viger et al. (2020), it was proposed to find cost-effective patterns in sequences with utility and cost information. This type of data is a set of sequences where each sequence is an ordered list of activity, where an activity has a cost value representing the amount of resources (e.g. time, money) spent to perform the activity. Moreover, each sequence has a utility label that can be either binary or numeric, which represents a positive or negative outcome. This type of data can represent for example how students use an e-learning system. In that case, each sequence is a list of activities or sessions performed by a student, and the cost represents the time spent for studying for each activity. On the other hand, the utility represents the outcome of a sequence such as the final exam score of a student after performing the acitivities (numeric utility), or whether he passed or failed the course or exam (binary utility). Another example of cost-utility sequences is hospital data, where a sequence is a list of medicines taken by some patient, the cost indicates how much money or time was spent for these medical treaments, and the utility is whether a patient has cured or died.

From cost/utility sequences, we can then find patterns that have a low-cost but typically lead to a high utility (called low-cost high utility patterns or cost-efficient patterns) by applying algorithms such as CEPB, CEPN and CorCEPB. In the above examples, some cost-effective patterns may be that studying some e-learning materials A and B typically require not much time but leadw to high scores, or that taking some given medicines has a low cost but a high possibility of success to cure a disease.

Here are some datasets:

Dataset name Description Sequence count Type of utility
allSessions_binary.txt

This is an e-learning dataset, where each sequence represents a student using an e-learning platform named Deeds, and there are 62 students.

A sequence contains a list of sessions taken by a student. There are six different sessions denoted as 1,2,3,4,5 and 6. In a sequence, each session has a cost value that represents the time spent by a student on the session for studying or doing some learning activities.

Each sequence has a binary utility value, which indicates the outcome of the sequence. Here the utility value indicates if a student has failed or passed the final exam after doing the learning sessions.

This dataset can be used with the CEPB and CorCEPB algorthms

Note: This dataset was created by converting public data from Vahdat et al to the SPMF format. A threshold of 60% was assumed to be the score for passing the final exam.

62 Binary
session6_numeric.txt

This dataset is also an e-learning dataset from students using an e-learning platform named Deeds. This dataset contains 50 sequences.

Each sequence indicates the list of activities done by a student during a learning session called SESSION 6. Each activity has a cost value representing the amount of time that the student has spent doing the activity. Moreover, each sequence has a numeric utility value indicating the score that the student had at the test at the end of SESSION 6.

This dataset can be used with the CEPN algorthm.

Note: This dataset was created by converting public data from Vahdat et al to the SPMF format. Note that this file only contains SESSION 6.

50 Numeric
session6_binary.txt

This is the same as above, except that the utility has been converted to binary values instead of numeric values.

This dataset can be used with the CEPB and CorCEPB algorthms.

50 Binary
session5_numeric.txt This is similar to above, except that this is the data for SESSION 5 (numeric utility). 53 Numeric
session4_numeric.txt This is similar to above, except that this is the data for SESSION 4 (numeric utility). 54 Numeric
session4_binary.txt This is similar to above, except that this is the data for SESSION 4 (binary utility). 54 Binary

The original dataset was made public by Vahdat et al. and can be downloaded here but it is not in SPMF format. However, it gives a lot of details about the meaning of the data. This is very useful for understanding the meaning of the patterns found in the above data. This is the paper by Vahdat et al.:

M. Vahdat, L. Oneto, D. Anguita, M. Funk, M. Rauterberg.: A learning analytics approach to correlate the academic achievements of students with interaction data from an educational simulator. In: G. Conole et al. (eds.): EC-TEL 2015, LNCS 9307, pp. 352-366. Springer (2015).
DOI: 10.1007/978-3-319-24258-3 26

And this is the paper about cost-effective pattern mining with the CEPB, CEPN and CorCEPB algorithms, for which the datasets have prepared for SPMF:

Fournier-Viger, P., Li, J., Lin, J. C., Chi, T. T., Kiran, R. U. (2020). Mining Cost-Effective Patterns in Event Logs. Knowledge-Based Systems (KBS), Elsevier

New: I have created some versions of the above datasets where sequences with cost values have been transformed into transactions with cost values. If you use those transformed transaction datasets please cite our paper "LCIM: Mining Low Cost and High Utility Itemsets" published in MIWAI 2022 by Nawaz et al., 2022.

Dataset name Description Transaction count Type of utility
allSessions_binary_trans.txt

See above

62 Binary
session6_numeric_trans.txt

See above

50 Numeric
session6_binary_trans.txt

See above

50 Binary
session5_numeric_trans.txt See above 53 Numeric
session4_numeric_trans.txt See above . 54 Numeric
session4_binary_trans.txt See above 54 Binary

Datasets for graph pattern mining

Datasets for mining subgraphs in a graph database

Here are a few datasets that can be used to evaluate subgraph mining algorithms such as TKG, gSpan and cgSpan. Each dataset contains a graph database (multiple graphs).

Dataset name Description Graph count Average node count per graph Average edge count per graph Vertex label count Edge label count Label file?
Chemical_340 a database of 340 graphs about chemistry

340

27.02 27.40 66 4 No
Coumpounds_422 a database of 422 graphs about coumpounds 422 39.61 42.31 21 4 No
Mutag a dataset of nitro compounds labeled regarding whether they have a mutagenic effect on a bacterium 188 17.93 19.79 7 11 Yes
PTC chemical compounds where their labels indicate carcinogenicity for male and female rats 344 25.55 25.96 19 1 Yes
NCI1 datasets of chemical compounds screened for activity against non-small cell lung cancer and ovarian cancer cell lines 4,110 29.87 32.30 37 3 Yes
NCI109 datasets of chemical compounds screened for activity against non-small cell lung cancer and ovarian cancer cell lines 4,127 29.68 32.13 38 3 Yes
Proteins graph collection where nodes represent secondary structure elements and edges indicate neighborhood in the amino-acid sequence or in 3-dimension space 1,113 39.06 72.82 3 1 Yes
Enzymes protein tertiary structures obtained from the BRENDA enzyme database 600 32.63 62.14 3 1 Yes
DandD a dataset of protein structures where nodes are amino acids and edges indicate spatial closeness, which are classified into enzymes or non-enzymes 1,178 284.31 715.66 82 1 Yes
IMDB-B a movie collaboration dataset that is collected from IMDB. Each graph is an ego-network where nodes represent actors/actresses and edges indicate if they appear in the same movie. Each graph is categorized into one of two genres (Action or Romance). 1,000 19.77 96.53 65 1 Yes
5newsgroup social network dataset 4976 86.86 352.65 27,881 1 Yes
collab social network dataset 5,000 74.49 2424.63 367 1 Yes
reddit_binary social network dataset 2,000 429.61 497.75 565 1 Yes
reddit_multi_12k social network dataset 11,929 391.41 456.89 909 1 Yes
reddit_multi_5k social network dataset 4,999 508.51 594.87 733 1 Yes
webkb social network dataset 4,167 77.80 318.15 4 7,770 Yes

The two first datasets were obtained from the Web and are probably the two most famous datasets in subgraph mining.

The other datasets were prepared by Dang Nguyen et al. based on data obtained from various data sources, and obtained from github.com/nphdang/gspan/, and used in this paper: Dang Nguyen, Wei Luo, Tu Dinh Nguyen, Svetha Venkatesh, Dinh Phung (2018). Learning Graph Representation via Frequent Subgraphs. SDM 2018, San Diego, USA. SIAM, 306-314.). The descriptions of these latter datasets in the above table were obtained from that paper too.

Note that some datasets have some label files. Those label files are not used by the algorithms offered in SPMF. They were generated when the datasets were prepared. These files indicates the correspondence between labels in the original datasets and those of the transformed datasets offered on this page. These label files are offered here because it can be useful to some people perhaps.

Datasets for Mining Patterns in Dynamic Attributed Graphs

A dynamic attributed graph is a graph that changes over time and where vertices may have multiple numerical attributes. Some algorithms are designed to discover interesting patterns in dynamic attributed graphs such as AER-Miner and TSEQMiner, offered in SPMF. The datasets used in the AER-Miner and TSEQMiner papers are available here: the datasets (ZIP, 5.0 MB) and the and some variations for scalability experiments (ZIP, 138 MB). Please see these papers for a description of these datasets.

Datasets for Clustering

A Matlab program provided by Ashwin Balani to generate synthetic clustering datasets.