SPMF: A Java Open-Source Data Mining Library

SPMF - Public Datasets

spmf

The SPMF software natively uses text files as input. Some small examples of text files that can be used with each algorithm are described in the documentation of SPMF. These sample input files can be downloaded from the download page (test_files.zip) for the release version of SPMF, and are included with the source code, for the source code version of SPMF. However, these datasets are quite small. For this reason, this webpage provides larger datasets that can be used with SPMF and that are often used in the data mining litterature for evaluating and comparing algorithm performance. Unless otherwise indicated, the datasets are in SPMF format.

The datasets are divided in the following categories:

Datasets for Sequential Pattern Mining / Sequential Rule Mining / Sequence Prediction
Datasets for Frequent Itemset mining / Association Rule Mining / Periodic pattern mining / Frequent episode mining)
Datasets for High-Utility Pattern Mining
Datasets for Cost-effective Pattern Mining (with utility and cost)
Graph datasets for graph pattern mining
- Datasets for mining subgraphs in a graph database
- Datasets for mining subgraphs in a dynamic attributed graph
Datasets for clustering

Datasets for Sequential Pattern Mining / Sequential Rule Mining / Sequence Prediction

Real-life datasets in SPMF format

Dataset name	Description	Sequence count	Item count	Average sequence length	Has item names?
BMSWebView1 (Gazelle) ( KDD CUP 2000)	This dataset was used in KDD CUP 2000. It contains clickstream data from an e-commerce In this dataset, there are some long sequences. For example, 318 sequences contains more than 20 items.	59,601	497	2.42	No
BMSWebView2 (Gazelle) ( KDD CUP 2000)	This dataset was used in KDD CUP 2000. It contains clickstream data from an e-commerce	77,512	3,340	4.62	No
Kosarak a subset of only 10 000 sequences: here a subset of 25 000 sequences: here.	This is a very large dataset containing 990 000 sequences of click-stream data from an hungarian news portal. The dataset was converted in SPMF format using the original data from: http://fimi.ua.ac.be/data/. (2020-1-10: the kosarak dataset file has been updated as someone informed me that some sequences were missing)	990,000	41,270	8.1	No
Sign	a dataset of sign language utterance containing approximately 800 sequences The original dataset file in another format can be obtained here with more details on this dataset.	~800	267	51.997	No
Bible	This dataset is a conversion of the Bible into a sequence database (each word is an item).	36,369	13,905	21.6	Yes Bible_with_items
Leviathan	This dataset is a conversion of the novel Leviathan by Thomas Hobbes (1651) as a sequence database (each word is an item).	5,834	9,025	33.8	Yes Leviathan_with_items
MSNBC	a dataset of click-stream data from the MSNBC website, converted from original data from the UCI repository. The shortest sequences have been removed to keep only 31,790 sequences. If you need the full dataset, you can download it.	989,818	17	13.23	No
FIFA	Click stream data from the website of FIFA World Cup 98	20,450	2,990	34.74	No
BIKE	This contains sequences of locations where shared bikes where parked in a city. Each item represents a bike sharing station and each sequence indicate the different locations of a bike over time. The dataset was obtained from the Github of Andrea Tonon, and is a transformation from Kaggle data (https://www.kaggle.com/cityofLA/los-angeles-metro-bike-share-trip-data)	21,078	67	7.27	No
MT745584 (COVID-19 genome sequence)	This is the sequence of nucleotide of the MT745584 strain of the COVID-19 coronavirus, collected on 2020-07-13 in Bahrain. The dataset has been obtained from public database and converted to the SPMF format by M. Saqib Nawaz for his paper: Using Artificial Intelligence Techniques for COVID-19 Genome Analysis. More datasets related to this paper and code can be found on github.	497	4	16	Original data
ProofSequences	a dataset containing sequences of proof steps for mathematical proofs. The dataset is described in the paper: Nawaz, M. S., Sun, M., Fournier-Viger, P. (2019). Proof Guidance in PVS with Sequential Pattern Mining. Proc. of 9th Intern. Conf. Fundamentals of Software Engineering (FSEN 2019), 15 pages, Springer, LNCS.	35	13	12.34	Yes
E-Shop	Click-stream data for online shopping. The dataset contains sequences of clicks (clickstream) from an online store offering clothing for pregnant women. Notes: Dataset cleaned, pre-processed and transformed by Frederic Flouvat based on data from UCI repository. A github project by F. Flouvat contains details and code about the transformation. This data is interesting for algorithm testing because it is sequence of itemsets instead of sequence of items. Average number of items per itemset : 9.0	24026	317	61.98	Yes Item meanings (JSON format)
MicroblogPCU	Datasets about spam in micro-blogs. Notes: Dataset cleaned, pre-processed and transformed by Frederic Flouvat based on data from UCI repository. A github project by F. Flouvat contains details and code about the transformation. This data is interesting for algorithm testing because it is sequence of itemsets instead of sequence of items. Average number of items per itemset : 6.85	429	50505	296.37	Yes Item meanings (JSON format)
OnlineRetail_II_all	This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers. Notes: Dataset cleaned, pre-processed and transformed by Frederic Flouvat based on data from UCI repository. A github project by F. Flouvat contains details and code about the transformation. This data is interesting for algorithm testing because it is sequence of itemsets instead of sequence of items. Average number of items per itemset : 22.775	4383	41431	32.154	Yes Item meanings (JSON format)
OnlineRetail_II_best	This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers. Notes: Dataset cleaned, pre-processed and transformed by Frederic Flouvat based on data from UCI repository. A github project by F. Flouvat contains details and code about the transformation. This data is interesting for algorithm testing because it is sequence of itemsets instead of sequence of items. Average number of items per itemset : 9.0 Note: The dataset had some errors with respect to the format. It has been fixed (2023-7-15)	4383	10157	48.23	Yes Item meanings (JSON format)

A collection of 30 books converted to SPMF format (sequences of words or sequence of part-of-speeches)

Below there is a dataset that is a collection of 15 public domain books that have been prepared and converted to sequences by Jean-Marc Pokou et al. (2016) to the SPMF format. Each book can be used to extract patterns using sequential pattern mining or sequential rule mining algorithms.

These books are written by 10 different English novelists from the XIX century. The total number of words/sentences in the corpus of each author is as follows: Catharine Traill (276,829/ 6,588), Emerson Hough (295,166/ 15,643), Henry Addams (447,337/ 14,356), Herman Melville (208,662/ 8,203), Jacob Abbott (179,874/ 5,804), Louisa May Alcott (220,775/ 7,769), Lydia Maria Child (369,222/ 15,159), Margaret Fuller (347,303/ 11,254), Stephen Crane (214,368/ 12,177), and Thornton W. Burgess (55,916/ 2,950). The list of books is:

Author	Datasets (books) in SPMF format
Catharine Traill	- A Tale of The Rice Lake Plains -Lost in the Backwoods - The Backwoods of Canada
Emerson Hough	- The Girl at the Halfway House - The Law of the Land - The Man Next Door
Henry Addams	- Democracy, an American novel - Mont-Saint-Michel and Chartres - The Education of Henry Adams
Herman Melville	- I and My Chimney -Israel Potter -The Confidence-Man His Masquerade
Jacob Abbott	- Alexander the Great - History of Julius Caesar - Queen Elizabeth
Louisa May Alcott	- Eight Cousins - Rose in Bloom - The Mysterious Key and What Opened
Lydia Maria Child	- A Romance of the Republic -Isaac THoppe -Philothea)
Margaret Fuller	- Life Without and Life Within -Summer on the Lakes, in 1843 - Woman in the Nineteenth Century
Stephen Crane	- Active Service - Last Words - The Third Violet
Thornton WBurgess	- The Adventures of Buster Bear - The Adventures of Chatterer the Red Squirrel -The Adventures of Grandfather Frog

There are two versions of each datasets : sequences of words and sequences of Part-of-Speeches (POS) (obtained with the Stanford NLP tagger).

Here are the links to download the books:

If you use the above book datasets, you may want to cite this paper:

Pokou J. M., Fournier-Viger, P., Moghrabi, C. (2016). Authorship Attribution Using Small Sets of Frequent Part-of-Speech Skip-grams. Proc. 29th Intern. Florida Artificial Intelligence Research Society Conference (FLAIRS 29), AAAI Press, pp. 86-91

Sequences of API calls of malware programs

Here is a large dataset (compressed ZIP file 6 MB - uncompressed 813 MB ) containing the sequences of API calls from different types of malware programs.
This dataset contains 8 files, each corresponding to a different type of malware such as viruses, backdoors and spyware. Each file is a sequence database containing the sequences of API calls of multiple programs of a given type.

The table below summarize the information about these 8 files:

File	Type of malware	Samples (sequences)	API calls (distinct items)	Maximum sequence length	Average sequence length
Adwaretranslated.txt	Adware	379	212	1,450,685	6,867
Backdoortranslated.txt	Backdoor	1001	227	1,402,652	11,293
Downloadertranslated.txt	Downloader	1001	232	870,719	6,522
Droppertranslated.txt	Dropper	891	226	1,068,329	16,008
Spywaretranslated.txt	Spyware	832	229	1,764,421	46,951
Trojantranslated.txt	Trojan	1001	232	1,232,913	13,818
Virustranslated.txt	Virus	1001	241	1,062,231	18,370
Wormstranslated.txt	Worms	1001	236	1,245,582	33,614

These datasets can be used with sequential pattern mining algorithms and sequential rule mining algorithms.

More information about these datasets can be found in this paper:

Nawaz, M. S., Fournier-Viger, P., Nawaz, M. Z., Chen, G., Wu, Y. (2022) MalSPM: Metamorphic Malware Behavior Analysis and Classification using Sequential Pattern Mining. Computers & Security, Elsever, to appear DOI: 10.1016/j.cose.2022.102741

More data related to this paper can also be found on Github.

Sequences of MOOC data with timestamps

This page also provides a time-extended sequence database that contains MOOC data (e-learning data) and can be used with sequential pattern mining algorithms such as SPM_FC_P and SPM_FC_L. It can be downloaded here: mooc.txt

This dataset was originally a “Course Recommendation” dataset provided by the MoocData platform but has been transformed by Song et al. in SPMF format. The original dataset was collected from XuetangX, one of the largest MOOC platforms in China. Originally used for course recommendation, the dataset contains the records of 82,535 course enrollment sequences from XuetangX from October 1, 2016 to March 31, 2018. The time span is 547 days. The number of courses is 1,302. The number of sequences is 82,535. The length of the longest sequence is 398. The length of the shortest sequence is 3. The average sequence length is 5.19.

See this paper for more details about this dataset and how it can be used:

Song, W., Ye, W., Fournier-Viger, P. (2022). Mining sequential patterns with flexible constraints from MOOC data. Applied Intelligence

Synthetic datasets

To generate synthetic sequence databases, you can use the sequence database generator provided in SPMF (see this example and this example in the documentation to know how to use it), which is flexible and easy to use.

Here are some synthetic sequence databases generated with the IBM Quest Dataset Generator, converted to the SPMF format:

Dataset name	Description	Sequence count	Item count	Average number of itemsets per sequence	Average number of distinct item per sequence
data.slen_10.tlen_1.seq.patlen_2.lit.patlen_8.nitems_5000_spmf.txt	Synthetic data	50,351	41,911	1.63	13.23
data.slen_10.tlen_1.seq.patlen_3.lit.patlen_8.nitems_5000_spmf.txt	Synthetic data	47,785	53,137	1.93	15.63
data.slen_10.tlen_1.seq.patlen_4.lit.patlen_8.nitems_5000_spmf.txt	Synthetic data	47,556	62,296	2.24	17.96
data.slen_10.tlen_1.seq.patlen_5.lit.patlen_8.nitems_5000_spmf.txt	Synthetic data	47,988	69,686	2.56	20.49
data.slen_10.tlen_1.seq.patlen_6.lit.patlen_8.nitems_5000_spmf.txt	Synthetic data	48,467	75,476	2.84	22.7
data.slen_8.tlen_1.seq.patlen_2.lit.patlen_8.nitems_5000_spmf.txt	Synthetic data	45,535	41,270	1.52	12.30
data.slen_8.tlen_1.seq.patlen_3.lit.patlen_8.nitems_5000_spmf.txt	Synthetic data	45,452	52,551	1.80	14.54
data.slen_8.tlen_1.seq.patlen_4.lit.patlen_8.nitems_5000_spmf.txt	Synthetic data	46,131	61,137	2.09	16.71
data.slen_8.tlen_1.seq.patlen_5.lit.patlen_8.nitems_5000_spmf.txt	Synthetic data	47,133	68,240	2.36	18.82

It is also possible to generate sequence databases by using the IBM Generator. Files generated by the IBM Generator can be converted in SPMF format by using the conversion tool provided in SPMF (see this example in the documentation for how to convert files.

Another alternative for generating synthetic sequences databases is to use a Matlab proram provided by Ashwin Balani.

Datasets that contain time-interval sequences

There are also several sequence datasets where events are described using time intervals (each event has a start time and an end time, that is a duration). Those datasets should be used with specialized algorithms for time-interval related pattern mining such as FastTIRP and VertTIRP. The datasets were obtained from the public repository of github user @omerh18 and converted in the SPMF format to make them easy to use with SPMF.

Dataset name	Sequence count	Event type count	Avg. number of time intervals per sequence
asl_spmf.csv	65	146	31.32
aslgt_spmf.csv	1751	47	41.26
auslan2_spmf.csv	200	12	4.49
blocks_spmf.csv	210	8	5.74
context_spmf.csv	240	54	53.81
ct1_spmf.csv	1060	49	42.68
ct2_spmf.csv	576	64	307.14
diabetes_spmf.csv	2038	35	39.52
hepatitis_spmf.csv	498	63	96.44
pioneer_spmf.csv	160	92	30.51
skating_spmf.csv	530	41	35.68
smarthome_spmf.csv	89	92	260.81
st1_spmf.csv	2746	1299	59.53
st2_spmf.csv	1805	1299	530.44

Datasets for Frequent Itemset mining / Association Rule Mining / Periodic pattern mining / Frequent episode mining

Datasets in SPMF format

These datasets can be directly used in SPMF:

Dataset name	Description	Transaction count	Item count (I)	Average item count per transaction (A)	Density (%) (A / I ) * 100	Has item names?
retail	customer transactions from an anonymous Belgian retail store. (source FIMI: http://fimi.ua.ac.be/data/)	88,162	16470	10,30	0.06 %	No
mushrooms	prepared based on the UCI mushrooms dataset (source: FIMI)	8,416	119	23	19.33 %	No
pumsb	census data for population and housing (source: FIMI)	49,046	2113	74	3.50 %	No
chess	prepared based on the UCI chess dataset (source: FIMI)	3,196	75	37	49.33 %	No
connect	prepared based on the UCI connect-4 dataset (source: FIMI)	67,557	129	43	33.33 %	No
accidents	anonymized traffic accident data (source: FIMI)	340,183	468	33.8	7.22 %	No
BMS_WebView_1	click-stream data from a webstore used in KDD-Cup 2000 (note: this version of the dataset has been prepared for itemset mining / association rule mining) (source: FIMI)	59,602	497	2.51	0.51 %	No
BMS_WebView_2	click-stream data from a webstore used in KDD-Cup 2000 (note: this version of the dataset has been prepared for itemset mining / association rule mining) (source: FIMI)	77,512	3340	4.62	0.14 %	No
Kosarak	transactions from click-stream data from an hungarian news portal. (source: FIMI)	990,002	41270	8.1	0.02 %	No
chainstore	customer transactions from a major grocecy store chain in California, USA	1,112,949	46086	7.23	0.02 %	No
foodmart	customer transactions from a retail store, obtained and transformed from SQL-Server 2000	4,141	1559	4.42	0.28 %	No
Fruithut	This is a dataset of customer transactions from a US retail store focusing on selling fruits. The dataset contains 181,970 transactions, 1,265 different items. The largest transactions contains 36 items, while on average a customer purchase 3.58 items per transaction. You can also download the taxonomy of items for this dataset, which can be useful for cross-level or multi-level itemset mining. The taxonomy is provided in a separated file where each line indicates than an item is a specialization of another item (a product belongs to a product category). An item can be a product or a category. The taxonomy has 4 levels and 43 categories. See the paper of Ying Wang et al. above for more details. The data was obtained from Kaggle and then transformed to the SPMF format by Ying Wang et al. for her paper about mining cross-level high utility itemsets (2020).	181,970	1265	3.58	0.28 %	Yes, item names are included in the file
Liquor_11	This is a dataset of 9,284 customer transactions from a US liquor stores in the state of IOWA. You can also download the taxonomy of items for this dataset, which can be useful for cross-level or multi-level itemset mining. The taxonomy is provided in a separated file where each line indicates than an item is a specialization of another item (a product belongs to a product category). An item can be a product or a category. The taxonomy has 7 levels and 77 categories. See the paper of Ying Wang et al. above for more details. The data was obtained from the internet and then transformed to the SPMF format by Ying Wang et al. for her paper about mining cross-level high utility itemsets (2020). In particular the taxonomy was creating by doing some NLP from the item names. The dataset contains only transactions with no more than 11 items. If you want all transactions with no more than 15 items, you can download liquor_15.txt, of if you want no more than 5 items, you can download liquor_5.txt.	9,284	2626	2.7	0.10 %	Yes (temporary not available
kddcup99	This dataset is transformed from the KDD CUP 1999 dataset, found at https://archive.ics.uci.edu/ml/publicdatasets/KDD+Cup+1999+Data)	1,000,000	135	16	11.85 %	Yes, kddcup99Attributes.xlsx
OnlineRetail	This dataset is transformed from the Online Retail dataset, found at https://archive.ics.uci.edu/ml/publicdatasets/ Online+Retail. Converted by Zhongjie Zhang. Note: these is another dataset also called OnlineRetail in some papers, which is in fact the ECommerce dataset.	541,909	2603	4.37	0.17 %	Yes, OnlineRetailAttributes.xlsx.
PAMP	This dataset is transformed from the PAMAP2 dataset, obtained from https://archive.ics.uci.edu/ml/publicdatasets/PAMAP2+Physical+Activity+Monitoring Note: In the original data, a few lines contained NaN values. These values have been removed. Converted by Zhongjie Zhang.	1,000,000	141	23.93	16.97 %	Yes, PAMPAttributes.xlsx
Skin	This dataset is transformed from the dataset Skin Segmentation, obtained from https://archive.ics.uci.edu/ml/publicdatasets/Skin+Segmentation Converted by Zhongjie Zhang.	245,057	11	4.0	36.36 %	Yes, SkinAttributes.xlsx
USCensus	This dataset is transformed from the US Census 1990 dataset, obtained from https://archive.ics.uci.edu/ml/publicdatasets/US+Census+Data+(1990) Converted by Zhongjie Zhang.	1,000,000	396	68.0	17.17 %	No
RecordLink	This dataset is transformed from http://archive.ics.uci.edu/ml/publicdatasets/Record+Linkage+Comparison+Patterns. Converted by Zhongjie Zhang.	574,913	29	10	34.48 %	Yes RecordLinkAttribute.xslx
PowerC	The dataset is about household electric power consumption. It was prepared using the original dataset from https://archive.ics.uci.edu/ml/publicdatasets/Individual+household+electric+power+consumption. The dataset contains instances, which were transformed to transactions, where value fields of the 3rd attribute to the 9th attribute are divided in 10 equal parts and every part is represented by a number. As a result, the dataset has 140 items. Converted by Zhongjie Zhang.	1,040,000	140	7	5.00 %	No
Susy	This dataset is related to physics. It is related to particles detected using a particle accelerator. The dataset contains instances prepared using the original datasetbtained from http://archive.ics.uci.edu/ml/publicdatasets/SUSY. The instances were transformed to transactions, such that the value field of every attribute is divided in 10 equal parts and every part is represented by a number. As a result, the dataset has 190 items. Converted by Zhongjie Zhang.	5,000,000	190	19	10.00 %	No
Chicago_Crimes_2001_to_2017_FIM	The dataset was converted from the dataset 'Crimes in Chicago' (https://www.kaggle.com/publicdatasets/currie32/crimes-in-chicago) by Zhongjie Zhang. The dataset records the crimes that have occurred in Chicago from 2001 to 2017. Every transaction corresponds to a <month, area>. A transaction describes the crimes that have occurred in a specific area during a specific month. Items represents the types of crimes, while the utility of an item in a transaction indicates the number of occurrences of that crime. For a description of the types of crimes, see the file describing the item names of that dataset.	2,662,309	35	1.795	5.13%	Yes, the description of items is here
Yoo-choose-buy-FIM	This dataset was obtained from the RecSys2015 challenge and converted to the SPMF format so that it can be used with SPMF. The dataset contains transactions from customers who have purchased items in an online store. Each transaction correspond to a customer and indicates the products that have been purchased. Note on the conversion: items having a quantity of 0 or a price of 0 were eliminated.	234,300	16,004	2.165	0.01%	No
instacart_priorFIM instacart_trainFIM	Those are two customer transaction datasets obtained by transforming the data from the instacart competition that was held on Kaggle in 2017. They can directly be used for frequent itemset mining and association rule mining. To help interpret the results, the product names are provided in another file: here And besides that, in addition to these two files, there is also a file taxonomy.txt that provides a taxonomy of all items for this dataset. A taxonomy is some categories and subcategories for the items. This taxonomy file contains several lines, where each line has the form X,Y indicating that an item or a category X is contained in a category Y. The category names for this dataset are provided in another file: instacart_category_names.txt	3,214,874 131,209	49,677 39,123	10.08 10.55	0.002 % 0.002 %	Yes
TH1 (compressed in 7z format)	A large dataset of customer transactions from grocery store(s). It was obtained from the public dataset TH1 on kaggle from @alexkever and transformed to the SPMF format, with some modifications such as to sort items in transactions, and remove duplicate items. The names of products are not included.	2,654,710	828	10.99	1.32%	No
TH2 (compressed in 7z format)	A very large dataset of customer transactions from grocery store(s). It was obtained from the public dataset TH2 on kaggle from @alexkever and transformed to the SPMF format, with some modifications such as to sort items in transactions, and remove duplicate items. The names of products are not included.	26,496,645	836	10.96	1.31%	No

Synthetic datasets

Here are a few synthetic datasets in SPMF format:

Dataset name	Description	Transaction count	Item count (I)	Average item count per transaction (A)	Density (A / I ) * 100
c20d10k	synthetic dataset	10,000	192	20	10.42 %
c73d10k	synthetic dataset	10,000	1592	73	4.59 %
t25i10d10k	synthetic dataset	9,976	929	24.77	2.67 %
t20i6d100k	synthetic dataset	99,922	893	19.90	2.23 %
T10I4D100K	synthetic dataset	100,000	870	10.10	1.16 %

It is possible to generate synthetic transaction databases by using the random transaction database generator provided in SPMF (see this example in the documentation to know how to do it), which is flexible and easy to use.

Moreover, you can download the following synthetic datasets often used in the data mining litterature, generated by the IBM Generator. The names are given according to this convention: D: number of sequences in the dataset, C: average number of itemsets per sequence, T: average number of items per itemset, I: average size of itemsets in potentially frequent sequences.

Another alternative for generating synthetic transactions databases is to use a Matlab program provided by Ashwin Balani.

Real-life datasets in SPMF format, having timestamps

The datasets containing timestamps can be directly used in SPMF with algorithms that accept timestamps:

Dataset name	Description	Transaction count	Item count (I)	Average item count per transaction (A)	Density (A / I ) * 100	Has item names?
ECommerce_time_without_utility	customer transactions occurring between 01/12/2010 and 09/12/2011 of a UK-based and registered non-store online retail (this is the version with timestamps ). If you want to know the meaning of the items in this dataset, the list of item names is available. The data was prepared by Yimin Zhang using public data.	14,975	3,468	11.71	0.34 %	Yes list of item names

Real-life datasets in ARFF format

The GUI and command line interface of SPMF > 093d and higher can read ARFF (Attribute Relational File Format) files . You can download a collection of 36 real-life datasets in ARFF format from the UCI machine learning repository. These files can be used with all association rule mining and itemset mining algorithms that take a transaction database as input. Note that examples MainTest...in the source code have not been adapted to read ARFF files yet (for now, ARFF files are only accepted by the GUI and command line interface). Most features of the ARFF format are supported except that (1) the character = is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will have to be converted. This cost however should be small. The specification of the ARFF format can be found here.

Datasets for High-Utility Pattern Mining

The following transaction databases that can be used with high-utility itemset mining algorithms such as EFIM, FHM, HUI-Miner, IHUP, UPGrowth and Two-Phase. For information about the format of these files, see the examples in the documentation of EFIM, FHM, HUI-Miner or Two-Phase.

Real-life transaction datasets in SPMF format with real utility values

Those are three real-life customer transaction datasets with real utility values:

Dataset name	Description	Transaction count	Item count (I)	Average item count per transaction (A)	Density (%) (A / I ) * 100	Has real utility values?	Has item names?
foodmart_utility	dataset of customer transactions from a retail store, obtained and transformed from SQL-Server 2000	4,141	1559	4.42	0.28 %	Yes	No
chainstore_utility	dataset of customer transactions from a major grocery store chain in California, USA, containing 1,112,949 transactions and 46,086 items, obtained and transformed from NU-Mine Bench. The original data is available here (not in SPMF format)	1,112,949	46086	7.23	0.02 %	Yes	No
ECommerce_retail_utility_no_timestamps	transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 of a UK-based and registered non-store online retail (this is the version with real utility values but without timestamps). The data was prepared by Yimin Zhang using public data for his paper about peak high utility itemset mining.	14,975	3468	11.71	0.34 %	Yes	Yes list of item names
Fruithut_utility	This is a dataset of customer transactions from a US retail store focusing on selling fruits. The dataset contains 181,970 transactions, 1,265 different items. The largest transactions contains 36 items, while on average a customer purchase 3.58 items per transaction. You can also download the taxonomy of items for this dataset, which can be useful for cross-level or multi-level itemset mining. The taxonomy is provided in a separated file where each line indicates than an item is a specialization of another item (a product belongs to a product category). An item can be a product or a category. The taxonomy has 4 levels and 43 categories. See the paper of Ying Wang et al. above for more details. The data was obtained from Kaggle and then transformed to the SPMF format by Ying Wang et al. for her paper about mining cross-level high utility itemsets (2020).	181,970	1265	3.58	0.28 %	Yes	Yes, item names are included in the file
Liquor_11	This is a dataset of 9,284 customer transactions from a US liquor stores in the state of IOWA. You can also download the taxonomy of items for this dataset, which can be useful for cross-level or multi-level itemset mining. The taxonomy is provided in a separated file where each line indicates than an item is a specialization of another item (a product belongs to a product category). An item can be a product or a category. The taxonomy has 7 levels and 77 categories. See the paper of Ying Wang et al. above for more details. The data was obtained from the internet and then transformed to the SPMF format by Ying Wang et al. for her paper about mining cross-level high utility itemsets (2020). In particular the taxonomy was creating by doing some NLP from the item names. The dataset contains only transactions with no more than 11 items. If you want all transactions with no more than 15 items, you can download liquor_15.txt, of if you want no more than 5 items, you can download liquor_5.txt.	9,284	2626	2.7	0.10 %	Yes	Yes (but temporary not available)
Chicago_Crimes_2001_to_2017_utility	The dataset was converted from the dataset 'Crimes in Chicago' (https://www.kaggle.com/publicdatasets/currie32/crimes-in-chicago) by Zhongjie Zhang. The dataset records the crimes that have occurred in Chicago from 2001 to 2017. Every transaction corresponds to a <month, area>. A transaction describes the crimes that have occurred in a specific area during a specific month. Items represents the types of crimes, while the utility of an item in a transaction indicates the number of occurrences of that crime. For a description of the types of crimes, see the file describing the item names of that dataset.	2,662,309	35	1.795	5.13%	Yes	Yes, the description of items is here
Yoo-choose-buy-Utility	This dataset was obtained from the RecSys2015 challenge and converted to the SPMF format so that it can be used with SPMF. The dataset contains transactions from customers who have purchased items in an online store. Each transaction correspond to a customer and indicates the products that have been purchased. Note on the conversion: items having a quantity of 0 or a price of 0 were eliminated.	234,300	16,004	2.165	0.01%	Yes	No

Real-life transaction datasets in SPMF format having synthetic (fake) utility values

These datasets are real datasets but with synthetic utility values. The internal utility values have been generated using a uniform distribution in [1, 10].

Dataset name	Description	Transaction count	Item count (I)	Average item count per transaction (A)	Density (%) (A / I ) * 100	Has real utility values?	Has item names?
retail_utility	customer transactions from an anonymous Belgian retail store. (source FIMI: http://fimi.ua.ac.be/data/)	88,162	16470	10,30	0.06 %	No (synthetic)	No
mushroom_utility	prepared based on the UCI mushrooms dataset (source: FIMI)	8,416	119	23	19.33 %	No (synthetic)	No
pumsb_utility,	census data for population and housing (source: FIMI)	49,046	2113	74	3.50 %	No (synthetic)	No
chess_utility,	prepared based on the UCI chess dataset (source: FIMI)	3,196	75	37	49.33 %	No (synthetic)	No
connect_utilitty,	prepared based on the UCI connect-4 dataset (source: FIMI)	67,557	129	43	33.33 %	No (synthetic)	No
accidents_utility,	anonymized traffic accident data (source: FIMI)	340,183	468	33.8	7.22 %	No (synthetic)	No
kosarak_utility,	transactions from click-stream data from an hungarian news portal. (source: FIMI)	990,002	41270	8.1	0.02 %	No (synthetic)	No
BMS_utilitty,	click-stream data from a webstore used in KDD-Cup 2000 (note: this version of the dataset has been prepared for high utility itemset mining ) (source: FIMI)	77,512	3340	4.62	0.14 %	No (synthetic)	No

Datasets in SPMF format having utility values and timestamps

Here are a few transaction databases in SPMF format for high utility itemset mining that contains timestamps. These datasets where prepared by Yimin Zhang for the paper in Information Sciences about the LHUI-Miner and PHUI-Miner algorithms (Fournier-Viger et al., 2019). These datasets were used for discovering local high utility itemsets and peak high utility itemsets, but can be used for other tasks such as recent high utility pattern mining.

Dataset name	Description	Transaction count	Item count (I)	Average item count per transaction (A)	Density (%) (A / I ) * 100	Has real utility values?	Has real timestamps?	Has item names?
ECommerce_retail_utility_timestamps	transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 of a UK-based and registered non-store online retail (this is the version with real utility values and with timestamps). The data was prepared by Yimin Zhang using public data for his paper about peak high utility itemset mining.	14,975	3468	11.71	0.34 %	Yes	Yes	Yes list of item names
Fruithut_utility_timestamps	This is a dataset of customer transactions from a US retail store focusing on selling fruits. The dataset contains 181,970 transactions, 1,265 different items. The largest transactions contains 36 items, while on average a customer purchase 3.58 items per transaction. You can also download the taxonomy of items for this dataset, which can be useful for cross-level or multi-level itemset mining. The taxonomy is provided in a separated file where each line indicates than an item is a specialization of another item (a product belongs to a product category). An item can be a product or a category. The taxonomy has 4 levels and 43 categories. See the paper of Ying Wang et al. above for more details. The data was obtained from Kaggle and then transformed to the SPMF format by Ying Wang et al. for her paper about mining cross-level high utility itemsets (2020).	181,970	1265	3.58	0.28 %	Yes	Yes	Yes, item names are included in the file
foodmart_utility_timestamp	dataset of customer transactions from a retail store, obtained and transformed from SQL-Server 2000	4,141	1,559	4.42	0.28 %	Yes	No (synthetic)	No
retail_utility_timestamp	customer transactions from an anonymous Belgian retail store. (source FIMI: http://fimi.ua.ac.be/data/)	88,162	16,470	10,30	0.06 %	No (synthetic)	No (synthetic)	No
kosarak_utility_timestamp	transactions from click-stream data from an hungarian news portal. (source: FIMI)	990,002	41,270	8.1	0.02 %	No (synthetic)	No (synthetic)	No
mushroom_utility_timestamp	prepared based on the UCI mushrooms dataset (source: FIMI)	8,416	119	23	19.33 %	No (synthetic)	No (synthetic)	No

Datasets for high utility itemset mining with negative unit profit values.

Here are a few transaction databases in SPMF format for high-utility itemset mining with negative unit profit values. Those datasets have been generated to be used in the FHN paper (Fournier-Viger et al., 2014) and can be used with the FHN and HUINIV-Mine algorithms. See the FHN paper for details.

Dataset name	Description	Transaction count	Item count (I)	Average item count per transaction (A)	Density (%) (A / I ) * 100	Has real utility values?	Has item names?
retail_negative,	customer transactions from an anonymous Belgian retail store. (source FIMI: http://fimi.ua.ac.be/data/)	88,162	16470	10,30	0.06 %	No (synthetic)	No
mushroom_negative	prepared based on the UCI mushrooms dataset (source: FIMI)	8,416	119	23	19.33 %	No (synthetic)	No
pumsb_negative,	census data for population and housing (source: FIMI)	49,046	2113	74	3.50 %	No (synthetic)	No
chess_negative,	prepared based on the UCI chess dataset (source: FIMI)	3,196	75	37	49.33 %	No (synthetic)	No
accidents_negative,	anonymized traffic accident data (source: FIMI)	340,183	468	33.8	7.22 %	No (synthetic)	No
kosarak_negative	transactions from click-stream data from an hungarian news portal. (source: FIMI)	990,002	41270	8.1	0.02 %	No (synthetic)	No

Datasets for high utility quantitative itemset mining

The following datasets have been used for the FHUQI-Miner paper (Nouioua et al., 2021) about high utility quantitative itemset mining, and can also be used with the VHUQI algorithm. For these algorithms, two files are provided for each dataset: the first file contains transactions with quantities for each item (e.g. retail.txt). The second file contains the unit profit or weight of each item (e.g. retail1profit.txt). For more details about the format, please see the documentation page about FHUQI-Miner or VHUQI.

Dataset name	Description	Transaction count	Item count (I)	Average item count per transaction (A)	Density (%) (A / I ) * 100	Has real utility and quantityvalues?	Has item names?
retail.txt retailf1profit.txt	customer transactions from an anonymous Belgian retail store. (note: this version of the dataset has been adaptedfor high utility quantitative itemset mining )	88162	16470	10.30	0.06 %	No (synthetic)	No
pumsb.txt pumsbf1profit.txt	census data for population and housing (note: this version of the dataset has been adaptedfor high utility quantitative itemset mining )	49046	2113	74	3.50 %	No (synthetic)	No
connect.txt connect1profit.txt	prepared based on the UCI connect-4 dataset (note: this version of the dataset has been adapted for high utility quantitative itemset mining )	67557	129	43	33.33 %	No (synthetic)	No
bms.txt bmsf1profit.txt	click-stream data from a webstore used in KDD-Cup 2000 (note: this version of the dataset has been adaptedfor high utility quantitative itemset mining )	59601	497	2.42	0.51 %	No (synthetic)	No
bms2.txt bmsf2profit.txt	click-stream data from a webstore used in KDD-Cup 2000 (note: this version of the dataset has been adaptedfor high utility quantitative itemset mining )	77512	3340	4.62	0.14 %	No (synthetic)	No
foodmart foodmartf1profit.txt	dataset of customer transactions from a retail store, obtained and transformed from SQL-Server 2000 (note: this version of the dataset has been adaptedfor high utility quantitative itemset mining )	4141	1559	4.42	0.28 %	Yes	No

Datasets for on-shelf high-utility itemset mining with negative unit profit values

Here are a few transaction databases in SPMF format for on-shelf high-utility itemset mining with negative unit profit values. Those datasets have been generated to be used in the FOSHU paper (Fournier-Viger et al., 2015) and can be used with the FOSHU and TS-HOUN algorithms. See the FOSHU paper for details.

kosarak (5 periods, 25 periods, 50 periods)
accidents (5 periods, 25 periods, 50 periods)
chess (5 periods, 25 periods, 50 periods)
mushroom (5 periods, 25 periods, 50 periods)
pumsb (5 periods, 25 periods, 50 periods)
retail (5 periods, 25 periods, 50 periods)

Datasets for high-utility sequential rule mining or high-utility sequential pattern mining

Some sequence databases in SPMF format for high-utility sequential rule mining or high-utility sequential pattern mining. Those datasets were generated for the HUSRM paper (Zida et al, 2015), and can be used with the HUSRM and USpan algorithm. See the HUSRM paper for more information.

Dataset name	Description	Sequence count	Item count	Average sequence length	Has item names?
BMS_sequence_utility )	This dataset was used in KDD CUP 2000. It contains clickstream data from an e-commerce	77,512	3,340	4.62	No
Kosarak10k_sequence_utility	This is a subset of 10,000 sequences of the Kosarak click-stream data from an hungarian news portal. The dataset was converted in SPMF format using the original data from: http://fimi.ua.ac.be/data/. (2020-1-10: the kosarak dataset file has been updated as someone informed me that some sequences were missing)	10,000			No
SIGN_sequence_utility	a dataset of sign language utterance containing approximately 800 sequences The original dataset file in another format can be obtained here with more details on this dataset.	~800	267	51.997	No
Bible_sequence_utility	This dataset is a conversion of the Bible into a sequence database (each word is an item).	36,369	13,905	21.6	Yes Bible_with_items
FIFA_sequence_utility	Click stream data from the website of FIFA World Cup 98	20,450	2,990	34.74	No

Datasets for Cost-effective Pattern Mining (with utility and cost)

In the paper by Fournier-Viger et al. (2020), it was proposed to find cost-effective patterns in sequences with utility and cost information. This type of data is a set of sequences where each sequence is an ordered list of activity, where an activity has a cost value representing the amount of resources (e.g. time, money) spent to perform the activity. Moreover, each sequence has a utility label that can be either binary or numeric, which represents a positive or negative outcome. This type of data can represent for example how students use an e-learning system. In that case, each sequence is a list of activities or sessions performed by a student, and the cost represents the time spent for studying for each activity. On the other hand, the utility represents the outcome of a sequence such as the final exam score of a student after performing the acitivities (numeric utility), or whether he passed or failed the course or exam (binary utility). Another example of cost-utility sequences is hospital data, where a sequence is a list of medicines taken by some patient, the cost indicates how much money or time was spent for these medical treaments, and the utility is whether a patient has cured or died.

From cost/utility sequences, we can then find patterns that have a low-cost but typically lead to a high utility (called low-cost high utility patterns or cost-efficient patterns) by applying algorithms such as CEPB, CEPN and CorCEPB. In the above examples, some cost-effective patterns may be that studying some e-learning materials A and B typically require not much time but leadw to high scores, or that taking some given medicines has a low cost but a high possibility of success to cure a disease.

Here are some datasets:

Dataset name	Description	Sequence count	Type of utility
allSessions_binary.txt	This is an e-learning dataset, where each sequence represents a student using an e-learning platform named Deeds, and there are 62 students. A sequence contains a list of sessions taken by a student. There are six different sessions denoted as 1,2,3,4,5 and 6. In a sequence, each session has a cost value that represents the time spent by a student on the session for studying or doing some learning activities. Each sequence has a binary utility value, which indicates the outcome of the sequence. Here the utility value indicates if a student has failed or passed the final exam after doing the learning sessions. This dataset can be used with the CEPB and CorCEPB algorthms Note: This dataset was created by converting public data from Vahdat et al to the SPMF format. A threshold of 60% was assumed to be the score for passing the final exam.	62	Binary
session6_numeric.txt	This dataset is also an e-learning dataset from students using an e-learning platform named Deeds. This dataset contains 50 sequences. Each sequence indicates the list of activities done by a student during a learning session called SESSION 6. Each activity has a cost value representing the amount of time that the student has spent doing the activity. Moreover, each sequence has a numeric utility value indicating the score that the student had at the test at the end of SESSION 6. This dataset can be used with the CEPN algorthm. Note: This dataset was created by converting public data from Vahdat et al to the SPMF format. Note that this file only contains SESSION 6.	50	Numeric
session6_binary.txt	This is the same as above, except that the utility has been converted to binary values instead of numeric values. This dataset can be used with the CEPB and CorCEPB algorthms.	50	Binary
session5_numeric.txt	This is similar to above, except that this is the data for SESSION 5 (numeric utility).	53	Numeric
session4_numeric.txt	This is similar to above, except that this is the data for SESSION 4 (numeric utility).	54	Numeric
session4_binary.txt	This is similar to above, except that this is the data for SESSION 4 (binary utility).	54	Binary

The original dataset was made public by Vahdat et al. and can be downloaded here but it is not in SPMF format. However, it gives a lot of details about the meaning of the data. This is very useful for understanding the meaning of the patterns found in the above data. This is the paper by Vahdat et al.:

M. Vahdat, L. Oneto, D. Anguita, M. Funk, M. Rauterberg.: A learning analytics approach to correlate the academic achievements of students with interaction data from an educational simulator. In: G. Conole et al. (eds.): EC-TEL 2015, LNCS 9307, pp. 352-366. Springer (2015).
DOI: 10.1007/978-3-319-24258-3 26

And this is the paper about cost-effective pattern mining with the CEPB, CEPN and CorCEPB algorithms, for which the datasets have prepared for SPMF:

Fournier-Viger, P., Li, J., Lin, J. C., Chi, T. T., Kiran, R. U. (2020). Mining Cost-Effective Patterns in Event Logs. Knowledge-Based Systems (KBS), Elsevier

New: I have created some versions of the above datasets where sequences with cost values have been transformed into transactions with cost values. If you use those transformed transaction datasets please cite our paper "LCIM: Mining Low Cost and High Utility Itemsets" published in MIWAI 2022 by Nawaz et al., 2022.

Dataset name	Description	Transaction count	Type of utility
allSessions_binary_trans.txt	See above	62	Binary
session6_numeric_trans.txt	See above	50	Numeric
session6_binary_trans.txt	See above	50	Binary
session5_numeric_trans.txt	See above	53	Numeric
session4_numeric_trans.txt	See above .	54	Numeric
session4_binary_trans.txt	See above	54	Binary

Datasets for graph pattern mining

Datasets for mining subgraphs in a graph database

Here are a few datasets that can be used to evaluate subgraph mining algorithms such as TKG, gSpan and cgSpan. Each dataset contains a graph database (multiple graphs).

Dataset name	Description	Graph count	Average node count per graph	Average edge count per graph	Vertex label count	Edge label count	Label file?
Chemical_340	a database of 340 graphs about chemistry	340	27.02	27.40	66	4	No
Coumpounds_422	a database of 422 graphs about coumpounds	422	39.61	42.31	21	4	No
Mutag	a dataset of nitro compounds labeled regarding whether they have a mutagenic effect on a bacterium	188	17.93	19.79	7	11	Yes
PTC	chemical compounds where their labels indicate carcinogenicity for male and female rats	344	25.55	25.96	19	1	Yes
NCI1	datasets of chemical compounds screened for activity against non-small cell lung cancer and ovarian cancer cell lines	4,110	29.87	32.30	37	3	Yes
NCI109	datasets of chemical compounds screened for activity against non-small cell lung cancer and ovarian cancer cell lines	4,127	29.68	32.13	38	3	Yes
Proteins	graph collection where nodes represent secondary structure elements and edges indicate neighborhood in the amino-acid sequence or in 3-dimension space	1,113	39.06	72.82	3	1	Yes
Enzymes	protein tertiary structures obtained from the BRENDA enzyme database	600	32.63	62.14	3	1	Yes
DandD	a dataset of protein structures where nodes are amino acids and edges indicate spatial closeness, which are classified into enzymes or non-enzymes	1,178	284.31	715.66	82	1	Yes
IMDB-B	a movie collaboration dataset that is collected from IMDB. Each graph is an ego-network where nodes represent actors/actresses and edges indicate if they appear in the same movie. Each graph is categorized into one of two genres (Action or Romance).	1,000	19.77	96.53	65	1	Yes
5newsgroup	social network dataset	4976	86.86	352.65	27,881	1	Yes
collab	social network dataset	5,000	74.49	2424.63	367	1	Yes
reddit_binary	social network dataset	2,000	429.61	497.75	565	1	Yes
reddit_multi_12k	social network dataset	11,929	391.41	456.89	909	1	Yes
reddit_multi_5k	social network dataset	4,999	508.51	594.87	733	1	Yes
webkb	social network dataset	4,167	77.80	318.15	4	7,770	Yes

The two first datasets were obtained from the Web and are probably the two most famous datasets in subgraph mining.

The other datasets were prepared by Dang Nguyen et al. based on data obtained from various data sources, and obtained from github.com/nphdang/gspan/, and used in this paper: Dang Nguyen, Wei Luo, Tu Dinh Nguyen, Svetha Venkatesh, Dinh Phung (2018). Learning Graph Representation via Frequent Subgraphs. SDM 2018, San Diego, USA. SIAM, 306-314.). The descriptions of these latter datasets in the above table were obtained from that paper too.

Note that some datasets have some label files. Those label files are not used by the algorithms offered in SPMF. They were generated when the datasets were prepared. These files indicates the correspondence between labels in the original datasets and those of the transformed datasets offered on this page. These label files are offered here because it can be useful to some people perhaps.

Datasets for Mining Patterns in Dynamic Attributed Graphs

A dynamic attributed graph is a graph that changes over time and where vertices may have multiple numerical attributes. Some algorithms are designed to discover interesting patterns in dynamic attributed graphs such as AER-Miner and TSEQMiner, offered in SPMF. The datasets used in the AER-Miner and TSEQMiner papers are available here: the datasets (ZIP, 5.0 MB) and the and some variations for scalability experiments (ZIP, 138 MB). Please see these papers for a description of these datasets.

Datasets for Clustering

A Matlab program provided by Ashwin Balani to generate synthetic clustering datasets.

SPMF - Public Datasets

Datasets for Sequential Pattern Mining / Sequential Rule Mining / Sequence Prediction

Real-life datasets in SPMF format

A collection of 30 books converted to SPMF format (sequences of words or sequence of part-of-speeches)

Sequences of API calls of malware programs

Sequences of MOOC data with timestamps

Synthetic datasets

Datasets for Frequent Itemset mining / Association Rule Mining / Periodic pattern mining / Frequent episode mining

Datasets in SPMF format

Synthetic datasets

Real-life datasets in SPMF format, having timestamps

Real-life datasets in ARFF format

Datasets for High-Utility Pattern Mining

Real-life transaction datasets in SPMF format with real utility values

Real-life transaction datasets in SPMF format having synthetic (fake) utility values

Datasets in SPMF format having utility values and timestamps

Datasets for high utility itemset mining with negative unit profit values.

Datasets for high utility quantitative itemset mining

Datasets for on-shelf high-utility itemset mining with negative unit profit values

Datasets for high-utility sequential rule mining or high-utility sequential pattern mining

Datasets for Cost-effective Pattern Mining (with utility and cost)

Datasets for graph pattern mining

Datasets for mining subgraphs in a graph database

Datasets for Mining Patterns in Dynamic Attributed Graphs

Datasets for Clustering

12056751