Datasets
The SPMF software natively uses text files as input. Some small examples of text files that can be used with each algorithm are described in the documentation of SPMF. These sample input files can be downloaded from the download page (test_files.zip) for the release version of SPMF, and are included with the source code, for the source code version of SPMF. However, these datasets are quite small. For this reason, this webpage provides larger datasets that can be used with SPMF and that are often used in the data mining litterature for evaluating and comparing algorithm performance. Unless otherwise indicated, the datasets are in SPMF format.
The datasets are divided in the following categories:
- Datasets for Sequential Pattern Mining / Sequential Rule Mining / Sequence Prediction
- Datasets for Frequent Itemset mining / Association Rule Mining / Periodic pattern mining / Frequent episode mining)
- Datasets for High-Utility Pattern
Mining
- Real-life transaction datasets in SPMF format with real utility values
- Real-life transaction datasets in SPMF format having synthetic (fake) utility values
- Datasets in SPMF format having utility values and timestamps
- Datasets for high utility itemset mining with negative unit profit values
- Datasets for high utility quantitative itemset mining
- Datasets for on-shelf high-utility itemset mining with negative unit profit values
- Datasets for high-utility sequential rule mining or high-utility sequential pattern mining
- Datasets for Cost-effective Pattern Mining (with utility and cost)
- Graph datasets for graph pattern mining
- Datasets for clustering
Datasets for Sequential Pattern Mining / Sequential Rule Mining / Sequence Prediction
Real-life datasets in SPMF format
Dataset name | Description | Sequence count | Item count | Average sequence length | Has item names? |
BMSWebView1 (Gazelle) ( KDD CUP 2000) | This dataset was used in KDD CUP 2000. It contains clickstream data from an e-commerce In this dataset, there are some long sequences. For example,
318 sequences contains more than 20 items. |
59,601 | 497 | 2.42 | No |
BMSWebView2 (Gazelle) ( KDD CUP 2000) | This dataset was used in KDD CUP 2000. It contains clickstream data from an e-commerce | 77,512 | 3,340 | 4.62 | No |
a subset of only 10 000 sequences: here |
This is a very large dataset containing 990 000 sequences of
click-stream data from an hungarian news portal. The dataset was converted in SPMF format using the original data from: http://fimi.ua.ac.be/data/. (2020-1-10: the kosarak dataset file has been updated as someone informed me that some sequences were missing) |
990,000 | 41,270 | 8.1 | No |
Sign | a dataset of sign language utterance containing approximately
800 sequences The original dataset file in another format can be obtained here with more details on this dataset. |
~800 | 267 | 51.997 | No |
Bible | This dataset is a conversion of the Bible into a sequence database (each word is an item). |
36,369 | 13,905 | 21.6 | Yes Bible_with_items |
Leviathan | This dataset is a conversion of the novel Leviathan by Thomas Hobbes (1651) as a sequence database (each word is an item). | 5,834 | 9,025 | 33.8 | Yes Leviathan_with_items |
MSNBC | a dataset of click-stream data from the MSNBC website,
converted from original data from the UCI repository. The shortest sequences have been removed to keep only 31,790 sequences. If you need the full dataset, you can download it. |
989,818 | 17 | 13.23 | No |
FIFA | Click stream data from the website of FIFA World Cup 98 | 20,450 | 2,990 | 34.74 | No |
BIKE | This contains sequences of locations where shared bikes where parked in a city. Each item represents a bike sharing station and each sequence indicate the different locations of a bike over time. The dataset was obtained from the Github of Andrea Tonon, and is a transformation from Kaggle data (https://www.kaggle.com/cityofLA/los-angeles-metro-bike-share-trip-data) |
21,078 | 67 | 7.27 | No |
MT745584 (COVID-19 genome sequence) | This is the sequence of nucleotide of the MT745584 strain of the COVID-19 coronavirus, collected on 2020-07-13 in Bahrain. The dataset has been obtained from public database and converted to the SPMF format by M. Saqib Nawaz for his paper: Using Artificial Intelligence Techniques for COVID-19 Genome Analysis. More datasets related to this paper and code can be found on github. | 497 | 4 | 16 | Original data |
ProofSequences | a dataset containing sequences of proof steps for
mathematical proofs. The dataset is described in the paper: Nawaz, M. S., Sun, M., Fournier-Viger, P. (2019). Proof Guidance in PVS with Sequential Pattern Mining. Proc. of 9th Intern. Conf. Fundamentals of Software Engineering (FSEN 2019), 15 pages, Springer, LNCS. |
35 | 13 | 12.34 | Yes |
E-Shop | Click-stream data for online shopping. The dataset contains
sequences of clicks (clickstream) from an online store offering
clothing for pregnant women. Notes: Dataset cleaned, pre-processed and transformed by Frederic Flouvat based on data from UCI repository. A github project by F. Flouvat contains details and code about the transformation. This data is interesting for algorithm testing because it is sequence of itemsets instead of sequence of items. Average number of items per itemset : 9.0 |
24026 | 317 | 61.98 | Yes Item meanings (JSON format) |
MicroblogPCU | Datasets about spam in micro-blogs. Notes: Dataset cleaned, pre-processed and transformed by Frederic Flouvat based on data from UCI repository. A github project by F. Flouvat contains details and code about the transformation. This data is interesting for algorithm testing because it is sequence of itemsets instead of sequence of items. Average number of items per itemset : 6.85 |
429 | 50505 | 296.37 | Yes Item meanings (JSON format) |
OnlineRetail_II_all | This Online Retail II data set contains all the transactions
occurring for a UK-based and registered, non-store online retail
between 01/12/2009 and 09/12/2011.The company mainly sells
unique all-occasion gift-ware. Many customers of the company are
wholesalers. Notes: Dataset cleaned, pre-processed and transformed by Frederic Flouvat based on data from UCI repository. A github project by F. Flouvat contains details and code about the transformation. This data is interesting for algorithm testing because it is sequence of itemsets instead of sequence of items. Average number of items per itemset : 22.775 |
4383 | 41431 | 32.154 | Yes Item meanings (JSON format) |
OnlineRetail_II_best | This Online Retail II data set contains all the transactions
occurring for a UK-based and registered, non-store online retail
between 01/12/2009 and 09/12/2011.The company mainly sells
unique all-occasion gift-ware. Many customers of the company are
wholesalers. Note: The dataset had some errors with respect to the format. It has been fixed (2023-7-15) |
4383 | 10157 | 48.23 | Yes Item meanings (JSON format) |
A collection of 30 books converted to SPMF format (sequences of words or sequence of part-of-speeches)
Below there is a dataset that is a collection of 15 public domain books that have been prepared and converted to sequences by Jean-Marc Pokou et al. (2016) to the SPMF format. Each book can be used to extract patterns using sequential pattern mining or sequential rule mining algorithms.
These books are written by 10 different English novelists from the XIX century. The total number of words/sentences in the corpus of each author is as follows: Catharine Traill (276,829/ 6,588), Emerson Hough (295,166/ 15,643), Henry Addams (447,337/ 14,356), Herman Melville (208,662/ 8,203), Jacob Abbott (179,874/ 5,804), Louisa May Alcott (220,775/ 7,769), Lydia Maria Child (369,222/ 15,159), Margaret Fuller (347,303/ 11,254), Stephen Crane (214,368/ 12,177), and Thornton W. Burgess (55,916/ 2,950). The list of books is:
Author | Datasets (books) in SPMF format |
Catharine Traill | - A Tale of The Rice Lake Plains -Lost in the Backwoods - The Backwoods of Canada |
Emerson Hough | - The Girl at the Halfway House - The Law of the Land - The Man Next Door |
Henry Addams | - Democracy, an American novel - Mont-Saint-Michel and Chartres - The Education of Henry Adams |
Herman Melville | - I and My Chimney -Israel Potter -The Confidence-Man His Masquerade |
Jacob Abbott | - Alexander the Great - History of Julius Caesar - Queen Elizabeth |
Louisa May Alcott | - Eight Cousins - Rose in Bloom - The Mysterious Key and What Opened |
Lydia Maria Child | - A Romance of the Republic -Isaac THoppe -Philothea) |
Margaret Fuller | - Life Without and Life Within -Summer on the Lakes, in 1843 - Woman in the Nineteenth Century |
Stephen Crane | - Active Service - Last Words - The Third Violet |
Thornton WBurgess | - The Adventures of Buster Bear - The Adventures of Chatterer the Red Squirrel -The Adventures of Grandfather Frog |
There are two versions of each datasets : sequences of words and sequences of Part-of-Speeches (POS) (obtained with the Stanford NLP tagger).
Here are the links to download the books:
- Books in SPMF format (sequences of words)
- Books in SPMF format (sequence of part-of-speeches)
- Books in SPMF format with item names (sequence of words)
- Books in SPMF format with item names (sequence of part-of-speeches)
- Original books as text files
If you use the above book datasets, you may want to cite this paper:
Pokou J. M., Fournier-Viger, P., Moghrabi, C. (2016). Authorship Attribution Using Small Sets of Frequent Part-of-Speech Skip-grams. Proc. 29th Intern. Florida Artificial Intelligence Research Society Conference (FLAIRS 29), AAAI Press, pp. 86-91
Sequences of API calls of malware programs
Here is a large dataset (compressed ZIP file 6 MB - uncompressed 813 MB ) containing the sequences of API calls from different types of malware programs.
This dataset contains 8 files, each corresponding to a different type of malware such as viruses, backdoors and spyware. Each file is a sequence database containing the sequences of API calls of multiple programs of a given type.
The table below summarize the information about these 8 files:
File | Type of malware | Samples (sequences) | API calls (distinct items) | Maximum sequence length | Average sequence length |
Adwaretranslated.txt | Adware | 379 | 212 | 1,450,685 | 6,867 |
Backdoortranslated.txt | Backdoor | 1001 | 227 | 1,402,652 | 11,293 |
Downloadertranslated.txt | Downloader | 1001 | 232 | 870,719 | 6,522 |
Droppertranslated.txt | Dropper | 891 | 226 | 1,068,329 | 16,008 |
Spywaretranslated.txt | Spyware | 832 | 229 | 1,764,421 | 46,951 |
Trojantranslated.txt | Trojan | 1001 | 232 | 1,232,913 | 13,818 |
Virustranslated.txt | Virus | 1001 | 241 | 1,062,231 | 18,370 |
Wormstranslated.txt | Worms | 1001 | 236 | 1,245,582 | 33,614 |
These datasets can be used with sequential pattern mining algorithms and sequential rule mining algorithms.
More information about these datasets can be found in this paper:
Nawaz, M. S., Fournier-Viger, P., Nawaz, M. Z., Chen, G., Wu, Y. (2022) MalSPM: Metamorphic Malware Behavior Analysis and Classification using Sequential Pattern Mining. Computers & Security, Elsever, to appear DOI: 10.1016/j.cose.2022.102741
More data related to this paper can also be found on Github.
Sequences of MOOC data with timestamps
This page also provides a time-extended sequence database that contains MOOC data (e-learning data) and can be used with sequential pattern mining algorithms such as SPM_FC_P and SPM_FC_L. It can be downloaded here: mooc.txt
This dataset was originally a “Course Recommendation” dataset provided by the MoocData platform but has been transformed by Song et al. in SPMF format. The original dataset was collected from XuetangX, one of the largest MOOC platforms in China. Originally used for course recommendation, the dataset contains the records of 82,535 course enrollment sequences from XuetangX from October 1, 2016 to March 31, 2018. The time span is 547 days. The number of courses is 1,302. The number of sequences is 82,535. The length of the longest sequence is 398. The length of the shortest sequence is 3. The average sequence length is 5.19.
See this paper for more details about this dataset and how it can be used:
Song, W., Ye, W., Fournier-Viger, P. (2022). Mining sequential patterns with flexible constraints from MOOC data. Applied Intelligence
Synthetic datasets
To generate synthetic sequence databases, you can use the sequence database generator provided in SPMF (see this example and this example in the documentation to know how to use it), which is flexible and easy to use.
Here are some synthetic sequence databases generated with the IBM Quest Dataset Generator, converted to the SPMF format:
It is also possible to generate sequence databases by using the IBM Generator. Files generated by the IBM Generator can be converted in SPMF format by using the conversion tool provided in SPMF (see this example in the documentation for how to convert files.
Another alternative for generating synthetic sequences databases is to use a Matlab proram provided by Ashwin Balani.
Datasets that contain time-interval sequences
There are also several sequence datasets where events are described using time intervals (each event has a start time and an end time, that is a duration). Those datasets should be used with specialized algorithms for time-interval related pattern mining such as FastTIRP and VertTIRP. The datasets were obtained from the public repository of github user @omerh18 and converted in the SPMF format to make them easy to use with SPMF.
Dataset name | Sequence count | Event type count | Avg. number of time intervals per sequence |
asl_spmf.csv | 65 | 146 | 31.32 |
aslgt_spmf.csv | 1751 | 47 | 41.26 |
auslan2_spmf.csv | 200 | 12 | 4.49 |
blocks_spmf.csv | 210 | 8 | 5.74 |
context_spmf.csv | 240 | 54 | 53.81 |
ct1_spmf.csv | 1060 | 49 | 42.68 |
ct2_spmf.csv | 576 | 64 | 307.14 |
diabetes_spmf.csv | 2038 | 35 | 39.52 |
hepatitis_spmf.csv | 498 | 63 | 96.44 |
pioneer_spmf.csv | 160 | 92 | 30.51 |
skating_spmf.csv | 530 | 41 | 35.68 |
smarthome_spmf.csv | 89 | 92 | 260.81 |
st1_spmf.csv | 2746 | 1299 | 59.53 |
st2_spmf.csv | 1805 | 1299 | 530.44 |
Datasets for Frequent Itemset mining / Association Rule Mining / Periodic pattern mining / Frequent episode mining
Datasets in SPMF format
These datasets can be directly used in SPMF:
Dataset name | Description | Transaction count | Item count (I) | Average item count per transaction (A) | Density (%) (A / I ) * 100 |
Has item names? |
retail | customer transactions from an anonymous
Belgian retail store. (source FIMI: http://fimi.ua.ac.be/data/) |
88,162 | 16470 | 10,30 | 0.06 % | No |
mushrooms |
prepared based on the UCI mushrooms dataset |
8,416 | 119 | 23 | 19.33 % | No |
pumsb |
census data for population and housing |
49,046 | 2113 | 74 | 3.50 % | No |
chess | prepared based on the UCI chess dataset (source: FIMI) |
3,196 | 75 | 37 | 49.33 % | No |
connect | prepared based on the UCI connect-4 dataset (source: FIMI) | 67,557 | 129 | 43 | 33.33 % | No |
accidents | anonymized traffic accident data (source: FIMI) | 340,183 | 468 | 33.8 | 7.22 % | No |
BMS_WebView_1 | click-stream data from a webstore used in KDD-Cup 2000 (note:
this version of the dataset has been prepared for itemset
mining / association rule mining) (source: FIMI) |
59,602 | 497 | 2.51 | 0.51 % | No |
BMS_WebView_2 | click-stream data from a webstore used in KDD-Cup 2000 (note:
this version of the dataset has been prepared for itemset
mining / association rule mining) (source: FIMI) |
77,512 | 3340 | 4.62 | 0.14 % | No |
Kosarak | transactions from click-stream data from an hungarian news
portal. (source: FIMI) |
990,002 | 41270 | 8.1 | 0.02 % | No |
chainstore | customer transactions from a major grocecy store chain in California, USA | 1,112,949 | 46086 | 7.23 | 0.02 % | No |
foodmart | customer transactions from a retail store, obtained and transformed from SQL-Server 2000 | 4,141 | 1559 | 4.42 | 0.28 % | No |
Fruithut |
This is a dataset of customer transactions from a US retail store focusing on selling fruits. The dataset contains 181,970 transactions, 1,265 different items. The largest transactions contains 36 items, while on average a customer purchase 3.58 items per transaction. You can also download the taxonomy of items for this dataset, which can be useful for cross-level or multi-level itemset mining. The taxonomy is provided in a separated file where each line indicates than an item is a specialization of another item (a product belongs to a product category). An item can be a product or a category. The taxonomy has 4 levels and 43 categories. See the paper of Ying Wang et al. above for more details. The data was obtained from Kaggle and then transformed to the SPMF format by Ying Wang et al. for her paper about mining cross-level high utility itemsets (2020). |
181,970 | 1265 | 3.58 | 0.28 % |
Yes, item names are included in the file
|
Liquor_11 |
This is a dataset of 9,284 customer transactions from a US liquor stores in the state of IOWA. You can also download the taxonomy of items for this dataset, which can be useful for cross-level or multi-level itemset mining. The taxonomy is provided in a separated file where each line indicates than an item is a specialization of another item (a product belongs to a product category). An item can be a product or a category. The taxonomy has 7 levels and 77 categories. See the paper of Ying Wang et al. above for more details. The data was obtained from the internet and then transformed to the SPMF format by Ying Wang et al. for her paper about mining cross-level high utility itemsets (2020). In particular the taxonomy was creating by doing some NLP from the item names. The dataset contains only transactions with no more than 11 items. If you want all transactions with no more than 15 items, you can download liquor_15.txt, of if you want no more than 5 items, you can download liquor_5.txt. |
9,284 | 2626 | 2.7 | 0.10 % | Yes (temporary not available |
kddcup99 | This dataset is transformed from the KDD CUP 1999 dataset, found at https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data) | 1,000,000 | 135 | 16 | 11.85 % | |
OnlineRetail |
This dataset is transformed from the Online Retail dataset,
found at https://archive.ics.uci.edu/ml/datasets/
Online+Retail. |
541,909 | 2603 | 4.37 | 0.17 % | |
PAMP |
This dataset is transformed from the PAMAP2 dataset, obtained
from
https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring |
1,000,000 | 141 | 23.93 | 16.97 % | Yes, PAMPAttributes.xlsx |
Skin | This dataset is transformed from the dataset Skin
Segmentation, obtained from
https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation Converted by Zhongjie Zhang. |
245,057 | 11 | 4.0 | 36.36 % | Yes, SkinAttributes.xlsx |
USCensus | This dataset is transformed from the US Census 1990 dataset,
obtained from
https://archive.ics.uci.edu/ml/datasets/US+Census+Data+(1990) Converted by Zhongjie Zhang. |
1,000,000 | 396 | 68.0 | 17.17 % | No |
RecordLink | This dataset is transformed from
http://archive.ics.uci.edu/ml/datasets/Record+Linkage+Comparison+Patterns. Converted by Zhongjie Zhang. |
574,913 | 29 | 10 | 34.48 % | |
PowerC | The dataset is about household electric power consumption. It
was prepared using the original dataset from
https://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption.
The dataset contains instances, which were transformed to
transactions, where value fields of the 3rd attribute to the 9th
attribute are divided in 10 equal parts and every part is
represented by a number. As a result, the dataset has 140 items. Converted by Zhongjie Zhang. |
1,040,000 | 140 | 7 | 5.00 % | No |
Susy |
This dataset is related to physics. It is related to
particles detected using a particle accelerator. The dataset
contains instances prepared using the original datasetbtained
from http://archive.ics.uci.edu/ml/datasets/SUSY.
The instances were transformed to transactions, such that the
value field of every attribute is divided in 10 equal parts
and every part is represented by a number. As a result, the
dataset has 190 items. |
5,000,000 | 190 | 19 | 10.00 % | No |
Chicago_Crimes_2001_to_2017_FIM | The dataset was converted from the dataset 'Crimes in Chicago' (https://www.kaggle.com/datasets/currie32/crimes-in-chicago) by Zhongjie Zhang. The dataset records the crimes that have occurred in Chicago from 2001 to 2017. Every transaction corresponds to a <month, area>. A transaction describes the crimes that have occurred in a specific area during a specific month. Items represents the types of crimes, while the utility of an item in a transaction indicates the number of occurrences of that crime. For a description of the types of crimes, see the file describing the item names of that dataset. |
2,662,309 | 35 | 1.795 | 5.13% | Yes, the description of items is here |
Yoo-choose-buy-FIM | This dataset was obtained from the RecSys2015 challenge and converted to the SPMF format so that it can be used with SPMF. The dataset contains transactions from customers who have purchased items in an online store. Note on the conversion: items having a quantity of 0 or a price of 0 were eliminated. |
234,300 | 16,004 | 2.165 | 0.01% | No |
instacart_priorFIM instacart_trainFIM |
Those are two customer transaction datasets obtained by transforming the data from the instacart competition that was held on Kaggle in 2017. They can directly be used for frequent itemset mining and association rule mining. To help interpret the results, the product names are provided in another file: here And besides that, in addition to these two files, there is also a file taxonomy.txt that provides a taxonomy of all items for this dataset. A taxonomy is some categories and subcategories for the items. This taxonomy file contains several lines, where each line has the form X,Y indicating that an item or a category X is contained in a category Y. The category names for this dataset are provided in another file: instacart_category_names.txt |
3,214,874 131,209 |
49,677 39,123 |
10.08 10.55 |
0.002 % 0.002 % |
Yes |
TH1 (compressed in 7z format) | A large dataset of customer transactions from grocery store(s). It was obtained from the public dataset TH1 on kaggle from @alexkever and transformed to the SPMF format, with some modifications such as to sort items in transactions, and remove duplicate items. The names of products are not included. | 2,654,710 | 828 | 10.99 | 1.32% | No |
TH2 (compressed in 7z format) | A very large dataset of customer transactions from grocery store(s). It was obtained from the public dataset TH2 on kaggle from @alexkever and transformed to the SPMF format, with some modifications such as to sort items in transactions, and remove duplicate items. The names of products are not included. | 26,496,645 | 836 | 10.96 | 1.31% | No |
Synthetic datasets
Here are a few synthetic datasets in SPMF format:
Dataset name | Description | Transaction count | Item count (I) | Average item count per transaction (A) | Density (A / I ) * 100 |
c20d10k | synthetic dataset | 10,000 | 192 | 20 |
10.42 %
|
c73d10k | synthetic dataset | 10,000 | 1592 | 73 |
4.59 %
|
t25i10d10k | synthetic dataset | 9,976 | 929 | 24.77 |
2.67 %
|
t20i6d100k | synthetic dataset | 99,922 | 893 | 19.90 |
2.23 %
|
It is possible to generate synthetic transaction databases by using the random transaction database generator provided in SPMF (see this example in the documentation to know how to do it), which is flexible and easy to use.
Moreover, you can download the following synthetic datasets often used in the data mining litterature, generated by the IBM Generator. The names are given according to this convention: D: number of sequences in the dataset, C: average number of itemsets per sequence, T: average number of items per itemset, I: average size of itemsets in potentially frequent sequences.
Another alternative for generating synthetic transactions databases is to use a Matlab program provided by Ashwin Balani.
Real-life datasets in SPMF format, having timestamps
The datasets containing timestamps can be directly used in SPMF with algorithms that accept timestamps:
Dataset name | Description | Transaction count | Item count (I) | Average item count per transaction (A) | Density (A / I ) * 100 |
Has item names? |
ECommerce_time_without_utility | customer transactions occurring between
01/12/2010 and 09/12/2011 of a UK-based and registered
non-store online retail (this is the version with
timestamps ). If you want to know the meaning of
the items in this dataset, the list of item names is
available. The data was prepared by Yimin Zhang using public data. |
14,975 |
3,468 | 11.71 | 0.34 % | Yes list of item names |
Real-life datasets in ARFF format
The GUI and command line interface of SPMF > 093d and higher can read ARFF (Attribute Relational File Format) files . You can download a collection of 36 real-life datasets in ARFF format from the UCI machine learning repository. These files can be used with all association rule mining and itemset mining algorithms that take a transaction database as input. Note that examples MainTest...in the source code have not been adapted to read ARFF files yet (for now, ARFF files are only accepted by the GUI and command line interface). Most features of the ARFF format are supported except that (1) the character = is forbidden and (2) escape characters are not considered. Note that when the ARFF format is used, the performance of the data mining algorithms will be slightly less than if the native SPMF file format is used because a conversion of the input file will be automatically performed before launching the algorithm and the result will have to be converted. This cost however should be small. The specification of the ARFF format can be found here.
Datasets for High-Utility Pattern Mining
The following transaction databases that can be used with high-utility itemset mining algorithms such as EFIM, FHM, HUI-Miner, IHUP, UPGrowth and Two-Phase. For information about the format of these files, see the examples in the documentation of EFIM, FHM, HUI-Miner or Two-Phase.
Real-life transaction datasets in SPMF format with real utility values
Those are three real-life customer transaction datasets with real utility values:
Dataset name | Description | Transaction count | Item count (I) | Average item count per transaction (A) | Density (%) (A / I ) * 100 |
Has real utility values? | Has item names? |
foodmart_utility | dataset of customer transactions from a retail store, obtained and transformed from SQL-Server 2000 | 4,141 | 1559 | 4.42 | 0.28 % | Yes | No |
chainstore_utility |
dataset of customer transactions from a major grocery store chain in California, USA, containing 1,112,949 transactions and 46,086 items, obtained and transformed from NU-Mine Bench. The original data is available here (not in SPMF format) |
1,112,949 | 46086 | 7.23 | 0.02 % | Yes | No |
ECommerce_retail_utility_no_timestamps |
transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 of a UK-based and registered non-store online retail (this is the version with real utility values but without timestamps). The data was prepared by Yimin Zhang using
public data for his paper about peak high utility
itemset mining. |
14,975 |
3468 | 11.71 | 0.34 % | Yes |
Yes |
Fruithut_utility |
This is a dataset of customer transactions from a US retail store focusing on selling fruits. The dataset contains 181,970 transactions, 1,265 different items. The largest transactions contains 36 items, while on average a customer purchase 3.58 items per transaction. You can also download the taxonomy of items for this dataset, which can be useful for cross-level or multi-level itemset mining. The taxonomy is provided in a separated file where each line indicates than an item is a specialization of another item (a product belongs to a product category). An item can be a product or a category. The taxonomy has 4 levels and 43 categories. See the paper of Ying Wang et al. above for more details. The data was obtained from Kaggle and then transformed to the SPMF format by Ying Wang et al. for her paper about mining cross-level high utility itemsets (2020). |
181,970 | 1265 | 3.58 | 0.28 % |
Yes |
Yes, item names are included in the file |
Liquor_11 |
This is a dataset of 9,284 customer transactions from a US liquor stores in the state of IOWA. You can also download the taxonomy of items for this dataset, which can be useful for cross-level or multi-level itemset mining. The taxonomy is provided in a separated file where each line indicates than an item is a specialization of another item (a product belongs to a product category). An item can be a product or a category. The taxonomy has 7 levels and 77 categories. See the paper of Ying Wang et al. above for more details. The data was obtained from the internet and then transformed to the SPMF format by Ying Wang et al. for her paper about mining cross-level high utility itemsets (2020). In particular the taxonomy was creating by doing some NLP from the item names. The dataset contains only transactions with no more than 11 items. If you want all transactions with no more than 15 items, you can download liquor_15.txt, of if you want no more than 5 items, you can download liquor_5.txt. |
9,284 | 2626 | 2.7 | 0.10 % | Yes | Yes (but temporary not available) |
Chicago_Crimes_2001_to_2017_utility | The dataset was converted from the dataset 'Crimes in Chicago' (https://www.kaggle.com/datasets/currie32/crimes-in-chicago) by Zhongjie Zhang. The dataset records the crimes that have occurred in Chicago from 2001 to 2017. Every transaction corresponds to a <month, area>. A transaction describes the crimes that have occurred in a specific area during a specific month. Items represents the types of crimes, while the utility of an item in a transaction indicates the number of occurrences of that crime. For a description of the types of crimes, see the file describing the item names of that dataset. |
2,662,309 | 35 | 1.795 | 5.13% | Yes | Yes, the description of items is here |
Yoo-choose-buy-Utility | This dataset was obtained from the RecSys2015 challenge and converted to the SPMF format so that it can be used with SPMF. The dataset contains transactions from customers who have purchased items in an online store. Note on the conversion: items having a quantity of 0 or a price of 0 were eliminated. |
234,300 | 16,004 | 2.165 | 0.01% | Yes | No |
Real-life transaction datasets in SPMF format having synthetic (fake) utility values
These datasets are real datasets but with synthetic utility values. The internal utility values have been generated using a uniform distribution in [1, 10].
Dataset name | Description | Transaction count | Item count (I) | Average item count per transaction (A) | Density (%) (A / I ) * 100 |
Has real utility values? | Has item names? |
retail_utility | customer transactions from an anonymous
Belgian retail store. (source FIMI: http://fimi.ua.ac.be/data/) |
88,162 | 16470 | 10,30 |
0.06 %
|
No (synthetic) | No |
mushroom_utility |
prepared based on the UCI mushrooms dataset |
8,416 | 119 | 23 |
19.33 %
|
No (synthetic) | No |
pumsb_utility, |
census data for population and housing |
49,046 | 2113 | 74 |
3.50 %
|
No (synthetic) | No |
chess_utility, | prepared based on the UCI chess dataset (source: FIMI) |
3,196 | 75 | 37 |
49.33 %
|
No (synthetic) | No |
connect_utilitty, | prepared based on the UCI connect-4 dataset (source: FIMI) | 67,557 | 129 | 43 |
33.33 %
|
No (synthetic) | No |
accidents_utility, | anonymized traffic accident data (source: FIMI) | 340,183 | 468 | 33.8 |
7.22 %
|
No (synthetic) | No |
kosarak_utility, | transactions from click-stream data from an hungarian news
portal. (source: FIMI) |
990,002 | 41270 | 8.1 |
0.02 %
|
No (synthetic) | No |
BMS_utilitty, | click-stream data from a webstore used in KDD-Cup 2000 (note: this version of the dataset has been prepared for high utility itemset mining ) (source: FIMI) |
77,512 | 3340 | 4.62 |
0.14 %
|
No (synthetic) | No |
Datasets in SPMF format having utility values and timestamps
Here are a few transaction databases in SPMF format for high utility itemset mining that contains timestamps. These datasets where prepared by Yimin Zhang for the paper in Information Sciences about the LHUI-Miner and PHUI-Miner algorithms (Fournier-Viger et al., 2019). These datasets were used for discovering local high utility itemsets and peak high utility itemsets, but can be used for other tasks such as recent high utility pattern mining.
Dataset name | Description | Transaction count | Item count (I) | Average item count per transaction (A) | Density (%) (A / I ) * 100 |
Has real utility values? | Has real timestamps? | Has item names? |
ECommerce_retail_utility_timestamps |
transactional data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 of a UK-based and registered non-store online retail (this is the version with real utility values and with timestamps). The data was prepared by Yimin Zhang using
public data for his paper about peak high utility itemset
mining. |
14,975 |
3468 | 11.71 | 0.34 % | Yes | Yes |
Yes |
Fruithut_utility_timestamps |
This is a dataset of customer transactions from a US retail store focusing on selling fruits. The dataset contains 181,970 transactions, 1,265 different items. The largest transactions contains 36 items, while on average a customer purchase 3.58 items per transaction. You can also download the taxonomy of items for this dataset, which can be useful for cross-level or multi-level itemset mining. The taxonomy is provided in a separated file where each line indicates than an item is a specialization of another item (a product belongs to a product category). An item can be a product or a category. The taxonomy has 4 levels and 43 categories. See the paper of Ying Wang et al. above for more details. The data was obtained from Kaggle and then transformed to the SPMF format by Ying Wang et al. for her paper about mining cross-level high utility itemsets (2020). |
181,970 | 1265 | 3.58 | 0.28 % |
Yes |
Yes | Yes, item names are included in the file |
foodmart_utility_timestamp | dataset of customer transactions from a retail store, obtained and transformed from SQL-Server 2000 | 4,141 | 1,559 | 4.42 | 0.28 % | Yes | No (synthetic) | No |
retail_utility_timestamp | customer transactions from an anonymous
Belgian retail store. (source FIMI: http://fimi.ua.ac.be/data/) |
88,162 | 16,470 | 10,30 | 0.06 % | No (synthetic) | No (synthetic) | No |
kosarak_utility_timestamp | transactions from click-stream data from an hungarian news
portal. (source: FIMI) |
990,002 | 41,270 | 8.1 | 0.02 % | No (synthetic) | No (synthetic) | No |
mushroom_utility_timestamp |
prepared based on the UCI mushrooms dataset |
8,416 | 119 | 23 | 19.33 % | No (synthetic) | No (synthetic) | No |
Datasets for high utility itemset mining with negative unit profit values.
Here are a few transaction databases in SPMF format for high-utility itemset mining with negative unit profit values. Those datasets have been generated to be used in the FHN paper (Fournier-Viger et al., 2014) and can be used with the FHN and HUINIV-Mine algorithms. See the FHN paper for details.
Dataset name | Description | Transaction count | Item count (I) | Average item count per transaction (A) | Density (%) (A / I ) * 100 |
Has real utility values? | Has item names? |
retail_negative, | customer transactions from an anonymous
Belgian retail store. (source FIMI: http://fimi.ua.ac.be/data/) |
88,162 | 16470 | 10,30 | 0.06 % | No (synthetic) | No |
mushroom_negative |
prepared based on the UCI mushrooms dataset |
8,416 | 119 | 23 | 19.33 % | No (synthetic) | No |
pumsb_negative, |
census data for population and housing |
49,046 | 2113 | 74 | 3.50 % | No (synthetic) | No |
chess_negative, | prepared based on the UCI chess dataset (source: FIMI) |
3,196 | 75 | 37 | 49.33 % | No (synthetic) | No |
accidents_negative, | anonymized traffic accident data (source: FIMI) | 340,183 | 468 | 33.8 | 7.22 % | No (synthetic) | No |
kosarak_negative | transactions from click-stream data from an hungarian news
portal. (source: FIMI) |
990,002 | 41270 | 8.1 | 0.02 % | No (synthetic) | No |
Datasets for high utility quantitative itemset mining
The following datasets have been used for the FHUQI-Miner paper (Nouioua et al., 2021) about high utility quantitative itemset mining, and can also be used with the VHUQI algorithm. For these algorithms, two files are provided for each dataset: the first file contains transactions with quantities for each item (e.g. retail.txt). The second file contains the unit profit or weight of each item (e.g. retail1profit.txt). For more details about the format, please see the documentation page about FHUQI-Miner or VHUQI.
Dataset name | Description | Transaction count | Item count (I) | Average item count per transaction (A) | Density (%) (A / I ) * 100 |
Has real utility and quantityvalues? | Has item names? |
customer transactions from an anonymous Belgian
retail store. (note: this version of the dataset has been adaptedfor high utility quantitative itemset mining ) |
88162 | 16470 | 10.30 | 0.06 % | No (synthetic) | No | |
|
census data for population and housing |
49046 | 2113 | 74 | 3.50 % | No (synthetic) | No |
prepared based on the UCI connect-4 dataset (note: this version of the dataset has been adapted for high utility quantitative itemset mining ) |
67557 | 129 | 43 | 33.33 % | No (synthetic) | No | |
|
click-stream data from a webstore used in KDD-Cup 2000 (note: this version of the dataset has been adaptedfor high utility quantitative itemset mining ) |
59601 | 497 | 2.42 | 0.51 % | No (synthetic) | No |
click-stream data from a webstore used in KDD-Cup 2000 (note: this version of the dataset has been adaptedfor high utility quantitative itemset mining ) |
77512 | 3340 | 4.62 | 0.14 % | No (synthetic) | No | |
dataset of customer transactions from a retail store, obtained
and transformed from SQL-Server 2000 (note: this version of the dataset has been adaptedfor high utility quantitative itemset mining ) |
4141 | 1559 | 4.42 | 0.28 % | Yes | No |
Datasets for on-shelf high-utility itemset mining with negative unit profit values
- kosarak (5 periods, 25 periods, 50 periods)
- accidents (5 periods, 25 periods, 50 periods)
- chess (5 periods, 25 periods, 50 periods)
- mushroom (5 periods, 25 periods, 50 periods)
- pumsb (5 periods, 25 periods, 50 periods)
- retail (5 periods, 25 periods, 50 periods)
Here are a few transaction databases in SPMF format for on-shelf high-utility itemset mining with negative unit profit values. Those datasets have been generated to be used in the FOSHU paper (Fournier-Viger et al., 2015) and can be used with the FOSHU and TS-HOUN algorithms. See the FOSHU paper for details.
Datasets for high-utility sequential rule mining or high-utility sequential pattern mining
Some sequence databases in SPMF format for high-utility sequential rule mining or high-utility sequential pattern mining. Those datasets were generated for the HUSRM paper (Zida et al, 2015), and can be used with the HUSRM and USpan algorithm. See the HUSRM paper for more information.
Dataset name | Description | Sequence count | Item count | Average sequence length | Has item names? |
BMS_sequence_utility) | This dataset was used in KDD CUP 2000. It contains clickstream data from an e-commerce | 77,512 | 3,340 | 4.62 | No |
This is a subset of 10,000 sequences of the
Kosarak click-stream data from an hungarian
news portal. The dataset was converted in SPMF format using the original data from: http://fimi.ua.ac.be/data/. (2020-1-10: the kosarak dataset file has been updated as someone informed me that some sequences were missing) |
10,000 | No | |||
SIGN_sequence_utility | a dataset of sign language utterance containing approximately
800 sequences The original dataset file in another format can be obtained here with more details on this dataset. |
~800 | 267 | 51.997 | No |
Bible_sequence_utility |
This dataset is a conversion of the Bible into a sequence database (each word is an item). |
36,369 | 13,905 | 21.6 | Yes Bible_with_items |
FIFA_sequence_utility | Click stream data from the website of FIFA World Cup 98 | 20,450 | 2,990 | 34.74 | No |
Datasets for Cost-effective Pattern Mining (with utility and cost)
In the paper by Fournier-Viger et al. (2020), it was proposed to find cost-effective patterns in sequences with utility and cost information. This type of data is a set of sequences where each sequence is an ordered list of activity, where an activity has a cost value representing the amount of resources (e.g. time, money) spent to perform the activity. Moreover, each sequence has a utility label that can be either binary or numeric, which represents a positive or negative outcome. This type of data can represent for example how students use an e-learning system. In that case, each sequence is a list of activities or sessions performed by a student, and the cost represents the time spent for studying for each activity. On the other hand, the utility represents the outcome of a sequence such as the final exam score of a student after performing the acitivities (numeric utility), or whether he passed or failed the course or exam (binary utility). Another example of cost-utility sequences is hospital data, where a sequence is a list of medicines taken by some patient, the cost indicates how much money or time was spent for these medical treaments, and the utility is whether a patient has cured or died.
From cost/utility sequences, we can then find patterns that have a low-cost but typically lead to a high utility (called low-cost high utility patterns or cost-efficient patterns) by applying algorithms such as CEPB, CEPN and CorCEPB. In the above examples, some cost-effective patterns may be that studying some e-learning materials A and B typically require not much time but leadw to high scores, or that taking some given medicines has a low cost but a high possibility of success to cure a disease.
Here are some datasets:
Dataset name | Description | Sequence count | Type of utility |
allSessions_binary.txt |
This is an e-learning dataset, where each sequence represents a student using an e-learning platform named Deeds, and there are 62 students. A sequence contains a list of sessions taken by a student. There are six different sessions denoted as 1,2,3,4,5 and 6. In a sequence, each session has a cost value that represents the time spent by a student on the session for studying or doing some learning activities. Each sequence has a binary utility value, which indicates the outcome of the sequence. Here the utility value indicates if a student has failed or passed the final exam after doing the learning sessions. This dataset can be used with the CEPB and CorCEPB algorthms Note: This dataset was created by converting public data from Vahdat et al to the SPMF format. A threshold of 60% was assumed to be the score for passing the final exam. |
62 | Binary |
session6_numeric.txt |
This dataset is also an e-learning dataset from students using an e-learning platform named Deeds. This dataset contains 50 sequences. Each sequence indicates the list of activities done by a student during a learning session called SESSION 6. Each activity has a cost value representing the amount of time that the student has spent doing the activity. Moreover, each sequence has a numeric utility value indicating the score that the student had at the test at the end of SESSION 6. This dataset can be used with the CEPN algorthm. Note: This dataset was created by converting public data from Vahdat et al to the SPMF format. Note that this file only contains SESSION 6. |
50 | Numeric |
session6_binary.txt |
This is the same as above, except that the utility has been converted to binary values instead of numeric values. This dataset can be used with the CEPB and CorCEPB algorthms. |
50 | Binary |
session5_numeric.txt | This is similar to above, except that this is the data for SESSION 5 (numeric utility). | 53 | Numeric |
session4_numeric.txt | This is similar to above, except that this is the data for SESSION 4 (numeric utility). | 54 | Numeric |
session4_binary.txt | This is similar to above, except that this is the data for SESSION 4 (binary utility). | 54 | Binary |
The original dataset was made public by Vahdat et al. and can be downloaded here but it is not in SPMF format. However, it gives a lot of details about the meaning of the data. This is very useful for understanding the meaning of the patterns found in the above data. This is the paper by Vahdat et al.:
M. Vahdat, L. Oneto, D. Anguita, M. Funk, M. Rauterberg.: A
learning analytics approach to correlate the academic achievements of
students with interaction data from an educational simulator. In: G.
Conole et al. (eds.): EC-TEL 2015, LNCS 9307, pp. 352-366. Springer
(2015).
DOI: 10.1007/978-3-319-24258-3 26
And this is the paper about cost-effective pattern mining with the CEPB, CEPN and CorCEPB algorithms, for which the datasets have prepared for SPMF:
Fournier-Viger, P., Li, J., Lin, J. C., Chi, T. T., Kiran, R. U. (2020). Mining Cost-Effective Patterns in Event Logs. Knowledge-Based Systems (KBS), Elsevier
New: I have created some versions of the above datasets where sequences with cost values have been transformed into transactions with cost values. If you use those transformed transaction datasets please cite our paper "LCIM: Mining Low Cost and High Utility Itemsets" published in MIWAI 2022 by Nawaz et al., 2022.
Dataset name | Description | Transaction count | Type of utility |
allSessions_binary_trans.txt |
See above |
62 | Binary |
session6_numeric_trans.txt |
See above |
50 | Numeric |
session6_binary_trans.txt |
See above |
50 | Binary |
session5_numeric_trans.txt | See above | 53 | Numeric |
session4_numeric_trans.txt | See above . | 54 | Numeric |
session4_binary_trans.txt | See above | 54 | Binary |
Datasets for graph pattern mining
Datasets for mining subgraphs in a graph database
Here are a few datasets that can be used to evaluate subgraph mining algorithms such as TKG, gSpan and cgSpan. Each dataset contains a graph database (multiple graphs).
Dataset name | Description | Graph count | Average node count per graph | Average edge count per graph | Vertex label count | Edge label count | Label file? |
Chemical_340 | a database of 340 graphs about chemistry | 340 |
27.02 | 27.40 | 66 | 4 | No |
Coumpounds_422 | a database of 422 graphs about coumpounds | 422 | 39.61 | 42.31 | 21 | 4 | No |
Mutag | a dataset of nitro compounds labeled regarding whether they have a mutagenic effect on a bacterium | 188 | 17.93 | 19.79 | 7 | 11 | Yes |
PTC | chemical compounds where their labels indicate carcinogenicity for male and female rats | 344 | 25.55 | 25.96 | 19 | 1 | Yes |
NCI1 | datasets of chemical compounds screened for activity against non-small cell lung cancer and ovarian cancer cell lines | 4,110 | 29.87 | 32.30 | 37 | 3 | Yes |
NCI109 | datasets of chemical compounds screened for activity against non-small cell lung cancer and ovarian cancer cell lines | 4,127 | 29.68 | 32.13 | 38 | 3 | Yes |
Proteins | graph collection where nodes represent secondary structure elements and edges indicate neighborhood in the amino-acid sequence or in 3-dimension space | 1,113 | 39.06 | 72.82 | 3 | 1 | Yes |
Enzymes | protein tertiary structures obtained from the BRENDA enzyme database | 600 | 32.63 | 62.14 | 3 | 1 | Yes |
DandD | a dataset of protein structures where nodes are amino acids and edges indicate spatial closeness, which are classified into enzymes or non-enzymes | 1,178 | 284.31 | 715.66 | 82 | 1 | Yes |
IMDB-B | a movie collaboration dataset that is collected from IMDB. Each graph is an ego-network where nodes represent actors/actresses and edges indicate if they appear in the same movie. Each graph is categorized into one of two genres (Action or Romance). | 1,000 | 19.77 | 96.53 | 65 | 1 | Yes |
5newsgroup | social network dataset | 4976 | 86.86 | 352.65 | 27,881 | 1 | Yes |
collab | social network dataset | 5,000 | 74.49 | 2424.63 | 367 | 1 | Yes |
reddit_binary | social network dataset | 2,000 | 429.61 | 497.75 | 565 | 1 | Yes |
reddit_multi_12k | social network dataset | 11,929 | 391.41 | 456.89 | 909 | 1 | Yes |
reddit_multi_5k | social network dataset | 4,999 | 508.51 | 594.87 | 733 | 1 | Yes |
webkb | social network dataset | 4,167 | 77.80 | 318.15 | 4 | 7,770 | Yes |
The two first datasets were obtained from the Web and are probably the two most famous datasets in subgraph mining.
The other datasets were prepared by Dang Nguyen et al. based on data obtained from various data sources, and obtained from github.com/nphdang/gspan/, and used in this paper: Dang Nguyen, Wei Luo, Tu Dinh Nguyen, Svetha Venkatesh, Dinh Phung (2018). Learning Graph Representation via Frequent Subgraphs. SDM 2018, San Diego, USA. SIAM, 306-314.). The descriptions of these latter datasets in the above table were obtained from that paper too.
Note that some datasets have some label files. Those label files are not used by the algorithms offered in SPMF. They were generated when the datasets were prepared. These files indicates the correspondence between labels in the original datasets and those of the transformed datasets offered on this page. These label files are offered here because it can be useful to some people perhaps.
Datasets for Mining Patterns in Dynamic Attributed Graphs
A dynamic attributed graph is a graph that changes over time and where vertices may have multiple numerical attributes. Some algorithms are designed to discover interesting patterns in dynamic attributed graphs such as AER-Miner and TSEQMiner, offered in SPMF. The datasets used in the AER-Miner and TSEQMiner papers are available here: the datasets (ZIP, 5.0 MB) and the and some variations for scalability experiments (ZIP, 138 MB). Please see these papers for a description of these datasets.
Datasets for Clustering
A Matlab program provided by Ashwin Balani to generate synthetic clustering datasets.