Generating a Synthetic Transaction Database (SPMF documentation)

This example explains how to generate a synthetic transaction database using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "Generate_a_transaction_database" algorithm, (2) set the output file name (e.g. "output.txt") (3) choose 100 transactions, 1000 maximum distinct items and 10 items per transaction (4) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Generate_a_transaction_database output.txt 100 1000 10 in a folder containing spmf.jar.
If you are using the source code version of SPMF, launch the file "MainTestGenerateTransactionDatabase.java" in the package ca.pfv.SPMF.tests.

What is this tool?

This tool is a random generator of transaction databases. It can be used to generate synthetic transaction databases with timestamps to compare the performance of data mining algorithms that takes a transaction database as input.

Synthetic databases are often used in the data mining litterature to evaluate algorithms. In particular, they are useful for comparing the scalability of algorithms. For example, one can generate sequence databases having various size and see how the algorithms react in terms of execution time and memory usage with respect to the database size.

What is the input?

The tool for generating a transaction takes three prameters as input:

1) the number of transactions to be generated (an integer >= 1)

2) the maximum number of distinct item that the database should contain (an integer >= 1),

3) the number of items that each transaction should contain (an integer >= 1)

What is the output?

The algorithm outputs a transaction database database respecting the parameters provided. A random number generator is used to generate the database.

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id	Items
t1	{1, 3, 4}
t2	{2, 3, 5}
t3	{1, 2, 3, 5}
t4	{2, 5}
t5	{1, 2, 3, 5}

Output file format

The output file format is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, an output file could be the following:

1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5