Generating a Synthetic Sequence Database (SPMF documentation)

This example explains how to generate a synthetic sequence database using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "Generate_a_sequence_database" algorithm, (2) set the output file name (e.g. "output.txt") (3) choose 100 sequences, 1000 maximum distinct items, 3 items by itemset, 7 itemsets per sequence (4) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Generate_a_sequence_database output.txt 100 1000 3 7 in a folder containing spmf.jar.
If you are using the source code version of SPMF, launch the file "MainTestGenerateSequenceDatabase.java" in the package ca.pfv.SPMF.tests.

What is this tool?

This tool is a random generator of sequence databases. It can be used to generate synthetic sequence databases to compare the performance of data mining algorithms that takes a sequence database as input.

Synthetic databases are often used in the data mining litterature to evaluate algorithms. In particular, they are useful for comparing the scalability of algorithms. For example, one can generate sequence databases having various size and see how the algorithms react in terms of execution time and memory usage with respect to the database size.

What is the input?

The tool for generating a sequence databases takes four prameters as input:

1) the number of sequences to be generated (an integer >= 1)

2) the maximum number of distinct item that the database should contain (an integer >= 1),

3) the number of items that each itemset should contain (an integer >= 1)

4) the number of itemsets that each sequence should contain (an integer >= 1)

What is the output?

The algorithm outputs a sequence database respecting these parameters. The database is generated by using a random number generator.