Calculate Statistics for a Transaction Database (SPMF documentation)

This example explains how to calculate statistics for a transaction database using the SPMF open-source data mining library.

How to run this example?

What is this tool?

This tool is a tool for generating statistics about a transaction database. It can be used to know for example if the database is dense or sparse before applying a data mining algorithms.

What is the input?

The input is a transaction database (aka formal context). A transaction database is a set of transactions. Each transactions an unordered set of items (symbols) represented by positive integers. For example, consider the following database. This database is provided in the file "contextPasquier99.txt" of the SPMF distribution. It contains five transactions. The first transactions contains the set of items {1, 3, 4}.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {1, 2, 3, 5}

What is the output?

The output is statistics about the transaction database. For example, if we use the tool on the previous sequence database given as example, we get the following statistics:

Input file format

The input file format is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, an output file could be the following:

1 3 4
2 3 5
1 2 3 5
2 5