Calculate Statistics for a Sequence Database (SPMF documentation)

This example explains how to calculate statistics for a sequence database using the SPMF open-source data mining library.

How to run this example?

What is this tool?

This tool is a tool for generating statistics about a sequence database. It can be used to know for example if the database is dense or sparse before applying a data mining algorithms.

What is the input?

The input is a sequence database. A sequence database is a set of sequence. Each sequence is an ordered list of itemsets. An itemset is an unordered list of items (symbols). For example, consider the following database. This database is provided in the file "contextPrefixSpan.txt" of the SPMF distribution. It contains 4 sequences. The second sequence represents that the set of items {1 4} was followed by {3}, which was followed by {2, 3}, which were followed by {1, 5}. It is a sequence database (as defined in Pei et al., 2004).

ID Sequences
S1 (1), (1 2 3), (1 3), (4), (3 6)
S2 (1 4), (3), (2 3), (1 5)
S3 (5 6), (1 2), (4 6), (3), (2)
S4 (5), (7), (1 6), (3), (2), (3)

What is the output?

The output is statistics about the sequence database. For example, if we use the tool on the previous sequence database given as example, we get the following statistics:

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a positive integer and items from the same itemset within a sequence are separated by single space. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextPrefixSpan.txt" contains the following four lines (four sequences).

1 -1 1 2 3 -1 1 3 -1 4 -1 3 6 -1 -2
1 4 -1 3 -1 2 3 -1 1 5 -1 -2
5 6 -1 1 2 -1 4 6 -1 3 -1 2 -1 -2
5 -1 7 -1 1 6 -1 3 -1 2 -1 3 -1 -2

The first line represents a sequence where the itemset {1} is followed by the itemset {1, 2, 3}, followed by the itemset {1, 3}, followed by the itemset {4}, followed by the itemset {3, 6}. The next lines follow the same format.