Fix a Sequence Database (SPMF documentation)

This example explains how to fix a sequence database using the SPMF open-source data mining library.

How to run this example?

What is this tool?

The tool "Fix_a_sequence_database" is a small program that fix some common problems in a sequence database file in SPMF format. The tool fixes two common problems: (1) an item appears more than once in an itemset from a sequence, (2) items within an itemset are not sorted by some total order. To fix the first problem the tool will keep only one occurrence of each item in an itemset from a sequence. So if an items appears more than once in an itemset, it will appears only once after applying the tool. To fix the second problem, the tool sorts each itemset according to the lexicographical ordering (because it is required by most sequential pattern mining algorithms).

What is the input?

The input is a sequence database in SPMF format that need to be fixed. A sequence database is a set of sequences where each sequence is a list of itemsets. An itemset is a set of distinct items, that should be ordered according to some total order for processing purposes. For example, the table shown below contains two sequences. These sequences do not follow the above requirements and hence need to be fixed. The first sequence contains two itemsets. The first itemset contains the items 4 and 5 but the items are repeated several times which is forbidden. Then the second itemset contains the items 4, 5 and 6. The second sequence contains three itemsets. The first one contains the item 7. The second itemset contains the item 8. And the third itemset contains the items 7, 8 and 9 but they are not sorted in lexicographical order. Thus, we will fix this problem by applying the tool.

ID Sequences
S1 (4 4 4 5 5) (4 5 6)
S2 (7) (8) (9 8 7)

What is the output?

The output is a sequence database where each itemset in a sequence is sorted and no item appears more than once. For example, the output using the above example is:

ID Sequences
s1 (4 5) (4 5 6)
s2 (7) (8) (7 8 9)

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is a positive integer and items from the same itemset within a sequence are separated by single space. Note that it is assumed that items within a same itemset are sorted according to a total order and that no item can appear twice in the same itemset. The value "-1" indicates the end of an itemset. The value "-2" indicates the end of a sequence (it appears at the end of each line).

For example, for the previous example, an input file that is incorrect is provided:

4 4 4 5 5 -1 4 5 6 -1 -2
7 -1 8 -1 9 8 7 -1 -2

Output file format

The output file format is the same as the input format. But the problems contained in the input file have been fixed.

4 5 -1 4 5 6 -1 -2
7 -1 8 -1 7 8 9 -1 -2