Fix a Transaction Database (SPMF documentation)

This example explains how to fix a transaction database using the SPMF open-source data mining library.

How to run this example?

What is this tool?

The tool "Fix_a_transaction_database" is a small program that fix some common problems in a transaction database file in SPMF format. The tool fixes two common problems: (1) an item appears more than once in a transaction, (2) transactions are not sorted. To fix the first problem the tool will keep only one occurrence of each item in a transaction. So if an items appears more than once in a transaction, it will appears only once after applying the tool. To fix the second problem, the tool sorts each transaction according to the lexicographical ordering (because it is required by most itemset and association rule mining algorithms).

What is the input?

The input is a transaction database in SPMF format that need to be fixed. A transaction database is a set of transactions. Each transactions an unordered set of items (symbols) represented by positive integers. For example, consider the following database. This database is provided in the file "contextIncorrect.txt" of the SPMF distribution. It contains three transactions. The first transactions contains the set of items {1, 3, 4}. However, this transaction database has some problems. The first transaction contains an item 3 that appears more than once. Moreover, transactions are not sorted.

Transaction id Items
t1 {1, 3, 3 4, 3}
t2 {5, 3, 2}
t3 {1, 2, 3, 5}

What is the output?

The output is a transaction database where each transaction is sorted and no item appears more than once. For example, the output using the above example is:

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}

Input file format

The input file format is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, an input file that is incorrect is provided:

1 3 3 4 3
5 3 2
1 2 3 5

Output file format

The output file format is the same as the input format. But the problems contained in the input file have been fixed.

1 3 4
2 3 5
1 2 3 5