Fix a Transaction Database (SPMF documentation)

This example explains how to fix a transaction database using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "Fix_a_transaction_database" algorithm, (2) choose the input file contextIncorrect.txt (3) set the output file to "output.txt", and then (4) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Fix_a_transaction_database contextIncorrect.txt output.txt in a folder containing spmf.jar and the input file contextIncorrect.txt.
If you are using the source code version of SPMF, launch the file "MainTestFixTransactionDatabase.java" in the package ca.pfv.SPMF.tests.

What is this tool?

The tool "Fix_a_transaction_database" is a small program that fix some common problems in a transaction database file in SPMF format. The tool fixes two common problems: (1) an item appears more than once in a transaction, (2) transactions are not sorted. To fix the first problem the tool will keep only one occurrence of each item in a transaction. So if an items appears more than once in a transaction, it will appears only once after applying the tool. To fix the second problem, the tool sorts each transaction according to the lexicographical ordering (because it is required by most itemset and association rule mining algorithms).

What is the input?

The input is a transaction database in SPMF format that need to be fixed. A transaction database is a set of transactions. Each transactions an unordered set of items (symbols) represented by positive integers. For example, consider the following database. This database is provided in the file "contextIncorrect.txt" of the SPMF distribution. It contains three transactions. The first transactions contains the set of items {1, 3, 4}. However, this transaction database has some problems. The first transaction contains an item 3 that appears more than once. Moreover, transactions are not sorted.

Transaction id	Items
t1	{1, 3, 3 4, 3}
t2	{5, 3, 2}
t3	{1, 2, 3, 5}

What is the output?

The output is a transaction database where each transaction is sorted and no item appears more than once. For example, the output using the above example is:

Transaction id	Items
t1	{1, 3, 4}
t2	{2, 3, 5}
t3	{1, 2, 3, 5}

Input file format

The input file format is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, an input file that is incorrect is provided:

1 3 3 4 3
5 3 2
1 2 3 5

Output file format

The output file format is the same as the input format. But the problems contained in the input file have been fixed.

1 3 4
2 3 5
1 2 3 5