Fix Item Identifiers in a Transaction Database (SPMF documentation)

This example explains how to fix item identifiers in a transaction database using the SPMF open-source data mining library.

How to run this example?

What is this tool?

The tool "Fix_item_ids_in_transaction_database" is a small program that can be used to quickly increase or decrease the item identifiers of all items in a given transaction database file in SPMF format. This tool was created because some algorithms requires that all items identifiers be positive (e.g. 1, 2, 3...), but some datasets were containing an item "0". In this type of cases, this tool can be used to quickly fix the database by incrementing all item identifiers by 1.

What is the input?

The input is a transaction database in SPMF format that need to be fixed. A transaction database is a set of transactions. Each transactions an unordered set of items (symbols) represented by positive integers. For example, consider the following database. This database is provided in the file "contextPasuiqer99.txt" of the SPMF distribution. It contains five transactions. The first transactions contains the set of items {1, 3, 4}.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {1, 2, 3, 5}


What is the output?

The output is a transaction database where all item ids in transactions are incremented by a user-defined value. For example, if we choose the value "1" and the previous database, the output is the following database:

Transaction id Items
t1 {2, 4, 5}
t2 {3, 4, 6}
t3 {2, 3, 5, 6}
t4 {3, 6}
t5 {2, 3, 4, 6}

Input file format

The input file format s defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, the input file is defined as follows:

1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

Output file format

The output file format is the same as the input format. But the items identifiers are changed by adding the user specified value (e.g. "1")

2 4 5
3 4 6
2 3 4 6
3 6
2 3 4 6