Fix a Transaction Database with Utility and Time Information (SPMF documentation)

This example explains how to fix a transaction database with utility and time information using the SPMF open-source data mining library.

How to run this example?

What is this tool?

The tool "Fix_a_transaction_database" is a small program that fix some common problems in a transaction database file in SPMF format. The tool fixes two common problems: (1) an item appears more than once in a transaction, (2) transactions are not sorted. To fix the first problem the tool will keep only one occurrence of each item in a transaction. So if an items appears more than once in a transaction, it will appears only once after applying the tool. To fix the second problem, the tool sorts each transaction according to the lexicographical ordering (because it is required by most itemset and association rule mining algorithms).

What is the input?

The input is a transaction database with utility information in the SPMF format. Let's consider the following database consisting of 8 transactions (t1,t2...t8) and 5 items (1, 2, 3, 4, 5), and each transaction is associated with a timestamp. This database is provided in the text file "DB_broken.txt" in the package ca.pfv.spmf.tests of the SPMF distribution

Items

Transaction utility

Item utilities for this transaction

Timestamp

5 2 3 5 5

9

1000 4 2 3 1000

1

2 3 4 5

18

8 3 4 3

3

2 3 4 5

9

4 2 10 3

3

1 2 3 4 5

58

10 20 2 20 6

5

1 3 5

22

10 6 6

6

2 3 5

14

8 3 3

7

1 3 4

16

10 2 4

9

1 3 5

22

10 6 6

10

Each line of the database is:

Note that the value in the second column for each line is the sum of the values in the third column. And the value of the fourth column is optional.

In this example database, there is a few problems, highlighted in red above. These problems are in the first transaction. It can be observed that the item 5 appears three times in the first column, which is not allowed. Moreover, the value in the second column is incorrect (it should be the sum of the values in the third column. Besides, the items in the first column are not sorted according to some order (a requirement).

What is the output?

The output is a transaction database where each transaction is sorted and no item appears more than once, and the second column is correctly calculated. For example, the output using the above example is:

Items

Transaction utility

Item utilities for this transaction

Timestamp

2 3 5

1109

4 2 1103

1

2 3 4 5

18

8 3 4 3

3

2 3 4 5

9

4 2 10 3

3

1 2 3 4 5

58

10 20 2 20 6

5

1 3 5

22

10 6 6

6

2 3 5

14

8 3 3

7

1 3 4

16

10 2 4

9

1 3 5

22

10 6 6

10

The modifications are highlighted in red color, and are in the first transaction. Rather than having the item "5" three times in the first column, it appears once. Moreover, the transaction utility (second column) is now correctly calculated as 1109. Lastly, in the third column the utility values of 5 is not 1000 + 100 + 3 = 1103 since we have combined the three occurrences of item "5" together.

Input file format

The input file format of LHUI-Miner is defined as follows. It is a text file. Each line represents a transaction. Each line is composed of four sections, as follows.

For example, for the previous example, the input file is defined as follows:

5 2 3 5 5:9:1000 4 2 3 100:1
2 3 4 5:18:8 3 4 3:3
2 3 4 5:9:4 2 10 3:3
1 2 3 4 5:58:10 20 2 20 6:5
1 3 5:22:10 6 6:6
2 3 5:14:8 3 3:7
1 3 4:16:10 2 4:9
1 3 5:22:10 6 6:10

Consider the last line. It means that the transaction {1, 3, 5} has a total utility of 22 and that items 1, 3 and 5 respectively have a utility of 10, 6 and 6 in this transaction, and the transaction were made at time 10. The other lines follow the same format.

Output file format

The output file format is the same as the input format. But the problems contained in the input file have been fixed.

2 3 5:1109:4 2 1103:1
2 3 4 5:18:8 3 4 3:3
2 3 4 5:19:4 2 10 3:3
1 2 3 4 5:58:10 20 2 20 6:5
1 3 5:22:10 6 6:6
2 3 5:14:8 3 3:7
1 3 4:16:10 2 4:9
1 3 5:22:10 6 6:10