Fix a Transaction Database with Utility and Time Information (SPMF documentation)
This example explains how to fix a transaction database with utility and time information using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "Fix_a_transaction_database_with_utility_time" algorithm, (2) choose the input file DB_broken.txt (3) set the output file to "output.txt", and then (4) click "Run algorithm".
- If you want to execute this example from the command
line, then execute this command:
java -jar spmf.jar runFix_a_transaction_database_with_utility_time DB_broken.txt output.txt in a folder containing spmf.jar and the input file DB_broken.txt. - If you are using the source code version of SPMF, launch the file "MainTestFixTDBTimeUtility.java" in the package ca.pfv.SPMF.tests.
What is this tool?
The tool "Fix_a_transaction_database" is a small program that fix some common problems in a transaction database file in SPMF format. The tool fixes two common problems: (1) an item appears more than once in a transaction, (2) transactions are not sorted. To fix the first problem the tool will keep only one occurrence of each item in a transaction. So if an items appears more than once in a transaction, it will appears only once after applying the tool. To fix the second problem, the tool sorts each transaction according to the lexicographical ordering (because it is required by most itemset and association rule mining algorithms).
What is the input?
The input is a transaction database with utility information in the SPMF format. Let's consider the following database consisting of 8 transactions (t1,t2...t8) and 5 items (1, 2, 3, 4, 5), and each transaction is associated with a timestamp. This database is provided in the text file "DB_broken.txt" in the package ca.pfv.spmf.tests of the SPMF distribution
Items |
Transaction utility |
Item utilities for this transaction |
Timestamp |
5 2 3 5 5 |
9 |
1000 4 2 3 1000 |
1 |
2 3 4 5 |
18 |
8 3 4 3 |
3 |
2 3 4 5 |
9 |
4 2 10 3 |
3 |
1 2 3 4 5 |
58 |
10 20 2 20 6 |
5 |
1 3 5 |
22 |
10 6 6 |
6 |
2 3 5 |
14 |
8 3 3 |
7 |
1 3 4 |
16 |
10 2 4 |
9 |
1 3 5 |
22 |
10 6 6 |
10 |
Each line of the database is:
- a set of items (the first column of the table),
- the sum of the utilities (e.g. profit) of these items in this transaction (the second column of the table),
- the utility of each item for this transaction (e.g. profit generated by this item for this transaction)(the third column of the table).
- the timestamp which indicated when this transaction was made (the fourth column of the table)
Note that the value in the second column for each line is the sum of the values in the third column. And the value of the fourth column is optional.
In this example database, there is a few problems, highlighted in red above. These problems are in the first transaction. It can be observed that the item 5 appears three times in the first column, which is not allowed. Moreover, the value in the second column is incorrect (it should be the sum of the values in the third column. Besides, the items in the first column are not sorted according to some order (a requirement).
What is the output?
The output is a transaction database where each transaction is sorted and no item appears more than once, and the second column is correctly calculated. For example, the output using the above example is:
Items |
Transaction utility |
Item utilities for this transaction |
Timestamp |
2 3 5 |
1109 |
4 2 1103 |
1 |
2 3 4 5 |
18 |
8 3 4 3 |
3 |
2 3 4 5 |
9 |
4 2 10 3 |
3 |
1 2 3 4 5 |
58 |
10 20 2 20 6 |
5 |
1 3 5 |
22 |
10 6 6 |
6 |
2 3 5 |
14 |
8 3 3 |
7 |
1 3 4 |
16 |
10 2 4 |
9 |
1 3 5 |
22 |
10 6 6 |
10 |
The modifications are highlighted in red color, and are in the first transaction. Rather than having the item "5" three times in the first column, it appears once. Moreover, the transaction utility (second column) is now correctly calculated as 1109. Lastly, in the third column the utility values of 5 is not 1000 + 100 + 3 = 1103 since we have combined the three occurrences of item "5" together.
Input file format
The input file format of LHUI-Miner is defined as follows. It is a text file. Each line represents a transaction. Each line is composed of four sections, as follows.
- First, the items contained in the transaction are listed. An item is represented by a positive integer. Each item is separated from the next item by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same transaction.
- Second, the symbol ":" appears and is followed by the transaction utility (an integer).
- Third, the symbol ":" appears and is followed by the utility of each item in this transaction (an integer), separated by single spaces.
- Fourth, the symbol ":" appears and is followed by the timestamp (an integer) indicating when the transaction were made.
For example, for the previous example, the input file is defined as follows:
5 2 3 5 5:9:1000 4 2 3 100:1
2 3 4 5:18:8 3 4 3:3
2 3 4 5:9:4 2 10 3:3
1 2 3 4 5:58:10 20 2 20 6:5
1 3 5:22:10 6 6:6
2 3 5:14:8 3 3:7
1 3 4:16:10 2 4:9
1 3 5:22:10 6 6:10
Consider the last line. It means that the transaction {1, 3, 5} has a total utility of 22 and that items 1, 3 and 5 respectively have a utility of 10, 6 and 6 in this transaction, and the transaction were made at time 10. The other lines follow the same format.
Output file format
The output file format is the same as the input format. But the problems contained in the input file have been fixed.
2 3 5:1109:4 2 1103:1
2 3 4 5:18:8 3 4 3:3
2 3 4 5:19:4 2 10 3:3
1 2 3 4 5:58:10 20 2 20 6:5
1 3 5:22:10 6 6:6
2 3 5:14:8 3 3:7
1 3 4:16:10 2 4:9
1 3 5:22:10 6 6:10