Converting a Sequence Database with Cost Values to a Transaction Database with Cost Values (SPMF documentation)

This example explains how to convert a sequence database to a transaction database using the SPMF open-source data mining library.

How to run this example?

What is this tool?

This tool converts a sequence database with cost values to a transaction database with cost values by removing the ordering between items. This tool is useful if you have a sequence database with cost values and you want to apply an algorithm that is designed to be applied on a transaction database with cost values.

What is the input?

The tool takes two prameters as input:

The input file type is a sequence database where each sequence is an ordered list of events, each event has a cost value (a positive integer), and each sequence has a utility value ( a numeric value such that a high value indicates a better outcome or result). Moreover, the user must set two parameters: (1) a minimum support threshold minsup (a positive integer), (2) a maximum cost threshold maxcost (a positive integer), and a (3) minimum occupancy minoccupancy threshold (a value in the [0,1] interval). 

For example, consider the following sequence database, which is provided in the file example_CEPN.txt of the SPMF distribution:

1[20] -1 2[40] -1 3[50] -1 4[20] -1 -2 SUtility:80
2[25] -1 4[12] -1 3[30] -1 5[25] -1 -2 SUtility:60
1[25] -1 5[14] -1 2[30] -1 -2 SUtility:50
1[40] -1 2[16] -1 4[40] -1 -2 SUtility:40
2[20] -1 5[24] -1 3[20] -1 -2 SUtility:70

This database contains five lines. Each line is a sequence.

Moreover, each sequence (line) is an ordered list of events separated by -1.

An event is represented by a positive integer and it is followed by a cost value (e.g. spent time on the event) indicated between squared brackets [ ]. A cost is a positive integer.

The end of a sequence is indicated by -2. Finally, at the end of each line, the keyword "SUtility:" is followed by a positive integer which indicates how good the outcome of this sequence is (e.g. it could represents a final exam score)

For example, the first line indicates that event "1" had a cost of 20, was followed by event "2" with a cost of 40, followed by event "3" with a cost of 50, followed by event "4" with a cost of 20. Moreover, this sequence has a utility of 80, which means a quite good outcome (compared to other sequences in this database). The other sequences follow the same format.

This database could for example represents sequences of learning activities made by learners, where the events 1,2,3,4 and 5 are learning activities, cost values are the time spent on a learning activity and the utility is the grade obtained at a final exam.

What is the output?

The output is a transaction database in SPMF format with cost values. A transaction database is a set of transactions. Each transaction an unordered set of items (symbols) represented by positive integers. The output for this example would be the following transaction database. It contains five transactions. The first transaction contains the set of items {1, 3, 4, 6}.

Transaction id Utility Items Costs
t1 80 {1, 2, 3, 4} 20 40 50 20
t2 60 {2, 3, 4, 5} 25 30 12 25
t3 50 {1, 2, 5} 25 30 14
t4 40 {1, 2, 4} 40 16 40
t5 70 {2, 3, 5} 20 20 24

The first transaction t1 has a utility of 80. Moreover, it indicates that the items 1,2,3,4 appears with cost values of 20, 40, 50 and 20 respectively.

Output file format

The output file format is defined as follows. It is a text file. Each line is a transaction. First, all items are presented as an integer, separated by a space. Then the symbol ":" appears, followed by the utility of the transaction (an integer). Then, the symbol ":" appears. Then, the list of cost values of items appears (integers separated by spaces).

For example, for the previous example, the output file is defined as follows:

1 2 3 4:80:20 40 50 20
2 3 4 5:60:25 30 12 25
1 2 5:50:25 30 14
1 2 4:40:40 16 40
2 3 5:70:20 20 24