Generating synthetic utility values for a transaction database without utility values (SPMF documentation)

This example explains how to generate synthetic utility values for a transaction database without utility values using the SPMF open-source data mining library.

How to run this example?

What is this tool?

This tool that generate synthetic utility values for a transaction database without utility value. This is useful to generate datasets that can be used for high utility-itemset mining.

Transaction database with synthetic utiliy values are often used in the data mining litterature to evaluate high utility itemset mining algorithms.

What is the input?

The tool takes as parameter an input file (a transaction database), and two parameters that are used for generating the synthetic utility values:

A transaction database is a set of transactions. Each transaction is a set of items. For example, consider the following transaction database. It contains 5 transactions (t1, t2, ..., t5) and 5 items (1,2, 3, 4, 5). For example, the first transaction represents the set of items 1, 3 and 4. This database is provided as the file contextPasquier99.txt in the SPMF distribution. It is important to note that an item is not allowed to appear twice in the same transaction and that items are assumed to be sorted by lexicographical order in a transaction.

Transaction id Items
t1 {1, 3, 4}
t2 {2, 3, 5}
t3 {1, 2, 3, 5}
t4 {2, 5}
t5 {1, 2, 3, 5}
 

What is the output?

The output is a transaction database with utility information.

It is a transaction database where each item appearing in a transaction has a purchase quantity (a.k.a internal utility). Furthermore, each item has a weight (a.k.a external utility or weight) that can be interpreted as a unit profit when buying one unit of the item.

In SPMF, the format of transaction database with utility information is represented as follows. Consider the following database:


Items Transaction utility Item utilities for this transaction
t1 1 3 4 9 1 3 5
t2 2 3 5 14 3 3 8
t3 1 2 3 5 9 1 5 2 1
t4 2 5 12 6 6
t5 1 2 3 5 11 2 3 4 2

Each line of the database is:

Note that the value in the second column for each line is the sum of the values in the third column.

What are real-life examples of such a database? There are several applications in real life. One application is a customer transaction database. Imagine that each transaction represents the items purchased by a customer. The first customer named "t1" bought items 1, 3 and 4. The amount of money spent for each item is respectively 1 $, 3 $ and 5 $. The total amount of money spent in this transaction is 1 + 3 + 5 = 9 $.

Input file format

The input file format is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.

For example, for the previous example, the input file is defined as follows:

1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

Output file format

The output file format is defined as follows. It is a text file. Each lines represents a transaction with utility information. Each line is composed of three sections, as follows.

For example, an output file that can be generated for the example input file could the following:

1 3 4:9:1 3 5
2 3 5:14:3 3 8
1 2 3 5:9:1 5 2 1
2 5:12:6 6
1 2 3 5:11:2 3 4 2


Items Transaction utility Item utilities for this transaction
t1 1 3 4 9 1 3 5
t2 2 3 5 14 3 3 8
t3 1 2 3 5 9 1 5 2 1
t4 2 5 12 6 6
t5 1 2 3 5 11 2 3 4 2

Consider the first line. It means that the transaction {1, 3, 4} has a total utility of 9 and that items 1, 3 and 4 respectively have a utility of 1, 3 and 5 in this transaction.