Calculate Statistics for a ProductTransaction Database (SPMF documentation)
This example explains how to calculate statistics for a product transaction database using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "Calculate_stats_for_a_transaction_database" algorithm, (2) choose the input file contextVME.txt (3) click "Run algorithm".
- If you want to execute this example from the command
line, then execute this command:
java -jar spmf.jar run Calculate_stats_for_a_transaction_database contextVME.txt no_output_file in a folder containing spmf.jar and the input file contextVME.txt. - If you are using the source code version of SPMF, launch the file "MainTestStatsTransactionDBProduct.java" in the package ca.pfv.SPMF.tests.
What is this tool?
This tool is a tool for generating statistics about a product transaction database, as used by algorithms such as VME. This tool can be used to know for example if the database is dense or sparse before applying a data mining algorithm.
What is the input?
The input is a product transaction database (aka formal context). A product is defined as a set of items that are used to assemble the product. Moreover each product is annotated with a profit (a positive integer) that indicates how much money this product generate for the company. For example, let's consider the following product database, consisting of 6 products and 7 items (this example is taken from the article of Deng & Xu, 2010). Each product is annotated with the profit information. For example, the first line indicates that the product 1 generate a total profit of 50 $ for the company and that its assembly requires parts 2, 3, 4 and 6. This product database is provided in the file "contextVME.txt" of the SPMF distribution.:
profit | items | |
product1 | 50$ | {2, 3, 4, 6} |
product2 | 20$ | {2, 5, 7} |
product3 | 50$ | {1, 2, 3, 5} |
product4 | 800$ | {1, 2, 4} |
product5 | 30$ | {6, 7} |
product6 | 50$ | {3, 4} |
What is the output?
The output is statistics about the product transaction database. For example, if we use the tool on the previous product transaction database given as example, we get the following statistics:
============ TRANSACTION DATABASE STATS ==========
Number of transactions : 6
File C:\Users\Phil\Desktop\test_files\contextVME.txt
Number of distinct items: 7
Smallest item id: 1
Largest item id: 7
Average number of items per transaction: 3.0 standard deviation: 0.816496580927726 variance: 0.6666666666666666
Average profit per product (transaction): 166.66666666666666 standard deviation: 283.4705550062573 variance: 80355.55555555555
Average item support in the database: 2.5714285714285716 standard deviation: 0.7284313590846836 variance: 0.5306122448979592 min value: 2 max value: 4
Database density: 42.857142857142854 %
Input file format
The input file format is defined as follows. It is a text file. Each lines represents a transaction. Each line is composed of two sections, as follows.
- First, the profit of the transaction is indicated by an integer number, followed by a single space.
- Second, the items in the transaction are listed. An item is represented by a positive integer. Each item is separated from the following item by a space. It is assumed that items are sorted according to a total order and that no item can appear twice in the same transaction.
For example, for the previous example, the input file is defined as follows:
50 2 3 4 6
20 2 5 7
50 1 2 3 5
800 1 2 4
30 6 7
50 3 4
Consider the first line. It means that the transaction {2, 3, 4, 6} has a profit of 50 and it contains the items 2, 3, 4 and 6. The following lines follow the same format.