Calculate Statistics for a Transaction Database (SPMF documentation)
This example explains how to calculate statistics for a transaction database using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "Calculate_stats_for_a_transaction_database" algorithm, (2) choose the input file contextPasquier99.txt (3) click "Run algorithm".
- If you want to execute this example from the command
line, then execute this command:
java -jar spmf.jar run Calculate_stats_for_a_transaction_database contextPasquier99.txt no_output_file in a folder containing spmf.jar and the input file contextPasquier99.txt. - If you are using the source code version of SPMF, launch the file "MainTestGenerateTransactionDatabaseStats.java" in the package ca.pfv.SPMF.tests.
What is this tool?
This tool is a tool for generating statistics about a transaction database. It can be used to know for example if the database is dense or sparse before applying a data mining algorithms.
What is the input?
The input is a transaction database (aka formal context). A transaction database is a set of transactions. Each transactions an unordered set of items (symbols) represented by positive integers. For example, consider the following database. This database is provided in the file "contextPasquier99.txt" of the SPMF distribution. It contains five transactions. The first transactions contains the set of items {1, 3, 4}.
Transaction id | Items |
t1 | {1, 3, 4} |
t2 | {2, 3, 5} |
t3 | {1, 2, 3, 5} |
t4 | {2, 5} |
t5 | {1, 2, 3, 5} |
What is the output?
The output is statistics about the transaction database. For example, if we use the tool on the previous database given as example, we get the following statistics:
- The number of transactions is 5
- File C:\Users\ph\Desktop\SPMF\ca\pfv\spmf\test\contextPasquier99.txt
- The number of distinct items is 5
- The smallest item id is 1
- The largest item id is 5
- The average number of items per transactions is 3 (standard deviation: 0.7 variance: 0.5)
- The average support of items in this database is : 2.4
(standard deviation: 0.8 variance: 0.64 min value: 1 max value: 3)
Input file format
The input file format is defined as follows. It is a text file. An item is represented by a positive integer. A transaction is a line in the text file. In each line (transaction), items are separated by a single space. It is assumed that all items within a same transaction (line) are sorted according to a total order (e.g. ascending order) and that no item can appear twice within the same line.
For example, for the previous example, an output file could be the following:
1 3 4
2 3 5
1 2 3 5
2 5