Resize a database in SPMF format (a text file) (SPMF documentation)

This example explains how to resize a database in SPMF format (a text file) using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "Resize_a_database" algorithm, (2) choose the input file DB_UtilityPerHUIs.txt (3) select 0.7 as percentage (which means 70%), and (4) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Resize_a_database DB_UtilityPerHUIs.txt output.txt in a folder containing spmf.jar and the input file DB_Utility.txt.
If you are using the source code version of SPMF, launch the file "MainTestResizeDatabaseTool.java" in the package ca.pfv.SPMF.tests.

What is this tool?

This tool is a small program that is designed to resize a database by using X% of the transactions of an original database. The tool takes as input an original database, and a percentage X. Then it outputs a new file containing X% of the lines of data from the original database.For example, if a database contains 100,000 transactions and this tool is used with a percentage of 75 %, the output will be a database containing the 75,000 first transactions from the original database. This program is designed to work with any database file in SPMF format (text file). This tool is useful for performing scalability experiments when comparing algorithms. For example, one may wants to see the behavior of some algoritms when using 25%, 50%, 75% and 100% of the database.

What is the input?

The input is a text file in SPMF format. It could be for example a transaction database, a sequence database, or other types of databases used by algorithms offered in SPMF. Moreover the user has to specify a percentage X.

What is the output?

The output is a new file containing X% of the lines of data from the input file.

Example

For example, if the user applies the tool for resizing a database with X = 70 % on the following file DB_UtilityPerHUIs.txt in this example:

3 1:6:1 5
5:3:3
3 5 1 2 4:25:1 3 5 10 6
3 5 2 4:20:3 3 8 6
3 1 4:8:1 5 2
3 5 1:22:6 6 10
3 5 2:9:2 3 4

The output is a new file (output.txt in this example) containing 5 transactions (because 70 % of 7 transactions is 5 transactions):

3 1:6:1 5
5:3:3
3 5 1 2 4:25:1 3 5 10 6
3 5 2 4:20:3 3 8 6
3 1 4:8:1 5 2