Sample records from a database in SPMF format (SPMF documentation)

This webpage explains how to use SPMF's tool to sample records from a database in SPMF format (a text file) using the SPMF open-source data mining library.

SPMF offers a set of tools allows to sample records (lines) from a dataset file. Sampling is useful when you want to work with a smaller subset of a large dataset for testing or performance reasons.

SPMF offers four variants of this tool:

Sample_Records: Sample a fixed number of records, with or without replacement.
Sample_Records_Percentage: Sample a percentage of records, with or without replacement.
Sample_Records_Reservoir: Memory-efficient reservoir sampling for large files.
Sample_Records_With_Seed: Sample with a fixed random seed for reproducible results.

All variants preserve metadata lines (lines that are empty or start with #, %, or @) and copy them to the output file without modification. Only data records are sampled.

Below, four example are provided where the four variants of the tools are demonstrated.

Example 1: Sample a fixed number of records (Sample_Records)

Consider the following input file contextPasquier99.txt:

1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

This file contains 5 data records (transactions). If we sample 3 records without replacement, a possible output is:

2 3 5
1 2 3 5
2 5

This can be achievde by running the algorithm Sample_Records. The output is then saved in an output file, which contains 3 randomly selected records from the original 5 records. Since sampling is random, running the tool again may produce different results.

How to run this example?

If you are using the graphical interface, (1) choose the "Sample_Records" algorithm, (2) choose the input file contextPasquier99.txt, (3) set the sample count to 3, (4) set "with replacement" to false, and (5) click "Run algorithm".

If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Sample_Records contextPasquier99.txt output.txt 3 false
in a folder containing spmf.jar and the input file contextPasquier99.txt.

If you are using the source code version of SPMF, launch the file "MainTestSampleRecords.java" in the package ca.pfv.spmf.tests.

Example 2: Sample a percentage of records (Sample_Records_Percentage)

Consider the following input file contextPasquier99.txt:

1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

This file contains 5 data records. If we sample 60% of records (0.6) without replacement, a possible output is:

1 3 4
1 2 3 5
2 5

This can be achieved by running the algorithm Sample_Records_Percentage. The output is then saved in an output file, which contains 3 records (because 60% of 5 records is 3 records, after rounding).

How to run this example?

If you are using the graphical interface, (1) choose the "Sample_Records_Percentage" algorithm, (2) choose the input file contextPasquier99.txt, (3) set the percentage to 0.6 (which means 60%), (4) set "with replacement" to false, and (5) click "Run algorithm".

If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Sample_Records_Percentage contextPasquier99.txt output.txt 0.6 false
in a folder containing spmf.jar and the input file contextPasquier99.txt.

If you are using the source code version of SPMF, launch the file "MainTestSampleRecords.java" in the package ca.pfv.spmf.tests.

Example 3: Memory-efficients reservoir sampling (Sample_Records_Reservoir)

Consider the following input file contextPasquier99.txt:

1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

This file contains 5 data records. If we sample 2 records using reservoir sampling, a possible output is:

2 5
1 3 4

This can be achieved by running the algorithm Sample_Records_Reservoir. The output is then saved in an output file, which contains 2 records. Reservoir sampling is particularly useful for very large files because it does not require loading all records into memory at once. It processes records one at a time and always samples without replacement.

How to run this example?

If you are using the graphical interface, (1) choose the "Sample_Records_Reservoir" algorithm, (2) choose the input file contextPasquier99.txt, (3) set the sample count to 2, and (4) click "Run algorithm".

If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Sample_Records_Reservoir contextPasquier99.txt output.txt 2
in a folder containing spmf.jar and the input file contextPasquier99.txt.

If you are using the source code version of SPMF, launch the file "MainTestSampleRecords.java" in the package ca.pfv.spmf.tests.

Example 4: Sample with a fixed seed for reproducibility (Sample_Records_With_Seed)

Consider the following input file contextPasquier99.txt:

1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

This file contains 5 data records. If we sample 3 records without replacement using the random seed 12345, the output will always be the same when using the same seed:

2 3 5
1 2 3 5
1 3 4

This can be achieved by running the algorithm Sample_Records_With_Seed. The output is then saved in an output file, which contains 3 records.

Using a fixed seed is useful for reproducibility in experiments. Running the tool again with the same seed will produce the exact same sample.

How to run this example?

If you are using the graphical interface, (1) choose the "Sample_Records_With_Seed" algorithm, (2) choose the input file contextPasquier99.txt, (3) set the sample count to 3, (4) set "with replacement" to false, (5) set the random seed to 12345, and (6) click "Run algorithm".

If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Sample_Records_With_Seed contextPasquier99.txt output.txt 3 false 12345
in a folder containing spmf.jar and the input file contextPasquier99.txt.

If you are using the source code version of SPMF, launch the file "MainTestSampleRecords.java" in the package ca.pfv.spmf.tests.