Remove duplicate records from a database in SPMF format (a text file) (SPMF documentation)

This example explains how to remove duplicate records from a database in SPMF format (a text file) using the SPMF open-source data mining library.

How to run this example?

What is this tool?

This tool allows to remove duplicate records (lines) from a dataset file. It keeps only the first occurrence of each unique record and removes all subsequent duplicates.

This is useful for cleaning datasets before applying data mining algorithms. The tool preserves metadata lines (lines that are empty or start with #, %, or @) and copies them to the output file without modification. Only data records are checked for duplicates.

What is the input?

The input is a text file in SPMF format. It could be for example a transaction database, a sequence database, or other types of databases used by algorithms offered in SPMF. Moreover the user has to specify a percentage X.

For example, this is the file contextPasquier99.txt:

1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

This file contains 5 records (each line is a records). It is a transaction database as used by algorithms such as Apriori.

What is the output?

The output is a text file with the same format as the input, but with duplicate data records removed. Only the first occurrence of each unique record is kept. In this example, the output is a file output.txt as follows:
1 3 4
2 3 5
1 2 3 5
2 5

The fifth line of the input file has been removed because it is a duplicate of the third line.