Remove duplicate records from a database in SPMF format (a text file) (SPMF documentation)
This example explains how to remove duplicate records from a database in SPMF format (a text file) using the SPMF open-source data mining library.
How to run this example?
- If you are using the graphical interface, (1) choose the "Remove_Duplicate_Records" algorithm, (2) choose the input file contextPasquier99.txt (3) and click "Run algorithm".
- If you want to execute this example from the command
line, then execute this command:
java -jar spmf.jar run Remove_Duplicate_Records contextPasquier99.txt output.txt in a folder containing spmf.jar and the input file contextPasquier99.txt. - If you are using the source code version of SPMF, launch the file "MainTestRemoveDuplicateRecords.java" in the package ca.pfv.SPMF.tests.
What is this tool?
This tool allows to remove duplicate records (lines) from a dataset file. It keeps only the first occurrence of each unique record and removes all subsequent duplicates.
This is useful for cleaning datasets before applying data mining algorithms. The tool preserves metadata lines (lines that are empty or start with #, %, or @) and copies them to the output file without modification. Only data records are checked for duplicates.
What is the input?
The input is a text file in SPMF format. It could be for example a transaction database, a sequence database, or other types of databases used by algorithms offered in SPMF. Moreover the user has to specify a percentage X.
For example, this is the file contextPasquier99.txt:
1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5
This file contains 5 records (each line is a records). It is a transaction database as used by algorithms such as Apriori.
What is the output?
The output is a text file with the same format as the input, but with duplicate data records removed. Only the first occurrence of each unique record is kept. In this example, the output is a file output.txt as follows:
1 3 4
2 3 5
1 2 3 5
2 5
The fifth line of the input file has been removed because it is a duplicate of the third line.