Using a TEXT file as input in the source code version of SPMF (SPMF documentation)

This example explains how to use a TEXT file as input in the source code version of SPMF using the SPMF open-source data mining library.

The GUI interface and command line interface of SPMF can read text files as input if they have the ".text" extension, since version 2.01 of SPMF and this is totally transparent to the user. This is supported for most sequential pattern mining algorithms and sequential rule mining algorithms. It however is not supported for itemset mining or association rule mining algorithms, for now. In this example, we will describe another possibility. It is to use a TEXT file as input when running an algorithm from the source code. This example explains how to do it and it is quite simple.

But before presenting the example, let's explain a few things about how the TEXT filesupport is implemented in SPMF:

Having said that, we will now explain how to use the TEXT file format in the source code with an example. We will use the ERMiner algorithm but the steps are the same for the other algorithms. We will first show how to run the ERMiner algorithm if the input file is in SPMF format. Then, we will show how to run the Apriori algorithm if the input is a text file to illustrate the differences.

If the input is in SPMF format

To run ERMiner with a file "contextPrefixSpan.txt" in SPMF format with the parameter minsup = 0.4, the following code is used:

AlgoERMiner algo = new AlgoERMiner();
algo.runAlgorithm(input, output, 3, 0.5);

If the input is a TEXT file

Now let's say that the input file is a text document.

// We first need to convert the input file from TEXT to SPMF format. To do that, we create a sequence database converter. Then we call its method "convertTEXTandReturnMap" to convert the input file to the SPMF format. It produces a converted input file named "example2_converted.arff". Moreover, the conversion method returns a map containing mapping information between the data in ARFF format and the data in SPMF format.

SequenceDatabaseConverter converter = new SequenceDatabaseConverter();
Map<Integer, String> mapping = converter.convertTEXTandReturnMap("example2.text", "example2_converted.txt", Integer.MAX_VALUE);

// Then we run the algorithm with the converted file "example2_converted.txt". This creates a file "output.txt" containing the result.

AlgoERMiner algo = new AlgoERMiner();
algo.runAlgorithm("example2_converted.txt", "output.txt", 3, 0.5);

// Finally, we need to use the mapping to convert the output file so that the result is shown using the words that are found in the TEXT file rather than the integer-based representation used internally by the ERMiner algorithm. This is very simple and performed as follows. The result is a file named "final_output.txt".

ResultConverter converter2 = new ResultConverter();
converter2.convert(mapping, "output.txt", "final_output.txt");

What is the cost of using a TEXT file in terms of performance? The only additional cost when using a text file is the cost of converting the input and output files, which is generally much smaller than the cost of performing data mining. In the future, we plan to add support for SQL databases, Excel files and other formats by using a similar conversion mechanism that does not affect the performance of the mining phase. We also plan to add support for the visualizations of patterns.