Using a TEXT file as input in the source code version of SPMF (SPMF documentation)
This example explains how to use a TEXT file as input in the source code version of SPMF using the SPMF open-source data mining library.
The GUI interface and command line interface of SPMF can read text files as input if they have the ".text" extension, since version 2.01 of SPMF and this is totally transparent to the user. This is supported for most sequential pattern mining algorithms and sequential rule mining algorithms. It however is not supported for itemset mining or association rule mining algorithms, for now. In this example, we will describe another possibility. It is to use a TEXT file as input when running an algorithm from the source code. This example explains how to do it and it is quite simple.
But before presenting the example, let's explain a few things about how the TEXT filesupport is implemented in SPMF:
- The TEXT format is only supported for algorithms that take a sequence database as input (most sequential rule and sequential pattern mining algorithm such as CMRules and CM-SPAM).
- To convert a text file in a sequence database, the text is separated by sentence ending by "?", "!" or ".". Then each sentence is viewed as a sequence of words.
- The words in a text files are naturally represented as strings while the SPMF format and algorithms use an integer representation which is much more memory and time efficient for the algorithms concerned. To provide a support for TEXT files while keeping the very efficient integer representation of our algorithms, we decided to implement the support for text files in a way that did not require to modify the source code of the algorithms. The solution is to convert the input before running an algorithm and then, to convert the output.
Having said that, we will now explain how to use the TEXT file format in the source code with an example. We will use the ERMiner algorithm but the steps are the same for the other algorithms. We will first show how to run the ERMiner algorithm if the input file is in SPMF format. Then, we will show how to run the Apriori algorithm if the input is a text file to illustrate the differences.
If the input is in SPMF format
To run ERMiner with a file "contextPrefixSpan.txt" in SPMF format with the parameter minsup = 0.4, the following code is used:
AlgoERMiner algo = new AlgoERMiner();
algo.runAlgorithm(input, output, 3, 0.5);
If the input is a TEXT file
Now let's say that the input file is a text document.
// We first need to convert the input file from TEXT to SPMF format. To do that, we create a sequence database converter. Then we call its method "convertTEXTandReturnMap" to convert the input file to the SPMF format. It produces a converted input file named "example2_converted.arff". Moreover, the conversion method returns a map containing mapping information between the data in ARFF format and the data in SPMF format.
SequenceDatabaseConverter converter = new SequenceDatabaseConverter();
Map<Integer, String> mapping = converter.convertTEXTandReturnMap("example2.text", "example2_converted.txt", Integer.MAX_VALUE);
// Then we run the algorithm with the converted file "example2_converted.txt". This creates a file "output.txt" containing the result.
AlgoERMiner algo = new AlgoERMiner();
algo.runAlgorithm("example2_converted.txt", "output.txt", 3, 0.5);
// Finally, we need to use the mapping to convert the output file so that the result is shown using the words that are found in the TEXT file rather than the integer-based representation used internally by the ERMiner algorithm. This is very simple and performed as follows. The result is a file named "final_output.txt".
ResultConverter converter2 = new ResultConverter();
converter2.convert(mapping, "output.txt", "final_output.txt");
What is the cost of using a TEXT file in terms of performance? The only additional cost when using a text file is the cost of converting the input and output files, which is generally much smaller than the cost of performing data mining. In the future, we plan to add support for SQL databases, Excel files and other formats by using a similar conversion mechanism that does not affect the performance of the mining phase. We also plan to add support for the visualizations of patterns.