SPMF: A Java Open-Source Data Mining Library

Developers' guide

This web page is intended for developers who want to understand and modify the source code of SPMF. This page provides information about how the source code is organized to make it easier to understand the source code, modify it and reuse it in other projects.

If you are looking for examples of how to run the algorithms, you should instead look at the documentation of SPMF. Also, another general source of information is the FAQ.

An overview of the architecture of SPMF

The architecture of SPMF is not very complicated. Below, I provide a picture that explains the main components of SPMF.

To use SPMF, a user can choose to use the Graphical interface or Command line interface. The user interacts with these two interfaces to run algorithms which are managed by a module called the Agorithm Manager. There are mainly three types of algorithms, which are (1) data pre-processing algorithms, (2) data mining algorithms, and (3) algorithms to either visualize data or patterns found in the data. The Algorithm Manager has the list of all available algorithms, and a description of each algorithm. The description of an algorithm indicates how many parameters it has, what are the data types of parameters, what is the algorithm name, etc. The input and output of algorithms are generally text files. A few different formats are supported, explained in the documentation of SPMF.

This architecture of SPMF is designed so that it is easy to add new algorithms by just giving a new algorithm description to the algorithm manager. Then, the new algorithm will automatically appear in the graphical user interface and command line interface.

To integrate SPMF in other software, it is possible to directly call the code of SPMF, or to call SPMF as an external program from the command line. There also exists some unofficial wrappers that have been developed which allows calling SPMF from other languages such R and Python.

Source code organization - packages

Now that you have an overview of the architecture, let me explain in more details how the source code of SPMF is organized.

The source code is organized into a package hierarchy (a package is a folder in Java terminology). Here is a description of the main packages in SPMF:

ca/pfv/spmf/ : this is the main package containing all the other packages of SPMF.

ca/pfv/spmf/algorithms/ this package contains the source code of all algorithms

ca/pfv/spmf/algorithms/associationrules/ : contains association rule mining algorithms.

ca/pfv/spmf/algorithms/classifiers/ : contains classification algorithms.

ca/pfv/spmf/algorithms/clustering/ : contains clustering algorithms.
ca/pfv/spmf/algorithms/episodes/ : contains episode mining algorithms.
ca/pfv/spmf/algorithms/frequentpatterns/ : contains itemset mining algorithms.
ca/pfv/spmf/algorithms/graph_mining/ : contains graph mining algorithms.
ca/pfv/spmf/algorithms/sequenceprediction/ : contains sequence prediction algorithms.
ca/pfv/spmf/algorithms/sequential_rules/ : contains sequential rule mining algorithms.
ca/pfv/spmf/algorithms/sequentialpatterns/ : contains sequential pattern mining algorithms.
ca/pfv/spmf/algorithms/sort/ : contains sorting algorithms.
ca/pfv/spmf/algorithms/timeseries/ : contains time series mining and analysis algorithms.

ca/pfv/spmf/algorithmmanager/ contains the Algorithm Manager, which is used for managing algorithms such as listing all the algorithms, their parameters, etc.
- ca/pfv/spmf/algorithmmanager/descriptions/ this package contains a description for each algorithm as a Java class. The description of an algorihtm indicates its input type, output type, authors, etc. The description of an algorithm is a subclass of the class DescriptionOfAlgorithm.

ca/pfv/spmf/datastructures : contains some special data structures that are used by some algorithms such as the triangular matrix.

ca/pfv/spmf/documentation/ : this package contains a text file that link to the documentation on the website.

ca/pfv/spmf/gui : contains the graphical user interface (MainWindow.Java) and command line (Main.Java). To run the graphical user interface from the source code, just run Main.Java.

ca/pfv/spmf/input/ : contains the code for reading some types of input file such as transaction databases and sequence databases. These classes are used by several algorithms.

ca/pfv/spmf/patterns/ : defines various types of patterns used by different algorithms such as itemsets and rules.

ca/pfv/spmf/test/ : contains examples of how to run each algorithm from the source code. Note that these files are intended for developers. Note that these files are not "unit tests" even though they are named "test".

ca/pfv/spmf/tools/ : contains various tools such as random dataset generators, a tool for generating statistics on sequence databases and a tool for converting various formats of sequence database to the SPMF format.

Source code organization - algorithms

In general, each algorithm has its own package containing its main files. A few files are shared by multiple algorithms and can be found in the ca/pfv/spmf/input/ or ca/pfv/spmf/patterns directories.

For example, the SPAM algorithm is located in the package ca/pfv/spmf/algorithms/sequentialpatterns/spam, which is a sub package of ca/pfv/spmf/algorithms/sequentialpatterns/ because SPAM is a sequential pattern mining algorithm.

Each algorithm has a main class with a name starting with "Algo..." and a method "runAlgorithm()" for running the algorithm. For example, the main file for the SPAM algorithm is AlgoSPAM.java and it has a method named runAlgorithm() to run it. This method takes three parameters as input : (1) an input file, (2) an output file and (3) a minsup threshold.

To see an example of how to run the SPAM algorithm from the source code, we should go in the ca/pfv/spmf/test/ folder where all the example files for developers are located. In this folder, all algorithms have at least one file named "MainTestXXXX.java" where XXXX is the name of the algorithm. For example, the file showing how to execute the SPAM algorithm is "MainTestSPAM_saveToFile.java". If we open the file, we see the following code:

...

// Load a sequence database
String input = fileToPath("contextPrefixSpan.txt");
String output = ".//output.txt";

// Create an instance of the algorithm
AlgoSPAM algo = new AlgoSPAM();

// execute the algorithm with minsup = 2 sequences (50 %)
algo.runAlgorithm(input, output, 0.5);
algo.printStatistics();

...

This example correspond to the SPAM example in the documentation section of the website. The input file is "contextPrefixSpan.txt", which correspond to the example in the documentation on the website. This input file can be found in ca/pfv/spmf/tests/ .

In this example, the output file path is ".//output.txt". This line can be replaced by whatever you like. You should chose a path that exist on your computer.

Finally, the line "runAlgorithm()" method launches the algorithm. In this example, we can see that the parameter 0.5 is used. To see the meaning of that parameter for the SPAM algorithm, we should look again at the documentation on the website (here). On the website it is explained that this parameter is "the minimum support".

Each algorithm also has a description in the package ca.pfv.spmf.algorithmmanager.descriptions. The description of an algorithm provides information about an algorithm such as its authors, input file type, output file type, types of parameters, and how the algorithm can be run. Descriptions of algorithms are used by the graphical interface to automatically build the list of algorithms that are offered in the user interface. Descriptions are also used by the command line interface to provide help to the user about what kind of parameters are expected by each algorithm. For example, for the SPAM algorithm, the class DescriptionAlgoSPAM describes the SPAM implementation offered in SPMF.

Compiling the source code

The source code of SPMF is designed to be easily reusable. To compile it, the only requirement is a recent version of the official Java JDK (>=1.8).

To compile the source code, it is more convenient to use an integrated development environment because there are many classes and packages. I recommend to use an IDE such as Eclipse, IntelliJ or Net Beans for compiling and modifying the source code. The instructions for installing the source code of SPMF in Eclipse, Net Beans, IntelliJ and compiling it can be found in this document: how_to_install. If you use another IDE, it should be straightforward to install the source code if you are familiar with Java.

Understanding an algorithm

A good effort has been made to put a lot of comments in the source code. However, to understand an algorithm, it is highly recommended to read the research paper describing the algorithm first. Link to most of the papers can be found on this page of the website.

Copying the source code of an algorithm in another Java project

Reusing the source code of a single algorithm from SPMF in another Java project is easy, since in general algorithms in SPMF are separated by packages, except for a few datastructures and classes that are shared by multiples algorithms.

If you want to reuse a single algorithm from SPMF without copying all the code from SPMF, you should first determine which classes are necessary. For example, if you want to copy the source code of the SPAM algorithm in another project, then you would need to first locate the package for the SPAM algorithm. It is ca/pfv/spmf/algorithms/sequentialpatterns/spam/ .

Then, you would need to copy the file AlgoSPAM.java in this package, as well as other files used by that file which are: AlgoSPAM. java, Bitmap.Java, Itemset.java and Prefix.Java.

Then you should check the first lines of those Java files in that package to see if there is some additional files that are required by looking at the import statements of each file. In the case of SPAM, there is only two such files indicated by the import statement in AlgoSPAM.java, which are:

ca.pfv.spmf.patterns.itemset_list_integers_without_support.Itemset;
ca.pfv.spmf.tools.MemoryLogger;

Therefore, we should also copy these files. Then we should also look at the import statement and dependency of these files to see if other files are required.

Note that each Java file has a package statement in its first line. If you move the files to another package, you should change the package statements accordingly.

Also, note that the source code of SPMF is distributed under the GPL license. If you copy parts of the source code of SPMF in another project, it should comply with the GPL license (see license). Also you should not delete the copyright statement on top of each file.

Naming conventions

In SPMF, the source code tries to follow the Standard Code Conventions for the Java Programming Language. Also, in general, the Javadoc convention is used for documenting the code.

How to generate the jar file for the GUI version of SPMF

If you want to regenerate the jar file for the GUI version of SPMF, you can do as follows in Eclipse:

Before starting, make sure that you run "Main.Java" which is in the package ca.pfv.spmf.gui at least once. This will create a launch configuration in Eclipse for Main.Java that you will need later.
Right-click on your project containing the source code of SPMF.
Select "Export", select "Runnable JAR file" and click "Next"
Under "Launch configuration" select the class that should be launched by the JAR file, which is Main.java.
Indicate where you want to export the Jar files in your computer by clicking "Browse..."
Click "Finish".

If you have performed these steps correctly, the Jar file should have been written in the location that you chose in Step 3. Note that it is possible that these steps are slightly different depending on your version of Eclipse.

Running the GUI from the source code

To run the graphical user interface from the source code, just run the file "Main.java" in the package ca/pfv/spmf/gui/.

Modifying the GUI of SPMF

The file "MainWindow.java" in the package ca/pfv/spmf/gui/ contains the graphical user interface of SPMF. To create/modify the user interface, I use the Eclipse development environment for Java.

In Eclipse, I right-click on the class MainWindow. Then I choose "Open with.." and then "Window Builder Editor". Then, you can click on the "Design" tab to modify/design a user interface visually and on the "Source" tab to edit the code. If you don't have "window builder" in Eclipse, you could update Eclipse or check this website: https://eclipse.org/windowbuilder/ or find another similar plugin. If you are using Netbeans instead of Eclipse, similar features should be offered.

After you have modified the user interface, to run it from the source code, just run the file "Main.java" in the package ca/pfv/spmf/gui/.

Contributing source code and bug fixes

If you find some bugs in the source code or errors in the documentation, please contact me directly and I will fix them as soon as possible (generally within a few days).

If you want to contribute source code, contact with me. You can send me an e-mail with your proposed modifications. I will test your code to see that it has no bugs and also have a look at the source code. If your code is OK, I will add it to the next release of SPMF and I will also add your name to the list of contributors. If you want to contribute source code of an algorithm to SPMF, it would be easier to integrate if you follow the same source code organization as in SPMF. But if you don't, it can also be OK. In this case, I can take care of integrating your source code in SPMF for you. If you have any questions about that, again, just send me an e-mail.

How to add an algorithm to the official version of SPMF?

To add a new algorithm to the official version of SPMF that is offered on the website, you should first contact me by e-mail to tell me that you are interested to add a new algorithm. After that, there are eight important steps that should be followed to integrate a new algorithm in the official version of SPMF.

1) OUTPUT FORMAT:

If your algorithm performs the same task as some other algorithm(s) already offered in SPMF, then your algorithm should have the same output format as those algorithm(s). For example, if you want to incorporate a new sequential pattern mining algorithm, the output format, should be the following:

1 2 -1 #SUP: 2
1 -1 2 -1 6 -1 #SUP: 2
1 -1 2 -1 5 -1 #SUP: 2
1 -1 3 -1 6 -1 #SUP: 2
1 -1 3 -1 5 -1 #SUP: 2
1 -1 6 -1 5 -1 #SUP: 2

Each line is a sequential pattern. Items are positive integers separated by spaces. After each itemset, there is a "-1" to indicate the end of an itemset.    #sup: indicate the support of the sequential pattern. The support is a positive integer representing a number of sequences.

2) INPUT FORMAT

If your algorithm takes the same input as some other algorithm(s) already offered in SPMF, then your algorithm should have the same input format as those algorithm(s). For example, if you want to incorporate a new sequential pattern mining algorithm, the input format should be the following:

1 2 -1 3 -1 6 -1 7 -1 5 -1 -2
1 4 -1 3 -1 2 -1 1 2 5 6 -1 -2
1 -1 2 -1 6 7 -1 5 -1 -2
2 -1 6 7 -1 -2

Each line is a sequence.   Each itemset is a set of items, where items are positive integers separated by spaces. "-1" indicates the end of an itemset. "-2" indicates the end of a sequence.

3) ADDING YOUR COPYRIGHT INFORMATION AND AUTHOR INFORMATION TO THE SOURCE CODE

In each files of your source code, you should mention that you are the author of these files and the copyright belongs to you. Furthermore, you should mention the license for using your code. Since the GPL3 license is used for all the code in SPMF, you should also use the same GPL3 license for code to be integrated in SPMF. To do that, you should add the following on top of each Java file:

/** * * * This is an implementation of the INSERT_NAME_OF_YOUR_ALGORITHM algorithm.
*
* Copyright (c) 2016 INSERT_YOUR_NAME
*
* This file is part of the SPMF DATA MINING SOFTWARE * (https://www.philippe-fournier-viger.com/spmf).
*
*
* SPMF is free software: you can redistribute it and/or modify it under the * terms of the GNU General Public License as published by the Free Software * Foundation, either version 3 of the License, or (at your option) any later * version. *

* SPMF is distributed in the hope that it will be useful, but WITHOUT ANY * WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR * A PARTICULAR PURPOSE. See the GNU General Public License for more details. *
*
* You should have received a copy of the GNU General Public License along with * SPMF. If not, see http://www.gnu.org/licenses/ .
*
* @author INSERT_YOUR_NAME
*/

4) VERIFYING THAT YOUR CODE HAS COMMENTS

Each class should contains at least some comments to indicate what is the class used for, what is each class variable and what each method is doing.

For example, the class AlgoHMine will contain a comment at the top of the file indicating what is this algorithm about:

    /** * An implementation of the HMine algorithm for mining         frequent itemsets from a * transaction database.
    *
    * It is based on the description in:
    *
    * Pei et al. (2007) H-Mine: Fast and space-preserving frequent     pattern mining * in large databases. IIE Transactions, 39, ; 593-605.
    *
    * @see supportList
    *@see Element
    * @author Philippe Fournier-Viger, 2015
    */

Besides, each class variable will have a comment such as:

    /** the time the algorithm started */
    long startTimestamp;

    /** the time the algorithm terminated */
    long endTimestamp;

    /** the number of patterns generated */
    int patternCount;

Besides, each method (function) will have a comment such as:

    /** * Run the algorithm
    * @param input the input file path
    * @param output the output file path
    * @param minSupport the minimum support threshold
    * @throws IOException exception if error while writing the file
    */
    public void runAlgorithm(String input, String output, double         minSupport) throws IOException {
        ...
    }

5) INTEGRATING THE SOURCE CODE

The next step is to integrate the source code into SPMF. This is quite simple because the source code is well-organized into packages in SPMF. For adding a new algorithm, you should in general create a sub-package for your algorithm, unless it shares several files with another algorithm.

For example, if you want to add a sequential pattern mining algorithm named ABC, you should create a sub package of the package:  ca/pfv/spmf/algorithms/sequentialpatterns/ such as ca/pfv/spmf/algorithms/sequentialpatterns/ABC/

You would then put the source code of your ABC algorithm in this folder.

5) PROVIDING AN EXAMPLE OF HOW TO RUN THE ALGORITHM

Beside that, it is important to provide an example of how to run the algorithm for other users.

In SPMF, I put these files in the package ca/pfv/spmf/tests/.

You can see in that folder some examples of how to do it such as MainTestPrefixSpan_saveToFile. This file for example explains how to run the PrefixSpan algorithm. You will notice that in this folder there is sometimes two test files for the same algorithm: a version "saveToFile" and a version "saveToMemory" indicating if the result of the algorithm should be saved to a file or kept into memory. If you want, you can just make the "saveToFile" version because saving to file is what is important for integrating the algorithm to the graphical interface and command line. Providing a version of your algorithm that save to memory is just to be convenient for users who want to integrate the source code in another software. If you want to do it, you can do it, but it is not necessary.

7) ADDING THE ALGORITHM TO THE GRAPHICAL INTERFACE

After that, the next step is to add the algorithms to the user interface. To do that, you need to provide a description of your algorithm in the package ca.pfv.spmf.algorithmmanager.descriptions. There are many examples in this package for various algorithm. You may copy one of the example an adapt it for your algorithm.

8) WRITING AN EXAMPLE FOR THE DOCUMENTATION

Lastly, the next step is to add a description of the example provided in "ca.pfv.spmf.tests". This description will be added to the "documentation" of SPMF on the website.

https://www.philippe-fournier-viger.com/spmf/index.php?link=documentation.php

Finally, the modifications should be sent to me so that they can be validated by me and integrated in the official version of SPMF on the website.