SPMFAn Open-Source Data Mining Library

Introduction

Algorithms

Download

Documentation

Datasets

FAQ

License

Contributors

Citations

Performance

Developers' guide

Forum

Mailing-list

Blog

 

-------------
We are hiring a postdoctoral researcher (details here) and a Ph.D. student.  Send your CV with cover letter to Prof. Fournier-Viger.
-------------

439086 visitors
since 2010

Developers' guide

This web page is intended for developers who want to modify the SPMF source code. This page provides information about how the source code is organized to make it easier to understand the source code, modify it and reuse it in other projects.

If you are looking for examples of how to run the algorithms, you should instead look at the documentation of SPMF.

Also, another source of information is the FAQ.

Compiling the source code

The source code of SPMF is designed to be easily reusable. To compile it, the only requirement is a recent version of the Java JDK (>=1.8).

To compile the source code, it is more convenient to use an integrated development  environment because there are many classes and packages. I recommend to use an IDE  such as Eclipse or Net Beans for compiling and modifying the source code. The  instructions for installing the source code of SPMF in Eclipse or Net Beans and compiling it can be found in this document: how_to_install.txt. If you use another IDE, it should be straightforward to install the source code if you are familiar with Java.

Source code organization - packages

The source code is organized into a package hierarchy (a package is a folder in Java terminology). Here is a description of the main packages in SPMF.

  • ca/pfv/spmf/ : this is the main package containing all the other packages of SPMF.
  • ca/pfv/spmf/algorithms/ : this package contains all the data mining algorithms
    • ca/pfv/spmf/algorithms/associationrules/ : contains the source code of association rule mining algorithms.
    • ca/pfv/spmf/algorithms/classifiers/ : contains the source code of all classification algorithms.
    • ca/pfv/spmf/algorithms/clustering/ : contains the source code of all the clustering algorithms.
    • ca/pfv/spmf/algorithms/frequentpatterns/ : contains the source code of all itemset mining algorithms.
    • ca/pfv/spmf/algorithms/sequential_rules/ : contains the source code of sequential rule mining algorithms.
    • ca/pfv/spmf/algorithms/sequentialpatterns/ : contains the source code of sequential pattern mining algorithms.
    • ca/pfv/spmf/algorithms/sort/ : contains the source code of sorting algorithms.
  • ca/pfv/spmf/datastructures : contains the source code of some special data structures that are used by some algorithms such as the triangular matrix.
  • ca/pfv/spmf/documentation/ : this package just contain a text file that link to the documentation on the website.
  • ca/pfv/spmf/gui : contains the source code of the graphical user interface (MainWindow.Java) and command line (Main.Java). To run the graphical user interface from the source code, just run Main.Java.
  • ca/pfv/spmf/input/ : contains the source code for reading some types of input file such as transaction databases and sequence databases. These classes are used by several algorithms.
  • ca/pfv/spmf/patterns/ : defines various types of patterns used by different algorithms such as itemsets and rules.
  • ca/pfv/spmf/tests/ : contains example files that show how to run each algorithm from the source code. It is useful for running the algorithms from the source code. Note that these files are intended for developers and some of them contains absolute file paths such as "c:/..." that should be replaced by a valid path on your computer before running them. Note also that these files are not "unit tests" even though they are named "test".
  • ca/pfv/spmf/tools/ : contains various tools such as random dataset generators, a tool for generating statistics on sequence databases and a tool for converting various formats of sequence database to the SPMF format.

Source code organization - algorithms

In general, each algorithm has its own package containing its main files. A few files are shared by multiple algorithms and can be found in the ca/pfv/spmf/input/ or ca/pfv/spmf/patterns directories.

For example, the SPAM algorithm is located in the package ca/pfv/spmf/algorithms/sequentialpatterns/spam, which is a sub package of ca/pfv/spmf/algorithms/sequentialpatterns/ because SPAM is a sequential pattern mining algorithm.

Each algorithm has a main class with a name starting with "Algo..." and a method "runAlgorithm()" for running the algorithm. For example, the main file for the SPAM algorithm is AlgoSPAM.java and it has a runAlgorithm() method to run the algorithm. This method takes three parameters as input : (1) an input file, (2) an output file and (3) a minsup threshold.

To see an example of how to run the SPAM algorithm from the source code, we should go in the ca/pfv/spmf/tests/ folder where all the example files for developers are located. In this folder, all algorithms have at least one file named "MainTestXXXX.java" where XXXX is the name of the algorithm. For example, the file showing how to execute the SPAM algorithm is "MainTestSPAM_saveToFile.java". If we open the file, we see the following code:

...

// Load a sequence database
String input = fileToPath("contextPrefixSpan.txt");
String output = ".//output.txt";

// Create an instance of the algorithm
AlgoSPAM algo = new AlgoSPAM();

// execute the algorithm with minsup = 2 sequences (50 %)
algo.runAlgorithm(input, output, 0.5);
algo.printStatistics();

...

This example correspond to the SPAM example in the documentation section of the website. The input file is "contextPrefixSpan.txt", which correspond to the example in the documentation on the website, which can be found in ca/pfv/spmf/tests/ .

In this example, the output file path is ".//output.txt". This line can be replaced by whatever you like. You should chose a path that exist on your computer.

Finally, the line "runAlgorithm()" method launches the algorithm. In this example, we can see that the parameter 0.5 is used. To see the meaning of that parameter for the SPAM algorithm, we should look again at the documentation on the website (here). On the website it is explained that this parameter is "the minimum support".

Understanding an algorithm

A good effort has been made to put a lot of comments in the source code. However, to understand an algorithm, it is highly recommended to read the research paper describing the algorithm first. Link to most of the papers can be found on this page of the website.

Copying the source code of an algorithm in another Java project

Reusing the source code of a single algorithm from SPMF in another Java project is easy, since in general algorithms in SPMF are separated by packages, except for a few datastructures and classes that are shared by multiples algorithms.

If you want to reuse a single algorithm from SPMF without copying all the code from SPMF, you should first determine which classes are necessary. For example, if you want to copy the source code of the SPAM algorithm in another project, then you would need to first locate the package for the SPAM algorithm. It is ca/pfv/spmf/algorithms/sequentialpatterns/spam_saveToFile/ .

Then, you would need to copy the file AlgoSPAM.java in this package, as well as other files used by that file which are: AlgoSPAM. java, Bitmap.Java, Itemset.java and Prefix.Java.

Then you should check the first lines of those Java files in that package to see if there is some additional files that are required by looking at the import statements of each file. In the case of SPAM, there is only two such files indicated by the import statement in AlgoSPAM.java, which are:

  • ca.pfv.spmf.patterns.itemset_list_integers_without_support.Itemset;
  • ca.pfv.spmf.tools.MemoryLogger;

Therefore, we should also copy these files. Then we should also look at the import statement and dependency of these files to see if other files are required.

Note that each Java file has a package statement in its first line. If you move the files to another package, you should change the package statements accordingly.

Also, note that the source code of SPMF is distributed under the GPL license. If you copy parts of the source code of SPMF in another project, it should comply with the GPL license (see license).

Naming conventions

In SPMF, the source code tries to follow the Standard Code Conventions for the Java Programming Language. Also, in general, the Javadoc convention is used for documenting the code.

How to generate the jar file for the GUI version of SPMF

If you want to regenerate the jar file for the GUI version of SPMF, you can do as follows in Eclipse:

  1. Right-click on your project containing the source code of SPMF.
  2. Select "Export".
  3. Select "Runnable Jar files" and click "Next"
  4. Select the main class to be used for the Jar file : ca.pfv.spmf.gui.MainGui
  5. Select where you want to export the Jar files in your computer by clicking "Browse..."
  6. Click "Finish".

If you have performed these steps correctly, the Jar file should have been written in the location that you chose in Step 3. Note that it is possible that these steps are slightly different depending on your version of Eclipse.

Running the GUI from the source code

To run the graphical user interface from the source code, just run the file "Main.java" in the package ca/pfv/spmf/gui/.

Modifying the GUI of SPMF

The file "MainWindow.java" in the package ca/pfv/spmf/gui/ contains the graphical user interface of SPMF. To create/modify the user interface, I use the Eclipse development environment for Java.

In Eclipse, I right-click on the class MainWindow. Then I choose "Open with.." and then "Window Builder Editor". Then, you can click on the "Design" tab to modify/design a user interface visually and on the "Source" tab to edit the code. If you don't have "window builder" in Eclipse, you could update Eclipse or check this website: https://eclipse.org/windowbuilder/ or find another similar plugin. If you are using Netbeans instead of Eclipse, similar features should be offered.

After you have modified the user interface, to run it from the source code, just run the file "Main.java" in the package ca/pfv/spmf/gui/.

Contributing source code and bug fixes

If you find some bugs in the source code or errors in the documentation, please contact me directly and I will fix them as soon as possible (generally within a few days).

If you want to contribute source code, contact with me. You can send me an e-mail with your proposed modifications. I will test your code to see that it has no bugs and also have a look at the source code. If your code is OK, I will add it to the next release of SPMF and I will also add your name to the list of contributors. If you want to contribute source code of an algorithm to SPMF, it would be easier to integrate if you follow the same source code organization as in SPMF. But if you don't, it can also be OK. In this case, I can take care of integrating your source code in SPMF for you. If you have any questions about that, again, just send me an e-mail.

How to add an algorithm to the official version of SPMF?

To add a new algorithm to the official version of SPMF that is offered on the website, you should first contact me by e-mail to tell me that you are interested to add a new algorithm. After that, there are seven important steps that should be followed to integrate a new algorithm in the official version of SPMF.

1) OUTPUT FORMAT:

If your algorithm performs the same task as some other algorithm(s) already offered in SPMF, then your algorithm should have the same output format as those algorithm(s). For example, if you want to incorporate a new sequential pattern mining algorithm, the output format, should be the following:

1 2 -1 #SUP: 2
1 -1 2 -1 6 -1 #SUP: 2
1 -1 2 -1 5 -1 #SUP: 2
1 -1 3 -1 6 -1 #SUP: 2
1 -1 3 -1 5 -1 #SUP: 2
1 -1 6 -1 5 -1 #SUP: 2

Each line is a sequential pattern.  Items are positive integers separated by spaces.  After each itemset, there is a "-1" to indicate the end of an itemset.    #sup: indicate the support of the sequential pattern. The support is a positive integer representing a number of sequences.

2) INPUT FORMAT  

If your algorithm takes the same input as some other algorithm(s) already offered in SPMF, then your algorithm should have the same input format as those algorithm(s). For example, if you want to incorporate a new sequential pattern mining algorithm, the input format should be the following:

1 2 -1 3 -1 6 -1 7 -1 5 -1 -2
1 4 -1 3 -1 2 -1 1 2 5 6 -1 -2
1 -1 2 -1 6 7 -1 5 -1 -2
2 -1 6 7 -1 -2 

Each line is a sequence.   Each itemset is a set of items, where items are positive integers separated by spaces. "-1"  indicates the end of an itemset. "-2" indicates the end of a sequence.

3) ADDING YOUR COPYRIGHT INFORMATION AND AUTHOR INFORMATION TO THE SOURCE CODE

In each files of your source code, you should mention that you are the author of these files and the copyright belongs to you.  Furthermore, you should mention the license for using your code. Since the GPL3 license is used for all the code in SPMF, you should also use the same GPL3 license for code to be integrated in SPMF.  To do that, you should add the following on top of each Java file:

/** * * * This is an implementation of the INSERT_NAME_OF_YOUR_ALGORITHM algorithm.
*
* Copyright (c) 2014 INSERT_YOUR_NAME
*
* This file is part of the SPMF DATA MINING SOFTWARE * (http://www.philippe-fournier-viger.com/spmf).
*
*
* SPMF is free software: you can redistribute it and/or modify it under the * terms of the GNU General Public License as published by the Free Software * Foundation, either version 3 of the License, or (at your option) any later * version. *

* SPMF is distributed in the hope that it will be useful, but WITHOUT ANY * WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR * A PARTICULAR PURPOSE. See the GNU General Public License for more details. *
*
* You should have received a copy of the GNU General Public License along with * SPMF. If not, see .
*
* @author INSERT_YOUR_NAME
*/

4) INTEGRATING THE SOURCE CODE

The next step is to integrate the source code into SPMF. This is quite simple because the source code is well-organized into packages in SPMF. For adding a new algorithm, you should in general create a sub-package for your algorithm, unless it shares several files with another algorithm.

For example, if you want to add a sequential pattern mining algorithm named ABC, you should create a sub package of the package:  ca/pfv/spmf/algorithms/sequentialpatterns/  such as ca/pfv/spmf/algorithms/sequentialpatterns/ABC/

You would then put the source code of your ABC algorithm in this folder.

5)  PROVIDING AN EXAMPLE OF HOW TO RUN THE ALGORITHM

Beside that, it is important to provide an example of how to run the algorithm for other users.

In SPMF, I put these files in the package  ca/pfv/spmf/tests/.

You can see in that folder some examples of how to do it such as MainTestPrefixSpan_saveToFile.  This file for example explains how to run the PrefixSpan algorithm. You will notice that in this folder there is sometimes two test files for the same algorithm: a version "saveToFile" and a version "saveToMemory" indicating if the result of the algorithm should be saved to a file or kept into memory.  If you want, you can just make the "saveToFile" version because saving to file is what is important for integrating the algorithm to the graphical interface and command line.  Providing a version of your algorithm that save to memory is just to be convenient for users who want to integrate the source code in another software.  If you want to do it, you can do it, but it is not necessary.

6)  ADDING THE ALGORITHM TO THE GRAPHICAL INTERFACE

After that, the next step is to add the algorithms to the user interface. This is done by modifying the class ca.pfv.spmf.gui.MainWindow.java.  There are several examples in this file. You should do the same thing as for the other algorithms. If you are not comfortable to do this step, I could do it for you.

7) WRITING AN EXAMPLE FOR THE DOCUMENTATION

Lastly, the next step is to add a description of the example provided in "ca.pfv.spmf.tests". This description will be added to the "documentation" of SPMF on the website.  

http://www.philippe-fournier-viger.com/spmf/index.php?link=documentation.php


Finally, the modifications should be sent to me so that they can be validated by me and integrated in the official version of SPMF on the website.

Other questions

If you have other questions about the source code, don't hesitate to write your questions in the forum or to contact me directly if your question has to be confidential. I usually answer all questions within 2 or 3 days, except when I'm very busy.

 

Copyright © 2008-2017 Philippe Fournier-Viger. All rights reserved.