Clustering using the AEDBScan Algorithm (SPMF documentation)

This example explains how to run the AEDBScan algorithm using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "AEDBScan" algorithm, (2) select the input file "szu.txt", (3) set the output file name (e.g. "output.txt") (4) set minPts =100, and (5) click "Run algorithm".
Note that there is also an optional parameter called "separator". It can be used to read other types of input files where values are not separated by spaces. For example, for a file separated by the ',' character, the parameter "separator" would have to be set to ",".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run AEDBScan szu.txt output.txt 100 in a folder containing spmf.jar and the example input file szu.txt.
If you are using the source code version of SPMF, launch the file "MainTestAEDBScan_saveToFile.java" in the package ca.pfv.SPMF.tests.

What is AEDBScan?

AEDBScan is a variation of the DBScan algorithm where there is a single parameter instead of two. The idea is that the epsilon parameter of DBScan is set automatically by the algorithm. In practice, AEDBScan may work better or worse than DBScan depending on the parameters and datasets.

Implementation note: To avoid having a O(n^2) time complexity, this implementation uses a KD-Tree to store points internally.

What is the input?

AEDBScan takes as input (1) a set of instances having a name and containing one or more double values, and (2) a parameter minPts (a positive integer >=1) indicating the number of points that a core point need to have in its neighborhood (see paper about AEDBScan for more details).

The input file format is is a text file containing several instances.

The first lines (optional) specify the name of the attributes used for describing the instances. In this example, two attributes will be used, named X and Y. But note that more than two attributes could be used. Each attribute is specified on a separated line by the keyword "@ATTRIBUTEDEF=", followed by the attribute name

Then, each instance is described by two lines. The first line (which is optional) contains the string "@NAME=" followed by the name of the instance. Then, the second line provides a list of double values separated by single spaces.

An example of input is provided in the file "szu.txt" of the SPMF distribution. It contains hundreds of instances, each described by two attributes:.

1.12 3.65
8.71 1.86
9.23 2.26
3.68 5.28
0.42 4.50
7.99 5.36
2.10 4.31

For example, the first instance is a vector with two values: 1.12 and 3.65 for the first and second attributes, respectively.

This input file represents a set of vectors that have two dimensions. But note that, it is possible to use more than two attributes to describe instances. To give a better idea of what the input file looks like, here is a visual representation:

The distance function used by AEDBscan is the Euclidian distance.

What is the output?

AEDBScans groups vectors (points) in clusters based on density and distance between points.

Note that it is normal that AEDBScan may generate a cluster having less than minPts (this happens if the neibhoors of a core points get "stolen" by another cluster).

Note also that AEDBScan may eliminate points that are seen as noise (a point having less than minPts neighboors within a radius of epsilon)

By running AEDBScan on the previous input file and minPts =100, we obtain the following output file containing 15 clusters. The output file looks like this:

@ATTRIBUTEDEF=Attribute0
@ATTRIBUTEDEF=Attribute1
[Instance761 0.18 2.61][Instance6199 0.33 2.99][Instance4391 0.45 3.06][Instance7835 0.46 3.0][Instance6689 0.5 3.02] .....
[Instance2314 0.88 1.12][Instance3626 1.57 1.13][Instance262 1.58 1.14] ....
.....

The output file format is defined as follows. The first few lines indicate the attribute names. Each attribute is specified on a separated line with the keyword "ATTRIBUTEDEF=" followed by the attribute name (a string). Then, the list of clusters is indicated. Each cluster is specified on a separated line, listing the instances contained in the cluster. An instance is a name followed by a list of double values separated by " " and between the "[" and "]" characters.

The clusters found by the algorithm can be viewed visually using the "Cluster Viewer" provided in SPMF. If you are using the graphical interface of SPMF, select "Open output file using" the "Cluster Viewer" before pressing the "Run Algorithm" button. The result will be displayed in the Cluster Viewer, as follows:

As it can be seen in this example, the result can be more or less good. Points that are close to each other have put in the same clusters but some closers may be broken into separate parts. An interesting thing about AEDBScan is that it can find clusters of various shapes.

Using an Extended Input File Format

Note that it is also possible to use an extended input file format where names can be given to the attributes and instances in the input file. In this case, the first lines (optional) specify the name of the attributes used for describing the instances. We will give an example with two attributes named X and Y. Each attribute is specified on a separated line by the keyword "@ATTRIBUTEDEF=", followed by the attribute name. Then, each instance is described by two lines. The first line (which is optional) contains the string "@NAME=" followed by the name of the instance. Then, the second line provides a list of double values separated by single spaces. An example of input using this extended format is provided in the file "inputDBScan2.txt" of the SPMF distribution. It contains 31 instances, each described by two attribute called X and Y, and the first lines look like this:

@ATTRIBUTEDEF=X
@ATTRIBUTEDEF=Y
@NAME=Instance1
1 1
@NAME=Instance2
0 1
@NAME=Instance3
1 0
@NAME=Instance4
11 12
...

Applying AEDBScan to time series

Note that the AEDBScan algorithm implementation in SPMF can also be applied to time series database such as the file contextSAX.txt in the SPMF distribution. To apply this algorithm to time series, it is necessary to set the "separator" parameter of this algorithm to "," since time series files separate values by "," instead of separating by spaces.

Where can I get more information about AEDBScan?

AEDBScan is a most famous data mining algorithm for clustering. It is described in almost all data mining books that focus on algorithms, and on many websites. The original article describing AEDBScan is:

Mistry, Vidhi, et al. "AEDBSCAN—adaptive epsilon density-based spatial clustering of applications with noise." Progress in Advanced Computing and Intelligent Engineering: Proceedings of ICACIE 2019, Volume 2. Springer Singapore, 2021.