Clustering Texts with a text clusterer (SPMF documentation)

This example explains how to run the text clusterer using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "TextClusterer" algorithm, (2) select the input file "input_text_clustering.txt", (3) set the output file name (e.g. "output.txt") (4) set "Perform stemming:" to true, (5) set "Remove stopwords:" to true, and (5) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run TextClusterer input_text_clustering.txt output.txt true true in a folder containing spmf.jar and the example input file input_text_clustering.txt.
If you are using the source code version of SPMF, launch the file "MainTestTextClusterer.java" in the package ca.pfv.SPMF.tests.

What is this algorithm?

The algorithm is a text clusterer implemented by Sabarish Raghu. It can group texts automatically into clusters. Furthermore, it offers the possiblity of performing stemming and removing stop words before performing clustering.

The text clusterer works as follows:
1. Load the input file
2. Remove the stopwords (optional)
3. Stem the words (optional)
4. Calculate tf*idf value for each record in the input file.
5. Calculate similarity matrix by using the tfidf values of the records.
6. Take most similar records per each record and make them as clusters initially.
7. Use the transitive rule A,B are most similar and B,C are most similar; A and C are likely to be similar. This imply that A, B, C are in the same cluster.
8. Merge the clusters based on the above rule for all the records.
9. Write the final output i.e,; final sets of clusters to the output file.

StopWords
Words that are insignificant to identify the clusters. This algorithm by defauly uses the list of most popular stopwords. Anyways we can define our own stopword list or even not remove any word.

Stemming
Stemming is deriving the base word of a word.
For example: Identification is stemmed to Identity

For stemming, we use the famous Porter stemmer algorithm, which uses rules given by Porter to stem the words. This implementation is taken from Brian Goetz's implementation from the internet.

tf:term frequency
Term frequency defines how frequently a term occur in a document

Idf:Inverse document frequency
Inverse document frequency is the frequency of a word in the whole set of documents.

similarity matrix:
2 dimensional matrix representation of a record's similarity with all other records.

What is the input?

The input is a text file.

Each line of the text file represents a text.

A line starts with an integer id. Then, it is followed by a tab, and then a text where words are separated by spaces.

An example of input is provided in the file "input_text_clustering.txt" of the SPMF distribution. It contains 100 texts (100 lines). For example, here are two lines from this file:

692770 inventory taker description team members required physically count inventory various retailers enter information equipment inventory counted varies depending type store audited items may located floor tables shelves various heights items generally counted shelves may moved required inventories take approximately hours complete however may take longer depending size store level inventory counted

574319 leading hotel furniture supplier seeking project managers manage national international accounts bid packages unique exciting opportunity right individuals qualified applicants should send resumes talli globalallies com phone calls please

What is the output?

The output is a set of clusters of similar texts.

The output file format is defined as follows. The first line indicates the file format. Then, each following line contains a text id followed by a cluster id. For example, here are the first five lines of the output file obtained by applying the text clusterer on the sample input file:

RecordId Clusternum
171056 0
770853 0
247263 1
870007 1

Results can be interpreted as follows. The texts with ids "171056" and "770853" both belongs to the same cluster, having the id "0". Moreover, the texts with ids "247263" and "870007" both belongs to the same cluster, having the id "1".