This example explains how to run the text clusterer using the SPMF open-source data mining library.
How to run this example?
What is this algorithm?
The algorithm is a text clusterer implemented by Sabarish Raghu. It can group texts automatically into clusters. Furthermore, it offers the possiblity of performing stemming and removing stop words before performing clustering.
The text clusterer works as follows:
1. Load the input file
2. Remove the stopwords (optional)
3. Stem the words (optional)
4. Calculate tf*idf value for each record in the input file.
5. Calculate similarity matrix by using the tfidf values of the records.
6. Take most similar records per each record and make them as clusters initially.
7. Use the transitive rule A,B are most similar and B,C are most similar; A and C are likely to be similar. This imply that A, B, C are in the same cluster.
8. Merge the clusters based on the above rule for all the records.
9. Write the final output i.e,; final sets of clusters to the output file.
Words that are insignificant to identify the clusters. This algorithm by defauly uses the list of most popular stopwords. Anyways we can define our own stopword list or even not remove any word.
Stemming is deriving the base word of a word.
For example: Identification is stemmed to Identity
For stemming, we use the famous Porter stemmer algorithm, which uses rules given by Porter to stem the words. This implementation is taken from Brian Goetz's implementation from the internet.
Term frequency defines how frequently a term occur in a document
Idf:Inverse document frequency
Inverse document frequency is the frequency of a word in the whole set of documents.
2 dimensional matrix representation of a record's similarity with all other records.
What is the input?
The input is a text file.
Each line of the text file represents a text.
A line starts with an integer id. Then, it is followed by a tab, and then a text where words are separated by spaces.
An example of input is provided in the file "input_text_clustering.txt" of the SPMF distribution. It contains 100 texts (100 lines). For example, here are two lines from this file:
692770 inventory taker description team members required physically count inventory various retailers enter information equipment inventory counted varies depending type store audited items may located floor tables shelves various heights items generally counted shelves may moved required inventories take approximately hours complete however may take longer depending size store level inventory counted
574319 leading hotel furniture supplier seeking project managers manage national international accounts bid packages unique exciting opportunity right individuals qualified applicants should send resumes talli globalallies com phone calls please
What is the output?
The output is a set of clusters of similar texts.
The output file format is defined as follows. The first line indicates the file format. Then, each following line contains a text id followed by a cluster id. For example, here are the first five lines of the output file obtained by applying the text clusterer on the sample input file:
Results can be interpreted as follows. The texts with ids "171056" and "770853" both belongs to the same cluster, having the id "0". Moreover, the texts with ids "247263" and "870007" both belongs to the same cluster, having the id "1".
<< Return to table of contents of SPMF documentation
Copyright © 2008-2020 Philippe Fournier-Viger. All rights reserved.