Classifying Text documents using a Naive Bayes approach (SPMF documentation)

This example explains how to run the text classifier based on Naive Bayes using the SPMF open-source data mining library.

How to run this example?

What is this algorithm?

The algorithm is a text document classifier implemented by Sabarish Raghu. It can classify texts automatically into categories. For example, it can be used to classified text by subjects.

The input is a training set of texts classified by categories (with known categories) and a set of texts to be classified (with unknown categories).

The output is a category for each text from the testing set.

How the algorithm works?

The Naive Bayes classifier is a probabilistic classifier. It compute the probability of a document d being in a class c as follows:
P(c|d) ∝ P(c) Y 1≤k≤nd P(tk |c) where
nd is the length of the document. (number of tokens)
P(tk |c) is the conditional probability of term tk occurring in a document of class c
P(tk |c) as a measure of how much evidence tk contributes that c is the correct class.
P(c) is the prior probability of c.
If a document’s terms do not provide clear evidence for one class vs. another, we choose the c with highest P(c)

Pseudocode of the algorithm

Algorithm ():
Naïve Bayes(Test_Data_Dir, Training_Data_Dir)
{
For(each test file in test data directory)
For each class
Map<class, probability> ProbabilityMap;
For each word in test file
Wordprobability=Probability of occurance of that word in the class
ProbabilityMap.put(className,probability*Wordprobability)
Classified_class=Key of Max probability value
}

The algorithm contains a set of classes, as follows:

Class "TestRecord"
Holds the Test record as an object.

* String RecordId Filename of the Test File
* String fullRecord Test record as a single string.
* ArrayList<String> words words in the test record.

Class "OccurrenceProbabilties"
Used as a cache to store the probabilities of words associated with a particular class.
* String className Classname
* Hashmap<String,Double> Probability of the each word


Class "MemoryFile"
Holds the training record as an object.
* String className Class name of the training file
* ArrayList<String> content Words in the class.

Flow of the Code

1. Read each test file, remove stopwords, perform stemming and load in to objects.
2. Read each training file, remove stopwords, perform stemming and load in to objects.
3. For each test file, for each class name, for each word; check if the probability already exist in cache.
4. Else compute the probability of each word and multiply them to get overall probability for the test file.
5. Check which probability has maximum among the classes for the test file which gives the class value of the file.

There are two modes of execution

Take your choice depending upon the size of the dataset and computing power you have in the machine.

* In Memory
Training Data is loaded in to memory as objects.
Executes much faster
Significantly less number of file reads.
Higher memory load.

* File Read
Handles Training data as files as it is.
Executes slower
More number of file reads.
Significantly less memory load.

How to use this algorithm

An example of how to use the algorithm is provided in the file MainTestTextClassifier of the package ca.pfv.spmf.test. To run the algorithm, one needs to create an instance of the algorithm and call the runAlgorithm() method:

AlgoNaiveBayesClassifier nbClassifier=new AlgoNaiveBayesClassifier();
nbClassifier.runAlgorithm(<Training_Directory>,<Test_Directory>,<Output_Directory>,<Memory_Flag>);

The output is a file indicating the categories of each text from the testing set.

Output ‘output.tsv’ would be found in the output directory ‘output.tsv’

Input of the algorithm

The algorithm takes a input two directories. The first directory is a set of training texts. The second directory is a set of testing texts that need to be classified.

In the package ca.pfv.spmf.test.text_classification_set, there are some sample files for training and testing.

Please follow the following structure of directory for Test and training directory.

For training directory (to train the algorithm):

TrainingDirectoryName
--->ClassName1
--->Trainingfile1
--->Trainingfile2

--->TrainingFileN
--->ClassName2
--->Trainingfile1
--->Trainingfile2

.........

For the directory of test files (to be classified)

---->TestDirectoryName
---->Testfile1
---->Testfile2
---->Testfile3
...

Output of the algorithm

The algorithm outputs a file ‘output.tsv’ in the output directory indicating the categories (classes) attributed to each text of the test set.