SPMF documentation >How to train the KNN Classifier to Perform Classification (SPMF documentation)

This example explains how to run the KNN algorithm using the SPMF open-source data mining library.

What is KNN?

The KNN algorithm is a very simple and popular algorithm for classification.

KNN is designed for classifying instances (a classifier). It is well-known and described in many artificial intelligence and data mining books.

The KNN algorithm takes as input a dataset that consists of a set of records, described using attributes, assumed to be nominal attributes (strings). The goal of classification is to guess the missing value of an attribute called the target attribute based on the values for the other attributes. For example, consider data about customers of a bank. Each record (customer) may be described using various attributes such as age, gender, city, education and steal-money?. Consider that steal-money? is an attribute that indicate whether a customer has stolen some money or not (yes or no). The goal of classification can be to guess whether a customer will steal money (yes or no) given the values of other attributes (the age, gender city and education) of a customer.

To do classification, the KNN algorithm first creates a model using training data (records where the target attribute value is known). A model (a classifier) is created . After the model is created, it can be used to guess the missing value of the target attribute for a new record. For instance, using data about previous customers at the bank, it is possible to learn rules that can help to guess whether the new customers will steal money or not.

There exists many algorithms for classication in the data mining literature. KNN is popular because it is simple and quite effective.

A) How to train the KNN model to make a prediction

How to run this example?

What is the input?

The input is a dataset that contains a set of records described according to some attributes . For instance, in this example, we use a dataset called tennisExtended.txt. This dataset is provided in the file "tennisExtended.txt." of the SPMF distribution. This dataset defines 7 attributes (outlook, temp, humid, wind, play, day and moon) and contains 19 instances (records).

Besides, the user needs to specify a value K that is a number of instances (record)

What is the model created by KNN?

The KNN algorithm can be used to create a model that can be used to perform classification. For KNN, the model just consists of keeping the dataset into memory. Then, KNN can analyze that model to guess what is the missing value of a target attribute for a new record. For example, lets say that there is a new record where the value of the attribute "play" is unknown:

???? rainy mild high strong monday small

 

KNN will try to guess the value of "play". To do this, KNN first finds the K instances in the dataset that are the most similar to the above instance. Then, KNN will choose the most popular value for the "play" attribute as the prediction. In the above example, if we set k = 3, the most popular value is "yes" Thus, KNN will predict value ???? as yes.

Optional feature: saving a model to a file using serialization to load it again into memory later

Training is generally really fast. But if you want to save a trained model to a file and load it in memory later, it is possible. Saving the model done by uncommenting the following lines of code in the example:

classifier.saveTrainedClassifierToFile("classifier.ser"); // Save the model the a file

Loading a saved model is done using the following lines of code in the example:

classifier = Classifier.loadTrainedClassifierToFile("classifier.ser");

Optional feature: saving the model as a set of rules into a text file (for the purpose of analysis)

If you want to see the rules of the trained model, you may save them to a text file by uncommenting the following code in the example:

String rulesPath = "rulesPath.txt"
((RuleClassifier)classifier).writeRulesToFileSPMFFormatAsStrings(rulesPath,dataset);

where rulesPath.txt is the output file path and dataset is the training dataset.

Note that rules saves in this format cannot be loaded back in memory. If you want to save and load a model in memory, you should save rules using serialization (see above) instead.

Input format (default)

A few input file formats are supported by this algorithm. The first one is a text file such as the dataset "tennisExtended.txt." used in this example:

play outlook temp humid wind day moon
no sunny hot high weak tuesday full
no sunny hot high strong tuesday small
yes overcast hot high weak tuesday full
yes rainy mild high strong tuesday small
yes rainy cool normal weak monday full
no rainy cool normal strong monday small
yes overcast cool normal strong monday full
no sunny mild high weak monday small
yes sunny cool normal weak monday full
yes rainy mild normal weak monday small
yes sunny mild normal strong friday full
no rainy cool normal strong friday small
yes overcast cool normal strong friday full
no sunny mild high weak friday small
yes sunny cool normal weak friday full
yes overcast hot high weak friday small
yes rainy mild high strong friday full
yes rainy mild normal weak friday small
yes sunny mild normal strong tuesday full

The first line indicates the names of the attributes, each separated by a space. Then each of the following lines is a record where attribute values are separated by spaces. There are 19 records.

Alternative input format (ARFF)

It is also possible to use a dataset encoded using the ARFF format as an alternative to the default dataset format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that all attribute values are treated as nominal values. This is due to the design of KNN, which is only defined for handling nominal values. If numerical values are in the data, they will be treated as nominal values (strings). To load a file using the ARFF format, the following lines of code can be used in the example:

String datasetPath = fileToPath("weather-train.arff");
ARFFDataset dataset = new ARFFDataset(datasetPath, targetClassName);

Using these lines, the dataset weather-train.aff will be used, which contains the following content:

@relation weather.tennis

@attribute outlook {sunny,overcast,rainy}
@attribute temperature {hot,mild,cool}
@attribute humidity {high,normal}
@attribute windy {strong,weak}
@attribute play {yes,no}

@data
sunny,hot,high,weak,no
sunny,hot,high,strong,no
overcast,hot,high,weak,yes
rainy,mild,high,weak,yes
rainy,cool,normal,weak,yes
rainy,cool,normal,strong,no
overcast,cool,normal,strong,yes
sunny,mild,high,weak,no
sunny,cool,normal,weak,yes
rainy,mild,normal,weak,yes
sunny,mild,normal,strong,yes

This dataset defines 5 attributes and 11 records (note that it is slightly different from the file tennisExtended.txt in the above example).

Alternative input format (CSV)

It is also possible to load a dataset encoded in the CSV format as an alternative to the default format and ARFF format. The CSV format is a file where values and attribute values are separated by commas. To load a file encoded according to the CSV format, the following lines of code can be used in the example:

String datasetPath = fileToPath("tennisExtendedCSV.txt");
CSVDataset dataset = new CSVDataset(datasetPath, targetClassName);

Using these lines, the dataset tennisExtendedCSV.txt can be read, which contains the following content:

play,outlook,temp,humid,wind,day,moon
no,sunny,hot,high,weak,tuesday,full
no,sunny,hot,high,strong,tuesday,small
yes,overcast,hot,high,weak,tuesday,full
yes,rainy,mild,high,strong,tuesday,small
yes,rainy,cool,normal,weak,monday,full
no,rainy,cool,normal,strong,monday,small
yes,overcast,cool,normal,strong,monday,full
no,sunny,mild,high,weak,monday,small
yes,sunny,cool,normal,weak,monday,full
yes,rainy,mild,normal,weak,monday,small
yes,sunny,mild,normal,strong,friday,full
no,rainy,cool,normal,strong,friday,small
yes,overcast,cool,normal,strong,friday,full
no,sunny,mild,high,weak,friday,small
yes,sunny,cool,normal,weak,friday,full
yes,overcast,hot,high,weak,friday,small
yes,rainy,mild,high,strong,friday,full
yes,rainy,mild,normal,weak,friday,small
yes,sunny,mild,normal,strong,tuesday,full

The first line indicates the names of the attributes, each separated by a comma. Then each of the following lines is a record where attribute values are separated by commas. There are 19 records.

B) How to run batch experiments to test the KNN model for classification

In the SPMF library there is some code to automatically run experiments with KNN on a dataset.

How to run this example?

What is this example about?

In this example, the dataset tennisExtended.txt from the previous example is read into memory. It is then split into two parts : a training dataset (the first 50% of the records) and a testing dataset (the last 50% of the records).

The KNN algorithm is then applied to build a model using the training dataset for the target attribute play.

Then, the model is applied to guess the values of the attribute play for all records in the test dataset. Statistics are then calculated in terms of various measures for evaluating a classifier and the results are presented in the console:

=== MODEL TRAINING RESULTS ===
#NAME: KNN
#RULECOUNT: 0 the number of rules in the model - this is not relevant for KNN
#TIMEms: 1 the time for training the model (ms)
#MEMORYmb: 2.482 the memory used for training the model (MB)

==== CLASSIFICATION RESULTS ON TRAINING DATA =====
#NAME: KNN
#ACCURACY: 0.778 accuracy of the model on training data
#RECALL: 0.775 recall of the model on training data
#PRECISION: 0.775 precision of the model on training data
#KAPPA: 0.55 Kappa Score of the model on the training data
#FMICRO: 0.7778 The F1 measure (micro) of the model on the training data
#FMACRO: 0.775 The F1 measure (macro of the model on the training data
#TIMEms: 0 the time for making predictions using the training data (ms)
#MEMORYmb: 1.9771 the memory usage for making predictions using the training data (MB)
#NOPREDICTION: 0.0 the percentage of records for which no prediction was made for the training data

==== CLASSIFICATION RESULTS ON TESTING DATA =====
#NAME: KNN
#ACCURACY: 0.8
#RECALL: 0.6875
#PRECISION: 0.6875
#KAPPA: 0.375
#FMICRO: 0.8
#FMACRO: 0.6875
#TIMEms: 0
#MEMORYmb: 1.9757
#NOPREDICTION: 0.0

==== CLASSIFICATION RESULTS ON TESTING DATA =====
#NAME: KNN
#ACCURACY: 0.8 accuracy of the model on the testing data
#RECALL: 0.6875 recall of the model on the testing data
#PRECISION: 0.6875 precision of the model on the testing data
#KAPPA: 0.375 Kappa Score of the model on the testing data
#FMICRO: 0.8 The F1 measure (micro) of the model on the testing data
#FMACRO: 0.6775 The F1 measure (macro) of the model on the testing data
#TIMEms: 0 the time for making predictions using the testing data (ms)
#MEMORYmb: 1.971 the memory usage for making predictions using the testing data (MB)
#NOPREDICTION: 0.0 the percentage of records for which no prediction was made for the testing data

These measures are commonly used for evaluating classification models (classifiers).

Optional feature: Using K-Fold Cross validation

The above example has shown how to split a dataset into two parts (training and testing) to evaluate a classifier. This approach called holdout is useful. However, a problem is that only part of the data is used for training (e.g. 50%) and only part of the data (e.g. 50%) is used for testing. To be able to use all the data for training and all the data for testing, there is an alternative way of testing a classifier, which is called k-fold cross-validation. To use k-fold cross-validation, the user must set a parameter k (a positive integer) indicating the number of folds (parts). For example, lets say that a dataset has 100 records and that k = 5. Then the dataset will be divided into 5 parts, each containing 20 records. Then, to evaluate the classifier, five experiments will be done:

Then, the average of the results are presented to the user for the five experiments.

To try k-fold cross-validation instead of holdout, you may run the example "MainTestKNN_batch_kfold.java" in the package ca.pfv.spmf.test of the SPMF distribution