How to train the CBA Classifier to Perform Classification (SPMF documentation)

This example explains how to run the CBA algorithm using the SPMF open-source data mining library.

What is CBA?

The CBA algorithm is an algorithm for classification, proposed in the following paper:

B. Liu, W. Hsu, and Y. Ma, Integrating classification and association rule mining, Proc. 4th International Conference on Knowledge Discovery and Data
Mining (KDD98),1998, pp. 80-86.

The CBA algorithm takes as input a dataset that consists of a set of records, described using attributes, assumed to be nominal attributes (strings). The goal of classification is to guess the missing value of an attribute called the target attribute based on the values for the other attributes. For example, consider data about customers of a bank. Each record (customer) may be described using various attributes such as age, gender, city, education and steal-money?. Consider that steal-money? is an attribute that indicate whether a customer has stolen some money or not (yes or no). The goal of classification can be to guess whether a customer will steal money (yes or no) given the values of other attributes (the age, gender city and education) of a customer.

To do classification, the CBA algorithm first creates a model using training data (records where the target attribute value is known). A model (a classifier) is a set of rules called class association rules. After the model is created, it can be used to guess the missing value of the target attribute for a new record. For instance, using data about previous customers at the bank, it is possible to learn rules that can help to guess whether the new customers will steal money or not. There exists many algorithms for classication in the data mining literature. Rule-based classification models such as CBA generally can make some good predictions but other models may work better. However, rule-based models have the advantage of being easily interpretable by humans.

This is a Java implementation of CBA. It was originally obtained under the GPCMAR license from the LAC library from Padillo et al. (2020). The code of CBA from LAC was cleaned, adapted and integrated into SPMF.

A) How to train the CBA model to make a prediction

How to run this example?

What is the input?

The input is a dataset that contains a set of records described according to some attributes . For instance, in this example, we use a dataset called tennisExtended.txt. This dataset is provided in the file "tennisExtended.txt." of the SPMF distribution. This dataset defines 7 attributes (outlook, temp, humid, wind, play, day and moon) and contains 19 instances (records).

What is the model created by CBA?

The CBA algorithm can be used to create a model that can be used to perform classification. The goal is to use that model to guess what is the missing value of a target attribute for a new record. For example, lets say that there is a new record where the value of the attribute "play" is unknown:

???? rainy mild high strong monday small

 

The goal of classification is to build a model that will be able to predict the value ???? as either yes or no.

To build a model, CBA takes as input:

The model created by CBA contains a set of rules. A rule is an implication of the form X ==> Y where Y is a target attribute value (e.g. yes or no) and X is a set of values from other attributes. For instance, a rule {mild,normal} ==> {yes} would indicate that if a record contains the values {mild} and {normal} then it is likely to have the value yes for the attribute play. The CBA algorithm can also create some negative rules such {NOT mild, normal} ==> {no}. This rule indicates that if {mild} does not appear and {normal} appears in a record then the value of the attribute play is likely to be no.

To select a good set of rules to build a model, CBA relies on a few measures called the support, confidence and error.

The support of a rule X ==> Y means how many records of the dataset contains the values X and Y together, divided by the total number of records.. For example, the support of the rule {mild,normal} ==> {yes} is 4/19 because values {mild,normal,yes} appear in 4 records and there is a total of 19 records.

The confidence of a rule X==> Y is how many records of the dataset contains the values X and Y together, divided by the number of records that contains the values of X. For example, the confidence of the rule {mild,normal} ==> {yes} is 4/4 because values {mild,normal,yes} appear in 4 records and there is 4 records that contain {mild,normal}.

By applying the CBA algorithm, a model is generated containing a set of rules where each rule has a confidence that is no less than minConf , and a support that is no less than minSup . In this example, the parameters are set as minSup = 0.4, minConf = 0.4, and the following rules are obtained

full ==> yes #SUP: 9 #CONF: 0.9 #ERROR: 0.4205307185207438
DEFAULT ==> no

where each line is a rule. The keywords #SUP:, #CONF:, and #ERROR:: are used to respectively indicate the support, confidence, and ERROR.

The last rule, which is DEFAULT ==> no, is a default rule indicating that by default "no" should be considered as the class value if no rules are applicable.

Using the trained model, the CBA algorithm can make a prediction for the record:

???? rainy mild high strong monday small

The prediction is: yes

Optional feature: saving a model to a file using serialization to load it again into memory later

Training is generally really fast. But if you want to save a trained model to a file and load it in memory later, it is possible. Saving the model done by uncommenting the following lines of code in the example:

classifier.saveTrainedClassifierToFile("classifier.ser"); // Save the model the a file

Loading a saved model is done using the following lines of code in the example:

classifier = Classifier.loCBArainedClassifierToFile("classifier.ser");

Optional feature: saving the model as a set of rules into a text file (for the purpose of analysis)

If you want to see the rules of the trained model, you may save them to a text file by uncommenting the following code in the example:

String rulesPath = "rulesPath.txt"
((RuleClassifier)classifier).writeRulesToFileSPMFFormatAsStrings(rulesPath,dataset);

where rulesPath.txt is the output file path and dataset is the training dataset.

Note that rules saves in this format cannot be loaded back in memory. If you want to save and load a model in memory, you should save rules using serialization (see above) instead.

Input format (default)

A few input file formats are supported by this algorithm. The first one is a text file such as the dataset "tennisExtended.txt." used in this example:

play outlook temp humid wind day moon
no sunny hot high weak tuesday full
no sunny hot high strong tuesday small
yes overcast hot high weak tuesday full
yes rainy mild high strong tuesday small
yes rainy cool normal weak monday full
no rainy cool normal strong monday small
yes overcast cool normal strong monday full
no sunny mild high weak monday small
yes sunny cool normal weak monday full
yes rainy mild normal weak monday small
yes sunny mild normal strong friday full
no rainy cool normal strong friday small
yes overcast cool normal strong friday full
no sunny mild high weak friday small
yes sunny cool normal weak friday full
yes overcast hot high weak friday small
yes rainy mild high strong friday full
yes rainy mild normal weak friday small
yes sunny mild normal strong tuesday full

The first line indicates the names of the attributes, each separated by a space. Then each of the following lines is a record where attribute values are separated by spaces. There are 19 records.

Alternative input format (ARFF)

It is also possible to use a dataset encoded using the ARFF format as an alternative to the default dataset format. The specification of the ARFF format can be found here. Most features of the ARFF format are supported except that all attribute values are treated as nominal values. This is due to the design of CBA, which is only defined for handling nominal values. If numerical values are in the data, they will be treated as nominal values (strings). To load a file using the ARFF format, the following lines of code can be used in the example:

String datasetPath = fileToPath("weather-train.arff");
ARFFDataset dataset = new ARFFDataset(datasetPath, targetClassName);

Using these lines, the dataset weather-train.aff will be used, which contains the following content:

@relation weather.tennis

@attribute outlook {sunny,overcast,rainy}
@attribute temperature {hot,mild,cool}
@attribute humidity {high,normal}
@attribute windy {strong,weak}
@attribute play {yes,no}

@data
sunny,hot,high,weak,no
sunny,hot,high,strong,no
overcast,hot,high,weak,yes
rainy,mild,high,weak,yes
rainy,cool,normal,weak,yes
rainy,cool,normal,strong,no
overcast,cool,normal,strong,yes
sunny,mild,high,weak,no
sunny,cool,normal,weak,yes
rainy,mild,normal,weak,yes
sunny,mild,normal,strong,yes

This dataset defines 5 attributes and 11 records (note that it is slightly different from the file tennisExtended.txt in the above example).

Alternative input format (CSV)

It is also possible to load a dataset encoded in the CSV format as an alternative to the default format and ARFF format. The CSV format is a file where values and attribute values are separated by commas. To load a file encoded according to the CSV format, the following lines of code can be used in the example:

String datasetPath = fileToPath("tennisExtendedCSV.txt");
CSVDataset dataset = new CSVDataset(datasetPath, targetClassName);

Using these lines, the dataset tennisExtendedCSV.txt can be read, which contains the following content:

play,outlook,temp,humid,wind,day,moon
no,sunny,hot,high,weak,tuesday,full
no,sunny,hot,high,strong,tuesday,small
yes,overcast,hot,high,weak,tuesday,full
yes,rainy,mild,high,strong,tuesday,small
yes,rainy,cool,normal,weak,monday,full
no,rainy,cool,normal,strong,monday,small
yes,overcast,cool,normal,strong,monday,full
no,sunny,mild,high,weak,monday,small
yes,sunny,cool,normal,weak,monday,full
yes,rainy,mild,normal,weak,monday,small
yes,sunny,mild,normal,strong,friday,full
no,rainy,cool,normal,strong,friday,small
yes,overcast,cool,normal,strong,friday,full
no,sunny,mild,high,weak,friday,small
yes,sunny,cool,normal,weak,friday,full
yes,overcast,hot,high,weak,friday,small
yes,rainy,mild,high,strong,friday,full
yes,rainy,mild,normal,weak,friday,small
yes,sunny,mild,normal,strong,tuesday,full

The first line indicates the names of the attributes, each separated by a comma. Then each of the following lines is a record where attribute values are separated by commas. There are 19 records.

B) How to run batch experiments to test the CBA model for classification

In the SPMF library there is some code to automatically run experiments with CBA on a dataset.

How to run this example?

What is this example about?

In this example, the dataset tennisExtended.txt from the previous example is read into memory. It is then split into two parts : a training dataset (the first 50% of the records) and a testing dataset (the last 50% of the records).

The CBA algorithm is then applied to build a model using the training dataset for the target attribute play.

Then, the model is applied to guess the values of the attribute play for all records in the test dataset. Statistics are then calculated in terms of various measures for evaluating a classifier and the results are presented in the console. The results have the following format (but are actually slightly different)

=== MODEL TRAINING RESULTS ===
#NAME: CBA
#RULECOUNT: 4 the number of rules in the model
#TIMEms: 19 the time for training the model (ms)
#MEMORYmb: 1.9 the memory used for training the model (MB)

==== CLASSIFICATION RESULTS ON TRAINING DATA =====
#NAME: CBA
#ACCURACY: 0.7778 accuracy of the model on training data
#RECALL: 0.775 recall of the model on training data
#PRECISION: 0.775 precision of the model on training data
#KAPPA: 0.55 Kappa Score of the model on the training data
#FMICRO: 0.778 The F1 measure (micro) of the model on the training data
#FMACRO: 0.775 The F1 measure (macro of the model on the training data
#TIMEms: 0 the time for making predictions using the training data (ms)
#MEMORYmb: 2.482 the memory usage for making predictions using the training data (MB)
#NOPREDICTION: 0.0 the percentage of records for which no prediction was made for the training data

==== CLASSIFICATION RESULTS ON TESTING DATA =====
#NAME: CBA
#ACCURACY: 1 accuracy of the model on the testing data
#RECALL: 1 recall of the model on the testing data
#PRECISION: 1 precision of the model on the testing data
#KAPPA: 1 Kappa Score of the model on the testing data
#FMICRO: 1 The F1 measure (micro) of the model on the testing data
#FMACRO: 1 The F1 measure (macro) of the model on the testing data
#TIMEms: 0 the time for making predictions using the testing data (ms)
#MEMORYmb: 2.482 the memory usage for making predictions using the testing data (MB)
#NOPREDICTION: 0.0 the percentage of records for which no prediction was made for the testing data

These measures are commonly used for evaluating classification models (classifiers).

Optional feature: Using K-Fold Cross validation

The above example has shown how to split a dataset into two parts (training and testing) to evaluate a classifier. This approach called holdout is useful. However, a problem is that only part of the data is used for training (e.g. 50%) and only part of the data (e.g. 50%) is used for testing. To be able to use all the data for training and all the data for testing, there is an alternative way of testing a classifier, which is called k-fold cross-validation. To use k-fold cross-validation, the user must set a parameter k (a positive integer) indicating the number of folds (parts). For example, lets say that a dataset has 100 records and that k = 5. Then the dataset will be divided into 5 parts, each containing 20 records. Then, to evaluate the classifier, five experiments will be done:

Then, the average of the results are presented to the user for the five experiments.

To try k-fold cross-validation instead of holdout, you may run the example "MainTestCBA_batch_kfold.java" in the package ca.pfv.spmf.test of the SPMF distribution