Perform Sequence Prediction using the CPT Sequence Prediction Model (SPMF documentation)

This example explains how to run the CPT algorithm using the SPMF open-source data mining library.

How to run this example?

To run the implementation of CPT

This Example is not available in the graphical user interface of SPMF.
If you are using the source code version of SPMF, launch the file "MainTestCPT.java" in the package ca.pfv.SPMF.tests

What is CPT (Compact Prediction Tree)?

CPT (Compact Prediction Tree) is a sequence prediction model. It is used for performing sequence predictions. A sequence prediction consists of predicting the next symbol of a sequence based on a set of training sequences. The task of sequence prediction has numerous applications in various domains. For example, it can be used to predict the next webpage that a user will visit based on previously visited webpages by the user and other users.

The CPT prediction model (2013) is an early version of the CPT+ model (Gueniche et al., 2015). CPT+ was shown to provide better accuracy than CPT and other several state-of-the-art prediction models such as DG, AKOM, TDAG, PPM on various datasets (see Gueniche et al., 2015 for details).

The implementation of CPT in SPMF is the original implementation (obtained from the ipredict project).

What is the input of CPT?

The input of CPT is a sequence database containing training sequences. These sequences are used to train the prediction model.

In the context of CPT, a sequence database is a set of sequences where each sequence is a list of items (symbols). For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 items. This sequence means that item 1 was followed by items 2, followed by item 3, followed by item 4, and followed by item 6. This database is provided in the file "contextCPT.txt" of the SPMF distribution.

ID	Sequences
S1	(1), (2), (3), (4), (6)
S2	(4), (3), (2), (5)
S3	(5), (1), (4), (3), (2)
S4	(5), (7), (1), (4), (2), (3)

CPT also takes as set of parameters as input.

What is the output of CPT?

CPT performs sequence prediction. After CPT has been trained with the input sequence database, it can predict the next symbol of a new sequence.

For example, if CPT is trained with the previous sequence database and parameters, it will predict that the next symbol following the sequence (1),(4) is the symbol (2).

Parameter(s)

There are several parameters that can be set for CPT. In the source code, these parameters are passed as a string to the class CPTPredictor. The main parameters are:

splitLength:6: This parameter indicates whether the training sequences should be kept entirely or if they should be cut to reduce memory usage when training the model. If this parameter is set to 0, the training sequences are kept entirely. If the parameter is set to a value k > 0, then only the last k items of each sequences will be used for training. This can reduce memory usage by reducing the size of the prediction model.
recursiveDividerMin:1 : The minimum number of items to be removed by the Recursive Divider strategy.
recursiveDividerMax:5 : The maximum number of items to be removed by the Recursive Divider strategy.

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is separated by single space and a -1. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextCPT.txt" contains the following four lines (four sequences).

1 -1 2 -1 3 -1 4 -1 6 -1 -2
4 -1 3 -1 2 -1 5 -1 -2
5 -1 1 -1 4 -1 3 -1 2 -1 -2
5 -1 7 -1 1 -1 4 -1 2 -1 3 -1 -2

The first line represents a sequence where the item 1 was followed by items 2, followed by item 3, followed by item 4, and followed by item 6. The next lines follow the same format.

Performance

CPT is a sequence prediction models that often provides a high accuracy according to a performance comparison (see Gueniche et al., 2015). Training the model is also very fast. However, performing a prediction may be slower than with some other models. An improved version of CPT called CPT+ is also offered in SPMF. It generally consumes less memory, is faster and provides better accuracy.

Where can I get more information about CPT?

The CPT (Compact Prediction Tree) sequence prediction model was proposed in this paper:

Gueniche, T., Fournier-Viger, P., Tseng, V. S. (2013). Compact Prediction Tree: A Lossless Model for Accurate Sequence Prediction. Proc. 9th Intern. Conference on Advanced Data Mining and Applications (ADMA 2013) Part II, Springer LNAI 8347, pp. 177-188.

Also, you can watch a video presentation of the CPT and CPT+ models (47 minutes)