Perform Sequence Prediction using the AKOM Sequence Prediction Model (SPMF documentation)
This example explains how to run the AKOM algorithm using the SPMF open-source data mining library.
How to run this example?
To run the implementation of AKOM (Dependency Graph)
- This Example is not available in the graphical user interface of SPMF.
- If you are using the source code version of SPMF, launch the file "MainTestAKOM.java" in the package ca.pfv.SPMF.tests
What is AKOM (All-k Order Markov)?
AKOM (All-k Order Markov) is a sequence prediction model proposed by Pitkow & Piroli (1999) that combines markovian models of order 1 to k, where k is parameter that need to be set by the user. This model is used for performing sequence predictions. A sequence prediction consists of predicting the next symbol of a sequence based on a set of training sequences. The task of sequence prediction has numerous applications in various domains. For example, it can be used to predict the next webpage that a user will visit based on previously visited webpages by the user and other users.
The AKOM prediction model can consume a huge amount of memory if the parameter k is set to a high value. But it can have a quite high accuracy. This is one reason why it is still popular. But AKOM is often outperformed by newer models such as CPT+ in terms of prediction accuracy and memory usage.
This implementation has been obtained from the ipredict project.
What is the input of AKOM?
The input of AKOM is a sequence database containing training sequences. These sequences are used to train the prediction model.
In the context of AKOM, a sequence database is a set of sequences where each sequence is a list of items (symbols). For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 items. This sequence means that item 1 was followed by items 2, followed by item 3, followed by item 4, and followed by item 6. This database is provided in the file "contextCPT.txt" of the SPMF distribution.
|S1||(1), (2), (3), (4), (6)|
|S2||(4), (3), (2), (5)|
|S3||(5), (1), (4), (3), (2)|
|S4||(5), (7), (1), (4), (2), (3)|
The AKOM algorithm takes a value k as
parameter that indicates the order of the model. In the source code,
the value of parameter k is passed as a string to the class AKOMPredictor.
For example, the string "order:4" means to set the parameter k to 4.
This indicates that AKOM will create a model of order 4, which means
that it can use up to the four previous symbols in a sequence to
perform a prediction.
What is the output of AKOM?
AKOM performs sequence prediction. After AKOM has been trained with the input sequence database, it can predict the next symbol of a new sequence.
For example, if AKOM is trained with the previous sequence database and parameters, it will predict that the next symbol following the sequence (1),(4) is the symbol (2).
Input file format
The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is separated by single space and a -1. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextCPT.txt" contains the following four lines (four sequences).
1 -1 2 -1 3 -1 4 -1 6 -1 -2
4 -1 3 -1 2 -1 5 -1 -2
5 -1 1 -1 4 -1 3 -1 2 -1 -2
5 -1 7 -1 1 -1 4 -1 2 -1 3 -1 -2
The first line represents a sequence where the item 1 was followed by items 2, followed by item 3, followed by item 4, and followed by item 6. The next lines follow the same format.
AKOM is a sequence prediction model that can consume a huge amount of memory if the parameter k is set to a high value. Moreover, it can be outperformed in terms of prediction accuracy by newer models such as CPT+. One of the reason is that AKOM is not noise tolerant.
Where can I get more information about AKOM?
The All-k Order Markov sequence prediction model was proposed in this paper:
Pitkow, J., Pirolli, P.: Mining longest repeating subsequence to predict world wide web surng. In: Proc. 2nd USENIX Symposium on Internet Technologies and Systems, Boulder, CO, pp. 13–25 (1999)