Perform Sequence Prediction using the TDAG Sequence Prediction Model (SPMF documentation)

This example explains how to run the TDAG algorithm using the SPMF open-source data mining library.

How to run this example?

To run the implementation of TDAG (Transition Directed Acyclic Graph)

What is TDAG (Transition Directed Acyclic Graph)?

TDAG (Transition Directed Acyclic Graph) is a sequence prediction model proposed by Pitkow & Piroli (1999) that combines markovian models of order 1 to k, where k is parameter that need to be set by the user. This model is used for performing sequence predictions. A sequence prediction consists of predicting the next symbol of a sequence based on a set of training sequences. The task of sequence prediction has numerous applications in various domains. For example, it can be used to predict the next webpage that a user will visit based on previously visited webpages by the user and other users.

The TDAG prediction model is quite simple. This is one reason why it is still popular. But TDAG is often outperformed by newer models such as CPT+ in terms of prediction accuracy.

This implementation has been obtained from the ipredict project.

What is the input of TDAG?

The input of TDAG is a sequence database containing training sequences. These sequences are used to train the prediction model.

In the context of TDAG, a sequence database is a set of sequences where each sequence is a list of items (symbols). For example, the table shown below contains four sequences. The first sequence, named S1, contains 5 items. This sequence means that item 1 was followed by items 2, followed by item 3, followed by item 4, and followed by item 6. This database is provided in the file "contextCPT.txt" of the SPMF distribution.

ID Sequences
S1 (1), (2), (3), (4), (6)
S2 (4), (3), (2), (5)
S3 (5), (1), (4), (3), (2)
S4 (5), (7), (1), (4), (2), (3)

What is the output of TDAG?

TDAG performs sequence prediction. After TDAG has been trained with the input sequence database, it can predict the next symbol of a new sequence.

For example, if TDAG is trained with the previous sequence database and parameters, it will predict that the next symbol following the sequence (1),(2) is the symbol (3).

Input file format

The input file format is defined as follows. It is a text file where each line represents a sequence from a sequence database. Each item from a sequence is separated by single space and a -1. The value "-2" indicates the end of a sequence (it appears at the end of each line). For example, the input file "contextCPT.txt" contains the following four lines (four sequences).

1 -1 2 -1 3 -1 4 -1 6 -1 -2
4 -1 3 -1 2 -1 5 -1 -2
5 -1 1 -1 4 -1 3 -1 2 -1 -2
5 -1 7 -1 1 -1 4 -1 2 -1 3 -1 -2

The first line represents a sequence where the item 1 was followed by items 2, followed by item 3, followed by item 4, and followed by item 6. The next lines follow the same format.

Performance

TDAG is a sequence prediction model that is quite simple and thus is generally memory efficient. But, it can often be outperformed in terms of prediction accuracy by newer models such as CPT+.

Where can I get more information about TDAG?

The TDAG sequence prediction model was proposed in this paper:

Laird, P., Saul, R.: Discrete sequence prediction and its applications. Machine learning, vol. 15, no. 1, 43-68 (1994)