Convert time series to sequences using the SAX algorithm (SPMF documentation)

This example explains how to convert a time series to sequences by applying the SAX algorithm using the SPMF open-source data mining library.

How to run this example?

What is this algorithm?

Calculating the SAX of a time series is a popular and simple way of transforming time series to a symbolic representation. In other words, SAX is a way of transforming a time series (a sequence of numbers) to a sequence of symbols.

This implementation takes a set of one or more time series as input. Then, it transform the time series to their SAX representation.

The SAX algorithm was proposed by Lin et al. (2007) and is quite popular.

The interest of applying the SAX algorithm to time series, is that after obtaining the SAX representation, we can apply traditional symbolic pattern mining algorithms such as sequential pattern mining and sequential rule mining algorithms, also offered in SPMF.

What is the input of this algorithm?

The input is one or more time series. A time series is a sequence of floating-point decimal numbers (double values). A time-series can also have a name (a string).

Time series are used in many applications. An example of time series is the price of a stock on the stock market over time. Another example is a sequence of temperature readings collected using sensors.

For this example, consider the four following time series:

Name Data points
ECG1 1,2,3,4,5,6,7,8,9,10
ECG2 1.5,2.5,10,9,8,7,6,5
ECG3 -1,-2,-3,-4,-5
ECG4 -2.0,-3.0,-4.0,-5.0,-6.0

This example time series database is provided in the file contextSAX.txt of the SPMF distribution.

In SPMF, to read a time-series file, it is necessary to indicate the "separator", which is the character used to separate data points in the input file. In this example, the "separator" is the comma ',' symbol.

To calculate the SAX representation of a time series, it is necessary to also provides two additional parameters: a number of segments w, and a number of symbols v.

What is the output?

The output is the SAX representation of each time series received as input. For a given time series, the SAX representation is calculated as follows. First, the time series is divided into w segments, and each segments is replaced by the average of its data points. This is called the piecewise approximate aggregation (PAA) of the time series. Then, the value of each segment is replaced by a symbol. The number of symbol and the number of segment is selected by the user.

Now, the main question is how the symbols are chosen? The main idea in SAX is to assume that values follow a normal distribution and to choose the symbol to represent various interval of values such that each interval is equally probable under the normal distribution (see the paper of Lin et al. 2007 for a more detailed explanation).

For example, in the above example, if the number of segments (data points) is set to 3 data points, and the number of symbols is set to 4, the sax algorithm will create four symbols:

Symbol Interval of values represented by this symbol
a [-Infinity,-0.9413981789451658]
b [-0.9413981789451658,2.4642857142857144]
c [2.4642857142857144,5.869969607516595]
d [5.869969607516595,Infinity]

Using the above symbols, the SAX algorithm generate the following SAX representation of each time series:

Name Data points
ECG1_PAA b, c, d,
ECG2_PAA c, d, d
ECG3_PAA a, a, a
ECG4_PAA a, a, a

After obtaining this representation, it is possible to apply traditional pattern mining algorithm on the sequences of symbols. For example, in SPMF, several algorithms are provided for sequential pattern mining and sequential rule mining, which can be applied on sequence of symbols.

Input file format

The input file format used by this algorithm is efined as follows. It is a text file. The text file contains one or more time series. Each time series is represented by two lines in the input file. The first line contains the string "@NAME=" followed by the name of the time series. The second line is a list of data points, where data points are floating-point decimal numbers separated by a separator character (here the ',' symbol).

For example, for the previous example, the input file is defined as follows:

@NAME=ECG1
1,2,3,4,5,6,7,8,9,10
@NAME=ECG2
1.5,2.5,10,9,8,7,6,5
@NAME=ECG3
-1,-2,-3,-4,-5
@NAME=ECG4
-2.0,-3.0,-4.0,-5.0,-6.0

Consider the first two lines. It indicates that the first time series name is "ECG1" and that it consits of the data points: 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10. Then, three other time series are provided in the same file, which follows the same format.

Output file format

The output file format is a sequence database containing one or more sequence. It is defined as follows. The first line contains the string "@CONVERTED_FROM_TIME_SERIES" to indicate that this file was obtained by converting some time series to sequences. The following defines the symbols produced by the SAX algorithm. Each line defining a symbol starts with "@ITEM=" and is followed by a symbol name (a positive integer), followed by "=", followed by the interval of values represented by this symbol. Then, the following lines are the sequences of symbols. Each sequence is represented by two lines. The first line contains "@NAME=" and is followed by the name of the sequence. The second line is the sequence. A sequence is a list of symbols separated by -1, and ending with a -2.

For example, the output of this example is:

@CONVERTED_FROM_TIME_SERIES
@ITEM=1=[-Infinity,-0.9413981789451658]
@ITEM=2=[-0.9413981789451658,2.4642857142857144]
@ITEM=3=[2.4642857142857144,5.869969607516595]
@ITEM=4=[5.869969607516595,Infinity]
@NAME=ECG1
2 -1 3 -1 4 -1 -2
@NAME=ECG2
3 -1 4 -1 4 -1 -2
@NAME=ECG3
1 -1 1 -1 1 -1 -2
@NAME=ECG4
1 -1 1 -1 1 -1 -2

The first five lines indicates that four symbols are defined called 1, 2, 3, 4, which were called a, b, c, d previously in this example.

Then, four sequences are defined.

The first sequence is 2, 3, 4, which was previously called b, c, d in this example.

Optional parameter

This implementation of SAX offers an optional parameter called "deactivatePAA". It is this used to directly convert timeseries to their SAX representation without first performing the transformation to the PAA representation. This parameter is optional. It is useful when converting several time series of different lengths to their SAX representation. When this parameter is set to true, the algorithm will preserve the original lengths of the time series rather than converting all of them to the same length. To this use this parameter in the user interface of SPMF, type the value "true". For the command line of SPMF, add the value "true" at the end of the line. For example: java -jar spmf.jar run Convert_time_series_to_sequence_database_using_SAX contextSAX.txt output.txt 3 4 , true

Where can I get more information about this algorithm?

The SAX representation is described in the paper of Lin et al. (2007):

Lin, J., Keogh, E., Wei, L., Lonardi, S.: Experiencing SAX: a novel symbolic representation of time series. Data Mining and Knowledge Discovery 15, 107–144 (2007)