Calculating statistics for time series (SPMF documentation)

This example explains how to calculate statistics for time-series using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "Calculate_stats_for_time_series" algorithm, (2) choose the input file contextSAX.txt (3) set the separator parameter to ",", and (4) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Calculate_stats_for_time_series contextSAX.txt no_output_file , in a folder containing spmf.jar and the input file contextSAX.txt.
If you are using the source code version of SPMF, launch the file "MainTestCalculateTimeSeriesStats.java" in the package ca.pfv.SPMF.tests.

What is this tool?

This tool is a tool for generating statistics about time-series.

What is the input?

The input is one or more time series. A time series is a sequence of floating-point decimal numbers (double values). A time-series can also have a name (a string).

Time series are used in many applications. An example of time series is the price of a stock on the stock market over time. Another example is a sequence of temperature readings collected using sensors.

For this example, consider the four following time series:

Name	Data points
ECG1	1,2,3,4,5,6,7,8,9,10
ECG2	1.5,2.5,10,9,8,7,6,5
ECG3	-1,-2,-3,-4,-5
ECG4	-2.0,-3.0,-4.0,-5.0,-6.0

This example time series database is provided in the file contextSAX.txt of the SPMF distribution.

In SPMF, to read a time-series file, it is necessary to indicate the "separator", which is the character used to separate data points in the input file. In this example, the "separator" is the comma ',' symbol.

To calculate the SAX representation of a time series, it is necessary to also provides two additional parameters: a number of segments w, and a number of symbols v.

What is the output?

The output is statistics about the event sequence. For example, if we use the tool on the previous event sequence given as example, we get the following statistics:

============= TIME SERIES STATS ==========
Number of time series: 4
Statistics for time series: ECG1
Number of data points: 10
Min value: 1.0
Max value: 10.0
Average value: 5.5
=========================================
Statistics for time series: ECG2
Number of data points: 8
Min value: 1.5
Max value: 10.0
Average value: 6.125
=========================================
Statistics for time series: ECG3
Number of data points: 5
Min value: -5.0
Max value: 4.9E-324
Average value: -3.0
=========================================
Statistics for time series: ECG4
Number of data points: 5
Min value: -6.0
Max value: 4.9E-324
Average value: -4.0
=========================================

Input file format

The input file format used by this algorithm is efined as follows. It is a text file. The text file contains one or more time series. Each time series is represented by two lines in the input file. The first line contains the string "@NAME=" followed by the name of the time series. The second line is a list of data points, where data points are floating-point decimal numbers separated by a separator character (here the ',' symbol).

For example, for the previous example, the input file is defined as follows:

@NAME=ECG1
1,2,3,4,5,6,7,8,9,10
@NAME=ECG2
1.5,2.5,10,9,8,7,6,5
@NAME=ECG3
-1,-2,-3,-4,-5
@NAME=ECG4
-2.0,-3.0,-4.0,-5.0,-6.0

Consider the first two lines. It indicates that the first time series name is "ECG1" and that it consits of the data points: 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10. Then, three other time series are provided in the same file, which follows the same format.