Example: Calculate moving average of time series (SPMF

SPMF documentation > Calculate moving average of time series

This example explains how to calculate the moving average of a time series using the SPMF open-source data mining library.

How to run this example?

If you are using the graphical interface, (1) choose the "Calculate_moving_average_of_time_series" algorithm, (2) select the input file "contextSAX.txt", (3) set the separator to the comma ',' (4) set the window size to 4, and then (5) click "Run algorithm".
If you want to execute this example from the command line, then execute this command:
java -jar spmf.jar run Calculate_moving_average_of_time_series contextSAX.txt output.txt 4 ,
in a folder containing spmf.jar and the example input file contextSAX.txt.
If you are using the source code version of SPMF, to run respectively this example, launch the file "MainTestMovingAverageFromFileToFile.java"in the package ca.pfv.SPMF.tests.

What is the calculation of the moving average for time series?

Calculating the moving average is a simple but popular way of smoothing a time series to remove noise. It takes as parameter a window size w (a number of data point). Then, for a time series, it replaces each data point by the average of its value plus the values of the (w-1) previous data points.

What is the input of this algorithm?

The input is one or more time series. A time series is a sequence of floating-point decimal numbers (double values). A time-series can also have a name (a string).

Time series are used in many applications. An example of time series is the price of a stock on the stock market over time. Another example is a sequence of temperature readings collected using sensors.

For this example, consider the four following time series:

Name Data points

ECG1 1,2,3,4,5,6,7,8,9,10

ECG2 1.5,2.5,10,9,8,7,6,5

ECG3 -1,-2,-3,-4,-5

ECG4 -2.0,-3.0,-4.0,-5.0,-6.0

This example time series database is provided in the file contextSAX.txt of the SPMF distribution.

In SPMF, to read a time-series file, it is necessary to indicate the "separator", which is the character used to separate data points in the input file. In this example, the "separator" is the comma ',' symbol.

To calculate the moving average, it is also necessary to provide a window size w, which is a number of data points. In this example, this parameter will be set to 4 data points. Thus, the moving average will be calculated for each of the above time series using a window size of 4 data points.

What is the output?

The output is the moving average of the time series received as input. The moving average is calculated by replacing each data point in each time series by the average of its value plus the values of the w-1 previous data points in the same time series.

For example, in the above example, if the window size is set to 4 data points, the result is:

Name Data points

ECG1_MAVG 1.0,1.5,2.0,2.5,3.5,4.5,5.5,6.5,7.5,8.5

ECG2_MAVG 1.5,2.0,4.666666666666667,5.75,7.375,8.5,7.5,6.5

ECG3_MAVG -1.0,-1.5,-2.0,-2.5,-3.5

ECG4_MAVG -2.0,-2.5,-3.0,-3.5,-4.5

To see the result visually, it is possible to use the SPMF time series viewer, described in another example of this documentation. Here is the result:

It is possible to see that the time series are less noisy. For example, the time series "ECG3" is much more smooth after applying the moving average.

Input file format

The input file format used by the time series viewer defined as follows. It is a text file. The text file contains one or more time series. Each time series is represented by two lines in the input file. The first line contains the string "@NAME=" followed by the name of the time series. The second line is a list of data points, where data points are floating-point decimal numbers separated by a separator character (here the ',' symbol).

For example, for the previous example, the input file is defined as follows:

@NAME=ECG1
1,2,3,4,5,6,7,8,9,10
@NAME=ECG2
1.5,2.5,10,9,8,7,6,5
@NAME=ECG3
-1,-2,-3,-4,-5
@NAME=ECG4
-2.0,-3.0,-4.0,-5.0,-6.0

Consider the first two lines. It indicates that the first time series name is "ECG1" and that it consits of the data points: 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10. Then, three other time series are provided in the same file, which follows the same format.

Output file format

The output file format is the same as the input format.

@NAME=ECG1_MAVG
1.0,1.5,2.0,2.5,3.5,4.5,5.5,6.5,7.5,8.5
@NAME=ECG2_MAVG
1.5,2.0,4.666666666666667,5.75,7.375,8.5,7.5,6.5
@NAME=ECG3_MAVG
-1.0,-1.5,-2.0,-2.5,-3.5
@NAME=ECG4_MAVG
-2.0,-2.5,-3.0,-3.5,-4.5

Implementation details

It is sometimes said that the moving average of a time series should not include the first w-1 points because for these points there is not enough previous points for calculating the average with w points. In this implementation, a design decision is that the moving average should contain as many data points as the original time series. To calculate the moving average of the first w-1 points, it was decided that the first point of the moving average is identical to the first point in the original time series, the second point is the average the two first points, .. the w-1-th point is the average of the w-1 first points of the original time series. This is a design decision.

Where can I get more information about the moving average?

The moving average is a very basic operation for time series. It is described in many websites and books.

<< Return to table of contents of SPMF documentation

Name	Data points
ECG1	1,2,3,4,5,6,7,8,9,10
ECG2	1.5,2.5,10,9,8,7,6,5
ECG3	-1,-2,-3,-4,-5
ECG4	-2.0,-3.0,-4.0,-5.0,-6.0

Name	Data points
ECG1_MAVG	1.0,1.5,2.0,2.5,3.5,4.5,5.5,6.5,7.5,8.5
ECG2_MAVG	1.5,2.0,4.666666666666667,5.75,7.375,8.5,7.5,6.5
ECG3_MAVG	-1.0,-1.5,-2.0,-2.5,-3.5
ECG4_MAVG	-2.0,-2.5,-3.0,-3.5,-4.5