Incremental Statistics

Incremental statistics is the base of handystats metrics’ data aggregation mechanism. Thus understanding this type of statistics, its pros and cons is crucial.

Definition

Incremental statistics are such statistics over dataset of pairs (value, timestamp) that can be maintained and updated on each new data with constant time and space complexity.

Examples of incremental statistics are:

  • min
  • max
  • sum
  • moving average

One of the main properties of incremental statistics is that no raw input data have to be stored. This could be disadvantage if you would like to retrieve some special statistics from input data that are not presented in handystats library. But this approach allows us to keep memory consumption low.

Interval Statistics

Some simple incremental statistics such as min, max, sum, count and mean act on complete dataset. While knowing total count or sum is important it could become less meaningful in long-running programs where statistics on recent data (e.g., last N values or last second) is needed. Examples of such statistics could be interval count, sum and mean.

Implementation of precise interval statistics is based on maintenance of recent data window which could lead to memory consumption problems. At this point approximation of interval statistics is the solution. But there’s no such thing as a free lunch and with approximation of interval statistics we lose precision.

We base our interval statistics implementation on modification of exponential smoothing technique.

Exponential Moving Average

Exponential moving average is an example of both exponential smoothing and interval statistics.

The basic formula of exponential moving average is:

\[EMA_t = \alpha \cdot x_t + (1 - \alpha) \cdot EMA_{t - 1}\]

Here

  • \(EMA_t\) is exponential moving average at time \(t\)
  • \(x_{t - 1}\) is datum at time \(t - 1\)
  • \(\alpha\) is smoothing factor, 0 < \(\alpha\) < 1

The only configurable parameter in exponential moving average is smoothing factor \(\alpha\). Higher \(\alpha\) values will discount older data faster.

If you are interested in just about \(N\) last data values the best choices for \(\alpha\) would be \(\frac{2}{N + 1}\) and \(\frac{1}{N}\). With this perception exponential moving average can be roughly perceived as an interval statistics for the last N data values.

Exponential Smoothing Technique For Time Intervals

Enhancement of exponential smoothing could be made by adjusting smoothing factor in runtime. Also to extend exponential smoothing to time intervals smoothing factor should be based on data timestamps and elapsed time between datum points.

For example, interval sum (for last time interval \(T\)) can be maintained in the following general form (corner cases are not considered):

\[S_k = x_k + (1 - E) \cdot S_{k - 1}\]

Here

  • \(S_k\) is approximately sum of data values over last time interval \(T\) that ends on \(x_k\) timestamp \(t_k\)
  • \(x_k\) is datum with timestamp \(t_k\)
  • \(E\) is the ratio of elapsed time between \(t_{k - 1}\) and \(t_k\) timestamps to time interval \(T\)

The “physical” interpretation of interval sum formula is the following.

Let \(S_{k - 1}\) be interval sum of \(x\) data values over last time interval \(T\) that ends on \(t_{k - 1}\) timestamp. Also let sum of considered \(x\) values be evenly distributed over time interval \(T\).

When new datum value \(x_k\) comes at timestamp \(t_k\) time interval window is shifted by \(\Delta T_k = t_k - t_{k - 1}\). And only \(\frac{T - \Delta T_k}{T}\) part of previous time interval window overlaps with the new one.

Thus as sum of values is evenly distributed over previous time interval only \(\frac{T - \Delta T_k}{T}\) part of previous interval sum should preserved. New datum value \(x_k\) contributes to the new interval sum with no modification as evenly distributed over last \(\Delta T_k\) time interval.

Still this approach is approximation of precise interval statistics and might behave poorly on irregular and sparse data series, but in general its behaviour is acceptable.

Interval count statistics is easily obtained from interval sum by replacing \(x\) values with constant 1.

And interval mean is just the ratio of interval sum to interval count.

List of Handystats’ Incremental Statistics

Here is the full list of incremental statistics that are supported by handystats library:

  • min
  • max
  • sum
  • count
  • min
  • moving average (exponential)
  • interval count
  • interval sum
  • interval mean

Incremental Statistics Implementation

We base our incremental statistics implementation on Boost.Accumulators with additional moving average and interval statistics.