Standard Deviation and simple time series outlier detection using Table Calculations

I recently used this technique as a quick and dirty way to find anomalies in time series data and thought I would share.

Let’s say you have simple time series data, event counts grouped by day:

SELECT 
	TO_CHAR(DATE_TRUNC('week', updates.date),'YYYY-MM-DD') AS "updates.date_week",
	COUNT(*) AS "updates.count"
FROM public.updates AS updates

To find big ‘jumps’ in this data, you could use the 3 sigma rule. Basically any observations that fall outside of three standard deviations from the mean is considered an outlier.

mu  = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std  THEN  x is outlier

To model this in a Look, I used table calculations. The first thing we need is the Standard Deviation of the count field. This is calculated by taking the square root of the average of the squared deviations of the values from their mean value.

This is pretty easy to define as a table calculation:

round(sqrt(mean(power(${updates.count} - mean(${updates.count}), 2))),2)

Once we have that, we can pretty easily define our 3-sigma rule:

if(abs(${updates.count} - mean(${updates.count})) >  ${standard_deviation} * 3, "Yes", "No")  

And that’s all you need to detect outliers!

4 Likes

Sweet pattern, Michael! Wanted to touch base and let you know about a few other features that might be simplifications/additions to this pattern.

  1. We’ve built out the ability to hide table calculated no values from a visualization, i.e. you could hide your outliers.

  2. There are stddev_pop and stddev_samp functions in table calculations.

2 Likes

This is a really cool use of table calculations! A quick note on syntax:

if(5 > 4, "Yes", "No")

isn’t strictly necessary, simply

5 > 4

will evaluate to Yes.

1 Like

The data points you consider for the Mean , STD calculations matters… I feel baseline average should be calculated by averaging the hourly rollups (Example - 9:00 AM to 10:00 AM hourly rollups ) for the three Mondays within the three-week period.

Similarly, The standard deviation is calculated from the baseline average for rollups

The standard deviation provides a statistical indicator of how much variability exists in the population data that factored into the baseline average calculations.

•For hourly reporting, the standard deviation is calculated for the polled values.

•For daily reporting, the standard deviation is calculated for hourly averages.

•For weekly reporting and beyond, the standard deviation is calculated for the daily averages.