Standard Deviation and simple time series outlier detection using Table Calculations

(Michael Erasmus) #1

I recently used this technique as a quick and dirty way to find anomalies in time series data and thought I would share.

Let’s say you have simple time series data, event counts grouped by day:

SELECT 
	TO_CHAR(DATE_TRUNC('week', updates.date),'YYYY-MM-DD') AS "updates.date_week",
	COUNT(*) AS "updates.count"
FROM public.updates AS updates

To find big ‘jumps’ in this data, you could use the 3 sigma rule. Basically any observations that fall outside of three standard deviations from the mean is considered an outlier.

mu  = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std  THEN  x is outlier

To model this in a Look, I used table calculations. The first thing we need is the Standard Deviation of the count field. This is calculated by taking the square root of the average of the squared deviations of the values from their mean value.

This is pretty easy to define as a table calculation:

round(sqrt(mean(power(${updates.count} - mean(${updates.count}), 2))),2)

Once we have that, we can pretty easily define our 3-sigma rule:

if(abs(${updates.count} - mean(${updates.count})) >  ${standard_deviation} * 3, "Yes", "No")  

And that’s all you need to detect outliers!

4 Likes

(Zachary Michel) #2

Sweet pattern, Michael! Wanted to touch base and let you know about a few other features that might be simplifications/additions to this pattern.

  1. We’ve built out the ability to hide table calculated no values from a visualization, i.e. you could hide your outliers.

  2. There are stddev_pop and stddev_samp functions in table calculations.

2 Likes

(Wil Gieseler) #3

This is a really cool use of table calculations! A quick note on syntax:

if(5 > 4, "Yes", "No")

isn’t strictly necessary, simply

5 > 4

will evaluate to Yes.

1 Like