Stats table calcs: comparing rates over time (Bonus: anomaly detection!)

statistics
tablecalcs

(fabio) #1

If you haven’t already read my post on using the beta distribution for an A/B test, that might be a good warm-up read. (The disclaimers from that post still apply :wink: )

In that example, the data took the form of a proportion - a number of successes out of a number of trials. What do you do if you want to instead compare rates of events over intervals? For example, user signups, error events, or web traffic over time.

Well, in the A/B testing example, the data points can be thought of as being taken from a Bernoulli distribution: either yes or no with a given probability. That probability is in turn modeled by the conjugate prior of the Bernoulli distributions, the beta distributions, so we used the beta_inv table calc.

In the present scenario, the data points can be thought of as being taken from a Poisson distribution: its possible values are whole positive numbers representing a number of events happening in an interval assuming those events occur independently at a given rate. That rate, which is what we are actually interested in quantifying, can be thought of as being taken from the Poissons’ conjugate priors: Gamma distributions.

For the sake of illustration, lets say you had data like this:

Month   | Days    | Errors
------------------------------------------
Jan     | 31      | 12
Feb     | 28      | 13
Mar     | 31      | 19

Is the rate of errors in March higher than in February? On the surface it looks that way, since the observed rate was 19/31=0.61 events per day, as compared to the observed rate of 13/28 = 0.46 events per day in February.

But with any observation, there is variability and we are interested in the “true” underlying rate that generated this data, moreso than the observed rate in our sample. The conjugate prior distribution allows us to understand exactly that - given a set of observations, what the true underlying rate may be.

With our example, when you use the conjugate prior (gamma distribution), you see there is quite a bit of range, including lots of overlap between what the rate of errors may be for March and for February:

Just like in the A/B testing example, you could use the inverse distribution function to come up with credibility intervals and then visualize them using the timeline viz:

The upper/lower bound formulas for 90% credibility intervals would be

gamma_inv(
    0.5 +/- 0.45,
    ${...events} + 0.001, 
    1/( ${...days} + 0.001)
)

Bonus: Anomaly Detection

Let’s try to detect when there is a drop in a rate of events. We’ll continue with the gamma distribution, but these concepts can just as easily be applied to detect a change in a proportion, with the beta distribution.

We’ll make a look that is designed to be run every day. We’ll have it pull some trailing activity by day, for example 60 days’ worth. Then we’ll add some formulas to detect when there is a significant drop:


(In this example, the threshold happens to be on a day with 0 events, but it is highlighting a difference in data before and after the line, not simply highlighting that day for being 0)

A practical consideration is choosing the windows of time to compare. There are many ways to do this, with different levels of finesse such as dynamically choosing baseline periods, normalizing for weekly or seasonal trends, or pushing the window functions into SQL instead of table calcs, but for demonstrative purposes, I’ll simply use table calcs to consider windows of increasing size looking back from the current day, and to compare those windows to similarly sized prior windows. Then I’ll highlight in red the first day (if any) with a significant, sizable drop. Here are the specific formulas I used:

  • N Periods: row()
  • 2 N rows available?: row() <= count(${accidents.event_date}) / 2
    ^ We will use this to “Hide No’s” since we need half of the rows for comparison’s sake only
  • Trailing N Total: running_total(${accidents.count})
  • Prior N Total: sum(offset_list(${accidents.count},1,row()))
  • Trailing N Upper Bound: gamma_inv(0.95, ${trailing_n_total} + 0.001, 1 / (${n_periods} + 0.001))
  • Prior N Lower Bound: gamma_inv(0.05, ${prior_n_total} + 0.001, 1 / (${n_periods} + 0.001))
  • Is practical difference?: ${trailing_n_upper_bound} < ${prior_n_lower_bound} * 0.667
    ^ Normally you want to specify a ratio of “practical equivalence”, here a drop of over 1/3rd
  • As 0/1: if( ${is_practical_difference}, 1, 0)
  • Difference running total: running_total(${as_01})
  • Most Recent Deviation: if( ${difference_running_total}=1 AND offset(${difference_running_total},-1)=0, max(${accidents.count}), null)
    ^This will be used to highlight the deviation on the chart

Setting that last measure to a column allows us to get a “highlight”:

In conclusion, we can say (with high probability) that to the right of this line, the true underlying rate of events has dropped by at least a third, as compared to the same sized period immediately to the left of this line. And because we’re using Bayesian statistics, these formulas are robust against small datasets, meaning no manual exception handling or awkward sample size declarations up front.


Filtering in Visualizations using Table Calculations (3.28+)
#2

@fabio – very cool- thanks. How are you producing those overlaid distributions in the events per day view?


(fabio) #3

Hi @AbhiSivasailam!

I’ll admit I got a bit creative there :slight_smile: I just made a result set with 100 numbers as the dimension, used table calcs to translate them over the domain I wanted to plot (min_number + ${num_from_0_to_100}/100*(max_number-min_number), and then used the gamma_dist table calc (cumulative=no) to get the PDF value:

I just hardcoded three measures for my visual, but I think you should be able to do this with pivots to dynamically pass in different alpha and beta values for each of N series


(Alexander McMurray) #4

This would be useful for my use case - I have a simpler model using the rolling standard deviation and rolling mean to detect anomalies (I guess it would be like using a normal distribution for the underlying process but honestly I just chose it as it was simpler and seemed to work well enough).


(fabio) #5

You can now condition scheduled to be sent or not based on table calculations (using the “Hide No’s” feature on the calc’s column header), so you can now do the alerting described in the bonus section!