Cumulative Density, a Percentile Pattern

(Daniel Nelson) #1

How Much and Where

How much Lifetime Value (LTV) am I getting from my top 5% of customers? How many times have the top ⅓ active users visited our website vs. the top ½? What percentage of our partners had at least $10k in sales last month?

All of these questions follow a similar form: “How many/much- insert noun here- did the top “x%” contribute? Fortunately in Looker, there is an easy way for end users to flesh out this type of analysis.

The Setup

Using the CDC’s 2013 Natality dataset, we are going to build a Cumulative Density Function, or CDF. These functions are perfect for questions like the ones posed above. To motivate this example, we start with the question, “What percentage of newborns weigh 7 lbs or less? Does a mother’s personal habits during pregnancy, such as cigarette smoking, affect birth weight?”

The Solution

It’s important to understand the math behind this question. At a high level, we want to answer questions like what percentage of newborns are under 7 lbs, or how many lbs do at least 50% of newborns weigh? We want a percentage of the total, so there is going to be a sum in the denominator. And we don’t want the percent of total for each row, we want a percent of total for the sum of all the preceding rows, inclusive. In English, we want a running total or running sum in the numerator, and the column sum in the denominator. The table calculation for this looks like:

You’ll notice two things about the graphic below. For “% of Newborns under ‘x’ lbs”, the last row is 100. That is because the total is always 100% of the total. The other thing you will notice is that each row in “% of Newborns under ‘x’ lbs” is at least as large as the preceding row. These are two properties of CDF’s and gives us visual validation of the table calc and the way we built this plot.

If we were to look at the right column, “% of Newborns under ‘x’ lbs” would have the following interpretation: 6.49% of all infants born in 2013 to mothers who smoked were under 5 lbs, 18.2% of all infants born in 2013 to mothers who smoked were under 6 lbs, 47.42% of all infants born in 2013 to mothers who smoked were under 7 lbs, etc. Comparing these to infants born to non-smoking mothers might even move our prior closer to the notion that smoking during pregnancy affects birth weight.

Use Cases

This pattern is great for comparing cohorts or “tiered” dimensions. Imagine there is an ecommerce shop that wants to understand how their customer base’s age affects their lifetime order count. We build a look that attempts to answer the question, how does the ordering behavior of the top 50,75,90 percent of customers change by age group? We start by pivoting out “Age Group”, and counting lifetime orders (ie. how many customers have EXACTLY 1 order, how many have 2, how many have 3, etc…).

Here, the x axis, Lifetime Orders, is sorted in reverse order. Now we can make statements like, “40% of customers aged 40-50 order at least 5 times, while only 33% of customers aged 20-30 order at least 5 times.” In this example, we see that the typical user (the median user) for 20-30 year olds order about 5 times, while the typical user for 40-50 year olds order 6 times. This difference is even more pronounced at the 80 percentile level. The top 80% of 40-50 year olds ordered 10 times, while the top 80% of 20-30 year olds ordered only 7 times.

Percentiles in pivot tables
(Dave Kim) #2

This is a great example, but do you know if there is a way to be able to run this with additional dimensions? So if in your example, there was a column to the left of birth weight that maybe showed what part of the country they were from. So the running total would be done at the region level, does that make sense?

(Daniel Nelson) #3

Hi David, great question! To add another dimension, filter on the values you care about and pivot out. This will give you a PDF of each combination of values from the two pivoted dimensions. In our example, if I added another yes/no dimension (have had unpasteurized cheese during pregnancy, for example), then I would get a PDF of the following 4 series:

  1. Those who have smoked and who have eaten unpasteurized cheese (Yes/Yes)
  2. Those who have smoked and who have not eaten unpasteurized cheese (Yes/No)
  3. Those who have not smoked and have not eaten unpasteurized cheese (No/No)
  4. Those who have not smoked but who have enjoyed unpasteurized cheese (No/Yes)

Hope this helped!

(Dave Kim) #4

Thanks Daniel, that is a way to do it for dimensions with limited values like a yes/no. However what if instead of another yes/no, you added a ‘State’ dimension? You’d result in a table with 100 columns!

(Daniel Nelson) #5

Hi David, you are absolutely correct, that would make a CDF for each unique value, so for a dimension like state, which has 50 values, you’d get 50 separate CDF’s. What are you trying to do by adding an “extra dimension”? If you only wanted to return values for certain states or regions, just filter on those values.