How Much and Where
How much Lifetime Value (LTV) am I getting from my top 5% of customers? How many times have the top ⅓ active users visited our website vs. the top ½? What percentage of our partners had at least $10k in sales last month?
All of these questions follow a similar form: “How many/much- insert noun here- did the top “x%” contribute? Fortunately in Looker, there is an easy way for end users to flesh out this type of analysis.
Using the CDC’s 2013 Natality dataset, we are going to build a Cumulative Density Function, or CDF. These functions are perfect for questions like the ones posed above. To motivate this example, we start with the question, “What percentage of newborns weigh 7 lbs or less? Does a mother’s personal habits during pregnancy, such as cigarette smoking, affect birth weight?”
It’s important to understand the math behind this question. At a high level, we want to answer questions like what percentage of newborns are under 7 lbs, or how many lbs do at least 50% of newborns weigh? We want a percentage of the total, so there is going to be a sum in the denominator. And we don’t want the percent of total for each row, we want a percent of total for the sum of all the preceding rows, inclusive. In English, we want a running total or running sum in the numerator, and the column sum in the denominator. The table calculation for this looks like:
You’ll notice two things about the graphic below. For “% of Newborns under ‘x’ lbs”, the last row is 100. That is because the total is always 100% of the total. The other thing you will notice is that each row in “% of Newborns under ‘x’ lbs” is at least as large as the preceding row. These are two properties of CDF’s and gives us visual validation of the table calc and the way we built this plot.
If we were to look at the right column, “% of Newborns under ‘x’ lbs” would have the following interpretation: 6.49% of all infants born in 2013 to mothers who smoked were under 5 lbs, 18.2% of all infants born in 2013 to mothers who smoked were under 6 lbs, 47.42% of all infants born in 2013 to mothers who smoked were under 7 lbs, etc. Comparing these to infants born to non-smoking mothers might even move our prior closer to the notion that smoking during pregnancy affects birth weight.
This pattern is great for comparing cohorts or “tiered” dimensions. Imagine there is an ecommerce shop that wants to understand how their customer base’s age affects their lifetime order count. We build a look that attempts to answer the question, how does the ordering behavior of the top 50,75,90 percent of customers change by age group? We start by pivoting out “Age Group”, and counting lifetime orders (ie. how many customers have EXACTLY 1 order, how many have 2, how many have 3, etc…).
Here, the x axis, Lifetime Orders, is sorted in reverse order. Now we can make statements like, “40% of customers aged 40-50 order at least 5 times, while only 33% of customers aged 20-30 order at least 5 times.” In this example, we see that the typical user (the median user) for 20-30 year olds order about 5 times, while the typical user for 40-50 year olds order 6 times. This difference is even more pronounced at the 80 percentile level. The top 80% of 40-50 year olds ordered 10 times, while the top 80% of 20-30 year olds ordered only 7 times.