[Analytic Block] Retention Analysis

About this Block

One of the most common requests we hear at Looker is for better cohort analysis. Charts like these have become commonplace in modern data applications:

Explore Data in Full Screen

From gaming apps to retail stores, countless businesses want to understand how well they are retaining various cohorts of customers, and what to do about it. This means that countless businesses have also felt the pain of writing some really gross queries.

Ideal Data Types

Retention analysis will work with plenty of different data structures. The pattern below utilizes data that contains a users table and a transactions table of some sort.

Try it Yourself

Hacking it together

Cohort-based retention analysis is hard in traditional SQL because you want to simultaneously see the number of users in a cohort (time-invariant) and the number of users in that cohort with activity in a given month (time-variant). So we want to group by two different months (signup month and activity month), but when we group by activity month, we want to make sure we don’t exclude users without any activity that month. Nasty stuff.

Traditionally, the solution is to calculate each number separately with multiple queries, and then stitch the results together in Excel. But that’s not flexible - you don’t get to slice and dice by all the other users attributes in my database (among other applications), and you certainly don’t get to drill into the individual users. Seems like a waste of all that data!

Lookering for a better way

The trick here is going to be using derived tables to do a pretty nasty join. Remember, a derived table is not designed to actually give you an answer to a business question; instead, it transforms the underlying data, so that when users start exploring, they can get a more meaningful result.

Working backwards, I want my users to be able to say something like “For users acquired from paid search vs. organic traffic, what percent were still active after 3 months, and have those percentages been getting better or worse over time?” To do this, we need the underlying data to contain a row for each user-month combination, telling me about each user’s behavior in each month - including months with no user activity at all - and has the ability to join in all relevant user attributes.

To visualize this, let’s imagine that a gaming company has users in an app. Here’s a simple schema illustrating what the raw data might look like:

Table 1: Users

user_id | signup_date | acqsource
--------+-------------+-----------
1       | 1/20/15     | paid_search
2       | 2/15/2015   | paid_search
3       | 3/2/2015    | organic

Table 2 - App Logins

event_id | user_id | event_date
---------+---------+-----------
1        | 1       | 1/21/15
2        | 1       | 3/4/15
3        | 1       | 4/1/15
4        | 2       | 5/1/15
5        | 2       | 2/15/15
6        | 2       | 2/16/15
7        | 2       | 2/17/15
8        | 3       | 3/5/15
9        | 3       | 4/1/15
10       | 3       | 5/1/15

Now, we can use the following LookML pseudocode to transform the data into a derived table which gives us one row per each user-month combination, and tells us about each user’s activity in that month, including months with no activity:

view: monthly_activity {
  derived_table: {
    sql_trigger_value: select current_date ;;
    sortkeys: ["signup_month"]
    distribution: "user_id"
    sql: SELECT

          users.id as user_id
        , date_trunc ('month', users.created_at) as signup_month
        , month_list.purchase_month as purchase_month
        , COALESCE(data.monthly_purchases, 0) as monthly_purchases
        , COALESCE(data.total_purchase_amount, 0) as monthly_spend
        , row_number() over() AS key
      FROM
        users

      LEFT JOIN

        (
          SELECT
            DISTINCT(date_trunc('month', order_items.created_at)) as purchase_month
          FROM order_items
        ) as month_list
      ON month_list.purchase_month >= date_trunc ('month', users.created_at)  -- your dialect will vary

      LEFT JOIN

        (
          SELECT
                o.user_id
              , date_trunc('month', o.created_at) as purchase_month
              , COUNT(distinct o.id) AS monthly_purchases
              , sum(o.sale_price) AS total_purchase_amount

          FROM order_items o
          GROUP BY 1,2
        ) as data
      ON data.purchase_month = month_list.purchase_month
      AND data.user_id = users.id
       ;;
  }

  dimension: user_id {
    type: number
    sql: ${TABLE}.user_id ;;
  }

  dimension_group: signup {
    type: time
    timeframes: [month]
    sql: ${TABLE}.signup_month ;;
  }

  dimension_group: purchase {
    type: time
    timeframes: [month]
    sql: ${TABLE}.purchase_month ;;
  }

  dimension: months_since_signup {
    type: number
    sql: datediff('month', ${TABLE}.signup_month, ${TABLE}.purchase_month) ;;
  }

  dimension: monthly_purchases {
    type: number
    sql: ${TABLE}.monthly_purchases ;;
  }

  dimension: monthly_spend {
    type: number
    sql: ${TABLE}.monthly_spend ;;
  }

  measure: total_users {
    type: count_distinct
    sql: ${user_id} ;;
    drill_fields: [users.id, users.age, users.name, user_order_facts.lifetime_orders]
  }

  measure: total_active_users {
    type: count_distinct
    sql: ${user_id} ;;
    drill_fields: [users.id, users.age, users.name, user_order_facts.lifetime_orders]

    filters: {
      field: monthly_purchases
      value: ">0"
    }
  }

  measure: percent_of_cohort_active {
    type: number
    value_format_name: percent_1
    sql: 1.0 * ${total_active_users} / nullif(${total_users},0) ;;
    drill_fields: [user_id, monthly_purchases, total_amount_spent]
  }

  measure: total_amount_spent {
    type: sum
    value_format_name: usd
    sql: ${monthly_spend} ;;
    drill_fields: [detail*]
  }

  measure: spend_per_user {
    type: number
    value_format_name: usd
    sql: ${total_amount_spent} / nullif(${total_users},0) ;;
    drill_fields: [user_id, monthly_purchases, total_amount_spent]
  }

  measure: spend_per_active_user {
    type: number
    value_format_name: usd
    sql: ${total_amount_spent} / nullif(${total_active_users},0) ;;
    drill_fields: [user_id, total_amount_spent]
  }

  dimension: key {
    type: number
    primary_key: yes
    hidden: yes
    sql: ${TABLE}.key ;;
  }

  set: detail {
    fields: [user_id, signup_month, monthly_purchases, monthly_spend]
  }
}

The resulting table would show:

user_id | first_active_month | activity_month | monthly_visits
--------+--------------------+----------------+---------------
1       | 1/15               | 1/15           | 1
1       | 1/15               | 2/15           | 0
1       | 1/15               | 3/15           | 1
1       | 1/15               | 4/15           | 1
1       | 1/15               | 5/15           | 1
2       | 2/15               | 2/15           | 3
2       | 2/15               | 3/15           | 0
2       | 2/15               | 4/15           | 0
2       | 2/15               | 5/15           | 0
3       | 3/15               | 3/15           | 1
3       | 3/15               | 3/15           | 1
3       | 3/15               | 3/15           | 1

Our last step is to expose this derived table as an explore which is joined to our users table:

explore: user_monthly_activity {
    join: users
    sql_on: ${user_monthly_activity.user_id} = ${users.id}
    relationship: many_to_one
}

Now, when our users select activity_month, signup_month, and percent_of_cohort_still_active, they’ll see percent user retention by cohort all from a single query. Then, they can filter or pivot by user acquisition source, drill into specific users, and find trends that might actually impact how their marketing team chooses to spend user acquisition dollars:

Explore Data in Full Screen

Isn’t that better?

11 Likes

There are lots of extension to this model as well. One view that lots of folks like with cohorts is something like cumulative spend:

By simply swapping our measure from percent_active to average_spend, we have cohorted spend data. This quick PDT using inequality joins allows us to do cumulative spend:

- view: cohort_cumulative
  derived_table:
    sql: |
      SELECT 
        utm1.user_id as user_id
        , utm1.signup_month
        , utm1.activity_month as activity_month
        , SUM(utm2.monthly_orders) AS cumulative_monthly_orders
        , SUM(utm2.monthly_items) as cumulative_monthly_items
        , SUM(utm2.monthly_spend) as cumulative_monthly_spend
      FROM ${cohort.SQL_TABLE_NAME} as utm1
      LEFT JOIN ${cohort.SQL_TABLE_NAME} as utm2
      ON STR_TO_DATE(CONCAT(utm2.activity_month, "-01"), "%Y-%m-%d") 
            <= STR_TO_DATE(CONCAT(utm1.activity_month, "-01"), "%Y-%m-%d")
            AND utm1.user_id = utm2.user_id
      GROUP BY 1,2,3
      ORDER BY 1,2,3
    indexes: [user_id]
    sql_trigger_value: SELECT CURRENT_DATE()
3 Likes

Is there something missing from this discussion about how to pivot on "months since signup’? I am guessing you would need another dimension that would calculate the month delta between the signup month and activity month. Or is there a way to compute this on the fly?

@Oliver That’s exactly right - it would look something like this:

  - dimension: months_since_signup
    type: number
    sql: datediff('month',${users.created_raw},${order_items.created_raw})

By the way - it’s worth noting that a slightly simpler version of this analysis can be done without the complex PDT pattern I described above. The pattern demonstrated here is useful when you have users who may belong to a cohort but never be considered ‘active’, which requires the independent calculations of cohort size and active user counts.

If, however, you only want to include users who were active in the cohort, the following is a bit simpler: Cohort Exploration Normalized by Cohort Size (no modeling required)

2 Likes

Is it possible to enrich the display of a retention cohort chart with the size of the cohort?

Hey @ross - there’s not a super easy way to do that at the moment, but stay tuned for some upcoming chart improvements which will offer more flexibility for things like this :slight_smile:

Glad to hear it. Period-to-period changes in cohort composition is generally a major driver of changes to performance, so being able to display that information in a cohort display is quite important.

One other question! When doing this in pure SQL, it would be prudent (for performance purposes) to perform the “fan join” after compressing on the relevant dimensions, rather than before. For example, if I’m interested in calculating retention for daily cohorts over weekly intervals by platform, the query would be much more performative if I first grouped by registration date, activity week, and platform, followed by ‘fanning’ the results by those dimensions to ensure that there is a record for each relevant combination of characteristics.

Is there a solution of this sort in Looker, or is it necessary to fan the data out at the atomic level?

Yup, you’re right - I’ve actually been meaning to replace the SQL here with something a bit more efficient and clean. Performance isn’t a huge issue here, since the big nasty query can happen overnight, but still a good practice

1 Like

@savage Thanks for this writeup! I was messing around with this and ran into an error caused by a little typo that I thought I’d mention. In your final ‘Select’ statement, you call a column “monthy_events” and then refer to it later as “monthly_events”. Notice the missing “L” in the first reference. Keep up the awesome work, I <3 Looker.

1 Like

Any update on this?

1 Like

I’d also consider trimming off any time periods with incomplete measures (or at least giving users a toggle filter so they can do so). The “hooks” at the ends of the various trend lines can be misleading.

Hi - I recreated this but am noticing in my version the nulls show up on the graph rather than the lines ending on the last active month. I’ve tried doing a filter on nulls but that didn’t work. What can I do to make it so the lines cut off like in the example?

There’s an option in the visualization settings that you can un-tick called “Plot Null Values”. You can see it in the screenshot here: https://docs.looker.com/exploring-data/visualizing-query-results/line-options#plot_menu_options

Turning that off will work!