[Analytic Block] - Cohorts as Filters using Templated Filters

cohort
open
low_priority
watch
filter

(Aron Clymer) #1

Hey all you good lookers,

This is a nice LookML design pattern that I hope many of you can leverage!

Anyone doing a deep dive analysis on customer behavior will want to easily look at a cohort and see what kinds of interesting insights can be discovered about that cohort. For instance, one simple but powerful question like “Of the customers that purchased Product A, what other products did they purchase?” can help sales target upsell or cross-sell opportunities.

In the world of SQL, this kind of question requires a sub-query to define the cohort (customers who purchased product A) and a main query to answer a question (what other products did they purchase). In the spirit of creating a friction-free, self-service analytics environment, the question is: how can we give end users the capability of dynamically creating ad-hoc cohorts at run-time and then asking other questions about the behavior of those cohorts? All without having to develop any LookML or write custom queries!

I’ve seen this functionality in other products for which the schema is already known and the scope of the tool is more narrow, like Amplitude, but Looker is more open ended and thus doesn’t (yet) provide this feature out of the box.

In the meantime, I realized that templated filters combined with a derived view could solve the problem in many situations.

The idea is to create a derived view that returns a particular “grain” (level of detail) of data that represents a cohort. The view should expose templated filters so end users can manipulate the WHERE clause and define the cohort at query-time. The view should then be INNER joined to the appropriate Explore. The inner join effectively turns the cohort results into a filter on the main query.

The limitations of this approach are that the LookML developer must know in advance 1) the “grain” of the cohorts that end users want to use (i.e. a cohort of users, or customers, or accounts, or products, etc.), and 2) the filter fields to expose for defining the cohort.

Let’s walk through an example that I built in my personal Looker instance with a fabricated e-commerce dataset. Some of you will be familiar with this data model. Using this dataset, I wanted to be able to build cohorts of users and then apply the cohort as a filter. I wanted to answer quesions such as, “Of all of the users that purchased “Hanes” brand items, what other brands did these users purchase?”

Step 1: Create a derived view to define the cohort

Create a derived view that returns a distinct list of user IDs (the cohort “grain”). Include templated filter fields for the end users to manipulate the WHERE clause and thus define the cohort at runtime. In my example I’ve created 3 templated filters: sku, item_name, and brand_name. Also notice that I only define a single dimension, user_id, which I’ve hidden because the end user does not need to see anything from this derived view - it is ultimately only used as a filter. I’ve also used “suggestions” to help Looker auto-complete values for convenience.

view: test_derived_user_cohort {

  derived_table: {
    sql: SELECT users.id  AS user_id
          FROM order_items
          LEFT JOIN orders ON order_items.order_id = orders.id
          LEFT JOIN inventory_items ON order_items.inventory_item_id = inventory_items.id
          LEFT JOIN products ON inventory_items.product_id = products.id
          LEFT JOIN users ON orders.user_id = users.id
          WHERE ({% condition cohort_filter_item_name %} products.item_name {% endcondition %})
            AND ({% condition cohort_filter_sku %} products.sku {% endcondition %} )
            AND ({% condition cohort_filter_brand_name %} products.brand {% endcondition %} )
          GROUP BY 1;;
  }

  dimension: user_id {
    hidden: yes
    description: "Unique ID for each user that has ordered"
    type: number
    sql: ${TABLE}.user_id ;;
  }

  filter: cohort_filter_sku {
    description: "SKU to filter cohort - filter on all users that purchased this sku"
    type: string
    suggest_explore: products
    suggest_dimension: products.sku
  }

  filter: cohort_filter_item_name {
    description: "Item Name to filter cohort - filter on all users that purchased this item"
    type: string
    suggest_explore: products
    suggest_dimension: products.item_name
  }

  filter: cohort_filter_brand_name {
    description: "Brand Name to filter cohort - filter on all users that purchased this brand"
    type: string
    suggest_explore: products
    suggest_dimension: products.brand_name
  }

}

Step 2: Join to an Explore

Join the cohort derived view to the appropriate explore using an inner join. In the example below, I used the “order_items” explore that has already been created in “thelook” model. I created another join called “test_derived_user_cohort”. Notice I named the view_label starting with an “X” so it sorts last in the field picker, and I gave it an intuitive name of “User Cohort Filters”

explore: order_items {
  join: orders {
    type: left_outer
    sql_on: ${order_items.order_id} = ${orders.id} ;;
    relationship: many_to_one
  }

  join: products {
    type: left_outer
    sql_on: ${inventory_items.product_id} = ${products.id} ;;
    relationship: many_to_one
  }

  join: users {
    type: left_outer
    sql_on: ${orders.user_id} = ${users.id} ;;
    relationship: many_to_one
  }

  join: users_orders_facts {
    type: left_outer
    sql_on: ${users.id} = ${users_orders_facts.user_id} ;;
    relationship: one_to_one
  }

  join: inventory_items {
    type: left_outer
    sql_on: ${order_items.inventory_item_id} = ${inventory_items.id} ;;
    relationship: many_to_one
    # don't import all of the fields, just the fields in this set.
    fields: [inventory_items.export*]
  }

  join: test_derived_user_cohort {
    view_label: "XX - User Cohort Filters"
    type: inner
    relationship: many_to_one
    sql_on: ${users.id} = ${test_derived_user_cohort.user_id} ;;
  }
}

End User Experience:

So, end users will see three filter fields for defining a user cohort (screen shot below).

They can now answer a business question like “Of all of the users that purchased items of brand “Hanes”, what at the top other brands they purchased?” They simply select the cohort brand filter, then select the regular report fields needed (Brand Name, Count [of products[), sort by count, and run:

Looking at the SQL tab, you can see that that the user cohort is now a sub-query inner joined to the driving SELECT statement. The templated filters define which user IDs are included in the cohort, but don’t affect main query. Also note that Looker inserts "WHERE 1=1 in the places where no defined template filter is actually used by the end user, which can make the SQL a bit harder to read.

-- generate derived table test_derived_user_cohort
-- Building thelook::test_derived_user_cohort in dev mode on instance 77939eb8292180e55264a950fb751649
CREATE TEMPORARY TABLE partner_scratch.test_derived_user_cohort SELECT
            users.id  AS user_id
          FROM order_items
          LEFT JOIN orders ON order_items.order_id = orders.id
          LEFT JOIN inventory_items ON order_items.inventory_item_id = inventory_items.id
          LEFT JOIN products ON inventory_items.product_id = products.id
          LEFT JOIN users ON orders.user_id = users.id

          WHERE (1=1 -- no filter on 'test_derived_user_cohort.cohort_filter_item_name' 
)
            AND (1=1 -- no filter on 'test_derived_user_cohort.cohort_filter_sku' 
 )
            AND ((products.brand = 'Hanes') )
          GROUP BY 1
-- finished test_derived_user_cohort => partner_scratch.test_derived_user_cohort
SELECT 
	products.brand  AS `products.brand_name`,
	COUNT(DISTINCT products.id ) AS `products.count`
FROM order_items
LEFT JOIN orders ON order_items.order_id = orders.id 
LEFT JOIN inventory_items ON order_items.inventory_item_id = inventory_items.id 
LEFT JOIN products ON inventory_items.product_id = products.id 
LEFT JOIN users ON orders.user_id = users.id 
INNER JOIN partner_scratch.test_derived_user_cohort AS test_derived_user_cohort ON users.id = test_derived_user_cohort.user_id 

GROUP BY 1
ORDER BY COUNT(DISTINCT products.id ) DESC
LIMIT 500

Voila!

Summary

Problem: How can end users analyze patterns of customer cohorts to answer questions like “Of the customers that purchased Product A, what other products did they purchase?” by dynamically defining a cohort and then applying it as a filter on other queries, all without developing LookML or any custom queries?

Solution: Create a derived view that returns a particular “grain” (level of detail) of data that represents a cohort. Expose templated filters to end users so they can target a cohort at run-time. Inner join the derived view to an explore to effectively turn the cohort results into a filter on the main query.

Limitations: The “grain” of the cohorts, and the templated filters, must be pre-defined in the derived view. There is also some end user training required.


(Dillon Morrison) #2

This is awesome, Aron! Super flexible. Great approach to conduct second-level analysis once you find an interesting cohort.


(Fabiana Tamburrini) #3

Great article! Thanks a lot


(Aron Clymer) #4

Update: there is currently a Looker bug preventing the use of YesNo templated filters in a subquery like this. The bug inserts the filtering logic both inside and outside the subquery, whereas we only want it applied to the subquery. As a workaround, I was able to use multiple parameters to insert the appropriate filtering logic.

Example:

Step 1: add a parameter for each boolean value

parameter: yes_flag {
type: unquoted
suggestable: no
default_value: "Yes"
allowed_value: {
  label: "Yes"
  value: "Yes"
  }
}

parameter: no_flag {
type: unquoted
suggestable: no
default_value: "No"
allowed_value: {
   label: "No"
   value: "No"
   }
}

Step 2: add logic to the WHERE clause that will insert the appropriate SQL

{% if yes_flag._in_query %}
     AND <appropriate SQL here>
{% endif %}

{% if no_flag._in_query  %}
     AND NOT <appropriate SQL here>
{% endif %}

Note: I originally wanted a single parameter that could take 2 values: Yes (value 1) and No (value 0), but it does not appear possible to extract the value of a parameter within a larger logic liquid block. For instance, this doesn’t cause a syntax error but won’t ever evaluate to true: if yes_flag._value == 1. Thus, I had to rely on the “._in_query” liquid variable instead. If anyone can tell me how to use the liquid “value” variable in a larger logical block, please let me know as I have many other use cases that this ability would solve.


(Max Corbin) #5

@aronclymer That’s a pretty clever workaround! I think that you should be able to use the raw value of the parameter by using the pattern {% parameter <variable_name> %}. I tested this on my end to make sure I had it right and it appears to work. Here’s a portable example that you can paste into any project to test yourself:

view: test {
  derived_table: {
    sql:  SELECT * FROM 
          (SELECT TRUE AS "value"
          UNION
          SELECT FALSE AS "value") AS t
          {% if boolean._in_query %}
          WHERE t.value IS {% parameter boolean %}
          {% endif %} ;;
  }
  
  dimension: value {
    type: string
    sql: ${TABLE}.value ;;
  }
  
  parameter: boolean {
    type: unquoted
    allowed_value: {
      label: "True"
      value: "TRUE"
    }
    allowed_value: {
      label: "False"
      value: "FALSE"
    }
  }
}

explore: test {hidden:yes}

The result is a boolean switch that inserts a filter when it is in the query, and listens to the end user’s choice. You still need to use the _in_query element because, unlike templated filters, parameters will insert null values when they are left out of the query, which can cause some SQL syntax errors.

While it doesn’t work around the yesno templated filter issue, it does at least allow you to use the same parameter for both yes and no values of your workaround. Let me know if you have any questions!


(Ronak Bhatt) #6

Hi all - how is the performance of this block? Could someone share the details of,

  1. Data Volume that cohort analysis looks at
  2. Underlying data platform / database
  3. Typical query time for your setup

thanks, ronak