Bucketing in Looker

#1

Bucketing can be very useful for creating custom grouping dimensions in Looker.

There are three ways to create buckets in Looker: using the tier dimension type, using sql_case, and using a SQL CASE WHEN statement.

Using tier for Bucketing

To create integer buckets, we can simply define a type: tier dimension:

- dimension: users_lifetime_orders_tier
  type: tier
  tiers: [0,1,2,5,10]
  sql: ${users_lifetime_orders}

You can use the style parameter to style how your tier appears when exploring. The three options for style are: integer, interval, and relational. For example:

- dimension: age_tier
  type: tier
  tiers: [0,10,20,30,40,50,60,70,80]
  style: integer
  sql: ${age} 

These three styles will display like so:

Find out more about Looker tiers in our documentation here.

##Using sql_case for Bucketing

sql_case (for Old LookML) or case can be used to create custom-named buckets with custom sorting. It is recommended for a fixed set of buckets, because it makes filtering and visualizations easier in Looker.

To create buckets with sql_case, we simply define a buckets dimension like so:

Old LookML

- dimension: order_amount_bucket
  sql_case:
    "Small": ${order_amount} <= 50
    "Medium": ${order_amount} > 50 AND ${order_amount} <= 150
    "Large": ${order_amount} > 150
    "Unknown": true

New LookML

dimension: order_amount_bucket {
    case: {
        when: {
            sql: ${order_amount} <= 50;;
            label: "Small"
        }
        when: {
            sql:  ${order_amount} > 50 AND ${order_amount} <= 150;;
            label: "Medium"
        }
        when: {
            sql: ${order_amount} > 150;;
            label: "Large"
        }
        else:"Unknown"
        }
    }

sql_case will sort in the order the buckets are defined. This would look something like:

If you would like to sort alphanumerically, add alpha_sort: true to the dimension, like so:

- dimension: order_amount_bucket
  alpha_sort: true
  sql_case:
    "Small": ${order_amount} <= 50
    "Medium": ${order_amount} > 50 AND ${order_amount} <= 150
    "Large": ${order_amount} > 150
    "Unknown": true

Read more about sql_case in our documentation here.

##Using CASE WHEN for Bucketing

A CASE WHEN statement is recommended for more complex bucketing, or for a complex ELSE statement.

For example, you may want to use different bucketing methods depending on the destination of an order. A CASE WHEN statement could be used to create a compound bucket dimension, where the THEN statement returns dimensions rather than strings:

- dimension: compound_buckets
  sql: |
    CASE
    WHEN ${orders.destination} = 'US' THEN ${us_buckets}
    WHEN ${orders.destination} = 'CA' THEN ${canada_buckets}
    ELSE ${intl_buckets} END

Note the use of a pipe | after sql: in the example above. This allows you to use multiple rows of SQL.

6 Likes

Histograms?
(Caitlin) #2

We love type: tier for filtering, but we can’t figure out what to do when the value of a tier is NULL. We’d like to apply a label in that case which is not the default ‘Undefined’ label. Any suggestions?

0 Likes

#3

Hey Caitlin - in order to create custom buckets, I would recommend creating a dimension using sql_case. With sql_case you can create custom buckets that will sort in the order they’re declared and have a filter dropdown just like with dimensions of type: tier. We have more about using sql_case here.

0 Likes

(Caitlin) #4

Thanks Lindsey, we actually do use case statements in some scenarios but we prefer the type: tier with the relational format for filtering. Good to confirm that we can’t actually use that to cover all of our use cases.

0 Likes

(Akhill Chopra) #5

I think I had a similar use case @caitiewrites; @ian pointed me to an elegant solution: COALESCE the sql in the tier dimension (assuming you want the currently-‘Undefined’ cases to be tiered as ‘0’).

For example, in a customers explore where I’ve joined a customer_order_facts persistent derived table I was getting ‘Undefined’ for customers who had no orders with the following:

  - dimension: lifetime_orders_tier
    type: tier
    tiers: [0,1,2,3,4,5,10,15,25]
    style: integer
    sql: ${lifetime_orders}

This was owed to customers who had no orders and were not represented by a row in my derived facts table - thus, NULL values or ‘Undefined’ bucket.

Replacing that dimension with the following tiers those customers in the ‘0’ bucket.

  - dimension: lifetime_orders_tier
    type: tier
    tiers: [0,1,2,3,4,5,10,15,25]
    style: integer
    sql: COALESCE(${lifetime_orders},0)

Hope that helps! Let me know if the above isn’t clear as I just went through our LookML applying this update and it’s working as needed (e.g. no more ‘Undefined’).

2 Likes

#6

A nice extension of this feature would be to be able to place bounds on a number, with everything within the bounds evaluating to its own value, and everything outside of the bounds evaluating to a grouped field, such as “> 15” or “< -10”. This can be done with SQL CASE statements, but that requires casting everything to a TEXT field, which can interfere with sorting, so there’s definitely an opportunity for Looker here.

0 Likes

(Max Corbin) #7

Hey Ross, you can use the sql_case for this, rather than putting the CASE statements in the sql parameter. This will actually index the sort for that dimension on the order that you defined the labels, rather than sorting alphabetically.

For example, in this field definition, when the status field is sorted in an Explore, the order returned will be pending, complete, returned, and unknown in that order, rather than the alphabetical sort.

- dimension: status
  sql_case:
    pending: ${TABLE}.status = 0
    complete: ${TABLE}.status = 1
    returned: ${TABLE}.status = 2
    else: unknown                  # Option 1 for an “unknown” overflow bucket

Let me know if this works for your use case!

0 Likes

#8

Thanks for the suggestion! The specific situation I was thinking about dealt with a data set where I would want the values to resolve to themselves when within the bounds. For example:

  • Value is less than -15 : “< 15”
  • Value is between -15 and 30: ${value}
  • Value is greater than 30: “> 30”

This sort of solution didn’t seem to be feasible to me, but I might be mistaken.

0 Likes

(Mike Ritchie) #9

Feature suggestion: A function to auto generate tiers, something like Python’s range(0, 200, 10)

1 Like

(Chris Seymour) #10

Thanks for the suggestion; I’ve passed it along to our product team.

0 Likes

(Ben Corwin) #11

Instead of hard-coding the buckets, is there a way to do n-tiles? For example, have the tiers automatically split the values into deciles, quartiles, etc.?

0 Likes

(lloyd tabb) #12

In order to do this, you will need to build a derived table against and join it back in.

view:user_age_quartile{
  derived_table: {
    sql: SELECT user_id, NTILE(4) OVER (ORDER BY age) as quartile FROM users ;;
  }
  dimension: user_id {}
  dimension: quartile {type: number}
}

2 Likes

(Swapnil Pimple) #13

Hello,I have successfully created bucket and in my data each bucket having more than 1000 values but I want to show top 10 values from each bucket is it possible ?

0 Likes

(sami.rubenfeld) #14

If you wanted to created a bar chart or bar h visualization, you could select the bucket (or quartile, if you follow Lloyd’s example with NTILE( )) in the filter of the explore, set the desired value, and then a dimension and the associated counts. You would set Row Limit to 10 and then the top 10 values of that bucket would render.

Here is an example of how it would look in the visualization. In my case, I created a derived table with NTILE( ) to create the table :

0 Likes

(Bhavna) #15

I tried bucketing logic using OR but AND in above example and its not working for me as expected. All I want to create bucket is value in one dimension (in db column) to be roll up. Like value " name1, name2 and name3 should be bucked as team1. name4 and name 5 to be bucketed as team2.

0 Likes

(will.adams) #16

Hey @bshukla, this should be possible with the case: parameter on the dimension like so:

case: {
        when: {
            sql: ${TABLE}.name in ('name1', 'name2', 'name3') ;;
            label: "team1"
        }
        when: {
            sql:  ${TABLE}.name in ('name4', 'name5') ;;
            label: "team2"
        }
        else: "Unknown Team"
        }
    }

If this is not working for you, it may have to do with the way your DB handles the in operator, but regardless, you can just use raw SQL as well:

dimension: the_dimension {
  type: string
  sql: CASE WHEN
        ${TABLE} = 'name1' OR ${TABLE}.name = 'name2' OR ${TABLE}.name = 'name3'
        THEN 'Team 1'
        WHEN ${TABLE} = 'name4' OR ${TABLE}.name = 'name5'
        THEN 'Team 2'
         ELSE 'Unknown Team'
         END ;;
}

Hope this helps :slight_smile:

0 Likes

(Bhavna) #17

Hello Will
Second option as I mentioned with OR operator already tried and it was not working correct for me.
I will try first option with ‘in’ parameter and let you know!

Thank you much for your prompt response on this
Best
-Bhavna

0 Likes