Using Looker with partitioned columns

hadoop
(Todd Nemet) #1

This article contains some tips and tricks for using Looker with partitioned columns. Looker is working to improve the way it works with partitioned columns, but in the meantime here are some best practices and workarounds.

Note: As of version 3.50, Looker contains the ability to directly reference the start_date and end_date in templated filters. See this discourse article for an example that shows how to use this feature to model a date-partitioned table. (Search for “Presto example”)

Background

Hive documentation on partitioned tables

A common design pattern in Hadoop tables is to use one or more partitioned columns. This creates “virtual columns” that are actually directories, one for each unique value.

The advantage of using partitioned columns is that it will greatly speed up searching a table because when encountering a predicate containing WHERE dt='20151209' or WHERE state='OH' the query engine only needs to search the database files in one directory and can ignore the rest of the table.

Commonly partitioned fields are dates (or the year, month, and day separately) and columns that have a finite set of unique values that are usually filtered by or grouped by, such as country, state, marketing channel, device type, etc.

Date-partitioned columns

The most common partitioned column is the date (usually named dt), though there is no standardized format. I have seen YYYYMMDD, YYYY-MM-DD, and YYYY MM DDseparately. One customer who gets TB of data per day partitions by the hour with the format YYYYMMDDHH.

Using the standard Looker timeframes doesn’t work well with partitioned columns because of the way that Looker calculates the date ranges in the predicate. For example, filtering a date on the range 2015-11-10 and 2015-12-10 results in this SQL (on Hive, for example), which causes all rows in the table to be scanned and defeating the purpose of partitioned columns:

WHERE
(users.dt) >= (TIMESTAMP('2015-11-10 00:00:00.000000000'))
  AND 
(users.dt) < (TIMESTAMP('2015-12-10 00:00:00.000000000'))

Suggested workarounds:

  • If the format can be compared as an integer, like YYYYMMDD, then define the dimension as type int, like this:
  - dimension: yyyymmdd_part
    type: int
    sql: ${TABLE}.yyyymmdd_at

The filter will show up in Looker’s explore window like this:

  • Use a derived table and {% parameter %} to define the start and end dates, as shown below. This has the advantage of allowing the definition of default values.
- view: users_partitioned_date
  derived_table:
    sql: |
      SELECT * FROM USERS
      WHERE
      users.yyyymmdd_at > {% parameter start_date %}
      AND
      users.yyyymmdd_at < {% parameter end_date %}

  fields:
  - filter: start_date
    label: 'FILTER Start Date'
    default_value: 20100101
    suggestions: [20150101, 20150901]

  - filter: end_date
    label: 'FILTER End Date'
    default_value: 20161231
    suggestions: [20151231, 20160331]

The filters will show up in Looker’s explore window like this:

For non-integer date partitions, this is the most flexible workaround.

Auto suggestions when filtering

When filtering on a dimension in Looker, typing in the box can generate SELECT DISTINCT ... queries for that column. On a cluster this isn’t always desired functionality. Also for partitioned columns there are usually a pre-defined set of values that never change and can be defined in LookML.

Disabling suggestions for a view

To disable auto-suggestions for all dimensions in a view, at the top of the view file include suggestions: false like this:

- view: logs
  suggestions: false

  fields:
  ...

This will override the suggestion settings for all dimensions in the view.

Disabling suggestions for a dimension

To disable suggestions for a dimension, in particular one with many distinct values, add suggestions: false to the dimension like this:

  - dimension: dt
    suggestions: false
    type: int
    sql: ${TABLE}.dt

Hard coding suggested values

For partitioned fields with a pre-defined set of values, use suggestions: [...] to hard code values into LookML like this:

  - dimension: device_type
    suggestions: [mobile, desktop, app]
    sql: ${TABLE}.device_type

See this article for more detail.

Other techniques

  1. suggest_dimension and suggest_explore can be used to force the look up on other tables.

  2. suggest_persist_for can change the default time that the value lookups are cached (6 hours by default)

Additional documentation

See this article for more detail.

Docs on suggestion-related keywords

0 Likes