Using date_start and date_end with date filters

hadoop
lookml
bigquery

(Todd Nemet) #1

Looker version 3.50 adds the ability to refer to the beginning and end dates in a date filter with {% date_start date_filter %} and {% date_end date_filter %} respectively.

Use Cases

Note that this syntax should work with all dialects, but certain dialects have specific use cases.

For BigQuery this allows for fine grain control in working with table wildcard functions like TABLE_DATE_RANGE and TABLE_QUERY when using {% table_date_range prefix date_filter %} isn’t enough.

For Hadoop this allows working with date-partitioned columns, no matter the type (string, date) or format (YYYY-MM-DD) of the column.

Usage Notes

When there is no value specified for date_filter, both {% date_start date_filter %} and {% date_end date_filter %} will evaluate to NULL.

And In the case of an open ended date_filter (like before 2016-01-01 or after 2016-01-01) then one of {% date_start date_filter %} or {% date_end date_filter %} will be NULL.

To make sure that the above two cases never result in invalid SQL, it’s necessary to use IFNULL or COALESCE in the LookML. (See the examples below.)

Examples

Monthly Partitioned Columns (in BigQuery)

In some BigQuery datasets, tables are organized by month and the table id has the year/month combination as a suffix. An example of this is in the public dataset [fh-bigquery:wikipedia], which has tables named pagecounts_201601, pagecounts_201602, pagecounts_201603, …

Example 1: LookML that depends on always_filter

The derived table below uses TABLE_QUERY([dataset], [expr]) to get the right set of tables to query. Some notes on the code in the expression:

  • length(table_id) = 17 makes sure that it ignores the other tables with names like pagecounts_201407_en_top64k
  • STRFTIME_UTC_USEC({% date_start date_filter %},'%Y%m') will output just the YYYYmm part of the beginning date
  • table_id refers to the name of the table in the dataset
Old LookML
- view: pagecounts
  derived_table:
    sql: |
      SELECT * FROM
      TABLE_QUERY([fh-bigquery:wikipedia], 
                  "length(table_id) = 17 AND 
                  table_id >= CONCAT( 'pagecounts_' , STRFTIME_UTC_USEC({% date_start date_filter %},'%Y%m') ) AND 
                  table_id<= CONCAT( 'pagecounts_' , STRFTIME_UTC_USEC({% date_end date_filter %},'%Y%m') )";
                 )

  fields:
  - filter: date_filter
    type: date
New LookML
view: pagecounts {
  derived_table: {
    sql: SELECT * FROM
      TABLE_QUERY([fh-bigquery:wikipedia],
                  "length(table_id) = 17 AND
                  table_id >= CONCAT( 'pagecounts_' , STRFTIME_UTC_USEC({% date_start date_filter %},'%Y%m') ) AND
                  table_id <= CONCAT('pagecounts_' , STRFTIME_UTC_USEC({% date_end date_filter %},'%Y%m') )";
                 )
       ;;
  }

  filter: date_filter {
    type: date
  }
}

This query will fail if there is no date filter because NULL will be substituted for the date_filter parts. To get around this requires an always_filter on the explore like this:

Old LookML
 - explore: pagecounts
   always_filter:
     date_filter: 2 months ago
New LookML
explore: pagecounts {
  always_filter: {
    filters: {
      field: date_filter
      value: 2 months ago;
    }
  }
}

Note that this will still fail for filters like before 2016-01-01 because {% date_start date_filter %} will evaluate to NULL.

Example 2: LookML that does not depend on always_filter

It is also possible to use COALESCE or IFNULL to encode a default set of tables to query over. In the example below, the past two months are used.

The lower bound: COALESCE({% date_start date_filter %},DATE_ADD(CURRENT_TIMESTAMP(),-2,'MONTH'))
The upper bound: COALESCE({% date_end date_filter %},CURRENT_TIMESTAMP())

Old LookML
- view: pagecounts
  derived_table:
    sql: |
      SELECT * FROM
      TABLE_QUERY([fh-bigquery:wikipedia], 
                  "length(table_id) = 17 AND 
                  table_id >= CONCAT( 'pagecounts_' , STRFTIME_UTC_USEC(COALESCE({% date_start date_filter %},DATE_ADD(CURRENT_TIMESTAMP(),-2,'MONTH')),'%Y%m') ) AND 
                  table_id <= CONCAT( 'pagecounts_' , STRFTIME_UTC_USEC(COALESCE({% date_end date_filter %},CURRENT_TIMESTAMP()),'%Y%m') )";
                 )

  fields:
  - filter: date_filter
    type: date
New LookML
view: pagecounts {
  derived_table: {
    sql: SELECT * FROM
      TABLE_QUERY([fh-bigquery:wikipedia],
                  "length(table_id) = 17 AND
                  table_id >= CONCAT( 'pagecounts_'; , STRFTIME_UTC_USEC(COALESCE({% date_start date_filter %},DATE_ADD(CURRENT_TIMESTAMP(),-2,'MONTH')),'%Y%m') ) AND
                  table_id <= CONCAT( 'pagecounts_' , STRFTIME_UTC_USEC(COALESCE({% date_end date_filter %},CURRENT_TIMESTAMP()),'%Y%m') )"
                 )
       ;;
  }

  filter: date_filter {
    type: date
  }
}

Log Files are in UTC, When Querying in American Time Zones (in BigQuery)

Several customers have reported issues where log files are stored in UTC, even though they are querying in Eastern or Pacific time zones. This can cause a problem where the log files have already rolled to “tomorrow’s date” in the local time zone of the query, resulting in some missed data.

The solution is to add an extra day to the end date of the date filter to make sure that if it is past midnight UTC that those log entries are picked up.

The examples below use the public dataset [githubarchive:day] which has a daily partition of github information.

Example 1: LookML that depends on always_filter

Old LookML
- view: githubarchive
  derived_table:
    sql: |
      SELECT * FROM
      TABLE_DATE_RANGE([githubarchive:day.],
      {% date_start date_filter %},
      DATE_ADD({% date_end date_filter %},1,"DAY")
      )

  fields:
  - filter: date_filter
    type: date
New LookML
view: githubarchive {
  derived_table: {
    sql: SELECT * FROM
      TABLE_DATE_RANGE([githubarchive:day.],
      {% date_start date_filter %},
      DATE_ADD({% date_end date_filter %},1,"DAY")
      )
       ;;
  }

  filter: date_filter {
    type: date
  }
}

Because this SQL will fail if NULL is substituted for the dates, it is necessary to add an always_filter to the explore like this:

Old LookML
 - explore: githubarchive
   always_filter:
     date_filter: 2 days ago
New LookML ``` explore: githubarchive { always_filter: { filters: { field: date_filter value: 2 days ago } } } ```

Note that this will still fail for filters like before 2016-01-01 because {% date_start date_filter %} will evaluate to NULL.

Example 2: LookML that does not depend on always_filter

In this example, the default date range is encoded in the LookML. Because COALESCE was returning an unknown type, I ultimately had to use IFNULL to make the SQL work.

The lower bound: IFNULL({% date_start date_filter %},CURRENT_DATE())
The upper bound: IFNULL({% date_end date_filter %},CURRENT_DATE()) + 1 day

Old LookML
- view: githubarchive
  derived_table:
    sql: |
      SELECT * FROM
      TABLE_DATE_RANGE([githubarchive:day.],
      IFNULL({% date_start date_filter %},CURRENT_TIMESTAMP()),
      DATE_ADD(IFNULL({% date_end date_filter %},CURRENT_TIMESTAMP()),1,"DAY")
      )

  fields:
  - filter: date_filter
    type: date
New LookML
view: githubarchive {
  derived_table: {
    sql: SELECT * FROM
      TABLE_DATE_RANGE([githubarchive:day.],
      IFNULL({% date_start date_filter %},CURRENT_TIMESTAMP()),
      DATE_ADD(IFNULL({% date_end date_filter %},CURRENT_TIMESTAMP()),1,"DAY")
      )
       ;;
  }

  filter: date_filter {
    type: date
  }
}

Trailing n-day window functions (in BigQuery)

When performing certain analysis, calculations are expected in some aggregate form over a historical timeframe. To perform this operation in SQL, one will typically implement a window function that reaches back n number of rows for a table unique by date. However, there is a catch-22 when using date partitioned table - one must first dictate the set of tables that the query will run against, but the query really needs extra historical tables for computation.

The solution is to simply allow the start date to be earlier than the dates provided in the date filter. Below is a solution that will reach back an additional week.

Old LookML
- view: githubarchive
  derived_table:
    sql: |
      SELECT 
        y._date
      , y.foo
      , y.bar
      FROM (
        SELECT
          _date
        , SUM(foo) OVER (ORDER BY _date RANGE BETWEEN x PRECEDING AND CURRENT ROW)
        , COUNT(DISTINCT(bar)) OVER (ORDER BY _date RANGE BETWEEN x PRECEDING AND CURRENT ROW)
        FROM (
          SELECT
            _date, foo, bar
          FROM TABLE_DATE_RANGE([something:something_else.], DATE_ADD(IFNULL({% date_start date_filter %},CURRENT_TIMESTAMP()), -7, "DAY"), IFNULL({% date_end date_filter %},CURRENT_TIMESTAMP()))
        ) x
      ) y
      WHERE {% condition date_filter %} y._date {% endcondition %}


  fields:
  - filter: date_filter
    type: date
New LookML
view: githubarchive {
  derived_table: {
    sql: SELECT
        y._date
      , y.foo
      , y.bar
      FROM (
        SELECT
          _date
        , SUM(foo) OVER (ORDER BY _date RANGE BETWEEN x PRECEDING AND CURRENT ROW)
        , COUNT(DISTINCT(bar)) OVER (ORDER BY _date RANGE BETWEEN x PRECEDING AND CURRENT ROW)
        FROM (
          SELECT
            _date, foo, bar
          FROM TABLE_DATE_RANGE([something:something_else.], DATE_ADD(IFNULL({% date_start date_filter %},CURRENT_TIMESTAMP()), -7, "DAY"), IFNULL({% date_end date_filter %},CURRENT_TIMESTAMP()))
        ) x
      ) y
      WHERE {% condition date_filter %} y._date {% endcondition %}
       ;;
  }

  filter: date_filter {
    type: date
  }
}

The extra SELECT statement is needed because it is supplying a WHERE constraint to trim the resultset back down to the date range the user originally specified in the query.

Table partitioned by date via string with format 'YYYY-MM-DD' (in Presto)

It is a common pattern in Hadoop tables to use partitioned columns to speed up search times for columns that are commonly searched on, especially dates.

The format of the date columns can be arbitrary, though YYYY-MM-DD and YYYYMMDD are most common. The type of the date column can be string or date (or number in the latter case).

In this example, a Hive table table_part_by_yyyy_mm_dd has a partitioned column, dt, which is a string formatted YYYY-MM-DD, and it is being searched by Presto.

When the generator is first run, the LookML is like this:

Old LookML
- view: table_part_by_yyyy_mm_dd
  sql_table_name: hive.taxi. table_part_by_yyyy_mm_dd
  suggestions: false

  fields:
  - dimension: dt
    type: string
    sql: ${TABLE}.dt
New LookML
view: table_part_by_yyyy_mm_dd {
  sql_table_name: hive.taxi. table_part_by_yyyy_mm_dd ;;
  suggestions: no

  dimension: dt {
    type: string
    sql: ${TABLE}.dt ;;
  }
}

Example 1: LookML that uses a common table expression to filter the table

The first example uses a derived table to filter the table. Some notes on the code in the expression:

  • The output of date_start and date_end is type timestamp
  • date_format( <expr>, '%Y-%m-%d') is used to convert the timestamp to a string and to the right format
  • The coalesce is to handle the case of NULLs if someone types in a filter like before 2010-01-01 or after 2012-12-31.
  • This is Presto dialect code, so Hive will have some differences in the format string (yyyy-MM-dd) and date_format can’t take a NULL value, so the coalesce would have to move in there with some sort of default value.
Old LookML
- view: table_part_by_yyyy_mm_dd
#  sql_table_name: hive.taxi. table_part_by_yyyy_mm_dd
  suggestions: false
  derived_table:
    sql: |
      SELECT * FROM hive.taxi. table_part_by_yyyy_mm_dd
      WHERE ( coalesce( dt >= date_format({% date_start date_filter %}, '%Y-%m-%d'), TRUE) )
        AND ( coalesce( dt <= date_format({% date_end date_filter %}, '%Y-%m-%d'), TRUE) )
  
  fields:
  - filter: date_filter
    type: date
    
  - dimension: dt
    type: string
    sql: ${TABLE}.dt
New LookML
view: table_part_by_yyyy_mm_dd {
  #  sql_table_name: hive.taxi. table_part_by_yyyy_mm_dd
  suggestions: no

  derived_table: {
    sql: SELECT * FROM hive.taxi. table_part_by_yyyy_mm_dd
      WHERE ( coalesce( dt >= date_format({% date_start date_filter %}, '%Y-%m-%d'), TRUE) )
        AND ( coalesce( dt <= date_format({% date_end date_filter %}, '%Y-%m-%d'), TRUE) )
       ;;
  }

  filter: date_filter {
    type: date
  }

  dimension: dt {
    type: string
    sql: ${TABLE}.dt ;;
  }
}

Usually partitioned tables take too long for full table scans (and consume too many cluster resources), so it is a good idea to put a default filter on the explore for this view as well:

Old LookML
- explore: table_part_by_yyyy_mm_dd
  always_filter:
    date_filter: '2013-01'
New LookML ``` explore: table_part_by_yyyy_mm_dd { always_filter: { filters: { field: date_filter value: '2013-01' } } } ```

Example 2: LookML that filters directly in the predicate

The second example does the predicate filtering directly on the table without a subquery or common table expression. The notes from Example 1 apply to this as well.

Old LookML
- view: table_part_by_yyyy_mm_dd
  sql_table_name: hive.taxi.table_part_by_yyyy_mm_dd

  fields:
  - filter: date_filter
    type: date
    sql: |
      ( coalesce( ${dt} >= date_format({% date_start date_filter %}, '%Y-%m-%d'), TRUE) )
      AND ( coalesce( ${dt} <= date_format({% date_end date_filter %}, '%Y-%m-%d'), TRUE) )
    
  - dimension: dt
    type: string
    sql: ${TABLE}.dt
New LookML ``` view: table_part_by_yyyy_mm_dd { sql_table_name: hive.taxi.table_part_by_yyyy_mm_dd ;;

filter: date_filter {
type: date
sql: ( coalesce( ${dt} >= date_format({% date_start date_filter %}, ‘%Y-%m-%d’), TRUE) )
AND ( coalesce( ${dt} <= date_format({% date_end date_filter %}, ‘%Y-%m-%’), TRUE) )
;;
}

dimension: dt {
type: string
sql: ${TABLE}.dt ;;
}
}

</details>


We can validate that the table partitions are actually being used by checking the output of `EXPLAIN` for a query generated by this LookML like this:

  • Output[table_part_by_yyyy_mm_dd.count] => [count:bigint]
    table_part_by_yyyy_mm_dd.count := count
    • TopN[500 by (count DESC_NULLS_LAST)] => [count:bigint]
      • Aggregate(FINAL) => [count:bigint]
        count := “count”(“count_4”)
        • RemoteExchange[GATHER] => count_4:bigint
          • Aggregate(PARTIAL) => [count_4:bigint]
            count_4 := “count”(*)
            • Filter[(COALESCE((“dt” >= CAST(‘2013-04-01’ AS VARCHAR)), true) AND COALESCE((“dt” <= CAST(‘2016-08-01’ AS VARCHAR)), true))] => [dt:varchar]
              • TableScan[hive:hive:taxi: table_part_by_yyyy_mm_dd, originalConstraint = (COALESCE((“dt” >= CAST(‘2013-04-01’ AS VARCHAR)), true) AND COALESCE((“dt” <= CAST(‘2016-08-01’ AS VARCHAR)), true))] => [dt:varchar]
                LAYOUT: hive
                dt := HiveColumnHandle{clientId=hive, name=dt, hiveType=string, hiveColumnIndex=-1, partitionKey=true}
                :: [[2013-04-01, 2013-12-31]]

The `partitionKey=true` along with the range of partition keys listed indicate that it is only scanning those partitioned columns.

Dynamic Date Filter Comparisons
[Analytic Block] Dynamic Previous Period Analysis using date_start, date_end
Getting last value for a measure over a time period
Using Looker with partitioned columns
Link to a Dashboard passing through multiple parameters
Timelines with dynamic date granularity
Calculating KPIs with non-existent data for some employees
Looker 3.50 Release Notes
Companion Filters for bigquery partition targeting
Metric "per timeframe" that listen to the dashboard filter
(Scott Schaen) #2

Hey Todd,

What would be the best way with looker+bigquery to query a collection of arbitrary tables?

For instance…

select count(*) from e.event20160101, e.event20160115, e.event20160201

Thanks + really happy to see these new Looker features for BigQuery!


(Todd Nemet) #3

Hey Scott, just saw your question. Two ideas off the top of my head:

a. If the table collection is constant, then just define them as a derived table

Old LookML ``` - view: githubarchive derived_table: sql: | SELECT * FROM e.event20160101, e.event20160115, e.event20160201
</details>
<details>
<summary>New LookML</summary>

view: githubarchive {
derived_table: {
sql: SELECT * FROM
e.event20160101, e.event20160115, e.event20160201
;;
}
}

</details>

b. If the table collection is arbitrary, it gets a little bit more tricky. You could use parameters to do something similar to below. (Note that I haven't tested this, but you get the gist.)

<details open>
<summary>Old LookML</summary>
- view: githubarchive
  derived_table:
    sql: |
      SELECT * FROM
      TABLE_QUERY([project_id.e],"table_id = {% parameter t1 %} OR table_id = {% parameter t2 %}")

  - parameter: t1
    default: 'event20160101'
  - parameter: t2
    default: 'event20160115'
    suggestions: ['othertable', 'anothertable']
</details>



You may also want to put in a regex or table length in the TABLE_QUERY argument to make sure that arbitrary tables aren't being queried.

c. Experiment with how multiple tables entered into a parameter field work, and get TABLE_QUERY to work with that in a way that you aren't limited to a specific number of tables.

Hope that's helpful. Let me know if you have any questions.

(Josh Siegel) #4

Hey guys - Do you have an example of how this should be used to combat the different timezone (UTC / PST) issue?

For example, many of our daily tables are UTC-based and some are generated nightly but our team is mostly PST based, so using a “yesterday” filter at 8 PM PST will return an error because “yesterday” is actually “today” in PST and that table hasn’t been generated.


(Todd Nemet) #5

Hi Josh, The last LookML example in the article was created specifically for the issue where the logs are in UTC and the queries are in PST or EST. Or are you saying TABLE_DATE_RANGE is returning a “table not found” error?

I’m not familiar with the structure of your log files, but it might be necessary to add something like this:

    sql: |
      SELECT * FROM
      TABLE_DATE_RANGE([githubarchive:day.],
      DATE_ADD(IFNULL({% date_start date_filter %},CURRENT_TIMESTAMP()),-1,"DAY"),
      DATE_ADD(IFNULL({% date_end date_filter %},CURRENT_TIMESTAMP()),1,"DAY")
      )

to make sure that you always include a table and then apply the same filter on another date field inside the log files.


(Dominic Ricard) #6

Awesome. This is the feature we have been waiting for since we started using Looker to query our Spark (Hive) cluster. We have had to educate our users to use both Partition and TimeZone Specific dates in the filters to ensure Spark was not scanning the entire cluster every time.

Now, with this new macro, I was able to simplify the date selection alot by splitting the Filter from the Dimension. The user can now pick the date filter he wants and the LookML automatically generate the partition range to use.

Here’s how our filter/dimension and table are setup:
Table (year, month, day are partitions):

derived_table: sql: | SELECT * FROM my_table i WHERE CONCAT(LPAD(year, 4, 0), LPAD(month, 2, 0), LPAD(day, 2, 0)) >= DATE_FORMAT(COALESCE(CAST(DATE_SUB({% date_start tz_filter %}, 1) as TIMESTAMP), {% date_start event_timestamp_filter %}, CAST(0 as TIMESTAMP)), "yyyyMMdd") AND CONCAT(LPAD(year, 4, 0), LPAD(month, 2, 0), LPAD(day, 2, 0)) < DATE_FORMAT(COALESCE(CAST(DATE_ADD({% date_end tz_filter %}, 1) as TIMESTAMP), {% date_end event_timestamp_filter %}, CURRENT_TIMESTAMP()), "yyyyMMdd")

Filters + Dimensions:

[code]- filter: event_timestamp_filter
type: date
view_label: “Date/Time”
label: “UTC”

  • filter: tz_filter
    view_label: “Date/Time”
    label: “TZ Normalized”
    type: date
    sql: |
    event_timestamp >= TO_UTC_TIMESTAMP({% date_start tz_filter %}, ${enrichment.timezone}) AND
    event_timestamp < TO_UTC_TIMESTAMP({% date_end tz_filter %}, ${enrichment.timezone})
  • dimension_group: event_timestamp
    datatype: timestamp
    type: time
    view_label: “Date/Time”
    label: “”
    group_label: “UTC”
    convert_tz: false
    can_filter: false
  • dimension_group: tz
    view_label: “Date/Time”
    label: “TZ Normalized”
    type: time
    datatype: timestamp
    suggestable: false
    convert_tz: false
    can_filter: false
    sql: FROM_UTC_TIMESTAMP(event_timestamp, ${enrichment.timezone})[/code]

Model Conditionally Filter (Ensure user input a date range):

conditionally_filter: event_timestamp_filter: "yesterday" unless: [event_timestamp_filter, tz_filter]

Enjoy!


(Todd Nemet) #7

Dominic, this is great. I was wondering how to do partitioning for this use case with separate partitions. Thanks so much for posting this example!


(Tig Newman) #8

For each old LookML code block in this article and its comments, we just added the New LookML equivalent code. One block was skipped since it was just pseudo-code.