Community

[Analytic Block] BigQuery Table Date Range (Deprecated)

This method has been Deprecated, this method only works with Legacy SQL, Standard SQL is recommended.

##About This Block

This block demonstrates how to effectively query time data in BigQuery, so that analysts can perform crucial time-related analyses (for example, events over time, events per time period, purchases per day, and so forth).

A best practice for time series data in BigQuery is to partition it by dates and store the partitions in individual files or tables. This makes it easy to add, remove and maintain datasets. The partitioned tables can be unioned together and effectively appear as a single table using the table wildcard functions TABLE_DATE_RANGE.

For example, click data would be stored in a series of tables named clicks_20140920, clicks_20140921, … clicks_20140926 and they would be queried with SQL like this:

SELECT ...
FROM (TABLE_DATE_RANGE(
	clicks_, 
	TIMESTAMP('2014-09-20'), 
	TIMESTAMP('2014-09-26')))
WHERE ...

Looker provides a practical and easy way to implement TABLE_DATE_RANGE through the {% table_date_range %} tag.

##Ideal Data Types
This pattern is specifically designed for date partitioned tables in BigQuery.

The {% table_date_range %} tag takes two parameters. The first parameter is the name of the LookML field to use as a filter. The second parameter is the common prefix for the partitioned tables.

##Expected Output
This Block allows you to easily query specific time periods in BigQuery, allowing for any time-based analysis, even for users who don’t understand how the tables are laid out under the hood.

This example explores the public data ‘githubarchive’, and has been set up to show the last 7 days of data. Any user viewing these results can change the date filter (to explore the last two weeks, say), and the query will be rewritten to query the appropriate tables automatically.

###Repositories created in the last 7 days.

##Try it Yourself
###How it’s Done:

First, create a date filter in the view:

  - filter: date_filter
    type: date

You can then use it in the {% table_date_range %} tag and in the always_filter of the explore.
###Try it Out:

- explore: githubarchive
  always_filter:
    date_filter: 2 days #This will be the default date range.

- view: githubarchive
  sql_table_name: |
      ( SELECT * FROM {% table_date_range date_filter githubarchive:day.events_ %})

  fields:
  - filter: date_filter #This date filter is used in the table_date_range tab and as an explore filter.
    type: date

#Below here put in measures and dimensions that are relevant for this view
#Example of measure
  - measure: event_count
    type: count
    drill_fields: [id, created_date, type, actor_name, repo_name]

#Example of dimensions
  - dimension: event_type
    sql: type

  - dimension_group: created_at
    type: time
    timeframes: [time, date, week, month]
    sql: created_at
...

###Further Analysis

There is a {% table_date_range_last %} parameter available, which works just like table date range, however it returns on the last date’s table. This is useful for referencing and joining in snapshot tables that are rebuilt daily.

1 Like

One problem we ran into was how BigQuery handles the case where no tables match the user provided query. For example, if we only have tables for 2016, but the user specifies a “is in the year 2015” filter, you will get an ugly BigQuery error that says:

Failed to retrieve data - FROM clause with table wildcards matches no table

Because the error message includes the generated SQL, as well as the fact that “FROM clause … table wildcards” means nothing to a non-technical user, we wanted to get rid of this error.

One recommendation was to modify the sql_table_name setting as follows:

- view: githubarchive
  sql_table_name: |
      ( SELECT * FROM (SELECT null LIMIT 0), {% table_date_range date_filter githubarchive:day.events_ %})

(Note the unioned sub-query (SELECT null LIMIT 0))

Unfortunately this resulted in similarly weird error messages when the query referenced any dimensions and didn’t match any date tables because (SELECT null LIMIT 0) has no schema.

To finally work around this, we instead created an empty table for every table set with the appropriate schema and unioned that. In our case we just used null instead of a date, giving you:

- view: githubarchive
  sql_table_name: |
      ( SELECT * FROM githubarchive:day.events_null, {% table_date_range date_filter githubarchive:day.events_ %})

Which is nice and easy to read. Only hassle is having to make sure you have the “null” table to work with.

2 Likes

Thanks for sharing that @numbsafari! Good catch and seems like a nice fix.

@numbsafari the empty table with an identical schema is an excellent idea. I’m trying to take this a step further and apply to “before now” time range (which is used a lot here). Since TABLE_DATE_RANGE does not work with infinite starting dates, we need to use date_start, date_end logic:

table_id >= CONCAT( 'daily_table_name_' , STRFTIME_UTC_USEC({% date_start activity.at_date %},'%Y%m%d') ) 
AND 
table_id < CONCAT( 'daily_table_name_' , STRFTIME_UTC_USEC({% date_end activity.at_date %},'%Y%m%d') )

in the case of “before now”, date_start gets converted to STRFTIME_UTC_USEC(NULL,'%Y%m%d') - which does not compile.

Any ideas? I am thinking of hardcoding a start date or alternatively just quering TABLES schema directly. Thoughts?

After brainstorming it with RealSelf guys, @ajit and @Grayson_Williams (fyi they are hiring), we have arrived at the following logic:

    (SELECT * FROM 
      TABLE_QUERY(schema, 
          "(table_id >= CONCAT( 'table_name_' , STRFTIME_UTC_USEC(IFNULL({% date_start some_date_filter %},'0'),'%Y%m%d') ) 
          AND 
          table_id < CONCAT( 'table_name_' , STRFTIME_UTC_USEC(IFNULL({% date_end some_date_filter %},'0'),'%Y%m%d') ))
          OR table_id = 'table_name_0'"
        ))

Some advantages:

  • works for is before, is after, is on a particular date (does not have to be a strict range)
  • avoids the above “FROM clause with table wildcards matches no table” error
  • lets you UNION this along with other tables as you would do otherwise since the empty table has the same underlying columns as the full tables

Awesome stuff, thanks @segahm!

Is there any way to adjust for time zones with Looker’s {% table_date_range %} tag and the “is in the past x days” filter? The relevant piece translates into TIMESTAMP(DATE_ADD(TIMESTAMP(CONCAT(CURRENT_DATE(), ' 00:00:00')), -1, 'DAY'))which takes the current UTC date at midnight. Thus, if I am currently at 2016-01-01 19:00 EST, it is 2016-01-02 00:00 UTC. This causes the time frame defined in the filter to be shifted one day.

Great question @bxhu . There’s no specific features to change the SQL translation with date filters right now. Theres potentially a work around by manually typing in the SQL for the filters, though that’s definitely not the most ideal solution given the number of date filter options. Let me loop in some folks with deeper knowledge here:

@jswenson @segahm @mikhailxu can either of you provide more insight here?

Hey David,

Google Standard SQL now supports timezones. If you convert your model to Standard SQL (instead of Legacy SQL) looker has full timezone support. Standard SQL is pretty awesome.

using _PARTITIONTIME or _TABLES_SUFFIX is much easier to use the TABLE_DATE_RANGE.

Some example code is here:

Agree 100% on Standard SQL. So so much less headache. That said, converting to Standard SQL turns into a large project. Unlike Redshift or, say, Vertica, Legacy BigQuery required a lot of custom SQL and Derived tables.

One way to address Legacy -> Standard migration: move persisted derived tables into Database views and then subsequently to ETL. That way the Looker connection can use StandardSQL while the ETL uses Legacy.

The other approach is to re-use PDTs from the old model/connection.

Anyone try something else?