How to: Simple incremental PDTs using create_process

Something we’ve been wanting to achieve in Looker for a while is the ability to incrementally build PDTs rather than doing a DROP and CREATE nightly.

I think I’ve found a reasonable solution, using create_process. I’m sharing my findings here as I’ve seen a lot of topics trying to solve this issue, so hopefully this can help some people.

The table I’m trying to increment is quite simple, I’m pulling rows from one table (that fit a criteria), modifying them a bit, then inserting them into another table, but I’m sure you can use the same solution for more complex processes.

  1. First you need to create the table you wish to make incremental in your warehouse, outside of the looker_scratch schema. I won’t go into the DDL process but feel free to comment if you need help.

  2. You then need to grant SELECT and INSERT on this table to your looker user in your warehouse. You can achieve this with GRANT SELECT ON TABLE schema.incremental_table_name TO looker_username;

  3. In LookML, create a new view containing an incremental PDT like so

view: incremental_table {
  derived_table: {
    create_process: {
      sql_step:
      CREATE TABLE ${SQL_TABLE_NAME} AS (
      SELECT DISTINCT
      col1,
      col2,
      col3,
      tstamp
      FROM existing_table AS e
      -- Only select new rows 
      WHERE e.tstamp > (SELECT MAX(tstamp) FROM schema.incremental_table_name);;
      
        sql_step:
        INSERT INTO schema.incremental_table_name
        SELECT * FROM ${SQL_TABLE_NAME}
        -- Optional filtering, assuming Col1 is unique
        WHERE ${SQL_TABLE_NAME}.col1 NOT IN (SELECT col1 FROM schema.incremental_table_name);;
      }}}
  1. Set the sql_trigger_value to something that indicates the underlying table has been updated e.g. sql_trigger_value: SELECT MAX(tstamp) FROM schema.incremental_table_name;;

  2. You may want to backdate the incremental table before you set up the incremental PDT to start running.

Basically what you’re doing is storing ONLY the new rows in the looker_scratch table, then another sql_step is inserting these new rows into an existing table outside of your looker_scratch schema.

As I mentioned this is a very simple example and will only really work in cases where you want to do simple extractions and processing of rows, then loading them into a separate table.

Interested to see of other use cases people can come up with. I’m going to see if we can use this method to make our page_views table incremental from the raw events data we collect with Snowplow.

8 Likes

Clever! We’ve been thinking about how to achieve this for a while and the idea to generate a temporary table of deltas and push them into a different table is very neat.

We’re going to give this ago and experiment with adding additional steps in the create process to handle cleansing, late arriving data and/or deletes from the source table.

1 Like

@zckymc, nice trick, thank you for the write up! We’ve implemented one incremental pdt based on your idea, except that the two tables are pdts (we replaced the incremental_table_name with a pdt with sql_trigger_value: SELECT 1), this had the following advantages:

  1. No need to maintain a table outside of Looker
  2. The ability to easily change the table structure if needed which is one of the main benefits of pdts for us
  3. The ability to use two instances of Looker (dev and prod) without having to worry about data being available for the prod instance while we’re still testing

Disadvantages:

  1. Looker has no way to tell when to update the cache, so cached queries would persist even though I’ve udpated the incremental table – EDIT: Maybe I can overcome that by using a datagroup and persist_with for the explore?
  2. The code is less readable, the sql code is also repeated accross the two pdts

I think it would be still of great benefit if Looker supports this syntax natively, i.e it internally maintains the pdt (or temporary view) which has the new data and only exposes the incremental table.