Incremental PDTs?

etl

(Dan) #1

So one of my favorite features of Looker is the persistent derived tables and the data modeling functionality that comes with that - it allows us to be very nimble. However, one of the downsides of using persistent derived tables is that every time it gets generated, unless you are setting a filter for it constrain the time period, the time it takes to generate takes longer and longer as your data set grows.

However, a lot of the times really only a small portion of the PDT needs to be updated (think a user_order_facts table or something like that).

My question is, is there a way to only re-compute the parts of the derived table which have changed since the last time it was built? Or should I be looking more towards an ETL solution at this point?

Thanks!


(George Fraser) #2

I’ve spent a lot of time talking to people who have implemented incremental “SQL pipelines” using DBT, Airflow, and homemade systems. My main takeaways are:

  1. Building incremental transforms is hard. Don’t do it until you actually have a performance problem.
  2. All of the correct pipelines that I’ve seen worked the same way: partition the data on a date column. Every time the pipeline is run, rebuild all the partitions that have changed.

Here’s a toy example of what I’m talking about: suppose we have the following tables:

  • events (time, kind) raw table of events
  • hourly_totals (date, hour, kind, count) aggregated version of events
  • cursor (date) keeps track of progress in converting events into hourly totals
begin transaction;

-- Identify all new data, with a resolution of 1 day
create temp table new_data as 
select * from events 
where date(time) >= (select date from cursor);

-- Delete updated dates from hourly_totals
delete from hourly_totals 
where date >= (select date from cursor);

-- Re-materialize the updated dates
insert into hourly_totals
select date(time) as date, extract(hour from time) as hour, kind, count(1) as count
from new_data
group by 1, 2, 3;

-- Advance the cursor
update cursor set date = (select max(date(time)) from new_data);
 
commit;

In theory, if your data arrives in-order, you could implement this pipeline at single-row resolution. The problem is that real pipelines are much more complicated, and in my experience when people try to implement “perfect” incremental updates, they inevitably have correctness problems. Rebuilding the most-recent-partition is a simple trick that gives up a little bit of efficiency, to gain a lot of robustness.


(jeremy.eckman) #3

Hey @Dan- glad to hear you’re enjoying the functionality of derived tables! Currently, PDTs (persistent derived tables) can’t be updated incrementally, they need to re-running the whole sql query to build. In that sense, PDTs are useful for prototyping potential ETL processes, but when the data gets sizeable enough, these tables need to be updated with an ETL processor.

I will certainly pass your feedback to our product team. It’s great that you’re getting so much utility from PDTs and it could be a powerful feature to only update the table where changes had occured. Thanks for your feedback!


(George Fraser) #4

Addendum: if you want to really go down a rabbit-hole of the subtleties of computing SQL queries incrementally, the Calcite project has some great resources on this. The author of Calcite, @julian.hyde, is now at Looker! So he might have some interesting things to add to this discussion :grin:


(Julian Hyde) #5

Thanks for the shout-out @George_Fraser!

There indeed a lot of pitfalls if you try to compute incrementally. I’m going to spend some time understanding how Looker users typically build and load their database model before making any concrete suggestions.

My hunch is that if the source is straightforward - say an append-only table with a “date_added” timestamp column - then this is a solvable problem, but otherwise you are likely to spend a lot of effort avoiding corner cases.


(Tristan Handy) #6

Hi @Dan! This is a great question that we run into very frequently. @George_Fraser–thanks for the shout out and giving a bit of a preview of my answer :smile:

I’m one of the maintainers of an open source product called dbt (data build tool). dbt is very similar to Looker’s Persistent Derived Table functionality, but is more sophisticated in a couple of key areas. The one that’s relevant to your question is that dbt natively supports incremental table rebuilds: it’s baked into the core of the tool. As a result, it only requires a couple of lines of code to go from full table rebuilds to incremental rebuilds in most cases.

I linked the relevant section of the docs below, but the short version is that you need to do two things to make a table incremental in dbt:

  1. Write a dynamic where condition in the model that performs date limiting. This where condition should limit records to only those that are new or changed since your prior table load.
  2. Specify a table unique key.

That’s it: dbt takes care of the rest. It uses the same conceptual approach that @George_Fraser outlines, where updated records are deleted and new data is inserted, but dbt writes all of the inserts and deletes for you. This keeps your code simple, maintainable, and extremely performant (and very much resembles a PDT!).

dbt is starting to see quite a lot of usage in the Looker community; there are over 200 companies using dbt in production as of this writing and many of them are Looker customers. dbt models present a very natural transition for when you’re ready to take the next step beyond PDTs, whether that’s to improve performance with incremental rebuilds, build an explicit DAG, or much more. You can transition PDTs over to dbt models 1:1 just using copy/paste, and then refactor from there. EZPZ.

Docs on incremental dbt models are here. If you run into any trouble, just ping us in Slack.


(Jon Brenner) #7

wrong reply!