Adding daily data to PDTs without INSERT

(Brad Gowland) #1

PDTs are a great tool to keep some of our data cleaning and shaping processes as flexible and close to our analysts as possible. We frequently have to adjust our data cleaning requirements, and the simplicity of changing SQL in PDTs is great for this. Like many others I’ve heard from here, however, a major limitation we encounter with PDTs is the lack of an INSERT function to append new daily data. We add millions of rows to our fact table each day, totaling billions of rows in our data warehouse. Processing all of this data from scratch each night isn’t possible, but we love the flexibility of managing our filtering methodology in PDTs.

To address this problem we use a system of what we call Historical and Current PDTs. Each runs with a different datagroup_trigger, and covers a different period of time.

Historical PDTs run on the first Sunday of each month, and cover the full history of our data warehouse up to the processing date. Current PDTs run on a daily trigger, and contain data from the 1st of the month until the present date. A UNION of the Historical and Current PDTs creates our full fact table. In this manner, we process a maximum of 31 days of new data each night, and use UNION as a workaround for INSERT. 31 days is not an insignificant amount of data for us, and the UNION process can still take some time, but the system is far more agile than processing the entirety of our data each night. Having a small Current data component of our larger PDT also allows us to recover more quickly from errors in our nightly ELT, rebuilding only recent days and leaving the Historical PDT untouched.

Has anybody else used a similar tiered system of PDTs as a substitute for inserting new rows? We’ve thought about having an even more long-term tier for historical data, possibly even only rebuilding data from more than a year ago quarterly or annually. Is there a maximum age for PDTs, or any other drawbacks for maintaining very rarely rebuilt PDTs that we should be concerned about?


(Izzy) #2

Regarding your base issue of not being able to use INSERT: Have you tested out create_process:?
The example on that page has a 2-step PDT creation process, where one step is an INSERT. I actually haven’t tested out how a PDT that just inserts will play, but it could be worth a shot since it sounds like it would drastically simplify your workflow.

1 Like

(Fabio) #3

Hi @bradgowland (Brad?),

Yes, I’ve worked on some advanced incremental PDT use cases recently and am planning on writing up an article next week. I’ll make sure to let you know once I publish that.

In the meantime, I have a couple of questions about your setup -

  1. What data warehouse / dialect are you using?

  2. In your union between the historical and the current, are you using a literal UNION, or a UNION ALL. If the former, I would definitely recommend switching to UNION ALL!


(Brad Gowland) #4

Hi Fabio!

Thanks for the response - to your questions:

  1. We use Redshift.
  2. Should have been more clear in my post! We do use UNION ALL. I have a nightly test that runs to check that we don’t have any overlap between max and min dates in the incremental tables instead of using UNION to dedupe.

Will definitely check out your article when that’s up.