PDTs are a great tool to keep some of our data cleaning and shaping processes as flexible and close to our analysts as possible. We frequently have to adjust our data cleaning requirements, and the simplicity of changing SQL in PDTs is great for this. Like many others I’ve heard from here, however, a major limitation we encounter with PDTs is the lack of an
INSERT function to append new daily data. We add millions of rows to our fact table each day, totaling billions of rows in our data warehouse. Processing all of this data from scratch each night isn’t possible, but we love the flexibility of managing our filtering methodology in PDTs.
To address this problem we use a system of what we call Historical and Current PDTs. Each runs with a different
datagroup_trigger, and covers a different period of time.
Historical PDTs run on the first Sunday of each month, and cover the full history of our data warehouse up to the processing date. Current PDTs run on a daily trigger, and contain data from the 1st of the month until the present date. A
UNION of the Historical and Current PDTs creates our full fact table. In this manner, we process a maximum of 31 days of new data each night, and use
UNION as a workaround for
INSERT. 31 days is not an insignificant amount of data for us, and the
UNION process can still take some time, but the system is far more agile than processing the entirety of our data each night. Having a small Current data component of our larger PDT also allows us to recover more quickly from errors in our nightly ELT, rebuilding only recent days and leaving the Historical PDT untouched.
Has anybody else used a similar tiered system of PDTs as a substitute for inserting new rows? We’ve thought about having an even more long-term tier for historical data, possibly even only rebuilding data from more than a year ago quarterly or annually. Is there a maximum age for PDTs, or any other drawbacks for maintaining very rarely rebuilt PDTs that we should be concerned about?