This is a bit long-winded but please bear with me on this one.
I’m not sure our modeling journey has been as straight forward and seamless as we would have anticipated or wanted.
Our use case is as below.
We have 6 separate tables that contain information from actions that our company takes on a regular basis.
These tables contain a date that those actions took place along with a session_id to identify the session. Those dates look like the below
The way our model works is that these numbers need to be reported side by side and are a reflection of our ‘Point in Time’ performance. To be more specific they don’t share any common identifiers other than the date that the actions took place, so a typical join won’t work in this instance (session_id = session_id).
Our current solution is to join these tables up via UNION into a PDT using standard SQL and giving each table an ‘Event’ type. See below example (rough mockup)
SELECT * FROM ( SELECT utc_date_abandoned AS [Activity Date] ,client_number AS [Client] ,’Abandoned’ AS [Event] ,COUNT(DISTINCT session_id) AS [Number of Abandonments] ,NULL AS [Number of Sends] ,NULL AS [Number of Opens] FROM data.abandonments WHERE utc_date_abandoned >= ‘2019-10-10’ AND utc_date_abandoned < ‘2019-10-15’ AND client_number = 12345 UNION ALL SELECT utc_date_sentAS [Activity Date] ,client_number AS [Client] ,’Sends’ AS [Event] ,NULL AS [Number of Abandonments] ,COUNT(DISTINCT session_id) AS [Number of Sends] ,NULL AS [Number of Opens] FROM data.abandonments WHERE utc_date_sentAS >= ‘2019-10-10’ AND utc_date_sentAS < ‘2019-10-15’ AND client_number = 12345 etc…. ).
The final results would give us the Total number of Abandonments, Sends and every other metric we decide on by adding another table to the union.
This solution works well for us but does have a few downfalls that we have discovered over time.
1.) We decided to run the UNION in the PDT without date restrictions overnight and cache the results.
- This resulted in really quick querying times but was snowflake usage heavy and used far too many credits.
2.) We added restrictions to the WHERE clause which reduced the volume of data but meant that it was running in query and taking almost 2 minutes to run.
My initial suspicion is that the route we’ve gone down is not the best solution for either performance or cost.
Has anyone had a similar use case or does anyone have any ideas on how we might be able to do this in a more efficient manner?