Date table

etl

(Kallin Nagelberg) #1

I’ve begun exploration of creating a transformed database (data warehouse) for use with looker. Right now we just have Looker hooked up to a replica of our production DB, which has worked great in the short-term, but is showing some limitations.

In order to not go into this too blindly I’ve been doing some reading on data warehouse best practices, including ‘The Data Warehouse Toolkit’, which from what I can gather is one of the must read books in this space.

In this book they are quite adamant about the creation of a ‘Date’ table, instead of using date fields directly in your fact tables. Benefits include being able to assign dimensions to dates such as ‘holiday’, ‘quarter’, ‘weekend’, etc. that aren’t readily accessible in SQL. Of course, Looker does have some helpful date/time tools allowing easy grouping by month/week etc.

So, as I begin down this road I wonder how important these classical warehouse approaches are when using a tool like looker, as it seems much of what looker does is to make a non-warehouse DB act like an easily explored analytical warehouse.

If anyone has any ideas/opinions on the topic, and specifically if you have a preference about creating a date table, I’d love to hear it.


(lloyd tabb) #2

Yeah, I think going down an old road. It isn’t the direction you should be headed.

Traditional data warehouses were designed when machines were slow and databases were very expensive.

With the advent of Column stores (Redshift, Vertica, BigQuery, etc), you can query huge amounts of data without have to precompute aggregates (which is what traditional data warehousing is all about). You might need to roll up if you are looking at huge event logs, but than and only then do you need to.

Summary tables in LookML are usually very different, rolled up by entity instead of time. Roll up orders by user_id and compute lifetime value, number of orders etc.

There are lots of articles on this site about different patterns for accomplishing all the things you used to do in a data warehouse, but in a much more efficient explorable way.


(Kallin Nagelberg) #3

Thanks for the feedback @lloydtabb. I wasn’t planning to do much in the way of pre-computation during transform. Instead, I will be filtering the production DB through the application code to create a database that is more designed for analytics. IE, the raw production DB doesn’t lend itself to analytics, and there is a lot of logic in the app required to interpret the data.

Extracting to an analytics db (probably redshift eventually) should include designing an appropriate schema. I’m trying to figure out the best way to accomplish this, and I do find people suggesting the use of things like a date table with redshift, and heavily denormalizing your data.

I imagine there is a lot of value in the old warehouse approaches that would still be valuable today, but probably anything based purely around performance should be taken with a grain of salt.


(Brett Sauve) #4

As another perspective on this, I used Looker as a customer in a previous life. At that time, we did a lot of ETL to take our ugly production data and turn it into nice analytics data. Like you, we weren’t trying to aggregate it, just make it easier to understand. Unfortunately, all we really did was introduce another place for data inconsistencies and misunderstandings to take place. We should have just used our raw data and then modeled out the complexity in LookML.

With that said, our data wasn’t really that terribly big or complicated. So, it’s possible that trying to model away everything in LookML can’t get you the type of performance you want. However, it’s worth considering if you’ll be opening up a can of worms.


(Kallin Nagelberg) #5

Thanks for the feedback @Brett_Sauve. In our case, there is some data in the production DB that is not even queryable at this point (serialized, encrypted.), that needs to be extracted to be useful for analytics. We also have a considerably complex data model, with lots of helper methods in code to make sense of it. Trying to re-implement that logic in LookML with derived tables or complex SQL seems to be asking for a maintenance headache as any refactoring or change to the code or data model will bring looker out of sync with reality.

For now I’m going ahead with writing some ETL code inside our app (to be hosted in a standalone ETL environment), that makes use of all of the existing app logic, and will allow me to write tests to ensure we catch any changes down the road in the app that could affect the ETL process. Does that seem reasonable? I am curious what successful looker customers tend to do in these cases.


(Segah A. Mir) #6

@Kallin_Nagelberg you might find this (https://discourse.looker.com/t/data-warehouse-design/1231) discussion on data warehouse design useful.


(Haleemur Ali) #7

We have a calendar table in our Redshift instance, and we use it for the reasons you state above (holidays, weekend, marketing calendar events, etc.) If your organization maintains a Google Calendar, you can pull out events through their API, and store them in your calendar table. Then 3 years later, if an analyst wants to know what caused a spike, she / he can pull up the calendar annotations right next to the spike.

Some time-series visualizations benefit from showing 0 values that would otherwise disappear from a regular GROUP BY on the table (e.g. line chart showing hourly sales for last 48 hours). To build such visualizations, it is necessary to left join from the date-hour table(s) to the data on the time field and aggregate up to the desired granularity, then the measures for periods with missing data will be converted to 0, and the can be rolled up as long as they’re additive or composed of additive measures (i.e. sum, average, min, max, etc…). median is an example of a non-additive measure.

We also integrate a lot of 3rd party data, and they come in various levels of time-granularity (e.g. the service we use to report store footfall gives us number of people in store every hour, not the individual events of people coming in & leaving), so it again becomes necessary to create aggregate tables to compare sales with footfall, and by using the date-hour table, I can report on hours with footfall that don’t have any sales.

I would advise to go ahead and create a calendar table, you only do it once and its good for the next 10 years or so, and space should not be an issue (10 years = 3652 rows). With LookML & today’s DB technology, the calendar table doesn’t need to have quarter, or weekend, or day of week / month / year - those you can add in as dimensions through the LookML layer.

Also, you can derive an hour-of-day table to be used for more granular aggregations using the following query in Postgresql flavoured databases. Then you can LEFT JOIN to it from the calendar table as needed.

SELECT
    RANK() OVER (ORDER BY any_field) - 1 as hod
  FROM table_with_at_least_24_rows
  LIMIT 24