Designing a Data Warehouse - what would a BI solution recommend?

meetup
performance
redshift
database
etl

(Segah A. Mir) #1

At Looker we are often asked about best practices when it comes to designing a new data warehouse. Typically this happens just as when companies are moving to MPP, and maybe even, column-oriented, databases, so it is clear from the start that replicating the design from an operational database is not appropriate.

It is also clear that there is not a single answer - search on Google turns up millions of results. That said, here are a few rules of thumb that you can apply as you focus on building your analytical data warehouse to work with Looker:

  • simplicity (aka shortest path)
  • performance
  • single copy of data
  • transparent EL process

Shortest Path

You should not need to use mapping tables or Entity–attribute–value tables to get to the value. The path to any two dimensions in one SELECT query should not involve more than a couple of joins.

Typically “long path” designs arise from storing original data in NoSQL format. Because there is very little analytical value derived from performing such transformations with Looker–albeit, in a more transparent way–it is better to perform such logic prior to the data warehouse.

Performance

We’ve had a number of posts already on performance: 1) case studies, 2) Redshift Performance, and 3) more on Redshift if you want to understand storage usage and its distribution/skew. It’s important because in the Looker world no waits in line to get a report - even calculations over billions of rows are expected to be instantaneous.

Because performance often requires redundancy or more complicated EL/ETL processes, a balance is needed between it and the desire to create one single copy of data or between performance and the transparent EL process. In some cases, date fields might need to be duplicated across tables. In other cases, de-normalization of tables is best done outside of Looker.

Single copy of data

Storing in multiple columns across several tables the same information is very confusing. It is often unclear whether there is any difference in how data is collected.

Typically, the rule of thumb is to store any data only once if it can ever change. This means that customer information, for example, should not be duplicated across several tables.

Transparent EL process

High quality analytics requires understanding of all assumptions that are made in the model about the data. LookML provides that transparency. However, if some complicated transformation, join or filtering happens before the data ends up in the data warehouse, your assumptions about underlying data might be wrong.

In summary

It is easier to visualize priorities in terms of 2x2 matrix. For example:

Simplicity vs. Performance

Likewise, it can be important to consider Simplicity vs. Single Copy of Data, Performance vs. Single Copy of Data, etc

http://www.slideshare.net/caura_invest_private/designing-a-data-warehouse-what-would-a-bi-solution-recommend


Date table
(Jeff Kaplan) #2

Thanks for the great breakdown on rules of thumb for using Looker with an analytical data warehouse, Segah!

I’m curious as to how one can manage both providing simplicity of data schemas via short paths to dimensions as well as a Transparent EL process. It seems to me that these two are in conflict since the first suggests removing complexity before reaching the data warehouse, but the second notes that this can reduce transparency.

Are there techniques for managing both of these requirements at the same time, and when those techniques are not applicable, how do you think about choosing one over the other?


(Segah A. Mir) #3

I think I should define what I mean by unnecessary transformations. For this I will use the example of EAV blog post made by my colleague, Mike Xu.

Say, our data originally is in this EAV format:

While we can work with this as is, it is a poor design for a relational database whose primary function is to be read from. Ideally, we would perform basic transformations as outlined in the blog to get the table in the following final form:

Now, consider the 2nd version of this table as a starting point. Let’s say we create the following transformations:

- dimension: full_name
  sql: ${first_name} || ${last_name}

- dimension: is_human
  type: yesno
  sql: ${gender} = 'male' OR ${gender} = 'female'

Now, we could have created both types of transformations before the data ever gets to our data warehouse, the former does not add any analytical value (we are just transposing our table), while in the later case we are doing a bit more.

It would have been easy to create a new table in the data warehouse with the following columns:

--- 
parent_id - race - is_human - full_name -
---

But here we are already making some assumptions: who is considered a human? Maybe in the future we decide that gender = ‘man’ is also a human - not just ‘male’. And even more importantly, we now lack raw columns for first_name, last_name, and gender.

There could be many other examples not involving EAV-type schema. For example, values stored in a JSON-like key/value format.


(Mike DeAngelo (a.k.a. Dr. StrangeLooker)) #4

There is another way to look at this question, and that is do we want to built and store “transactional tables” in something like 3rd normal, a star/snowflake schema, or a flat datamart. Each have their advantages.

Storing something similar to transactional data has the advantage that the ETL process is simple, but it will require more joins to extract value. Consider a case with a set of tables like orders, order_details, products, customers, customer_address, shipping_address, and sales_people. The order_details table is likely the largest and has the “interesting” fact data, so it we would naturally like to make it the table from which we start our explore. But all the dimensions except those in products are obtained by first joining the order table. So maybe we instead want to start our explore on orders. Since our fact data is now in a “joined table” rather than the driving table, the sql might be less efficient.

The solution to this is to pre-join the orders and order_details into an order_detail_fact table. and join customer_dim, product_dim, sales_people_dim to the order_detail_fact. This works well overall, but requires more ETL. There is a potential scaling issue here still, however, since the dim tables are all joined to the fact table by different keys. Only one key is used for the distribution key, so resolving all the joins means lots of internode communication in an MPP database. Often the dim tables are relatively small, however, so the ETL system can be instructed to make a complete copy of the dim tables on every node. Maybe the distkey is the customer_id, and customer_dim is not distributed everywhere, but the other dims are.

Finally, just make this one flat table and let everyone use it without joins, or with only very few joins. MPP columnar databases actually process these big wide flat structures very efficiently, while row oriented databases are less capable.

My general opinion is customers should start with a normalized reporting database and use PDTs to build the other structures as needed. When those PDTs have demonstrated value, then they can alter their ETL to build the star schemas and datamart structures.


(Segah A. Mir) #5

+1000 this statement. Use Looker to prototype your EL/ETL workflow. When project scales (either in Data size or modeling complexity), move those PDTs to the ETL pipeline.