[Analytic Block] Derived Tables Pattern

(lloyd tabb) #1

About This Block

Often times our database doesn’t store information optimally for analytics. Rather than writing inefficient dimensions or measures to calculate what we want, derived tables allow us to create temporary tables directly in our database. With derived tables, you can perform analytics that require intermediate aggregations, increase the query performance of complex queries, create patterns to cleanse or normalize data, and many other beneficial functions.

Simply put, a derived table is Looker’s way of creating new tables that don’t exist in your database. You define one by providing a SQL query whose result set becomes the derived table itself.

Looker provides two options for derived tables: ephemeral or persistent. Ephemeral derived tables are not stored in your database; Looker uses common table expressions or creates a temporary table every time this derived table is referenced. Alternatively, persistent derived tables (PDTs) are written to disk and refreshed at a frequency of your choosing. You provide Looker write access to a scratch schema in your database to facilitate this process. PDTs are very similar to materialized views.

Ideal Data Types

Any and all data can benefit from derived tables. Near the end of this article will be links to design patterns that leverage derived tables. Please note that depending on your database, PDTs may not be available; however, all supported dialects in Looker support ephemeral derived tables.

Expected Output

Here is the explore of the derived table outlined in the code below. Notice it is identical in functionality to a normal view.

Explore Data in Full Screen

How it Works

Let’s use a standard e-commerce example and pretend we have a table called “orders” which can have multiple entries of orders place by the same user. If we wanted to perform some analysis on the subset of orders placed by each user, we could leverage a derived table to perform some aggregations on the details of the orders data while grouped by user_id. Here is an example:

- view: user_order_facts
    sql: |
      SELECT orders.user_id AS user_id                           
              , COUNT(DISTINCT order_id) as lifetime_orders
              , COUNT(*) AS lifetime_order_items
              , MIN(NULLIF(orders.created_at,0)) AS first_order
              , MAX(NULLIF(orders.created_at,0)) AS latest_order
              , DATEDIFF(MAX(NULLIF(orders.created_at,0)),MIN(NULLIF(orders.created_at,0))) AS days_as_customer
              , DATEDIFF(CURDATE(),MAX(NULLIF(orders.created_at,0))) AS days_since_purchase
              , COUNT(DISTINCT CONCAT(MONTH(NULLIF(orders.created_at,0))),YEAR(NULLIF(orders.created_at,0))) AS number_of_distinct_months_with_orders
              , SUM(CASE WHEN orders.created_at <= users.created_at + INTERVAL 24 HOUR THEN order_items.sale_price ELSE 0 END) AS 24_hour_revenue
              , SUM(CASE WHEN orders.created_at <= users.created_at + INTERVAL 30 DAY THEN order_items.sale_price ELSE 0 END) AS 30_day_revenue
              , SUM(CASE WHEN orders.created_at <= users.created_at + INTERVAL 90 DAY THEN order_items.sale_price ELSE 0 END) AS 90_day_revenue
              , SUM(CASE WHEN orders.created_at <= users.created_at + INTERVAL 365 DAY THEN order_items.sale_price ELSE 0 END) AS 365_day_revenue
            FROM order_items 
            LEFT JOIN orders ON orders.id = order_items.order_id
            LEFT JOIN users ON users.id = orders.user_id
            GROUP BY user_id
    persist_for: 2 hours
    indexes: [user_id, lifetime_orders]

Lets go some of the fields in the derived table:

  • user_id: the column we are grouping by. Every row of this derived table will be a unique user_id.
  • lifetime_orders: a count(distinct order_id) to count up all unique orders for each user.
  • first_order: the oldest date of any order for each user, if available
  • latest_order: the most recent date of any order for each user, if available
  • days_as_customer: the difference between the most recent and oldest order for each user.
  • days_since_purchase: the difference between today’s date and the most recent order date for each user
  • number_of_distinct_months_with_orders: the count of distinct months each user has placed orders
  • xxx_day_revenue: order revenue over that specified time window

Now that Looker knows how to create the results of this query, you can either reference these fields directly or perform any sort of manipulations to these fields in Looker. Below are two dimensions which reference fields from the derived table - one is creating a tier structure and the other is performing an inequality evaluation.

  - dimension: lifetime_number_of_orders_tier
    type: tier
    style: integer
    tiers: [0,1,2,3,5,10]
    sql: ${lifetime_orders}

  - dimension: repeat_customer 
    type: yesno
    sql: ${lifetime_orders} > 1

These are just two simple examples of manipulating data from a derived table. Complex calculations can now be performed because the results have been obtained initially through the derived table mechanism.

Lastly, lets cover some parameters of persistent derived tables.

  • Data Freshness: Because PDTs are stored in scratch schema on your database, there is a possibility it is not based off of the most recent dataset. You can control this with parameters like persist_for or sql_trigger_value to manage the frequency of requery and the freshness of the data generated by the derived table. In our example above, we are specifying a trigger value of 2 hours, which is telling Looker to resubmit the underlying query every two hours to refresh the data.

  • Data Distribution: Looker can distribute the results of a PDT by user specification of indexes or sort_key/dist_key, depending on your database. In our example above, we are specifying a distribution index on user_id and lifetime_orders. This is a good choice because the query’s GROUP_BY statement establishes uniqueness on user_id and we will be frequently filtering on lifetime_orders. Well indexed PDTs will result in highly performant access.

Try it Yourself!

For the users_orders_facts example, the derived table view definition can be found in this model here. But perhaps a more practical application would be to see a derived table leveraged in another design pattern. Please take a look at these other Looker Blocks which use derived tables:

Additional Information

Here are some key things to remember about derived tables:

  • Because derived tables are defined within LookML as a view, you can utilize them in the model just like any other view. Feel free to join derived tables to other views, restrict columns, and set mandatory filters.
  • Persistent derived tables can be managed from the admin panel, under the category “PDTs”. Here you can see high level information on all PDTs across your instance.
  • Derived tables can leverage templated filters, a feature allowing you to dynamically manipulate a derived table’s underlying query, changing its WHERE clause based on parameters or conditionals.
1 Like