Identifying and Building PDTs for Performance Optimization


(Mike DeAngelo (a.k.a. Dr. StrangeLooker)) #1

Background

A Persistent Derived Table (or PDT for short) is a table that Looker can create and manage in the target database. The table is loaded with data from the SQL statement provided, and then refreshed on a regular basis. In contrast, the data for a Derived Table is never written to the database, but is used in sql queries like a database view.

In the view LookML file a DT is defined like this…

Old LookML
- view: customer_facts
  derived_table:
    sql: |
      SELECT ...
      FROM ...
      WHERE ...
New LookML
view: customer_facts {
  derived_table: {
    sql: SELECT ...
      FROM ...
      WHERE ...
       ;;
  }
}

Adding Persistence

Update on Persistence:

Datagroups - Datagroups is a feature as of version 4.16 that allows you to group PDT rebuilds and/or tie them to your ETL processes. For more info on using data groups check out this doc (https://docs.looker.com/data-modeling/learning-lookml/caching)

In order to use the PDT feature, the box to enable “Persistent Derived Tables” must be checked and the “Temp Database” parameter must be set to a database or schema on the Database Instance. These settings are part of the Database Connection settings available through the Admin panel.

The user id that Looker uses to connect to the database must have the ability to create and drop tables and other database objects in that schema.

In the view LookML file a PDT must have either ‘persist_for:’ or ‘sql_trigger_value:’ but not both.

  • Use persist_for to specify a “time to live” for the PDT
    Example: persist_for: 8 hours
  • Use a sql_trigger_value to specify a query that will return one value and whose return value will change when the PDT should be regenerated.
    Example: sql_trigger_value: SELECT MAX(last_update_ts) FROM customers

In the view LookML file a PDT might look like this:

Old LookML
- view: customer_facts
  derived_table:
    sql: |
      SELECT ...
      FROM ...
      WHERE ...
     sql_trigger_value: SELECT MAX(last_update_ts) FROM customers

  fields:
    # field definitions would continue here.
New LookML
- view: customer_facts
  derived_table:
    sql: |
      SELECT ...
      FROM ...
      WHERE ...
     sql_trigger_value: SELECT MAX(last_update_ts) FROM customers


    # field definitions would continue here.

So, a PDT must have either a persist_for: duration or sql_trigger_value: sql_statement attribute defined.

Using persist_for

When using persist_for: the PDT will be built for the first time when a user query attempts to use it. If other queries reference the PDT as well, they will use the cached PDT data as long as the duration specified as part of the persist_for: has not elapsed since the PDT was last built. If the duration has passed, then the PDT will be rebuilt when it is next needed.

So an attribute like persist_for: 20 minutes means that the query results returned from a PDT will always be less than 20 minutes old — if the cached data is older then PDT will be rebuilt before returning a query’s results. A PDT is always rebuilt in response to another query attempting to use it. So if the system is idle the PDT may not get rebuilt for a long time. Thus, some user queries that attempt to use the PDT may need to wait while the PDT is rebuilt, causing those users to see slower response times.

Using sql_trigger_value

When using sql_trigger_value: sql_statement Looker will run the specified sql_statement on a regular basis - typically every 5 minutes - and record the result. If the result is different than the previous run, the PDT is reloaded. This means that the PDT can be loaded before a user actually needs the data. Typically sql_statements look like one of these examples:

SELECT COUNT(*) FROM source_table — Reload the data whenever the number of rows in some other table increases.

SELECT MAX(update_timestamp) FROM source_table — Reload the table whenever an update occurs to the underlying data.

SELECT CURRENT_DATE — Reload the table at midnight.

SELECT CAST(DATE_ADD(hrs, -3, CURRENT_TIMESTAMP) AS DATE) — Reload the table at 3 AM.

Recommended Settings For Sorting and Indexing

The following settings are optional but highly recommended. They control how the data in the PDT is sorted and indexed.

For RedShift and Aster, distribution specifies the column name whose value is used to spread the data around a cluster. When two tables are joined by the column specified in the ‘distribution:’, the database can find the join data on same node and so internode I/O is minimized.
Example: distribution: customer_id

For RedShift, setting the ‘distribution_style:’ to ‘ALL’ instructs the database to keep a complete copy of the data on each node. This is often used to minimize internode I/O when joining relatively small tables. Setting this value to EVEN instructs the database to spread the data evenly through the cluster without using a distribution column. This value can only be specified when ‘distribution:’ is not specified.
Example: distribution_style: ALL

The parameter ‘sortkeys:’ is used on RedShift only. The values specify what columns of the PDT are used to sort the data on disk to make searching easier. On RedShift, ‘sortkeys:’ or ‘indexes:’ may be used, but not both.
Example: sortkeys: [customer_name, first_purchase_date, last_purchase_date]

The parameter ‘indexes:’ is used on most databases. The values specify what columns of the PDT are indexed. (On RedShift, ‘indexes:’ are used to generate ‘interleaved sort keys’.)
Example: indexes: [column_name, column_name, …]

So, the overall definition of a PDT could be:

Old LookML
- view: customer_facts
  derived_table:
    sql: |
      SELECT ...
      FROM ...
      WHERE ...
    sql_trigger_value: SELECT MAX(last_update_ts) FROM customers
    distribution: customer_id
    sortkeys: [customer_name, first_purchase_date, last_purchase_date]

  fields:
    # field definitions would continue here.
New LookML
view: customer_facts {
  derived_table: {
    sql: SELECT ...
      FROM ...
      WHERE ...
       ;;
    sql_trigger_value: SELECT MAX(last_update_ts) FROM customers ;;
    distribution: customer_id;
    sortkeys: [customer_name, first_purchase_date, last_purchase_date]
  }

  # field definitions would continue here.
}

Understanding PDT Names

When Looker determines that the PDT should be created, Looker generates a PDT name that is made up of the “scratch schema” + “table status code” + “hash value” + “view name” by:

  • Generating a new unique code for the name of the table. It will be a string of letters and numbers.
    For example: 4DEM41PGKN2KFC63Y8RTF.
  • Prepending the unique code with a string like LC$ to indicate the current status of the table.
    For example: LC$4DEM41PGKN2KFC63Y8RTF
  • Appending the name of the view to create a table name for the generated table.
    For example: LC$4DEM41PGKN2KFC63Y8RTF_customer_fact.
  • Prepending the scratch schema.
    For example: tmp.LC$4DEM41PGKN2KFC63Y8RTF_customer_fact.

Now that the name is created, Looker:

  1. Uses the derived_table sql to fashion a create table as select (or CTAS) statement and execute it.
    For example: CREATE TABLE tmp.LC$4DEM41PGKN2KFC63Y8RTF_customer_fact AS SELECT ... FROM ... WHERE ...
  2. When the table is built, issues the statements to create the indexes.
  3. Renames the table from LC$… to LR$… to indicate the the table is ready to use.
  4. Drops any old version of the table that should not be in use anymore.

There are a few important implications of this:

  • The sql that forms the derived table must be valid inside a CTAS statement.
  • The column aliases on the result set of the SELECT statement must be valid column names.
  • The names used when specifying ‘distribution:’, ‘sortkeys:’ and ‘indexes:’ must be the column names in the generated table, not the field names defined in the LookML.

When to Use a PDT

First, Try Not to Use a PDT

Often a PDT is very useful for a certain analysis, but will also make the resulting Explore less flexible. A trap that new Looker developers often fall into is taking the SQL they previously used for their reports and blindly inserting it into Looker. This “dumbs down” Looker into being used like the tools it replaces. Instead, take advantage of Looker’s more flexible and powerful capabilities.

As a general rule of thumb, a developer should try to model without using PDTs. Wait until an issue is actually identified before building a PDT. Even when an issue is identified, in some cases it can be resolved through other means. Make sure to analyze the execution plans of slow queries. The addition of an index or the change of a column data type might resolve an issue without the need to build a PDT.

Using PDTs to Test Optimizations

The PDT functionality makes it very simple to test different indexing, distributions, and other options without needing a large amount of support from your DBA or ETL developers.

Consider a case where you have a table but wish to test different indexes. Your initial LookML for the view may look like this…

Old LookML
- view: customer
  sql_table_name: warehouse.customer
  fields:
    ...

New LookML
view: customer {
  sql_table_name: warehouse.customer ;;

}

Simply change the view LookML like this…

Old LookML
- view: customer
  # sql_table_name: warehouse.customer
  derived_table:
    sql: SELECT * FROM warehouse.customer
    persist_for: 8 hours
    indexes: [customer_id, customer_name, salesperson_id]
  fields:
    ...

New LookML
view: customer {
  # sql_table_name: warehouse.customer
  derived_table: {
    sql: SELECT * FROM warehouse.customer ;;
    persist_for: 8 hours;
    indexes: [customer_id, customer_name, salesperson_id]
  }

}

Query this once to trigger the generation on the PDT, then run your test queries and compare your results. If your results are good, you can ask your DBA or ETL team to include those indexes in the original table. Remember to change your view code back to remove the PDT.

Pre-Joining/Aggregating Data

Sometimes the volume or type of data is such that prejoining and/or aggregating some of the data is really useful.

For example, suppose we want to report on customers by cohort based on when they made their first order. This query might be expensive in realtime, but using the following PDT the query can be calculated once and reused.

Old LookML
- view: customer_order_facts
  derived_table:
    sql: |
      SELECT
        c.customer_id,
        MIN(o.order_date) OVER (PARTITION BY c.customer_id) AS first_order_date,
        MAX(o.order_date) OVER (PARTITION BY c.customer_id) AS most_recent_order_date,
        COUNT(o.order_id) OVER (PARTITION BY c.customer_id) AS lifetime_orders,
        SUM(o.order_value) OVER (PARTITION BY c.customer_id) AS lifetime_value,
        RANK() OVER (PARTITION BY c.customer_id ORDER BY o.order_date ASC) AS order_sequence,
        o.order_id
      FROM warehouse.customer c LEFT JOIN warehouse.order o ON c.customer_id = o.customer_id
    sql_trigger_value: SELECT CURRENT_DATE
    indexes: [customer_id,order_id, order_sequence, first_order_date]

  fields:
    ...
New LookML
view: customer_order_facts {
  derived_table: {
    sql: SELECT
        c.customer_id,
        MIN(o.order_date) OVER (PARTITION BY c.customer_id) AS first_order_date,
        MAX(o.order_date) OVER (PARTITION BY c.customer_id) AS most_recent_order_date,
        COUNT(o.order_id) OVER (PARTITION BY c.customer_id) AS lifetime_orders,
        SUM(o.order_value) OVER (PARTITION BY c.customer_id) AS lifetime_value,
        RANK() OVER (PARTITION BY c.customer_id ORDER BY o.order_date ASC) AS order_sequence,
        o.order_id
      FROM warehouse.customer c LEFT JOIN warehouse.order o ON c.customer_id = o.customer_id
       ;;
    sql_trigger_value: SELECT CURRENT_DATE ;;
    indexes: [customer_id&, order_id, order_sequence, first_order_date]
  }

}

Add a Primary Key Using a PDT

Primary keys can be critical when using symmetric aggregates. Most tables will have some column or expression that has a primary key. Every now and then, however, a table might not have any reasonable way to uniquely identify a row. In this case, a PDT can be used to generate a unique id.

Old LookML
- view: transaction
  # sql_table_name: warehouse.transaction
  derived_table:
    sql: SELECT ROW_NUMBER as transaction_pk, * FROM warehouse.transaction
    sql_trigger_value: SELECT CURRENT_DATE
    indexes: [customer_id, product_id, salesperson_id, transaction_date]
  fields:
    - dimension: transaction_pk
      type: number
      primary_key: true
      hidden: true
      sql: ${TABLE}.transaction_pk
    ...
New LookML
- view: transaction
  # sql_table_name: warehouse.transaction
  derived_table:
    sql: SELECT ROW_NUMBER as transaction_pk, * FROM warehouse.transaction
    sql_trigger_value: SELECT CURRENT_DATE
    indexes: [customer_id, product_id, salesperson_id, transaction_date]
  fields:
    - dimension: transaction_pk
      type: number
      primary_key: true
      hidden: true
      sql: ${TABLE}.transaction_pk
    ...

PDT Anti-Patterns

There are several situations where a PDT might not be the best choice:

  • The most important issue occurs when the underlying data is changing frequently and a “snapshot in time” is not sufficient.
    For example, consider the customer_order_facts PDT above. If the data is getting fed from the transactional system on a nightly basis, then this pattern makes sense. If the data is being fed continuously or nearly continuously the PDT will quickly become out of date. Does it matter if the Looks and Dashboards that rely on this view are not up-to-date? Maybe or maybe not.That depends on the use case. Do your non-technical users of Looker understand that this data is not constantly updating? Consider fixing that in training, by naming choices, or by using description parameters.

  • If the cost and time involved in creating PDTs often enough is too high
    In the prior situation, the ‘sql_trigger_value:’ could be set so that the table rebuilds more often. However, the rebuild process recreates the entire table, rather than just modifying a few rows of data. At some rebuild frequency, the cost is too much and the database spends more time rebuilding the PDTs rather than serving useful queries. In these cases, instead of a PDT in Looker, consider having the ETL processes create this table using an incremental update strategy.


JOIN 2017 - Deep Dive - To Use or Not Use PDT's
PDT Naming: Development vs Production
(sharon) #2

For each old LookML code block in this article and its comments, we just added the New LookML equivalent code.


(Levi Davis) #3

For each old LookML code block in this article and its comments, we just added the New LookML equivalent code.