[Analytic Block] Redshift Performance Optimization

 

Looker has created this block to make analyzing data easier and more efficient. This Data Block is made available on an “as is” basis, meaning there will not be updates moving forward.

Database performance is key to getting the best experience out of Looker. By leveraging the performance optimization features in the Redshift Block and the knowledge below, you’ll be able to quickly identify, diagnose, and optimize performance on your Redshift cluster.

1. Implementing the Redshift Block

Implementation instructions are detailed in the block itself, but implementation can be as easy as copying the files into your environment and setting the connection.

 

Note: URL paths are provided throughout this post. You can append these to your Looker domain to navigate to the referenced content. If you choose to rename the model while implementing the block, you will need to replace “redshift_model” in these paths accordingly.

2. How to identify Performance Problems

The first point of investigation or periodic review should typically be the performance dashboard:

/dashboards/redshift_model::redshift_performance

redshift dashboard perf 1

By starting at the dashboard, you can focus your performance optimization efforts in areas where they will have the most impact:


  •  

    Individual queries that are taking the most time


  •  

    Suboptimal join patterns that, in aggregate, are taking the most time


  •  

    Capacity limitations that would not be revealed at the per-query level



     
 

Note: All data relating to query history is limited to the past 1 day, directly in the model. If desired, this can be adjusted in the redshift_queries view definition.

Identifying Opportunities from Individual Queries

The top section of the dashboard gives an overview of all queries run yesterday, with a histogram by run time, and a list of the top 10 longest running queries. You can drill into a list of queries by clicking on any bar in the histogram, and from either that list or the top 10 list, you can inspect queries you think might be problematic:

redshift dashboard perf 2

Identifying Opportunities from Network Activity Patterns

The next section of the dashboard deals with the network activity caused by joins. Since network activity is highly impactful on execution time, and since it often follows consistent patterns, it is ripe for optimization, but most Redshift implementations neglect to properly analyze and optimize their network activity.

redshift dashboard perf 3

The pie chart on the left gives the approximate share for each type of network activity. Although there will always be some suboptimal redistribution of data for some share of queries in the system, when red and yellow types account for more than 50% of the activity, this is a good indication that a different configuration would yield better system-wide average performance.

The list on the right shows individual opportunities, with all queries performing a particular join pattern grouped into a row and then sorted by aggregate time running among those queries, so you can focus on adjusting join patterns that will have a significant impact on your end users.

Once you have identified a candidate join pattern for optimization based on this table, click the query count to see a drill-down of all the matching queries, and then select ones that appear representative or are particularly slow to run to investigate further.

Note: Nested loops are another problem sometimes caused by joins. Not only will they always result in DB_BCAST_INNER, but they can also cause excessive CPU load and disk-based operations. To check for these, use:

/explore/redshift_model/redshift_plan_steps?fields=redshift_plan_steps.operation_argument,redshift_queries.count&f[redshift_plan_steps.operation]=Nested+Loop&limit=50

Identifying Capacity Issues

In addition to slow-running queries, you might be experiencing slow response time, simply because Redshift is queueing queries as a result of excessive demand or insufficient capacity. The line graph at the bottom of the dashboard will quickly reveal if and during what time of the day queries were queued. The blue line represents all queries received each hour, and the red line represents queries queued each hour. You can also click on the minutes queued disabled series to get an estimate of how much, in aggregate, the queued queries were delayed by being in the queue.

redshift dashboard perf 4

If you do find an issue here, you can (of course) increase capacity, or you could manage demand by adjusting or cleaning out your PDT build schedules and scheduled looks/dashboards:

PDTs:

/admin/pdts

Scheduled content by hour:

/explore/i__looker/history?fields=history.created_hour,history.query_run_count&f[history.source]=Scheduled+Task&sorts=history.query_run_count+desc&limit=50&dynamic_fields=%5B%5D

In addition to this capacity issue that directly affects query response time, you can also run into disk-capacity issues.

If your Redshift connection is a superuser connection, you can use the admin elements of the block to check this.

3. How to Interpret Diagnostic Query Information

When you click Inspect from any query ID, you’ll be taken to the Query Inspection dashboard:

redshift dashboard perf 5

The dashboard components can be interpreted as follows:



  • Seconds To Run: The number of seconds a query took to run, and a good starting point for deciding whether to spend time optimizing this query.

  • Mb Scanned: How many megabytes of data the query scanned from disk.

    • This is commonly a significant contributing factor to query slowness. It’s affected by things such as the underlying size of your dataset, whether a sort key could be leveraged to limit which blocks had to be read from disk, the average size of values in the referenced columns, the compression applied to your data on disk, and whether a table is set to dist-style all.

       


  • Mb Distributed, Mb Broadcast: Network activity is another very common cause for slow queries, and distributing and broadcasting are the two main categories of network activity caused by a query.

    • They occur in order to allow separate points of data to be joined together, by sending the joined data to the same node. Distribution means that for each pair of datapoints to be joined together (e.g., a user and his/her order), Redshift chooses a location on the network for that pair and each node sends its datapoints to that new node for joining. On the other hand, broadcasting occurs either when Redshift cannot determine ahead of time how datapoints will be matched up (e.g., the join predicate requires a nested loop instead of a hash join) or when Redshift estimates that one side of the relationship is small enough that broadcasting it will be less costly timewise than distributing both sides.

       


  • Rows Sorted: Less frequently, queries will trigger large sorts. Sorts on large numbers of records can be expensive, so a number greater than 100k here should be looked into.

    • Of course, these sorts are sometimes required, but many times an extraneous order by is left in a derived table view, resulting in a large, unnecessary, and frequently executed sort.

       


  • Disk-Based: Indicates whether any step on any node resulted in an operation that exceeded the available memory and caused the operation to be completed by storing some data on disk.

    • If yes, see whether there is very high skew in the underlying data, or any step in the query execution section causing this, and if not, consider whether additional capacity is needed, based on trends across all queries in the system.

       


  • Query Text: Provides the first 4,000 characters of the query.

  • Table Details: Shows the tables that participated in the query, some key information about them, and metrics about the scans of these tables during the execution of this query.

    • Note that these table dimensions are current and may be different from what they were when the query ran.

       


  • Query Plan: Shows the steps that the query planner produced. This is where the diagnostic heavy lifting gets done. When exploring from here, you can join the parent/child steps together, for example to see how many rows each side of a join contributed.

  • Query Execution: At times, the query planner plans things that don’t work out as expected. While the execution report is less structured and difficult to tie back to the query, it is a good way to check the assumptions made by the planner.

    • For example, you can see how many rows or bytes were scanned, broadcast, or executed, and get a general sense for how the workload is balanced or skewed across nodes in the cluster.# 4. Common Problems and Corrective Actions
     

    This was part of the JOIN 2017 conference. You can find the deep-dive presentation recording here.

Situation Possible Corrective Actions Considerations
A join pattern causes a nested loop that is unintentional or on large tables Refactor the join into an equijoin (or an equijoin and a small fixed-size nested loop)  
  Build a relationship table as a PDT, so the nested loop only needs to be done once per ETL  
Overall join patterns result in frequent broadcasts of inner tables, or distribution of large outer tables, or distribution of both tables Adjust the dist style and distkey of the broadcast table, or of the receiving table, based on overall join patterns in your system  
  Add denormalized column(s) to your ETL to enable better dist keys. E.g., in events -> users -> accounts, you could add account_id to the events table Don’t forget to add the account_id as an additional condition in the events -> users join
  Build a PDT to pre-join or redistribute the table Not usually needed, though this may be worth the higher disk usage, and can be more efficient than distribution style “all”
Queries result in large amounts of scanned data Set your first sort key to the most frequently filtered on or joined on columns  
  Check whether any distribution style “all” tables should be distributed instead (and possibly duplicated and re-distributed) With distribution style “all”, each node must scan the entire table vs. just scanning its slice
  Adjust table compression  
  Check for unsorted data in tables and schedule vacuums or leverage sorted loading for append-only datasets  
  For large tables, set an always_filter declaration on your sort key to guide users  
Queries have large steps with high skew, and/or disk-based operations Check table skew, skew of scan operations, and potentially adjust relevant distribution keys to better distribute the query processing For small queries, higher skew can be ok.
The query planner incorrectly underestimates the resulting rows from a filter, leading to a broadcast of a large number of rows Check how off the statistics are for the relevant table and schedule analyzes  
  Adjust your filter condition  
Users frequently run, full-historical queries when recent data would do just as well Use always_filter so users are required to specify a filter value The filtered field is ideally the sort key for a significant table. E.g., the created date field in an event table

For example:

\${[a-zA-Z0-9_]+\.field_id}\s*=|=\s*${[a-zA-Z0-9_]+\.field_id}

This would let you search for where a given field is involved in an equality/join, if you are using the same field name as the underlying database column name.

18 6 1,410
6 REPLIES 6

Rui_Zhang1
Participant I

Love this a lot! Thank you for putting this together. So in order to view the queries from other users, we have to create a superuser connection, right? That’s just a little scary to do, but we wish to use Looker to monitor all the queries on our Redshift Database.

Edit: Outdated answer. See below for current details

Unfortunately, yes, it seems a superuser connection is the only way to see other users’ data. If you have one or two specific user in mind, you may be able to simply connect Looker as those other users.

ross2
Participant III

We were able to work around this restriction by running an ETL job through which a superuser copies the contents of updated system table records into a publicly accessible scheme. Far from perfect, but I’ve found it to be a manageable solution.

Rui_Zhang1
Participant I

That’s a great idea! Thank you for sharing!

Hi, if you only want to view other user’s queries then you don’t need to use a superuser. Instead create a user with the option syslog access unrestricted. They will then be able to see other users details in the system tables. You will probably also need to grant select on certain system tables/views that are normally accessible to superuser only e.g. STV_LOCKS, STL_TR_CONFLICT. Perhaps, you could provide a list of those used Fabio?
Please note that although the user would not have superuser privileges it should still be strongly secured as giving a user unrestricted access to system tables gives the user visibility to data generated by other users, which might contain sensitive data. For example, STL_QUERY and STL_QUERY_TEXT contain the full text of INSERT, UPDATE, and DELETE statements.

Yes, the tables to grant select to (as well as the info about syslog access) are noted in the block’s README file. Specifically, here