Identify & Fix a Non-Performant Join in an Explore

(sam) #1

When your Explore is running slowly, one of the first things you should think about is whether the culprit could be a join. Looker makes it easy to join in several tables at once, and this can be very illuminating for data exploration. However, it can also make things slower if you accidentally use more joins than you need or if a particular join is especially non-performant. This article goes over a few tips to keep your joining from getting out of hand.

#Keep Your Explores Tidy

It can be tempting and sometimes useful to make a large explore that joins in every view in the model. However, this gives the end user the cause possibly unnecessary joins. We recommend breaking up explores into manageable, intuitive segments based on how the data is related.

When exploring the data and building a Look for consumption, consider whether the data is comparing two things from different views that don’t actually require a join – they are possibly they are separate information that just is interesting to see at the same time. If so, break up that one Look into two Looks and put both on a dashboard. That way you see the data side-by-side without requiring a join.

#Reduce the Number of Joins at Runtime

Writing good LookML can reduce the number of joins needed at runtime. When referencing fields in other fields, like with filtered measures, try to keep the fields in the same view so that you don’t have to write a join every time you use that measure.

If you know that you will often be joining two tables together, you might consider writing a PDT that joins them together so that the join can be done in the PDT build instead of at every query.

#Take Advantage of the Join Parameters

While defining joins at the model level, you can help Looker write the most performant join by specifying the correct relationship between your data with the relationship parameter and defining a primary_key for each of your views.

If you’re using a derived table, specifying the proper indexes is a good idea as well.

The type parameter can also help speed up a join — for instance, Looker defaults to a left outer join if you don’t specify a type. However, an inner join is actually faster — just be sure that it is appropriate for your data.

Finally, using the sql_on parameter, you can specify clauses that would normally go in the WHERE clause of your SQL in order to filter the result set brought in by a join preemptively. An example might look something like this:
sql_on: ${view_a.field} = ${view_b.field} AND ${view_a.field} < 626
#Play With SQL Runner
You can test the effect of all of these guidelines using SQL Runner. Play with your query — adding and removing parts to see how the changes affect the speed and the result set. You can use the results you find there to write a slick derived table or update your LookML parameters appropriately.

(Mike DeAngelo (a.k.a. Dr. StrangeLooker)) #2

A few other things can be important…

First, beware of implicit type conversions. If you join columns of different data types, there needs to be a type conversion in order to process the join. The type conversion function can disable an index.

Second, on RedShift be aware of distribution keys. If you join two large tables in RedShift without a common distribution key, it will result in large amounts of intra-node communication. Make sure that your larger tables share a distribution key and your smaller tables have a distribution style of ALL.

(Dirty Looker) #3

I would also say check for indexes for transactional databases. If computation needs to be done to do a JOIN, precalculate that in a PDT or part of the ETL and create a map table.