"Order by" hurts redshift performance

performance
redshift
sql

(Thomas Ellis) #1

Looker frequently (almost always) inserts an order by clause into the auto-generated sql, which can have nasty performance implications for redshift. Here’s an example of a looker-generated query (table names/fields changed for security):

SELECT
ad_impressions.ad_type AS “ad_impressions.ad_type”,
COUNT(DISTINCT ad_clicks.id) AS “ad_clicks.count”,
COUNT(DISTINCT ad_impressions.id) AS “ad_impressions.count”
FROM public.ad_impressions AS ad_impressions
LEFT JOIN public.ad_clicks AS ad_clicks
ON ad_impressions.id = ad_clicks.ad_impression_id
AND ad_impressions.user_id = ad_clicks.user_id
AND ((( ad_clicks.created_ts ) >= ((TIMESTAMP ‘2016-10-12’))
AND ( ad_clicks.created_ts ) < ((DATEADD(day,1, TIMESTAMP ‘2016-10-12’ )))))
WHERE
ad_impressions.page_type = ‘home’
GROUP BY 1
ORDER BY 2 DESC
LIMIT 500

The tables are distributed on user_id and sorted on created_ts & ad_impressions.id - this is a very performant set up for a our normal usage. However, the query cost for the above comes back as

XN Limit (cost=1000025567867.99…1000025567868.26 rows=106 width=43)

With the ‘order by’ removed, the query cost is only:
XN Limit (cost=25567863.63…25567864.43 rows=106 width=43)

This is a general issue with redshift - using an order by that doesn’t match sortkeys can increase costs by several orders of magnitude. The problem is that I can’t find any way to prevent looker from using an order by clause. Is that possible?

Thanks!


(Segah A. Mir) #2

Generally…and it’s a bit of a pain, but you could use order_by_field for all of these unsorted fields.

This does not work for all cases, however, because you won’t be able to sort on that field normally if its sort order is different from the sortkey.

But my advice based on your query:

  • forget the above
  • set sortkey on time
  • force a sortkey using always_filter

And it won’t matter as much which field you are sorting.

Why?
It looks like you are interested only in clicks and impressions from October (from your time join condition). Correct me if I am wrong, but that join forces impressions to also come from October.

However, the join is not limiting the search to a subset of data because your WHERE clause does not have any time logic in it. So while you are only interested in October, you are scanning the full history.

Once you limit the history to a subset, the sort will be on a much smaller subset => main Redshift’s efficiency (i.e. perform most operations on small subsets)


(Morgan Imel) #3

Hi @tme! One thing to note is that if the first column selected in the explore is defined as a primary key, Looker will not apply an ORDER BY clause to the SQL. It sounds like this isn’t a viable workaround for your situation, so I’d be happy to pass the feedback along.


(sam) #4

An update here: Sorts can now be turned off when downloading!