BigQuery: Support JOIN EACH (in join types or by detection)

BigQuery requires that the right-side table in a JOIN contains less than 8MB of compressed data. If it contains more, you need to use JOIN EACH. (See https://cloud.google.com/bigquery/query-reference#joins for details).

Looker doesn’t support these at the moment. In most cases, you can workaround this using the “sql:” clause, e.g.

- join: test
      sql: LEFT OUTER JOIN EACH test ON test.a = master.a

However, I’ve run into a case where I’m joining to a derived table. It doesn’t appear to be possible to have Looker substitute the derived table SQL into the “sql:” clause (please advise if it is and I’m missing something), so you have to reproduce the entire derived table’s SQL in the clause, e.g.

- join: test
      sql: LEFT OUTER JOIN EACH (SELECT * FROM mytable) test ON test.a = master.a

This obviously gets very messy for non-trivial derived tables.

I think there are two approaches to solving for this:

  1. Automatically detect when BigQuery returns a resources exceeded error and retry the query with a JOIN EACH. (Note this would also work for cases where a GROUP EACH BY is needed). Tableau handles EACH seamlessly and so I believe they may have taken this approach
  2. Add additional join types for joins that include EACH modifiers (http://www.looker.com/docs/reference/explore-params/type-for-join) – or include a “manual” one where the user can specify the exact text for the “LEFT OUTER JOIN EACH” bit of the JOIN clause

We’re on this. Should have something for you soon.

We’ve implemented some new join types to suport EACH for bigquery. For all the normal join types there is an additional each type, so left_outer_each, inner_each, full_outer_each, and cross_each

Here is an example of it in use.

https://learnbeta.looker.com/projects/bqlook/files/bqlook.model.lookml

- explore: order_items
  joins:
    - join: orders
      foreign_key: order_id
      type: left_outer_each

    - join: inventory_items
      foreign_key: inventory_item_id
      type: left_outer_each

    - join: users
      foreign_key: orders.user_id
      type: left_outer_each

    - join: products
      foreign_key: inventory_items.product_id
      type: left_outer_each

The GROUP EACH BY is more troubling. We’re looking at some solutions. There really is no way to know in advance if the query will fail. Rerunning upon failure seems like such a amateur solution.

Great, thanks. I tested and it works apart from the CROSS JOIN:

  1. BigQuery doesn’t actually support CROSS JOIN EACH so you can remove that one
  2. I think I uncovered a bug with how Looker handles CROSS JOIN for BigQuery

I also have some thoughts on GROUP EACH BY, at the end of the response.

More detail on (2), the CROSS JOIN bug:
If I don’t include a sql_on clause, I get a Looker error:

- explore: inventory_items
  joins:
    - join: products
      type: cross

=> join[products] in inventory_items, missing or mis-spelled join method

If I do include sql_on, Looker generates invalid SQL:

- explore: inventory_items
  joins:
    - join: products
      type: cross
      sql_on: 1=1

Generates:

SELECT 
	inventory_items.cost AS inventory_items_cost,
	products.brand AS products_brand
FROM thelook.inventory_items AS inventory_items
CROSS JOIN thelook.products AS products ON 1=1

GROUP BY 1,2
ORDER BY 1 
LIMIT 500

=> Failed to retrieve data - CROSS JOIN cannot have ON clause

Thoughts on GROUP EACH BY
On the GROUP EACH BY, one solution would be to have an explore level setting that species that all queries from the explore should use GROUP EACH BY (or even better, allow it to be put at the JOIN level too, so that it is only triggered if a certain join which blows out the dataset is used). There’s a performance hit from using EACH when you don’t need to but I suspect that other options will get a lot more complex.

In case it helps, here’s an excerpt from the BigQuery book written by the Google PM (which I’d highly recommend if you are looking to better understand BigQuery’s nuances or how it works behind the scenes).

JOIN EACH and GROUP EACH If you run a query that has a GROUP BY operation where the number of resulting rows is high, the query may fail with a Resources Exceeded error. Likewise, if you try to do a JOIN operation on two large tables, you may get the same error. In both of these cases, the EACH keyword can come to the rescue. EACH is a hint to the BigQuery query optimizer that instructs it to perform a shuffle operation. Shuffle is described in detail in Chapter 9, but for now it can be described as sorting the data to process more of the data in parallel. At some time in the future, BigQuery may be smart enough to infer the table size so that you won’t need to use a qualifier, but as of this writing, the EACH keyword is required when performing “big” JOINs or certain GROUP BYs. It isn’t recommended to just add EACH to all your queries; for many types of queries (JOIN of one large table against a smaller one, or GROUP BY with only a few distinct values), using EACH may be significantly slower. A reasonable rule of thumb is that if the smaller table in the JOIN is more than 1 million rows, or the GROUP BY has more than a million distinct values, you’re probably better off using EACH.

Tigani, Jordan; Naidu, Siddartha (2014-05-21). Google BigQuery Analytics (Kindle Locations 6219-6233). Wiley. Kindle Edition.

Thanks. guess I’m leaning toward the rerun on failure. Ugly, but I don’t see a better way.