In some cases you will get an “Resources exceeded during query execution” error when trying to build a large derived table in BigQuery.
Why this happens
BigQuery is a distributed database. This means most queries execute on multiple nodes. But there are certain operations that require all the data exist on a single node. When that data can no longer fit on that node, you get the “Resources exceeded during query execution” error and the whole query fails.
Which Operations Operate on a Single Node?
Operations that need to see all the data in the resulting table at once have to operate on a single node. Un-partitioned window functions like
RANK() OVER() or
ROW_NUMBER() OVER() will operate on a single node.
When will it matter?
Often you will want to generate a primary key for a derived table. A common way to do this is to use
ROW_NUMBER() OVER() as id to generate a sequence number for a each row in your table. If the table is big we fail with “Resources exceeded” because this function tries to operate on a single node.
Luckily, there’s an easy way around this. You can
PARTITION the the window function by date and build a string as the primary key. For example:
CONCAT(CAST(ROW_NUMBER() OVER(PARTITION BY event_date) AS STRING), '|',(CAST(event_date AS STRING)) as id
If your query contains an
ORDER BY clause, all the data is still going to be passed to a single node and then sorted. If it doesn’t fit on the node then it fails. Since BigQuery is a scanning database (it scans the whole column for each query) sorting the data is unnecessary. So don’t use ORDER BY in ordering derived table results.