BigQuery daily snapshots

etl

(Aaron Bostick) #1

Continuing the discussion from ETL Tool Recommendations:

Hi @lloydtabb, thanks for the info!

I saw this post as well:

and the TABLE_DATE_RANGE feature makes me wonder what the best approach is.

My first thought was just to append a current date column each day to the snapshot and keep appending to the same table, but your post here seems to suggest it is better to create a new table each day with the snapshot date in the table name?

Being a BigQuery novice still, is one clearly better than the other?

For context, I am currently cross joining a simple calendar date column against my table of interest and deriving my historical numbers using date logic. This works, but as you can guess, is not fast, as I am effectively multiplying my table row count once for each date that I care to process…


Incremental PDTs on BigQuery - or how to produce Daily Snapshots on SCD Type 1 Schemas
(lloyd tabb) #2

Aaron, depends on the size of the table. If you have a transactional table, it is best to break into table/days, but only if it is somewhat big (say more then a few gigabytes).

There are limitations to the number of days you can union together, so if you are looking at really long timeframes, you might not want to use days)

BigQuery does a full table scan for each query (believe it or not).

best,
lloyd


(Josh Siegel) #3

The only way to avoid a full table scan (on the columns you are querying) is to use table decorators: https://cloud.google.com/bigquery/table-decorators

If you have a 10 TB table for one day of data and only want to query the last 30 minutes to save time and cost, this is really really helpful (but not supported in Looker today though).

I think BQ might be building other stuff to help solve for this too.

e.g. get table data added between one hour and half an hour ago:

SELECT COUNT(*) FROM [data-sensing-lab:gartner.seattle@-3600000–1800000]


(Sergei Kharchenko) #4

Loading data to Google BigQuery is much faster and simplier if to use a special 3rd party services like Skyvia Data Integration. Among main advantages:

  • no need to code anything
  • few clicks setup
  • flexible updating frequency (from once a day to once a minute)
  • Change Data Capture feature, which allows to upload only fresh data

(jim ) #5

We used Rivery service in order to split our data to tables by Logic Date
and not by daily snapshot and it was great and super cost effective.