A Framework for Benchmarking Redshift Table Architectures

redshift
etl

(Scott Hoover) #1

It’s often useful to benchmark different table architectures when initially setting up Redshift or adding new data. While there’s plenty of literature on best practices, each data set is unique. When it comes to performance, everything depends. Based on my experience, it’s best to use these rules of thumb as a loose guide and test a few different ideas empirically to see which table architecture is ideal for your data. With that in mind, I’m adding a few scripts that we use at Looker to benchmark new tables to a dedicated repo on GitHub. I encourage people to use these, comment, and contribute.

The primary script is run.sh. First, it creates your table variants. Then, for each table variant, it inserts new rows from an existing table in the database (optionally, you can alter the script to copy from S3, etc.). It then executes the pre-defined benchmark queries against each table variant and logs the output to a unique file per table as well as a benchmark_summary.txt file. Lastly, the temporary tables are dropped.

You can also run an R script that summarizes the benchmark queries by variants (found in benchmark_summary.txt). Below is example output of the R script. The y-axis captures execution time in milliseconds. The x-axis lists the n table variants. Each group (denoted by 1,2,…,5 above) indicates the benchmark query. The line in the box is the mean; the box’s top and bottom denote the 75th and 25 percentiles, respectively; the lines represent the interquartile range; and points above or below denote Tukey outliers.