You'll find that Redshift, and most analytical databases for that matter, aren't optimized for microbatch data loads. So production-to-analytical-database latency might remain a problem for some folks, irrespective of the analytical database you choose. If you're able to make upserts work with a deep copy, you may find yourself in a situation where the Redshift cluster is almost continually backing itself up, depending on the magnitude of these incremental loads. Just a little word of caution there.
Obviously, where analytical databases shine is with historical analyses for which MySQL or PostgreSQL are underpowered and ill-suited. With that said, I'm a huge fan of Redshift as well as Spark (see our pipelines below), and our CTO and Founder, Lloyd (@lloydtabb), is all about BigQuery (he might be able to enumerate its strengths).
Personally, I find Spark to be the most intriguing because it works very well as an analytical data warehouse (SparkSQL), it has built-in machine-learning capabilities (MLib), and it has a microbatch component for near real-time analysis (Spark Streaming). All of the above make Spark a very well-rounded suite of tools that address many use cases at an organization. Redshift—once one gets past its idiosyncrasies and once one fully understand its subtleties—is really quite powerful and will remain our data warehouse technology for the foreseeable future.
I'm happy to kick things off by explaining how we do things at Looker! We have a number of data sources, all of which eventually flow into Redshift and reside in their own schema:
- Event data:
Events are tracked; an Elastic Beanstalk collector pipes access logs to S3; a parallel job written in Ruby crawls S3 to process logs, outputting JSON back to S3; lastly, an S3-to-Redshift loader, also written in Ruby, copies JSON objects into Redshift relying on JSON schemas. (We also have an alternate version written in Spark/Scala for large-scale batch jobs, if/when needed.)
- Runs on MySQL; shell script copies to Redshift.
- Copystorm pipes to MySQL for operational microbatch data; Fivetran pipes to Redshift for historical reporting.
- In-house job pipes to PostgreSQL for operational microbatch data from Github API; shell script copies from PostgreSQL to Redshift for historical reporting.
- Fivetran pipes to Redshift. Spark/Scala job pulls from Redshift, splits chat text/blob into chat/conversational events, and pipes back to S3 for copying.
- In-house job pipes to PostgreSQL for operational microbatch data from Google Calendar API; shell script copies from PostgreSQL to Redshift for historical reporting.
- Fivetran pipes to Redshift.
- Native integration with Salesforce; we get Marketo data when we extract and load Salesforce data.
- In-house job pipes to PostgreSQL for operational microbatch data from Discourse API; shell script pipes from PostgreSQL to Redshift for historical reporting.
- Fivetran pipes to Redshift.
- In-house Ruby job pipes from Harvest API to Redshift for reporting.
Sitting on top of all of this is, of course, Looker. We don't use anything sophisticated for orchestration, though we're considering it. At present, we mostly use cron jobs and Ansible on various worker EC2 instances.
You can see that for complex, ever-changing data sources, we've opted to rely on FiveTran to handle incremental upserts. We've been using them for about a year and, in my opinion, they've really matured into a great EL tool.
To your point on the format of resulting data from FiveTran et al: for us, PDTs handle most transformations on moderately sized data sources (e.g., Salesforce, Google, Marketo). For larger, more complex data, we use Spark or take PDT logic out of Looker and turn them into scheduled unload jobs that dump to S3 then copy back into a table with proper compression.
Consistent with the Looker philosophy, we like to start with the data in as raw a format as possible, then let the modeling layer act as our means of defining transformations. Only when scale becomes an issue do we take it out of Looker and move it into the ETL process.
I know that Alooma provides one the opportunity to define in-flight transformations in Python as it moves data from source (typically an API) to destination (Redshift). A Data Scientist on our Services team, Segah Meer (@segahm), has worked quite a bit with the people at Alooma, and could provide more context if that sounds compelling.
Hope this was helpful.