Just curious what other companies are using for their data stack and some of the pros/cons? My company is thinking about transitioning from what we currently use so wanted to see what others have tried.
Your company's data team organization
JOIN 2017 - Deep Dive - To Use or Not Use PDT's
2016 January Discourse Digest
Out of curiosity, what are you currently using, and what are the current shortcomings that you’re experiencing?
Production DBs: Postgres
Analytic DB: SQL Server
Analytic workbench: RStudio Server
Some of the issues include integrations with third-party vendors and data latency between the analytic db and the production db. It’s been challenging to setup every integration in SQL server. We’ve tried Segment/Fivetran but the way they ETL data isn’t always in the format or detail we’d like.
We’re exploring Redshift right now.
You’ll find that Redshift, and most analytical databases for that matter, aren’t optimized for microbatch data loads. So production-to-analytical-database latency might remain a problem for some folks, irrespective of the analytical database you choose. If you’re able to make upserts work with a deep copy, you may find yourself in a situation where the Redshift cluster is almost continually backing itself up, depending on the magnitude of these incremental loads. Just a little word of caution there.
Obviously, where analytical databases shine is with historical analyses for which MySQL or PostgreSQL are underpowered and ill-suited. With that said, I’m a huge fan of Redshift as well as Spark (see our pipelines below), and our CTO and Founder, Lloyd (@lloydtabb), is all about BigQuery (he might be able to enumerate its strengths).
Personally, I find Spark to be the most intriguing because it works very well as an analytical data warehouse (SparkSQL), it has built-in machine-learning capabilities (MLib), and it has a microbatch component for near real-time analysis (Spark Streaming). All of the above make Spark a very well-rounded suite of tools that address many use cases at an organization. Redshift—once one gets past its idiosyncrasies and once one fully understand its subtleties—is really quite powerful and will remain our data warehouse technology for the foreseeable future.
I’m happy to kick things off by explaining how we do things at Looker! We have a number of data sources, all of which eventually flow into Redshift and reside in their own schema:
- Event data:
- Events are tracked; an Elastic Beanstalk collector pipes access logs to S3; a parallel job written in Ruby crawls S3 to process logs, outputting JSON back to S3; lastly, an S3-to-Redshift loader, also written in Ruby, copies JSON objects into Redshift relying on JSON schemas. (We also have an alternate version written in Spark/Scala for large-scale batch jobs, if/when needed.)
- License data:
- Runs on MySQL; shell script copies to Redshift.
- Copystorm pipes to MySQL for operational microbatch data; Fivetran pipes to Redshift for historical reporting.
- In-house job pipes to PostgreSQL for operational microbatch data from Github API; shell script copies from PostgreSQL to Redshift for historical reporting.
- Fivetran pipes to Redshift. Spark/Scala job pulls from Redshift, splits chat text/blob into chat/conversational events, and pipes back to S3 for copying.
- Google Calendar:
- In-house job pipes to PostgreSQL for operational microbatch data from Google Calendar API; shell script copies from PostgreSQL to Redshift for historical reporting.
- Google AdWords:
- Fivetran pipes to Redshift.
- Native integration with Salesforce; we get Marketo data when we extract and load Salesforce data.
- In-house job pipes to PostgreSQL for operational microbatch data from Discourse API; shell script pipes from PostgreSQL to Redshift for historical reporting.
- Fivetran pipes to Redshift.
- In-house Ruby job pipes from Harvest API to Redshift for reporting.
Sitting on top of all of this is, of course, Looker. We don’t use anything sophisticated for orchestration, though we’re considering it. At present, we mostly use cron jobs and Ansible on various worker EC2 instances.
You can see that for complex, ever-changing data sources, we’ve opted to rely on FiveTran to handle incremental upserts. We’ve been using them for about a year and, in my opinion, they’ve really matured into a great EL tool.
To your point on the format of resulting data from FiveTran et al: for us, PDTs handle most transformations on moderately sized data sources (e.g., Salesforce, Google, Marketo). For larger, more complex data, we use Spark or take PDT logic out of Looker and turn them into scheduled unload jobs that dump to S3 then copy back into a table with proper compression.
Consistent with the Looker philosophy, we like to start with the data in as raw a format as possible, then let the modeling layer act as our means of defining transformations. Only when scale becomes an issue do we take it out of Looker and move it into the ETL process.
I know that Alooma provides one the opportunity to define in-flight transformations in Python as it moves data from source (typically an API) to destination (Redshift). A Data Scientist on our Services team, Segah Meer (@segahm), has worked quite a bit with the people at Alooma, and could provide more context if that sounds compelling.
Hope this was helpful.
RDS - Postgres 9.4.5
We use postgres as our data warehouse. Our data volume isn’t large enough to warrant the need for a MPP solution such as Redshift, and we like functions that aren’t available in redshift (DISTINCT ON clause, jsonb, regex, materialized views, etc).
Custom batch script that runs on cron every 5 minutes that replicates our data using the SOAP SOQL Api. We manually generate the SOQL statements and load into Postgres.
Same thing, custom batch script that queries the tables we need and loads the data. Makes sure to go back 60-90 days every day so that we don’t miss data that gets retroactively populated.
Our production store is in mongo, so we have a pipeline which in realtime tails the Mongo oplogs and replicates the data to dw. We are working on version 2 where we were separate the producer and consumer and put a streaming queue (Kafka, or Kenesis) in the middle so we can better manage schema changes. This is the most intensive process, very similar to Alooma.
Series of Materialized Views
We have a series of materialized views that get run every 5 minutes to denormalize data and make it more efficient to load into looker.
we did a few posts on the asana eng blog about this topic -
overall, things are scaling reasonably well - although it’s very hard in our B2C2B “consumer-y” freemium model - we get millions of email addresses, and we have to lean heavily on automation, much of it inhouse. i think a key learning for us is that when you reach “big enough data” it is very hard to keep any third party data repositories - like email automation systems, or even salesforce - reliably in sync. tools like Looker that just connect to our first-party owned data are much more leveraged
Scott, I’m curious what data you pull from Google Calendar? In particular if this has to do with areas like CSMs and the onboarding process. .
@Bridge While we grab all calendar events, our Google Calendar data is used primarily as a system of record for milestones in engagements with prospects and customers. That way we can start analyzing how critical events, like co-dev sessions and check-in meetings, affect various outcomes.
Love this conversation I spent probably a week compiling all of what I could at the time find on this topic around the internet. The results are here. Asana’s infrastructure write-ups are definitely referenced, as are those at Braintree, Netflix, Spotify, Zulily, and more. I profiled a total of 11 companies.
Would definitely recommend folks check this out, and I’m happy to answer any questions from what I learned.
Wow. Very nice, Tristan!
If you’re comparing data warehouses, check out this benchmark. Cost & performance comparison for Amazon Redshift, Google Bigquery, and Snowflake. https://blog.fivetran.com/warehouse-benchmark-dce9f4c529c1
Our data stack looks like:
Application and Data: SQL Server: Amazon S3; Amazon Redshift; Node.js; Firebase.
Utilities: Google Analytics; Discourse; Dropbox; Shopify.
Business Tools: Jira; MailChimp; Zendesk; Salesforce Sales Cloud; Skyvia (for loading data to Redshift to analyze it with Looker)