ETL Tool Recommendations

etl

(Aaron Bostick) #21

Most of the tools listed here only support Redshift as the destination warehouse.

Being a Google customer, can you recommend a vendor who supports BigQuery as the warehouse?

I’m sourcing from MySQL and am interested in “snapshotting” certain data in daily intervals.


(lloyd tabb) #22

Hey @aaron_bostick, good to see you.

I’ve been spending lots of time with BigQuery and it is pretty awesome. Here is a simple script to move files from mysql to BigQuery.

https://gist.github.com/shantanuo/4466298

Data load in BigQuery is pretty easy and straight-forward.

Loading Data into BigQuery


BigQuery daily snapshots
(jim ) #23

We also faced the same issue… and we choose Rivery.
Very easy to use in order to load and transform data on BigQuery.


#24

Hey guys! @segahm is right when he says that there’s a lot more to consider when it comes to choosing an ETL solution.

Some other thoughts: you also want to make sure that you find a tool with some flexibility. A big problem with many of the options available today is that the user can’t customize their data delivery schedule, extract specific relevant data or connect all of their sources.

In other words, they can’t choose what data they want to move, where they want it to go or how they want it to look and this can be a really frustrating limitation.

@zhill @kevin, Xplenty lets you easily integrate, transform and process data from more than 100 data stores and SaaS applications. Their platform is also very straightforward and user friendly - no code and no developers. www.xplenty.com


(George Fraser) #25

Appreciate the kind words in this thread about Fivetran (of which I am CEO); I just want to chime in to say we do now support Oracle as a source and BigQuery as a destination, to agree with @segahm general formulation, and to share a couple general observations we’ve made after seeing many company’s projects:

First, the key questions you should ask about any ETL connector are:

  • What happens when a column is added in the source?
  • What happens when the TYPE of a column is changed in the source?
  • What happens when a row is deleted in the source?

Second, nearly everyone should use SOME commercial ETL tool rather than build it themselves. It’s not that hard to write some scripts that sync your data; the hard part is keeping up with schema changes, and the REALLY hard part is dealing with turnover on your data team. For whatever reason, data engineering teams experience a lot of turnover. In my experience, there are only two scenarios where it CAN make sense to write your own ETL:

  1. Your source schema is extremely stable so you rarely have to update your scripts.
  2. ETL is part of your core business and you have a dedicated team that will maintain and monitor your ETL.

Lastly, as has been pointed out Fivetran does not support transformation, because we believe this to be an anti-feature: in the long run, it’s a lot easier to maintain looker derived-tables. But not everyone agrees, and this is an important difference between the no-transformation vendors (us, Stitch) and our pro-transformation friends (Alooma, ETLeap, Boomi, …).


(John Norman) #26

For RedShift, Matillion has been pretty sweet . . .


(Jake Stein) #27

This is a great discussion. Here’s one more aspect that may be helpful as you’re evaluating options:

The trend we’re seeing at Stitch (where I’m co-founder & CEO) is that, in addition to popular data sources like MySQL and Salesforce, more and more people have a few data sources that are much less common. There’s been an explosion in the number of tools that companies use to run their business, with a 2017 survey of the marketing technology landscape finding over 5,000 different products (up from 150 in 2011).

This is obviously important if your current ETL needs include a data source that no commercial ETL vendors support. However, it’s also important to choose a tool that will enable you to incorporate whatever new sources you will need in the future.

To do this, make sure that you choose a vendor that has a great API for importing data (which is true of many of the folks mentioned on this thread). With that, you can always write a script that pushes data to their API to be consolidated with your other data sources.

There are some drawbacks to this approach that you should keep in mind:

  • If you’re reading this thread, you’re trying get away from building ETL scripts in the first place
  • You’re on the hook for maintaining these scripts, which is a much bigger job than the initial build
  • You then have to do the operational work to make sure that these scripts run indefinitely and notify the right people when there is a problem with authorization, data volume, API quotas, etc. This is the biggest job of all.

This is why we launched Singer.io, an open source project for ETL integrations. There are more than 20 open source integrations to data sources (or “taps” as we call them), and more are being built all of the time. If you’re building a new one, you can reuse code from the existing taps and helper utilities. We also have partners who can build custom taps.

These taps are maintained by Stitch and the community, so you don’t have to bear that burden on your own. Finally, we can incorporate any integration written to the Singer specification directly into our system, so you can run Singer taps on your own hardware or from within Stitch.


(Katie Chin) #29

To see how easy it is to use Fivetran with Looker, check out this video: https://www.youtube.com/watch?v=fWfyVtAQ6WI&feature=youtu.be


(Greg Kogan) #30

Etleap does exactly that: ETL data from Salesforce, Facebook, NetSuite, S3, Marketo, MySQL, and many others sources into a Redshift data warehouse.

A common use case we hear at Etleap (I’m on the team) is wishing to give analysts the ability to connect and manage the data pipelines they need to get more done in Looker. Otherwise they would need to depend on the engineering team to set up and maintain these pipelines, which can take weeks and break often. Meanwhile, engineering are happy to free their time from creating and maintaining these pipelines.

Since you can do transformations within Looker, for some companies it’s enough to have a data pipeline service that just ingests data and loads it into the warehouse (that is, ETL without the T, or ELT), such as Segment, Stitch, and Fivetran.

Load first, ask questions transform later.

However, as complexity of data goes up, the need for transformations outside the warehouse (and Looker) does as well.

What’s more, if your company has strict security requirements around data—think health, financial services, enterprise software providers, etc—loading everything sight-unseen into a warehouse is a non-starter. For cases like that, being able to transform data before it hits the warehouse (and Looker) can be the difference between passing and failing a security review.

All that to say…

It depends on the specific use case and requirements.

Hopefully this helps you think through that and points you to a few good options.


(Osvald Markus) #31

On my opinion, there are quite a lot of cool and interesting tools. I personally use Skyvia and I am completely satisfied with it.
In my company we use Skyvia to load data from Salesforce to Redshift on a daily basis, it is enough for our analytics department.


(David Ames) #32

I’m looking for some ETL (or ELT) /DW Automation tool recommendations too - but from a slightly different perspective.

Source is SQL Server, destination is PostgreSQL with the option of Redshift or Snowflake at a later date. After something that plays well with Looker maybe even generating some base LookML from it’s internal models.

The real tricky parts of us are

  1. Easy to use - we want our autonomous development teams to be able to manage their own destination schema and ETL. - And they are not ETL developers but know SQL every well.
  2. We are in a multi-tenant environment. - Around 400+ source databases to a similar number of destination databases. (but all same-schema)

(George Fraser) #33

Multi-tenant is really tricky. We (Fivetran) still don’t have a great out-of-the-box solution for sharded DBs, though we have some things coming out later this year that will drastically improve the situation. You should look at all the vendors, and I recommend talking to DAS42, a consultancy we’ve worked with on sharded DBs. Some key questions are:

  1. Is the set of 400 source databases relatively fixed, or is it adjusting dynamically all the time?
  2. Do all 400 source databases have the exact same schema, or is it possible for the schemas to differ?
  3. What’s the total data size?

DO NOT try to put 400 SQL Servers into a single Postgres destination. It will die. Really, your only option here is going to be Snowflake. Postgres will get killed by the data volume, Redshift will get killed by the concurrency. BigQuery might be a workable option, though you didn’t mention it.


(David Ames) #34

@George_Fraser it’s 400 and increasing at a small rate each year.

There is no requirement to aggregate data across the 400 source DB’s into a single DW - ie, we could have 400 DW’s. 400 ETL processes with each customer reporting off their own DW using Looker.

Obviously automation is key here - we don’t want to manually manage 400 schemas, etl definitions etc.

The 400 source DB’s have the exact same schema, however the destination DW’s schemas will be ~90% the same schema. There are some attribute-value pair tables & other flexible data structures that will be loaded into customer specific dimensions. This transforms can all be rule/code driven (ability for us to hook custom code into the pipeline is a must)

Most source data is <10gb per tenant with a few up around the 150gb - nothing over 250gb.

Total size in the largest region is <2tb. - This is spread across 6 regions.

Current thinking is a bunch of Postgres RDS instances. - Snowflake would be REALLY nice to for this use-case, however data sovereignty/control adds another layer of complexity - still do-able if it’s the best option.


(Lee Schlesinger) #35

Most of the tools mentioned in this thread should be able to go from SQL Server to PostgreSQL (or Redshift or Snowflake).

Some of Looker’s partners have generated Looker Blocks for popular sources (for example, Stitch has contributed ones for Facebook Ads, Salesforce, and others), but I’m not sure you’ll find a solution from an ETL vendor that generates LookML based on your custom DB schema.

For easy-to-use and SQL-based, you want to go ELT rather than ETL, and then use dbt for transformation. ELT tools eliminate a lot of the complexity of traditional ETL by separating out the SQL. Dbt lets you manage SQL-based models and transforms.

Multi-tenant: Do you actually want to go to 400 destinations? If you’re going to be using Looker on top of the data, it would probably make your life easier if you consolidated the data to a single destination, or as few as possible.

Either way, if you have 400+ source DBs (and possibly another 400+ destinations), you may want to set up these connections programmatically rather than via a UI.


(David Ames) #36

Automation is an absolute must - which is why we love Looker so much (we can generate LookML models in code where required)

dbt’s model-as-code & transforms-as-code aligns well to our philosophy.

Looking forward to getting my hands dirty with dbt.

Thank you for the suggestion.


(Sergey Dovganich) #37

Renta is my favorite ETL tools, try them: https://renta.im/


#38

I also finished now validation for my company between 3 of the mentioned above.

The chosen one was Rivery (Rivery.io). we found it the most easier to implement especially when working with snowflake.

We got access to variety of APIs we need, the only one we missed they promised to deploy in 1 week, meanwhile they helped us to build temporary API by using their Custom API feature.

The main value for us was the logic layer where we can run our transformations & algorithms.

The POC & trial was really good, hope also to have same experience in real life :slight_smile:


#39

We originally gave Talend a shot, but since have settled comfortably on Apache Airflow.
The latter wold require more development skills, though.
I’ve found this list to be quite helpful in searching for Big Data tools


(Jeremy Martin) #40

Good discussion here.

Full disclosure, I work for Bedrock Data. Our newest product, Fusion, could help folks here by getting data from applications (NetSuite, Salesforce, Marketo, HubSpot, Jira, Shopify, etc.)., into a MySQL warehouse you can connect to Looker. Good if you’re worried about hitting API limits and want to get up and running with LookML dashboards lickety-split.