We often recommend ELT solutions like Matillion and FiveTran to our customers as powerful tools for moving data into their warehouse from lots of sources and being able to transform that data to find useful insights. These tools have worked well with BI use cases for customers that are willing to invest in data engineering and pipelining. Recently, we took a closer look at Keboola as an alternative solution and we are transfixed. Let’s just say: it’s pretty awesome.
Harnessing the Power of Cloud Computing for Transformations
The conventional wisdom has been that ETL solutions, tools that Extract and Transform then Load data into a warehouse are going out of fashion in favor of ELT tools that Extra and Load into a warehouse first and then transform data once there. It’s easy to understand why people are saying this. In ETL solutions, the data is extracted and then transformed by some middle entity that would need major processing power (thought to be cost prohibitive and non scalable given the growth of data) to be able to perform the transformations. Alternatively, ELT provides an appealing solution: just load all of your data into your warehouse in raw format and then use the power of the warehouse to join, combine, summarize, etc. It does make a lot of sense but it can be complex and requires data engineering to support the data analysts and scientists.
Enter Keboola. They have turned this whole concept on its head, and it’s spectacular. Their approach is to extract data into the cloud first (not a warehouse), storing it in AWS s3, allowing you to transform it in the cloud and then write it back to your database and elsewhere. If the conventional wisdom is that ETL is no longer feasible because it requires some major processing power in the middle, Keboola has come back and said “no problem, we’ll use cloud’s infinitely scalable infrastructure for that.” And so they have. They use a serverless architecture that provides as much compute as they need for transformations.
But the secret sauce is really the way they make use of modern cloud data warehouse technologies in the middle to perform transformations regardless of where your data comes from. Your data can come from a transactional database, one that would typically be very slow to run the aggregate, roll-up operations for analytics (OLAP) reporting. And yet, just in time, Keboola moves your data to a modern cloud warehouse (by default Snowflake but Keboola also supports Redshift) in order to perform the analytics transformation.
By harnessing the power of an OLAP data warehouse for all SQL transformations, users see lightning fast aggregate operations and the ability to run really complex analytics queries with ease. While Keboola has an ETL design at a high level, it actually employs ELT semantics for transformations where they load your data into a warehouse first and transform it there. In other words, Keboola lets you benefit from both ETL and ELT by having distinct extracting and loading steps but also using a modern cloud data warehouse for transformations regardless of where you are extracting from or loading to. You can think of it as ETL with just-in-time ELT in the middle.
If you want to extract from PostgreSQL and send analytics data to Looker, Keboola will use a modern cloud data warehouse for that (again, typically Snowflake). Do you want to extract your Google AdWords data and your Hubspot Data and combine them? Keboola will use a modern cloud data warehouse for those transformations as well. What this means is that any Keboola customer can very quickly start to benefit from using a modern cloud data warehouse without having to worry about how to migrate and manage their data there This concept of using Snowflake or Redshift as an offload engine, as a co-processor for their own cloud ETL solution is genius. To us, this is a real differentiator for Keboola and a game changer.
What Keboola Can Do With an ETL Architecture
Because Keboola invested in having an ETL architecture, with powerful cloud-based automation in the middle, they can do way more than just SQL transforms. Perhaps the most exciting features of Keboola are the ones that enable data scientists to do their jobs without having to worry about how to get the data they need.
Keboola also allows for Python and R transformations in the cloud. Like their SQL transform cousins, these transformations take one or more tables as inputs and produce one or more tables as outputs and run your code in the middle. The output tables can be written back to your database or warehouse or consumed downstream by other data scientists to keep on crunching. Most importantly, you can share and move data easily without having to ask for help from a data engineer to write some logic to extract and load.
And then there are the sandboxes. For SQL transformations, Keboola automates creating a temporary Snowflake workspace on their Snowflake account, essentially a dedicated user, database, and schema for you to play around with your datasets and get your transformations correct. For Python, Keboola launches a Jupyter notebook automatically and secures it using Docker for isolation. The magic is that the data is just there. Keboola took care of getting your extracted data from its own storage into Snowflake or Jupyter so that you can get working right away on your queries, Python or R code.
Even if you don’t end up choosing Keboola, it’s worth taking a look and admiring the architecture and cloud automation. We will certainly be recommending it to our clients going forward.