Building a distributed ETL pipeline for a monolithic data warehouse

Written by Matthieu CaneillMar 17, 2021 14:1210 min read

Data expands to fill the space available for storage.

Now, this poses serious challenges when storage in the cloud is virtually infinite. Let’s dive into our data-capturing processes, and how we are keeping control of the flows!

*Python and SQL enter in a bar. They sit at a table, and actually enjoy their time together.*

The Data Warehouse

All operational data generated by our systems gets captured in the Data Warehouse — a giant database used for detailed analyses and improvements of the whole supply chain. To enter the Data Warehouse, the data hops on many processes of transformation and cleansing; to ensure it is fully integrated with the data model and can be later queried without producing wrong answers.

This process is known as ETL (extract-transform-load) or ELT (extract-load-transform). We currently have dozens of processes, constantly reading from the operational systems and feeding the Data Warehouse. They are all carefully crafted and configured to work smoothly using various tools and programming languages.

From Pentaho to Python

Historically, our ETL/ELT processes were written in Pentaho. While this had many advantages, such as an easy to understand graphical interface, and many integrated connectors to read data from various places, it also posed some problems. The main issues were:

Difficult code reviews. Pentaho stores everything in XML documents; while this is convenient for computers, it is much harder for humans to parse an XML diff. Pull requests became very lengthy due to the verbosity of XML, and it’s almost impossible to understand the differences by just looking at the code changes. Of course you can load the new version of a process in Pentaho, and visually compare both versions, but it requires a lot more effort.
Implementation of code logic. While Pentaho has ways to implement logic (do this if that happens, do this for every item of that), the only way to extend its possibilities is through Javascript snippets embedded in Pentaho components. This doesn’t scale to a solid codebase; moreover, the Javascript snippets had to be somehow encoded within XML files (to properly handle whitespace), leading to complex code reviews. Another way to extend Pentaho is by writing custom components in Java. This worked to some extent but requires a specific Java knowledge not typically part of the data engineering toolset.
Software tooling. Compared to full-fledged programming languages, Pentaho lacks some very important features: ability to write unit tests, file formatting and linting, and easiness of deployments. This is a missed opportunity towards solid software engineering principles, required to build strong foundations for scaling and to maintain a fresh codebase while reducing technical debt.

This has led us to decide to migrate our Pentaho processes towards a solution that would better suit our needs. We decided to go with Python for multiple reasons. The most important of them is its rich ecosystem, especially in the data landscape: solid, mature, well-performing libraries exist for almost everything. Another good reason is its ubiquity: many systems in Picnic were already written in Python, while Pentaho skills are more scarce.

Python and SQL: a love story ❤️

I didn’t mention SQL yet, but it is at the core of all our processes. Our Data Warehouse can be programmed with SQL, and this language is the perfect match for a lot of processes operations, typically cleaning, transformations and injections. SQL, being designed for data manipulation, is hard to beat. You might even wonder why use Pentaho or Python at all, but there are some things left we can’t do with SQL:

Communication with the external world. The “extract” part of the ETL/ELT can’t be done with SQL; the data has to come from somewhere! We use Python to connect to API endpoints and to various databases. Similarly, we develop jobs that read from the Data Warehouse and write data in other places, Slack for example. Naturally, error handling and backoff retry mechanisms are also handled with Python.
Complex and parallel scheduling. While we can schedule “pure-SQL” operations, we can’t schedule complex orchestrations of SQL scripts. We have nightly builds, where every layer of the Data Warehouse is constructed from data in inner layers: from raw data to aggregations and presentation layers. The SQL scripts taking care of building those layers are sometimes dependent on each other. In order not to run them one by one, which would take a considerable time, we have built a Python executor able to run scripts in parallel, layer after layer.
Automation outside of the Data Warehouse. We love our Data Warehouse, but sometimes we’re writing processes not directly related to data manipulation. We automate many things related to the many systems we interact with, such as Salesforce or Tableau. Those processes are also written in Python.

All in all, we’re developing and maintaining a growing code base of SQL and Python scripts. SQL is used for data manipulation within the Data Warehouse and Python for everything that revolves around it. The Data Warehouse being the unique source of truth, we consider it the “state” of all our applications. By carefully developing idempotent and self-contained ETL/ELT processes, we are not hit by many of the complexities of distributed systems, such as invalid state or race conditions. This is how we can scale our infrastructure to a distributed pipeline of flowing data.

To the cloud, and beyond

Naturally, these SQL and Python processes don’t run on our laptops in the production environments. Depending on the volume and freshness of data, they are scheduled to run one or multiple times per day, along with the other Picnic systems in Kubernetes.

And, naturally, processes will start failing sooner or later for a plethora of reasons. Fear not! This is expected. While we do our best to develop stable software, unforeseen things will happen: wrong data coming in, external API not responding, memory lacking (or leaking), and, believe it or not, even bugs in our own code!

Before things fail, we’re trying to anticipate them. We developed some automated processes running sanity checks on the Data Warehouse. Even though every row of data entering it is carefully cleaned, some discrepancies can happen. Regularly challenging data allows us to anticipate process failures and gives us a semi-automated opportunity to fix things beforehand.

When things are failing, first and foremost it’s important to be alerted. We’re ensuring alerts are always sent when something goes wrong, and verbose logs correctly dispatched, in order to, well, find out what’s wrong.

Finally, after things have failed, the next important thing is not to run dependent processes: we certainly don’t want to compute aggregations when part of the data is missing, for example. This is achieved both with Argo workflows, allowing us to precisely pinpoint what can run in parallel and which process depends on what; and through special tables in the Data Warehouse, where processes execution results are saved: having a centralized state proves useful once more. And, last but not least, the failing process has to be fixed. Having idempotent processes allows us to easily re-trigger them with the guarantee of not losing any data.

Lessons learned

We’ve come a long way towards building a robust, distributed, and efficient infrastructure for our Data Warehouse and its processes. Data engineering and software development are constant moving processes, hence we will never declare it a finished job. Nonetheless, we have learnt many valuable lessons while converging towards our SQL and Python ecosystem:

Prepare for failures. Nothing will ever go right all the time; designing for failures and repairs proves handy every day.
Carefully choose your tools. SQL is perfect for data manipulation, Python is perfect for automation (among other things!).
Document the non-obvious. We have hundreds times more SQL and Python scripts than data engineers. Nobody knows everything, and those who do will forget sooner or later. Documenting as much as possible raises up the shared comprehension of our systems, and lowers the bar for newcomers diving into them. Documentation happens in many places; we use inline documentation with SQL and Python for implementation details, version-controlled markdown files next to the code for higher-level overviews, and a wiki for everything else.