Picnic’s Lakeless Data Warehouse

Written by Iliana IankoulovaDec 9, 2020 16:1325 min read

At Picnic, we are proud of having a Data Warehouse (DWH), which covers all aspects of our business and is widely adopted in the organization. We achieve this by continuously delivering on three promises:

Single source of truth with a centralized logic for Key Performance Indicators (KPIs).
Trusted data source with high data quality and intuitive structure.
Root cause analysis support in a single place for any business issue.

Note: This is just the first in a series of five blog posts about Picnic’s Data Engineering. It is about the current state of technology and architecture. In the future, we will also explore topics like business challenges solved with data, our tech stack evolution, and the principles behind an effective Data team. Click the follow button to stay in the loop.

Some context to set the scene

Picnic is the fastest-growing online supermarket in Europe. We are sustainable, we strive to minimize food waste, and we deeply care about our customers. At our core, we’re a data-driven tech company, and we’re using this expertise to solve the challenges of online grocery shopping.

As the Data Engineering team, we build the tech foundations for decision making based on data. Since day one of Picnic, we’ve been developing the Data Warehouse — our team’s main product. Essentially, this is a massive library containing all of our data and knowledge, which helps the business to make smarter, more informed decisions.

My name is Iliana. I am Bulgarian-Canadian, and I joined Picnic as the first Data Engineer before any deliveries had been made. Now, as we celebrate Picnic’s fifth birthday and thousands of deliveries per day, I am excited to share my team’s journey in five blog posts. It’s a wild, adventurous, and action-packed ride!

The Data Warehouse Revealed

Our Business Analysts and Data Scientists do not have to worry about technicalities such as sourcing, cleaning, deciphering, and storing the data. Data discoverability is easy when there is a pattern in the data storage. Also, the good documentation of tables and columns ensures that the sources are used correctly.

We rely heavily on Data Vault and Dimensional Kimball modeling to structure our Data Warehouse. To ensure the highest quality, we have in place consumer-driven contracts within the source systems. This technical setup is complemented with an onboarding program in SQL for every new DWH explorer. At the same time, we follow the best practices in Data Governance and security access control.

Data Engineers source data from all systems across the supply chain and enable Business Analysts and Data Scientists to focus on value-added activities of creating insights and machine learning algorithms.

Here are some numbers to put our current scale into perspective:

The DWH has 120 (micro)service sources.
Some 1000 ETL/ELT jobs run per day to capture and aggregate data.
About 50% of employees at our HQ outside of tech roles have SQL skills.
A Business Analyst runs, on average, 300 SQL queries per day.
OurDWH contains 15,000 tables, 247,000 fields, and 121 billion rows.

Just nine Data Engineers support all of this! Managing such a scale with so few people is only possible with significant automation in our data pipelines and a relentless focus on structure, patterns, and quality.

The Data Engineering Tech Stack

We intentionally keep the Data Engineering tech stack lean. As a rapidly growing organization, we need to master lightweight yet powerful technologies to keep up with the business. And those tools must be versatile enough to adapt to the changing business needs.

*Data Engineering Tech Stack at Picnic: (1) Python, Snowplow, Stitch, RMQ; (2) Snowflake; (3) SQL; (4) Tableau, Slack; (5) Kubernetes, Travis, Python, Argo, GitHub.*

Python: Data Integration and ETL/ELT Processes

Extract Transform Load (ETL), or the more modern version, Extract Load Transform (ELT), are the jobs responsible for capturing historical data in the DWH and applying business logic for analytical consumption. At Picnic, the ELT is implemented in Python, and it uses a high level of automation to load data in the DWH. As well as the automated pipelines, we also have anomaly detection jobs that identify structural and semantic issues.

Across the different processes, we use both public and in-house libraries to build our jobs. Fewer than 0.5% of the runs finish with a failure status (triggering a PagerDuty alert), but we handle every alert as soon as possible and take corrective actions. Most often, the cause of an alert is a deployment of backward-incompatible features from a source system. In those cases, we collaborate directly with the respective development team to avoid any future incidents.

Snowflake: Data Warehouse made for the cloud

At Picnic, Snowflake is the cloud data warehouse of choice. Snowflake supports the most common standardized version of SQL, ANSI, for relational database querying. It also comes with many extensions, including robust JSON parsing, auto-ingestion Snowpipe, and other built-in functions.

One compelling Snowflake feature is the separation of storage and computing resources. This has been beneficial for us because we have a large volume of data that we want to store in a single place and use in various ways. Our DWH is architected in separate databases — one per country, and multiple schemas — one per data modeling technique. Another feature we appreciate is the auto-ingestion of data from S3 via Snowpipe — a native data pipeline in Snowflake — achieving near-real-time data ingestion and reduction of the ETL/ELT code.

Every Business Analyst at Picnic (and there are about 150 of them!) undergoes SQL training. After passing a few challenges, they obtain read access to Snowflake via SQL. This way of opening up the DWH for broad SQL access has been hugely successful at Picnic. One year after we started it, we did a survey to assess the program. Over 95% of the analysts reported increased productivity.

Tableau: dashboarding and data visualization

Tableau is used as the Business Intelligence (BI) visualization and dashboarding software. It is one of the tools we adopted at the beginning of Picnic operations five years ago, and, to this day, has been widely used.

Most of the tables from the DWH’s reporting layer are available as sources in Tableau, and we name them to be consistent with the names of the fact tables in Snowflake. We use the managed service, Tableau Online with separate sites per market. This multi-site setup is part of enterprise architecture to isolate the countries from a performance and security perspective.

Maintaining multiple sites with over 90% equivalent data structures and workbooks poses some unique challenges. We’ve built an automation process using the Tableau API to get an overview of the access permissions and also to be able to manipulate the data sources automatically. Another example of automation is a process that creates a backup of all Tableau objects, enabling recovery of any data sources or workbooks that have been accidentally deleted.

Since all analysts have acquired SQL access, we’ve seen increased usage of Tableau solely for visualization. This pattern is a clear shift away from ‘blending’ and ‘level of detail’ calculations used prior. The heavy logic is now done in SQL, utilizing Snowflake to make complex calculations in a performant way. Direct SQL access to the DWH has resulted in more performant Tableau dashboards and genuine happiness among the Picnic organization.

Argo on Kubernetes: orchestration and deployment

All of our jobs are deployed in Kubernetes on Amazon AWS, running in isolated pods. This ensures that jobs don’t block each other, and they use a customized amount of computational resources. Hence, we enable parallel execution of the Data Vault data capturing processes. This is a secure deployment since all the secrets are safely kept in Vault (not to be confused with ‘Data Vault’ :P).

Some of the ELT jobs run a few times per day, capturing incremental changes. Others run a few times per hour. This is only possible thanks to the containerization of the jobs. Argo allows us to create complex schedules using configurations of environmental variables and dependencies.

Snowplow: a framework for events processing

For near-real-time and batch event processing, we use the Snowplow framework. Snowplow is an event data collection platform, which we utilize for in-app and back-end internal events. To learn more about our use case, I recommend the blog post ‘(Consistent) App Analytics with Snowplow’ by D. Nedev.

We store the schemas for the events in an Iglu repository. The events originate from our back-end systems and pass through RabbitMQ, an open-source message broker. And besides routing events to Snowplow for analytics, we also use it at Picnic for communication between services.

Once collected on S3, the events are further processed via AWS Kinesis and loaded in Snowflake. All internal back-end events have schemas validated by the producer (owner) system — significantly reducing the risk of breaking changes.

GitHub: versioning and documentation

We don’t just use Git for versioning of the code, and managing our release and Pull Request (PR) process. For us, GitHub is much more. This is where anyone in the company can see the business logic behind every KPI that we’ve calculated.

Since all the logic is in code — either SQL or Python — we can provide transparency and full lineage of data from the source system to the aggregated KPIs. In our experience, this is the only place where documentation stays up-to-date and complete.

The applied logic lives in the same space as the README files and comments, providing context for decisions. It also makes it easy to identify the most knowledgeable Data Engineer in any given area.

Data Modeling Frameworks for Organizing our Data Warehouse

The tech stack is only one side of the story. Another is how we used those tools. Let’s do a deep dive into the architecture of the Data Warehouse.

Just like a functioning library needs a classification system, a usable and intuitive Data Warehouse needs data models. In fact, data modeling is one of the most critical, yet often overlooked Data Engineering skills. It allows us to create and maintain order amidst the chaos.

A library’s classification system makes sure that readers can find the book they want in a logical place, and with all the pages intact. Similarly, a data model provides a map to the Business Analysts and Data Scientists, so they can find exactly what they need — and be confident that it’s correct.

We architected our DWH with schemas to depict data in different data modeling techniques, and we have a strict separation between back-end and front-end reporting DWH layers. The back-end schemas are only used by the Data Engineers and Data Scientists. Simultaneously, the front-end is an API also available to Business Analysts and Developers depending on their role. The Data Vault is the primary modeling technique for the back-end DWH while Dimensional Kimbal for the front-end.

*Our Data Warehouse consists of two layers: front-end and back-end. Further, those are implemented by multiple schemas, split based on data modeling techniques and function.*

Dimensional Kimball Model

Our preferred way to expose analytics data to the business is through a Dimensional Kimball model. We call this our presentation, reporting, or front-end DWH layer. We use those terms interchangeably.

The DIMENSIONAL KIMBALL model is perfect for organizing data around business entities and enforcing conformed dimensions. Therefore, whenever we talk about entities such as products, customers, fulfillment centers, hubs, and electric vehicles, we always use the same master data in the respective dimension. Our fact tables span from low-level transactional or event data to highly-aggregated multidimensional cubes. Regardless of the aggregation level, we maintain the KPIs’ logic in a single place and allow for a root cause analysis.

We have also made Picnic-flavoured customization to the dimensional model. Besides the basic data types, we expose some JSON fields in the facts and dimensions. This is a handy extension these days, in the age of unstructured data. It helps us expose very detailed information in a controlled way before we have an analytical use case.

Often, such fields are a source for business requirements and rapid prototyping. For example, a fact table containing all actions that occurred in a given fulfillment center throughout a single day. This transactional fact accumulates hundreds of thousands of events. Each event can be one of the dozen types, such as picking, receiving, moving of stock, counting, or shipping.

New types of events are emitted regularly, and instead of blocking the reporting, we pass them through to the fact table by exposing the raw JSON payload in a VARIANT type column named ‘event_raw’. Only for the most important entities, we calculate the foreign keys. In this scenario, creating the keys for all of the possible entities involved is overkill. And it never ends. So, we’re selective by decoupling the new features of source systems and the DWH development as much as possible.

Besides the Dimensional modeling, in the presentation layer, we also have a SPECIALIZED MART schema that contains tables and views accessible by the business. This schema contains a limited number of flat tables, which are not suitable for Kimball Dimensional modeling. Here, for example, we have a few near-real-time views that might not have dimensional data available (yet) and tables that do not link to anything else.

Data Vault

Data Vault (DV) is a modern data modeling technique for capturing historical data. Its philosophy is “all the data, all of the time” — in contrast to the Dimensional Kimball model, which is “a single version of the truth”. Those two are a powerful duo when deployed correctly. We use the DV only as a back-end DWH layer.

Combining the best of both worlds has proven to be extremely valuable for Picnic. We use DV v2.0 with key hashes, enabling us to easily capture the state of entities (hubs) and relationships (links) over time in satellite tables.

We have more than 1000 tables in the DATA VAULT schema, and it’s incredibly adaptable to changes on the business side. For example, when relationship cardinality changes, we can capture this easily in the DV.

And the same goes for changes on the tech side. Our back-end system evolved from a colossal monolith to a constellation of microservices. With the DV modeling, we could quickly adapt to the new, more specialized, sources — and only need to spin off new satellites.

The Benefits of Data Vault

The main benefits of the DV are the traceability capabilities, the performance gained from parallel incremental loads, and the rich time-series data. This is useful to our Data Science team, allowing them to build reliable predictive algorithms. It has also enabled us to be selective on the Dimensional Kimball modeling side, with the possibility to rebuild the complete presentation layer whenever we need.

We’ve used DV since the beginning of Picnic, and we’ve gained a lot of experience in building and maintaining it. Although it can be a challenge to get started, and the learning curve is steep, we are glad that we could persevere through the early stages.

Addressing some common criticisms of Data Vault

There are a couple of arguments against DV we hear from other Data teams and like to briefly address.

“DV is too verbose in data modeling.” The high number of tables is the cost of getting extreme flexibility in capturing and using the data for historical analysis. It also allows for easy separation of sensitive data and complete idempotent Change Data Capture (CDC) by design. In my opinion, it is a small price to pay.

Since the recipe is so straightforward, the model usually saves considerable time on open-ended discussion about structuring the data tables. The magnitude of those benefits might not be immediately evident but becomes apparent after years of collected data. In a few instances, we’ve even used DV as a source to recover data for operational use. While I highly recommend proper backup and archival procedures, DV indeed comes in handy in extreme cases.

“DV requires too much time to get a new job running.” I consider this to simply be an implementation detail, rather than a fundamental flaw. Without an automation loading framework, it becomes a challenge to maintain dozens of tables — never mind hundreds.

At Picnic, we’ve built a homegrown solution in Python, enabling us to implement new DV domains in hours rather than days. The framework is highly automated using dynamic SQL and only needs four ingredients for configuration:

Python implementation of the extractions, since it is custom.
YAML file with the configuration of the target DV structures.
SQL script for the hard business logic transformations.
One-time executed SQL script with the creation of the DV tables.

Our DV loading framework handles the population of the target DV tables, the calculation of the CDC hashes, and the atomicity of the transaction. Implementing the initial version took no less than four months of full-time development by a single Data Engineer. Still, this paid off with years of stability and high velocity for new pipelines.

Sandboxing and Prototyping With Temporary Tables

Another home-invented DWH architecture element at Picnic is for users to have controlled write access to the DWH. We have a schema called TEMP, which we purge every 24 hours. Analysts have the right to create tables/views and to load external data.

This feature enables the DWH to be the central place for analysis, where data comes in from Excel files and Google Sheets — and not so much the other way around.

It’s simply impossible, even for a complete DWH, to contain all the (fresh) data that the business needs. Therefore, it is valuable to have provisioning in the architecture to allow for imports that can be joined with existing data.

We purge this schema every 24 hours to avoid a swamp of stale data without an owner. If an Analyst finds themselves in the daily process of running the same SQL statements, as a next step, a Data Engineer will create those structures in a more persistent area in a SANDBOX schema. It’s then the Data Engineer’s responsibility to manage permissions and ensure it gets to production after a proper peer-reviewed development process.

Other Back-End Data Capturing Schemas

We tailor the back-end DWH to the mechanisms in which we source data. Besides the DATA VAULT, we also have ATOMIC schema exclusively used for the Snowplow events. Most of the data is unstructured in JSON objects, and before it is exposed, we parse it according to the agreed event schemas.

Similarly, the Operational Data Store (ODS) schema holds data in Third Normal Form as it arrives from the source systems. There is often no change in data capture but a complete replication of the source. The ODS schema is mostly used for business mapping and configurations loaded via Snowpipe or third-party systems. STITCH is a database containing all sorts of ODS exports extracted by a third party ETL solution. Where possible, we prefer to use it instead of writing our own integration.

An example here is the weather data we fetch from public APIs. For every delivery area, we query the weather forecasts and the actual weather measures. This data has many applications, such as knowing the drop time duration in rain and darkness or planning extra ice packs on warm days. The effort to model this is in DV outweighs the benefits, so we’ve settled on storing it as is.

Our Data Lakeless approach

Although we capture large amounts of data in unstructured JSON format, we do not call it a Data Lake. We strongly believe that even if data is in JSON format, it should still have a strict schema it adheres to, and this schema is known up-front.

The process so common in Data Lakes of capturing first and thinking about the structure later doesn’t fit our vision of a highly trusted DWH. In our experience, Data Lakes often turn into swamps within the first years of existence and resemble more an archival data system rather than a long-term trusted source for analytics.

We believe that data quality is a must requirement in every data project. Without it, any machine learning effort is useless, and the business decisions are questionable at best. Only 3% of companies meet basic data quality standards, and at Picnic, we strive to be one of them.

Final Takeaways

Data modeling is key for keeping a high-quality Data Warehouse that the business trusts and values as an easily-accessible library for meaningful insights. The structured data and strict Data Governance guarantee good quality. To realize its full potential, the users are encouraged to combine it with their own files and mappings in a dedicated area in the DWH. This controlled write access ensures that sensitive data remains secure while encouraging wide DWH adoption within the organization.

Besides the importance of Data Modeling in the DWH architecture, it has been beneficial to maintain a very lean tech stack. There are many good tools out there, but to move fast, we need to stay lightweight. Our versatile toolbox consists of Python for ELT, Snowflake for DWH, Tableau for visualization, Argo for orchestration of jobs, Kubernetes for deployments, Snowplow for events processing, and GitHub for version control and code documentation.

We take a strong stand against un-governed Data Lakes. Instead, we support the whole company with a single source of truth Data Warehouse and see the Data Engineering team as an accelerator and enabler in a data-driven organization. At Picnic, we aspire by Sherlock Holmes’ eloquent words: “It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.” In the next blog post of the series, I will share some of the mysteries we solve using data. ????