Breaking a Supply Chain Monolith — A Learning Journey on the Ship of Theseus

Written by Chris DekkerAug 5, 2025 06:4613 min read

Breaking a Supply Chain Monolith — A Learning Journey on the Ship of Theseus

At Picnic we have grown in many different directions over a relative short time span of a few years. On the Inbound Supply Chain side our mission has remained the same: Determine what to order from our suppliers, when and in what quantity, to satisfy the needs of our customers today and tomorrow. The complexity in finding the answers to these questions has grown significantly, though. As we grow in volume, ordering just-in-time (JIT), reducing waste and accurately predicting demand has an increasing impact on our bottom line.

At the heart of our domain is the Purchase Order (PO), a grouping of articles & quantities to be ordered from one of our hundreds of suppliers. These POs are preceded by the Purchase Order Proposal (or POP), a predictive order generated in advance containing suggested order quantities. These POPs are then converted either into a placed PO to our suppliers, or used to schedule our physical warehouse layouts and available workforce more efficiently. About one in hundred of these generated POPs are placed as PO and sent to the supplier. The rest is used for planning or other analytical purposes.

Introducing the Monolith

Enter the Purchase Order Management (POM) service, a 200K lines-of-code Java Spring monolith responsible for generating these POPs, creating POs and managing supplier communication. Starting out as a more advanced CRUD, as more fine-grained ordering logic was tacked on over the years to support a diverse set of ordering conditions and suppliers, it became harder and harder to reason over the produced results and answer simple, often recurring, questions such as “why are we proposing to order 1500 bananas today?”.

Generating a POP is a synchronous process taking seconds to minutes, running through all configured ordering logic start to finish before producing the end result of the articles and their proposed quantities. This big black box behavior has led to a lack of runtime visibility with performance concerns, poor maintainability and a resulting declining developer velocity. Something needed to be done!

The Goal

We decided to focus on breaking off the predictive ordering part of the monolith in such a way that it

Increases runtime visibility — by explaining the output.
Increases code maintainability — by reducing complexity.
Increases performance — by scaling with the growing business.

On top of that, doing this migration did not mean that we could stop shipping features. Therefore we preferred any approach that would directly use new written code, while avoiding situations where we had to duplicate logic in the codebase to avoid synchronization issues and long cut-overs.

The Plan

To solve this puzzle, we settled on a three-step plan:

Make an inventory of all necessary functionality and split it into logical modules with their dependencies.
Implement each module independently and wire it up inside the existing infrastructure.
Create a new service architecture built on top of an asynchronous execution framework to make the step modules independently scalable.

Generating a POP: The 17-Step Program

After careful archaeological expeditions into the code base, we settled on 17 distinct steps and their dependencies to come to an ordering proposal. Each step being an isolated unit of logic with a single input and output. These steps include retrieving information from our own databases or configuration, communicating with the predictive AI models and applying our own ordering business logic on the previously retrieved input.

17 steps required to generate a POP with their dependencies

Additionally, this was a great opportunity to challenge any existing functionality, allowing us to simplify flows and drop no longer relevant edge cases that otherwise would’ve needed to be supported ad infinitum.

Modular Implementation: Ship of Theseus

These steps are implemented as separate Maven modules with the following interface using Project Reactor. Each module has its own unique input and output types containing collections or relevant fields.

Mono<O> process(I input);

This approach lets us do two important things: Develop these modules independently in parallel and, once completed, integrate each module into the existing infrastructure. This means we can immediately battle-test and mature the new modules we implement inside the existing infrastructure, in production. This lets us spot and fix any issues early, as we gradually roll out small changes, but also already allows us to introduce new features in the migrated modules, reducing the need for a feature stop.

Gradually replacing old code with new in the same legacy execution

The only additional custom code that needed to be written was mapping the relevant information in the legacy flows to the module’s I input type. Once more and more modules are being migrated, mappers are introduced converting multiple module’s O output to a single I input. While this converter ‘glue’ between the step modules can be quite verbose, it is not complex code. The added benefit of abstracting this away from each module’s core business logic outweighed any potential duplication.

This ultimately lets us gradually modularize and replace the entire code base without interrupting service at any point. A modern-day ship of Theseus charging full steam ahead!

Don’t Let the Ship Sync

At this point, the goal of increasing code maintainability has been achieved. However, the performance and scalability as well as the explainability of our results has not yet been improved as long as we’re still executing all steps synchronously inside the legacy application.

To achieve the remaining goals, we want to execute each step asynchronously, wrapped in a new service. Luckily, at Picnic we already have a tried and true asynchronous Java framework built on top of MongoDB. This framework stores each step’s input and output and orchestrates the dependencies between them. In theory, we can distribute the steps across multiple deployments and scale them individually and horizontally. Additionally, we can restart from any step or inspect the output for debugging purposes.

Clouds on the Horizon

Once we adopted the asynchronous framework, we immediately encountered some challenges. We found that due to the polling nature of the framework, the idle time between the step executions compounded significantly, impacting the execution time of a single job. Even though we could now easily scale horizontally, the added time for a single execution was a step back. Similarly, having to persist each step’s input and output became prohibitively large, sometimes exceeding MongoDB’s maximum document size of 16MB. Back to the drawing board!

We solved these challenges by gathering the 17 steps into 4 distinct groups, still executed asynchronously and independently scalable, but within each group, the steps are executed in sequence. We decided on the following logical grouping:

FETCH — Retrieving information from various sources (databases, configuration, etc.)
DEMAND — Consulting the predictive AI models.
ORDERING — Applying Picnic’s business logic magic on the input to come to the desired order quantities.
PERSISTENCE — Assembling the final end result document from all previous output.

To overcome the storage limitations of MongoDB for the intermediate steps, we leveraged AWS’s S3 to store each step group’s output in full instead.

This has let us accomplish yet another one of our goals: explainability! For each executed job, we could now easily inspect the gathered input from the FETCH or DEMAND steps at the moment of execution, even letting us re-run and debug the execution of the ORDERING locally in a deterministic fashion. This has cut down support effort significantly.

This leaves the last goal to accomplish: performance! It has now become apparent that the polling nature of the MongoDB framework is not sufficient for our workload.

The Tortoise and the… Rabbit(MQ)

While the current storage solution works great for our use cases, it is now clear that for the orchestration, delegation and execution of the work a database polling approach is not sufficient. RabbitMQ is another staple in Picnic’s tech stack which could prove to be the right tool for the job here. Transitioning from polling to an immediate pushing approach using messages would still let us orchestrate the work in a distributed fashion, allowing us to scale each part individually. Thanks to the separation of the step implementations and the asynchronous framework, we could easily swap out one for the other with minimal effort. After only 2 weeks of effort, we ended up with the following framework implementation:

Async framework implementation backed by RabbitMQ

Each dashed line is an asynchronous RabbitMQ message and each vertical system can be deployed and scaled independently. That means we only need to look at the sizes of the various ‘trigger step’ queues and scale up that part of the application to efficiently eliminate a bottleneck. When idle, each message is handled near-instantly, outpacing the performance of the database polling approach greatly.

Each step is responsible for its own storage of its results, keeping the communication with the Orchestrator fast and lean. Subsequent steps communicate directly with prior steps to retrieve the produced results, further alleviating the Orchestrator, allowing it to handle many jobs in parallel.

@Value.Immutable
public interface StepRequestMessageInterface<P extends PipelinePayload, T extends StepType> {
  /** Job ID of the pipeline triggering this step. */
  String getPipelineId();
  /** Requested job ID of the step to be started. */
  String getStepJobId();
  /** Type of the step to be started. Is part of the routing key as well. */
  T getStepType();
  /** Pairs of step types and step job IDs referencing completed predecessors. */
  ImmutableList<StepKey<T>> getPredecessors();
  /** Pipeline job request. */
  P getRequest();
}

Contents of the ‘trigger step’ message between Orchtestrator and steps. Note the generic types P and T, making no assumption about the kind of work being scheduled and executed.

@Value.Immutable
public interface StepStatusMessageInterface {
  /** Step job ID of the finished step. */
  String getStepJobId();
  /** Terminal status of the finished step job. */
  JobStatus getStatus();
}

Contents of the ‘step completed’ message between steps and Orchestrator.

The framework is built abstract and can be applied to any kind of sequential or even parallel workload. The POP specific workload is a thin layer of configuration and implementation on top of this, tying it into the previously described steps.

Lessons Learned

This whole journey took little over 9 months and stayed largely on target, both in terms of time and results. Looking back, what helped tremendously to make this project transition smoothly, even given the challenges we faced:

Splitting the work early into many small subtasks matching the 17 steps. This has let us estimate the work for each step more precisely plus adding some overhead for the framework.
Due to the total amount of time needed, replacing the codebase part by part and starting using it in the existing application worked great for battle-testing new code immediately and allowing continuous development of new features in the new code base.
Strictly separating framework from the business logic has let us quickly replace the whole framework in the eleventh hour when we realized the originally planned solution would not suffice.

Given these valuable lessons learned, this project was not only a success for this exact scope, but this approach has now become a blueprint to do similar projects in the future at Picnic!

Breaking a Supply Chain Monolith — A Learning Journey on the Ship of Theseus was originally published in Picnic Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.