Adding Eyes to Picnic’s Automated Warehouses Part 2

Written by Sven ArendsOct 9, 2025 06:4215 min read

Our automated fulfillment center in Utrecht is a busy place. Every day, thousands of totes travel across conveyor belts, carrying everything from apples and avocados to chocolate bars and shampoo bottles. More than 99% of these products end up with families, but achieving this requires highly accurate stock counting. In principle, knowing our stock should be straightforward: our current stock plus what we order minus what we sell should equal what we have. In practice, however, shipments may arrive incomplete, items can break during handling, or expire before being sold. Maintaining precise stock levels is therefore not trivial but it is crucial for keeping our promise to our customers.

In part 1, we shared why we installed cameras in our automated warehouse, our trade-offs for opting to go with a 2D camera setup and why we bet on multimodal LLMs to solve (most) of our challenges. Over the past months, we’ve explored how to effectively productionize these multimodal LLMs. In this blogpost, we dive deeper into what we’ve built, how it works, and the lessons we learned along the way.

Say Cheese!

To keep track of stock using cameras & AI, we need to capture almost a million images each day. While that may sound simple, the speed of the moving totes and the limited space for lighting make it a real challenge. In our initial attempt, many images turned out unusable by often being blurry, too dark, out of frame or a combination of those.

Too dark tote images. Out of frame (left), blurry (right)

Through a couple of iterations with our hardware partners, we came up with a versatile solution which reliably captures high quality images like the one shown below.

Eval-driven Development

Traditional development assumes deterministic code: specify behavior upfront, assert it with tests, and keep regressions out through CI. On the other hand, Evaluation-driven (eval) development treats AI like a living system: ship small, measure consistently, and let real tasks and metrics steer the roadmap. To measure anything, we first need to collect the correct ground truth labels.

Creating these by hand is a tedious task, as you would need someone to take a look at a photo and label it manually. We’ve created our Warehouse Control Software (WCS) in-house, and we already capture these metrics — as it’s part of our every day process. We therefore could simply collect the ground truth labels by tapping into the existing data setup on stock levels, along with images of our products.

Surprisingly, while iterating on our prompts and system logic, we quickly found the highest errors were coming from mislabelled data! It’s not unexpected that humans make mistakes as well, but it did lead to an extra focus on ensuring our ‘golden dataset’ was of high quality.

We collected 15k+ images from totes. Then we designed metrics to measure success:

Primary metric: Accuracy
Secondary metrics: MAPE, RMSE, MSD

As a company that is focussed on high-quality deliveries to our customers, we obviously care about accuracy. In this case, we care about how accurate we are on predicting the exact number of products correctly. Additionally, we have secondary metrics to offer more context to the system’s performance. MAPE (Mean Absolute Percentage Error) penalizes mistakes on almost empty totes more than on fully stocked totes. RMSE (Root Mean Square Error) quantifies the extreme errors we make (predictions that are way off the expected number). Meanwhile, MSD (Mean Signed Difference) can offer insights into whether there is a consistent bias in the predictions, pointing out underestimation or overestimation risks.

Together, these metrics form our evaluation framework that not only highlights the strengths and areas needing improvement in the system, but also underlines the iterative nature of evaluation-driven development.

Data split

“We recommend an unusual split compared to deep neural networks: 20% for training, 80% for validation. This reverse allocation emphasizes stable validation, since prompt-based optimizers often overfit to small training sets.” (source)

At this stage, we didn’t yet know if we’d go for a more classical DNN approach or an LLM approach. With thousands of images, we opted to split the train, validation and test set into three equal parts. That way we can use a ⅔ , ⅓ split in case we’d go for the traditional approach or a ⅓ , ⅔ split if we would focus on prompt optimization. Note that in both cases, the test set remains untouched.

Solution directions

Before diving into a PoC, we first created a brief qualitative overview of the different solution directions we identified. Traditional (deep learning) computer vision models are being applied widely in the industry to solve a large variety of tasks. It therefore only made sense to include them as a possible direction. They excel at latency, ease of self-hosting and — once trained on high quality data — handle lots of impressive industry use-cases . One limitation is the poor out-of-the-box generalization. To address this, we would need to collect many images and annotate them with bounding boxes, since the strongest models rely on such data for effective training.

An alternative approach is to rely on the State-of-the-Art (SotA) Multimodal models (MLLMs) driven by impressive demos as presented by the various top foundational AI companies (OpenAI, Anthropic, Google). While we didn’t know for sure it would work, seeing that these models were solving cognitively harder-than-counting tasks in our qualitative overview, we assessed that out of the box and expected performance should fit our needs. As drawbacks we saw potential challenges in latency and the (un)ability to self-host these models.

We visualized the comparison using a traffic light approach: No solution is perfect but MLLMs were a clear contender!

Comparing standard Deep Learning models vs Multimodal LLMs for our use case.

Unfortunately, this immediately raises a new question; “Which MLLM (family) should we pick to solve our use case?” Each week, new models, improvements, and benchmarks are released, but re-running our dataset on every potential candidate would take too much time and be too costly.. Hence, we’d need a way to quickly assess if a new model has potential or not, both on the capability as well as on the cost side. To qualify for a good benchmark, we defined that the benchmark should be sufficiently hard, require visual understanding and is updated regularly.

We settled on the MMMU benchmark (Massive Multi-discipline Multimodal Understanding). It’s a relatively hard benchmark (harder than simply counting stock), contains a variety of different types of images that are required to solve the tasks and is often reported directly on model release by the foundational AI companies. It also provides references to human experts and contains a good amount of data points for the open source models.

While current open source models are not able to solve our tasks with high precision, we’re keeping a close eye at them. Relying on open-source models would give us various benefits, such as full control on deployment and the ability to make architectural changes to tailor the solution even more. And they are a close competitor! Open-source models are only roughly 9 months behind in performance as shown by the trendlines below.

Real pictures, real edge cases

We now have high quality pictures, a way to measure success and have picked a model based on the strategy explained. Yet some pictures are impossible to reliably count, even for a human. As shown below, cartons, packaging and even stacked products can block the view.

To solve this, we added a countability check step with the purpose of ensuring that we make only predictions for ‘countable’ pictures. To measure our success of this new classifier, we used the following metrics:

Precision: “Of all the images the model said were countable, how many were actually countable?”

Recall: “Of all the images that truly were countable, how many did the model successfully identify?”

Naturally seeing how well the MLLM’s were performing on the actual counting task, we applied it to this task as well and the results were… lacklustering. With some prompt optimization we came to a fair recall score but continued to attain a weak precision score, a clear signal that the model is ‘overconfident’ in predicting the countability of images. After investigating the erroneous pictures, we came up with a few straightforward rules (e.g. exclude counting certain types of packaging). These rules blew the MLLM-based performance out of the water and is again a good reminder that when working with AI systems, using a few simple rules as guardrails can often solve most of the problems.

Simple rule-based methods outperform MLLM-based classification.

While this countability check based on rules was a pragmatic way forward, we as humans are able to outperform this approach intuitively and with more granularity. We can quite easily tell if we’re ‘sure’, ‘not too sure’ or ‘unable’ to count. Similarly, a lot of traditional computer vision models naturally provide confidence scores out of the box. An approach we’re now exploring is making the MLLM output confidence scores by giving detailed instructions on what we think makes a picture ‘harder’ or ‘easier’ to count. This requires more time investment than the previously mentioned rules but this way, we could e.g. still allow ‘not too sure’ predictions to be recorded in cases where having a small error is acceptable (e.g. by larger counts) while only allowing high confidence predictions in cases of ‘almost out of stock’.

Finetuning

Until this point, our experiments relied on the base models from the Gemini 2 family using only a text prompt. We saw a major jump in performance once we began adding richer, high-quality context to the prompt. In particular, including the product image gave the model a clear visual reference for what it should look for. This made the model significantly more robust to variations in product orientation and packaging, helping it make more reliable predictions across different scenarios.

While we were able to solve the problem using Gemini 2.5 Pro, the main drawback was cost: at the time of writing, Pro is in our case three to four times more expensive than Gemini 2.5 Flash. Although Flash performed reasonably well, it still fell short of fully replacing the manual work.

This is where supervised fine-tuning (SFT) comes in. With SFT, we adapt the model for our domain-specific task. The advantages include stricter adherence to the task at hand, ignoring irrelevant observations (such as stickers, loose or rotated packaging), and producing outputs that better follow the desired format with less ‘yapping’ (unnecessary text).

SFT can be performed by updating all model weights, known as full fine-tuning, or by training additional lightweight components on top of the existing model, an approach known as parameter-efficient fine-tuning (PEFT) (Houlsby et al., 2019). We opted to use Gemini’s built-in PEFT based on LoRA (Hu et al., 2021), as it provides a pragmatic way to adapt the model using relatively little data. The flexible data-splitting strategy we had described earlier proved particularly valuable here since with this approach we need relatively little training data compared to training a classical DNN model.

The results were encouraging as well: Gemini 2.5 Flash Fine-Tuned (FT) outperformed the base Pro model, giving us strong performance at a much more competitive cost.

*Normalized performance of Gemini models on Picnic’s stock counting dataset*

MLLMs in production

As any mature ML system, we need to monitor model performance on a continuous basis. We didn’t build a new counting system from scratch; instead, we pragmatically repurposed the existing one. While the old system used to count stock for all totes, we now apply it only to a small subset we call “validation totes.” Every day, these totes are counted manually through the old flow, providing ground truth data to continuously monitor and track our model’s performance and accuracy over time.

In this production setup, we’ve installed 16 cameras, each capturing an image every second. Simultaneously, the tote’s barcode is scanned and, together with the corresponding image, sent to our service. Working closely with our hardware partner, we optimized image compression and encoding so that a FastAPI web service running on a standard Kubernetes node can process everything efficiently. The Vision ML service then prepares all necessary input for the prompt and retrieves the relevant product image from S3. Inference is then performed using our fine-tuned model through our LiteLLM-powered AI gateway. Since model capabilities and costs evolve rapidly as discussed before, LiteLLM allows us to easily switch between providers and maintain model flexibility from the start. The prediction is then pushed to our warehouse control software and finally also to our data warehouse to enable evaluation later.

What We Learned

Looking back, six lessons stand out:

Boost LLM performance on your domain task by leveraging the unique context you have.
Fine-tuned ‘mini’ models can outperform the ‘pro’ models.
Proprietary LLMs have made significant advances in visual tasks over the past year.
Open-source models remain about ~nine months behind the top proprietary ones but are now reaching practical performance levels for real-world visual tasks.
Plan for LLM interchangeability from the start.
Analyze the LLM mistakes and see if simple rules can fix most of the problems.

These improvements don’t just make our warehouses more efficient. They also help us deliver on Picnic’s promise: keeping grocery shopping simple, fun, and affordable for everyone.

Sounds cool? We’re hiring!

Machine Learning Engineer | Engineering | Picnic Careers

Adding Eyes to Picnic’s Automated Warehouses Part 2 was originally published in Picnic Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.