Picnic logo

How we broke customer support language barriers without breaking production

Written by Carlos ArcenasMay 28, 2024 07:0713 min read
2400 BlogPosts Img 12


Ever since we first started with a single vehicle in one city nine years ago, Picnic has always taken our end-to-end user journey very seriously, and our increasing attention to detail is borne out by how we’ve been voted as most customer-friendly supermarket in the Netherlands for three years running. This watchful eye becomes even more critical when that journey takes some unexpected turns. At Picnic, our Customer Success team stands ever ready to help our customers when these turns happen, backed by our comprehensive and battle-tested tech stack and further enhanced with machine learning techniques.


While we always ensure that we engage our customers with a thoughtful, human touch, we’re no strangers to using artificial intelligence and machine learning techniques to improve our customer support processes. In fact, we’ve been writing about this use-case on this very blog since 2018. However, finding something that works and bringing it to production doesn’t mean that we rest on our laurels. As the years pass, and our customer base expands, we continuously find ourselves reevaluating our existing solutions to understand whether they still meet our needs, and are constantly on the lookout for new technologies and pathways to enhance our customer experience.


In this blog post, we’ll discuss how we combined a forward-looking eye for emerging ML techniques with evolving business needs to chart out a successful migration strategy for one of our most critical processes. We’ll also dive into our technology selection process, and how we validate new solutions to existing problems, all without breaking operational continuity.


The brief and the status quo


At Customer Success, we receive thousands of questions and messages from our customers on a daily basis, spanning a broad range of topics and complexity. This poses challenges for how we allocate all of these across our agent pool — a complex inquiry about an unprocessed payment requires more experience and effort than a quick question about delivery times. To ensure that we don’t overwhelm a newcomer, or underutilize a seasoned agent, we rely on several ML models to help us classify incoming customer support requests in order to route them to the agent best equipped to handle the challenge at hand.


We have models running for our email, WhatsApp, and in-app feedback channels that identify customer intent, alongside another model that infers customer sentiment. We decided on deploying a dedicated model for each communication channel as we noticed that the length and tone of speech differs quite a bit between them — the emails we receive tend to be longer and written in a formal style; WhatsApp plays host to shorter, more casual messages; and in-app feedback is the shortest of all, with nary a sentence or two of text we can use to classify.


In addition, our model architecture at the time involved using two different tokenizers to account for the two languages we supported when the project was first conceptualized: Dutch and German. The use of two distinct tokenizers (which fed into a logistic regression for classification) naturally led to the need for country-specific models to be deployed.


So since we had four different model types and two different languages to support, in effect, we were running eight different models for our CS routing! While this worked for our original requirements — and has been serving Customer Success well for more than half a decade — we ran into some significant issues when we wanted to broaden our use of the model:



  1. We weren’t able to immediately roll out issue classification in France, our newest country, due to the need to stand up new models for each of our message origins. With our current set-up, this would require a lot of effort to replicate all of our existing processes — not to mention cause a maintenance headache further down the line.

  2. Even if we were willing to bear the start-up and maintenance overhead, we also had to deal with the cold start problem — given that we had just begun operations, we did not have sufficient data to properly train a French language model.

  3. We also wanted a solution that could continue growing with the company. While we are currently operating in the Netherlands, Germany, and France, we did not want to have to spin up and train new models tailored to a specific geography if we could avoid it.

  4. Given our language-specific tokenizer approach, we could only classify messages sent in the country’s language with confidence. This poses problems if somebody chooses to contact us in a different language, such as an expat asking us in English why their groceries arrived too late.


We could have overlooked these issues and pushed forward with standing up a brand-new, separate model in France if we had a consistent, high-performing model. However, even that was not the case: our existing solution was unable to arrive at a confident classification for a concerningly high proportion of cases.


With all of that in mind, we concluded that it was high time to finally investigate and eventually migrate to more modern ML architectures. We also wanted to investigate whether pursuing a unified, multilingual approach would result in significant performance gains.


Enter an LLM?


We at Picnic are always cognizant of the latest developments in the tech space, and that naturally extends to keeping a close eye on the rapidly changing field of machine learning, and of course the emergence of large language models (LLMs).


While issue and intent classification is a very natural use case for LLMs, we didn’t immediately reach for that as our tool of choice. Through a hackathon on this very topic, our colleagues confirmed that not everything has to be solved using an LLM — especially when it comes to dealing with sensitive customer information and ensuring operational continuity.


While open-sourced models such as Llama v4 and Mixtral open up pathways to isolated, self-hosted LLMs, we felt that it would be overkill for what we wanted to achieve — in essence, a solution in search of a problem. In our view, there is still significant value to be found in developing and deploying solutions that aren’t always on the bleeding edge of technology. Not only does it end up being simpler and more explainable to our stakeholders across the company, it also gives us confidence that we’re working with systems that have been tried, tested, and verified, and continually give our customers the high level of service they’ve come to expect from Picnic.


But that doesn’t mean that we’ve closed the door on LLMs altogether; if anything, we’re eagerly uncovering new use-cases within Picnic that are indeed best served by the predictive capabilities of such a powerful model; you can check out this post from my colleague Maarten on how we’re using LLMs to power our improved product search experience.


Our next step


In the hackathon post I mentioned earlier, we discussed how Picnic benchmarks new technologies for customer feedback classification, and how we put those new shiny tools through their paces. From that same hackathon where we determined that an LLM may not be the best solution for this problem, we also identified how a multilingual BERT model could deliver the best of both worlds — an innovative and highly-performant system based on the groundbreaking Transformer architecture, yet small, adaptable, easily maintainable by our dedicated machine learning team here at Picnic. We believed that this was the best approach as it balanced our desire to work with state-of-the-art systems while delivering value to our internal users. However, given how critical the existing issue classification model was for our operations, we did not want to jump the gun and commit to this approach without validating that it works for us in the first place. We decided to first implement this architecture through a proof of concept for another related problem in Customer Success.


Proving the concept


Luckily, given the vast problem space inherent to the customer support domain, we didn’t have to look far for a suitable candidate. Our operations team noticed that their agents were mistakenly routed messages that did not require any action from them, such as when a customer says “thank you” at the end of a conversation. Given this well-scoped and defined phenomenon, and how we validated that this was happening across all of our operating countries, we thought that filtering out these messages would be the perfect test case for our new architectural approach. While the application isn’t necessarily an apples-to-apples comparison with intent and sentiment analysis, it was still a suitable enough classification problem that addressed the same questions, and provided us with sufficient data to make a final decision on a full-scale migration to BERT.


In contrast to our existing model architecture, which relied on setting up separate pipelines for each market, we were able to combine data from all of our operating countries and use it to fine-tune a single BERT-based model. Given our multilingual needs, and wanting to balance model performance with resource needs, we settled on using the distilbert-base-multilingual-cased model from HuggingFace.


We then fine-tuned the model using tens of thousands of messages annotated by human experts. In line with our multilingual approach, we also decided to combine messages that we’ve tagged in all our markets and languages into a single unified dataset. We also decided to experiment with only using the message text as input to the model, and not include any other derived features for inference to understand the predictive power of the new architecture.


Through this end-to-end approach, we came up with a single model artifact that could be used in all countries — a stark contrast to how we needed to spend considerable time and resources to create country-and-language specific artifacts.


We rolled this out in one market to test, and it was crystal clear after just a week that this proof-of-concept was a roaring success. We were able to identify and deflect 50% of messages that would have otherwise been rerouted to our agents in spite of them being unactionable, with an error rate of less than 4%. Translated into more concrete terms, this saved the effort equivalent of one full-time CS agent — not bad for a small proof of concept!


Scaling up the solution


With the experience gained from the proof-of-concept in our toolbox, we could now confidently tackle the total revamp of our case classification models.


We followed the same approach as earlier, using the tried-and-tested distilbert-base-multilingual-cased model, but now with millions of rows of training data, and tasking the model to predict one of several predefined customer intents instead of just a binary classification. Just like in the previous case, we decided to combine all of the tagged messages into a single unified dataset for training, testing, and evaluation. Emboldened by how well it worked for the proof-of-concept, we also took the courageous step of dropping almost all other derived features from our data, using only the text input from our customers, along with the labels filled in by our customer service agents.


The key metric we used for evaluating the performance of the new model was the f-beta=2 score, which reflected our business needs of prioritizing recall (i.e. capturing as many relevant messages per category as possible to forward them to the right set of agents) over precision (i.e. ensuring that the routed messages themselves are highly relevant), especially considering the high cost of messages that the model can’t classify at all. The results were immediately apparent — compared to our existing logistic regression-based model, we saw a jump in f-beta=2 scores of more than 10%. Business-wise, this led to a massive reduction in cases that would have otherwise been unclassified by the model — we found an improvement of at least 50% for all origins, with our Email model showing an astonishing improvement of 85%.


Given the undeniable performance improvements, we agreed with our business stakeholders to an aggressive roll-out schedule, fully sunsetting the eight models developed using the old architecture across all case origins and markets (and onboarding a new market) and migrating to the new set of four global models in just eight weeks.


The outcomes


We were able to roll out a comprehensive, high-performing solution that can scale with our ambitious plans — all hosted on our own cloud infrastructure, and maintained at a fraction of the cost of what a third-party hosted equivalent would charge.


The migration to a full BERT end-to-end solution addressed the major challenges we detailed at the beginning of this article:



  1. Instead of having to maintain separate models for every use case and geography, we’re now able to have one global model per use case, with all of them trained using a generic pipeline, greatly reducing our future maintenance costs.

  2. We were able to use existing case data from other markets to set up an immediate high baseline quality of predictions for France, thereby minimizing the impact of the cold start problem.

  3. With the ability to train our models using unified, multilingual datasets, supporting new markets and languages becomes a trivial matter.

  4. We’re now able to start classifying messages in languages outside of Dutch, French, and German, meaning that we can better serve our expat customers and be ready for expansions into other markets and languages.


The future


While we’re incredibly proud of the progress we’ve made with this architecture migration, and how we’ve used this to address a whole other need identified by our Customer Success operations team, we’re not stopping here.


In fact, our internal users have expressed that this kind of classification — where messages from our customer are matched against historically and manually identified topics — may not be the best way to manage a sprawling customer support operation. Instead, they are starting to point towards wanting a classification system that’s much more dynamic, able to respond to trends as they emerge from moment to moment, and ready to rise to the unique challenges posed by a rapidly growing international operation.


Maybe that’ll be the best opportunity to really consider the power of LLMs, and finally introduce them to our CS experience. But that’ll be a blog post for another time :)


Are you interested in working with the latest technologies and the cleanest data? We are actively seeking talented individuals for a variety of machine-learning engineering roles. Join us in shaping the future of customer success — find out more about these opportunities here!



Recent blog posts

We're strong believers in learning from each other, so
our employees write about what interests them.