Incident Management at Picnic

Written by Steven van de RidderJan 27, 2020 13:2715 min read

Nearly everyone has experienced the struggle of making an appointment with a service technician. Once you have waited for a good 20 minutes, listening to a royalty-free acoustic version of a 90’s pop song, the call-center agent on the other end of the line grumpily informs you the service technician is “delighted to support you on Tuesday, two weeks from now, between 10:00 AM and 4:00 PM” 🤦‍♂

For that day, you cancel your work meetings, plan as few calls as possible and make sure you are equipped with 6 hours of offline work. At 5:17 PM, the technician calls and informs you he is unfortunately not able to make it, and proceeds to give you a laundry list of reasons why. He could not pick up the right equipment, faced a roadblock on his way and one of his previous appointments took more time than expected. He assures you to be there the next day between 08:00 AM and 1:00 PM 😠

From a customer perspective, you don’t really care what happened along the way. All you want is predictability and confidence: that the phone gets answered as soon as possible; that you spend as little time on the topic as possible; and that whatever needs to get fixed gets fixed, asap. It’s infrastructure, it’s a commodity. As for many commodities, we tend to forget that there are a lot of dependencies and intertwined processes that need to flawlessly work together to make it all work.

Where we came from

💩 happens… what really matters is how you deal with the mess

At Picnic, we offer our customers a one-hour delivery window when ordering, which we refine to a 20-minute window on the day of delivery. This requires smooth operational processes in every part of the supply chain, intelligent tools to support these processes and extremely tight cooperation between teams.

This holds even more so when the proverbial 💩 hits the fan. What happens if one of our suppliers conducts a product recall after the discovery of a potential safety risk? What happens if WiFi goes down in one of our fulfillment centers? What if truck drivers get stuck in traffic? What if farmers protest in the delivery area and our delivery drivers cannot get to our customers?

When we were still a small start-up, not much was needed to get everyone to drop what they were doing and help to fix any operational issues. Any incident could be fixed with a few smart people or an extra pair of hands. We had a Slack channel (creatively named #emergency) as our go-to spot for reporting, triaging and resolving incidents. No matter how big (the entire store going offline) or small (one extra pair of hands needed for a delivery trip) everything went through #emergency. And it was running like clockwork.

For a true ops decision-maker, dealing with these kinds of incidents is what makes you tick. Yet when your operations grow double digits every week, they become more of a hindrance than a fun challenge. We had realized that our operation had become more complex and with it did the incidents that occur. It had become more difficult for new people to know who to reach out to for help, communication about incidents had become more dispersed, cross-team dependencies had become larger and there was little transparency about the incidents that occurred.

Operation Snowball

Forever in our memory: the 22nd of January. In the morning, we were facing the first snowflakes of 2019. Snowflakes turned into centimeters of snow in the afternoon and it did not take long before the Royal Dutch Meteorological Institute had warned everyone with a “code orange” to stay safely inside and stay away from snowy and slippery roads.

Early that morning, some Picnic colleagues on their way to our fulfillment centers were thinking of raising their concerns about the situation on the road. But where? And to whom?

It was not until 09:23 AM when one of the founders un-archived and brushed up another creatively named Slack channel (#sc-emergency) where a random group of people working in the supply chain was still present from a previous incident. However, this was a private channel not known to anyone else in Picnic, so it took even longer to “collect” all the relevant people. This group of “random supply chain people” had found themselves multiple times in Picnic incidents, but this snowball was rolling down the hill and was quickly growing in size.

As this group was starting to investigate the safety on the roads at the hubs (you can imagine we put the health and safety of both our Runners and our customers first), the fulfillment centers had already prepared the majority of the orders our customers were expecting that day, baskets with groceries ready to be shipped to our local hubs. And only by that time we were starting to think: if it’s unsafe to deliver, we cannot ask customers to come to pick up their groceries either. If we cannot deliver or ask customers for pick up, what should we do with the hundreds of thousands of items already in baskets that we need again the next day? How should we approach the situation? Who should be in charge of such an incident? Who should be involved? We were not well prepared. At all. The laws of the small start-up did not apply well to this snowball.

We realized what we needed, but there was not enough time to go and do it. We needed a central emergency coordinator, one representative of each impacted team, scheduled check-in calls, local input on conditions, models to simulate the impact of our decisions, clear deadlines for process-dependent decisions, contingency plans for impact and so on and so forth.

Eventually, we decided not to deliver that day because we considered it not to be safe. We decided to postpone all our deliveries the next day: we “copied” all planning, customer deliveries, baskets, and items to the next day, and closed the store for any regular orders. That is already a difficult decision to make if you know whether it is actually possible and what the impact would be, but at that point, we needed to figure all of that out too. However, the real issue here was that we made that decision only after 5:00 PM, which meant that our customers and our colleagues did not know what was about to happen until that moment.

Because we did not want to waste this good crisis, we wanted to learn as much as possible and make sure we are prepared for any next incidents. There was a lot we could cover with procedures, calculations, and emergency scripts, yet we needed a tool. We needed “something” that would have enabled that colleague at the fulfillment center to report the issue already at 6 AM, which would have alerted a few key people directly and that would have pulled everyone together in one place to coordinate.

Now, we wouldn’t be a tech company if we didn’t think this should get a tech solution.

Enter Mr. Murphy

Knowing this tool should be usable by thousands of employees in multiple countries, and able to deal with any and every kind of incident we decided to tackle the hardest question first: what should we name our new service?

“I keep telling people I’ll make movies until I’m fifty and then I’ll go and do something else. I’m going to be a̵ ̵p̵̵̵r̵̵̵o̵̵̵f̵̵̵e̵̵̵s̵̵̵s̵̵̵i̵̵̵o̵̵̵n̵̵̵a̵̵̵l̵̵̵ ̵̵̵g̵̵̵e̵̵̵n̵̵̵t̵̵̵l̵̵̵e̵̵̵m̵̵̵a̵̵̵n̵̵̵ ̵̵̵o̵̵̵f̵̵̵ ̵̵̵l̵̵̵e̵̵̵i̵̵̵s̵̵̵u̵̵̵r̵̵̵e̵ an incident reporting bot”— Eddy Murphy (age 58)

Mr. Murphy runs on GAS

Tasked with developing a proper incident reporting tool quickly we decided to go through very fast PoC iterations with a small group of stakeholders. The tool of choice here was Google Apps Script (GAS). Initially developed by Google as a way of scripting extensions to its core apps (Gmail, Docs, Calendar) it’s essentially a serverless model in which Google runs the code you provide in their cloud. Built on Javascript, and offering a quick and easy way to open up endpoints, it’s one of the fastest ways of getting a service up and running.

Design principles and Tech stack

As time was of the essence (a growing company continually increases in complexity, increasing the need to proper incident management) we settled on the following design principles and resulting technology choices for this project:

Very rapid design iterations, based as much on live production issues as possible. As mentioned before, we opted to build our proof of concept in Google Scripts, allowing us to push changes to production in a matter of minutes whenever needed.
Using proper software design principles. Of course, we do not abandon proper software development practices nor code quality in our pursuit of rapid design iterations and testing. We developed our service locally using Typescript, pushed this to GAS using Google’s CLASP command-line interface and employed our VCS of choice (Github).
Broad availability to all Picnic employees. The technology choice here was straightforward: Picnic runs on Slack and we have extensive experience in integrating with Slack’s APIs as well as building custom Apps for the Slack ecosystem.
The bot should be configurable by business owners. Changes to incident reporting and resolution paths should not be dependent on software developers deploying new code, it should rather be configured by a select number of business users. Our choice of Google Apps Scripts greatly facilitated this as it has seamless integration with Google Sheets, a tool many business users are very familiar with.
The service should scale with Picnic’s growth, over all countries. Google’s GAS serverless model serves us well here, allowing up to 30 concurrent executions of our service. Although this might seem like a low limit, our service takes an average of 400ms to process an incident. This gives us sufficient headroom to process 75 incidents per minute. If there’s ever a need to report more than 75 incidents per minute our biggest concern likely isn’t the scalability of our incident reporting bot…
All relevant incident data should be captured. Mr. Murphy has been designed to both allow streamlined incident management whenever an incident occurs as well as capture all relevant data to analyze its root cause. This allows us to identify areas of improvement in our processes and therefore reduce the number of incidents occurring in the first place: a simple Prevent is better than the Cure approach. Technology-wise, data storage was our biggest headache. Although natively available as a data source in GAS, a Google sheet is not a solution. Slack expects a response from any service it calls within 3000ms and getting Google Sheet data from a Google Script is SLOW 🙁 As we were developing the incident data model while developing our service we decided to implement a document store instead of a relational database. We employed MongoDB for this purpose. Unfortunately, MongoDB is not a supported database in GAS, we had to implement a simple Stitch endpoint in MongoDB Atlas.

Mr. Murphy’s toolkit

Now, what exactly does Mr. Murphy do for us?

Incidents can be reported by anyone, in any slack channel. In a fully ephemeral (only visible to the user making the report) setup, using a simple and always accessible /report_incident slash-command.
Fully dynamic question path for incident triage, allowing a customized set of questions for every unique type of incident.
Incident reporting in dedicated slack channels, with user alerting depending on incident severity.
Functional elements to manage incidents: (de-)escalate reported incidents, (re-)assign ownership, create dedicated incident channels and resolve incidents.
Incident data available for analysis through a Google Sheet connector.

Our first six months with Mr. Murphy

As we are now processing all these incident reports (including the smaller ones that don’t require a lot of people or coordination to be fixed) through Mr.Murphy, we are also collecting much more data. And this helps us to improve the resilience of our operation.

Because nowadays when we review the performance of our operation, which we do on a weekly basis, Mr. Murphy is part of the discussion and provides input to where we can further improve to deliver our promise to our customers. We’ve learned a lot and continue learning, facilitated by the rapid update cycle we’ve adopted. We hope this helps you to understand your incident flows better and manage them accordingly.

This post was written in collaboration with Eduard Posthumus Meyjes and Nikki Oude Elferink.