It is amazing to see how far we have ventured with a solid foundation for our infrastructure as code (IaC) building blocks, having developed various systems and services to support our business operations at Picnic. Taking a look at a single one of them uncovers how many moving pieces are involved for efficient interoperability.
For example, one service might require a conventional relational database while another requires an object storage solution for a data processing pipeline and both might need to be run within the same network, communicating events to a central log collection system. We make use of cloud providers in order to get these resources up and running as fast as possible while the providers take care of the physical resources upkeep.
While each provider has their own management portal for applying modifications and extracting current states, a growing number of resources and providers, each with their own interface, has made manual configuration and replication across environments more difficult. To support further scaling of operations, Picnic early on adopted Infrastructure as Code (IaC) solutions to manage the complexity, focusing our efforts on Terraform.
Terraform is a tool that uses the Hashicorp Configuration Language (HCL) to programmatically describe the resources and their attributes to be provisioned and modified. This allows us to use GitHub for our version control, where we can keep a record of the changes made, enforce approval and code analysis checks, and have an up to date view of our resources.
With small enough setups, like Picnic’s early stages, deploying new resources, ensuring consistency and keeping track of changes can still be manageable without custom standards, abstractions and automation. Picnic’s pace of growth has only sped up and the management of our IaC started to show pain points. The amount of changes to be reviewed and applied across thousands of Terraform states surpassed our infrastructure team capacity, making them spend a lot of time in support operations still having delays on provisioning the required resources.
We began improving things by continuously developing storage, observability and compute Terraform modules with enough customization and built in best practices to match this grow. Still one big time consuming task was to simply run the provisioning commands in all modified places, something that could be easily automated. Nevertheless, there were tasks harder to abstract and automate like detect deviations between the code configuration and the actual resources. Additionally due to the privileged access needed to make the changes, compliance with security standards is needed which complicates things further. A greater effort was needed to allow our tech operations to scale across the Netherlands and beyond.
Looking for a solution
With slow and error prone manual processes it was clear that automation would further empower our teams, bringing closer to them their required resources. It can range from a simple script to run Terraform commands in different working directories, to a full fledged continuous deployment platform with dedicated service agents scheduling and executing IaC tasks. We wanted to look for a capable enough solution that required as little upkeep investment from our side while getting enough ownership to customise it to fulfil our current and future needs.
Let’s dive into the different options and our selection criteria. We looked through documentation, published stories and product teams to gather information about many services, to name a few: Terraform Cloud, Spacelift, Atlantis, Env0 and Scalr.
An indispensable part of our workflow is GitHub, and we have worked to optimise our tooling around it. Similar to other common CI tools, our selected solution should be able to run a Terraform “plan and apply” triggered from an update event in a pull request, reporting the results and only executed to the affected Terraform states.
Additionally, the tool should not prevent an operator from taking over operations, and manually plan/apply from their local machine to act upon possible edge cases that deviate from regular scenarios all the while having access management capabilities to limit access to actions and resources. Being able to detect or define dependencies between the states was also a feature sought after to ease the deployment process.
Our Terraform layout, while simplified to encapsulate resources per service and environment, also comes with the tradeoff of a growing high number of Terraform states, changes triggering multiple jobs and access required for all code owners. These elements are usually pricing drivers and we had to align them to the most cost effective solution.
In one way or another we could make our case work with most of our available options, nevertheless, after running a PoC with a couple of them we came to the following conclusions:
- Terraform cloud, is a part of the Hashicorp product suite. It is a remote operations platform, which allows organisations to securely manage and decentralise Terraform code at scale. We decided not to move forward with Terraform Cloud as they had an opaque pricing model, which turned out to be much more expensive than other competitors. At the time of the assessment, only support for Hashicorp Sentinel policy language existed, which is only usable within Hashicorp’s ecosystem. There was no way to keep the Terraform state on the Picnic side. It had to be all hosted at Hashicorp.
- Scalr, is a remote operation platform for Terraform. Positioned as an alternative to Terraform Cloud, it is aimed to centralise administration and policy enforcement, while decentralising Terraform operations. We decided not to move forward with Scalr as there was no way to keep the Terraform state on Picnic side and it offered a pay per run pricing model, which scaled poorly with the deployment topology that Picnic has.
- Spacelift, is an automation platform for Terraform, Pulumi and AWS CloudFormation (and a bit of Kubernetes). It allows flows and interactions between Git, Plan, Review, Apply, Triggers to be fully customizable through the usage of Open Policy Agent. We decided to adopt Spacelift. While it is the youngest platform out of the ones evaluated, we feel it is the one that tries the most to answer the problem we as the infrastructure team are trying to solve: Improving collaboration between Development teams and Operations. Additionally, from a cost perspective, it is among the cheapest options evaluated, while ticking all the requirements that we had identified.
Setting up Spacelift
Before executing any stack we had to decide on the scope for the hosting of workers and Spacelift itself. Spacelift resources are managed via the site as a SaaS, while our workers that execute our Terraform tasks are hosted in our K8s clusters in AWS.
The first challenge is to tackle the authentication set up. Using Spacelift AWS integrations the stack can assume a role in AWS to which we configured a backend authentication endpoint in Vault to dynamically create an expiring auth token.
We heavily rely on the AWS and Vault providers, once a Spacelift run is authenticated in both we can extend the access of a run to additional providers afterwards as access credentials are retrieved from Vault or are linked to IAM roles in AWS.
Workflow using Spacelift
State management
Spacelift also allows us to simplify our backend management for Terraform states in terms of leveraging state management to Spacelift for the general case while still having the ability to create stacks which state is managed on our own for specific scenarios.
GitHub integration
The feedback cycle in GitHub used to be a manual process where the infrastructure operator would run the Terraform commands per affected state and comment in the PR the results. Even if locally scripting some workaround it still would take several minutes before we got an overview of the changes. With Spacelift it is now amazingly fast, getting the results of all the triggered runs in a matter of seconds as checks when PRs are open and modified.
Private Terraform module registry
We have benefited greatly from having a private registry to publish our custom Terraform modules as we have tailored the usage of providers and community modules to the specific needs of Picnic and having them available as any other module makes development stay simple.
Access management
We make use of the capabilities for login policies in order to write access logic that evaluates the requests managed by Spacelift. This allows us to have a proper isolation between spaces and resources used by different teams in the organisation.
Statistics
Our visibility has never been this broad and granular in important metrics regarding our Terraform development practices. Health states like drift detection are easy to enable, track and act upon with Spacelift built in mechanisms. We have also configured an observability stack using the Datadog integration and the prometheus exporter, both reporting on the status of workers, load and availability. In addition we follow our trends for stack creation and run executions to proactively make the necessary changes to keep an efficient workflow in performance and cost.
Future vision
Our biggest ambition by using Spacelift is to empower developers in the organisation to own and customise their infrastructure in an easy, compliant and cost-effective manner. This is being worked on with Spacelift offerings like spaces to scope the reach of teams actions and policies enforcement to ensure quality on their solutions and the proper course of action. All baked in custom modules that abstract this work and make it easy to manage.
Our current IaC landscape consists of over 200 services which account for more than 1000 Terraform states that have been benefiting from Spacelift for well over a year now and there are still gains to be had. So far development teams at Picnic can deploy autonomously resources related to observability for their services, like monitors and alerts. Furthermore the Spacelift team has offered unparalleled support and assistance during our collaboration exchanges, whether it is in the form of a scheduled meeting or a Slack thread. We are looking forward to future offerings by Spacelift like their extended RBAC permission layout that aligns quite well with our future vision of CI at Picnic