EXPEDIA GROUP TECHNOLOGY — ENGINEERING

Chaos Engineering at Expedia Group

Building platforms requires connecting building blocks

Nikos Katirtzis

Published in

Expedia Group Technology

10 min readMar 29, 2022

Many Lego blocks of different sizes, shapes, and colors — Photo by Xavi Cabrera on Unsplash.

Background

In one of our previous blogposts we talked about our efforts on improving the reliability of our new runtime platform using Chaos Engineering. In this blogpost, we present the framework we built internally which allows us to run chaos experiments at scale. This blog post explores how it started, the challenges we faced, and the current offering.

How it Started

A small project team was formed to initially focus on consolidating the various Chaos Engineering tools used across the company. As part of this effort, the team created a vision on new tooling and on scaling Chaos Engineering.

The decision was to invest on and integrate with our new runtime platform and with the on-road set of tools. For context, our new platform is based on Kubernetes and uses Kubernetes Cluster Federation (KubeFed). In terms of on-road tools, our main integration points would be our logging platform (Splunk), our monitoring platform (DataDog), and our Continuous Delivery platform (Spinnaker).

A 30,000 foot view

A high-level overview of our chaos engineering framework is visualised below:

A diagram showing the flow of running chaos experiments. On the left-hand side we have the user. An arrow connects the user with the Spinnaker Continuous Delivery platform with the aim of running an experiment that will terminate all containers of a Pod in a Kubernetes Cluster. On the right, the experiment gets executed in the cluster. This happens through a component that is installed inside the cluster, which injects the failure to the containers of the application’s Pod. — 30,000 foot view of Expedia Group’s Chaos Engineering Framework.

The proposed architecture relies on a custom Kubernetes Controller which is installed in all clusters to facilitate chaos experiments. The controller receives “instructions” on the type and target of the failure and then injects it.

One of the key lessons we learned from past attempts on scaling chaos engineering was that, in order to attract customers, we need to provide great developer experience. For this reason, and due to the fact that access to worker clusters would be restricted, we took the decision to enable execution of experiments through Spinnaker.

The Controller

Collaborating in the open to cater DataDog’s chaos controller to our needs

A black wireless controller. — Photo by Ugur Akdemir on Unsplash.

Building custom controllers requires engineering effort and collaboration with platform teams. This was clear from our proof-of-concept, but also from other internal projects with a similar design.

Building a custom controller for Chaos Engineering with support for a wide range of experiments, multiplies the effort. First, implementing failure injections on different targets and different layers requires extensive knowledge of certain areas; Linux kernel, networking, different runtimes, but also of any internal abstractions such as our chosen Service Mesh (Istio), and our control plane (Kubefed). The controller would need to have a solid design with a strong focus on safety; it is crucial to ensure cleanup of experiments once these finish, and to support an immediate rollback in case things go south. Additionally, it would need to be battle tested and integrate with our internal platform.

Fortunately, we were already in touch with other companies, evaluating their Chaos Engineering tools. Just a few weeks after beginning this work, DataDog open-sourced their Chaos Controller, a product that had similar foundations with the one we envisioned. Since then, we worked with DataDog in the open and found this collaboration refreshing.

What our team liked in the design of that controller is that it is generic enough to work on most Kubernetes clusters. It is also not tied to any Cloud providers or Service Meshes; this has the benefit of it being agnostic to constantly evolving tech. The downside is that the implementation is low-level; most experiments leverage Linux kernel modules rather than any abstractions which makes debugging slightly harder and requires expertise in that area.

Architecture

The architecture of the controller is visualised below:

On the left, the content of a YAML file describing a chaos experiment that will terminate containers of a Pod. On the right, a Kubernetes cluster where the experiment runs. The experiment gets deployed as a Custom Resource, through the API Server. It then gets picked up by the chaos controller which spins up injector Pods to inject the failure to the application Pod’s containers. The controller also integrates with DataDog and Slack. — The architecture of the Chaos Controller, the core component behind our Chaos Engineering Framework.

The controller follows the Operator pattern, which can extend Kubernetes while making use of Custom Resources. Custom Resources provide the contract between the server (controller) and the client (user) on Kubernetes.

This architecture has 2 main components; the controller/agent, and a number of injector Pods. Behind the scenes, the controller:

watches for the creation of experiments, in the form of Custom Resources
injects the requested failure on the described target by spinning up injector Pods with the required permissions
cleans up the experiment and brings the system to its steady state

In reality, there is a great amount of hidden details behind these 3 bullet points, and any features supported by that controller. Let’s dig into that a bit more.

Experiment definition

As mentioned before, experiments are defined in the form of Custom Resources. An experiment that will non-gracefully terminate all of a Pod’s containers looks as follows:

The selector uses labels to select targets which is handy for limiting the blast radius of the experiment. You can find more examples with experiments in the open source repository.

In contrast to other chaos engineering frameworks, this controller has a single contract. This has pros and cons; validating different combinations becomes harder, but a reduced development and maintenance effort outweighs the disadvantages.

Failure injection

Socio-technical systems have many failure modes

Broken phone screen showing “Error 404” on screen. — Photo by Kostiantyn Li on Unsplash.

Our team created a classification of fault injection types. A summary with indicative examples is presented below:

As you can imagine, there are many failure points; different targets, different layers, different runtimes and platform specifics. The latter may not sound like a problem since our engineering teams are moving to a common platform, however, even a single platform is dynamic; Kubernetes evolves, service meshes such as Istio evolve, and, in rare cases, the runtime may also change.

There are also different types of customers with different needs when it comes to injecting failures. Application owners are more interested in application failures, while Site Reliability Engineers (SREs), infrastructure engineers and cluster operators would also like to understand the cluster-wide impact of infrastructure failures.

The open-source controller supports a large number of failures across all types presented above. However, our team is taking the approach of evaluating them on a case-by-case basis before documenting and opening them internally. This way we ensure we can provide the best support possible.

Integrations

Integrating new products with internal tools is key for a smooth experience

Close up of woven, interlocking material. — Photo by JJ Ying on Unsplash.

Getting something working locally or in a test cluster is relatively straightforward. On the contrary, fully integrating with internal platforms requires significant effort.

As we mentioned earlier, our main integration points were our new runtime platform (through Kubefed), DataDog, and Spinnaker. Integrating with each of them had its own complexity.

Integrating with our new runtime platform

The controller integrates with the runtime platform by being deployed as a federated resource. This was a key enabler of the following:

installation of the controller in multiple clusters at once, thanks to Kubefed.
introduction of an opt-in model where we on-board customers to the chaos engineering framework on demand (per cluster).
execution of experiments through Kubefed, without applying experiments directly on the worker clusters. This was also critical for the Spinnaker integration.

The controller not being an in-house solution meant that we had to be prepared for integration issues which were unique to us. Examples here include networking issues with our setup which uses Istio as its Service Mesh, and integration challenges with our control plane (Kubefed). Surprisingly, you can end up with nasty issues when messing around with multiple control loops!

In reality, integration challenges with the platform would make an entire blogpost.

Integrating with our monitoring platform

A set of monitoring dashboards. The graphs visualise metrics for the last 7 days including the load time vs bounce rate, start render vs bounce rate, page views vs onload, and sessions. — Photo by Luke Chesser on Unsplash

As the company moves to a domain and island model with hundreds of clusters where the chaos engineering framework will be running, it is crucial to have visibility.

Having metrics is useful not only for operational purposes but also for measuring adoption. We wanted to ensure we had a way to measure adoption from day 0.

This would allow us to answer questions such as; are there any issues with the controller or the injector Pods and does our team need to get involved? How many experiments are currently running per cluster? What kind of experiments are popular among our customers?

In addition to metrics we get from default DataDog integrations, the controller and the injector Pods report custom metrics. This allowed us to create a single pane of glass view which is visualised below:

Dashboard for experiments running using our chaos framework. Graphs include metrics for running/finished experiments, duration of experiments, disruptions by type and target, and experiments breakdown per domain and cluster. — Single Pane of Glass Dashboard for Chaos Engineering at Expedia Group.

The audience of this dashboard was initially our team. However, application owners and SREs may need a view of the currently running experiments in their cluster which is where this dashboard comes in handy.

For verifying the impact of experiments we recommend our customers to use their own observability methods (i.e. their Service Level Objectives and application dashboards) but we also provide guides on what to look for and where for every type of experiment.

Integrating with our Continuous Delivery platform

Integrating the Chaos Engineering framework with our Continuous Delivery platform

Many sailing ships on the horizon. — Photo by Daniel Stenholm on Unsplash.

Integrating with our Continuous Delivery platform was considered critical for the reasons explained below:

Permissions — Applying experiments directly to the clusters would not be possible long-term due to user access restrictions.
Developer Experience — We should not expect the users of the Chaos Engineering Framework to know about framework internals, such as Kubefed and Custom Resources.
Safety — Applying experiments directly to the worker clusters, rather than through Kubefed, may have unintended consequences.
Auditing — There needs to be visibility on who executes chaos experiments and when in on-road platforms.

The approach we chose was to build a Spinnaker plugin that would allow us to do customisations and enhance the user experience. The end-to-end architecture after integrating with Spinnaker is visualised below:

End-to-end flow for executing chaos experiments. On the left, the users initiate experiments through the Spinnaker UI. They input information for the environment where the experiment will run and the type of the experiment. The request “flows” through the Spinnaker plugin consisting of 3 components; Gate (API), Orca (Orchestrator), and CloudDriver (Cloud Ops). A Custom Resource gets created through these components and gets deployed to the worker cluster, through the Control Plane (Kubefed). — Architecture of the end-to-end flow for executing chaos experiments through Spinnaker.

Spinnaker is composed of a number of independent micro-services. It follows a pluggable architecture which allows users to extend its components through extension points and build plugins with custom logic.

Building Spinnaker plugins is outside the scope of this blogpost and has its own nuances (if you are interested you can check my blogpost). However, past experience from teams that have built similar plugins, such as our internal Progressive Deployment capability, has helped enormously.

As to the need for this plugin, in addition to the reasons mentioned before, this allows a custom workflow that aims to improve developer experience. Users don’t need to know the internals of the framework, they will be executing experiments through a User Interface. Behind the scenes, their requests would flow through the internal components of the plugin (the API, the orchestrator, and eventually the micro-service responsible for interacting with Kubefed) and “trigger” chaos experiments in the worker clusters.

Chaos Engineering Product

Internal documentation site for the Chaos Engineering framework. — Chaos Engineering at Expedia Group

As it stand now, the Chaos Engineering product offering is composed of the framework we have built, and a central space with documentation.

The framework includes the controller, which gets installed in all clusters, the Spinnaker plugin which allows customers to run experiments through the Continuous Delivery platform, and a set of demo resources and applications to test, debug, and demonstrate the capabilities of the framework.

The documentation space comes with an introduction to Chaos Engineering, the capabilities which are currently available internally, and past experiments from teams practicing Chaos Engineering.

What is Next?

There are several improvements on the framework to ensure a seamless developer experience for our customers. Having connected the building blocks of the platform, the focus is now on improving the developer experience on Spinnaker.

Meanwhile, a few ideas we are exploring:

Running automated experiments, on a regular basis, without intervention of our customers.
Canary analysis, by targeting canary deployments and measuring the impact to key metrics. A sophisticated version of this is described in Netflix’s paper on Automating chaos experiments in production where Ali Basiri et al. introduce their ChAP chaos experimentation platform.

Closing Notes

While writing this blogpost I came across Spotify’s blogpost on “Product Lessons from ML Home: Spotify’s One-Stop Shop for Machine Learning”. Chaos Engineering Platforms have many similarities with Machine Learning Platforms. There are several building blocks that need to work together and as a consequence many failure modes. Adoption is also critical and requires strong focus on developer experience. But most importantly, there needs to be a balance between vision and strategy, as in any product. Our vision is to scale Chaos Engineering by making it as easy as possible to run experiments. Even better, to have experiments running and verifying failure modes without involving users. But to get there we needed the building blocks.

Building this framework wouldn’t be possible without the help of the wider Reliability Engineering and Operations organisation, the Runtime Orchestration Platform team, and finally the brilliant DataDog Chaos Engineering team.

Special thanks to the following individuals who have been involved in one way or another in this work: Rashidat Balogun, David Fowler, Rahul Gupta, Nitin Mistry, Kaushik Patel, Sasidhar Sekar, Andrew Tseng, Christina Wui Yan Ho.