EXPEDIA GROUP TECHNOLOGY — SOFTWARE

Using Fault-Injection to Improve our new Runtime Platform’s Reliability

Exercising Availability Zone Disruptions

Nikos Katirtzis
Expedia Group Technology
6 min readJul 6, 2021

--

The Parthenon; a symbol of Athenian resilience.
The Parthenon; a symbol of Athenian resilience — Photo by Patrick on Unsplash

At Expedia Group™, our platform teams are building the next generation of our runtime platform which is based on Kubernetes and uses Kubernetes Cluster Federation (KubeFed).

One of the areas that the Site Reliability Engineering team is looking after is fault-injection / Chaos Engineering. The goal is to improve the reliability of services and platforms by injecting failures in a controlled way and assessing and mitigating their impact. The reliability of our new platform is critical and hence we saw the opportunity to work with our platform teams in order to exercise its reliability in the event of AWS Availability Zone (AZ) disruptions.

This blog post documents the process followed and our findings from injecting failures to the platform so far.

The Hypothesis

Chaos experiments build around a hypothesis which is then verified using fault injection. The high-level hypothesis in our experiments was the following:

No noticeable impact to customers of the new platform during an Availability Zone (AZ) disruption.

Given the fact that many components and behaviours need to be verified this required the definition of more granular hypotheses which are summarised below:

  • The cluster detects and responds to an AZ failure​.
  • There is no impact on the applications running on the cluster​ based on a set of metrics (also see our blog post on Creating Monitoring Dashboards).
  • The cluster has even load distribution across AZs​.
  • Observability on the cluster is not compromised​.
  • Key platform components are fully functional​.

Based on the above, an example of a chaos experiment with the various stages involved is illustrated below:

The anatomy of a Chaos Experiment disrupting an Availability Zone (AZ). As part of the first stages we define the steady state and hypothesis. This is followed by the stage during which we inject the failure. The final stage is the one where the hypothesis is verified. In this case we verify the health of platform components.
The anatomy of a Chaos Experiment disrupting an Availability Zone.

How AZ Disruptions are exercised

Although we cannot “bring an Availability Zone down” in reality we can simulate Availability Zone disruptions. This can be done by blocking traffic to/from a random Availability Zone in the target Virtual Private Cloud (VPC) and is achieved using Network ACLs.

The existing capability we have internally targets AWS accounts and is not tied to Kubernetes. The architecture of our multi-AZ cluster setup is illustrated below. As you can see in the image below, each business domain gets its own cluster which provides a clear separation of concerns and responsibilities.

The image shows the various infrastructure layers in our AWS EKS clusters. There are 2 fictional business domain clusters spanning 3 Availability Zones (AZ). Each AZ is located at an AWS region which in its turn is associated with an AWS account. By disrupting one of the AZs we disrupt any pods running in worker nodes located in that AZ.
Availability Zone disruption in multi-AZ Kubernetes (EKS) clusters.

Experiment 1

In the first execution, we disrupted the us-west-2a Availability Zone. Since this was the first time we were executing an AZ failure in the clusters the focus was on a single AZ.

Pre-experiment work

A significant amount of time is spent before the execution of a chaos experiment. In our case this meant among others:

  • Communicating the mini GameDay to stakeholders through a Slack channel.
  • Verifying the health of the various workloads in the cluster. This included platform components such as Kubefed and Istio, infrastructure for observability, and a set of applications deployed to the cluster which are used for testing purposes.
  • Generating load for the test applications.

Lessons Learned

✓ The cluster was able to successfully detect the AZ failure and move the pods to the available AZs. The time to recover was less than 5 minutes.

✓ All pods in the cluster were evenly distributed across the available AZs.

✓ The cluster’s capability to release new deployments was not impacted.

✓ The cluster’s observability was not impacted; Datadog and Splunk continued to receive metrics and logs.

✗ Not all components had multiple replicas distributed across multiple AZs; some critical components were available only in a single AZ.

✗ Limits on AutoScaling groups can block the Cluster Autoscaler from scaling up the available AZs (to accommodate pods rescheduled from the failed AZ).

✗ The Cluster Autoscaler could not scale out the worker nodes. Hence, the Horizontal Pod Autoscaler (HPA) failed to scale out to 100% of the desired replicas.

✗ A fourth Availability Zone was unexpectedly configured in the AWS account!

Experiment 2

The goal behind the second execution was to verify the impact of the same failure in the same AZ but also to assess the impact of another AZ -us-west-2b- failing (individually rather than on top of the us-west-2a failure).

Pre-experiment work

In addition to the work mentioned in the previous experiment we created CloudWatch alerts that should fire when an AZ disruption is detected; the disruption is detected through EKS log events.

Lessons Learned

us-west-2a disruption

✓ Applications recovered in less than 5 minutes.

✓ The Cluster Autoscaler scaled out additional nodes in the available AZs.

✓ New pods were scheduled to run on the available AZs.

us-west-2b disruption

✗ The test applications did not recover for the entire duration of the experiment.

✗ The Horizontal Pod Autoscaler could not scale out due to the metrics server being unavailable.

✗ Spinnaker, the Continuous Delivery platform we are using, could no longer deploy to the target cluster via KubeFed.

✗ Critical components such as the Open Policy Agent (OPA) Gatekeeper and Vault were all rendered unavailable.

✗ Datadog and Splunk stopped receiving metrics and logs from the cluster.

✗ CloudWatch alerts configured to fire on an AZ failure did not fire.

As you can see, the us-west-2b disruption severely impacted the cluster’s health. The reasons are still under investigation but this shows that disrupting different AZ does not lead to the same learnings, mostly because different workloads are running on each of them.

Towards continuous verification

As part of wide experiments such as ones that target an Availability Zone, there are many behaviours that need to be verified. In addition to that, it would be much more valuable to run these exercises regularly.

Our approach towards eliminating manual work and continuously executing Availability Zone disruptions has been the following:

  • In the first execution, we did not use any automation to verify the hypothesis and relied on manual verification.
  • To avoid doing this again and again and since many of the platform components include tests that are running when these are deployed, we leveraged these tests. In fact, we identified a couple of issues just by running the tests.
  • As these exercises evolve, we would like to have better observability and more automation in place. For instance, as mentioned earlier, alerts that will fire on an AZ disruption are now in place.
  • Long-term we would like to provide the ability to schedule such experiments and run them in an automated way. This includes not only the fault-injection stage but the verification stage as well. For this, we have been exploring open-source tools but are also building internal ones that would fit our needs.

Failing Over without Falling Over

In his talk “Failing Over without Falling Over”, Adrian Cockroft from AWS identifies AZ failover as a pre-requisite before going multi-region, claiming that the failover will not be smooth the first few times.

This aligns with past exercises we have done in our platforms and it was one of our concerns with the new platform hence we invested in these exercises early and while the new platform is being built.

Closing Thoughts

Usually, chaos engineering is applied on already mature platforms and services. However, making reliability an afterthought instead of a non-functional requirement of early-stage products magnifies the problems and delays their resolution. Injecting failures while a platform is being built increases confidence and leads to a smoother onboarding of customers.

Note: Thanks to all the individuals involved in this exercise (Nitin Mistry, Kiichiro Okano, Kaushik Patel, Sasidhar Sekar) and to Daniel Albuquerque for reviewing the blogpost.

Learn more about technology at Expedia Group

--

--