When you were a child, did you deliberately break or dismantle things in order to understand how they work? We all did – although some people have a greater urge to destroy than others. Today we call it Chaos Engineering.
As developers, one of our primary goals is to develop stable, secure and bug-free software that will not deprive us of sleep or new and exciting topics. To accomplish these and other goals, we write unit and integration tests, which alert us to unexpected behavior and ensure that the patterns we test do not lead to errors. Today’s architectures contain many components that can’t all be fully covered with unit and integration tests. Servers and components of whose existence we’re not aware still manage to drag our entire system into the abyss.
You don’t choose the moment, the moment chooses you!
You only choose how prepared you are when it does.
Fire Chief Mike Burtch
In recent years, Netflix has been one of the drivers behind Chaos Engineering and has contributed significantly to the growing importance of Chaos Engineering in distributed systems. Kyle Kingsbury, security researcher, takes a slightly different approach, verifying the promises of manufacturers of distributed databases, queues and other distributed systems. With his tool Jepsen, he probes the behavior of the aforementioned systems and occasionally comes to frightening conclusions. You can find a very impressive talk on this topic on YouTube.
In this article I hope to give you a simple and descriptive introduction to the world of Chaos Engineering. As a neat side-effect, Chaos Engineering will allow you to personally meet all of your colleagues within a short time – whether you want to or not! (But only if you do it wrong.)
Do not underestimate the social aspect of Chaos Engineering; it is not just about destroying something, but also about bringing the right people together and jointly pursuing the goal of creating stable and fault-tolerant software.
When we develop new or existing software, we toughen our implementation through various forms of tests. We are often referred to a test pyramid that illustrates what kinds of tests we should write and to what extent.
The test pyramid illustrates the following dilemma. The more I move upwards with my test scenario, the more effort, time and costs arise.
When creating unit tests, we write test cases to check the expected behavior. The component we are testing is free of all its dependencies and we keep their behavior under control with the help of mocks. These types of tests cannot guarantee that they are free of errors. If the developer of the module had a logic error in the implementation of the component, this error will also occur in the tests – regardless of whether the developer has first implemented the tests and then the code. One possibility to solve this is Extreme Programming, in which the developers continuously alternate between writing the tests and implementing the functionality.
In order to allow the developers and stakeholders to spend more free time and relaxed weekends with family and friends, we write integration tests after the unit tests. These test the interaction of individual components. Integration tests are ideally run automatically after the successfully tested unit tests and test-interdependent components.
Thanks to high test coverage and automation, we achieve a very stable state of our application, but who does not know this unpleasant feeling on the way to the most beautiful place in the world? What I mean is production, where our software has to show how good it really is.Only under real conditions can we see how all the individual components of the overall architecture behave. This unpleasant feeling has even been reinforced by the use of modern microservice architectures.
Erosion of the software architecture
In the age of loosely coupled microservices, we arrive at software architectures that can be summarized under the umbrella term ‘distributed systems’. It’s easy to understand individual systems – and they can be deployed and scaled quickly – but in most cases this leads to an architecture like the following one:
I love architecture diagrams, since they give us a clear and abstract view of the software being developed. However, they are also able to hide all the evil pitfalls and mistakes. They are especially good at obscuring the underlying layers and hardware. In real-life production environments, the following architectural diagram is closer to the status quo:
The load balancer doesn’t know all instances of the gateway or cannot reach them in the network due to a firewall rule. Several applications are dead, but service discovery doesn’t notice the failure. Additionally, service discovery can’t synchronize and delivers varying results. The load cannot be distributed due to missing instances, and this leads to an increased load on the individual nodes. And anyway – why does the twelve-hour batch have to run during the day and why does it need twelve hours?!
I’m sure you’ve had similar experiences. You probably know what it means to deal with defective hardware, faulty virtualization, incorrectly configured firewalls or tedious coordination within your company.
At times, statements like the following come up during discussion: “There is no chaos here, everything runs its usual course.” It may be hard to believe, but an entire industry lives from selling us ticket systems so we can control and document the chaos that exists. The following quote from a movie that my children enjoy describes our everyday life well:
Chaos is the engine that drives the world
API heaven vs backend hell
APIs suggest an intact world where we get exactly what we need via simple API calls, with well-defined inputs and outputs. We use myriad APIs to protect us from direct interaction with hell (backend/API implementation). Countless layers of abstractions are implemented in which Hades, god of the underworld, is doing his mischief. He makes sure that not a single API call ever comes back unscathed. Okay, I exaggerate, but you know what I’m getting at.
Netflix shows what this can lead to, so let’s take a quick look at their architecture to better understand the potential complexity of modern microservice architectures. The following picture is from 2013:
Reprinted from A. Tseitlin, “Resiliency through failure”, QCon NY, 2013
This is even more impressive how well Netflix’s architecture works and can react to all the possible errors. If you watch a talk by a Netflix developer, you will find the statement “nobody knows how and why this works”. This insight has brought Chaos Engineering to life at Netflix.
Don’t try this at home
Before you start your first chaos experiments, make sure that your services can already apply a resilience pattern and deal with the possible errors.
Chaos Engineering doesn´t cause problems. It reveals them.
Senior Chaos Engineer at Netflix
As Nora Jones from Netflix rightfully points out, Chaos Engineering is not about creating chaos, but about preventing it. So, if you want to begin your chaos experiments, start small and ask yourself the following questions in advance:
Decision Helper – inspired by Russ Miles
Once again, it makes no sense to commence engineering chaos if your infrastructure – and especially your services – are not prepared for it. Heed this very important instruction, and we’ll now begin our journey into Chaos Engineering.
Rules of Chaos Engineering
- Talk to your colleagues about the planned chaos experiments in advance!
- If you know your chaos experiment will fail, don’t do it!
- Chaos shouldn’t come as a surprise, your aim is to prove the hypothesis.
- Chaos Engineering helps you understand your distributed systems better.
- Limit the blast radius of your chaos experiments.
- Always be in control of the situation during the chaos experiment!
Principles of Chaos Engineering
In Chaos Engineering, you should go through the five phases described below and keep control of your experiment at all times. Start small, and keep the potential blast radius of your experiments small as well. Simply pulling a plug somewhere and seeing what happens has absolutely nothing to do with Chaos Engineering! We don’t cause uncontrolled chaos; we actively fight to prevent it.
Phases of Chaos Engineering
It is essential to define metrics which give you a reliable statement about the overall state of your system. These metrics must be continuously monitored during the chaos experiments. As a nice side effect, you can also monitor these metrics outside of your experiments.
Metrics can be both technical or business metrics – I’d say that business metrics outweigh technical metrics. Netflix monitors the number of successful clicks to start a video during a chaos experiment; this is their core metric and it comes from the business domain. Customers not being able to start videos have a direct effect on customer satisfaction. For example, if you run an online shop, the number of successful orders or the number of articles placed in the shopping basket would be important business metrics.
Think about what should happen in advance, and then prove it through your experiment. If your hypothesis is invalidated, you must locate the error based on the findings and bring it up with your team or company. This is sometimes the hardest part – definitely avoid finger-pointing and scapegoating! As a chaos engineer, your goal is to understand how the system behaves and to present this knowledge to the developers. This is why it’s important to get everyone on board early and to let them participate in your experiments.
What awaits us in real life? What new mistakes could happen, and which ones already ruined our previous weekends? These and other questions must be asked and tested for in a controlled experiment.
Potential examples include:
- Failure of a node in a Kafka cluster
- Dropped network packets
- Hardware errors
- Insufficient max-heap-size for the JVM
- Increased latency
- Malformed responses
You can extend the list however you like and it will always be closely linked to the chosen architecture. Even if your application is not hosted at one of the well-known cloud providers, things will go wrong in your own company’s data center. I strongly suspect you could tell me a thing or two about it!
Chaos experiment example
Say we have several microservices connected via a REST API and using service discovery. The ‘Product Service’ maintains a local cache of the stock from the ‘Warehouse Service’ as a safeguard. Data from the cache should always be delivered if the Warehouse Service has not replied within 500ms. We can achieve this behavior in the Java environment, for example with Hystrix or resilience4j. Thanks to these libraries, we can implement fallbacks and other resilience patterns very easily and effectively.
Below you will find the information required for a successful experiment.
Due to the increased latency when the Warehouse Service is called, 30% of the requests are delivered using the local cache of the Product Service.
Product-Service & Warehouse Service
Status – Before
Status – After
Product Service fallback failed and led to exceptions, since the cache could not answer all requests.
As you can see from the result, we used resilience patterns but still had errors. These errors must be eliminated and tested for by running the experiment again.
Chaos Engineering must be operated continuously. Your systems are also constantly changing: new versions are going into production, hardware is being replaced, firewall rules are adapted and servers are restarted. The elegant way is to establish the culture of Chaos Engineering in your company and in the minds of its people. Netflix achieved this by letting the Simian Army loose on the products of their developers – in production! They decided at some point to only do this during working hours, but still they do it.
For your first chaos experiments, please choose an environment that is identical to the one in production. You will not gain any meaningful insights in a test environment where nothing is going on and that does not correspond to the setup of the platform in production. Once you’ve gained initial insights and made improvements, you can move on to production and carry out your experiments there as well. The aim of Chaos Engineering is to run it in production, always being in control of the situation and keeping customers unaffected.
First steps into Chaos Engineering
At first, it was difficult for me to internalize and explore the ideas of Chaos Engineering in everyday life. I am not a senior chaos engineer working at Netflix, Google, Facebook or Uber. The customers I support with my expertise are just beginning to grasp and implement the principles and designs of microservices. However, what I almost always found was Spring Boot! Sometimes stand-alone, sometimes packed in a Docker container. My demos, which I use in my talks at conferences, always included at least one Spring Boot application. This led to the birth of the Chaos Monkey for Spring Boot, which makes it possible to attack existing Spring Boot applications without modifying a single line of code.
Chaos Monkey for Spring Boot
All relevant information about how Chaos Monkey for Spring Boot can help you on your way to a stable Spring Boot infrastructure can be found here on GitHub.
I hope I could provide you with a basic understanding of the ideas and principles behind Chaos Engineering. This topic is very important and we have a lot of catching up to do, even though we don’t work at Netflix and Co. I love my job, but even more, I love my personal life with family and friends. I don’t want to spend endless evenings and weekends fixing Prio 1 tickets; Chaos Engineering gives us confidence in the performance of a system to better withstand the turbulent conditions in production.