Chaos Engineering – withstanding turbulent conditions in production

4.7.2018 | 11 minutes of reading time

When you were a child, did you deliberately break or dismantle things in order to understand how they work? We all did – although some people have a greater urge to destroy than others. Today we call it Chaos Engineering.

As developers, one of our primary goals is to develop stable, secure and bug-free software that will not deprive us of sleep or new and exciting topics. To accomplish these and other goals, we write unit and integration tests, which alert us to unexpected behavior and ensure that the patterns we test do not lead to errors. Today’s architectures contain many components that can’t all be fully covered with unit and integration tests. Servers and components of whose existence we’re not aware still manage to drag our entire system into the abyss.

You don’t choose the moment, the moment chooses you!
You only choose how prepared you are when it does.
Fire Chief Mike Burtch

Introduction

In recent years, Netflix has been one of the drivers behind Chaos Engineering and has contributed significantly to the growing importance of Chaos Engineering in distributed systems. Kyle Kingsbury, security researcher, takes a slightly different approach, verifying the promises of manufacturers of distributed databases, queues and other distributed systems. With his tool Jepsen , he probes the behavior of the aforementioned systems and occasionally comes to frightening conclusions. You can find a very impressive talk on this topic on YouTube .

In this article I hope to give you a simple and descriptive introduction to the world of Chaos Engineering. As a neat side-effect, Chaos Engineering will allow you to personally meet all of your colleagues within a short time – whether you want to or not! (But only if you do it wrong.)

Do not underestimate the social aspect of Chaos Engineering; it is not just about destroying something, but also about bringing the right people together and jointly pursuing the goal of creating stable and fault-tolerant software.

Basics

When we develop new or existing software, we toughen our implementation through various forms of tests. We are often referred to a test pyramid that illustrates what kinds of tests we should write and to what extent.

The test pyramid illustrates the following dilemma. The more I move upwards with my test scenario, the more effort, time and costs arise.

Unit tests

When creating unit tests, we write test cases to check the expected behavior. The component we are testing is free of all its dependencies and we keep their behavior under control with the help of mocks. These types of tests cannot guarantee that they are free of errors. If the developer of the module had a logic error in the implementation of the component, this error will also occur in the tests – regardless of whether the developer has first implemented the tests and then the code. One possibility to solve this is Extreme Programming, in which the developers continuously alternate between writing the tests and implementing the functionality.

Integration tests

In order to allow the developers and stakeholders to spend more free time and relaxed weekends with family and friends, we write integration tests after the unit tests. These test the interaction of individual components. Integration tests are ideally run automatically after the successfully tested unit tests and test-interdependent components.

Thanks to high test coverage and automation, we achieve a very stable state of our application, but who does not know this unpleasant feeling on the way to the most beautiful place in the world? What I mean is production, where our software has to show how good it really is. Only under real conditions can we see how all the individual components of the overall architecture behave. This unpleasant feeling has even been reinforced by the use of modern microservice architectures.

Erosion of the software architecture

In the age of loosely coupled microservices, we arrive at software architectures that can be summarized under the umbrella term ‘distributed systems’. It’s easy to understand individual systems – and they can be deployed and scaled quickly – but in most cases this leads to an architecture like the following one:

I love architecture diagrams, since they give us a clear and abstract view of the software being developed. However, they are also able to hide all the evil pitfalls and mistakes. They are especially good at obscuring the underlying layers and hardware. In real-life production environments, the following architectural diagram is closer to the status quo:

The load balancer doesn’t know all instances of the gateway or cannot reach them in the network due to a firewall rule. Several applications are dead, but service discovery doesn’t notice the failure. Additionally, service discovery can’t synchronize and delivers varying results. The load cannot be distributed due to missing instances, and this leads to an increased load on the individual nodes. And anyway – why does the twelve-hour batch have to run during the day and why does it need twelve hours?!

I’m sure you’ve had similar experiences. You probably know what it means to deal with defective hardware, faulty virtualization, incorrectly configured firewalls or tedious coordination within your company.

At times, statements like the following come up during discussion: “There is no chaos here, everything runs its usual course.” It may be hard to believe, but an entire industry lives from selling us ticket systems so we can control and document the chaos that exists. The following quote from a movie that my children enjoy describes our everyday life well:

Chaos is the engine that drives the world

API heaven vs backend hell

APIs suggest an intact world where we get exactly what we need via simple API calls, with well-defined inputs and outputs. We use myriad APIs to protect us from direct interaction with hell (backend/API implementation). Countless layers of abstractions are implemented in which Hades, god of the underworld, is doing his mischief. He makes sure that not a single API call ever comes back unscathed. Okay, I exaggerate, but you know what I’m getting at.

Netflix shows what this can lead to, so let’s take a quick look at their architecture to better understand the potential complexity of modern microservice architectures. The following picture is from 2013:

Reprinted from A. Tseitlin, “Resiliency through failure”, QCon NY, 2013

This is even more impressive how well Netflix’s architecture works and can react to all the possible errors. If you watch a talk by a Netflix developer, you will find the statement “nobody knows how and why this works”. This insight has brought Chaos Engineering to life at Netflix.

Don’t try this at home

Before you start your first chaos experiments, make sure that your services can already apply a resilience pattern and deal with the possible errors.

Chaos Engineering doesn´t cause problems. It reveals them.
Nora Jones
Senior Chaos Engineer at Netflix

As Nora Jones from Netflix rightfully points out, Chaos Engineering is not about creating chaos, but about preventing it. So, if you want to begin your chaos experiments, start small and ask yourself the following questions in advance:

Decision Helper – inspired by Russ Miles

Once again, it makes no sense to commence engineering chaos if your infrastructure – and especially your services – are not prepared for it. Heed this very important instruction, and we’ll now begin our journey into Chaos Engineering.

Rules of Chaos Engineering

Talk to your colleagues about the planned chaos experiments in advance!
If you know your chaos experiment will fail, don’t do it!
Chaos shouldn’t come as a surprise, your aim is to prove the hypothesis.
Chaos Engineering helps you understand your distributed systems better.
Limit the blast radius of your chaos experiments.
Always be in control of the situation during the chaos experiment!

Principles of Chaos Engineering

In Chaos Engineering, you should go through the five phases described below and keep control of your experiment at all times. Start small, and keep the potential blast radius of your experiments small as well. Simply pulling a plug somewhere and seeing what happens has absolutely nothing to do with Chaos Engineering! We don’t cause uncontrolled chaos; we actively fight to prevent it.

I warmly recommend the site PrinciplesOfChaos.org and the free ebook “Chaos Engineering ” written by the authors: Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, Ali Basiri

Phases of Chaos Engineering

Steady state

It is essential to define metrics which give you a reliable statement about the overall state of your system. These metrics must be continuously monitored during the chaos experiments. As a nice side effect, you can also monitor these metrics outside of your experiments.

Metrics can be both technical or business metrics – I’d say that business metrics outweigh technical metrics. Netflix monitors the number of successful clicks to start a video during a chaos experiment; this is their core metric and it comes from the business domain. Customers not being able to start videos have a direct effect on customer satisfaction. For example, if you run an online shop, the number of successful orders or the number of articles placed in the shopping basket would be important business metrics.

Hypothesis

Think about what should happen in advance, and then prove it through your experiment. If your hypothesis is invalidated, you must locate the error based on the findings and bring it up with your team or company. This is sometimes the hardest part – definitely avoid finger-pointing and scapegoating! As a chaos engineer, your goal is to understand how the system behaves and to present this knowledge to the developers. This is why it’s important to get everyone on board early and to let them participate in your experiments.

Real-world events

What awaits us in real life? What new mistakes could happen, and which ones already ruined our previous weekends? These and other questions must be asked and tested for in a controlled experiment.

Potential examples include:

Failure of a node in a Kafka cluster
Dropped network packets
Hardware errors
Insufficient max-heap-size for the JVM
Increased latency
Malformed responses

You can extend the list however you like and it will always be closely linked to the chosen architecture. Even if your application is not hosted at one of the well-known cloud providers, things will go wrong in your own company’s data center. I strongly suspect you could tell me a thing or two about it!

Chaos experiment example

Say we have several microservices connected via a REST API and using service discovery. The ‘Product Service’ maintains a local cache of the stock from the ‘Warehouse Service’ as a safeguard. Data from the cache should always be delivered if the Warehouse Service has not replied within 500ms. We can achieve this behavior in the Java environment, for example with Hystrix or resilience4j . Thanks to these libraries, we can implement fallbacks and other resilience patterns very easily and effectively.

Environment

Below you will find the information required for a successful experiment.

Target
Warehouse Service

Experiment Type
Latency

Hypothesis
Due to the increased latency when the Warehouse Service is called, 30% of the requests are delivered using the local cache of the Product Service.

Blast Radius
Product-Service & Warehouse Service

Status – Before
OK

Status – After
ERROR

Finding
Product Service fallback failed and led to exceptions, since the cache could not answer all requests.

As you can see from the result, we used resilience patterns but still had errors. These errors must be eliminated and tested for by running the experiment again.

Automated experiments

Chaos Engineering must be operated continuously. Your systems are also constantly changing: new versions are going into production, hardware is being replaced, firewall rules are adapted and servers are restarted. The elegant way is to establish the culture of Chaos Engineering in your company and in the minds of its people. Netflix achieved this by letting the Simian Army loose on the products of their developers – in production! They decided at some point to only do this during working hours, but still they do it.

Environment

For your first chaos experiments, please choose an environment that is identical to the one in production. You will not gain any meaningful insights in a test environment where nothing is going on and that does not correspond to the setup of the platform in production. Once you’ve gained initial insights and made improvements, you can move on to production and carry out your experiments there as well. The aim of Chaos Engineering is to run it in production, always being in control of the situation and keeping customers unaffected.

First steps into Chaos Engineering

At first, it was difficult for me to internalize and explore the ideas of Chaos Engineering in everyday life. I am not a senior chaos engineer working at Netflix, Google, Facebook or Uber. The customers I support with my expertise are just beginning to grasp and implement the principles and designs of microservices. However, what I almost always found was Spring Boot! Sometimes stand-alone, sometimes packed in a Docker container. My demos, which I use in my talks at conferences, always included at least one Spring Boot application. This led to the birth of the Chaos Monkey for Spring Boot, which makes it possible to attack existing Spring Boot applications without modifying a single line of code.

Chaos Monkey for Spring Boot

All relevant information about how Chaos Monkey for Spring Boot can help you on your way to a stable Spring Boot infrastructure can be found here on GitHub .

Conclusion

I hope I could provide you with a basic understanding of the ideas and principles behind Chaos Engineering. This topic is very important and we have a lot of catching up to do, even though we don’t work at Netflix and Co. I love my job, but even more, I love my personal life with family and friends. I don’t want to spend endless evenings and weekends fixing Prio 1 tickets; Chaos Engineering gives us confidence in the performance of a system to better withstand the turbulent conditions in production.

Was this post helpful?

Likes

Blog author

Benjamin Wilms

Do you still have questions? Just send me a message.

fromBenjamin Wilms

Hystrix & Archaius – Dynamische Konfiguration zur Laufzeit

Wer sich mit Hystrix aus dem Hause Netflix beschäftigt und es im besten Fall auch einsetzt, ist mit sehr großer Wahrscheinlichkeit schon einmal der Bibliothek Archaius begegnet. Doch was ist Archaius und welche Möglichkeiten bieten sich mir? Habe ich...

Softwareentwicklung
Data
Resilienz

28.11.2016 | 8 Minuten Lesezeit

Benjamin Wilms

Vaadin Christmas Meetup in Düsseldorf

Unter dem Titel „Vaadin Christmas Meetup in Düsseldorf“ fand das erste gemeinsame Meetup von Vaadin und codecentric statt. In den Düsseldorfer codecentric-Räumen trafen sich ca. 15 Vaadin-Interessierte und lauschten den sehr informativen und unterhaltsamen...

Java
JavaScript
Community
Frontend
Webdevelopment

7.1.2016 | 4 Minuten Lesezeit

Benjamin Wilms

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Test Fixtures mit JUnit 5

Wir Softwareentwickler leben in einem ständigen Dilemma. Jede Funktionalität der Software sollte durch Unit-Tests und Integrationstest abgesichert werden. Es sollten dabei so viel Tests wie nötig, aber nur so wenige wie möglich geschrieben werden. Schreiben...

Java
Testing
Framework
Softwareentwicklung

25.3.2024 | 7 Minuten Lesezeit

Jens Kaiser

Datenbanken testen mit Testcontainers in Mule4

Hier erfährst du die Möglichkeiten Testcontainers in Mule4 zu nutzen, um deine Datenbankaufrufe zu testen. Vor einiger Zeit hat mein Kollege Christian Langmann eine Blogartikelserie veröffentlicht, in welcher er aufzeigt, wie man in Mule3 Munit-Tests...

Community
Softwareentwicklung
Testing
API
Open Source
Datenbank
Container
Integration

19.1.2024 | 3 Minuten Lesezeit

Benjamin Lüdicke

Goldene Wasserhähne – Wie wichtig ist Qualität in der Softwareentwicklung...

Stellt man Projektbeteiligten die Frage, ob Qualität von Software wichtig ist, antwortet ein Großteil der Befragten vermutlich mit „Ja”. Jede andere Antwort würde sicherlich weitere, unangenehme Fragen aufkommen lassen. Aber was bedeutet Qualität im ...

Testing
Softwareentwicklung

18.10.2023 | 9 Minuten Lesezeit

Kevin Peters

Die Bingo Bongo-Methode: ein spielerischer Software-Testing-Ansatz

Software-Testing kann zur Herausforderung werden. Aber was wäre, wenn es weniger wie Arbeit und mehr wie ein Spiel wäre? Etwas, das das ganze Team einbezieht und sogar Spaß macht? In diesem Beitrag stellen wir Bingo Bongo vor, einen spielerischen Ansatz...

Testing
Agile Methoden
Agilität

31.7.2023 | 4 Minuten Lesezeit

Benjamin Knauer

Test-Fixtures: Wozu denn überhaupt?

Für uns Softwareentwickler ist der ultimative Endgegner immer die Komplexität. Wir haben zahlreiche, teils ziemlich mächtige Waffen gesammelt, um in diesen Kämpfen bestehen zu können: Dinge wie Modularisierung, Abstraktion, Lean Development, iteratives...

Testing
Java
Test Driven Development

12.5.2023 | 19 Minuten Lesezeit

Rüdiger zu Dohna

Microservice Integration Testing done right

In diesem Artikel beschreiben wir gesammelte Best Practices für das Integration Testing von Microservices. Zu diesem Zweck haben wir ein Projekt namens toti-example-service erstellt und auf GitHub veröffentlicht. Wir werden uns in diesem Beitrag immer...

Testing
Microservices
Spring
Kotlin

11.4.2023 | 7 Minuten Lesezeit

Tobias Dittrich

Till Voß

Mule 4: Test-Parametrisierung – ein Flow für viele Fälle

Immer wieder entdecke ich bei Code-Reviews, dass für verschiedene Testfälle, die sich prinzipiell nur durch die Ein- und Ausgabedaten unterscheiden, eine Vielzahl von MUnit-Tests angelegt werden. Diese Flows werden dann mühselig kopiert, um jeden Testfall...

Integration
API
Testing

16.2.2023 | 5 Minuten Lesezeit

Pasquale Brunelli

AWS CloudFront Functions testen

Mit den CloudFront Functions bietet AWS die Möglichkeit, den Funktionsumfang von CloudFront um kleine JavaScript-Funktionen zu erweitern. AWS führt diese Funktionen direkt an den Edge-Locations aus und ermöglicht es dadurch, alle ankommenden Requests...

Cloud
AWS
Testing
Softwareentwicklung

4.10.2022 | 3 Minuten Lesezeit

Dennis

Vom PoC zu Produktionssoftware: Trinke, bactane, programmiere, refaktoriere...

In diesem Text richte ich meinen Blick auf den Übergang vom Proof of Concept (PoC) zu Produktionssoftware. Speziell in kleinen Teams sind die Ressourcen nicht vorhanden, Software umfassend zu refaktorisieren, und der eine oder andere PoC landet in Produktion...

Softwareentwicklung
Testing
Agile Methoden
Test Driven Development

20.7.2022 | 7 Minuten Lesezeit

Robert Meißner

Mock Service Worker – Einfach Backends mocken

Der Mock Service Worker, kurz MSW, ist ein hilfreiches Werkzeug zum API Mocking bei der Entwicklung von Single Page Applications.Beim Entwickeln einer clientseitigen Webanwendung ist die Kommunikation zwischen Frontend und Backend essenziell. Dementsprechend...

Frontend
JavaScript
Testing

29.8.2021 | 9 Minuten Lesezeit

Andreas Houben

Grüne Test-Pyramiden mit Cypress – UI-Testing für die Zukunft

Cypress ist ein junges Open-Source-Test-Framework für Web-basierte, grafische Benutzeroberflächen. Cypress-Tests werden in JavaScript geschrieben und orientieren sich, wie auch bei Selenium-basierten Technologien üblich, am Document Object Model (DOM...

Frontend
JavaScript
Testing

29.9.2020 | 7 Minuten Lesezeit

Jonas Verhoelen

Mule Test Recorder: MUnit-Tests wie von Zauberhand in Mule 4

Vor Kurzem wurde der Mule Test Recorder in Mule 4 vorgestellt. Dieser verspricht eine Zeitersparnis bei der Erstellung von MUnit-Tests. Dafür muss lediglich die jeweilige Applikation gestartet werden. Während der laufenden Anwendung werden dann sämtliche...

API
Integration
Testing

18.6.2020 | 5 Minuten Lesezeit

Pasquale Brunelli

Stefan Koch

Schnelle Frontend-Entwicklung durch typisierte Mock-Server mit json-server...

Bei der Entwicklung von Software kann es vorkommen, dass die Weiterarbeit an einem Feature durch projektinterne Abhängigkeiten aufgehalten wird. Ein Beispiel hierfür ist die getrennte Entwicklung von Frontend und Backend. Oft kann gewisse Funktionalit...

JavaScript
Frontend
Testing

31.3.2020 | 4 Minuten Lesezeit

Felix Magnus

GoMock vs. Testify: Mocking frameworks for Go

Summary: Testify/mock and mockery are the tools of choice, providing an overall better user experience, if you do not need the additional power of the GoMock expectation API.Testify has the better mock generator and error messages while GoMock has the...

Go
GitHub
Testing

22.7.2019 | 15 Minuten Lesezeit

Sergey Grebenshchikov

BDD und End-to-End-Tests – Cypress.io mit Cucumber verbinden

Cypress.io (oder kurz Cypress) bekommt momentan sehr viel Aufmerksamkeit, wenn es um das Thema End-to-End-Testing geht. Speziell im JavaScript-Umfeld scheint sich Cypress.io langsam durchzusetzen. Es macht vieles richtig und ist Selenium-basierten Ans...

JavaScript
BDD
APM
Testing

16.4.2019 | 10 Minuten Lesezeit

Holger Grosse-Plankermann

Testen in Mule mit Datenbanken – Teil 3: Datenbanken mit Docker

In den ersten zwei Teilen der Artikelserie haben wir einen einfachen REST-Service in Mule implementiert, der seine Informationen aus einer Datenbank bezieht. Zum Testen haben wir zunächst die Datenbank gemockt und in einem zweiten Schritt eine In-Memory...

Datenbank
Container
Integration
Testing

15.4.2019 | 4 Minuten Lesezeit

Christian Langmann

Testen in Mule mit Datenbanken – Teil 2: In-Memory-Datenbanken

Nachdem ich im ersten Teil der Artikelserie das Mocken einer Datenbank im Rahmen von Munit-Tests beschrieben habe, werde ich im Folgenden zeigen, wie eine In-Memory-Datenbank zum Testen benutzt werden kann.Warum sollten überhaupt In-Memory-Datenbanken...

Agilität
Datenbank
Integration
Testing

8.4.2019 | 4 Minuten Lesezeit

Christian Langmann

Testen in Mule mit Datenbanken – Teil 1: Mocking von Datenbanken

Mule bietet mit MUnit ein Framework, mit dem sehr ähnlich zu den normalen Flows Tests geschrieben werden können. Ob es sich dabei um Unit- oder Integrationstests handelt, hängt von der Implementierung und der Benennung ab. Denn mithilfe von Maven lassen...

Datenbank
Integration
Testing

1.4.2019 | 6 Minuten Lesezeit

Christian Langmann

BDD für Alexa Skills – Teil 5: cucumber.js Tests und State-Handling

Dies ist der fünfte Teil einer Serie von Blogposts über Behaviour Driven Development (BDD) eines Alexa Skills. In diesem Beitrag erweitern wir unser Testframework um die Behandlung von Status-Informationen.Was bisher geschahTeil 1: Initiales Setup Teil...

AWS
BDD
Testing
JavaScript
Voice UI

11.3.2019 | 9 Minuten Lesezeit

Stefan Spittank

Abweichungen zwischen Spezifikation und REST-API mit hikaku erkennen

Wenn man eine REST-API mit dem Contract-first-Ansatz erstellt, verwendet man vermutlich Codegenerierung oder einen anderen Weg, um sicherzustellen, dass die Spezifikation und die Implementierung im Laufe der Zeit inhaltlich gleich bleiben. In diesem ...

Microservices
Open Source
Testing

8.3.2019 | 3 Minuten Lesezeit

Jannes Heinrich

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

Chaos Engineering – withstanding turbulent conditions in production

Introduction

Basics

Unit tests

Integration tests

Erosion of the software architecture

API heaven vs backend hell

Don’t try this at home

Rules of Chaos Engineering

Principles of Chaos Engineering

Phases of Chaos Engineering

Steady state

Hypothesis

Real-world events

Chaos experiment example

Environment

Automated experiments

Environment

First steps into Chaos Engineering

Chaos Monkey for Spring Boot

Conclusion

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Hystrix & Archaius – Dynamische Konfiguration zur Laufzeit

Vaadin Christmas Meetup in Düsseldorf

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Test Fixtures mit JUnit 5

Datenbanken testen mit Testcontainers in Mule4

Goldene Wasserhähne – Wie wichtig ist Qualität in der Softwareentwicklung...

Die Bingo Bongo-Methode: ein spielerischer Software-Testing-Ansatz

Test-Fixtures: Wozu denn überhaupt?

Microservice Integration Testing done right

Mule 4: Test-Parametrisierung – ein Flow für viele Fälle

AWS CloudFront Functions testen

Vom PoC zu Produktionssoftware: Trinke, bactane, programmiere, refaktoriere...

Mock Service Worker – Einfach Backends mocken

Grüne Test-Pyramiden mit Cypress – UI-Testing für die Zukunft

Mule Test Recorder: MUnit-Tests wie von Zauberhand in Mule 4

Schnelle Frontend-Entwicklung durch typisierte Mock-Server mit json-server...

GoMock vs. Testify: Mocking frameworks for Go

BDD und End-to-End-Tests – Cypress.io mit Cucumber verbinden

Testen in Mule mit Datenbanken – Teil 3: Datenbanken mit Docker

Testen in Mule mit Datenbanken – Teil 2: In-Memory-Datenbanken

Testen in Mule mit Datenbanken – Teil 1: Mocking von Datenbanken

BDD für Alexa Skills – Teil 5: cucumber.js Tests und State-Handling

Abweichungen zwischen Spezifikation und REST-API mit hikaku erkennen

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten