In a recent blog post, Benjamin Wilms gave an introduction to “Chaos Engineering“. But how to apply potential Chaos Engineering experiments and their results in a specific company context?
I would like to introduce the Chaos Engineering GameDay as a platform for this purpose. The idea behind it is that we learn how to “survive” a production outage in a secure environment (in terms of space and time). A GameDay is basically a fire drill. As with any drill, the first question is what to practice. Do we want the database system to fail? Or rather a major component of the system?
To get an impression of how a GameDay works, the initial radius of the experiment should be very limited. It is also advisable to start by trying out the GameDay within a test environment. After that, you should be all set for a proper GameDay in production, so you can start planning it from beginning to end.
The most important thing when carrying out Chaos Engineering GameDay is communication. First, we set up a Command Center. This can be either a virtual chat or a conference room. It is important that everyone in this room is provided with insights into success, termination criteria and any other overviews relevant to GameDay. This is the only way to ensure that potential sources of error are discovered quickly. If something goes wrong, participants should always have a recovery plan and a set of criteria at hand.
Before we delve into GameDay details, here is a recap of the rules of Chaos Engineering, as already cited by Benjamin Wilms:
- Talk to your colleagues about the planned chaos experiments in advance!
- If you know your chaos experiment will fail, don’t do it!
- Chaos shouldn’t come as a surprise, your aim is to prove the hypothesis.
- Chaos Engineering helps you understand your distributed systems better.
- Limit the blast radius of your chaos experiments.
- Always be in control of the situation during the chaos experiment!
A more detailed look at Chaos Engineering GameDay
A GameDay lasts between two and four hours. Ideally, in addition to developers and operations, the “customers” behind the application should also participate in the planned event. But what else needs to be prepared apart from the actual experiment?
Every GameDay needs a clear goal to which the experiment can be applied. The participants then define their roles based on this goal, i.e. based on how much they identify with it. It’s up to them if they want to be participate or not – everyone can decide for themselves. In addition to the defined goal, it is important to prove the hypothesis with test cases.
But before we define our objective, we should do two things. After opening GameDay, we spend a little time on doing a presentation of the current system architecture, preferably on a whiteboard in the center of the room. This should, above all, ensure that all participants keep the current status of the system in mind as they make further considerations.
Next, the group tries to come up with potential test cases. For some tests, it might be crucial not to look at the actual time. We already know from experience that some applications do not run into errors within a certain time interval. Particularly if the tests are run in production, it is important to define a criterion.
The tests are now defined, so we can get started. While we run the test, some questions arise. If these can be answered adequately, we continue with the next test. Here are some examples:
- Do we have enough information?
- Did we expect exactly this behavior?
- What does the customer see while this happens?
- What happens right before and after the component under consideration?
As soon as all the tests have been completed, we can briefly sum up results. This summary is also a conclusion of the GameDay, though it can be contemplated for a few more days in order to to allow everyone involved to take a deeper look at the test events and results with some distance. It is best to initially focus on the most striking tests and corresponding results. Following the discussion, the detected errors are passed via a ticket system to the right people for clarification and fixing. However, all other tests should also be looked at more thoroughly, e.g. to find out if anything negative has been noticed before. It is also worth considering whether these positive tests can be automated in any way to protect future releases against regression.
To conclude: while a single GameDay can already provide some insights into the behavior of an application or component, continuous implementation will help increase security in the operation of distributed applications and components in the long run. If you like the idea of Chaos Engineering GameDays, you can get started even without having an experiment at hand. Often enough, it is sufficient to get the right people to be in the same room together and start asking questions. Ensuing discussions and reflections will already give rise to quite a few significant insights at this early stage.
I hope this blog post could give you a first glimpse into planning a Chaos Engineering GameDay. If you have any questions, suggestions and maybe wishes related to the field of Chaos Engineering, feel free to drop a comment.