Site Reliability Engineering: Running software in production

No Comments

Lately, Site Reliability Engineering (SRE) has been getting a lot of attention. With SRE came metrics such as Service-Level Objective (SLO), Service-Level Indicator (SLI), and error budget. The SRE discipline also details a lot about running software in production. But the above buzzwords are more or less only what enables Site Reliability Engineers to do their job.
There is another buzzword: “production-ready.” This one is more about what an SRE or software developer can do to improve the software behind the metrics. This blog post will take a look at how these buzzwords work together and whether they are only buzzwords or there’s more to them.

The above-mentioned topics are not only buzzwords. Books have been written on them. There are the Google SRE books and there are also books on production-ready software:
Production-ready Microservices” by Susan Fowler and
Release It! — Design and Deploy Production-Ready Software” by Michael Nygard.

Besides being buzzwords and filling books, they have real impact. Although this is centered around microservices, it is also valid for software running on servers, functions or at the edge. So, let’s have a closer look at what it is all about. To do so, let’s start with two examples which should be valid for most applications: logging and retries.

Logging

As an example, let’s assume this error message somewhere in a log:

Connection timed out.

In case your software connects to one service. This might be valid, not nice, but valid. As soon as you use two or more backend services in your software, questions arise. Which host do you try to reach? Which port did you use? After what time did you receive the timeout?
This log message is more helpful:

Connection to host hello.from.the.other.side on port 12345 timed out after 100ms.

At first, the 100 ms timeout might seem strange. It could be a valid timeout in case the consumed service runs inside the same datacenter. It’s definitely not valid when the service is located on the other side of the globe. With this example, other questions arise. Did this timeout happen because of a misconfiguration? Is a firewall blocking the connection? Or is the backend not available? You would have come to these questions also in case of the first log message, but only after also figuring out the questions mentioned above. So, this additional information in the log messages will help save some precious time.

This log message directly leads to our next example.

Retries

In case a backend isn’t available, the software shouldn’t give up after the first try. Perhaps a load balancer switched backends or a switch somewhere along the way reboots. So, give it another try. Not directly, but after a short break. With the production-ready log message of the previous section, this might look like this:

Connection to host hello... on port 12345 timed out after 100ms. Retries: 2/5

In most cases, you do not have to implement it on your own, e.g. Google implemented it as an HTTP client in Java. It is also nice to add this information about the number of retries or that the client gave up to the logs. This will help during debugging.

These are two common examples. As you might guess from the books mentioned above, there is more to making software production-ready. These examples should get you started rather than educate you on the entire topic.

Even these two small examples will help your software, and even more the people running your software in production. More on that later.

But not all chapters of the production-ready or SRE books will apply to your software. For example, a back-office insurance company software does not have to scale up from 100 to 1000 users in a matter of seconds. It also does not need to scale down some minutes later. But it’s also good to know that this is not necessary. This also goes for document factors which aren’t applicable to your software.

There is more to running software in production

But this article is about running software in production, not only production-ready software. So, there is more.

“Production-ready” mostly describes functionality built into the software. This functionality will help cope with situations occurring during production lifetime. To run software in production, you also have to take care of some infrastructure, systems, and processes around the software.

Let’s again take a look at two examples: backup/restore and certificates.

Backup/restore

Stateless is a buzzword I didn’t mention above. And although you should aim for stateless services, you will have some sort of state somewhere in your application. And this state will need backups.
The questions which will arise here are:

  • How often do you need backups?
  • How long do we have to keep them?

This functionality will not be part of the software, but these questions arise sooner or later. Otherwise, an outage might lead to a complete data loss. This e.g. happened to GitLab quite some time ago. And this is not to blame GitLab, but we should instead be thankful they shared a postmortem about this from which we can learn.

Certificates

Your service will most likely be accessible via HTTPS. Or they will access other services via some sort of secured communication protocol. So, some topics which aren’t covered by the software, but might cause major downtimes when not handled:

  • Does the software need a client certificate?
  • Are certificates renewed automatically?
  • Are any self-signed certificates involved?

Depending on the answers to these questions, you might need additional processes around the software to keep it running smoothly. Otherwise, things like this might happen:
not so production-ready certificate

They were so kind to discuss this in a publicly accessible ticket. And as you can see in the screenshot, the browser doesn’t even allow you to proceed. So, this broken certificate is actually a service outage. Thanks to Jenkins for discussing this publicly and giving us the opportunity to learn.

But why?

Let’s take the opposite perspective. What could possibly go wrong?

But first some definitions up front.
Some people have very specific definitions of what a distributed system is. For this article, we use a not so strict definition:

A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. The components interact with one another in order to achieve a common goal.

https://en.wikipedia.org/wiki/Distributed_computing

The fallacies

So, by the definition in the previous paragraph, nearly every system you will develop nowadays is a distributed system. In this context, let’s take a look at the fallacies of distributed systems:

The network is reliable;
Latency is zero;
Bandwidth is infinite;
The network is secure;
Topology doesn’t change;
There is one administrator;
Transport cost is zero;
The network is homogeneous.

https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing

For a more detailed explanation of each fallacy, take a look at these blog posts.

Some seem obvious, some are inevitable. But you can circumvent some of these fallacies with production-ready software. That’s one of the reasons why “production-ready” makes your application more robust.

But you cannot circumvent all fallacies. The more complex your systems gets, the more the following quote applies:

… complex systems run as broken systems. The system continues to function because it contains so many redundancies and because people can make it function, despite the presence of many flaws.

https://how.complexsystems.fail/#5

Short detour

That’s where observability comes into play. Since there is always something broken, you want to know:

  • What broke?
  • Where is the broken part in your system?
  • How does it affect your users?

But that’s an entirely different topic with more than a handful of books.

The fallacies and public clouds

What’s important is that you are aware of these fallacies and prepare your software accordingly. But not all the parts are under your control. In case your software is running in a public cloud, the cloud providers offer insight into how to make your services more robust on their platforms:

The Amazon Builders’ Library
Azure Application Architecture Guide
Google Cloud — Cloud Architecture Center

As already mentioned above, this article cannot cover the entire topic. It can only give you some hints to get you started.

Conclusion

Based on our experience running software in production, we offer Production Readiness Review (another term coined by the SRE discipline) workshops containing more of these topics. But they aren’t specific to us who are running the software in production. So, no matter who will run the software in production, they will be interested in the answers to questions like the ones mentioned above.

Christian Zunker is creating and operating distributed software systems and infrastructures. In his current role as a Senior Consultant Cloud Technologies, he builds and runs cloud-based systems for cc cloud GmbH and its customers.

Comment

Your email address will not be published. Required fields are marked *