Your Hardware will Fail – Just not the Way You Expect

13.11.2013 | 5 minutes of reading time

Why do people decide to move their services into the cloud? For one thing they wish to store large amounts of data. But they also wish for response times and reliability that classical sever installation cannot offer. Therefore, clusters of commodity server nodes are built that deliver services as a distributed system. In such a distributed system, failures of nodes are immanent and it is important to understand the mechanics of hardware failures to successfully prepare such events. In this blog article, I present four surprising facts that shed new light on common believes about RAM and hard disk failures.

These facts are based on the two Google research papers “Failure Trends in a Large Disk Drive Population ” [1] and “DRAM Errors in the Wild: A Large-Scale Field Study ” [2] which cover data collected over hundreds of thousand nodes throughout Google’s data centers.

1. Lower Temperatures Lead to Higher Availably

Having managed servers for almost 20 years, I have always been told that it is crucial for servers to be appropriately cooled in order to allow for 24/7 operation. Interestingly, Google researches concluded that temperatures impacts the reliability of neither RAM modules nor disk drives as expected.

In particular, the authors of [2] analyzed the correlation of disk drive temperatures and the “annualized failure rate ” (AFR), the rate at which a disk fails per year — cf. figure 1.

Figure 1 – cf. [1]

Surprisingly, there is a clear trend that contradicts the common believe that higher temperatures lead to lower AFR until a temperature of 40°C is reached. It seems hard disks prefer a cosy, warm environment.

Similar results have been found for RAM which show that RAM correctable errors are independent from the surrounding temperature in a server and rather depend on the utilization — cf. figure 2. Correctable errors are errors that can be corrected by the individual RAM modules transparently by exploring parity information and error correction codes (ECC).

Figure 2 – cf. [2]

In this figure, the failure rate of correctable errors correlates with the CPU utilization of a node in which case a generally higher error rate increases with temperature. In a low CPU utilization scenario, the error rate does not depend on the temperature.

A conclusion from these findings is that datacenter and cluster designers may be more flexible about cooling parameters and limits.

2. Analog Components Degrade with Age

Most hard disks still use magnetic disks and very sophisticated mechanics to store data. Just like your jeans wear out with time, the mechanics of a hard disk is said to wear out and to become more error-prone over its lifetime. Figure 3 shows the results of the analysis of failed hard disks and their AFR over time.

Figure 3 – cf. [1]

It is interesting to see that once a disk survives the first three months, its AFR decreases, then peaks after three years, and then decreases again. In numbers, the incidence for disk drives to fail after three months or three years is higher than after one year or five years, respectively. This observation can be explained by survival of the fittest”. Disks that have a manufacturing problem are likely to fail early. The same might hold for long running disks that proof to have an especially high manufacturing quality.

It seems to be clear that age is not a usable indicator for the likelihood of disk drive failure.

3. Digital Components do not Age

In contrast to the aging behavior of hard disks, there is a correlation between age and failure rates for RAM modules over different hardware platforms and manufacturers — cf. figure 4.

Figure 4 – cf. [2]

This figure shows that the older a RAM module gets, the more likely it starts to fail. This degradation starts after around 10 months. In general, one third of all observed machines encountered at least one and in average over 22.000 correctable errors per year.

At first sight this finding does not look too bad since it only covers correctable errors which have no implications on the running system. But the studies found a 27-400 times higher probability for an uncorrectable error in one month if preceded by a correctable errors in the month before. Unlike a correctable error, an uncorrectable error in RAM usually leads to a shutdown of the affected node.

In addition, the research results indicate that, in contrast to common believe, failures are rather due to hard errors, i.e., permanent hardware defects, than soft errors caused by flipping bits.

The conclusion from these findings is that it is imperative to use ECC-protected memory to protect the system from a high probability (30%) of a memory failure. In addition, correctable memory errors must be taken seriously because they might indicate upcoming fatal uncorrectable errors.

4. Hard Disk Fail Gradually

“Self-Monitoring Analysis and Reporting Technology ” (SMART) has been introduced by hard disk manufacturers to allow users to access vital information about the health of their disks. All major system monitoring systems use the SMART values from a node’s hard disk to derive the health state of these disk and warn in case of beginning failure.

According to the research findings, the four SMART values “Scan Errors”, “Reallocation Count”, “Offline Reallocation Count”, and “Probational Count” may indicate upcoming disk failures if they are monitored over time. This means that while it is not possible to assess a disk’s health by looking at the SMART status once, the over time development of these four values may indicate a failure in the future.

Unfortunately, for 56% of all failed disks, the Google researches did not find relevant indication of degrading service quality. In other words, almost one out of two disks fails without a reliable information from SMART. This means, disks fail abruptly one in two times without any warning.

This makes redundancy of data (hardware or logic based) and preparations for a quick exchange of failing hard disks imperative.

Conclusion

This article summaries large scale analyses of causes for RAM and hard disk failures. The general conclusion is that the likelihood of those failures as well as their development contradict common believes. Therefore, it is imperative to include these findings in your data center design and emergency response plans in order to deliver a reliable service to your customers.

[1] http://research.google.com/pubs/pub32774.html
[2] http://research.google.com/pubs/pub35162.html
All figures are excerpts from the corresponding papers.

Was this post helpful?

Likes

Blog author

Lukas Pustina

Do you still have questions? Just send me a message.

fromLukas Pustina

Cloud Fibel 2.0

Wir sind davon überzeugt, dass verteilte, skalierbare und auf Microservices basierende Anwendungen die Zukunft der IT sind. Diese Anwendungen brauchen eine neue Generation von Infrastruktur, die häufig mit dem Begriff Cloud beschrieben wird. Die Cloud...

Cloud
Infrastructure
Microservices
Kubernetes

28.10.2015 | 1 Minuten Lesezeit

Lukas Pustina

Docker basiertes „Runtime Environment for Developers“

Verteilte, skalierbare, cloudfähige Anwendungen sind die Zukunft der Softwareentwicklung. Moderne Softwareprojekte basieren auf einer Vielzahl von Komponenten, um die fachliche Funktionalität schnell zu implementieren. Dies sind zum Beispiel verschiedene...

3.12.2014 | 3 Minuten Lesezeit

Lukas Pustina

An OpenStack Crime Story solved by tcpdump, sysdig, and iostat – Episode...

Previously on OpenStack Crime Investigation … Two load balancers running as virtual machine in our OpenStack based cloud, sharing a keepalived based highly available IP address started to flap, switching the IP address back and forth. After ruling ...

Cloud
Hosting

16.9.2014 | 5 Minuten Lesezeit

Lukas Pustina

An OpenStack Crime Story solved by tcpdump, sysdig, and iostat – Episode...

Previously on OpenStack Crime Investigation. I was called to a crime scene; our OpenStack based private cloud for CenterDevice. Somebody or something was causing our virtual load balancers to flap their highly available IP address. tcpdump showed me...

Infrastructure
Open Source
APM
Cloud
IT-Security

15.9.2014 | 5 Minuten Lesezeit

Lukas Pustina

An OpenStack Crime Story solved by tcpdump, sysdig, and iostat – Episode...

This is the story of how the tiniest things can sometimes be the biggest culprits. Because first and foremost, this is a detective story. So come and follow me on a little crime scene investigation that uncovered an incredibly counterintuitive and almost...

Cloud

14.9.2014 | 5 Minuten Lesezeit

Lukas Pustina

Drei Tage ScalaDays 2014 im Überblick

Der Fokus der ScalaDays 2014 in Berlin lag auf Vereinfachung, Reactive Streams und Event Sourcing mit Akka Persitence. Vom 16. bis 18. Juni trafen sich dazu Entwickler und konnten 59 Vorträgen in mehreren Tracks zuhören. Die Vorträge waren ein spannende...

Scala

4.7.2014 | 3 Minuten Lesezeit

Lukas Pustina

Provisioning IaaS Clouds with Dynamic Ansible Inventories and OpenStack...

My colleague Daniel Schneller gave an introduction to Ansible . A key concept of Ansible is the inventory. It contains all hosts of your site that you want to provision with Ansible. For bare metal hardware, this inventory is a static file enumerating...

Database
Cloud

24.6.2014 | 5 Minuten Lesezeit

Lukas Pustina

Crypto is Broken or How to Apply Secure Crypto as a Developer

Last year’s revelations show that crypto is broken on all levels. 1 We cannot trust hardware nor commercial software providers anymore to securely encrypt our data. My first instinct as a developer is to turn to open source libraries which have been...

Crypto
IT-Security

5.3.2014 | 8 Minuten Lesezeit

Lukas Pustina

Ceph Object Storage as fast as it gets or Benchmarking Ceph

CenterDevice is a distributed document management and sharing software without any single centralized component. In our next evolution we are going to use the distributed object store Ceph for storing our encrypted documents. In this article, my colleague...

Infrastructure
Software development

3.3.2014 | 9 Minuten Lesezeit

Lukas Pustina

Docker Registry or How to Run your own Private Docker Image Repository

Docker allows to bundle artifacts and configurations in an image. These images run as light weight system-level virtual machines. In my previous articles, I showed how to use Docker in general and how to use networking . In this article, I will show...

Container

18.2.2014 | 5 Minuten Lesezeit

Lukas Pustina

Docker Networking Made Simple or 3 Ways to Connect LXC Containers

In my previous article , I introduced Docker as a lightweight alternative to hypervisor-based virtualization. The article described the basic usage of Docker. Today, we dig a bit deeper and cover advanced topics regarding Docker networking and how to...

CI/CD
DevOps
Container

26.1.2014 | 7 Minuten Lesezeit

Lukas Pustina

Lightweight Virtual Machines Made Simple with Docker or How to Run 100...

Running virtual machines has many benefits. They utilize your hardware much better, are easy to backup and exchange, and isolate services from each other. But running virtual machines also has downsides. Virtual machine images are clunky. Also and more...

DevOps
Open Source
APM

6.1.2014 | 8 Minuten Lesezeit

Lukas Pustina

“Never Change a Running System” is wrong

In German, we like to make up our own English words and expressions. For example, German IT-experts have invented the saying: “Never change a running system“. What they mean by that is: “If it ain’t broken, don’t fix it“. But is that really true? I don...

18.12.2013 | 2 Minuten Lesezeit

Lukas Pustina

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Send

Your Hardware will Fail – Just not the Way You Expect

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Cloud Fibel 2.0

Docker basiertes „Runtime Environment for Developers“

An OpenStack Crime Story solved by tcpdump, sysdig, and iostat – Episode...

An OpenStack Crime Story solved by tcpdump, sysdig, and iostat – Episode...

An OpenStack Crime Story solved by tcpdump, sysdig, and iostat – Episode...

Drei Tage ScalaDays 2014 im Überblick

Provisioning IaaS Clouds with Dynamic Ansible Inventories and OpenStack...

Crypto is Broken or How to Apply Secure Crypto as a Developer

Ceph Object Storage as fast as it gets or Benchmarking Ceph

Docker Registry or How to Run your own Private Docker Image Repository

Docker Networking Made Simple or 3 Ways to Connect LXC Containers

Lightweight Virtual Machines Made Simple with Docker or How to Run 100...

“Never Change a Running System” is wrong

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten

Contact

Send

Your Hardware will Fail – Just not the Way You Expect

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Cloud Fibel 2.0

Docker basiertes „Runtime Environment for Developers“

An OpenStack Crime Story solved by tcpdump, sysdig, and iostat – Episode...

An OpenStack Crime Story solved by tcpdump, sysdig, and iostat – Episode...

An OpenStack Crime Story solved by tcpdump, sysdig, and iostat – Episode...

Drei Tage ScalaDays 2014 im Überblick

Provisioning IaaS Clouds with Dynamic Ansible Inventories and OpenStack...

Crypto is Broken or How to Apply Secure Crypto as a Developer

Ceph Object Storage as fast as it gets or Benchmarking Ceph

Docker Registry or How to Run your own Private Docker Image Repository

Docker Networking Made Simple or 3 Ways to Connect LXC Containers

Lightweight Virtual Machines Made Simple with Docker or How to Run 100...

“Never Change a Running System” is wrong

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten