Why do people decide to move their services into the cloud? For one thing they wish to store large amounts of data. But they also wish for response times and reliability that classical sever installation cannot offer. Therefore, clusters of commodity server nodes are built that deliver services as a distributed system. In such a distributed system, failures of nodes are immanent and it is important to understand the mechanics of hardware failures to successfully prepare such events. In this blog article, I present four surprising facts that shed new light on common believes about RAM and hard disk failures.
These facts are based on the two Google research papers “Failure Trends in a Large Disk Drive Population”  and “DRAM Errors in the Wild: A Large-Scale Field Study”  which cover data collected over hundreds of thousand nodes throughout Google’s data centers.
1. Lower Temperatures Lead to Higher Availably
Having managed servers for almost 20 years, I have always been told that it is crucial for servers to be appropriately cooled in order to allow for 24/7 operation. Interestingly, Google researches concluded that temperatures impacts the reliability of neither RAM modules nor disk drives as expected.
In particular, the authors of  analyzed the correlation of disk drive temperatures and the “annualized failure rate” (AFR), the rate at which a disk fails per year — cf. figure 1.Surprisingly, there is a clear trend that contradicts the common believe that higher temperatures lead to lower AFR until a temperature of 40°C is reached. It seems hard disks prefer a cosy, warm environment.
Similar results have been found for RAM which show that RAM correctable errors are independent from the surrounding temperature in a server and rather depend on the utilization — cf. figure 2. Correctable errors are errors that can be corrected by the individual RAM modules transparently by exploring parity information and error correction codes (ECC).In this figure, the failure rate of correctable errors correlates with the CPU utilization of a node in which case a generally higher error rate increases with temperature. In a low CPU utilization scenario, the error rate does not depend on the temperature.
A conclusion from these findings is that datacenter and cluster designers may be more flexible about cooling parameters and limits.
2. Analog Components Degrade with Age
Most hard disks still use magnetic disks and very sophisticated mechanics to store data. Just like your jeans wear out with time, the mechanics of a hard disk is said to wear out and to become more error-prone over its lifetime. Figure 3 shows the results of the analysis of failed hard disks and their AFR over time.It is interesting to see that once a disk survives the first three months, its AFR decreases, then peaks after three years, and then decreases again. In numbers, the incidence for disk drives to fail after three months or three years is higher than after one year or five years, respectively. This observation can be explained by “survival of the fittest”. Disks that have a manufacturing problem are likely to fail early. The same might hold for long running disks that proof to have an especially high manufacturing quality.
It seems to be clear that age is not a usable indicator for the likelihood of disk drive failure.
3. Digital Components do not Age
In contrast to the aging behavior of hard disks, there is a correlation between age and failure rates for RAM modules over different hardware platforms and manufacturers — cf. figure 4.This figure shows that the older a RAM module gets, the more likely it starts to fail. This degradation starts after around 10 months. In general, one third of all observed machines encountered at least one and in average over 22.000 correctable errors per year.
At first sight this finding does not look too bad since it only covers correctable errors which have no implications on the running system. But the studies found a 27-400 times higher probability for an uncorrectable error in one month if preceded by a correctable errors in the month before. Unlike a correctable error, an uncorrectable error in RAM usually leads to a shutdown of the affected node.
In addition, the research results indicate that, in contrast to common believe, failures are rather due to hard errors, i.e., permanent hardware defects, than soft errors caused by flipping bits.
The conclusion from these findings is that it is imperative to use ECC-protected memory to protect the system from a high probability (30%) of a memory failure. In addition, correctable memory errors must be taken seriously because they might indicate upcoming fatal uncorrectable errors.
4. Hard Disk Fail Gradually
“Self-Monitoring Analysis and Reporting Technology” (SMART) has been introduced by hard disk manufacturers to allow users to access vital information about the health of their disks. All major system monitoring systems use the SMART values from a node’s hard disk to derive the health state of these disk and warn in case of beginning failure.
According to the research findings, the four SMART values “Scan Errors”, “Reallocation Count”, “Offline Reallocation Count”, and “Probational Count” may indicate upcoming disk failures if they are monitored over time. This means that while it is not possible to assess a disk’s health by looking at the SMART status once, the over time development of these four values may indicate a failure in the future.
Unfortunately, for 56% of all failed disks, the Google researches did not find relevant indication of degrading service quality. In other words, almost one out of two disks fails without a reliable information from SMART. This means, disks fail abruptly one in two times without any warning.
This makes redundancy of data (hardware or logic based) and preparations for a quick exchange of failing hard disks imperative.
This article summaries large scale analyses of causes for RAM and hard disk failures. The general conclusion is that the likelihood of those failures as well as their development contradict common believes. Therefore, it is imperative to include these findings in your data center design and emergency response plans in order to deliver a reliable service to your customers.