NoSQL-databases typically run on virtual machines in the cloud. But if the machines they run on are virtual, how can persistence be ensured?
Enterprise relational database management systems typically run on expensive robust and highly reliable hardware. Frequently, large sums are invested to make sure the hardware to run as fail save as possible. And a typical db admin would insist on taking such measurements.
In the cloud, we rarely find this situation. Cloud computing hardware is typically commodity hardware. Certainly none of the cloud computing providers use cheap hardware from “your dealer around the corner”, just because it is way too expensive to maintain. But on the other hand, servers are typically not supplied with a redundant power supply. And disks are not connected to build RAID-arrays or other fault-tolerant systems.
As a consequence, service providers inform tenants that they cannot rely on all virtual nodes to work flawlessly. Actually, one should expect that nodes fail every once in a while. This has interesting consequences. If your DB server runs on a virtual machine, the hard disk the server writes to is virtual, too. Of course there must be a physical disk behind it. But it is not accessible. If the node the virtual machine runs on goes down, then so does the virtual machine. What happens to the data on the virtual hard disk? It gets lost. Even if you restart the virtual machine on the same node, there is no way to access the data the DB server previously wrote to the virtual disk. If you wanted to prevent such a situation to happen you would have to take snapshots of the virtual machine (including the disk). If you have many write operations to the DB, that would have to be done continuously in short intervals. This is obviously not viable.
In other words, there is no persistence available for DB servers running on virtual machines.
So, if virtual machines are transient, how can persistence be achieved? The answer is pretty similar to the one when using dedicated hardware: data replication. Each individual datum to be stored in the DB should be stored on several different virtual machines on several different nodes. Threefold replication seems to be some kind of a standard here. The idea behind this approach is that while we cannot rely on the individual machines it is regarded very unlikely that all three machines storing a datum to go down at the same time. If it is just one or two machines, the datum is still available. And several NoSQL DB servers contain built-in mechanisms to automatically restore the number of replications if a node goes down. Others leave this task to the application developers.
Is this the end of the story? Unfortunately not really. In August last year, Amazon’s European EC2 center suffered an outage as a consequence of a lightning hitting the transformers close to their site (see, e.g., here or here.) And the lightning also hit the secondary power supply, with the consequence that the data center suffered a full power outage. I don’t actually care whether this is the right explanation for that particular incident. It is enough to see that a power outage of a computing center is in fact possible and not something to be considered to be too unlikely to happen.
The obvious solution is to introduce data replication across data centers. But this is where problems start. Replication within a single data center is relatively simple because all nodes are connected by high bandwidth. Thus lots of communication traffic between nodes is relatively unproblematic. Such a bandwidth is obviously no longer available between two different data centers, perhaps even located on different continents. Bandwidth over the Internet is clearly the limiting factor in data replication. Full real-time replication of huge data sets with many fast changes to data sets is impossible.
There is yet another effect to be considered. If a data center remains down or unreachable for a prolonged period, tenants of the data center will start moving their applications to other data centers of the same provider. This may in effect turn these data centers to become unreachable, too, due to overload.
In such a situation there is no uniform solution to data replication across data centers that fits everybody’s needs. It is rather the individual requirements of applications that drive potential solutions. There is basically three types of data to be distinguished:
- data that does nor require cross data center replication,
- data that should be replicated across data centers sometime,
- data that requires immediate cross data center replication.
Let us try to explain this by means of an example. Think of a web shop and a new customer trying to place an order. The items in the shopping cart are transient data anyway. There is no need to replicate it across data centers. As long as the order is not completed by the customer, an order data loss due to the unlikely event of a data center outage is an event that is economically sustainable. The customer just has to re-enter what he wants to order. And if needed, the customers browser can be used as a backup by storing the cart content in a cookie.
Customer address and payment information are data that should be replicated across data centers. After all one wouldn’t want to loose all customers or their data due to a data center outage. But it is unlikely that an immediate replication is required. I’d rather propose to eventually replicate. If a data center outage happens it is only a small amount of customer data that is affected, namely only the changes that happened after the last replication.
It is difficult to find any type of data that requires immediate replication in the given example. A potential example might be payment related data indicating that a customer lost his status as a reliable payer and thus may no longer place any orders for example as a consequence of some fraud detection. In such a situation the importance of this information may be so high that an immediate replication is the action of choice.
An analysis along these lines has to be performed for each individual application. Applications that require a cross data center replication of data to happen eventually are still viable. If the requirement of immediate replication of large amounts of data is the result of such an analysis, the situation is really difficult. There is no ready-made solution at hand. But these cases have to be carefully considered. Why is it the case that large amounts of data need to be replicated immediately? Is it really necessary to replicate such an amount of data? Answers to questions of this type are likely not to be technical in nature but rather business-driven.
To catch up, if persistence is taken serious then data replication across data centers is required. But bandwidths over the Internet, that are known to be orders of magnitude smaller than the ones available within data centers, prohibit immediate replication of large amounts of data. It is therefore necessary to identify and down-size the amount of data that really requires immediate replication. For all other data a replication across data centers that takes place eventually should suffice. It is also well advised to automatically detect the outage of a data center to stop all futile communication efforts.