LANGUAGE

Persistence without Persistence

17.6.2012 | 6 minutes of reading time

NoSQL-databases typically run on virtual machines in the cloud. But if the machines they run on are virtual, how can persistence be ensured?

Enterprise relational database management systems typically run on expensive robust and highly reliable hardware. Frequently, large sums are invested to make sure the hardware to run as fail save as possible. And a typical db admin would insist on taking such measurements.

In the cloud, we rarely find this situation. Cloud computing hardware is typically commodity hardware. Certainly none of the cloud computing providers use cheap hardware from “your dealer around the corner”, just because it is way too expensive to maintain. But on the other hand, servers are typically not supplied with a redundant power supply. And disks are not connected to build RAID-arrays or other fault-tolerant systems.

As a consequence, service providers inform tenants that they cannot rely on all virtual nodes to work flawlessly. Actually, one should expect that nodes fail every once in a while. This has interesting consequences. If your DB server runs on a virtual machine, the hard disk the server writes to is virtual, too. Of course there must be a physical disk behind it. But it is not accessible. If the node the virtual machine runs on goes down, then so does the virtual machine. What happens to the data on the virtual hard disk? It gets lost. Even if you restart the virtual machine on the same node, there is no way to access the data the DB server previously wrote to the virtual disk. If you wanted to prevent such a situation to happen you would have to take snapshots of the virtual machine (including the disk). If you have many write operations to the DB, that would have to be done continuously in short intervals. This is obviously not viable.
In other words, there is no persistence available for DB servers running on virtual machines.

So, if virtual machines are transient, how can persistence be achieved? The answer is pretty similar to the one when using dedicated hardware: data replication. Each individual datum to be stored in the DB should be stored on several different virtual machines on several different nodes. Threefold replication seems to be some kind of a standard here. The idea behind this approach is that while we cannot rely on the individual machines it is regarded very unlikely that all three machines storing a datum to go down at the same time. If it is just one or two machines, the datum is still available. And several NoSQL DB servers contain built-in mechanisms to automatically restore the number of replications if a node goes down. Others leave this task to the application developers.

Is this the end of the story? Unfortunately not really. In August last year, Amazon’s European EC2 center suffered an outage as a consequence of a lightning hitting the transformers close to their site (see, e.g., here or here.) And the lightning also hit the secondary power supply, with the consequence that the data center suffered a full power outage. I don’t actually care whether this is the right explanation for that particular incident. It is enough to see that a power outage of a computing center is in fact possible and not something to be considered to be too unlikely to happen.

The obvious solution is to introduce data replication across data centers. But this is where problems start. Replication within a single data center is relatively simple because all nodes are connected by high bandwidth. Thus lots of communication traffic between nodes is relatively unproblematic. Such a bandwidth is obviously no longer available between two different data centers, perhaps even located on different continents. Bandwidth over the Internet is clearly the limiting factor in data replication. Full real-time replication of huge data sets with many fast changes to data sets is impossible.

There is yet another effect to be considered. If a data center remains down or unreachable for a prolonged period, tenants of the data center will start moving their applications to other data centers of the same provider. This may in effect turn these data centers to become unreachable, too, due to overload.

In such a situation there is no uniform solution to data replication across data centers that fits everybody’s needs. It is rather the individual requirements of applications that drive potential solutions. There is basically three types of data to be distinguished:

data that does nor require cross data center replication,
data that should be replicated across data centers sometime,
data that requires immediate cross data center replication.

Let us try to explain this by means of an example. Think of a web shop and a new customer trying to place an order. The items in the shopping cart are transient data anyway. There is no need to replicate it across data centers. As long as the order is not completed by the customer, an order data loss due to the unlikely event of a data center outage is an event that is economically sustainable. The customer just has to re-enter what he wants to order. And if needed, the customers browser can be used as a backup by storing the cart content in a cookie.

Customer address and payment information are data that should be replicated across data centers. After all one wouldn’t want to loose all customers or their data due to a data center outage. But it is unlikely that an immediate replication is required. I’d rather propose to eventually replicate. If a data center outage happens it is only a small amount of customer data that is affected, namely only the changes that happened after the last replication.

It is difficult to find any type of data that requires immediate replication in the given example. A potential example might be payment related data indicating that a customer lost his status as a reliable payer and thus may no longer place any orders for example as a consequence of some fraud detection. In such a situation the importance of this information may be so high that an immediate replication is the action of choice.

An analysis along these lines has to be performed for each individual application. Applications that require a cross data center replication of data to happen eventually are still viable. If the requirement of immediate replication of large amounts of data is the result of such an analysis, the situation is really difficult. There is no ready-made solution at hand. But these cases have to be carefully considered. Why is it the case that large amounts of data need to be replicated immediately? Is it really necessary to replicate such an amount of data? Answers to questions of this type are likely not to be technical in nature but rather business-driven.

To catch up, if persistence is taken serious then data replication across data centers is required. But bandwidths over the Internet, that are known to be orders of magnitude smaller than the ones available within data centers, prohibit immediate replication of large amounts of data. It is therefore necessary to identify and down-size the amount of data that really requires immediate replication. For all other data a replication across data centers that takes place eventually should suffice. It is also well advised to automatically detect the outage of a data center to stop all futile communication efforts.

Was this post helpful?

LANGUAGE

Likes

Blog author

Stephan Kepser

Senior IT Consultant, Cloud and Data Architect

Do you still have questions? Just send me a message.

fromStephan Kepser

Selenium WebDriver for Safari 8

This is just a short note on how to get the Selenium WebDiver installed and running for the browser Safari (ver. 8) under Mac OS 10.10 “Yosemite” . It isn’t that easy to find the solution on the internet. Core insight is that you need a WebDriver ...

Webdevelopment
Testing

4.2.2015 | 1 Minuten Lesezeit

Stephan Kepser

Verwendung GPL-lizenzierter Komponenten in kommerziellen Projekten

Software, die unter der GNU General Public License lizenziert ist, ist freie Software, und alle Software, die auf GPL-lizenzierten Komponenten aufbaut, ist ebenso frei und muss ebenso unter der GPL lizenziert werden. Daher wird häufig angenommen, dass...

29.5.2012 | 8 Minuten Lesezeit

Stephan Kepser

Set-up of a small Riak cluster with VirtualBox, part II

This is the second part of a small tutorial to set up a small Riak cluster using VirtualBox. In the first part , we explained how to install and set up the first node. Adding Riak nodes Let us now add more nodes to set up a real cluster, even if a small...

23.4.2012 | 6 Minuten Lesezeit

Stephan Kepser

Set-up of an small Riak cluster with VirtualBox, part I

Introduction The aim of this article is to show how to set-up a small Riak cluster using VirtualBox. Riak is a NoSQL database of the key-value-type. Objects in the database are uninterpreted atomic binary entities. They are addressed by unique keys. ...

22.4.2012 | 6 Minuten Lesezeit

Stephan Kepser

Cloud, soziale Netzwerke & Co.: vernetzte Trends erkennen und bewerten

IT-Trends zu beobachten, ist wichtig, um sich rechtzeitig auf neue Herausforderungen einstellen zu können. Auf der anderen Seite ist es genauso wichtig, die richtigen Maßnahmen zu ergreifen und keinen kurzlebigen Hypes aufzusitzen. Das übliche isolierte...

23.2.2012 | 1 Minuten Lesezeit

Stephan Kepser

German Data Protection Legislation and the USA PATRIOT Act

On Tuesday, 6th December, several it news tickers (see, e.g., heise online ) announced that Microsoft is about to change the end user agreement for its cloud service Office 365 in such a way that it conforms to German and European data protection legislation...

9.12.2011 | 4 Minuten Lesezeit

Stephan Kepser

Selenium 1 Remote Control Plugin for Firefox 5 and 6

Selenium is a powerful tool for web browser automation. As such it is an important component in many test set-ups for GUI or acceptance tests. It’s current version is 2, Selenium Webdriver. But many people still use version 1. Unfortunately the Selenium...

20.9.2011 | 2 Minuten Lesezeit

Stephan Kepser

Kommerzielle Nutzbarkeit der Daten von Twitter und Facebook

Persönliche Statusmeldungen, die Nutzer auf Twitter oder Facebook hinterlassen, sind zweifelsfrei nicht nur für deren Freundeskreis, sondern auch für Unternehmen interessant. Dabei geht es nicht nur um Kommentare über Unternehmen, sondern durchaus auch...

7.7.2011 | 6 Minuten Lesezeit

Stephan Kepser

Personenbezogene Daten in der Cloud

Das Thema Personenbezogene Daten in der Cloud hat mindestens zwei Perspektiven. Da ist die Perspektive des Endanwenders, wo wir als Benutzer uns fragen, ob unsere Daten in der Cloud sicher sind. Und dann gibt es die Unternehmensperspektive. Stellen Sie...

17.5.2011 | 6 Minuten Lesezeit

Stephan Kepser

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Public Cloud im regulierten Sektor: Das ist zu beachten

Es war längere Zeit ein weit verbreitetes und in strategischen Debatten häufig zitiertes Missverständnis, dass die Bundesanstalt für Finanzdienstleistungsaufsicht (BaFin) dem Einsatz von Public-Cloud-Anbietern wie AWS, Azure und Co. einen Riegel vorschiebt...

Cloud
Compliance

10.4.2024 | 6 Minuten Lesezeit

Marc Bialowons

Björn Bohn

Green Cloud: Daten und Emissionen sparen

Das Internet produziert jährlich 900 Millionen Tonnen CO₂ – das ist deutlich mehr als Deutschland insgesamt emittiert. Hauptverantwortlich ist der immer weiter steigende Stromverbrauch beim Transport und der Speicherung von Daten. Wenn ihr kurz darüber...

Cloud
Green IT
Softwarearchitektur
Data

11.3.2024 | 5 Minuten Lesezeit

Dennis

AZ-900-Zertifizierung: Mein How-to!

Was ist AZ-900? Azure bietet eine Reihe verschiedener Zertifizierungen an. Zu finden sind sie hier. Darunter befindet sich auch die Zertifizierung AZ-900. Bei diesem Zertifikat handelt es sich um Microsoft Certified: Azure Fundamentals. Diese prüft unter...

Azure
Cloud

2.1.2024 | 5 Minuten Lesezeit

Ege Inanc

Mit FinOps die größten Kostenfallen bei AWS S3 verhindern

In der Welt der Cloud-Technologie und insbesondere bei AWS (Amazon Web Services) ist die effiziente Verwaltung von Ressourcen von entscheidender Bedeutung, um unnötige Kosten zu vermeiden. Dieser Blogbeitrag konzentriert sich auf AWS S3 und die teuren...

AWS
Cloud

27.11.2023 | 4 Minuten Lesezeit

Lukas Miliunas

Maximilian Mayer

Cloud FinOps

Cloud FinOps bietet einen etablierten Prozess, um Kosten für den Cloudbetrieb zu reduzieren (s. auch diesen Artikel). Zu diesem Zweck bietet es ein etabliertes Cloud-unabhängiges Vorgehen, das eine Organisation schrittweise aufgreifen kann. Das Tooling...

Cloud
Cloud Native
Green IT

26.10.2023 | 5 Minuten Lesezeit

Lukas Miliunas

Marco Paga

Mehr Struktur in der Cloud mit Azure Landing Zones

Die Migration in die Cloud bringt einige Herausforderungen mit sich. Viele Unternehmen stehen vor der Frage, wie ein effizienter und sicherer Aufbau einer skalierbaren Cloud-Infrastruktur umzusetzen ist. Die Antwort auf diese Herausforderung liegt in...

Cloud
Azure
IT-Governance

4.8.2023 | 4 Minuten Lesezeit

Florian Moll

Nils Bauroth

CI/CD-Pipelines mit AWS CDK CodePipeline

Das Aufsetzen der CI/CD-Pipeline ist ein typischer Task in der Anfangszeit eines Projekts. Ist die Pipeline dann aufgesetzt, sind Änderungen nur noch selten notwendig. Dementsprechend wenig Routine entwickeln Programmierende im Umgang mit der Konfiguration...

Cloud
CI/CD
AWS

17.7.2023 | 4 Minuten Lesezeit

Dennis

Green Cloud: Nachhaltig skalieren

Wenn Softwareprojekte in die Cloud gebracht werden, versprechen wir uns davon hohe Verfügbarkeit, planbare Kosten und eine immer dem Bedarf entsprechende Skalierung. Aufgrund der grenzenlosen Angebote ist es aber auch leicht, die Komponenten eines Systems...

Cloud
Softwarearchitektur
Green IT

12.6.2023 | 5 Minuten Lesezeit

Dennis

Crossplane: Eine Lösung für hybride Cloud-Herausforderungen?

Crossplane ist ein plattformübergreifendes Kontrollsystem (Control-Plane), das das Management von Cloud-Ressourcen vereinfachen und automatisieren soll. Das Tool ermöglicht es, verschiedene Cloud-Provider und lokale Ressourcen, z. B. Kubernetes-Cluster...

Cloud
Cloud Native

12.5.2023 | 2 Minuten Lesezeit

Matthias Niehoff

Green Cloud: Ideen für eine nachhaltigere Architektur

Die ökologische Nachhaltigkeit eines Systems ist aktuell häufig noch kein Thema. Nachhaltigkeit bedeutet für mich in diesem Kontext die Reduktion der verursachten Emissionen durch gesenkten Ressourcenverbrauch – egal ob die Emissionen beim Cloudprovider...

Cloud
Softwarearchitektur
Green IT

5.5.2023 | 5 Minuten Lesezeit

Dennis

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

Wenn wir Erkenntnisse aus großen Datenmengen gewinnen wollen, bieten uns Cloud Service Provider inzwischen Lösungen an, dank derer wir uns kein Data Warehouse oder Hadoop-Cluster mehr in den Keller stellen müssen. AWS hat mit Athena, RedShift und EMR...

Cloud
Big Data
AWS
Serverless
GitLab

21.3.2023 | 16 Minuten Lesezeit

Maik Fleuter

Ist die Cloud der große Umweltsünder?

Rechenleistung und Speicher kosten nicht nur Geld. Sie verbrauchen auch Mengen – potenziell klimaschädlicher – Energie. Das überrascht die Wenigsten, im kollektiven Bewusstsein ist es aber bislang kaum angekommen. Sehr wohl bewusst ist es natürlich den...

Cloud

18.1.2023 | 2 Minuten Lesezeit

Matthias Niehoff

AWS Cloud Development Kit – Infrastructure as Code on Steroids

Infrastructure as Code (IaC) ist inzwischen ein alter Hut. Frameworks wie Terraform, Ansible und andere haben Standards geschaffen. Kaum jemand provisioniert produktive Systeme heute ohne IaC – sei es in der Cloud oder auf der eigenen Infrastruktur.Und...

Infrastructure as Code
AWS
Cloud

21.12.2022 | 3 Minuten Lesezeit

Matthias Niehoff

Infrastructure as Code in AWS: Keine Silver Bullet

TL;DR Es gibt keine Universalmethode. Infrastructure as Code ist ein vergleichsweise neuer Ansatz. Einige Lösungen rund um Infrastructure as Code befinden sich noch in der Entwicklung. Es gibt keinen klaren Favoriten. Die Wahl des passenden Tools hängt...

Cloud
AWS
Infrastructure as Code

13.12.2022 | 27 Minuten Lesezeit

Florian Wiech

Sören

AWS CloudFront Functions testen

Mit den CloudFront Functions bietet AWS die Möglichkeit, den Funktionsumfang von CloudFront um kleine JavaScript-Funktionen zu erweitern. AWS führt diese Funktionen direkt an den Edge-Locations aus und ermöglicht es dadurch, alle ankommenden Requests...

Cloud
AWS
Testing
Softwareentwicklung

4.10.2022 | 3 Minuten Lesezeit

Dennis

Die Zukunft der IDEs – aus Sicht eines „Java-EE-Entwicklers“

Bei unseren Kunden und auch bei codecentric dreht sich alles um den besten und schnellsten Weg, die richtige Software zu entwickeln – und das natürlich in hoher Qualität. Von daher bin ich auch ein fleißiger Leser des „State of DevOps“-Report (hier zum...

Cloud
Java
Remote Work

16.5.2022 | 11 Minuten Lesezeit

Rainer Vehns

Green Cloud: Emissionen unserer Cloud-Architektur messen

Überall wird von der Cloud geschwärmt: Grenzenlose Skalierung und unzählige Features sind bereits „out of the box“ verfügbar. Das alles gibt es zu unschlagbar günstigen Preisen. Das Thema Nachhaltigkeit kommt dabei selten zur Sprache. Rechenzentren verbrauchen...

AWS
Azure
Cloud
Google Cloud
Green IT

24.4.2022 | 6 Minuten Lesezeit

Dennis

Terraform Remote State richtig nutzen

Was ist Terraform und was ist State?Terraform ist ein Tool für die Verwaltung von Infrastruktur in Form von Code, gehört also in den sogenannten Infrastructure-as-Code-Bereich (IaC). Eine kurze Einführung und ein Vergleich zu anderen Tools findet sich...

Infrastructure
Softwarearchitektur
Cloud
DevOps

21.4.2022 | 7 Minuten Lesezeit

Alexander Kasper

Stream Processing mit Kafka Streams und Spring Boot

Kontinuierliche Datenströme in verteilten Systemen ohne Zeitverzögerung zu verarbeiten, birgt einige Herausforderungen. Wir zeigen euch, wie Stream Processing mit Kafka Streams und Spring Boot gelingen kann. Alles im Fluss: Betrachtet man Daten als fortlaufenden...

Softwarearchitektur
Cloud
IoT
Messaging
Kotlin
Spring

20.12.2021 | 20 Minuten Lesezeit

Maik Fleuter

Lukas Maier

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Machine Learning (ML) erzeugt erst dann realen Mehrwert, wenn es in Produktion benutzt wird. Allerdings kann die Zeitspanne zwischen der Entwicklung eines belastbaren Modells und dessen Einsatz frustrierend lange sein. Insbesondere in schnelllebigen ...

Agile Methoden
Cloud
Machine Learning

26.7.2021 | 5 Minuten Lesezeit

Timo Böhm

Niklas Haas

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

Persistence without Persistence

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Selenium WebDriver for Safari 8

Verwendung GPL-lizenzierter Komponenten in kommerziellen Projekten

Set-up of a small Riak cluster with VirtualBox, part II

Set-up of an small Riak cluster with VirtualBox, part I

Cloud, soziale Netzwerke & Co.: vernetzte Trends erkennen und bewerten

German Data Protection Legislation and the USA PATRIOT Act

Selenium 1 Remote Control Plugin for Firefox 5 and 6

Kommerzielle Nutzbarkeit der Daten von Twitter und Facebook

Personenbezogene Daten in der Cloud

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Public Cloud im regulierten Sektor: Das ist zu beachten

Green Cloud: Daten und Emissionen sparen

AZ-900-Zertifizierung: Mein How-to!

Mit FinOps die größten Kostenfallen bei AWS S3 verhindern

Cloud FinOps

Mehr Struktur in der Cloud mit Azure Landing Zones

CI/CD-Pipelines mit AWS CDK CodePipeline

Green Cloud: Nachhaltig skalieren

Crossplane: Eine Lösung für hybride Cloud-Herausforderungen?

Green Cloud: Ideen für eine nachhaltigere Architektur

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

Ist die Cloud der große Umweltsünder?

AWS Cloud Development Kit – Infrastructure as Code on Steroids

Infrastructure as Code in AWS: Keine Silver Bullet

AWS CloudFront Functions testen

Die Zukunft der IDEs – aus Sicht eines „Java-EE-Entwicklers“

Green Cloud: Emissionen unserer Cloud-Architektur messen

Terraform Remote State richtig nutzen

Stream Processing mit Kafka Streams und Spring Boot

Kürzere Time-to-Market für ML-Modelle durch Googles BigQuery ML

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten