Elasticsearch Zero Downtime Reindexing – Problems and Solutions

17.9.2014 | 8 minutes of reading time

Reindexing Elasticsearch could be so easy. Well in the first place, we all wouldn’t have to reindex at all. Why should you do this? There is dynamic mapping! In this post I will explain why dynamic mapping won’t do you much good, how you can deal with inevitable errors in your static mapping, what zero downtime reindexing is, and finally how you can deal with the drawbacks this approach has.

Basics: In the end, everyone maps static anyways.

So what happens when you throw a random json at Elasticsearch and call it a day? Elasticsearch will, after it finds out that the given Index does not provide a mapping for that kind of data, try to determine a new mapping according to the data supplied.

So if we throw a “book” at a fresh Elasticsearch instance with dynamic indexing turned on:

1POST blog/articles/1
2{
3  "author": "Chris",
4  "title": "useful Cat facts (III)"
5}

Elasticsearch will index it without complaints, because these are obviously String fields:

1GET blog/articles/_mapping

1"blog": {
2  "mappings": {
3    "articles": {
4      "properties": {
5        "author": { 
6          "type": "string"
7        },
8        "title": {
9          "type": "string"
10}}}}}}

But we are in the epoch of Big Data, where input comes in chaotically, without much norming. Let’s imagine someone comes along and posts a new blogpost:

1POST blog/articles/2
2{
3  "Author": "The Dude",
4  "Title": "thats just like, your opinion man!"
5}

This will be indexed just fine, but our new mapping will look like:

1{
2  "blog": {
3    "mappings": {
4      "articles": {
5        "properties": {
6          "Author": {
7            "type": "string"
8          },
9          "Title": {
10            "type": "string"
11          },
12          "author": { 
13            "type": "string"
14          },
15          "title": {
16            "type": "string"
17}}}}}}

Yikes! That’s not what we wanted – Elasticsearch can’t determinate if this is “legitimately” different or we’ve just been vague. So sooner or later (and hopefully sooner) you will start to define a mapping for your data.

Your Mapping is most likely wrong

Okay, now that we’ve got the basics out of the way we get to the more sophisticated Problems – what happens when your mapping is wrong? Generally it’s additive: When you add a field, the underlying newly created Lucene segments will just be bigger from now on, and the old ones are left as they were. Searches for the new field will be applied to old segments, but will not result in a hit. Since Lucene never edits a written segment, this bubbles up to Elasticsearch – we cannot change a field type after data has been indexed.

We all know that our first guesses when setting things up is most likely not the end-to-be-all, but needs to be revised later on. The very same happens when you have your Elasticsearch cluster already in production.

The simplest way to tackle this would be just to drop your current index, apply a new mapping and reindex everything again. This approach is fine while you’re still in your dev (or maybe staging) environment. But in production, your reindex can easily take a couple hours, maybe days – Good luck telling your customers you’re offline during that period. Also this only works if you have your old data available somewhere else to feed the reindex – otherwise you need to figure out how to do this without downtime.

Zero Downtime Reindexing

There is already a great entry in the Elasticsearch Guide that is derived from the post on the official blog that you should read, too. Just to give it a short TL;DR:

Elasticsearch provides us with the fantastic and helpful concept of aliases. So to get to a seamless reindexing you do the following:

create an alias that points to the index with the old mapping
point your application to your alias instead of your index
create a new index with the updated mapping
move data from old to new
atomically move the alias from the old to the new index
delete the old stuff

-> The cluster stays fully operational during the whole operation and you experience no downtime!

1. Where do the WRITE operations go in the meanwhile?

Unfortunately the official documentation does not discuss how to handle incoming writes to your cluster during the reindexing period. Such an operation might take a lot of time depending on various factors like your machinery, the size of your dataset, your analyzers and so on. Aliases do not allow us to write to both the old and the new index at the same time, so we need to take care of that. Currently I’d suggest two approaches:

1 a) Duplicate Writes yourself

The most straightforward solution is to change your application in a way so it will write the same data to both of your indices simultaneously.

Obviously, duplicated writes will leave their performance impact when both indices operate on the same machine. But it might be worth it if your reindex process dies in the middle of the reindexing and you do not have a mechanism for recovery implemented – your old data is still in a valid state.

1b) Write to new index and read from both

The Guide states :

A search request can target multiple indices, so having the search alias point to tweets_1 and tweets_2 is perfectly valid. However, indexing requests can only target a single index. For this reason, we have to switch the index alias to point only to the new index.

If you are not in control of the software writing towards your application, or the first approach is not feasible because of other environmental constraints, you can alternatively switch the write alias towards the new index and read from both at the same time. Please note that you will get duplicates in your queries, so it is your responsibility to deal with them application-wise. Also concepts like pagination will provide additional hurdles.

In conclusion, your application has to be aware of the reindexing process and behave accordingly to your chosen strategy. Either you will write in both indices or deal with duplicated results. It depends on your application which way is acceptable. But besides this point, this concept has another weakness:

2. Lost Updates and Deletes!

When we’re in the middle of a lenghty reindexing process, all incoming writes are written to the new index. This is unproblematic for indexing new documents – they are just appended to the index, and have no relation to the old one.

But what about an UPDATE or DELETE of a document? When they are already transferred into the new index, there is no problem. But in the other case, the external operation will fail with an error, and later on the value will be put into the new index in an outdated version.

Now this output is not desirable and should be avoided! If your application supports updates and deletions we will have to include additional steps into our reindexing process. The basic idea is that you do not delete documents, but mark them as “deleted” instead and exclude them from queries. Here are some proposals to get you started:

2 a) Incremental Reindexing

For this approach to work, your whole infrastructure needs to adapt the following two concepts:

Every modification updates a timestamp field of the document
Instead of writing your critical updates and deletions to the new index we will still apply them to the old one. Our reindexing job will move all documents that are older than its own start timestamp to the new index. Every update that happens to be during this time will update the document timestamp. Note that Elasticsearch already provides a _timestamp field that can be activated in the mapping.
When the reindexing job has terminated successfully it will start again and transfer all modifications during its last execution time. When it reaches an iteration where it has nothing to do, we consider it done and continue the wrap-up as in the regular process.

Drawbacks:

If you have a lot of deletions you will artificially bloat your index. This can be improved by cleaning all marked-as-deleted documents after your reindex. Still, since a DELETE in Elasticsearch will just be a mark-as-delete in Lucene, there will be bloat.
The logical delete implemented as an UPDATE is more expensive than a regular DELETE, so watch out for performance hits.
After the last reindexing iteration, there must be a “Stop-the-world” phase to prevent any modifications from sneaking in. Our suggested approach would be to include that into your deployment process if you can.

2 b). Modification Buffering

If your reindexing is expected to last only a short amount of time there might be another solution to be considered:

Elasticsearch has a simple versioning control with the special _version field. When your application keeps this information during the GET -> modify -> UPDATE / DELETE phase and sends it back, Elasticsearch will check if the version matches.

Example: If your Document has a version #1 and you send the UPDATE to Eleasticearch with this version as a parameter, and the document has not been transferred yet, you will get a VersionConflictEngineException – in this case, hold the update in your application and retry later (how much “later” is acceptable depends on your application and can ultimately only be answered by you).

The same drawback as in 2a applies: You cannot truly delete your documents anymore, you have to mark them as deleted as well.

Conclusion

It’s not important which solution you will take from this article, the most important point is to be aware of the drawbacks of the “official” reindexing procedure. You’ll have to figure out how you will work around these limitations depending on your business needs.

Was this post helpful?

Likes

Blog author

Christian Uhl

Do you still have questions? Just send me a message.

fromChristian Uhl

Elasticsearch tips: inserting vs. updating your index

Transforming an update-heavy Elasticsearch use case into an insert-heavy one. Just recently i’ve had the opportunity to set up an Elasticsearch installation at a customer that had a rather unique use case, and i’d like to share my approach of that with...

NoSQL
APM

12.12.2014 | 6 Minuten Lesezeit

Christian Uhl

Datastax Tech Day bei der codecentric München

Am 18.11 fand der erste DataStax Tech Day in Deutschland im Münchner Büro der codecentric statt. Im Mittelpunkt des Tages stand Apache Cassandra für Einsteiger. Rund 40 Teilnehmer lauschten hochkarätigen Sprechern, die in der kurzen Zeit eines einzigen...

5.12.2014 | 2 Minuten Lesezeit

Christian Uhl

Behaviour Driven Development with Elasticsearch

Elasticsearch has been riding on top of the hype for a while now, and I expect it to hit even harder with the release of 1.0 – We will continue to see a massive growth in various fields throughout the tech world, and even more use cases will be discovered...

Big Data
Search

24.2.2014 | 5 Minuten Lesezeit

Christian Uhl

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Die 5 größten Risiken für deine IT-Sicherheit – und wie du dich davor ...

Damit dein Unternehmen dauerhaft erfolgreich sein kann, ist es für deine IT-Abteilung unerlässlich, sich kontinuierlich mit dem Thema IT-Sicherheit auseinanderzusetzen. Ansonsten ist die Gefahr für dein Geschäft groß – der Bitkom summiert circa 203 Milliarden...

IT-Security

6.9.2023 | 12 Minuten Lesezeit

Björn Bohn

Threat Modeling 101 – Wie fange ich eigentlich an?

In einem früheren Blogpost haben wir bereits erklärt, wie wichtig Awareness im Bereich IT-Security im agilen Projekt ist. Ein Kernthema war das Threat Modeling. Doch wie genau funktioniert das? Wie bewerte ich, welche Bereiche meiner Applikation unter...

Agilität
IT-Security
Softwareentwicklung

27.2.2023 | 14 Minuten Lesezeit

Kevin Peters

Schneller handeln bei Software-Schwachstellen

Sicherheitslücken in Software und Bibliotheken werden immer auftreten, unabhängig davon, wie viel Energie aufgebracht wird, um sie zu vermeiden. An die als Log4Shell bekannte Schwachstelle vor gut einem Jahr werden sich Viele noch schmerzhaft erinnern...

IT-Security

8.2.2023 | 3 Minuten Lesezeit

Matthias Niehoff

Meine Keycloak-Lernreise

Keycloak ist ein Open-Source-Identitätsanbieter. Du kannst mit minimalem Aufwand Authentifizierung zu Anwendungen und sicheren Diensten hinzufügen. Dabei musst du dich nicht mit der Speicherung oder der Authentifizierung von Benutzern befassen. Keycloak...

Keycloak
IT-Security

22.11.2022 | 8 Minuten Lesezeit

Florian Wiech

Open Policy Agent – Maschinen, die auf Regeln starren

Der Open Policy Agent (OPA) ist eine universell einsetzbare, quelloffene Policy Engine, also eine Sammlung von Komponenten, die eine einheitliche und effiziente Umsetzung von Regeln aller Art erlaubt. Dieser Artikel zeigt ein kleines Praxisbeispiel. ...

CI/CD
Softwarearchitektur
IT-Security

19.10.2022 | 5 Minuten Lesezeit

Marco Paga

Aber ich habe doch ein Antivirusprogramm …

Antivirus- und EDR-FunktionsweiseIn der Vergangenheit haben sich Antivirusprogramme auf das Entdecken und Beseitigen von schädlichen Dateien spezialisiert. Dabei überprüften sie das Dateisystem und Dateien während der Ausführung.EDR-Software (Endpoint...

IT-Security

3.8.2022 | 7 Minuten Lesezeit

Markus Höfer

Shift left security – Sicherheit ist Daily-Business

IT-Security ist ein Thema, das nicht ausschließlich InfoSec-Expertinnen angeht. Auch als Entwicklerin muss man diese Thematik auf dem Schirm haben. Security gehört zum grundlegenden Prozess der Softwareentwicklung und von Beginn an zum Daily-Business...

Agilität
IT-Security

19.7.2022 | 15 Minuten Lesezeit

Kevin Peters

Keycloak.X, aber sicher – ohne bekannte Sicherheitslücken!

TLDR: Wie man die bekannten CVEs (Common Vulnerabilities and Exposures) mit einer eigenen Keycloak-Distribution auf null* reduziert.EinführungKeycloak (s. Website) wird durch die Umstellung auf Quarkus einfacher und robuster, so das Versprechen. Wie...

Java
IT-Security
Keycloak

9.5.2022 | 9 Minuten Lesezeit

Sebastian Rose

Thomas Darimont

Schadcode in npm-Paketen – Was tun?

Security-Stress in npmDie npm Registry ist DIE öffentliche Registry der JavaScript-Sphäre. Die beiden wichtigsten Paketmanager npm und yarn setzen beide auf ihr auf. Dementsprechend groß war der Aufschrei, als Mitte Oktober 2021 bekannt wurde, dass...

JavaScript
IT-Security

23.11.2021 | 7 Minuten Lesezeit

Antonia Schmalstieg

Penetration Testing in die Cloud skalieren mit Axiom

Beim Thema Penetration Testing und Cloud können Pentester*innen meistens Frust-Geschichten von Rate Limiting, IP bans und ähnlichen Unannehmlichkeiten erzählen. Will man keinen Bann bei AWS, Azure und Co. riskieren, so muss die Rate an Requests, die ...

Softwareentwicklung
Cloud
IT-Security

9.6.2021 | 7 Minuten Lesezeit

Martin Riedel

Wie reif ist euer DevOps? – Einige Gedanken zur Messung des Fortschritts

Spoiler: Es ist ehrlich gesagt nicht von Bedeutung.In letzter Zeit haben wir des Öfteren von Kunden eine Frage gestellt bekommen:Wie misst man Fortschritt in Bezug auf Dev(Sec)Ops? Gibt es hierfür ein Maturity Model oder eine Menge an Skills, welche ...

Agilität
Cloud
DevOps
IT-Security

6.6.2021 | 4 Minuten Lesezeit

Nicolas Byl

Malware-Analyse: Einrichtung von Cuckoo Sandbox auf ProxMox

Warum brauchen wir ein System zur Malware-Analyse? Im Zuge von Incident-Response-Einsätzen und Forensiken kommen uns in unserer Arbeit immer wieder Programme, Skripte und Dokumente zweifelhafter Herkunft unter. Bei diesen ist oft nicht klar, was der ...

IT-Security

5.3.2021 | 6 Minuten Lesezeit

Martin Riedel

Keycloak-Konfiguration mit Terraform

Infrastructure as Code (IaC) ist heutzutage aus der modernen IT-Landschaft nicht mehr wegzudenken. Red Hat beschreibt den Begriff wie folgt:Infrastructure as Code (IaC) is the managing and provisioning of infrastructure through code instead of through...

DevOps
Infrastructure
IT-Security
CI/CD
Keycloak
Open Source

2.3.2021 | 6 Minuten Lesezeit

Johanna Nolte

codecentrics HR Synergy Solution: Eine übergreifende und voll individualisierbare...

Ihr HR-Team verdient BesseresHR-Mitarbeiter müssen sich mit sämtlichen Aktivitäten und Dienstleistungen hinsichtlich Mitarbeitern und Bewerbern auseinandersetzen und erhöhten Ansprüchen an die Servicequalität gerecht werden. Häufig sind die Prozesse ...

Agile Transformation
IT-Security
HR
Atlassian
Process Management

21.10.2020 | 8 Minuten Lesezeit

Dr. Pujan Ziaie

BIE Spotty – unsere Lösung beim BIE City Hackathon

Typischerweise sind bei Hackathons viele Soft- und Hardware-Entwickler zu finden, die innerhalb eines begrenzten Zeitraums versuchen, kreative und ungewöhnliche Lösungen in Form von Code und ersten Prototypen für vorher definierte Challenges zu erarbeiten...

IoT
Computer Vision
IT-Security
Machine Learning

2.7.2020 | 5 Minuten Lesezeit

Meike Wocken

Hacker School in Dortmund – Scratch & Python

Anfang März hatte die codecentric AG in Dortmund die Hacker School und 20 neugierige Kids zu Besuch. Bei dieser Gelegenheit konnten die Kinder die Programmiersprachen Scratch und Python kennenlernen und sich Wissen im Handwerk des Programmierens aneignen...

Raspberry Pi
IT-Security
Python

7.4.2020 | 5 Minuten Lesezeit

Christopher

Gamified Security mit hackthebox.eu: DevOops

Heute werden wir lernen, wie wir uns in die DevOops-Maschine auf hackthebox hacken können. Wenn du mehr über hackthebox erfahren willst, sieh dir den ersten Post in dieser Serie an. DevOops: VorbereitungAls einen ersten Schritt fügen wir 10.10....

IT-Security

2.4.2020 | 13 Minuten Lesezeit

Martin Riedel

Gamified Security mit hackthebox.eu

In der aktuellen Quarantäne, in der sich alle unsere Mitarbeiter in räumlich getrennten Arbeitsumgebungen befinden, ist es für uns Infosec-Mitarbeiter ziemlich schwierig bei Forensiken, internen Penetrationstests vor Ort oder bei der Reaktion auf Vorf...

Game programming
IT-Security

26.3.2020 | 6 Minuten Lesezeit

Martin Riedel

AWS IoT mit Cognito absichern – Ein Schritt-für-Schritt-Guide

Mit AWS IoT bekommt man sehr schnell und leicht seine Daten in die Cloud. Doch was macht man mit diesen dann? In diesem Artikel gehe ich darauf ein, wie man eine einfache JavaScript SPA mit AWS IoT verbinden kann, um die Daten in Web-Applikationen zu...

Cloud
JavaScript
AWS
IIoT
IoT
IT-Security

15.10.2019 | 14 Minuten Lesezeit

Holger Apfel

Mitigation für den Zoom Exploit vom 8. Juli 2019

Zoom Exploit: Am 8. Juli hat Jonathan Leitschuh einen Security Exploit für die beliebte und vor allem in Unternehmen oft genutzte Videochat-Software Zoom veröffentlicht (siehe 1).TL;DRZoom startet einen Webserver auf localhost, mit dem Webseiten interagieren...

Cloud
Kultur
GitHub
Softwareentwicklung
Voice UI
Webdevelopment
IT-Security

8.7.2019 | 1 Minuten Lesezeit

Martin Riedel

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Send

Elasticsearch Zero Downtime Reindexing – Problems and Solutions

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Elasticsearch tips: inserting vs. updating your index

Datastax Tech Day bei der codecentric München

Behaviour Driven Development with Elasticsearch

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Die 5 größten Risiken für deine IT-Sicherheit – und wie du dich davor ...

Threat Modeling 101 – Wie fange ich eigentlich an?

Schneller handeln bei Software-Schwachstellen

Meine Keycloak-Lernreise

Open Policy Agent – Maschinen, die auf Regeln starren

Aber ich habe doch ein Antivirusprogramm …

Shift left security – Sicherheit ist Daily-Business

Keycloak.X, aber sicher – ohne bekannte Sicherheitslücken!

Schadcode in npm-Paketen – Was tun?

Penetration Testing in die Cloud skalieren mit Axiom

Wie reif ist euer DevOps? – Einige Gedanken zur Messung des Fortschritts

Malware-Analyse: Einrichtung von Cuckoo Sandbox auf ProxMox

Keycloak-Konfiguration mit Terraform

codecentrics HR Synergy Solution: Eine übergreifende und voll individualisierbare...

BIE Spotty – unsere Lösung beim BIE City Hackathon

Hacker School in Dortmund – Scratch & Python

Gamified Security mit hackthebox.eu: DevOops

Gamified Security mit hackthebox.eu

AWS IoT mit Cognito absichern – Ein Schritt-für-Schritt-Guide

Mitigation für den Zoom Exploit vom 8. Juli 2019

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten

Contact

Send

Elasticsearch Zero Downtime Reindexing – Problems and Solutions

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Elasticsearch tips: inserting vs. updating your index

Datastax Tech Day bei der codecentric München

Behaviour Driven Development with Elasticsearch

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Die 5 größten Risiken für deine IT-Sicherheit – und wie du dich davor ...

Threat Modeling 101 – Wie fange ich eigentlich an?

Schneller handeln bei Software-Schwachstellen

Meine Keycloak-Lernreise

Open Policy Agent – Maschinen, die auf Regeln starren

Aber ich habe doch ein Antivirusprogramm …

Shift left security – Sicherheit ist Daily-Business

Keycloak.X, aber sicher – ohne bekannte Sicherheitslücken!

Schadcode in npm-Paketen – Was tun?

Penetration Testing in die Cloud skalieren mit Axiom

Wie reif ist euer DevOps? – Einige Gedanken zur Messung des Fortschritts

Malware-Analyse: Einrichtung von Cuckoo Sandbox auf ProxMox

Keycloak-Konfiguration mit Terraform

codecentrics HR Synergy Solution: Eine übergreifende und voll individualisierbare...

BIE Spotty – unsere Lösung beim BIE City Hackathon

Hacker School in Dortmund – Scratch & Python

Gamified Security mit hackthebox.eu: DevOops

Gamified Security mit hackthebox.eu

AWS IoT mit Cognito absichern – Ein Schritt-für-Schritt-Guide

Mitigation für den Zoom Exploit vom 8. Juli 2019

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten