Elasticsearch tips: inserting vs. updating your index

12.12.2014 | 6 minutes of reading time

Transforming an update-heavy Elasticsearch use case into an insert-heavy one.

Just recently i’ve had the opportunity to set up an Elasticsearch installation at a customer that had a rather unique use case, and i’d like to share my approach of that with you. This post will show why an update heavy use of Elasticsearch is a bad idea and how you could transform it into an insert heavy one, which is way faster.

Prerequisites

The requirements involved tracking the lifecycle of a document that entered the company via various input channels, and is processed by a number of automated systems. Sometimes it happens that one of these documents gets lost between steps or is misanalyzed and therefore gets lost in the system. If someone happens to inquire the status of such a lost document noone could really give a good answer on that, or attempt to fix it. That’s not a desirable state.

Fortunately the “metadata” of such a document does contain the OCR fulltext, so any kind of “storage engine” with fulltext search capabilities is needed, and that really sounds like a job for elasticsearch! It’s especially easy because we were able to hook custom code into each of these processing steps. Another great coincidence is that we can print a barcode on each document, so every process step can be truly independent of the others. This will influence my conclusion later on.

As for the general usage of this system, I would expect to have a lot of writing operations (lots of documents processed, most of them without errors) and only few reading operations (you only check when something went wrong, if at all). This will bring us to some conclusions you would not expect in a more traditional use case.

The ‘naive’ NoSQL approach

As with every Elasticsearch project I’m involved in I like to step back first and give the data model a good thought. Sure, Elasticsearch is schemaless, but that does not mean you can skip thinking about your data at all, especially not if you want acceptable performance later on. Naturally I was inclined to think of a document as a flat structure, that contains it’s various events and the respective results and timestamps. They could be thought of relations, sure, but since they are naturally tightly bound to each other (in classical terms a 1-1 Relationship, if you will) it saves you the awkwardness of joining things together.

Implemented that would mean that the first operation on an document would create (or upsert it) and each following step would update the document accordingly. I’m not quite happy with this approach since updating lots of documents all the time has the following drawbacks:

Downsides to frequent updates

Cost for get_then_reindex

Any updates you would do during the lifecycle would mostly be “partial updates”, where you only send the things that have changed to the Elasticsearch cluster. In fact, the independent software systems should really be unaware of the state updates the other systems did to avoid coupling of these systems. Elasticsearch allows us to do partial updates, but internally these are “get_then_update” operations, where the whole document is fetched, the changes are applied and then the document is indexed again. Even without disk hits one can imagine the potential performance implications if this is your main use case.

potential version conflicts

The “get_then_update” operations are not atomic, and Elasticsearch uses implicit versioning of it’s documents, so version conflicts are to be expected. They are automatically handled (by last-write-wins) and do not need to be handled by your software, but it’s another performance impact you have to be aware of.

need to store _source

Another uniqueness about the “get_then_update” update is that Elasticsearch can not use the indexed document itself but needs the original instead. This forces you to keep the _source field activated. In my case that was not an issue but it’s something to be aware of.

Lucene “soft deletes” and merging cost

On the Lucene layer, an update is actually not an update but an (atomic) “insert and delete” operation. But alas, this is still not the full truth: Deletes are soft, that means they are marked with a tombstone flag and reside in the segment. Only a merging operation will clean them up eventually. Like garbage collection for instantiated and dereferenced objects, this can lead to additional pressure on your system.

In conclusion, update operations can be considered rather expensive. Now that our application will ultimately consist of (almost) nothing but update operations, this seems like a bad idea. Let’s try changing that.

Index instead of Update

To achieve a different operation we need to split the document into its event parts – so we have got a relation going. To keep things a little denormalized we can reduce the event into a single type that contains all the possible data – and only fill the fields relevant to that:

Relations in Elasticsearch

To handle relations, Elasticsearch provides us with two different mechanisms that both have their individual pros and cons: nested documents and parent-child relations. For an in depth introduction to both concepts, i’d recommend reading the Elasticsearch Guide’s chapter on modeling your data .

Without poorly replicating the description, in a nutshell, the nested documents live inside the original document type and the parent-child documents live separately in their own type, and are joined at query time. You need to be aware that parents and their children necessarily have to live on the same shard, and the parent-child ID map is held in memory.

For our specific use case (where there are plenty of updates which are our performance concern, and search performance is actually negligible), we chose a parent-child relation as the better fit: we can truly insert a new event, without touching the original document or any of the other events. This is possible in this case because every step in the process chain does already know about the ID of the document without touching Elasticsearch. It’s a printed ID on the document, that we can reuse as an ID for our Dokument type.

In the end, the performance numbers on the hardware we had to our disposal prove us right: We are able to process a day’s worth of data in about 2 seconds!

Conclusion

While this was a relatively rare use case that you probably won’t encounter in the wild, it contains an interesting essence: Sometimes the “natural” or obvious data model goes against the inner workings of Elasticsearch, and it’s useful to remodel your data to better fit your system. Afterwards (and by that I mean the rest of the week, talking about how insanely fast one can accomplish results with elasticsearch) we were able to develop a small webapp where users can search the generated data – and were pleasantly surprised that search operations are still way faster than we anticipated!

Was this post helpful?

Likes

Blog author

Christian Uhl

Do you still have questions? Just send me a message.

fromChristian Uhl

Datastax Tech Day bei der codecentric München

Am 18.11 fand der erste DataStax Tech Day in Deutschland im Münchner Büro der codecentric statt. Im Mittelpunkt des Tages stand Apache Cassandra für Einsteiger. Rund 40 Teilnehmer lauschten hochkarätigen Sprechern, die in der kurzen Zeit eines einzigen...

5.12.2014 | 2 Minuten Lesezeit

Christian Uhl

Elasticsearch Zero Downtime Reindexing – Problems and Solutions

Reindexing Elasticsearch could be so easy. Well in the first place, we all wouldn’t have to reindex at all. Why should you do this? There is dynamic mapping! In this post I will explain why dynamic mapping won’t do you much good, how you can deal with...

NoSQL
IT-Security

17.9.2014 | 8 Minuten Lesezeit

Christian Uhl

Behaviour Driven Development with Elasticsearch

Elasticsearch has been riding on top of the hype for a while now, and I expect it to hit even harder with the release of 1.0 – We will continue to see a massive growth in various fields throughout the tech world, and even more use cases will be discovered...

Big Data
Search

24.2.2014 | 5 Minuten Lesezeit

Christian Uhl

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Automatisch skaliertes Cloud Native Consent Management in der Google Cloud

Immer häufiger ersetzen unsere Kunden lokale Rechenzentren durch eine Cloud-Infrastruktur. Die Gründe sind Ausfallsicherheit, Wartbarkeit und vor allem Skalierbarkeit. Mit dem letzten dieser Aspekte befassen wir uns in diesem Blogartikel anhand eines...

APM
Python
Cloud
Google Cloud
Infrastructure
Softwarearchitektur
Serverless

28.6.2021 | 16 Minuten Lesezeit

Markus Lüger

Christopher

Datenbankoperationen in Mule 4 optimieren

Häufig geht es in Mule-Projekten darum, Daten aus irgendeiner Quelle effizient in einer Datenbank abzulegen. Heute zeige ich, mit welchen Strategien man dabei die Performance optimieren kann.AufgabenstellungDa es hier primär um Datenbankoperationen geht...

APM
Integration

10.2.2021 | 8 Minuten Lesezeit

Roger Butenuth

Synchroner Batch mit Mule 4

Während in Mule 3 der Batch noch eine eigenständige Komponente war und Batches sich in der Konfiguration auf der gleichen Ebene wie Flows befanden, ist der Batch in Mule 4 zu einem sogenannten Scope geworden, der jetzt innerhalb eines Flows lebt. Auf...

Java
APM
JavaScript
Integration

28.1.2020 | 5 Minuten Lesezeit

Roger Butenuth

Kong API-Gateway – Observability mit Prometheus, Grafana und OpsGenie

Im vorherigen Blogpost habe ich das bestehende Demo-Setup um decK und Konga erweitert. Nun soll es darum gehen, die vorhandenen Daten der APIs sichtbarer werden zu lassen. Hierzu möchte ich zwei Observability Patterns, nämlich Monitoring und Alerting...

Softwarearchitektur
Atlassian
Microservices
Open Source
API
APM

19.12.2019 | 4 Minuten Lesezeit

Daniel Kocot

Kubernetes-Monitoring mit Instana (Teil 1)

Einführung: Weshalb Kubernetes und Instana?Cloud- oder cloud-ähnliche Dienste bedienen bekanntermaßen das “As a Service”-Prinzip. Egal ob “Software”, “Function” oder “Platform as a Service”, meist steckt eine containerbasierte Infrastruktur dahinter....

Infrastructure
APM
Kubernetes

13.10.2019 | 6 Minuten Lesezeit

Niko Blättermann

Maximilian Mayer

Web Performance – eine sehr kurze Einführung

Wer sich schon einmal mit der Entwicklung von Webseiten beschäftigt hat, ist sicherlich auch über die performance-relevanten Entwicklertools seines Browsers gestolpert. Dort finden sich allerhand Zahlen und Werkzeuge, die irgendwie Anhaltspunkte über...

APM
Softwareentwicklung

29.4.2019 | 11 Minuten Lesezeit

Marco Schäfer

BDD und End-to-End-Tests – Cypress.io mit Cucumber verbinden

Cypress.io (oder kurz Cypress) bekommt momentan sehr viel Aufmerksamkeit, wenn es um das Thema End-to-End-Testing geht. Speziell im JavaScript-Umfeld scheint sich Cypress.io langsam durchzusetzen. Es macht vieles richtig und ist Selenium-basierten Ans...

JavaScript
BDD
APM
Testing

16.4.2019 | 10 Minuten Lesezeit

Holger Grosse-Plankermann

kibconfig – Wartungstool für Kibana Dashboards

Als wir vor 2 Jahren zu Beginn unseres Projekts damit begannen, unser ELK Logging über Kibana Dashboards zu optimieren, standen wir vor einem Problem: Wie konnten wir unsere für die PP-Umgebung vorbereiteten Dashboards, Visualisierungen und gespeicherten...

NoSQL
APM

12.10.2017 | 3 Minuten Lesezeit

Carsten Rohrbach

Graphen-Visualisierung mit Neo4j

In diesem Artikel möchte ich nach einer kurzen Einführung in die Graphen-Theorie einen Überblick über die NoSQL-Datenbank Neo4j geben. Insbesondere werde ich auf die Möglichkeiten eingehen, die Neo4j bei der Visualisierung von Graphen anbietet.Was ist...

Datenbank
NoSQL

18.6.2017 | 10 Minuten Lesezeit

Tobias Trelle

Elasticsearch: _type-Mapping zur Dateninspektion

ProblemsituationEine typische Situation: Daten aus einer Domän mit verschiedenen Sub-Domänen liegen in stark unterschiedlicher und unbekannter Form, mit ebenso unterschiedlichen und unbekannten Werten, vor. Sich mit diesen Daten auseinanderzusetzen ist...

NoSQL

5.12.2016 | 3 Minuten Lesezeit

Christian Börner-Schulte

Spring Boot & Apache CXF – Logging & Monitoring mit Logback, Elasticsearch...

SOAP-Endpoints auf Basis von Microservice-Technologien mit Spring Boot? Cool! Aber wie findet man bei den ganzen „Micro-Servern“ Fehler? Wie sehen die SOAP-Nachrichten aus und wie logge ich eigentlich generell? Und: wie viele Produkte haben wir eigentlich...

Frontend
NoSQL
Java
APM
Logging
Spring

26.7.2016 | 24 Minuten Lesezeit

Jonas Hecht

IoT-Analyse-Plattform

Internet of Things (IoT) oder auch Industrie 4.0 ist heute in aller Munde. Aber welche Herausforderungen stellen sich eigentlich bei der Verarbeitung großer Datenmengen? Eine Variante kann sein, Daten zu sammeln und später im Batch-Betrieb zu verarbeiten...

Cloud
IoT
NoSQL
Scala
Big Data

13.7.2016 | 14 Minuten Lesezeit

Achim Nierbeck

Elixir, Phoenix und CouchDB – Eine Einführung

Das Elixir MVC Framework PhoenixVon Markus Krogemann und Marcel WolfWorum geht es?Zunächst wird gezeigt, wie sich eine Webanwendung mit Phoenix in einfachen Schritten erstellen lässt, ohne dass ein tieferes Verständnis für eine funktionale Programmiersprache...

Softwareentwicklung
Functional programming
Frontend
NoSQL

13.1.2016 | 4 Minuten Lesezeit

Marcel Wolf

Joins und Schema-Validierung mit MongoDB 3.2

Mit Version 3.2 der dokumentenorientierten NoSQL-Datenbank MongoDB werden u.a. zwei lange vermisste(?) Features eingeführt, auf die ich im Folgenden näher eingehen möchte.JoinsDie logischen Namensräume, in denen man seine Dokumente ablegt, werden in...

NoSQL
Big Data
Validierung

7.12.2015 | 3 Minuten Lesezeit

Tobias Trelle

MongoDB-Einführung bei der Java-Usergruppe ruhrjug

Die Java-Enthusiasten im Ruhrgebiet treffen sich regelmäßig bei der ruhrjug , um sich über aktuelle Themen rund um die Programmiersprache Java auszutauschen.Beim letzten Treffen vor der Sommerpause am 25.06.2015 war ich eingeladen, um dort einen Vortrag...

Java
NoSQL
Community
Spring

1.7.2015 | 1 Minuten Lesezeit

Tobias Trelle

Was es bedeutet, mit AppDynamics zu arbeiten – und anderen APM Werkzeugen

Seit vielen Jahren arbeite ich mit Application Performance Management (APM) Werkzeugen im Java-Umfeld. Im Vergleich zu anderen Performance-Optimierungs-Werkzeugen wie z.B. Profilern, erfüllen APM-Werkzeuge sowohl Monitoring- als auch Analyse-Aufgaben...

17.5.2015 | 5 Minuten Lesezeit

Raymond Georg Snatzke

Confess – Konferenzbericht

Von 14.-16.04.2015 fand die Confess, eine Konferenz für Enterprise Software Lösungen, statt. Sie wurde im C3 Convention Center in Wien veranstaltet. Auf der Konferenz waren hervorragende Speaker, wie Anton Arhipov, Maarten Mulders und Michael Plöd.Anton...

Community
Softwareentwicklung
NoSQL
Open Source
Java
Kubernetes
Microservices

21.4.2015 | 2 Minuten Lesezeit

Bernd Zuther

DataStax Tech-Day, die Zweite!

Vier Monate sind vergangen, seit wir den ersten Tech-Day gemeinsam mit unserem Partner DataStax in München durchgeführt hatten. Es war also an der Zeit, dieses Format auch in den hohen Norden, genauer gesagt in die Räumlichkeiten der codecentric nach...

NoSQL
Community

31.3.2015 | 2 Minuten Lesezeit

Silvio Tschapke

Big Data und Tiny Hardware – Teil 1

AbstractNachdem Ihr unsere „Big Data in a Box“-Lösung auf Schulungen und Usergroup-Treffen gesehen habt, haben wir immer wieder Anfragen zur Realisierung von Euch erhalten. Ihr wolltet wissen was wir dort gebaut haben und wie alles einzurichten ist. ...

Java
Open Source
Big Data
NoSQL

11.2.2015 | 3 Minuten Lesezeit

Dominique Ronde

MongoDB 2.8 – Neue Storage-Engine WiredTiger

Mit Version 2.8 kommen wesentliche Neuerungen auf die Benutzer der NoSQL-Datenbank MongoDB zu. Eine davon ist die Einführung einer weiteren Storage Engine. Was es damit auf sich hat, werde ich in diesem Artikel erläutern.Bis zur Version 2.6 hat MongoDB...

Big Data
NoSQL

10.12.2014 | 4 Minuten Lesezeit

Tobias Trelle

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Send

Elasticsearch tips: inserting vs. updating your index

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Datastax Tech Day bei der codecentric München

Elasticsearch Zero Downtime Reindexing – Problems and Solutions