Elasticsearch Indexing Performance Cheatsheet

8.5.2014 | 8 minutes of reading time

You plan to index large amounts of data in Elasticsearch? Or you are already trying to do so but it turns out that throughput is too low? Here is a collection of tips and ideas to increase indexing throughput with Elasticsearch. Some of them I have successfully tried myself, others I have only read about and found them reasonable. In any case, I hope you will find them useful.

In order to fit all this into a single article, I have kept the suggestions rather brief. For some of them, you may feel that you need to learn more before putting them into practice. To ease your task a little, I have included links to the relevant sections of the Elasticsearch documentation which you may use as a starting point for further research.

General Performance

Before doing anything more specific, it makes sense to follow the advice given in the Elasticsearch documentation on configuration. In a nutshell:

Set the maximum number of open file descriptors for the user running Elasticsearch to at least 32k or 64k.
If possible, consider disabling swapping for the Elasticsearch process memory. Note, however, that in a virtualized environment this may not behave as expected.
Set -Xms to the same value as -Xmx (the same result can be achieved by setting the ES_HEAP_SIZE environment variable).
Leave some amount of physical memory unassigned so that the OS file system cache is free to use it for Lucene’s benefit. A rule of thumb is to have the Elasticsearch JVM use no more than half of the available memory.

Mapping

If your search requirements allow it, there is some room for optimization in the mapping definition of your index:

By default, Elasticsearch stores the original data in a special _source field . If you do not need it, disable it.
By default, Elasticsearch analyzes the input data of all fields in a special _all field . If you do not need it, disable it.
If you are using the _source field, there is no additional value in setting any other field to _stored.
If you are not using the _source field, only set those fields to _stored that you need to. Note, however, that using _source brings certain advantages, such as the ability to use the update API.
For analyzed fields, do you need norms? If not, disable them by setting norms.enabled to false.
Do you need to store term frequencies and positions, as is done by default, or can you do with less – maybe only doc numbers? Set index_options to what you really need, as outlined in the string core type description.
For analyzed fields, use the simplest analyzer that satisfies the requirements for the field. Or maybe you can even go with not_analyzed?
Do not analyze, store, or even send data to Elasticsearch that you do not need for answering search requests. In particular, double-check the content of mappings that you do not define yourself (e.g., because a tool like Logstash generates them for you).

Requests and Clients

You can also gain a lot from optimizing the way in which you transfer indexing requests to Elasticsearch:

Do you have to send a separate request for each document? Or can you buffer documents in order to use the bulk API for indexing multiple documents with a single request?
When using bulk requests, optimize the bulk size, i.e., how many documents you bundle in a single request. Usually an appropriate bulk size has to be discovered empirically by trying out different sizes under realistic load conditions.
If your business can afford it, you can even consider trading some reliability for performance using the bulk UDP API for certain data. This is particularly interesting if the client and server participating in the request reside on the same host.
If you are using an HTTP client, consider using long-lived HTTP connections. Also, make sure that HTTP chunking is not hampering throughput.
Consider using one of the various existing clients as they may contain performance advantages over using plain HTTP.
If your client speaks Java, consider using the NodeClient . A NodeClient joins the cluster and knows which nodes to address for certain requests, possibly saving one hop when compared to other clients. If you cannot use the NodeClient, e.g., due to security restrictions, see if you can use TransportClient before considering something else.
Can you parallelize indexing by using multiple clients? It may well be that a single client turns out to be the indexing bottleneck and that the Elasticsearch server is able to handle a much higher load.

Sharding and Replication

Elasticsearch provides sharding and replication as the recommended way for scaling and increasing availability of an index. There are a few things to consider:

If a single Elasticsearch server is not enough to provide your desired indexing throughput, you may need to scale out. Multiple cluster nodes enable parallel work on an index by sharding it. Note: The number of shards of an index needs to be set on index creation and cannot be changed later. In case you do not know exactly how much data to expect, you may consider overallocating a few shards (but not too many, they are not free!) to have some spare capacity available. Other than that, index aliases may provide a way (albeit with limitations) of scaling out an index at a later point in time.
Replication is an important feature for being able to cope with failure, but the more replicas you have the longer indexing will take. Thus, for raw indexing throughput it would be best to have no replicas at all. Luckily, in contrast to the number of shards, you may change the number of replicas of an index at any time, which gives us some additional options. In certain situations, such as populating a new index initially, or migrating data from one index to another, it may prove beneficial to start without replication and only add replicas later, once the time-critical initial indexing has been completed.
Consider separating data nodes (that actually store and index data) from “aggregator nodes” (used only for querying). When aggregator nodes handle search queries and only contact data nodes as needed, they take load off the data nodes which will then have more capacity for handling indexing requests.
By default, an indexing request is completed once the data has been safely received (i.e., stored in the transaction log) by all replicas. By setting the query parameter replication to async , the request will already complete when the data has been acknowledged on the primary shard.

Index Settings

There are several index level settings that you may tune to improve indexing throughput:

By default, an index shard uses a refresh interval of one second, i.e., new documents become available for search after one second. Even though refreshing is a more lightweight operation than one may think, it comes at a cost. Thus, depending on your search requirements, you may consider setting the refresh interval to something higher than one second. It can even make sense to temporarily turn off refreshing completely for an index (by setting the interval to -1), e.g., during a bulk indexing run, and trigger it manually at the end.
Compared to refreshing an index shard, the really expensive operation is flushing its transaction log (which involves a Lucene commit). Elasticsearch performs flushes based on a number of triggers that may be changed at run time . By delaying flushes, or disabling them completely, you can increase indexing throughput. Just be aware that nothing comes for free, and the delayed flush will of course take longer when it eventually happens.
The default segment merge policy, “tiered”, supports a compound format where data is stored in fewer files to reduce the number of open file handles needed. However, the compound format comes along with a performance penalty. There are two settings, index.compound_on_flush and index.compound_format, that specify whether the compound format should be used for new segments and merged segments, respectively. Making sure that both are set to false may improve indexing performance, at the cost of more file handles.
Segment merging is done in the background but requires I/O from which indexing performance may suffer. Therefore, it is possible to throttle merging to a maximum number of bytes per second, on the node or index level. Note that throttling is already done by default, but maybe you want to adjust the predefined limit according to your needs.
The setting indices.memory.index_buffer_size defines the percentage of available heap memory that may be used for indexing operations (the remaining heap memory will mainly be used for search operations). The default of 10% may be too low if you have lots of data to index, and it may make sense to set it to a higher value .
Index warmup is a useful concept to speed up search queries, but when indexing large amounts of data (in particular, bulk indexing) it may make sense to temporarily disable it .
Consider increasing the node level thread pool size for indexing and bulk operations (and measure if it really brings an improvement).
The setting index.index_concurrency limits the number of threads that may concurrently perform indexing operations on a single shard. Consider increasing the value, especially when there are no other shards on the node (and measure if it pays off).

Conclusion

I hope some of these suggestions will help you resolve any indexing performance problems you might have. Keep in mind, however, that the most important aspect of a search engine is, well, the search. Do not make the mistake of tuning your search engine to maximum indexing throughput only to discover that out of a sudden its query performance suffers or it does not fulfill the functional requirements anymore. Always make sure that your users get a quality search experience and really find what they are looking for.

Was this post helpful?

Likes

Blog author

Patrick Peschlow

Do you still have questions? Just send me a message.

fromPatrick Peschlow

Elastic{ON}: Erste Elasticsearch-User-Konferenz in San Francisco

Elasticsearch in all seinen Facetten – das war das Thema der ersten Elastic{ON} , die Anfang März in San Francisco stattfand. Über 1.000 User waren vor Ort, und auch die codecentric als Elasticsearch-Partner war mit einem Stand vertreten! Das codecentric...

8.4.2015 | 5 Minuten Lesezeit

Patrick Peschlow

Scaling an Elasticsearch Index – Introduction

A well-known design decision of Elasticsearch is that a fixed number of shards has to be specified when creating an index. It is not possible to start out with just one or only a few shards and add more shards later as the data increases. Now what to...

30.3.2015 | 7 Minuten Lesezeit

Patrick Peschlow

Transactions in Elasticsearch

Earlier this year a customer mentioned a search requirement that I hadn’t really thought about before: How to achieve transactions in Elasticsearch? Recently, the same requirement popped up again in a conversation I had with other search aficionados....

6.10.2014 | 8 Minuten Lesezeit

Patrick Peschlow

Elasticsearch Monitoring and Management Plugins

Elasticsearch offers a highly useful plugin mechanism as a standard way for extending its core. Plugins enable developers to add new functionality, e.g., a custom analyzer, or provide alternatives to existing functionality, like swapping in another transport...

30.3.2014 | 11 Minuten Lesezeit

Patrick Peschlow

Useful JVM Flags – Part 8 (GC Logging)

The last part of this series is about garbage collection logging and associated flags. The GC log is a highly important tool for revealing potential improvements to the heap and GC configuration or the object allocation pattern of the application. For...

3.1.2014 | 8 Minuten Lesezeit

Patrick Peschlow

Useful JVM Flags – Part 7 (CMS Collector)

The Concurrent Mark Sweep Collector (“CMS Collector”) of the HotSpot JVM has one primary goal: low application pause times. This goal is important for most interactive applications like web applications. Before we take a look at the relevant JVM flags...

4.3.2013 | 10 Minuten Lesezeit

Patrick Peschlow

ForkJoinPool vs. ThreadPoolExecutor

Recently, an article of mine appeared on the German site Heise Developer, and today the English translation was published on The H Developer. The article gives an introduction to the Java 7 ForkJoinPool and explains for which application scenarios ...

25.11.2012 | 1 Minuten Lesezeit

Patrick Peschlow

Useful JVM Flags – Part 6 (Throughput Collector)

For most application areas that we find in practice, a garbage collection (GC) algorithm is being evaluated according to two criteria: The higher the achieved throughput, the better the algorithm.The smaller the resulting pause times, the better the ...

4.1.2012 | 10 Minuten Lesezeit

Patrick Peschlow

Useful JVM Flags – Part 5 (Young Generation Garbage Collection)

In this part of our series we focus on one of the major areas of the heap, the “young generation”. First of all, we discuss why an adequate configuration of the young generation is so important for the performance of our applications. Then we move on...

18.8.2011 | 13 Minuten Lesezeit

Patrick Peschlow

Useful JVM Flags – Part 4 (Heap Tuning)

Ideally, a Java application runs just fine with the default JVM settings so that there is no need to set any flags at all. However, in case of performance problems (which unfortunately arise quite often) some knowledge about relevant JVM flags is a welcome...

2.7.2011 | 6 Minuten Lesezeit

Patrick Peschlow

Useful JVM Flags – Part 3 (Printing all XX Flags and their Values)

With a recent update of Java 6 (must have been update 20 oder 21), the HotSpot JVM offers two new command line flags which print a table of all XX flags and their values to the command line right after JVM startup. As many HotSpot users were longing ...

Java
APM

10.4.2011 | 4 Minuten Lesezeit

Patrick Peschlow

Useful JVM Flags – Part 2 (Flag Categories and JIT Compiler Diagnostics...

In the second part of this series, I give an introduction to the different categories of flags offered by the HotSpot JVM. Also, I am going to discuss some interesting flags regarding JIT compiler diagnostics. JVM flag categories The HotSpot JVM offers...

Java
APM

23.3.2011 | 9 Minuten Lesezeit

Patrick Peschlow

Useful JVM Flags – Part 1 (JVM Types and Compiler Modes)

Modern JVMs do an amazing job at running Java applications (and those of other compatible languages) in an efficient and stable manner. Adaptive memory management, garbage collection, just-in-time compilation, dynamic classloading, lock optimization ...

Java
APM

8.3.2011 | 6 Minuten Lesezeit

Patrick Peschlow

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

kibconfig – Wartungstool für Kibana Dashboards

Als wir vor 2 Jahren zu Beginn unseres Projekts damit begannen, unser ELK Logging über Kibana Dashboards zu optimieren, standen wir vor einem Problem: Wie konnten wir unsere für die PP-Umgebung vorbereiteten Dashboards, Visualisierungen und gespeicherten...

NoSQL
APM

12.10.2017 | 3 Minuten Lesezeit

Carsten Rohrbach

Graphen-Visualisierung mit Neo4j

In diesem Artikel möchte ich nach einer kurzen Einführung in die Graphen-Theorie einen Überblick über die NoSQL-Datenbank Neo4j geben. Insbesondere werde ich auf die Möglichkeiten eingehen, die Neo4j bei der Visualisierung von Graphen anbietet.Was ist...

Datenbank
NoSQL

18.6.2017 | 10 Minuten Lesezeit

Tobias Trelle

Elasticsearch: _type-Mapping zur Dateninspektion

ProblemsituationEine typische Situation: Daten aus einer Domän mit verschiedenen Sub-Domänen liegen in stark unterschiedlicher und unbekannter Form, mit ebenso unterschiedlichen und unbekannten Werten, vor. Sich mit diesen Daten auseinanderzusetzen ist...

NoSQL

5.12.2016 | 3 Minuten Lesezeit

Christian Börner-Schulte

Spring Boot & Apache CXF – Logging & Monitoring mit Logback, Elasticsearch...

SOAP-Endpoints auf Basis von Microservice-Technologien mit Spring Boot? Cool! Aber wie findet man bei den ganzen „Micro-Servern“ Fehler? Wie sehen die SOAP-Nachrichten aus und wie logge ich eigentlich generell? Und: wie viele Produkte haben wir eigentlich...

Frontend
NoSQL
Java
APM
Logging
Spring

26.7.2016 | 24 Minuten Lesezeit

Jonas Hecht

IoT-Analyse-Plattform

Internet of Things (IoT) oder auch Industrie 4.0 ist heute in aller Munde. Aber welche Herausforderungen stellen sich eigentlich bei der Verarbeitung großer Datenmengen? Eine Variante kann sein, Daten zu sammeln und später im Batch-Betrieb zu verarbeiten...

Cloud
IoT
NoSQL
Scala
Big Data

13.7.2016 | 14 Minuten Lesezeit

Achim Nierbeck

Elixir, Phoenix und CouchDB – Eine Einführung

Das Elixir MVC Framework PhoenixVon Markus Krogemann und Marcel WolfWorum geht es?Zunächst wird gezeigt, wie sich eine Webanwendung mit Phoenix in einfachen Schritten erstellen lässt, ohne dass ein tieferes Verständnis für eine funktionale Programmiersprache...

Softwareentwicklung
Functional programming
Frontend
NoSQL

13.1.2016 | 4 Minuten Lesezeit

Marcel Wolf

Joins und Schema-Validierung mit MongoDB 3.2

Mit Version 3.2 der dokumentenorientierten NoSQL-Datenbank MongoDB werden u.a. zwei lange vermisste(?) Features eingeführt, auf die ich im Folgenden näher eingehen möchte.JoinsDie logischen Namensräume, in denen man seine Dokumente ablegt, werden in...

NoSQL
Big Data
Validierung

7.12.2015 | 3 Minuten Lesezeit

Tobias Trelle

MongoDB-Einführung bei der Java-Usergruppe ruhrjug

Die Java-Enthusiasten im Ruhrgebiet treffen sich regelmäßig bei der ruhrjug , um sich über aktuelle Themen rund um die Programmiersprache Java auszutauschen.Beim letzten Treffen vor der Sommerpause am 25.06.2015 war ich eingeladen, um dort einen Vortrag...

Java
NoSQL
Community
Spring

1.7.2015 | 1 Minuten Lesezeit

Tobias Trelle

Confess – Konferenzbericht

Von 14.-16.04.2015 fand die Confess, eine Konferenz für Enterprise Software Lösungen, statt. Sie wurde im C3 Convention Center in Wien veranstaltet. Auf der Konferenz waren hervorragende Speaker, wie Anton Arhipov, Maarten Mulders und Michael Plöd.Anton...

Community
Softwareentwicklung
NoSQL
Open Source
Java
Kubernetes
Microservices

21.4.2015 | 2 Minuten Lesezeit

Bernd Zuther

DataStax Tech-Day, die Zweite!

Vier Monate sind vergangen, seit wir den ersten Tech-Day gemeinsam mit unserem Partner DataStax in München durchgeführt hatten. Es war also an der Zeit, dieses Format auch in den hohen Norden, genauer gesagt in die Räumlichkeiten der codecentric nach...

NoSQL
Community

31.3.2015 | 2 Minuten Lesezeit

Silvio Tschapke

Big Data und Tiny Hardware – Teil 1

AbstractNachdem Ihr unsere „Big Data in a Box“-Lösung auf Schulungen und Usergroup-Treffen gesehen habt, haben wir immer wieder Anfragen zur Realisierung von Euch erhalten. Ihr wolltet wissen was wir dort gebaut haben und wie alles einzurichten ist. ...

Java
Open Source
Big Data
NoSQL

11.2.2015 | 3 Minuten Lesezeit

Dominique Ronde

MongoDB 2.8 – Neue Storage-Engine WiredTiger

Mit Version 2.8 kommen wesentliche Neuerungen auf die Benutzer der NoSQL-Datenbank MongoDB zu. Eine davon ist die Einführung einer weiteren Storage Engine. Was es damit auf sich hat, werde ich in diesem Artikel erläutern.Bis zur Version 2.6 hat MongoDB...

Big Data
NoSQL

10.12.2014 | 4 Minuten Lesezeit

Tobias Trelle

MongoDB – Riesige Datenmengen schemafrei verwalten

MongoDB ist eine dokumentenorientierte NoSQL-Datenbank, die sich steigender Beliebtheit erfreut. In meinem Artikel MongoDB – Riesige Datenmengen schemafrei verwalten aus dem Java Magazin 5.14 gebe ich eine allgemeine kurze Einführung und erläutere die...

Datenbank
NoSQL

10.7.2014 | 1 Minuten Lesezeit

Tobias Trelle

MongoDB Days München 2013

Am 14. Oktober fand in München zum 4. Mal die MongoDB Munich Konferenz statt. Dieses Jahr zog die Veranstaltung mit dem Hilton Hotel am Rosenheimer Platz an einen zentral gelegenen Ort an dem sich laut Veranstalter ca. 240 Anhänger der beliebten OpenSource...

NoSQL

15.10.2013 | 5 Minuten Lesezeit

Bastian Spanneberg

Einführung in Hadoop – Was ist Big Data & Hadoop? (Teil 1 von 3)

Was ist Big Data?„Big Data ist, wenn die Daten selbst Teil des Problems werden“Diese kurze Definition in Anlehnung an ein Zitat des Verantwortlichen für Marktforschung bei O’Reilly Media, Roger Magoulas, ist in meinen Augen die beste Charakterisierung...

Big Data
NoSQL

12.8.2013 | 5 Minuten Lesezeit

Uwe Printz

MongoDB und Ruby

#MongoDB #RubyAm vergangenen Samstag habe ich auf dem Cloud Developer Camp in Düsseldorf einen Vortrag über den Ruby-Treiber für MongoDB gehalten. Hier sind die Slides dazu:Klicken Sie auf den unteren Button, um den Inhalt von www.slideshare.net zu...

NoSQL
Ruby

18.7.2013 | 1 Minuten Lesezeit

Tobias Trelle

MongoDB für den Roboter

Wir setzen das Robot Framework seit geraumer Zeit für automatisierte Softwaretests in unseren Projekten ein. Außerdem beschäftigen sich ein paar meiner Kollegen mit der NoSql Datenbank MongoDB (Tutorial über MongoDB ). Die Dokumenten-Management-Lösung...

Agilität
Big Data
Open Source
NoSQL
Testing

6.6.2013 | 2 Minuten Lesezeit

Max Hartmann

OOP 2013: Praktische Einführung in MongoDB

Auf der OOP 2013 gab es von mir einen Vortrag zum Thema„Praktische Einführung in MongoDB“Klicken Sie auf den unteren Button, um den Inhalt von de.slideshare.net zu laden.Inhalt laden Wer wollte, konnte sich MongoDB herunterladen und die Beispiele live...

NoSQL
Community

1.2.2013 | 1 Minuten Lesezeit

Tobias Trelle

Oliver Gierke über Spring Data und den ganzen REST …

Heute mal was ganz anderes: ich führe ein Interview mit Oliver Gierke von SpringSource . Los geht’s …Tobias Trelle: Hallo Oliver. Möglicherweise gibt es Leser, die Dich noch nicht kennen. Könntest Du Dich bitte kurz vorstellen?Oliver Gierke: Mein Name...

Data
Java
Community
Datenbank
NoSQL
Spring

20.11.2012 | 9 Minuten Lesezeit

Tobias Trelle

Eindrücke von der MongoDB Munich 2012

MongoDB World Tour 2012Im Rahmen der MongoDB World Tour 2012 hat 10gen, die Firma hinter MongoDB, Station in der bayerischen Landeshauptstadt gemacht.Die eintägige Konferenz im Hilton Park Hotel drehte sich natürlich voll und ganz um MongoDB in all....

Big Data
Community
NoSQL

24.10.2012 | 4 Minuten Lesezeit

Uwe Printz

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

Elasticsearch Indexing Performance Cheatsheet

General Performance

Mapping

Requests and Clients

Sharding and Replication

Index Settings

Conclusion

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Elastic{ON}: Erste Elasticsearch-User-Konferenz in San Francisco

Scaling an Elasticsearch Index – Introduction

Transactions in Elasticsearch

Elasticsearch Monitoring and Management Plugins

Useful JVM Flags – Part 8 (GC Logging)

Useful JVM Flags – Part 7 (CMS Collector)

ForkJoinPool vs. ThreadPoolExecutor

Useful JVM Flags – Part 6 (Throughput Collector)

Useful JVM Flags – Part 5 (Young Generation Garbage Collection)

Useful JVM Flags – Part 4 (Heap Tuning)

Useful JVM Flags – Part 3 (Printing all XX Flags and their Values)

Useful JVM Flags – Part 2 (Flag Categories and JIT Compiler Diagnostics...

Useful JVM Flags – Part 1 (JVM Types and Compiler Modes)

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

kibconfig – Wartungstool für Kibana Dashboards

Graphen-Visualisierung mit Neo4j

Elasticsearch: _type-Mapping zur Dateninspektion

Spring Boot & Apache CXF – Logging & Monitoring mit Logback, Elasticsearch...

IoT-Analyse-Plattform

Elixir, Phoenix und CouchDB – Eine Einführung

Joins und Schema-Validierung mit MongoDB 3.2

MongoDB-Einführung bei der Java-Usergruppe ruhrjug

Confess – Konferenzbericht

DataStax Tech-Day, die Zweite!

Big Data und Tiny Hardware – Teil 1

MongoDB 2.8 – Neue Storage-Engine WiredTiger

MongoDB – Riesige Datenmengen schemafrei verwalten

MongoDB Days München 2013

Einführung in Hadoop – Was ist Big Data & Hadoop? (Teil 1 von 3)

MongoDB und Ruby

MongoDB für den Roboter

OOP 2013: Praktische Einführung in MongoDB

Oliver Gierke über Spring Data und den ganzen REST …

Eindrücke von der MongoDB Munich 2012

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten