Validating Topic Configurations in Apache Kafka

7.12.2017 | 8 minutes of reading time

Messages in Apache Kafka are appended to (partitions of) a topic. Topics have a partition count, a replication factor and various other configuration values. Why do those matter and what could possibly go wrong?

Why does Kafka topic configuration matter?

There are three main parts that define the configuration of a Kafka topic:

Partition count
Replication factor
Technical configuration

The partition count defines the level of parallelism of the topic. For example, a partition count of 50 means that up to 50 consumer instances in a consumer group can process messages in parallel. The replication factor specifes how many copies of a partition are held in the cluster to enable failover in case of broker failure. And in the technical configuration, one can define the cleanup policy (deletion or log compaction), flushing of data to disk, maximum message size, permitting unclean leader elections and so on. For a complete list, see https://kafka.apache.org/documentation/#topicconfigs . Some of these properties are quite easy to change at runtime. For others this is a lot harder, though.

Let’s take the partition count. Increasing it upwards is easy – just run

bin/kafka-topics.sh --alter --zookeeper zk:2181 --topic mytopic --partitions 42

This might be sufficient for you. Or it might open the fiery gates of hell and break your application. The latter is the case if you depend on all messages for a given key landing on the same partition (to be handled by the same consumer in a group) or for example if you run a Kafka Streams application. If that application uses joins , the involved topics need to be copartitioned, meaning that they need to have the same partition count (and producers using the same partitioner, but that is hard to enforce). Even without joins, you don’t want messages with the same key end up in different KTables .

Changing the replication factor is serious business. It is not a case of simply saying “please increase the replication factor to x” as it is with the partition count. You need to completely reassign partitions to brokers, specifying the preferred leader and n replicas for each partition. It is your task to distribute those well across your cluster. This is no fun for anyone involved. Practical experience with this has actually led to this blog post.
The technical configuration has an impact as well. It could be for example quite essential that a topic is using compaction instead of deletion if an application depends on that. You also might find the retention time too small or too big.

The Evils of Automatic Topic Creation

In a recent project, a central team managed the Kafka cluster. This team kept a lot of default values in the broker configuration. This is mostly sensible as Kafka comes with pretty good defaults. However, one thing they kept was auto.create.topics.enable=true. This property means that whenever a client tries to write to or read from a non-existing topic, Kafka will automatically create it. Defaults for partition count and replication factor were kept at 1.

This led to the situation where the team forgot to set up a new topic manually before running producers and consumers. Kafka created that topic with default configuration. Once this was noticed, all applications were stopped and the topic deleted – only to be created again automatically seconds later, presumably because the team didn’t find all clients. “Ok”, they thought, “let’s fix it manually”. They increased the partition count to 32, only to realize that they had to provide the complete partition assignment map to fix the replication factor. Even with tool support from Kafka Manager, this didn’t give the team members a great feeling. Luckily, this was only a development cluster, so nothing really bad happened. But it was easy to conceive that this could also happen in production as there are no safeguards.

Another danger of automatic topic creation is the sensitivity to typos. Let’s face it – sometimes we all suffer from butterfingers. Even if you took all necessary care to correctly create a topic called “parameters”, you might end up with something like

Automatic topic creating means that your producer thinks everything is fine, and you’ll scratch your head as to why your consumers don’t receive any data.

Another conceivable issue is that a developer that maybe is not yet that familiar with the Producer API might confuse the String parameters in the send method

So while our developer meant to assign a random value to the message key, he accidentally set a random topic name. Every time a message is produced, Kafka creates a new topic.

So why don’t we just switch automatic topic creation off? Well, if you can: do it. Do it now! Sadly, the team didn’t have that option. But an idea was born – what would be the easiest way to at least fail fast at application startup when something is different than expected?

How to automatically check your topic configuration

In older versions of Kafka, we basically used the code called by the kafka-topics.sh script to programmatically work with topics. To create a topic for example we looked at how to use kafka.admin.CreateTopicCommand. This was definitely better than writing straight to Zookeeper because there is no need to replicate the logic of “which ZNode goes where”, but it always felt like a hack. And of course we got a dependency on the Kafka broker in our code – definitely not great.

Kafka 0.11 implemented KIP-117 , thus providing a new type of Kafka client – org.apache.kafka.clients.admin.AdminClient . This client enables users to programmatically execute admin tasks without relying on those old internal classes or even Zookeeper – all Zookeeper tasks are executed by brokers.

With AdminClient, it’s fairly easy to query the cluster for the current configuration of a topic. For example, this is the code to find out if a topic exists and what its partition count and replication factor is:

The DescribeTopicsResult contains all the info required to find out if the topic exists and how partition count and replication factor are set. It’s asynchronous, so be prepared to work with Futures to get your info.

Getting configs like cleanup.policy works similarly, but uses a different method:

Under the hood there is the same Future-based mechanism.

A first implementation attempt

If you are in a situation where your application depends on a certain configuration for the Kafka topics you use, it might make sense to fail early when something is not right. You get instant feedback and have a chance to fix the problem. Or you might at least want to emit a warning in your log. In any case, as nice as the AdminClient is, this check is not something you should have to implement yourself in every project.

Thus, the idea for a small library was born. And since naming things is hard, it’s called “Club Topicana”.

With Club Topicana, you can check your topic configuration every time you create a Kafka Producer, Consumer or Streams client.

Expectations can be expressed programmatically or configuratively. Programmatically, it uses a builder:

This basically says “I expect the topic test_topic to exist. It should also have 32 partitions and a replication factor of 3. I also expect the cleanup policy to be delete. Kafka should retain messages for at least 30 seconds.”

Another option to specify an expected configuration is YAML (parser is included):

What do you do with those expectations? The library provides factories for all Kafka clients that mirror their public constructors and additionally expects a collection of expected topic configurations. For example, creating a producer can look like this:

The last line throws a MismatchedTopicConfigException if the actual configuration does not meet expectations. The message of that exception lists the differences. It also provides access to the computed result so users can react to it in any way they want.

The code for consumers and streams clients looks similar. Examples are available on GitHub . If all standard clients are created using Club Topicana, an exception will prevent creation of a client and thus auto creation of a topic. Even if auto creation is disabled, it might be valuable to ensure that topics have the correct configuration.

There is also a Spring client. The @EnableClubTopicana annotation triggers Club Topicana to read YAML configuration and execute the checks. You can configure if you want to just log any mismatches or if you want to let the creation of the application context fail.

This is all on GitHub and available on Maven Central.

Caveats

Club Topicana will not notice when someone changes the configuration of a topic after your application has successfully started. It also of course cannot guard against other clients doing whatever on Kafka.

Summary

The configuration of your Kafka topics is an essential part of running your Kafka applications. Wrong partition count? You might not get the parallelism you need or your streams application might not even start. Wrong replication factor? Data loss is a real possibility. Wrong cleanup policy? You might lose messages that you depend on later. Sometimes, your topics might be auto-generated and come with bad defaults that you have to fix manually. With the AdminClient introduced in Kafka 0.11, it’s simple to write a library that compares actual and desired topic configurations at application startup.

Was this post helpful?

Likes

Blog author

Florian Troßbach

Senior IT Consultant

Do you still have questions? Just send me a message.

fromFlorian Troßbach

Building a distributed Runtime for Interactive Queries in Apache Kafka...

Interactive Queries are a fairly new feature of Apache Kafka Streams that provides programmatic access to the internal state held by a streaming application. However, the Kafka API only provides access to the state that is held locally by an instance...

Messaging
Java

20.3.2017 | 9 Minuten Lesezeit

Florian Troßbach

Interactive Queries in Apache Kafka Streams

"Databases? Where we're going we don't need databases" – Doc Brown, 1985 Well, we’re certainly not there yet, but this article is going to introduce you to a new feature of the popular streaming platform Apache Kafka that can make a dedicated external...

Messaging
Streaming

13.3.2017 | 10 Minuten Lesezeit

Florian Troßbach

Crossing the Streams – Joins in Apache Kafka

Version 0.10.0 of the popular distributed streaming platform Apache Kafka saw the introduction of Kafka Streams. In its initial release, the Streams-API enabled stateful and stateless Kafka-to-Kafka message processing using concepts such as map, flatMap...

Messaging
Big Data
Streaming

15.2.2017 | 14 Minuten Lesezeit

Florian Troßbach

Realtime Fast Data Analytics with Druid

I have been working with the SMACK stack for a while now and it is great fun from a developer’s point of view. Kafka is a very robust data buffer, Spark is great at streaming all that buffered data and Cassandra is really fast at writing and retrieving...

18.8.2016 | 13 Minuten Lesezeit

Florian Troßbach

Neues in Apache Kafka 0.10 und Confluent Platform 3.0.0

Die im Mai erschienenen neuen Versionen von Apache Kafka und Confluent Platform enthalten einige spannende Neuerungen. Diese werden in diesem Artikel vorgestellt. Was ist Apache Kafka? Kafka ist ein verteilter Message Broker, der nach dem Publish-Subscribe...

7.6.2016 | 10 Minuten Lesezeit

Florian Troßbach

The SMACK stack – hands on!

The SMACK stack is all the rage these days. Instead of just talking about it, this post is going to guide you through the steps for setting up a simple SMACK stack that will enable you to get a hands on experience with the tools. In the first step,...

1.5.2016 | 9 Minuten Lesezeit

Florian Troßbach

First steps with Java 9 and Project Jigsaw – Part 2

This is part 2 of a series that aims to get you started with project Jigsaw. In part 1 , we briefly talked about the definition of a module and how the Java Runtime was modularized. We then proceeded to a simple example that demonstrated how to (and ...

Java

1.12.2015 | 12 Minuten Lesezeit

Florian Troßbach

First steps with Java 9 and Project Jigsaw – Part 1

Eight years after its inception, Project Jigsaw – the modularization of the Java platform and introduction of a general module system – is on track to be included in Java 9. The target release has changed over the years from Java 7 via Java 8 to Java...

Java

24.11.2015 | 11 Minuten Lesezeit

Florian Troßbach

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Im Bereich des maschinellen Lernens wurde eine lange Zeit angenommen, dass die Eingabedaten von Modellen und Gewichten sicher sei und nicht extrahiert werden könnten. In den letzten Jahren veröffentlichte Forschung hat diese Annahme in Frage gestellt...

Machine Learning
Big Data
Data Science
Data

18.9.2023 | 8 Minuten Lesezeit

Ihsan Kisi

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Mithilfe von Daten können Unternehmen fundiertere Entscheidungen treffen, ihre Arbeitsabläufe optimieren und mit der Kraft des maschinellen Lernens (ML) einen Vorteil in der wettbewerbsintensiven Geschäftswelt erlangen. Allerdings ist der Umgang mit ...

Machine Learning
Data Science
Data
Big Data

25.8.2023 | 7 Minuten Lesezeit

Ihsan Kisi

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

Wenn wir Erkenntnisse aus großen Datenmengen gewinnen wollen, bieten uns Cloud Service Provider inzwischen Lösungen an, dank derer wir uns kein Data Warehouse oder Hadoop-Cluster mehr in den Keller stellen müssen. AWS hat mit Athena, RedShift und EMR...

Cloud
Big Data
AWS
Serverless
GitLab

21.3.2023 | 16 Minuten Lesezeit

Maik Fleuter

Streaming Wikipedia mit Apache Kafka

Apache Kafka ist in aller Munde und entwickelt sich im Kontext von verteilten Systemen zum De-facto-Standard als Plattform für Event Streaming. Im Rahmen unserer OffProject Time (Weiterbildungszeit) haben wir uns die Plattform auch näher angeschaut und...

Kotlin
Data
Java
Messaging
Spring

15.8.2022 | 10 Minuten Lesezeit

Christoph Metzger

Felix Rieß

Stream Processing mit Kafka Streams und Spring Boot

Kontinuierliche Datenströme in verteilten Systemen ohne Zeitverzögerung zu verarbeiten, birgt einige Herausforderungen. Wir zeigen euch, wie Stream Processing mit Kafka Streams und Spring Boot gelingen kann. Alles im Fluss: Betrachtet man Daten als fortlaufenden...

Softwarearchitektur
Cloud
IoT
Messaging
Kotlin
Spring

20.12.2021 | 20 Minuten Lesezeit

Maik Fleuter

Lukas Maier

Simple Deep Learning mit Amazon SageMaker

In unserem neuesten codecentric.AI-Video geben wir eine kurze Einführung in Amazon SageMaker und zeigen, wie man damit schnell und einfach ein Bildklassifikationsmodell trainieren kann, das Brillenträger von Nicht-Brillenträgern unterscheidet. Mit ...

Big Data
AWS
Cloud
Data
Machine Learning
Künstliche Intelligenz
Python

11.7.2018 | 5 Minuten Lesezeit

Shirin Elsinghorst

Oliver Moser

Schema First Design – Produktentwicklung mit GraphQL

Zu den schwierigsten Aufgaben bei der Entwicklung neuer Produkte gehören die Koordinierung der Teams, der Featureumfang und unbekannte Faktoren in Form der „moving parts“. Laut Definition müssen wir bestimmte laufende Prozesse berücksichtigen. Ein gutes...

API
Big Data

25.6.2018 | 7 Minuten Lesezeit

Toni Haupt

Wie schreibt man eine Kotlin-DSL – z.B. für Apache Kafka?

Das Interesse an der Programmiersprache Kotlin wächst, und auch die Verwendung von Kotlin in Projekten nimmt zu. Ein Bereich, in dem Kotlin hervorragend verwendet werden kann, ist die Implementierung von speziellen Domänen-spezifischen Sprachen, den ...

Messaging
DSL
Kotlin

23.6.2018 | 9 Minuten Lesezeit

Peter-Josef Meisch

Deep Learning Workshop bei der codecentric AG in Solingen

Big Data – ein Schlagwort, das zur Zeit in aller Munde ist, egal ob in nerdigen Blogs, wissenschaftlichen Artikeln oder der Tageszeitung. Doch wie funktionieren Analysen von Big Data eigentlich? Um das heraus zu finden, habe ich an dem Workshop über ...

Big Data
Data
Künstliche Intelligenz
Machine Learning

6.2.2018 | 6 Minuten Lesezeit

Shirin Elsinghorst

BigchainDB – Das leichtgewichtige Blockchain-Framework [blockcentric #...

Mit BigchainDB sehen wir eines der ersten vollumfänglichen, aber einfachen Blockchain-Frameworks. Das Projekt macht es sich zur Aufgabe, Blockchain für eine große Anzahl von Entwicklern und Use Cases nutzbar zu machen, ohne besonderes Wissen in Kryptographie...

Big Data
Blockchain

3.1.2018 | 5 Minuten Lesezeit

Jonas Verhoelen

Data Science und Big Data: Eine Mate mit… Michael Plümacher #EineMateMit

„Aufgrund der gestiegenen Rechen- und Speicherkapazitäten sind in den letzten Jahren ganz neue Möglichkeiten entstanden“, sagt Michael Plümacher, Data Scientist bei der codecentric. Einige seiner aktuellen Data-Science- und Big Data-Projekte stellt er...

Big Data
Data
Community

21.9.2017 | 1 Minuten Lesezeit

Felix Braun

Fraud-Analyse mit Data-Science-Techniken

Was ist Fraud und was macht es für Data Science interessant?Im Zusammenhang mit Data Science beschreibt das englische Wort „Fraud“ in der Regel Betrug im Online-, Kreditkarten- oder Versicherungsgeschäft. Betrugsversuche bei Geschäftsabschlüssen gibt...

Big Data
Data
Machine Learning

5.9.2017 | 9 Minuten Lesezeit

Shirin Elsinghorst

Datenlookup in Spark Streaming

Bei der Verarbeitung von Streaming-Daten reichen die Rohdaten aus den Events häufig nicht aus. Meist müssen noch zusätzliche Daten hinzugezogen werden, beispielsweise Metadaten zu einem Sensor, von dem im Event nur die ID mitgeschickt wird.In diesem ...

Softwarearchitektur
Scala
Big Data
Data
Streaming

1.6.2017 | 7 Minuten Lesezeit

Matthias Niehoff

Event-Zeit-Verarbeitung in Apache Spark und Apache Flink

Mit dem neuen Release von Spark 2.1 wurden die Eventzeit-Fähigkeiten von Spark Structured Streaming ausgebaut. Höchste Zeit also den Stand der Unterstützung genauer unter die Lupe zu nehmen und mit Apache Flink – ausgestattet mit einem breiten Support...

Big Data
Data
Machine Learning
Streaming

19.4.2017 | 9 Minuten Lesezeit

Matthias Niehoff

Verteilte Stream Processing Frameworks für Fast Data & Big Data – Ein ...

Spark Streaming, Flink, Storm, Kafka Streams – das sind nur die populärsten Vertreter einer stetig wachsenden Auswahl zur Verarbeitung von Streaming-Daten in großen Mengen. In diesem Artikel soll es um die wesentlichen Konzepte hinter diesen Frameworks...

Big Data
Data
Open Source
Messaging
Machine Learning
Streaming

26.3.2017 | 10 Minuten Lesezeit

Matthias Niehoff

IoT-Analyse-Plattform

Internet of Things (IoT) oder auch Industrie 4.0 ist heute in aller Munde. Aber welche Herausforderungen stellen sich eigentlich bei der Verarbeitung großer Datenmengen? Eine Variante kann sein, Daten zu sammeln und später im Batch-Betrieb zu verarbeiten...

Cloud
IoT
NoSQL
Scala
Big Data

13.7.2016 | 14 Minuten Lesezeit

Achim Nierbeck

Aufbau eines Mesosphere DC/OS-Clusters mit Terraform

Der Ein oder Andere kennt höchstwahrscheinlich die Herausforderung, ein verteiltes System zu betreiben. Selbst der Betrieb von einem einfachen Online-Shop kann eine nicht triviale Aufgabe sein, wenn der Shop in einer Microservice-Architektur über mehrere...

Cloud
CI/CD
DevOps
Softwarearchitektur
Reactive Programming
Messaging
Big Data

24.4.2016 | 5 Minuten Lesezeit

Bernd Zuther

Joins und Schema-Validierung mit MongoDB 3.2

Mit Version 3.2 der dokumentenorientierten NoSQL-Datenbank MongoDB werden u.a. zwei lange vermisste(?) Features eingeführt, auf die ich im Folgenden näher eingehen möchte.JoinsDie logischen Namensräume, in denen man seine Dokumente ablegt, werden in...

NoSQL
Big Data
Validierung

7.12.2015 | 3 Minuten Lesezeit

Tobias Trelle

Canary-Release mit der Very Awesome Microservices Platform (Vamp)

Im letzten Artikel der Serie “Microservice-Deployment ganz einfach ” erkläre ich, dass Docker nicht zwingend notwendig ist, um Microservice-Anwendungen auszuliefern. Wie der Artikel zeigt, kann man die Linux-Paketverwaltung benutzen, um Microservice...

Cloud
CI/CD
Infrastructure
Startup
Open Source
Big Data
Microservices
Kubernetes
Softwareentwicklung
API

11.10.2015 | 7 Minuten Lesezeit

Bernd Zuther

Big Data und Tiny Hardware – Teil 1

AbstractNachdem Ihr unsere „Big Data in a Box“-Lösung auf Schulungen und Usergroup-Treffen gesehen habt, haben wir immer wieder Anfragen zur Realisierung von Euch erhalten. Ihr wolltet wissen was wir dort gebaut haben und wie alles einzurichten ist. ...

Java
Open Source
Big Data
NoSQL

11.2.2015 | 3 Minuten Lesezeit

Dominique Ronde

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

Validating Topic Configurations in Apache Kafka

Why does Kafka topic configuration matter?

The Evils of Automatic Topic Creation

How to automatically check your topic configuration

A first implementation attempt

Caveats

Summary

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Building a distributed Runtime for Interactive Queries in Apache Kafka...

Interactive Queries in Apache Kafka Streams

Crossing the Streams – Joins in Apache Kafka

Realtime Fast Data Analytics with Druid

Neues in Apache Kafka 0.10 und Confluent Platform 3.0.0

The SMACK stack – hands on!

First steps with Java 9 and Project Jigsaw – Part 2

First steps with Java 9 and Project Jigsaw – Part 1

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Eine Einführung in Federated Learning im industriellen Kontext: Fortgeschritten

Eine Einführung in Federated Learning im industriellen Kontext: Grundlagen

Datenanalyse auf die schnelle Art – mit Amazon Athena und GitLab

Streaming Wikipedia mit Apache Kafka

Stream Processing mit Kafka Streams und Spring Boot

Simple Deep Learning mit Amazon SageMaker

Schema First Design – Produktentwicklung mit GraphQL

Wie schreibt man eine Kotlin-DSL – z.B. für Apache Kafka?

Deep Learning Workshop bei der codecentric AG in Solingen

BigchainDB – Das leichtgewichtige Blockchain-Framework [blockcentric #...

Data Science und Big Data: Eine Mate mit… Michael Plümacher #EineMateMit

Fraud-Analyse mit Data-Science-Techniken

Datenlookup in Spark Streaming

Event-Zeit-Verarbeitung in Apache Spark und Apache Flink

Verteilte Stream Processing Frameworks für Fast Data & Big Data – Ein ...

IoT-Analyse-Plattform

Aufbau eines Mesosphere DC/OS-Clusters mit Terraform

Joins und Schema-Validierung mit MongoDB 3.2

Canary-Release mit der Very Awesome Microservices Platform (Vamp)

Big Data und Tiny Hardware – Teil 1

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten