Building a distributed Runtime for Interactive Queries in Apache Kafka with Vert.x

20.3.2017 | 9 minutes of reading time

Interactive Queries are a fairly new feature of Apache Kafka Streams that provides programmatic access to the internal state held by a streaming application. However, the Kafka API only provides access to the state that is held locally by an instance of the application – there is no global state. Source topic partitions are distributed among instances and while each can provide cluster metadata that tells a caller which instances are responsible for a given key or store, developers must provide a custom RPC layer that glues it all together. While playing around with the API while preparing a blog on Interactive Queries , I wondered how such a layer could be written in a generic way. This blog describes how I ended up with KIQR (Kafka Interactive Query Runtime) .

Disclaimer: This truly is a hobby project and has not been extensively tested at runtime.

First steps

After looking at the default APIs on the KafkaStreams client class, I realized I had to account for two types of queries:

key-based queries that would only be routed to one instance in the cluster based on the key
scatter-gather queries that would be routed to all instances that held data for a given store (by name) and aggregate the results

Both types involve querying at least one instance. Any instance of a Kafka Streams application can be used to obtain cluster wide metadata that tell us which instance holds what information. But once we know the “where”, how do we get there? Of course we could just communicate via HTTP, but that doesn’t sound that appealing for “internal” queries.
After having heard a lot about Eclipse Vert.x from my colleague and Vert.x committer Jochen Mader , I thought it might be a good fit. I started reading the Vert.x documentation, and I really liked what I saw.

What is Vert.x

Vert.x is an event-driven non-blocking application platform. It enables you to write concurrent code without having to think too much about concurrency itself, so you can focus on your business logic instead of threads and synchronization. A key abstraction is the Verticle , which works similarly to actors in the actor model (it’s not a perfect match, but close enough). As I was familiar with Akka already, making the leap to Vert.x was actually quite easy. There are some other nice features as well – Vert.x is polyglot, so you can write your components in different languages. It also integrates very well with OSGi. And the list is even longer – by now I’m really excited about Vert.x!

Components in a Vert.x application communicate via simple String addresses on an event bus , and this is the killer feature for KIQR’s use case. It is very simple to run Vert.x in cluster mode, turning the event bus into a distributed event bus without having to change any code. After trying it out with very simple hello world example, this looked capable of handling KIQR’s requirements for internal communication. There are actually four libraries that can be used to run Vert.x in cluster mode (as of Vert.x 3.4.0). The two stable ones are Hazelcast and Apache Ignite. Infinispan and Apache Zookeeper are in technical preview. I settled on Hazelcast as it was the only stable option at the time when I started.
Perfect – transparent communication between the instances is delegated to Vert.x.

Componentizing the runtime

The event bus sits in the middle, that much is clear. Now what kinds of components do we attach to the bus? I settled on these logical components:

query verticles for the low-level query operations directly on the KafkaStreams client
- one for each query operation, potentially multiple ones per store type
query facades that first find out which instances need to be queried, asynchronously execute the query and aggregate the results if necessary
- also one for each query operation

We definitely need to run the query verticles on every instance that we want to query, so they’ll have to listen to messages on the event bus. But how can we make the correlation between event bus addresses and KafkaStreams metadata? Since Kafka 0.10.1, the Streams API contains a new parameter called application.server that is published among all instances of a streaming application via the Kafka protocol.
As the Vert.x event bus only uses Strings as addresses, I had the idea that I could use that field not to publish a pair as intended, but to use a unique identifier as host and listen to that identifier on the event bus. UUIDs make good identifiers in this case.

The query facades do not actually need to be deployed on every instances as well as they’ll delegate queries to the responsible query verticle, but for simplicity, better load distribution and reduced latency, it won’t hurt to have them run on each instance as well. Facades for the same query type will share the same static address across instances as it really doesn’t matter which instance serves a request. Vert.x will prefer a local one. A query facade asks the KafkaStreams client for metadata, infers the id of the query verticle and issues a request for that verticle on the event bus. The following diagram shows the setup:

That covers the basic blocks. What’s still missing is a component that opens an interface to the outside world. While other options are conceivable, HTTP is a good start. Vert.x makes it very easy to start a HTTP server and provide a REST-API. That API of course only allows GET requests because Interactive Queries are read-only. Let’s look at the communication flow for a key-value query. All communication between component uses the event bus:

As the diagram indicates, this is all as non-blocking as it can be on the server side.

The following diagram shows an overview of all the verticles that are running in a single KIQR instance:

Serialization

As we’re definitely going to have communication between JVMs and wire transfers both within the Vert.X cluster and in communication with clients, we need to think about serialization.
In Kafka, messages are little more than key-value pairs of byte arrays. Producers and consumers need to have a contract on the serialization format. This is informal – Kafka Brokers simply do not care about message contents. That’s why the Producer/Consumer-API heavily rely on Serdes (Serializer/Deserializers). As we need those anyway to run Kafka Producers and Streams, we can just go on and use them for all other wire transfers as well – no need to reinvent the wheel. KIQRs runtime will directly serialize any key or value it reads from an interactive query. It will then be encoded as Base64 string. KIQR itself remains as agnostic to message contents as Kafka itself is.

Serialization on the Vert.x event bus is a different topic altogether. For each message sent over the event bus, Vert.x must be aware of a message codec for that type – even if the message is transmitted within the same JVM. This is a safeguard as the sender is not aware if the receiver is running on the same or a different node. If it is JVM internal, it will not be serialized, but if it needs to be serialized after all, Vert.x knows what to do. KIQR uses simple POJOs that can be easily converted to JSON. Problem solved. This probably could be more efficient, but hey, early days.

Server-side example

So how can we deploy a Kafka Streams application in with KIQR? First thing you need is a Vertx object. In the simplest case without distribution, this is created by a simple Vertx vertx = Vertx.vertx();. The distributed case involves setting up a cluster manager as per the following example using Hazelcast:

In the default, this uses UDP broadcasts as cluster discovery mechanism. If that is not available in your environment (e.g. AWS), please check the docs .

Once we got a Vertx object, we can deploy the KIQR verticles. A streaming topology can be started like this:

This starts the streaming application with a HTTP server listening on port 4711.

Rest-API

KIQR supports all standard store operations available in the High Level Streams DSL as of Kafka 0.10.2.0. This is the mapping of endpoints to methods:

Key-Value queries:
- /api/v1/kv/{store}/values/{b64 encoded serialized key}?keySerde=&valueSerde=
  - Maps to org.apache.kafka.streams.state.ReadOnlyKeyValueStore#get
- /api/v1/kv/{store}?keySerde=&valueSerde=
  - Maps to org.apache.kafka.streams.state.ReadOnlyKeyValueStore#all
- /api/v1/kv/{store}?keySerde=&valueSerde=&from=&to=
  - Maps to org.apache.kafka.streams.state.ReadOnlyKeyValueStore#range
- /api/v1/kv/{store}/count
  - Maps to org.apache.kafka.streams.state.ReadOnlyKeyValueStore#approximateNumEntries
Window queries:
- /api/v1/window/{store}/{b64 encoded serialized key}?keySerde=&valueSerde=&from=&to=
  - Maps to org.apache.kafka.streams.state.ReadOnlyWindowStore#fetch
- /api/v1/session/{store}/{b64 encoded serialized key}?keySerde=&valueSerde=
  - Maps to org.apache.kafka.streams.state.ReadOnlySessionStore#fetch

Clients

You can use the REST API with any client of course, but its URIs contain Base64 encoded serialized keys and the responses also contain serialized values, so a client that handles all that serialization and deserialization sounded like a good idea. The first draft of KIQR contains a REST client based on Apache HttpComponents. The list of dependencies is intentionally kept simple and is restricted to

Fluent-HC from HttpComponents
Jackson for a bit of JSON handling
Kafka Streams (for the Serde interface and the default Serdes)

Plus transitive dependencies, of course. The clients are blocking for the moment, which marks a bit of a step back from all this non-blocking Vert.x code. But non-blocking clients are definitely on the road map. The clients are written in a way that lets you use the actual types of your keys and values. It will use the provided Serdes to handle wire transfers.

There is a generic client whose parameters map closely to the REST API:

There is also a specific client that let’s you set types, serdes and store name once in the constructor so you don’t have to bother with them each time:

This API is probably more enjoyable to use.

Caveats and restrictions

As mentioned earlier, KIQR is a hobby project. It has not been used in any actually real-world scenario so far. Some other caveats and restrictions are:

not very well integrationally tested yet, especially not for high volumes
not highly available in the sense that when the streams app is rebalancing, we cannot execute queries
No streaming of large results – if you query too much data, you’ll get large results and might run into timeouts
highly unstable API and implementation, things will change
you are responsible to know the names of the state stores and types of your keys and values in Kafka. There is
no way to infer them at runtime.
Java 8 and Kafka Streams 0.10.2 required

Conclusion & resources

I had a lot of fun building this proof of concept and learned a lot about Vert.x and Interactive Queries on the way. I’d be very happy for feedback.

Confluent’s introductory blog for interactive queries
KIQR source code
Confluent’s reference implementation
My article on Interactive Queries

Was this post helpful?

Likes

Blog author

Florian Troßbach

Senior IT Consultant

Do you still have questions? Just send me a message.

fromFlorian Troßbach

Validating Topic Configurations in Apache Kafka

Messages in Apache Kafka are appended to (partitions of) a topic. Topics have a partition count, a replication factor and various other configuration values. Why do those matter and what could possibly go wrong? Why does Kafka topic configuration matter...

Messaging
Big Data

7.12.2017 | 8 Minuten Lesezeit

Florian Troßbach

Interactive Queries in Apache Kafka Streams

"Databases? Where we're going we don't need databases" – Doc Brown, 1985 Well, we’re certainly not there yet, but this article is going to introduce you to a new feature of the popular streaming platform Apache Kafka that can make a dedicated external...

Messaging
Streaming

13.3.2017 | 10 Minuten Lesezeit

Florian Troßbach

Crossing the Streams – Joins in Apache Kafka

Version 0.10.0 of the popular distributed streaming platform Apache Kafka saw the introduction of Kafka Streams. In its initial release, the Streams-API enabled stateful and stateless Kafka-to-Kafka message processing using concepts such as map, flatMap...

Messaging
Big Data
Streaming

15.2.2017 | 14 Minuten Lesezeit

Florian Troßbach

Realtime Fast Data Analytics with Druid

I have been working with the SMACK stack for a while now and it is great fun from a developer’s point of view. Kafka is a very robust data buffer, Spark is great at streaming all that buffered data and Cassandra is really fast at writing and retrieving...

18.8.2016 | 13 Minuten Lesezeit

Florian Troßbach

Neues in Apache Kafka 0.10 und Confluent Platform 3.0.0

Die im Mai erschienenen neuen Versionen von Apache Kafka und Confluent Platform enthalten einige spannende Neuerungen. Diese werden in diesem Artikel vorgestellt. Was ist Apache Kafka? Kafka ist ein verteilter Message Broker, der nach dem Publish-Subscribe...

7.6.2016 | 10 Minuten Lesezeit

Florian Troßbach

The SMACK stack – hands on!

The SMACK stack is all the rage these days. Instead of just talking about it, this post is going to guide you through the steps for setting up a simple SMACK stack that will enable you to get a hands on experience with the tools. In the first step,...

1.5.2016 | 9 Minuten Lesezeit

Florian Troßbach

First steps with Java 9 and Project Jigsaw – Part 2

This is part 2 of a series that aims to get you started with project Jigsaw. In part 1 , we briefly talked about the definition of a module and how the Java Runtime was modularized. We then proceeded to a simple example that demonstrated how to (and ...

Java

1.12.2015 | 12 Minuten Lesezeit

Florian Troßbach

First steps with Java 9 and Project Jigsaw – Part 1

Eight years after its inception, Project Jigsaw – the modularization of the Java platform and introduction of a general module system – is on track to be included in Java 9. The target release has changed over the years from Java 7 via Java 8 to Java...

Java

24.11.2015 | 11 Minuten Lesezeit

Florian Troßbach

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Test Fixtures mit JUnit 5

Wir Softwareentwickler leben in einem ständigen Dilemma. Jede Funktionalität der Software sollte durch Unit-Tests und Integrationstest abgesichert werden. Es sollten dabei so viel Tests wie nötig, aber nur so wenige wie möglich geschrieben werden. Schreiben...

Java
Testing
Framework
Softwareentwicklung

25.3.2024 | 7 Minuten Lesezeit

Jens Kaiser

Charge your APIs Volume 23: REST vs. gRPC

APIs dienen als Verbindungsstück zwischen Daten und Verarbeitung und erlauben uns damit, Daten im richtigen Kontext als Informationen zu interpretieren. Passende fachliche Themen sind dabei präsenter denn je und erreichen bald auch den Endverbraucher...

Java
Softwareentwicklung
Spring
Softwarearchitektur
API
Data

11.2.2024 | 7 Minuten Lesezeit

Sebastian Tiemann

Reactive Programming mit Spring Webflux

In diesem Artikel geben wir einen Überblick über Reactive Programming, erläutern, welche Prinzipien diesem zugrunde liegen und wann ein Einsatz sinnvoll sein kann. Anschließend zeigen wir, wie mithilfe des Spring-Webflux-Projekts eine reaktive Anwendung...

Spring
Java
Reactive Programming

11.12.2023 | 13 Minuten Lesezeit

Christian Franzen

Ferdinand Ade

Test-Fixtures: Wozu denn überhaupt?

Für uns Softwareentwickler ist der ultimative Endgegner immer die Komplexität. Wir haben zahlreiche, teils ziemlich mächtige Waffen gesammelt, um in diesen Kämpfen bestehen zu können: Dinge wie Modularisierung, Abstraktion, Lean Development, iteratives...

Testing
Java
Test Driven Development

12.5.2023 | 19 Minuten Lesezeit

Rüdiger zu Dohna

Microstream – das Ende der O/R-Mapper?

Über eine Suche nach Alternativen zu O/R-Mappern und Persistenz-Frameworks für NoSQL-Datenbanken bin ich auf Microstream aufmerksam geworden und war ziemlich schnell interessiert. Zum einen, weil Microstream wie ich aus der Oberpfalz kommt, aber haupts...

Java
Datenbank
Softwarearchitektur

29.9.2022 | 13 Minuten Lesezeit

Felix Rieß

Streaming Wikipedia mit Apache Kafka

Apache Kafka ist in aller Munde und entwickelt sich im Kontext von verteilten Systemen zum De-facto-Standard als Plattform für Event Streaming. Im Rahmen unserer OffProject Time (Weiterbildungszeit) haben wir uns die Plattform auch näher angeschaut und...

Kotlin
Data
Java
Messaging
Spring

15.8.2022 | 10 Minuten Lesezeit

Christoph Metzger

Felix Rieß

Die Zukunft der IDEs – aus Sicht eines „Java-EE-Entwicklers“

Bei unseren Kunden und auch bei codecentric dreht sich alles um den besten und schnellsten Weg, die richtige Software zu entwickeln – und das natürlich in hoher Qualität. Von daher bin ich auch ein fleißiger Leser des „State of DevOps“-Report (hier zum...

Cloud
Java
Remote Work

16.5.2022 | 11 Minuten Lesezeit

Rainer Vehns

Keycloak.X, aber sicher – ohne bekannte Sicherheitslücken!

TLDR: Wie man die bekannten CVEs (Common Vulnerabilities and Exposures) mit einer eigenen Keycloak-Distribution auf null* reduziert.EinführungKeycloak (s. Website) wird durch die Umstellung auf Quarkus einfacher und robuster, so das Versprechen. Wie...

Java
IT-Security
Keycloak

9.5.2022 | 9 Minuten Lesezeit

Sebastian Rose

Thomas Darimont

Stream Processing mit Kafka Streams und Spring Boot

Kontinuierliche Datenströme in verteilten Systemen ohne Zeitverzögerung zu verarbeiten, birgt einige Herausforderungen. Wir zeigen euch, wie Stream Processing mit Kafka Streams und Spring Boot gelingen kann. Alles im Fluss: Betrachtet man Daten als fortlaufenden...

Softwarearchitektur
Cloud
IoT
Messaging
Kotlin
Spring

20.12.2021 | 20 Minuten Lesezeit

Maik Fleuter

Lukas Maier

Wie man Java-Klassen in Python benutzt

Generell sollte man zwar für jedes Problem das passende Werkzeug nutzen. Aber oftmals wird man gezwungen, den Hammer Java zu nutzen, weil der Rest des Hauses mit diesem Hammer gebaut wurde. Eine moderne Lösung dieses Problems ist natürlich die Microservice...

Künstliche Intelligenz
Java
Python

15.11.2021 | 8 Minuten Lesezeit

Hendrik Schawe

Effizient mit Text, Code und IDEs arbeiten

Hast du dich schon immer gefragt, warum andere Leute ihre Entwicklungsumgebung (Integrated Development Environment, IDE) anders nutzen als du? Ist dir aufgefallen, dass andere beim Programmieren deutlich langsamer oder schneller sind? Kennst du auch ...

Softwareentwicklung
Java

6.10.2021 | 12 Minuten Lesezeit

Jonas Verhoelen

Serverless Java mit AWS – Zwei Jahre Cloud-Native

Vor zwei Jahren haben wir angefangen, ein Kundenprodukt Cloud-Native auf Basis von Serverless, Java und AWS Managed Services umzusetzen. Im Folgenden möchte ich beschreiben, was wir in dieser Zeit gemeinsam gelernt haben und was wir heute besser machen...

Softwarearchitektur
Cloud
Java
Microservices
Serverless
Softwareentwicklung

2.12.2020 | 9 Minuten Lesezeit

Felix Massem

BPMN im Smart Home: Camunda und openHAB

Geschäftsprozessmodellierung und einhergehende Sprachen wie BPMN und DMN sind Begriffe, denen man normalerweise im beruflichen Umfeld begegnet und die im privaten Raum keine Rolle spielen. Natürlich kann man die Prozesse eines Haushalts (aka kleines,...

Java
BPM
Smart Home
IoT

6.4.2020 | 8 Minuten Lesezeit

Stephan Köninger

State Management in Svelte

Teil der Webentwicklung in 2020 sind nicht nur komponentenbasierte Ansätze, sondern ebenso die Nutzung von State-Management-Lösungen. Diese orientieren sich in der Regel an der Flux-Architektur und ihrem prominentesten Vertreter, Redux . Und so ist es...

JavaScript
React
Java

25.2.2020 | 3 Minuten Lesezeit

Daniel Zenzes

Gleich und doch anders: Einführung in Svelte

Verglichen mit den letzten Jahren ist es im JavaScript-Umfeld in letzter Zeit verhältnismäßig ruhig geworden. Gerade im Frontend sind React, Angular und, mit etwas Abstand, Vue etabliert und erfreuen sich einer wachsenden Nutzerbasis. Komponentenbasierte...

JavaScript
Java

18.2.2020 | 4 Minuten Lesezeit

Daniel Zenzes

Synchroner Batch mit Mule 4

Während in Mule 3 der Batch noch eine eigenständige Komponente war und Batches sich in der Konfiguration auf der gleichen Ebene wie Flows befanden, ist der Batch in Mule 4 zu einem sogenannten Scope geworden, der jetzt innerhalb eines Flows lebt. Auf...

Java
APM
JavaScript
Integration

28.1.2020 | 5 Minuten Lesezeit

Roger Butenuth

Was ist GraalVM?

Als ich anfing, mich genauer mit GraalVM zu beschäftigen, hatte ich nur eine grobe Vorstellung davon, was sich hinter der Bezeichnung eigentlich verbirgt. Beim Lesen der ersten Artikel zum Thema war ich geradezu verwirrt. Was ist GraalVM denn nun? Ein...

Java

23.1.2020 | 5 Minuten Lesezeit

Timo

Schnelles Entwickeln mit Kubernetes in Azure

Kubernetes ist die de facto Deployment-Umgebung für moderne Microservice-Architekturen. Alle großen Cloud-Anbieter haben daher Angebote für Kubernetes, die durch zahlreiche Features ergänzt werden, die Ressourcen des jeweiligen Anbieters intelligent ...

Cloud
Java
Microservices
Azure
Kubernetes

31.7.2019 | 5 Minuten Lesezeit

Christian Sauer

RESTful Webservices mit Quarkus

Im ersten Artikel zu Quarkus wurde beschrieben, wie man es nutzen kann und was die theoretischen Hintergründe sind. In diesem Artikel wird beleuchtet, wie mit Quarkus eine vollständige REST-Anwendung erstellt werden kann. In der Anwendung werden verschiedene...

Java
Cloud
Microservices
API

3.6.2019 | 7 Minuten Lesezeit

Enno Lohmann

Quarkus macht Java fit für die Cloud

Vor über zwanzig Jahren wurde Java vorgestellt, und es ist bis heute eine der erfolgreichsten Programmiersprachen. Durch sein Alter ist Java jedoch nicht auf die Cloud optimiert und fällt hier hinter anderen Sprachen zurück. Java ist eher auf ein monolithisches...

Cloud
Java
Microservices

9.4.2019 | 7 Minuten Lesezeit

Enno Lohmann

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.

Hilf uns, noch besser zu werden.

Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.

Contact

Send

Building a distributed Runtime for Interactive Queries in Apache Kafka with Vert.x

First steps

What is Vert.x

Componentizing the runtime

Serialization

Server-side example

Rest-API

Clients

Caveats and restrictions

Conclusion & resources

Was this post helpful?

Ja

Blog author

Get in contact

Get in contact

More articles

Validating Topic Configurations in Apache Kafka

Interactive Queries in Apache Kafka Streams

Crossing the Streams – Joins in Apache Kafka

Realtime Fast Data Analytics with Druid

Neues in Apache Kafka 0.10 und Confluent Platform 3.0.0

The SMACK stack – hands on!

First steps with Java 9 and Project Jigsaw – Part 2

First steps with Java 9 and Project Jigsaw – Part 1

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

View Job

More articles in this subject area

Test Fixtures mit JUnit 5

Charge your APIs Volume 23: REST vs. gRPC

Reactive Programming mit Spring Webflux

Test-Fixtures: Wozu denn überhaupt?

Microstream – das Ende der O/R-Mapper?

Streaming Wikipedia mit Apache Kafka

Die Zukunft der IDEs – aus Sicht eines „Java-EE-Entwicklers“

Keycloak.X, aber sicher – ohne bekannte Sicherheitslücken!

Stream Processing mit Kafka Streams und Spring Boot

Wie man Java-Klassen in Python benutzt

Effizient mit Text, Code und IDEs arbeiten

Serverless Java mit AWS – Zwei Jahre Cloud-Native

BPMN im Smart Home: Camunda und openHAB

State Management in Svelte

Gleich und doch anders: Einführung in Svelte

Synchroner Batch mit Mule 4

Was ist GraalVM?

Schnelles Entwickeln mit Kubernetes in Azure

RESTful Webservices mit Quarkus

Quarkus macht Java fit für die Cloud

Gemeinsam bessere Projekte umsetzen.

Wir helfen deinem Unternehmen.

Unsere Leistungen

Hilf uns, noch besser zu werden.

Zu den Jobangeboten