Spark Streaming, Flink, Storm, Kafka Streams – that are only the most popular candidates of an ever growing range of frameworks for processing streaming data at high scale. This article is about the main concepts behind these frameworks. Furthermore the three Apache projects Spark Streaming, Flink and Kafka Streams are briefly classified.
Why Stream Processing?
The processing of streaming data is gaining in importance due to the steadily growing number of data sources that continuously produce and offer data. In addition to the omnipresent Internet of Things, these include, for example, click streams, data in the advertising business, as well as device and server logs.
Infinite and continuous data is not a new phenomenon. Even now, many data correspond to this scheme. For example, changes to master data occur continuously, but only at low frequency. Master data is processed according to the classic request / response pattern. In the case of non-time-critical changes or larger volumes, the data are often stored collectively and then processed regularly by batch processes. These then run, for example, every night or at shorter intervals.
However, daily intervals are often not sufficient. Speed is needed: analyzes and evaluations are expected promptly and not minutes or even hours later. At this point stream processing comes into play: data is processed as soon as they are known to the system. This has started with the lambda architecture (cf. ), in which the stream and batch processing takes place in parallel, since the stream processing could not guarantee consistent results. With today’s systems, it is also possible to achieve consistent results in almost real-time with streaming processing only (cf. ).
An important aspect of streaming is the time. Essentially three different times can be distinguished:
- Event time: Time at which the event actually occurred
- Ingestion time: Time at which the event was observed in the system
- Processing time: Time at which the event was processed by the system
Abb. 1: Exemplary representation of event time and processing time. With late (yellow, green, red) and out-of-order events (blue)
In practice, the event time is particularly interesting compared to the ingestion and processing time. The difference between the event time and the processing time can vary greatly. The reasons are numerous: network latencies, distributed systems, hardware failures or even irregular data delivery. When being processed by the processing time, this is not important. The data is analyzed based on the system time of the processor: if an event arrives at 12 o’clock, it is irrelevant that it has already occurred at 11 o’clock.
But this is not the normal use case: If an event occurs at 11 o’clock I would like to treat it in the time it occurred. The question here is: When do I know that I got all the events until 11am? How long do I wait for events? There are several strategies and concepts to solve those problems. On of them is the Dataflow/Beam Model. Here, concepts such as watermarks, triggers and accumulators help:
- Watermarks: When did I collected all the data?
- Trigger: When should I trigger the calculation?
- Accumulation: How do I merge individual calculations, for example when data is subsequently added.
It is easy to write a separate article about these three concepts. Tyler Akidau, the head behind streaming on Google, has already summed this up. Therefore it is recommended to read his article for details .
State & Window
Any non-trivial application will correlate incoming events with each other. This requires a state in which previous events are stored temporarily. This state can be stored indefinitely or explicitly limited in time. An example of an infinite stored state is a lookup table with metadata. A temporally limited state is, for example, a window.
A window is used to aggregate and analyze data for a specific period of time. This is necessary in almost every application, since the data stream never ends. There are different types of Windows.
- Tumbling Window: Non-overlapping, fixed time segments
- Sliding Window: Overlapping, fixed time segments
- Session Window: Non-overlapping time segments of different length. Defined by certain events or by exceeding a certain time between two events
Abb. 2: Tumbling and sliding window with a time window of 4 seconds and a sliding interval of 2 seconds with the sliding window. Within each window the values are summed.
Abb. 3: Session windows with an inactivity of at least two minutes between two events for a key.
For the definition of windows, the distinction between event and processing time is important: windows based on processing time are very simple to implement; windows based on event time need the above event time strategies, in order not to grow infinitely.
API & Runtime Environment
First differences in the frameworks can be found within the API and the general processing model. Differentiating between a native streaming approach and microbatching. In native streaming, incoming data is processed directly while microbatching collects the incoming data for a certain time (typically 1 – 30s) and then processes it together. The next microbatch can then be started either directly after the completion of the previous batch, or only after the fixed interval has elapsed. In both cases, microbatching increases latency, but the handling of errors is somewhat easier. The frequently mentioned advantages of the very high throughput can now also be achieved by native streaming frameworks. They also offer more flexibility for windows and states.
Visible to the developer is mainly the API. Here, too, a distinction can be made between two variants: a component-based and a declarative, high-level API. For the former, the flow is described by several components (source -> processing 1 -> processing 2 -> sink), the latter describes the operations on data (map, filter, reduce) similar to Scala Collections or Java 8 streams . The description of components provides more flexibility in the distribution of data streams, while the declarative API often already provides higher-order functions and automatic optimization.
Finally, the question is: Where are the applications running? One can distinguish between two – surprise 🙂 – basic alternatives. Some frameworks need a special cluster consisting of master nodes and worker nodes. These clusters then also deal with resource management and error handling, but can also outsource this to other tools (for example, YARN or Mesos). Other frameworks come as a simple library, which can be integrated into your own application. Running and scaling the application must then be taken over by other tools. Here you have the full flexibility from running a jar file via docker up to Mesos or YARN.
Distributed systems are unreliable!
All three frameworks are specialized in processing large amounts of data and solve this by horizontal scaling. These distributed systems are inherently unreliable: single nodes can fail, the network is inconsistent, or the database in which the results are to be written is unavailable.
For this reason, each framework has different mechanisms to achieve certain guarantees. These range from microbatching, in which small batches are repeated, via acknowledgments for individual data sets, to transactional updates on source and sink. The guarantees achieved are then usually at-least-once or exactly-once. Since exactly-once is often difficult to achieve, at-least-once guarantees with idempotent operations are often sufficient in terms of both speed and error tolerance.
Isn’t there something that can help us?
Time handling, state & windows, a runtime environment, all in a distributed fashion: streaming applications are complex. There are a number of projects to help with these problems. Three of them briefly presented:
Apache Spark (Streaming)
Apache Spark is currently one of the most popular projects in the streaming field. Started as a better MapReduce, support for streaming data was added later. Spark streaming relies on microbatching with a declarative API. At the moment, only the processing time is fully supported, but with the new Structured Streaming API the support for event time processing has also been gradually expanded since version 2.0. The same is true for supporting windows. The state is stored locally in memory or on disk and is regularly backed up by checkpointing. Since Spark is now distributed with every major Hadoop distro, the overall distribution is very high. There is also a large ecosystem with many tools and connectors.
When it comes to event-time processing, Apache Flink is currently the first choice. Watermarks and triggers are supported as well as different window operations. Flink pursues a native streaming approach and thus achieves low latencies. As with Spark Streaming it offers a declarative API, with the possibility to use so-called rich functions, in which, for example, a state can be utilized. Unlike Spark, the state implementations can be chosen from different implementations: in-memory, hard disk or RocksDB. Flink is slightly younger than Spark, but is gaining in popularity. Likewise the community and the ecosystem is growing steadily, but is not yet as big as with Spark.
Apache Kafka Streams
The streaming framework from the Kafka ecosystem is the latest representative in this overview. It is based on many concepts already contained in Kafka, such as scaling by partitioning the topics. Also for this reason it comes as a lightweight library, which can be integrated into an application. The application can then be operated as desired: standalone, in an application server, as docker container or via a resource manager such as mesos. Flink & Spark, on the other hand, always need a cluster, either built with the equipment of the frameworks or YARN / mesos. Kafka Streams, however, is limited to Kafka as a source and also as a sink. But you can connect a Kafka topic to other systems through Kafka Connect, with over 60 available connectors. Apart from a declaratory, Kafka also has a component-oriented API, a rudimentary support for event time, and RocksDB as a state implementation. While Kafka is already very mature and often used in connection with Flink and Spark, the streaming component is still quite young. So the community and the spread is rather small. It is, however, to be expected that both will grow rapidly.
It should be noted that Kafka Streams does not use the concepts of the Beam Model to tackle the challenges of event time processing. Streams is built on the concept of KTables and KStreams, which helps them to provide event time processing.
And what suits me?
Finally, the question is: Which framework suits me? If event-time processing is required and you do not mind working with the concepts of the Beam Model, you could go with Apache Flink. Another advantage is the low latency. The most important systems (Kafka, Cassandra, Elasticsearch, SQL databases) can be integrated relatively easily.
The low latency and an easy to use event time support also apply to Kafka streams. So if Kafka is already in use
and the processing is rather simple, without complex requirements for event processing (Streams can also be used for more complex stream processing), Kafka Streams is a good alternative. For this you have to connect the other systems, like databases, via Kafka Connect and care about the runtime environment. This can also be an advantage if I can use existing tools, for example from the Docker ecosystem.
And Spark? If event time is not relevant and latencies in the seconds range are acceptable, Spark is the first choice. It is stable and almost any type of system can be easily integrated. In addition it comes with every Hadoop distribution. Furthermore the code used for batch applications can also be used for the streaming applications as the API is the same.
Only with very large states Spark can cause problems. The support for event time is expanded with Spark 2.1.
Stream processing frameworks significantly simplify the processing of large amounts of data. The presented frameworks primarily solve problems in the area of distributed processing, whereby easy-to-scale solutions can be developed. Equally important are the different aspects of the time processing, which all frameworks support in some way.
That is what distinguishes those systems from libraries such as Akka Streams, RxJava, or Vert.x. The presented frameworks are mainly located in the Big and Fast Data area, while the libraries can also be used to build smaller, more reactive applications, but usually without native support for event time and clustering.
It remains to be noted that the presented framework can all help with current challenges in the fast data area and also support new architectures beyond the well-known lambda architecture. However, the complexity of these distributed systems is in no way to be underestimated. Nevertheless, it is to be assumed that the spread of the systems as well as the functionality will continue to grow.