I have been building streaming data platforms in the past decade. Streaming data platform has many benefits ( 13 , 9 , 10 , 8 ), and often, they are the most natural choice to organize data pipelines.
I will describe the basic structure of a streaming data platform, technology choices, and common challenges. I am only writing down short opinions to explain my personal preferences. Please refer to original documentations for more details.
A streaming data platform has two basic parts: data pipe and processor. Data pipe could be called message queue, event bus, distributed log, etc. It is the part of the system that stores the event data. Processors sit between different instances of the data pipe, connecting the pipes to form the topology of the streaming platform.
Data Pipe
The data pipe has to be rock solid. Any issues with this component cause the whole system to alert, and potentially lead to irreversible data losses. In my experiences, interacting with the data pipe’s APIs are straight forward, the most common challenges centered on cluster administrations. Here are some questions to ask:
- How do I maintain configurations of my cluster?
- How do I scale up and down the cluster?
- How do I upgrade my cluster?
- How do I migrate my cluster to a new environment?
- How do I mitigate data center failures?
Here are some common technology choices:
- Kafka: It is by far the most popular and most mature solution.
- Pulsar: It could be an alternative. One of the most compelling features is the existence of a Presto connector that works with the Pulsar’s storage backend. One can set up Presto to query the streaming data directly ( 4 ).
- Kinesis: It is a managed solution.
It should be noted that once the clusters and usage grow. Data center failures will become a legitimate concern. Federation is one way to add flexibility to cluster administrations. For example, see 11 . However, it is an effort to add that capability, likely requiring a whole team to build and maintain that feature.
Deploying data pipe clusters on Kubernetes could have many operational benefits. It could unify data-heavy clusters configurations and application deployments, reducing the complexity of the team’s overall devops process. The operator approach on k8s is increasingly popular. I have good experiences using 1 for medium size Kafka clusters. But the hesitation on going all-in with k8s Kafka is about administration, again! One would have to think through the potential possibilities of cluster migration and cluster federation. K8s operators might not be flexible enough.
Processor
The core of a streaming data platform is data processors. Processors roughly fall into three categories: source, internal processor, and sink. The hardest processor features to design properly are exactly once processing, state management, and system level monitoring.
Achieving exactly once is hard because the applications have to be able to recover from a variety of failure scenarios. Kafka features such as idempotent producer and transaction have made building exactly-once semantics a lot easier (see 12 ). When processors have states, the offset and state management would have to work together. Usually, the states have to be backed up whenever the offset is committed.
There are a few approaches that I would go about building these processors in Kafka:
- Rely on Kafka primitives
- Build on top of Kafka stream, see 3
- Build on top of Flink; see 6 and 7
- Build on top of Spark Stream, see 5
Monitoring is often overlooked, and it is hard to generalize. The challenge is to be able to have a watermark system in place to inform the progress of the whole processing pipeline. This requirement varies from product to product. For example, a streaming system have stock ticker prices as input events. There are 5 processors, and the last processor output a real time signal about buy or sell. The monitoring questions we would ask could be: How timely is this signal? The definition of “timely” could vary a lot depending on how the signal is calculated. Watermark is a generic term to refer to whatever information that is needed to allow downstream applications to make timely decisions. In many cases, this information cannot be part burned into the messages itself. Examples of such information is the failure rates of internal processors, number of unprocessed events, late arrival events, etc. I have not seen any framework that addresses this need. One common pattern to building a system monitor is using yet another data streaming pipeline. This monitoring pipeline has to be dead simple and its failures be independent from what it is monitoring.
Source Processor
A ource processor could receive external traffic or process data from an object store (S3), and then write the data into the pipe. Source processor is different from internal and sink processor because it cannot use the offset marker in the data pipe to keep track of the progress to accomplish exactly-once-semantics. For example, if the data pipe is Kafka. The hard problem of periodic batch job producing into Kafka is defining job uniqueness and how to recover from job failures. The hard problem of edge servers persisting event into Kafka is coordinating the timing of Kafka receipt acknowledgement and responding to client requests.
Sink Processor
A sink processor takes data from the data pipe and load them into a storage engine. This component is not difficult to write, but it invariably feels like a painful step in building a data platform. It is partly because it is a common feature that every system needs, and yet we have to manually biuld this component from scratch. It is also partly because we have to write this component many times to target different storage engines based on how the data is used later on.
In my experiences, this component is usually handcrafted on top of the same framework as the other internal processors. For example, if other internal processors use Kafka primitives directly, the sinks are built similarly. If the platform already uses Flink, the sinks are written using Flink.
I would definitely hope to see a mature solution targeting the most common storage engines. Kafka connect is a possibility ( 2 ). It is tied into the Confluent platform, and it is not sufficiently flexible to be composable, and cannot be easily integrated into custom-built processors.
Materializing the Data Pipe
One of the pain points of a streaming platform is getting insights on in-flight data. One common way is to build many sink processors targeting intermediate stages, loading the data into an analytics database. See my post on analytics databases. This approach is tedious and lead to more applications to maintain.
Another way is write a processor that exposes materialized views. In the past, we have to write the data models and processing logic for these views, but in recent years, we are lucky to have these frameworks to help us to write this component.
Additionally, this strategy is a great alternative to using a realtime query engine to serve user-facing queries.
Citations
- Jcs. The Log: What every software engineer should know about real-time data's unifying abstraction. 2013. URL: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying. 1
- Fowler, Martin. Focusing on Events. 2006. URL: https://www.martinfowler.com/eaaDev/EventNarrative.html. 1
- Fowler, Martin. What do you mean by “Event-Driven”? 2017. URL: https://martinfowler.com/articles/201701-event-driven.html. 1
- Akidau, Tyler. Streaming 101: The world beyond batch. 2015. URL: https://www.oreilly.com/radar/the-world-beyond-batch-streaming-101. 1
- Pulsar SQL Overview. 2020. URL: https://pulsar.apache.org/docs/en/sql-overview/. 1
- Fui, Yupeng and Dong, Xiaoman. Kafka Cluster Federation at Uber. 2019. URL: https://www.confluent.io/kafka-summit-san-francisco-2019/kafka-cluster-federation-at-uber/. 1
- Banzai Cloud Kafka Operator. 2020. URL: https://github.com/banzaicloud/kafka-operator. 1
- Gustafson, Jason, Junqueira, Flavio Paiva, Mehta, Apurva, Sriram, and Wang, Guozhang. KIP-98 - Exactly Once Delivery and Transactional Messaging. 2019. URL: https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging. 1
- Kafka Stream Architecture. 2020. URL: https://kafka.apache.org/27/documentation/streams/architecture. 1
- Stateful Stream Processing. 2020. URL: https://ci.apache.org/projects/flink/flink-docs-stable/concepts/stateful-stream-processing.html. 1
- Timely Stream Processing. 2020. URL: https://ci.apache.org/projects/flink/flink-docs-stable/concepts/timely-stream-processing.html. 1
- Spark Streaming. 2020. URL: https://spark.apache.org/docs/latest/streaming-programming-guide.html. 1
- Kafka Connect. 2020. URL: https://docs.confluent.io/platform/current/connect/index.htm. 1