Skip to content

Kafka Streams vs Flink — When to Use Each

TL;DR

Kafka Streams is a library you embed in your app for simple Kafka-to-Kafka processing; Flink is a distributed framework you deploy as a separate cluster for complex, multi-source stream processing -- pick based on operational appetite, not features.


What It Is

Kafka Connect

Stream processing means transforming data as it arrives, rather than batching it and processing later. Two tools dominate this space, and they solve different problems.

Kafka Streams is a Java library. You add it as a dependency. It runs inside your application process. No separate cluster. No resource manager. No job scheduler. Just a JAR file that reads from Kafka, processes data, and writes back to Kafka.

Apache Flink is a distributed stream processing framework. It runs on its own cluster with a JobManager, TaskManagers, and state backends. It processes data from Kafka, databases, files, sockets -- anything. It has its own runtime, memory management, and fault tolerance.

The fundamental difference: Kafka Streams is a library you add to your code. Flink is infrastructure you operate.


Kafka Streams: The Library Approach

Architecture

Your Application (JVM process)
  ├── Your business logic
  ├── Kafka Streams library
  │     ├── Stream threads (configurable count)
  │     ├── State stores (local RocksDB)
  │     └── Changelog topics (for state recovery)
  └── Standard deployment (Kubernetes, ECS, bare metal)

No external dependencies beyond Kafka itself. Deploy it like any other microservice. Scale it by running more instances. Each instance gets assigned a subset of partitions.

Code Example

StreamsBuilder builder = new StreamsBuilder();

KStream<String, Order> orders = builder.stream("orders");

KTable<String, Long> orderCounts = orders
    .filter((key, order) -> order.getStatus().equals("COMPLETED"))
    .groupByKey()
    .count();

orderCounts.toStream().to("order-counts");

KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();

That's it. No cluster setup. No YARN or Kubernetes operator. The KafkaStreams object manages threads, state, and partition assignment internally.

KTable: The Stateful Primitive

A KTable is a changelog stream materialized as a table. Each new message with the same key replaces the previous value. Think of it as a continuously updated database table.

// KStream: every event matters (append-only)
// user-123: {"action": "login"}
// user-123: {"action": "view_page"}
// user-123: {"action": "logout"}
// → 3 records

// KTable: only latest value per key matters
// user-123: {"name": "Alice", "status": "active"}
// user-123: {"name": "Alice", "status": "inactive"}
// → 1 record (latest value)

KTable enables stream-table joins. Enrich a stream of orders with the latest user profile data. The user profile is a KTable (updated when profiles change). The order stream is a KStream (every order matters).

KStream<String, Order> orders = builder.stream("orders");
KTable<String, UserProfile> users = builder.table("user-profiles");

KStream<String, EnrichedOrder> enriched = orders.join(
    users,
    (order, profile) -> new EnrichedOrder(order, profile)
);

This join runs continuously. When a new order arrives, it's enriched with the latest user profile. When a profile updates, future orders get the updated data.

State Stores

Stateful operations (count, aggregate, join) need local state. Kafka Streams uses RocksDB by default. State is stored on the local disk of the application instance.

For fault tolerance, state changes are written to a changelog topic in Kafka. If the instance crashes, a new instance replays the changelog topic to rebuild state. Recovery time depends on state size -- seconds for small state, minutes for large.

Pinterest uses Kafka Streams for their real-time ads pipeline. Simple transformations, all data already in Kafka, deployed as standard Kubernetes pods. No need for a separate Flink cluster.


Architecture

Flink Cluster
  ├── JobManager (coordinator)
  │     ├── Job scheduling
  │     ├── Checkpoint coordination
  │     └── Failure recovery
  ├── TaskManager 1
  │     ├── Task slots
  │     └── Managed memory (off-heap)
  ├── TaskManager 2
  │     ├── Task slots
  │     └── Managed memory (off-heap)
  └── State Backend (RocksDB / heap / S3)

Flink manages its own memory, its own scheduling, its own fault tolerance. It's a full distributed system that happens to be good at stream processing.

Code Example

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<Order> orders = env
    .addSource(new FlinkKafkaConsumer<>("orders", new OrderSchema(), kafkaProps));

DataStream<OrderCount> counts = orders
    .filter(order -> order.getStatus().equals("COMPLETED"))
    .keyBy(Order::getUserId)
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .aggregate(new OrderCountAggregator());

counts.addSink(new FlinkKafkaProducer<>("order-counts", new CountSchema(), kafkaProps));

env.execute("Order Counter");

Notice the explicit window definition. Flink's windowing is its killer feature.

Event-Time Processing with Watermarks

This is where Flink pulls ahead of everything else.

Processing time = when the system processes the event. Simple but inaccurate. A delayed event counted in the wrong window.

Event time = when the event actually occurred. Accurate but complex. How do you know when all events for a window have arrived?

Watermarks solve this. A watermark is a timestamp that says: "I believe all events with timestamps up to this point have arrived." Flink uses watermarks to decide when to close a window and emit results.

DataStream<Event> events = source
    .assignTimestampsAndWatermarks(
        WatermarkStrategy
            .<Event>forBoundedOutOfOrderness(Duration.ofSeconds(10))
            .withTimestampAssigner((event, timestamp) -> event.getEventTime())
    );

This says: "Events can arrive up to 10 seconds late. Close the window 10 seconds after the last expected event." Late events beyond 10 seconds can be handled by side outputs.

Kafka Streams has limited event-time support through TimestampExtractor, but no watermark concept. If your events arrive out of order (and they will), Kafka Streams can't handle it as gracefully.

Complex Event Processing (CEP)

Flink has a CEP library for pattern matching over streams:

Pattern<LoginEvent, ?> pattern = Pattern.<LoginEvent>begin("first")
    .where(event -> event.getType().equals("FAILED_LOGIN"))
    .timesOrMore(3)
    .within(Time.minutes(5));

PatternStream<LoginEvent> patternStream = CEP.pattern(loginStream, pattern);

patternStream.select(matches -> {
    return new Alert("3+ failed logins in 5 minutes for user " + matches.getUserId());
});

This detects three or more failed login attempts within five minutes. Try doing this with Kafka Streams. You'd need manual state management, timers, and custom logic. Flink CEP handles it declaratively.


Head-to-Head Comparison

Deployment

Factor Kafka Streams Flink
Infrastructure None (runs in your app) Dedicated cluster (JobManager + TaskManagers)
Deployment Deploy like any microservice Submit jobs to Flink cluster
Scaling Add more app instances Add more TaskManagers / adjust parallelism
Kubernetes Standard pods Flink Kubernetes Operator
Ops burden Low (just your app) High (cluster management, checkpointing, upgrades)

Spicy take: If you can solve your problem with Kafka Streams, you should. The operational cost of running a Flink cluster is real -- upgrades, monitoring, checkpoint failures, state migration. Kafka Streams gives you 70% of the capability at 20% of the operational overhead.

Data Sources

Factor Kafka Streams Flink
Input sources Kafka only Kafka, Kinesis, files, databases, sockets, custom
Output sinks Kafka only Kafka, databases, files, Elasticsearch, custom
Multi-source joins Not possible Native support

If all your data is in Kafka and your output goes to Kafka, Kafka Streams wins on simplicity. If you need to join a Kafka stream with a MySQL table and write to Elasticsearch, you need Flink.

Processing Capabilities

Capability Kafka Streams Flink
Map/Filter/FlatMap Yes Yes
Aggregations Yes (KTable) Yes (windows, keyed state)
Joins Stream-stream, stream-table Stream-stream, stream-table, temporal joins
Windowing Tumbling, hopping, sliding, session All of the above + custom windows
Event-time processing Basic (timestamp extractor) Advanced (watermarks, late data handling)
CEP No Yes (pattern matching library)
Exactly-once Yes (with Kafka transactions) Yes (checkpoints + Kafka transactions)
State management RocksDB, in-memory RocksDB, heap, pluggable backends
Queryable state Interactive queries (limited) Queryable state (limited)

Performance

Metric Kafka Streams Flink
Throughput 100K-500K events/sec per instance Millions of events/sec per cluster
Latency Low (ms) Very low (sub-ms possible)
State size Limited by local disk Can spill to remote state backends (S3)
Checkpointing Changelog topics in Kafka Async checkpoints to S3/HDFS
Recovery time Minutes (replay changelog) Seconds (restore from checkpoint)

Alibaba runs Flink at billions of events per second during Singles' Day. That's not a typo. They've pushed Flink harder than anyone. You won't hit those numbers with Kafka Streams.


Decision Framework

Use Kafka Streams When

  1. All data is in Kafka. Input from Kafka, output to Kafka.
  2. Transformations are simple. Filter, map, aggregate, join with one other stream/table.
  3. Your team doesn't want to run another cluster. Embed it in your microservice.
  4. State is small to medium. Fits on local disk (tens of GB).
  5. Event-time processing is not critical. Processing-time is acceptable.
  1. Multiple data sources. Join Kafka with a database changelog or file stream.
  2. Complex windowing. Event-time windows with watermarks and late data handling.
  3. Pattern matching. CEP for fraud detection, anomaly detection.
  4. Massive state. Terabytes of state that needs remote checkpointing.
  5. Very high throughput. Millions of events per second with sub-millisecond latency.

Decision Table

Scenario Kafka Streams Flink
Filter + transform Kafka events Best Overkill
Real-time dashboard from Kafka Good Good
Join Kafka stream with MySQL CDC Can't Best
Session window analytics Possible Better
Fraud detection with pattern matching Hard Best
Simple event counting Best Overkill
Multi-DC stream processing Limited Better
Team has no JVM experience Neither (consider ksqlDB) Neither

Kafka Connect: The Plumbing Layer

Before you write custom producers and consumers, check if a Kafka Connect connector already exists.

What It Is

Kafka Connect is a framework for moving data between Kafka and external systems. No code required -- just configuration.

Source Connectors (External -> Kafka)

{
  "name": "postgres-source",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "db.example.com",
    "database.dbname": "orders",
    "table.include.list": "public.orders,public.users",
    "topic.prefix": "cdc"
  }
}

This captures every INSERT, UPDATE, DELETE from PostgreSQL and publishes them to Kafka topics. Debezium (by Red Hat) is the most popular CDC connector. It reads the PostgreSQL WAL, so it captures changes without polling the database.

Sink Connectors (Kafka -> External)

{
  "name": "elasticsearch-sink",
  "config": {
    "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
    "topics": "orders",
    "connection.url": "http://elasticsearch:9200",
    "type.name": "_doc",
    "key.ignore": "false"
  }
}

This reads from the orders topic and writes to Elasticsearch. Updates are idempotent (upsert by key). If Elasticsearch goes down, the connector pauses and resumes when it's back.

Common Connectors

Connector Direction Use Case
Debezium PostgreSQL Source CDC from PostgreSQL
Debezium MySQL Source CDC from MySQL
JDBC Source Source Poll-based DB ingestion
Elasticsearch Sink Sink Search index sync
S3 Sink Sink Data lake archival
HDFS Sink Sink Hadoop ingestion
MongoDB Source/Sink Both MongoDB CDC + writes

Why it matters for interviews: "We don't write custom Kafka producers for database changes. We use Debezium with Kafka Connect. It reads the WAL, captures every change, and publishes to Kafka. Zero application code. If the database schema changes, the connector handles it."


Schema Registry

When producers and consumers evolve independently, message schemas drift. The Schema Registry prevents this.

What It Does

  1. Producers register their schema (Avro, Protobuf, or JSON Schema) with the registry.
  2. The registry assigns a schema ID.
  3. Producers send messages with the schema ID embedded.
  4. Consumers fetch the schema from the registry using the ID.
  5. The registry enforces compatibility rules on schema evolution.

Compatibility Modes

Mode What You Can Do What You Can't Do
BACKWARD Add optional fields, remove fields Add required fields
FORWARD Add fields, remove optional fields Remove required fields
FULL Add/remove optional fields only Add/remove required fields
NONE Anything Nothing is checked

BACKWARD is the default and most practical. New consumers can read old data. Old consumers can't read new data, but they should be updated anyway.

Why Avro Over JSON

  • Avro: Schema stored separately. Messages are compact binary. Schema evolution with compatibility checks. Smaller payloads.
  • JSON: Schema is implicit. Messages include field names. No compatibility enforcement. Larger payloads.
  • Protobuf: Similar to Avro but with better cross-language support. Google's standard.

Confluent (the company behind Kafka) recommends Avro. Google shops use Protobuf. JSON works for small-scale systems but becomes a liability at scale because of payload size and schema drift.


When NOT to Use Kafka

Kafka is not a universal solution. Here's when other tools are better.

Scenario Use Instead Why
< 1K messages/sec SQS or RabbitMQ Zero ops overhead. Kafka's complexity isn't justified.
Request-reply HTTP/gRPC Kafka is async. Request-reply with Kafka is an anti-pattern.
Message priority RabbitMQ Kafka has no priority queue concept.
Exactly-once to external DB Outbox pattern + polling Simpler than Kafka transactions for DB consistency.
Temporary task queues Celery + RabbitMQ/Redis Better worker semantics (retry, dead-letter, priorities).
Small team, simple needs AWS SNS + SQS Fully managed, no cluster to run.

Spicy take: If your startup has 5 engineers and you're debating Kafka vs SQS, use SQS. You don't have the headcount to operate Kafka properly. You'll spend more time debugging broker issues than building features. Kafka is for organizations that have outgrown managed queues. Not everyone has.


Patterns for System Design Interviews

Pattern 1: CDC Pipeline

"Debezium captures PostgreSQL changes via Kafka Connect. Changes flow to Kafka topics. Flink reads from Kafka, joins the user stream with the order stream, and writes enriched records to Elasticsearch. Schema Registry enforces backward compatibility. If we need to add a field to orders, we update the Avro schema, register it, and consumers handle the new field gracefully."

Pattern 2: Simple Event Processing

"We use Kafka Streams embedded in our notification service. It reads from the user-events topic, filters for events that require notifications, enriches them with user preferences from a KTable, and writes notification requests to the notifications topic. No separate cluster. Deployed as a standard Kubernetes deployment."

Pattern 3: Complex Analytics Pipeline

"Flink reads click events from Kafka with event-time processing. Watermarks handle out-of-order events (bounded by 30 seconds). Session windows group clicks by user session. CEP detects suspicious patterns (bot clicks). Results go to both Kafka (for real-time dashboards) and S3 (for batch analytics)."


Streams Vs Flink

Trade-Offs Table

Factor Kafka Streams Flink Kafka Connect
Purpose Stream processing (Kafka-only) General stream processing Data integration
Ops overhead None High (separate cluster) Low (runs on Connect workers)
Data sources Kafka only Any 100+ pre-built connectors
Custom logic Full Java/Scala Full Java/Scala/Python/SQL Configuration only
Exactly-once Yes (Kafka transactions) Yes (checkpoints) Depends on connector
Scaling Add app instances Adjust parallelism Add Connect workers
State Local (RocksDB) Distributed (pluggable) Offsets only
Learning curve Medium High Low
Best for Simple transforms Complex analytics Moving data in/out of Kafka

Interview Gotchas

"When all data is in Kafka, transformations are straightforward, and we don't want to operate a separate cluster. Kafka Streams runs inside our application -- same deployment, same monitoring, same CI/CD. Flink is worth the operational overhead when we need event-time processing with watermarks, multi-source joins, or CEP."

"What's ksqlDB?"

"SQL interface for Kafka Streams. Write SQL queries instead of Java code. Good for prototyping and simple transformations. Not suitable for complex logic -- you'll eventually need Java. Think of it as the training wheels version of Kafka Streams."

"How do you handle schema evolution in Kafka?"

"Schema Registry with Avro serialization. BACKWARD compatibility mode -- new consumers can read old messages. When we add a field, it has a default value so old messages without the field still deserialize correctly. Breaking changes (renaming a field, changing a type) require a new topic."

"What's the relationship between Kafka Connect and Kafka Streams?"

"Kafka Connect moves data into and out of Kafka. Kafka Streams processes data already in Kafka. In a typical pipeline: Connect ingests data from PostgreSQL into Kafka, Streams transforms it, Connect writes the output to Elasticsearch. They're complementary, not competing."

"No. Flink is a processing engine. Kafka is a storage and messaging system. Flink reads from Kafka, processes events, and writes results somewhere. You can use Flink without Kafka (reading from files or databases), but in most architectures they work together."


Summary

Kafka Streams and Flink solve the same problem differently. Kafka Streams is a library -- simple to deploy, limited to Kafka sources, good enough for most use cases. Flink is a framework -- powerful, multi-source, complex to operate, needed for event-time processing and advanced analytics. Kafka Connect handles data integration without code. Schema Registry prevents schema drift. Know which tool fits which scenario, and more importantly, know when Kafka itself is overkill. Most systems don't need Flink's power. Many don't even need Kafka. The best architecture uses the simplest tool that solves the problem.