Kafka Streams vs Flink — When to Use Each
TL;DR
Kafka Streams is a library you embed in your app for simple Kafka-to-Kafka processing; Flink is a distributed framework you deploy as a separate cluster for complex, multi-source stream processing -- pick based on operational appetite, not features.
What It Is

Stream processing means transforming data as it arrives, rather than batching it and processing later. Two tools dominate this space, and they solve different problems.
Kafka Streams is a Java library. You add it as a dependency. It runs inside your application process. No separate cluster. No resource manager. No job scheduler. Just a JAR file that reads from Kafka, processes data, and writes back to Kafka.
Apache Flink is a distributed stream processing framework. It runs on its own cluster with a JobManager, TaskManagers, and state backends. It processes data from Kafka, databases, files, sockets -- anything. It has its own runtime, memory management, and fault tolerance.
The fundamental difference: Kafka Streams is a library you add to your code. Flink is infrastructure you operate.
Kafka Streams: The Library Approach
Architecture
Your Application (JVM process)
├── Your business logic
├── Kafka Streams library
│ ├── Stream threads (configurable count)
│ ├── State stores (local RocksDB)
│ └── Changelog topics (for state recovery)
└── Standard deployment (Kubernetes, ECS, bare metal)
No external dependencies beyond Kafka itself. Deploy it like any other microservice. Scale it by running more instances. Each instance gets assigned a subset of partitions.
Code Example
StreamsBuilder builder = new StreamsBuilder();
KStream<String, Order> orders = builder.stream("orders");
KTable<String, Long> orderCounts = orders
.filter((key, order) -> order.getStatus().equals("COMPLETED"))
.groupByKey()
.count();
orderCounts.toStream().to("order-counts");
KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();
That's it. No cluster setup. No YARN or Kubernetes operator. The KafkaStreams object manages threads, state, and partition assignment internally.
KTable: The Stateful Primitive
A KTable is a changelog stream materialized as a table. Each new message with the same key replaces the previous value. Think of it as a continuously updated database table.
// KStream: every event matters (append-only)
// user-123: {"action": "login"}
// user-123: {"action": "view_page"}
// user-123: {"action": "logout"}
// → 3 records
// KTable: only latest value per key matters
// user-123: {"name": "Alice", "status": "active"}
// user-123: {"name": "Alice", "status": "inactive"}
// → 1 record (latest value)
KTable enables stream-table joins. Enrich a stream of orders with the latest user profile data. The user profile is a KTable (updated when profiles change). The order stream is a KStream (every order matters).
KStream<String, Order> orders = builder.stream("orders");
KTable<String, UserProfile> users = builder.table("user-profiles");
KStream<String, EnrichedOrder> enriched = orders.join(
users,
(order, profile) -> new EnrichedOrder(order, profile)
);
This join runs continuously. When a new order arrives, it's enriched with the latest user profile. When a profile updates, future orders get the updated data.
State Stores
Stateful operations (count, aggregate, join) need local state. Kafka Streams uses RocksDB by default. State is stored on the local disk of the application instance.
For fault tolerance, state changes are written to a changelog topic in Kafka. If the instance crashes, a new instance replays the changelog topic to rebuild state. Recovery time depends on state size -- seconds for small state, minutes for large.
Pinterest uses Kafka Streams for their real-time ads pipeline. Simple transformations, all data already in Kafka, deployed as standard Kubernetes pods. No need for a separate Flink cluster.
Apache Flink: The Framework Approach
Architecture
Flink Cluster
├── JobManager (coordinator)
│ ├── Job scheduling
│ ├── Checkpoint coordination
│ └── Failure recovery
├── TaskManager 1
│ ├── Task slots
│ └── Managed memory (off-heap)
├── TaskManager 2
│ ├── Task slots
│ └── Managed memory (off-heap)
└── State Backend (RocksDB / heap / S3)
Flink manages its own memory, its own scheduling, its own fault tolerance. It's a full distributed system that happens to be good at stream processing.
Code Example
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Order> orders = env
.addSource(new FlinkKafkaConsumer<>("orders", new OrderSchema(), kafkaProps));
DataStream<OrderCount> counts = orders
.filter(order -> order.getStatus().equals("COMPLETED"))
.keyBy(Order::getUserId)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.aggregate(new OrderCountAggregator());
counts.addSink(new FlinkKafkaProducer<>("order-counts", new CountSchema(), kafkaProps));
env.execute("Order Counter");
Notice the explicit window definition. Flink's windowing is its killer feature.
Event-Time Processing with Watermarks
This is where Flink pulls ahead of everything else.
Processing time = when the system processes the event. Simple but inaccurate. A delayed event counted in the wrong window.
Event time = when the event actually occurred. Accurate but complex. How do you know when all events for a window have arrived?
Watermarks solve this. A watermark is a timestamp that says: "I believe all events with timestamps up to this point have arrived." Flink uses watermarks to decide when to close a window and emit results.
DataStream<Event> events = source
.assignTimestampsAndWatermarks(
WatermarkStrategy
.<Event>forBoundedOutOfOrderness(Duration.ofSeconds(10))
.withTimestampAssigner((event, timestamp) -> event.getEventTime())
);
This says: "Events can arrive up to 10 seconds late. Close the window 10 seconds after the last expected event." Late events beyond 10 seconds can be handled by side outputs.
Kafka Streams has limited event-time support through TimestampExtractor, but no watermark concept. If your events arrive out of order (and they will), Kafka Streams can't handle it as gracefully.
Complex Event Processing (CEP)
Flink has a CEP library for pattern matching over streams:
Pattern<LoginEvent, ?> pattern = Pattern.<LoginEvent>begin("first")
.where(event -> event.getType().equals("FAILED_LOGIN"))
.timesOrMore(3)
.within(Time.minutes(5));
PatternStream<LoginEvent> patternStream = CEP.pattern(loginStream, pattern);
patternStream.select(matches -> {
return new Alert("3+ failed logins in 5 minutes for user " + matches.getUserId());
});
This detects three or more failed login attempts within five minutes. Try doing this with Kafka Streams. You'd need manual state management, timers, and custom logic. Flink CEP handles it declaratively.
Head-to-Head Comparison
Deployment
| Factor | Kafka Streams | Flink |
|---|---|---|
| Infrastructure | None (runs in your app) | Dedicated cluster (JobManager + TaskManagers) |
| Deployment | Deploy like any microservice | Submit jobs to Flink cluster |
| Scaling | Add more app instances | Add more TaskManagers / adjust parallelism |
| Kubernetes | Standard pods | Flink Kubernetes Operator |
| Ops burden | Low (just your app) | High (cluster management, checkpointing, upgrades) |
Spicy take: If you can solve your problem with Kafka Streams, you should. The operational cost of running a Flink cluster is real -- upgrades, monitoring, checkpoint failures, state migration. Kafka Streams gives you 70% of the capability at 20% of the operational overhead.
Data Sources
| Factor | Kafka Streams | Flink |
|---|---|---|
| Input sources | Kafka only | Kafka, Kinesis, files, databases, sockets, custom |
| Output sinks | Kafka only | Kafka, databases, files, Elasticsearch, custom |
| Multi-source joins | Not possible | Native support |
If all your data is in Kafka and your output goes to Kafka, Kafka Streams wins on simplicity. If you need to join a Kafka stream with a MySQL table and write to Elasticsearch, you need Flink.
Processing Capabilities
| Capability | Kafka Streams | Flink |
|---|---|---|
| Map/Filter/FlatMap | Yes | Yes |
| Aggregations | Yes (KTable) | Yes (windows, keyed state) |
| Joins | Stream-stream, stream-table | Stream-stream, stream-table, temporal joins |
| Windowing | Tumbling, hopping, sliding, session | All of the above + custom windows |
| Event-time processing | Basic (timestamp extractor) | Advanced (watermarks, late data handling) |
| CEP | No | Yes (pattern matching library) |
| Exactly-once | Yes (with Kafka transactions) | Yes (checkpoints + Kafka transactions) |
| State management | RocksDB, in-memory | RocksDB, heap, pluggable backends |
| Queryable state | Interactive queries (limited) | Queryable state (limited) |
Performance
| Metric | Kafka Streams | Flink |
|---|---|---|
| Throughput | 100K-500K events/sec per instance | Millions of events/sec per cluster |
| Latency | Low (ms) | Very low (sub-ms possible) |
| State size | Limited by local disk | Can spill to remote state backends (S3) |
| Checkpointing | Changelog topics in Kafka | Async checkpoints to S3/HDFS |
| Recovery time | Minutes (replay changelog) | Seconds (restore from checkpoint) |
Alibaba runs Flink at billions of events per second during Singles' Day. That's not a typo. They've pushed Flink harder than anyone. You won't hit those numbers with Kafka Streams.
Decision Framework
Use Kafka Streams When
- All data is in Kafka. Input from Kafka, output to Kafka.
- Transformations are simple. Filter, map, aggregate, join with one other stream/table.
- Your team doesn't want to run another cluster. Embed it in your microservice.
- State is small to medium. Fits on local disk (tens of GB).
- Event-time processing is not critical. Processing-time is acceptable.
Use Flink When
- Multiple data sources. Join Kafka with a database changelog or file stream.
- Complex windowing. Event-time windows with watermarks and late data handling.
- Pattern matching. CEP for fraud detection, anomaly detection.
- Massive state. Terabytes of state that needs remote checkpointing.
- Very high throughput. Millions of events per second with sub-millisecond latency.
Decision Table
| Scenario | Kafka Streams | Flink |
|---|---|---|
| Filter + transform Kafka events | Best | Overkill |
| Real-time dashboard from Kafka | Good | Good |
| Join Kafka stream with MySQL CDC | Can't | Best |
| Session window analytics | Possible | Better |
| Fraud detection with pattern matching | Hard | Best |
| Simple event counting | Best | Overkill |
| Multi-DC stream processing | Limited | Better |
| Team has no JVM experience | Neither (consider ksqlDB) | Neither |
Kafka Connect: The Plumbing Layer
Before you write custom producers and consumers, check if a Kafka Connect connector already exists.
What It Is
Kafka Connect is a framework for moving data between Kafka and external systems. No code required -- just configuration.
Source Connectors (External -> Kafka)
{
"name": "postgres-source",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "db.example.com",
"database.dbname": "orders",
"table.include.list": "public.orders,public.users",
"topic.prefix": "cdc"
}
}
This captures every INSERT, UPDATE, DELETE from PostgreSQL and publishes them to Kafka topics. Debezium (by Red Hat) is the most popular CDC connector. It reads the PostgreSQL WAL, so it captures changes without polling the database.
Sink Connectors (Kafka -> External)
{
"name": "elasticsearch-sink",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"topics": "orders",
"connection.url": "http://elasticsearch:9200",
"type.name": "_doc",
"key.ignore": "false"
}
}
This reads from the orders topic and writes to Elasticsearch. Updates are idempotent (upsert by key). If Elasticsearch goes down, the connector pauses and resumes when it's back.
Common Connectors
| Connector | Direction | Use Case |
|---|---|---|
| Debezium PostgreSQL | Source | CDC from PostgreSQL |
| Debezium MySQL | Source | CDC from MySQL |
| JDBC Source | Source | Poll-based DB ingestion |
| Elasticsearch Sink | Sink | Search index sync |
| S3 Sink | Sink | Data lake archival |
| HDFS Sink | Sink | Hadoop ingestion |
| MongoDB Source/Sink | Both | MongoDB CDC + writes |
Why it matters for interviews: "We don't write custom Kafka producers for database changes. We use Debezium with Kafka Connect. It reads the WAL, captures every change, and publishes to Kafka. Zero application code. If the database schema changes, the connector handles it."
Schema Registry
When producers and consumers evolve independently, message schemas drift. The Schema Registry prevents this.
What It Does
- Producers register their schema (Avro, Protobuf, or JSON Schema) with the registry.
- The registry assigns a schema ID.
- Producers send messages with the schema ID embedded.
- Consumers fetch the schema from the registry using the ID.
- The registry enforces compatibility rules on schema evolution.
Compatibility Modes
| Mode | What You Can Do | What You Can't Do |
|---|---|---|
| BACKWARD | Add optional fields, remove fields | Add required fields |
| FORWARD | Add fields, remove optional fields | Remove required fields |
| FULL | Add/remove optional fields only | Add/remove required fields |
| NONE | Anything | Nothing is checked |
BACKWARD is the default and most practical. New consumers can read old data. Old consumers can't read new data, but they should be updated anyway.
Why Avro Over JSON
- Avro: Schema stored separately. Messages are compact binary. Schema evolution with compatibility checks. Smaller payloads.
- JSON: Schema is implicit. Messages include field names. No compatibility enforcement. Larger payloads.
- Protobuf: Similar to Avro but with better cross-language support. Google's standard.
Confluent (the company behind Kafka) recommends Avro. Google shops use Protobuf. JSON works for small-scale systems but becomes a liability at scale because of payload size and schema drift.
When NOT to Use Kafka
Kafka is not a universal solution. Here's when other tools are better.
| Scenario | Use Instead | Why |
|---|---|---|
| < 1K messages/sec | SQS or RabbitMQ | Zero ops overhead. Kafka's complexity isn't justified. |
| Request-reply | HTTP/gRPC | Kafka is async. Request-reply with Kafka is an anti-pattern. |
| Message priority | RabbitMQ | Kafka has no priority queue concept. |
| Exactly-once to external DB | Outbox pattern + polling | Simpler than Kafka transactions for DB consistency. |
| Temporary task queues | Celery + RabbitMQ/Redis | Better worker semantics (retry, dead-letter, priorities). |
| Small team, simple needs | AWS SNS + SQS | Fully managed, no cluster to run. |
Spicy take: If your startup has 5 engineers and you're debating Kafka vs SQS, use SQS. You don't have the headcount to operate Kafka properly. You'll spend more time debugging broker issues than building features. Kafka is for organizations that have outgrown managed queues. Not everyone has.
Patterns for System Design Interviews
Pattern 1: CDC Pipeline
"Debezium captures PostgreSQL changes via Kafka Connect. Changes flow to Kafka topics. Flink reads from Kafka, joins the user stream with the order stream, and writes enriched records to Elasticsearch. Schema Registry enforces backward compatibility. If we need to add a field to orders, we update the Avro schema, register it, and consumers handle the new field gracefully."
Pattern 2: Simple Event Processing
"We use Kafka Streams embedded in our notification service. It reads from the user-events topic, filters for events that require notifications, enriches them with user preferences from a KTable, and writes notification requests to the notifications topic. No separate cluster. Deployed as a standard Kubernetes deployment."
Pattern 3: Complex Analytics Pipeline
"Flink reads click events from Kafka with event-time processing. Watermarks handle out-of-order events (bounded by 30 seconds). Session windows group clicks by user session. CEP detects suspicious patterns (bot clicks). Results go to both Kafka (for real-time dashboards) and S3 (for batch analytics)."

Trade-Offs Table
| Factor | Kafka Streams | Flink | Kafka Connect |
|---|---|---|---|
| Purpose | Stream processing (Kafka-only) | General stream processing | Data integration |
| Ops overhead | None | High (separate cluster) | Low (runs on Connect workers) |
| Data sources | Kafka only | Any | 100+ pre-built connectors |
| Custom logic | Full Java/Scala | Full Java/Scala/Python/SQL | Configuration only |
| Exactly-once | Yes (Kafka transactions) | Yes (checkpoints) | Depends on connector |
| Scaling | Add app instances | Adjust parallelism | Add Connect workers |
| State | Local (RocksDB) | Distributed (pluggable) | Offsets only |
| Learning curve | Medium | High | Low |
| Best for | Simple transforms | Complex analytics | Moving data in/out of Kafka |
Interview Gotchas
"When would you use Kafka Streams over Flink?"
"When all data is in Kafka, transformations are straightforward, and we don't want to operate a separate cluster. Kafka Streams runs inside our application -- same deployment, same monitoring, same CI/CD. Flink is worth the operational overhead when we need event-time processing with watermarks, multi-source joins, or CEP."
"What's ksqlDB?"
"SQL interface for Kafka Streams. Write SQL queries instead of Java code. Good for prototyping and simple transformations. Not suitable for complex logic -- you'll eventually need Java. Think of it as the training wheels version of Kafka Streams."
"How do you handle schema evolution in Kafka?"
"Schema Registry with Avro serialization. BACKWARD compatibility mode -- new consumers can read old messages. When we add a field, it has a default value so old messages without the field still deserialize correctly. Breaking changes (renaming a field, changing a type) require a new topic."
"What's the relationship between Kafka Connect and Kafka Streams?"
"Kafka Connect moves data into and out of Kafka. Kafka Streams processes data already in Kafka. In a typical pipeline: Connect ingests data from PostgreSQL into Kafka, Streams transforms it, Connect writes the output to Elasticsearch. They're complementary, not competing."
"Can Flink replace Kafka?"
"No. Flink is a processing engine. Kafka is a storage and messaging system. Flink reads from Kafka, processes events, and writes results somewhere. You can use Flink without Kafka (reading from files or databases), but in most architectures they work together."
Summary
Kafka Streams and Flink solve the same problem differently. Kafka Streams is a library -- simple to deploy, limited to Kafka sources, good enough for most use cases. Flink is a framework -- powerful, multi-source, complex to operate, needed for event-time processing and advanced analytics. Kafka Connect handles data integration without code. Schema Registry prevents schema drift. Know which tool fits which scenario, and more importantly, know when Kafka itself is overkill. Most systems don't need Flink's power. Many don't even need Kafka. The best architecture uses the simplest tool that solves the problem.