Kafka Streams vs Flink — When to Use Each

TL;DR

Kafka Streams is a library you embed in your app for simple Kafka-to-Kafka processing; Flink is a distributed framework you deploy as a separate cluster for complex, multi-source stream processing -- pick based on operational appetite, not features.

What It Is

Kafka Connect

Stream processing means transforming data as it arrives, rather than batching it and processing later. Two tools dominate this space, and they solve different problems.

Kafka Streams is a Java library. You add it as a dependency. It runs inside your application process. No separate cluster. No resource manager. No job scheduler. Just a JAR file that reads from Kafka, processes data, and writes back to Kafka.

Apache Flink is a distributed stream processing framework. It runs on its own cluster with a JobManager, TaskManagers, and state backends. It processes data from Kafka, databases, files, sockets -- anything. It has its own runtime, memory management, and fault tolerance.

The fundamental difference: Kafka Streams is a library you add to your code. Flink is infrastructure you operate.

Kafka Streams: The Library Approach

Architecture

Your Application (JVM process)
  ├── Your business logic
  ├── Kafka Streams library
  │     ├── Stream threads (configurable count)
  │     ├── State stores (local RocksDB)
  │     └── Changelog topics (for state recovery)
  └── Standard deployment (Kubernetes, ECS, bare metal)

No external dependencies beyond Kafka itself. Deploy it like any other microservice. Scale it by running more instances. Each instance gets assigned a subset of partitions.

Code Example

StreamsBuilder builder = new StreamsBuilder();

KStream<String, Order> orders = builder.stream("orders");

KTable<String, Long> orderCounts = orders
    .filter((key, order) -> order.getStatus().equals("COMPLETED"))
    .groupByKey()
    .count();

orderCounts.toStream().to("order-counts");

KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();

That's it. No cluster setup. No YARN or Kubernetes operator. The KafkaStreams object manages threads, state, and partition assignment internally.

KTable: The Stateful Primitive

A KTable is a changelog stream materialized as a table. Each new message with the same key replaces the previous value. Think of it as a continuously updated database table.

// KStream: every event matters (append-only)
// user-123: {"action": "login"}
// user-123: {"action": "view_page"}
// user-123: {"action": "logout"}
// → 3 records

// KTable: only latest value per key matters
// user-123: {"name": "Alice", "status": "active"}
// user-123: {"name": "Alice", "status": "inactive"}
// → 1 record (latest value)

KTable enables stream-table joins. Enrich a stream of orders with the latest user profile data. The user profile is a KTable (updated when profiles change). The order stream is a KStream (every order matters).

KStream<String, Order> orders = builder.stream("orders");
KTable<String, UserProfile> users = builder.table("user-profiles");

KStream<String, EnrichedOrder> enriched = orders.join(
    users,
    (order, profile) -> new EnrichedOrder(order, profile)
);

This join runs continuously. When a new order arrives, it's enriched with the latest user profile. When a profile updates, future orders get the updated data.

State Stores

Stateful operations (count, aggregate, join) need local state. Kafka Streams uses RocksDB by default. State is stored on the local disk of the application instance.

For fault tolerance, state changes are written to a changelog topic in Kafka. If the instance crashes, a new instance replays the changelog topic to rebuild state. Recovery time depends on state size -- seconds for small state, minutes for large.

Pinterest uses Kafka Streams for their real-time ads pipeline. Simple transformations, all data already in Kafka, deployed as standard Kubernetes pods. No need for a separate Flink cluster.

Apache Flink: The Framework Approach

Architecture

Flink Cluster
  ├── JobManager (coordinator)
  │     ├── Job scheduling
  │     ├── Checkpoint coordination
  │     └── Failure recovery
  ├── TaskManager 1
  │     ├── Task slots
  │     └── Managed memory (off-heap)
  ├── TaskManager 2
  │     ├── Task slots
  │     └── Managed memory (off-heap)
  └── State Backend (RocksDB / heap / S3)

Flink manages its own memory, its own scheduling, its own fault tolerance. It's a full distributed system that happens to be good at stream processing.

Code Example

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<Order> orders = env
    .addSource(new FlinkKafkaConsumer<>("orders", new OrderSchema(), kafkaProps));

DataStream<OrderCount> counts = orders
    .filter(order -> order.getStatus().equals("COMPLETED"))
    .keyBy(Order::getUserId)
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .aggregate(new OrderCountAggregator());

counts.addSink(new FlinkKafkaProducer<>("order-counts", new CountSchema(), kafkaProps));

env.execute("Order Counter");

Notice the explicit window definition. Flink's windowing is its killer feature.

Event-Time Processing with Watermarks

This is where Flink pulls ahead of everything else.

Processing time = when the system processes the event. Simple but inaccurate. A delayed event counted in the wrong window.

Event time = when the event actually occurred. Accurate but complex. How do you know when all events for a window have arrived?

Watermarks solve this. A watermark is a timestamp that says: "I believe all events with timestamps up to this point have arrived." Flink uses watermarks to decide when to close a window and emit results.

DataStream<Event> events = source
    .assignTimestampsAndWatermarks(
        WatermarkStrategy
            .<Event>forBoundedOutOfOrderness(Duration.ofSeconds(10))
            .withTimestampAssigner((event, timestamp) -> event.getEventTime())
    );

This says: "Events can arrive up to 10 seconds late. Close the window 10 seconds after the last expected event." Late events beyond 10 seconds can be handled by side outputs.

Kafka Streams has limited event-time support through TimestampExtractor, but no watermark concept. If your events arrive out of order (and they will), Kafka Streams can't handle it as gracefully.

Complex Event Processing (CEP)

Flink has a CEP library for pattern matching over streams:

Pattern<LoginEvent, ?> pattern = Pattern.<LoginEvent>begin("first")
    .where(event -> event.getType().equals("FAILED_LOGIN"))
    .timesOrMore(3)
    .within(Time.minutes(5));

PatternStream<LoginEvent> patternStream = CEP.pattern(loginStream, pattern);

patternStream.select(matches -> {
    return new Alert("3+ failed logins in 5 minutes for user " + matches.getUserId());
});

This detects three or more failed login attempts within five minutes. Try doing this with Kafka Streams. You'd need manual state management, timers, and custom logic. Flink CEP handles it declaratively.

Head-to-Head Comparison

Deployment

Factor	Kafka Streams	Flink
Infrastructure	None (runs in your app)	Dedicated cluster (JobManager + TaskManagers)
Deployment	Deploy like any microservice	Submit jobs to Flink cluster
Scaling	Add more app instances	Add more TaskManagers / adjust parallelism
Kubernetes	Standard pods	Flink Kubernetes Operator
Ops burden	Low (just your app)	High (cluster management, checkpointing, upgrades)

Spicy take: If you can solve your problem with Kafka Streams, you should. The operational cost of running a Flink cluster is real -- upgrades, monitoring, checkpoint failures, state migration. Kafka Streams gives you 70% of the capability at 20% of the operational overhead.

Data Sources

Factor	Kafka Streams	Flink
Input sources	Kafka only	Kafka, Kinesis, files, databases, sockets, custom
Output sinks	Kafka only	Kafka, databases, files, Elasticsearch, custom
Multi-source joins	Not possible	Native support

If all your data is in Kafka and your output goes to Kafka, Kafka Streams wins on simplicity. If you need to join a Kafka stream with a MySQL table and write to Elasticsearch, you need Flink.

Processing Capabilities

Capability	Kafka Streams	Flink
Map/Filter/FlatMap	Yes	Yes
Aggregations	Yes (KTable)	Yes (windows, keyed state)
Joins	Stream-stream, stream-table	Stream-stream, stream-table, temporal joins
Windowing	Tumbling, hopping, sliding, session	All of the above + custom windows
Event-time processing	Basic (timestamp extractor)	Advanced (watermarks, late data handling)
CEP	No	Yes (pattern matching library)
Exactly-once	Yes (with Kafka transactions)	Yes (checkpoints + Kafka transactions)
State management	RocksDB, in-memory	RocksDB, heap, pluggable backends
Queryable state	Interactive queries (limited)	Queryable state (limited)

Performance

Metric	Kafka Streams	Flink
Throughput	100K-500K events/sec per instance	Millions of events/sec per cluster
Latency	Low (ms)	Very low (sub-ms possible)
State size	Limited by local disk	Can spill to remote state backends (S3)
Checkpointing	Changelog topics in Kafka	Async checkpoints to S3/HDFS
Recovery time	Minutes (replay changelog)	Seconds (restore from checkpoint)

Alibaba runs Flink at billions of events per second during Singles' Day. That's not a typo. They've pushed Flink harder than anyone. You won't hit those numbers with Kafka Streams.

Decision Framework

Use Kafka Streams When

All data is in Kafka. Input from Kafka, output to Kafka.
Transformations are simple. Filter, map, aggregate, join with one other stream/table.
Your team doesn't want to run another cluster. Embed it in your microservice.
State is small to medium. Fits on local disk (tens of GB).
Event-time processing is not critical. Processing-time is acceptable.

Use Flink When

Multiple data sources. Join Kafka with a database changelog or file stream.
Complex windowing. Event-time windows with watermarks and late data handling.
Pattern matching. CEP for fraud detection, anomaly detection.
Massive state. Terabytes of state that needs remote checkpointing.
Very high throughput. Millions of events per second with sub-millisecond latency.

Decision Table

Scenario	Kafka Streams	Flink
Filter + transform Kafka events	Best	Overkill
Real-time dashboard from Kafka	Good	Good
Join Kafka stream with MySQL CDC	Can't	Best
Session window analytics	Possible	Better
Fraud detection with pattern matching	Hard	Best
Simple event counting	Best	Overkill
Multi-DC stream processing	Limited	Better
Team has no JVM experience	Neither (consider ksqlDB)	Neither

Kafka Connect: The Plumbing Layer

Before you write custom producers and consumers, check if a Kafka Connect connector already exists.

What It Is

Kafka Connect is a framework for moving data between Kafka and external systems. No code required -- just configuration.

Source Connectors (External -> Kafka)

{
  "name": "postgres-source",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "db.example.com",
    "database.dbname": "orders",
    "table.include.list": "public.orders,public.users",
    "topic.prefix": "cdc"
  }
}

This captures every INSERT, UPDATE, DELETE from PostgreSQL and publishes them to Kafka topics. Debezium (by Red Hat) is the most popular CDC connector. It reads the PostgreSQL WAL, so it captures changes without polling the database.

Sink Connectors (Kafka -> External)

{
  "name": "elasticsearch-sink",
  "config": {
    "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
    "topics": "orders",
    "connection.url": "http://elasticsearch:9200",
    "type.name": "_doc",
    "key.ignore": "false"
  }
}

This reads from the orders topic and writes to Elasticsearch. Updates are idempotent (upsert by key). If Elasticsearch goes down, the connector pauses and resumes when it's back.

Common Connectors

Connector	Direction	Use Case
Debezium PostgreSQL	Source	CDC from PostgreSQL
Debezium MySQL	Source	CDC from MySQL
JDBC Source	Source	Poll-based DB ingestion
Elasticsearch Sink	Sink	Search index sync
S3 Sink	Sink	Data lake archival
HDFS Sink	Sink	Hadoop ingestion
MongoDB Source/Sink	Both	MongoDB CDC + writes

Why it matters for interviews: "We don't write custom Kafka producers for database changes. We use Debezium with Kafka Connect. It reads the WAL, captures every change, and publishes to Kafka. Zero application code. If the database schema changes, the connector handles it."

Schema Registry

When producers and consumers evolve independently, message schemas drift. The Schema Registry prevents this.

What It Does

Producers register their schema (Avro, Protobuf, or JSON Schema) with the registry.
The registry assigns a schema ID.
Producers send messages with the schema ID embedded.
Consumers fetch the schema from the registry using the ID.
The registry enforces compatibility rules on schema evolution.

Compatibility Modes

Mode	What You Can Do	What You Can't Do
BACKWARD	Add optional fields, remove fields	Add required fields
FORWARD	Add fields, remove optional fields	Remove required fields
FULL	Add/remove optional fields only	Add/remove required fields
NONE	Anything	Nothing is checked

BACKWARD is the default and most practical. New consumers can read old data. Old consumers can't read new data, but they should be updated anyway.

Why Avro Over JSON

Avro: Schema stored separately. Messages are compact binary. Schema evolution with compatibility checks. Smaller payloads.
JSON: Schema is implicit. Messages include field names. No compatibility enforcement. Larger payloads.
Protobuf: Similar to Avro but with better cross-language support. Google's standard.

Confluent (the company behind Kafka) recommends Avro. Google shops use Protobuf. JSON works for small-scale systems but becomes a liability at scale because of payload size and schema drift.

When NOT to Use Kafka

Kafka is not a universal solution. Here's when other tools are better.

Scenario	Use Instead	Why
< 1K messages/sec	SQS or RabbitMQ	Zero ops overhead. Kafka's complexity isn't justified.
Request-reply	HTTP/gRPC	Kafka is async. Request-reply with Kafka is an anti-pattern.
Message priority	RabbitMQ	Kafka has no priority queue concept.
Exactly-once to external DB	Outbox pattern + polling	Simpler than Kafka transactions for DB consistency.
Temporary task queues	Celery + RabbitMQ/Redis	Better worker semantics (retry, dead-letter, priorities).
Small team, simple needs	AWS SNS + SQS	Fully managed, no cluster to run.

Spicy take: If your startup has 5 engineers and you're debating Kafka vs SQS, use SQS. You don't have the headcount to operate Kafka properly. You'll spend more time debugging broker issues than building features. Kafka is for organizations that have outgrown managed queues. Not everyone has.

Patterns for System Design Interviews

Pattern 1: CDC Pipeline

"Debezium captures PostgreSQL changes via Kafka Connect. Changes flow to Kafka topics. Flink reads from Kafka, joins the user stream with the order stream, and writes enriched records to Elasticsearch. Schema Registry enforces backward compatibility. If we need to add a field to orders, we update the Avro schema, register it, and consumers handle the new field gracefully."

Pattern 2: Simple Event Processing

"We use Kafka Streams embedded in our notification service. It reads from the user-events topic, filters for events that require notifications, enriches them with user preferences from a KTable, and writes notification requests to the notifications topic. No separate cluster. Deployed as a standard Kubernetes deployment."

Pattern 3: Complex Analytics Pipeline

"Flink reads click events from Kafka with event-time processing. Watermarks handle out-of-order events (bounded by 30 seconds). Session windows group clicks by user session. CEP detects suspicious patterns (bot clicks). Results go to both Kafka (for real-time dashboards) and S3 (for batch analytics)."

Streams Vs Flink

Trade-Offs Table

Factor	Kafka Streams	Flink	Kafka Connect
Purpose	Stream processing (Kafka-only)	General stream processing	Data integration
Ops overhead	None	High (separate cluster)	Low (runs on Connect workers)
Data sources	Kafka only	Any	100+ pre-built connectors
Custom logic	Full Java/Scala	Full Java/Scala/Python/SQL	Configuration only
Exactly-once	Yes (Kafka transactions)	Yes (checkpoints)	Depends on connector
Scaling	Add app instances	Adjust parallelism	Add Connect workers
State	Local (RocksDB)	Distributed (pluggable)	Offsets only
Learning curve	Medium	High	Low
Best for	Simple transforms	Complex analytics	Moving data in/out of Kafka

Interview Gotchas

"When would you use Kafka Streams over Flink?"

"When all data is in Kafka, transformations are straightforward, and we don't want to operate a separate cluster. Kafka Streams runs inside our application -- same deployment, same monitoring, same CI/CD. Flink is worth the operational overhead when we need event-time processing with watermarks, multi-source joins, or CEP."

"What's ksqlDB?"

"SQL interface for Kafka Streams. Write SQL queries instead of Java code. Good for prototyping and simple transformations. Not suitable for complex logic -- you'll eventually need Java. Think of it as the training wheels version of Kafka Streams."

"How do you handle schema evolution in Kafka?"

"Schema Registry with Avro serialization. BACKWARD compatibility mode -- new consumers can read old messages. When we add a field, it has a default value so old messages without the field still deserialize correctly. Breaking changes (renaming a field, changing a type) require a new topic."

"What's the relationship between Kafka Connect and Kafka Streams?"

"Kafka Connect moves data into and out of Kafka. Kafka Streams processes data already in Kafka. In a typical pipeline: Connect ingests data from PostgreSQL into Kafka, Streams transforms it, Connect writes the output to Elasticsearch. They're complementary, not competing."

"Can Flink replace Kafka?"

"No. Flink is a processing engine. Kafka is a storage and messaging system. Flink reads from Kafka, processes events, and writes results somewhere. You can use Flink without Kafka (reading from files or databases), but in most architectures they work together."

Summary

Kafka Streams and Flink solve the same problem differently. Kafka Streams is a library -- simple to deploy, limited to Kafka sources, good enough for most use cases. Flink is a framework -- powerful, multi-source, complex to operate, needed for event-time processing and advanced analytics. Kafka Connect handles data integration without code. Schema Registry prevents schema drift. Know which tool fits which scenario, and more importantly, know when Kafka itself is overkill. Most systems don't need Flink's power. Many don't even need Kafka. The best architecture uses the simplest tool that solves the problem.