Engineering

Real-Time Data Pipelines with Apache Kafka

A practical guide to designing, building, and operating production-grade Kafka pipelines for enterprise data streaming.

Marcus Rodriguez

Data Engineer

November 1, 202413 min read

Real-Time Data Pipelines with Apache Kafka

Apache Kafka has become the central nervous system of modern data architectures. But building reliable, production-grade Kafka pipelines requires more than understanding the API â€” it demands careful attention to partitioning strategies, consumer group design, exactly-once semantics, and operational excellence. This guide covers the patterns and practices that separate toy implementations from enterprise-grade streaming platforms.

Why Kafka for Enterprise Data Pipelines?

Kafka's architecture provides unique advantages for real-time data processing:

Durable commit log â€” Messages are persisted to disk and replicated, providing durability without sacrificing throughput
Horizontal scalability â€” Add partitions and brokers to scale throughput linearly
Replay capability â€” Consumers can reprocess messages from any point in the log
Decoupled producers and consumers â€” Teams produce and consume data independently through topics
Ecosystem maturity â€” Connect, Streams, and Schema Registry form a complete platform

Architecture Patterns

Pattern 1: Event Sourcing with CQRS

Use Kafka as the source of truth for your domain events:

All state changes are published as events to Kafka topics
Consumers build read models (projections) from the event stream
The event log serves as both the integration layer and the audit trail
Replay events to rebuild any projection at any time

This pattern works particularly well for financial systems, inventory management, and order processing where auditability is critical.

Pattern 2: Change Data Capture (CDC)

Stream database changes in real-time:

Use Debezium connectors to capture row-level changes from databases
Publish change events to Kafka with before/after values
Consumers react to changes: update search indexes, invalidate caches, trigger workflows
Enables real-time analytics on operational data without query impact on the source database

Pattern 3: Stream Processing with Kafka Streams

Build stateful stream processing applications:

Use Kafka Streams DSL for common operations: filter, map, aggregate, join
Leverage interactive queries for real-time lookups into materialized state stores
Implement windowed aggregations for time-based metrics (5-minute averages, hourly totals)
Use branch operations to route events to different output topics based on content

Pattern 4: Lambda Architecture Simplified

Combine batch and streaming with a unified Kafka layer:

Raw events land in Kafka topics
A fast path processes events in real-time via Kafka Streams
A batch path processes the same events periodically via Spark or Flink
A serving layer merges both views for queries

With Kafka as the single source of truth, you avoid the complexity of maintaining separate batch and streaming data pipelines.

Partitioning Strategy

Partitioning is the most critical design decision in any Kafka deployment:

Partition Key Selection

Your partition key determines message ordering and distribution:

Entity ID â€” Guarantees ordering for a single entity (user_id, order_id)
Composite key â€” Combines entity and event type for more granular ordering
Round-robin â€” Maximum distribution but no ordering guarantees
Custom partitioner â€” Business logic for complex distribution requirements

Partition Count Sizing

Each partition provides:

Maximum throughput of roughly 10 MB/s per partition for produce and consume
One consumer thread per partition at maximum parallelism
Overhead for replication, compaction, and broker recovery

A practical approach: start with enough partitions for twice your current throughput, plan for growth, and remember that you can add partitions but cannot remove them.

Hot Partition Mitigation

When some keys generate disproportionate traffic:

Salting â€” Append a random suffix to the key for distribution, then aggregate across salted partitions
Splitting â€” Give hot entities their own dedicated partitions
Compensation â€” Use a two-stage approach where events first go to a distribution topic, then a dedicated consumer re-keys and routes them

Exactly-Once Semantics

Achieving exactly-once processing in Kafka requires coordination across three levels:

Producer Level

Enable idempotent producer (enable.idempotence=true)
Use transactional producer for multi-partition writes
Set acks=all for maximum durability

Consumer Level

Use read_committed isolation level with transactional producers
Store consumer offsets in the same transaction as the output
Implement idempotent consumer logic as a safety net

Processing Level

Kafka Streams supports exactly-once via processing.guarantee=exactly_once_v2
For external systems, use the outbox pattern with dual-write to Kafka and the database in the same transaction

Schema Management

Schema evolution is critical for long-running Kafka deployments:

Schema Registry â€” Centralize schema storage and enforce compatibility rules
Backward compatibility â€” New schemas can read data written by old schemas (default and recommended)
Evolution strategies â€” Add optional fields with defaults; never remove or rename required fields
Schema IDs in messages â€” Each message carries its schema ID; consumers fetch the schema at deserialization time

Operational Excellence

Monitoring

Track these critical metrics:

Under-replicated partitions â€” Indicates broker health issues
Consumer lag â€” How far behind the consumer group is from the latest offset
Produce/consume throughput â€” Bytes and records per second per topic
Request latency â€” P99 produce and fetch latencies
Disk usage â€” Kafka is disk-heavy; monitor and plan capacity

Capacity Planning

Throughput â€” Plan for peak, not average; add 50% headroom
Storage â€” Calculate based on retention period, throughput, and replication factor
Network â€” Replication and consumer traffic can double your network requirements
Broker count â€” Minimum 3 brokers for production; more for higher throughput and availability

Failure Scenarios

Broker failure â€” Partition leadership transfers automatically; ensure min.insync.replicas is set
Consumer failure â€” Rebalance triggers; use cooperative sticky assignor to minimize disruption
Zookeeper/KRaft failure â€” Monitor controller health; KRaft mode eliminates the Zookeeper dependency entirely
Disk failure â€” Replace the disk and let Kafka rebuild the replica from the leader

Anti-Patterns to Avoid

Using Kafka as a database â€” It's an event log, not a query engine; use it with a database for state
Large messages â€” Keep messages under 1MB; use a reference pattern for large payloads
Too many topics â€” Group related events; use headers or a type field within a topic
Ignoring consumer lag â€” Lag is a leading indicator of system stress; alert on it early
Tight coupling â€” Use schema contracts and API versioning between producer and consumer teams

Conclusion

Kafka is powerful but demands respect for its design principles. By choosing appropriate partitioning strategies, implementing exactly-once semantics, managing schemas carefully, and investing in operational tooling, you can build data pipelines that are reliable, scalable, and maintainable. The key is to start simple, measure everything, and iterate based on real production data.