Engineering

Real-Time Data Pipelines with Apache Kafka

A practical guide to designing, building, and operating production-grade Kafka pipelines for enterprise data streaming.

MR

Marcus Rodriguez

Data Engineer

13 min read
Real-Time Data Pipelines with Apache Kafka
Back to Blog

Real-Time Data Pipelines with Apache Kafka

Apache Kafka has become the central nervous system of modern data architectures. But building reliable, production-grade Kafka pipelines requires more than understanding the API — it demands careful attention to partitioning strategies, consumer group design, exactly-once semantics, and operational excellence. This guide covers the patterns and practices that separate toy implementations from enterprise-grade streaming platforms.

Why Kafka for Enterprise Data Pipelines?

Kafka's architecture provides unique advantages for real-time data processing:

  • Durable commit log — Messages are persisted to disk and replicated, providing durability without sacrificing throughput
  • Horizontal scalability — Add partitions and brokers to scale throughput linearly
  • Replay capability — Consumers can reprocess messages from any point in the log
  • Decoupled producers and consumers — Teams produce and consume data independently through topics
  • Ecosystem maturity — Connect, Streams, and Schema Registry form a complete platform

Architecture Patterns

Pattern 1: Event Sourcing with CQRS

Use Kafka as the source of truth for your domain events:

  1. All state changes are published as events to Kafka topics
  2. Consumers build read models (projections) from the event stream
  3. The event log serves as both the integration layer and the audit trail
  4. Replay events to rebuild any projection at any time

This pattern works particularly well for financial systems, inventory management, and order processing where auditability is critical.

Pattern 2: Change Data Capture (CDC)

Stream database changes in real-time:

  1. Use Debezium connectors to capture row-level changes from databases
  2. Publish change events to Kafka with before/after values
  3. Consumers react to changes: update search indexes, invalidate caches, trigger workflows
  4. Enables real-time analytics on operational data without query impact on the source database

Pattern 3: Stream Processing with Kafka Streams

Build stateful stream processing applications:

  1. Use Kafka Streams DSL for common operations: filter, map, aggregate, join
  2. Leverage interactive queries for real-time lookups into materialized state stores
  3. Implement windowed aggregations for time-based metrics (5-minute averages, hourly totals)
  4. Use branch operations to route events to different output topics based on content

Pattern 4: Lambda Architecture Simplified

Combine batch and streaming with a unified Kafka layer:

  1. Raw events land in Kafka topics
  2. A fast path processes events in real-time via Kafka Streams
  3. A batch path processes the same events periodically via Spark or Flink
  4. A serving layer merges both views for queries

With Kafka as the single source of truth, you avoid the complexity of maintaining separate batch and streaming data pipelines.

Partitioning Strategy

Partitioning is the most critical design decision in any Kafka deployment:

Partition Key Selection

Your partition key determines message ordering and distribution:

  • Entity ID — Guarantees ordering for a single entity (user_id, order_id)
  • Composite key — Combines entity and event type for more granular ordering
  • Round-robin — Maximum distribution but no ordering guarantees
  • Custom partitioner — Business logic for complex distribution requirements

Partition Count Sizing

Each partition provides:

  • Maximum throughput of roughly 10 MB/s per partition for produce and consume
  • One consumer thread per partition at maximum parallelism
  • Overhead for replication, compaction, and broker recovery

A practical approach: start with enough partitions for twice your current throughput, plan for growth, and remember that you can add partitions but cannot remove them.

Hot Partition Mitigation

When some keys generate disproportionate traffic:

  • Salting — Append a random suffix to the key for distribution, then aggregate across salted partitions
  • Splitting — Give hot entities their own dedicated partitions
  • Compensation — Use a two-stage approach where events first go to a distribution topic, then a dedicated consumer re-keys and routes them

Exactly-Once Semantics

Achieving exactly-once processing in Kafka requires coordination across three levels:

Producer Level

  • Enable idempotent producer (enable.idempotence=true)
  • Use transactional producer for multi-partition writes
  • Set acks=all for maximum durability

Consumer Level

  • Use read_committed isolation level with transactional producers
  • Store consumer offsets in the same transaction as the output
  • Implement idempotent consumer logic as a safety net

Processing Level

  • Kafka Streams supports exactly-once via processing.guarantee=exactly_once_v2
  • For external systems, use the outbox pattern with dual-write to Kafka and the database in the same transaction

Schema Management

Schema evolution is critical for long-running Kafka deployments:

  1. Schema Registry — Centralize schema storage and enforce compatibility rules
  2. Backward compatibility — New schemas can read data written by old schemas (default and recommended)
  3. Evolution strategies — Add optional fields with defaults; never remove or rename required fields
  4. Schema IDs in messages — Each message carries its schema ID; consumers fetch the schema at deserialization time

Operational Excellence

Monitoring

Track these critical metrics:

  • Under-replicated partitions — Indicates broker health issues
  • Consumer lag — How far behind the consumer group is from the latest offset
  • Produce/consume throughput — Bytes and records per second per topic
  • Request latency — P99 produce and fetch latencies
  • Disk usage — Kafka is disk-heavy; monitor and plan capacity

Capacity Planning

  • Throughput — Plan for peak, not average; add 50% headroom
  • Storage — Calculate based on retention period, throughput, and replication factor
  • Network — Replication and consumer traffic can double your network requirements
  • Broker count — Minimum 3 brokers for production; more for higher throughput and availability

Failure Scenarios

  • Broker failure — Partition leadership transfers automatically; ensure min.insync.replicas is set
  • Consumer failure — Rebalance triggers; use cooperative sticky assignor to minimize disruption
  • Zookeeper/KRaft failure — Monitor controller health; KRaft mode eliminates the Zookeeper dependency entirely
  • Disk failure — Replace the disk and let Kafka rebuild the replica from the leader

Anti-Patterns to Avoid

  1. Using Kafka as a database — It's an event log, not a query engine; use it with a database for state
  2. Large messages — Keep messages under 1MB; use a reference pattern for large payloads
  3. Too many topics — Group related events; use headers or a type field within a topic
  4. Ignoring consumer lag — Lag is a leading indicator of system stress; alert on it early
  5. Tight coupling — Use schema contracts and API versioning between producer and consumer teams

Conclusion

Kafka is powerful but demands respect for its design principles. By choosing appropriate partitioning strategies, implementing exactly-once semantics, managing schemas carefully, and investing in operational tooling, you can build data pipelines that are reliable, scalable, and maintainable. The key is to start simple, measure everything, and iterate based on real production data.

MR

Marcus Rodriguez

Data Engineer

Expert in engineering at Albos Technologies Pvt Ltd. Sharing insights from years of building enterprise solutions at scale.

Read More

Related Articles

A
K
M
S
Join 2,500+ subscribers

Get insights delivered to your inbox

Weekly deep-dives on engineering, AI, and design. No spam, ever.

Free foreverCommunity access