TestForge | Aidevops | 📊 Plogger ✍️ Blog 📚 Docs
plogger

AI DevOps Korea

Turn AI service development and operations into one improvement loop

Aidevops.kr covers LLMOps, RAG, agents, observability, evaluation, and cost-performance optimization for production AI services.

Implementing Event-Driven Architecture with Apache Kafka

· Updated Apr 22
Implementing Event-Driven Architecture with Apache Kafka diagram
This diagram shows how event contracts, Kafka topics, consumer recovery, and operational signals fit together in a production event-driven system.
Kafka does not create good event-driven architecture on its own. The hard part is not standing up the broker. The hard part is deciding which business facts become events, under which contracts, with which ordering meaning, and how failures, duplicates, and replay are handled over time.

That is why strong Kafka usage is more about contract discipline than broker mechanics.

Events are system contracts

The most important design decision is whether an event is treated as an internal implementation detail or as a durable contract that other systems can depend on.

In most serious event-driven systems, events should be treated as contracts:

  • names should reflect business facts
  • payloads should carry enough meaning for downstream consumers
  • versioning should be planned before multiple consumers appear
  • schema changes should be rolled out deliberately

If teams publish vague events such as UserUpdated or RowChanged, consumers end up re-encoding business meaning through guesswork.

Topics and partitions must have business meaning

Kafka topics are not just transport channels. They define boundaries of event type and retention policy. Partitions are not just a scaling detail. They are the unit of ordering.

That means partition-key design should answer:

  • what entity needs ordered processing?
  • what concurrency level is required?
  • what skew risk exists for hot keys?

Good partition design preserves ordering where the business needs it and parallelism where the system can use it.

Database-write and event-publish boundaries need an explicit solution

One of the most dangerous assumptions in Kafka systems is that writing to the database and publishing to Kafka will “usually succeed together.”

The practical fix is usually the Outbox pattern:

  • commit business state and outbox event in one database transaction
  • relay the outbox event to Kafka asynchronously
  • monitor relay lag and failures

Without this, teams eventually discover silent divergence between service state and published events.

Consumers must be idempotent

At-least-once delivery means duplicates are normal in real systems.

Consumers should therefore be designed to:

  • detect duplicate message identity
  • apply side effects safely once
  • distinguish transient failure from business rejection
  • survive replay without corrupting downstream state

A consumer that cannot tolerate duplicates is not production-ready, no matter how clean the broker setup looks.

Replay is a feature, not an accident

Kafka is powerful partly because consumers can rebuild state by replaying retained events. But replay only works well if the team plans for it.

Replay-friendly systems usually have:

  • deterministic consumer logic
  • idempotent side effects
  • clear versioning rules
  • tooling for backfill and offset management
  • monitoring for replay lag and error spikes

If replay is treated as a rare emergency-only task, it usually fails when it is needed most.

DLT is an operational control, not a trash can

Dead-letter topics are useful when consumers repeatedly fail on certain messages, but they should not become a place where unresolved business problems go to disappear.

A healthy DLT practice includes:

  • classifying why the message failed
  • separating malformed payloads from transient infrastructure issues
  • defining who investigates and how replay happens
  • preserving correlation IDs and original metadata

Without that discipline, DLT volume becomes an ignored consistency backlog.

Metrics that actually matter

Kafka success is often misread through cluster health alone. Broker health matters, but application correctness depends on more than broker uptime.

Watch:

  • consumer lag
  • rebalance frequency
  • producer error rate
  • retry and dead-letter volume
  • processing latency per consumer group
  • hot partition skew

These metrics tell you whether the event-driven design is staying healthy under load and change.

Common architecture mistakes

Be careful with these patterns:

  • publishing events that mirror table changes instead of domain facts
  • assuming ordering across topics or across all partitions
  • using Kafka without solving database/event atomicity
  • making consumers depend on undeclared field meanings
  • treating replay and dead-letter handling as manual hero work

None of these are broker failures. They are design failures.

Decision checklist

Before calling the design mature, confirm:

  • event names are domain-specific and stable
  • partition keys reflect ordering requirements
  • outbox or an equivalent boundary solution exists
  • consumers are idempotent
  • replay procedures are tested
  • lag, rebalance, DLT, and skew are observable

Wrap-up

The differentiator in Kafka systems is not the broker itself. It is the discipline to treat events as contracts, ordering as a deliberate choice, and replay plus duplication as normal operating conditions.

That is what makes event-driven architecture reliable instead of merely asynchronous.

Continue Reading

Related posts

Next Path

Keep exploring this topic as a system