TestForge | Aidevops | 📊 Plogger ✍️ Blog 📚 Docs
plogger

AI DevOps Korea

Turn AI service development and operations into one improvement loop

Aidevops.kr covers LLMOps, RAG, agents, observability, evaluation, and cost-performance optimization for production AI services.

Designing Distributed Transactions with Outbox, Inbox, and Idempotency

· Updated Apr 22
Designing Distributed Transactions with Outbox, Inbox, and Idempotency diagram
This diagram shows how outbox publication, inbox deduplication, and idempotent state transitions work together to make distributed recovery operationally safe.
Distributed failures are partial by nature. A database write can succeed while message publication fails. A consumer can finish business work and crash before acknowledging the message. A retry can reprocess the same event after the first attempt already changed state.

That is why distributed transaction design is less about preserving one giant all-or-nothing illusion and more about making duplication, retries, and recovery operationally safe.

Why 2PC is rarely the default answer

Traditional two-phase commit promises strong atomicity, but many modern systems avoid it because it introduces tight coupling, infrastructure constraints, and operational fragility across services.

In practice, most teams need:

  • local correctness inside one service boundary
  • reliable delivery to downstream consumers
  • safe replay and retry behavior
  • visibility into what is stuck, duplicated, or delayed

Outbox, Inbox, and idempotent handlers are the practical baseline for achieving that.

The Outbox pattern solves the write-publish gap

The classic failure window is simple:

  1. business data is committed
  2. event publish fails

Without an Outbox, downstream systems never learn that the state changed.

The practical solution is:

  • write business state and an outbox record in the same local transaction
  • publish outbox records asynchronously
  • mark publication state separately

This ensures that if the transaction commits, there is durable evidence that the event still needs to be published.

The Inbox pattern protects the consumer side

Outbox alone is not enough because consumers can still see duplicates.

Common causes include:

  • producer retries
  • broker redelivery
  • consumer crash after side effects but before acknowledgment
  • replay or backfill operations

The Inbox pattern gives the consumer a durable record of processed message identity so it can decide whether a message is new, duplicate, or partially completed.

Idempotency is not optional

In distributed systems, duplicate delivery is normal. Treating it as an edge case creates fragile systems.

A consumer should be able to receive the same message more than once without corrupting state. That usually means:

  • using a stable message identity
  • storing processing results keyed by that identity
  • making side effects conditional on first successful processing
  • separating “already processed” from “currently failed”

If a handler cannot safely process duplicates, retries become dangerous instead of helpful.

Ordering must be defined, not assumed

Many distributed designs fail because they assume global ordering where only local ordering exists.

Teams should decide explicitly:

  • which entity or business key requires ordering
  • whether ordering is required per aggregate, per account, per order, or globally
  • how reordering is detected and handled

In most systems, total ordering is too expensive and unnecessary. What matters is preserving meaningful ordering for the business key that drives consistency.

Retries need policy, not hope

Retries are useful only when they are bounded and observable.

A practical retry policy usually includes:

  • exponential backoff
  • maximum retry count
  • dead-letter routing after repeated failure
  • separate handling for transient and permanent errors
  • correlation IDs to trace the message through retries

If teams only “retry until it works,” they often create hidden backlog growth and cascading failures.

A realistic processing flow

One practical design flow looks like this:

  1. Service A updates its business table and inserts an outbox event in one transaction.
  2. An outbox relay publishes the event to the broker.
  3. Service B receives the event.
  4. Service B checks the inbox table for the message ID.
  5. If it is new, Service B processes the business logic and records successful consumption.
  6. If it is already processed, Service B acknowledges and exits safely.

This flow does not eliminate failure. It makes failure recoverable.

Observability is part of correctness

A distributed transaction design is incomplete if the team cannot see where messages are getting stuck.

At minimum, observe:

  • outbox backlog size and age
  • relay publish failures
  • inbox duplicate rate
  • retry counts
  • dead-letter volume
  • end-to-end latency from original write to final consumption

If these metrics do not exist, many data consistency incidents will be discovered too late.

Common mistakes

Watch for these failure patterns:

  • using outbox but not making consumers idempotent
  • storing message IDs without recording processing status
  • assuming message brokers guarantee exactly-once business effects
  • mixing permanent validation failures with transient infrastructure failures
  • lacking a replay procedure for dead letters and partial outages

The strongest architecture patterns still fail if operational behavior is left ambiguous.

Decision checklist

Before calling the design production-ready, confirm the team can answer:

  • What is the idempotency key for each message type?
  • How do we distinguish processed, failed, and retrying states?
  • What ordering matters to the business?
  • How do we replay dead letters safely?
  • How do we detect stuck outbox records?
  • Can support teams trace one business action across services?

Wrap-up

A strong distributed transaction design is not one that pretends duplicates and retries will disappear. It is one where duplicates, retries, ordering assumptions, and observability have all been designed to remain safe under failure.

That is the practical standard for consistency in distributed systems.

Continue Reading

Related posts

Next Path

Keep exploring this topic as a system