Resilient Payment Platform — Architecture Study

01Problem frame

Payment systems operate across boundaries that fail independently: client networks retry, provider APIs time out, webhooks arrive late or out of order, and settlement occurs well after an initial authorisation.

The design goal is therefore stronger than “process a request.” Every accepted instruction must produce one traceable business outcome, and every ambiguous state must be recoverable without creating or losing money.

02Core invariants

Three invariants shape the architecture: the same client instruction cannot create a second payment; financial history is append-only; and external provider state must be continuously reconciled with the platform’s own records.

An idempotency key is bound to the authenticated merchant and request fingerprint. Reuse returns the original result, while a mismatched payload is rejected instead of being interpreted as a new instruction.

03Request path

The synchronous path validates intent, runs risk policy, persists a payment state transition, and calls the selected provider through an isolated adapter. Provider-specific models never leak into the core domain.

Client idempotency key

API Gateway auth + limits

Payment API state machine

Risk Engine policy checks

Provider Adapter isolated PSP

Ledger immutable entries

04Ledger & events

A double-entry ledger records balanced debit and credit entries rather than mutating a single balance. Payment status is useful for workflow, but the ledger remains the financial source of truth.

State changes and outbound events are committed together through a transactional outbox. Kafka consumers are idempotent, use stable event identifiers, and move repeatedly failing messages to a quarantined stream with operational context.

05Failure handling

Timeouts are treated as unknown outcomes, not automatic failures. The payment enters a pending state until provider lookup or a signed webhook resolves it. Retries use bounded exponential backoff and never bypass the original idempotency boundary.

Circuit breakers isolate unhealthy providers. Routing can shift new eligible traffic to another provider, while in-flight transactions remain attached to their original processor for consistent capture, refund, and reconciliation behaviour.

06Security controls

The platform minimises sensitive data through tokenisation and strict service boundaries. Secrets are short-lived, encrypted at rest, and never written to application logs. Webhooks require signature verification, timestamp validation, and replay protection.

Merchant operations are authorised by scoped permissions; high-risk administrative actions use dual control. Audit records capture the actor, intent, policy decision, state transition, and correlation identifiers without exposing payment credentials.

07Operations

Technical metrics are paired with business signals: authorisation rate by provider, unknown-state age, ledger imbalance, webhook delay, reconciliation breaks, and duplicate-attempt suppression. Traces use the payment identifier across synchronous calls and asynchronous events.

Onceone outcome per accepted intent

Balanceddouble-entry ledger invariant

Traceableend-to-end audit history

08Takeaway

Payment reliability comes from explicit state, durable invariants, and reconciliation—not optimistic assumptions about networks or providers. The architecture keeps uncertainty visible until evidence safely resolves it.

← Back to all case studies