Design a Payment Processing System

TL;DR

Build a system that moves money from a buyer's card to a seller's bank account without losing a cent, double-charging anyone, or violating PCI DSS. The flow is Authorize, Capture, Settle -- three separate operations that can span hours or days. The hardest problem is not the happy path; it is the failure path. Stripe's PaymentIntent object has NO terminal "failed" state -- a payment can always be retried. Airbnb's Orpheus system uses two-level idempotency (client-level and service-level) to guarantee exactly-once payments even through network partitions and retries. The cardinal rule: "default to non-retryable." If you are unsure whether a charge went through, do NOT retry -- you risk double-charging the customer. Instead, reconcile offline. Double-entry bookkeeping, 3D Secure authentication, and PCI DSS scope reduction through tokenization round out a system that Stripe processes at $1 trillion annually.

The System

Think Stripe. A user on an e-commerce site clicks "Pay $99.99." The frontend sends the payment details to your system. Your system charges their card, holds the funds, and eventually settles the money to the merchant's bank account. The user sees "Payment successful" within 3 seconds. Behind that confirmation, your system has: tokenized the card (PCI DSS), run fraud checks, sent an authorization request through the card network (Visa/Mastercard) to the issuing bank, received approval, recorded the authorization, and scheduled capture and settlement.

Why is this the hardest system design question? Because money has no "undo" button. If your messaging system drops a message, the user resends. If your social media post fails to save, the user reposts. If your payment system charges someone twice, you have committed fraud. The entire architecture is built around one principle: never lose money, never duplicate money, and when in doubt, stop and reconcile.

Requirements

Functional Requirements

Requirement	Details
Payment initiation	Merchant creates a payment with amount, currency, payment method, and idempotency key.
Authorization	Request hold on customer's funds from issuing bank.
Capture	Convert authorization hold to actual charge. Can be immediate or delayed (up to 7 days).
Settlement	Transfer captured funds to merchant's bank account (typically T+2).
Refunds	Full or partial refund. Return funds to customer's original payment method.
Webhooks	Notify merchant of payment status changes asynchronously.

Non-Functional Requirements

Requirement	Target
Authorization latency	< 3 seconds (including network hop to issuing bank)
Consistency	Exactly-once payment semantics. Zero double charges. Zero lost payments.
Availability	99.999% (five nines -- money cannot be "down")
Durability	Every transaction persisted before returning success. Zero data loss.
Scale	1M payments/day, 50 payments/sec avg, 500/sec peak (Black Friday)
Compliance	PCI DSS Level 1

Back-of-Envelope Math

Payments per day:            1 million (mid-size processor; Stripe does ~50M/day)
Payments per second:         ~12 avg, ~500 peak (Black Friday, flash sales)
Average payment amount:      $65
Daily volume:                $65 million
Annual volume:               $23.7 billion

Storage:
  Payment record:            ~1 KB (amount, currency, status, timestamps, metadata)
  Payments per year:         365 million
  Annual storage:            365 GB (trivial)
  7-year retention (PCI):    2.5 TB

Ledger entries:
  Each payment = 2 entries (debit + credit) = 2 KB
  Annual ledger:             730 GB
  Double-entry total:        ~5 TB over 7 years

Card network latency:
  Tokenization:              100-200 ms
  Authorization:             500-1500 ms (round trip to issuing bank via Visa/MC network)
  3D Secure (if required):   10-30 seconds (user interaction)

The key number: 500-1500 ms for authorization. This is NOT your system's latency -- it is the card network's latency. You are at the mercy of Visa's network and the issuing bank's processing speed. Your system must be resilient to this variability.

Naive Design

PostgreSQL database, synchronous REST API.

Schema:

CREATE TABLE payments (
    payment_id UUID PRIMARY KEY,
    merchant_id UUID,
    amount DECIMAL(10,2),
    currency VARCHAR(3),
    status VARCHAR(20),  -- PENDING, AUTHORIZED, CAPTURED, SETTLED, REFUNDED, FAILED
    card_number VARCHAR(16),  -- PCI VIOLATION
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);

Flow:

1. Merchant calls POST /payments with card details.
2. Server stores payment with status=PENDING.
3. Server calls Visa API to authorize.
4. If authorized, update status=AUTHORIZED.
5. Later, server calls Visa API to capture.
6. Update status=CAPTURED.
7. Settlement happens via batch file to acquiring bank.

Three problems jump out immediately. First, storing raw card numbers in your database is a PCI DSS violation that can end your business. Second, "status" column with overwrites loses history -- you cannot tell when the authorization happened vs. when capture happened. Third, what happens when step 3 returns a timeout? Is the card charged or not? You do not know.

Where It Breaks

Problem 1: The Ambiguous Failure

You send an authorization request to Visa. The network times out after 30 seconds. Did the authorization go through? Three possibilities:

Visa never received the request. No charge. Safe to retry.
Visa received and approved the request, but the response was lost. Card is charged. Retrying = double charge.
Visa received and declined the request, but the response was lost. No charge. Safe to retry but pointless.

You cannot distinguish these cases in real time. This is the fundamental problem of payment systems, and the reason the entire architecture revolves around idempotency.

Problem 2: Concurrent Payment Retries

The client sends a payment request. No response after 5 seconds. The client retries. Now two requests are in-flight. If both reach Visa, the customer is charged twice. If you deduplicate at your server, what if the first request already reached Visa before you receive the retry?

Problem 3: Partial Failures in Multi-Step Flow

Authorize succeeds. Capture fails. Now the customer's funds are held (authorization) but not charged (no capture). The hold expires in 7 days (Visa's standard auth hold duration). But the merchant thinks the order is confirmed. The merchant ships the product. The capture fails again. The merchant has shipped for free.

Problem 4: Storing Card Numbers

PCI DSS Level 1 requires annual on-site audits, quarterly network scans, penetration testing, and strict access controls for any system that stores, processes, or transmits cardholder data. Storing card_number VARCHAR(16) in PostgreSQL means your ENTIRE infrastructure is in PCI scope. The audit costs $200K+/year. The alternative: tokenize immediately and keep card data out of your system entirely.

Problem 5: No Double-Entry Bookkeeping

A status column tells you the current state but not the money flow. Where did the $65 come from? Where did it go? If a refund is processed, does the ledger balance? Without double-entry bookkeeping, you cannot reconcile and you cannot pass a financial audit.

Real Design

Payment Processing High-Level Design

Architecture Overview

┌──────────────┐
│   Merchant   │ ── POST /payments (idempotency_key in header)
└──────┬───────┘
       │
┌──────┴───────┐
│  API Gateway │ ── idempotency check (Level 1)
└──────┬───────┘
       │
┌──────┴───────┐
│  Payment     │ ── create PaymentIntent, orchestrate steps
│  Service     │
└──────┬───────┘
       │
┌──────┴───────────────────────────────────┐
│  Payment State Machine                    │
│  CREATED -> AUTHORIZED -> CAPTURED ->     │
│  SETTLED -> COMPLETED                     │
│  (each transition is an immutable event)  │
└──────┬───────────────────────────────────┘
       │
┌──────┴───────┐     ┌──────────────┐     ┌──────────────┐
│  Card Vault  │     │  Acquirer    │     │  Ledger      │
│  (tokenize)  │     │  Gateway     │     │  Service     │
│  PCI scope   │     │  (Visa/MC)   │     │  (double     │
│  isolated    │     │              │     │   entry)     │
└──────────────┘     └──────────────┘     └──────────────┘

Component 1: Idempotency -- The Foundation

Every payment API call includes an Idempotency-Key header. This key is the client's promise: "If I send this key again, I mean the same payment, not a new one."

Level 1: API-level idempotency

On request:
  key = request.header("Idempotency-Key")
  existing = redis.GET("idempotency:" + key)

  if existing:
    return existing.response  // return the same response as before

  // Process the payment...
  result = process_payment(request)

  redis.SET("idempotency:" + key, result, EX=86400)  // 24-hour TTL
  return result

Level 2: Service-level idempotency (Airbnb's Orpheus)

Airbnb discovered that API-level idempotency is not enough. What if the payment service internally retries a call to the acquirer? The acquirer sees two requests from the same service, both without the original idempotency key.

Orpheus assigns a unique idempotency token to each internal step:

PaymentIntent ID:           pi_12345
  Authorization attempt 1:  auth_pi_12345_001
  Authorization attempt 2:  auth_pi_12345_002 (retry of same auth)
  Capture attempt 1:        cap_pi_12345_001

Each acquirer call includes the step-level idempotency token. If the acquirer receives auth_pi_12345_001 twice, it returns the same result. This prevents double-charging even when internal retries happen.

The Stripe guarantee: Stripe says "if you supply an idempotency key, we will return the same result for the same key for 24 hours." This means a client can safely retry a timed-out request indefinitely within 24 hours without risking a double charge.

Component 2: The Authorize-Capture-Settle Flow

This is the actual money movement, and it happens in three distinct phases.

Phase 1: Authorization (milliseconds)

Merchant sends: "Charge $65 to card tok_abc123."
Payment service sends authorization request to the acquiring bank (the merchant's bank), which forwards it through the card network (Visa) to the issuing bank (the customer's bank).
Issuing bank checks: sufficient funds? Card not frozen? Not flagged for fraud?
If approved: issuing bank places a hold for $65 on the customer's account. Returns an authorization code.
Payment status: AUTHORIZED. Customer's available balance decreases by $65, but no money has moved yet.

Phase 2: Capture (hours to days later)

When the merchant ships the product, they "capture" the authorized amount.
Payment service sends a capture request to the acquirer with the authorization code.
This converts the hold into an actual charge. The customer's statement now shows a pending charge.
Payment status: CAPTURED.

Why separate authorize and capture? Hotels authorize for 5 nights but capture for 3 if you check out early. Restaurants authorize the meal amount and capture the meal + tip. E-commerce authorizes at order time and captures at ship time (in case the item is out of stock and the order is cancelled).

Phase 3: Settlement (T+2 typically)

At end of business day, the payment processor batches all captured payments.
The batch is submitted to the card network for clearing.
The card network orchestrates the actual money transfer: issuing bank sends funds to acquiring bank.
Acquiring bank deposits into merchant's account, minus processing fees (typically 2.9% + $0.30).
Payment status: SETTLED.

Component 3: The "No Terminal Failed State" Principle

Stripe's PaymentIntent object does NOT have a terminal "failed" state. Why? Because a "failed" payment might succeed on retry. The card might have been declined due to insufficient funds, but the customer adds money an hour later and retries.

Stripe's state machine:

requires_payment_method -> requires_confirmation -> requires_action (3D Secure) ->
processing -> requires_capture -> succeeded
              OR
processing -> requires_payment_method (declined, try again)

Notice: there is no "failed" state that prevents further action. The PaymentIntent can always go back to requires_payment_method for another attempt. Only canceled (explicitly by the merchant) is terminal.

System design implication: Your payment records must support multiple attempts. A PaymentIntent with 3 failed attempts and 1 successful attempt is a success. The data model is:

PaymentIntent:
  id: pi_12345
  amount: $65
  status: succeeded
  attempts: [
    { id: att_001, status: declined, reason: "insufficient_funds", at: T1 },
    { id: att_002, status: declined, reason: "card_expired", at: T2 },
    { id: att_003, status: authorized, auth_code: "ABC123", at: T3 }
  ]

Component 4: Double-Entry Bookkeeping

Every money movement is recorded as two ledger entries: a debit (money leaving an account) and a credit (money entering an account). The sum of all debits must equal the sum of all credits. Always. If it does not, money has been lost or created, and you have a bug.

Ledger entries for a $65 payment:

Authorization:
  No ledger entry (no money has moved -- just a hold)

Capture:
  Debit:  Customer Account    $65.00
  Credit: Merchant Escrow     $65.00

Settlement:
  Debit:  Merchant Escrow     $65.00
  Credit: Merchant Account    $62.81  (after 2.9% + $0.30 fee)
  Credit: Fee Revenue         $2.19   ($65 * 0.029 + $0.30 = $2.185, rounded to $2.19)

Refund (if it happens):
  Debit:  Merchant Account    $62.81
  Debit:  Fee Revenue         $2.19
  Credit: Customer Account    $65.00

Why double-entry?

Reconciliation: At any point, sum all debits and credits. If they do not balance, something is wrong. Run this check nightly.
Audit trail: Every dollar is accounted for. Where did the $2.19 fee go? It is in the Fee Revenue account.
Reversibility: A refund is just the inverse entries. The ledger remains balanced.

Implementation: The ledger is an append-only table. No updates. No deletes. Ever.

CREATE TABLE ledger_entries (
    entry_id BIGSERIAL PRIMARY KEY,
    payment_id UUID,
    account_id UUID,
    type VARCHAR(10),  -- DEBIT or CREDIT
    amount DECIMAL(10,2),
    currency VARCHAR(3),
    created_at TIMESTAMP,
    -- NO updated_at. NO delete. Append only.
);

Component 5: PCI DSS Scope Reduction via Tokenization

PCI DSS says: any system that stores, processes, or transmits cardholder data (card number, CVV, expiry) must comply. Compliance is expensive ($200K+/year for Level 1 audits). The solution: minimize the number of systems that touch card data.

Tokenization flow:

Customer enters card details in the browser.
JavaScript SDK (Stripe.js, Braintree Drop-in) sends card details directly to the card vault (a PCI-compliant service) -- your server NEVER sees the card number.
The card vault returns a token: tok_abc123.
Your server receives only the token. All subsequent operations use the token.
Only the card vault is in PCI scope. Your API servers, databases, logs -- all out of scope.

The card vault is a hardened, isolated service with:

Hardware Security Modules (HSMs) for encryption key management
No outbound network access except to card networks
Separate network segment with strict firewall rules
Encrypted at rest with per-tenant keys
Access logging and anomaly detection

Result: Instead of your entire infrastructure being in PCI scope (30+ servers, networking, databases), only the card vault cluster (3-5 servers) is in scope. Audit cost drops from $200K to $50K.

Component 6: 3D Secure Authentication

3D Secure (3DS) adds cardholder authentication for high-risk transactions. The customer is redirected to their bank's authentication page (password, OTP, biometric).

When to trigger 3DS:

Transactions over a threshold (e.g., > $100)
Card-not-present transactions (all e-commerce)
First-time use of a card with a merchant
Fraud score above threshold

System design impact: 3DS introduces a multi-step asynchronous flow:

1. Payment service requests 3DS enrollment check from card network.
2. If enrolled, return a redirect URL to the merchant.
3. Merchant redirects customer to their bank's 3DS page.
4. Customer authenticates (10-30 seconds of user interaction).
5. Bank returns authentication result to the merchant's return URL.
6. Merchant's server sends the authentication result to the payment service.
7. Payment service proceeds with authorization (now with 3DS proof).

This flow changes the payment from synchronous (3 seconds) to asynchronous (30+ seconds). The PaymentIntent must persist across this gap. The requires_action state in Stripe's state machine handles this.

Liability shift: With 3DS, fraud liability shifts from the merchant to the issuing bank. If a 3DS-authenticated transaction turns out to be fraudulent, the bank eats the loss, not the merchant. This is the main incentive for merchants to implement 3DS despite the friction.

Deep Dives

Payment Processing State Machine

Deep Dive 1: "Default to Non-Retryable"

This is the most important engineering principle in payment systems. When a call to the acquirer/card network fails, should you retry?

Default answer: NO.

Here is why. Retrying means sending the same authorization request again. If the first request succeeded but the response was lost, retrying creates a double charge. The damage from a double charge (customer complaint, chargeback, regulatory violation) is far worse than the damage from a missed charge (customer re-initiates, you try again with explicit user action).

Retry classification:

HTTP 400 (Bad Request):     Non-retryable. Fix the request.
HTTP 401 (Unauthorized):    Non-retryable. Fix credentials.
HTTP 404 (Not Found):       Non-retryable.
HTTP 409 (Conflict):        Non-retryable. Already processed.
HTTP 429 (Rate Limited):    Retryable. Back off and retry.
HTTP 500 (Server Error):    AMBIGUOUS. Default to non-retryable.
HTTP 502/503 (Gateway):     AMBIGUOUS. Default to non-retryable.
Network Timeout:            AMBIGUOUS. Default to non-retryable.

For ambiguous failures: Do not retry automatically. Instead:

Record the attempt as status: unknown.
Schedule a background reconciliation job for 60 seconds later.
The reconciliation job queries the acquirer for the transaction status using the idempotency token.
If the acquirer says "authorized," update status to authorized. If "not found," update to declined and allow the customer to retry.

Airbnb's rule: "We'd rather fail a payment and ask the customer to try again than double-charge them and deal with the chargeback."

Deep Dive 2: Reconciliation Pipeline

No matter how good your idempotency layer is, discrepancies happen. Network glitches, bugs, timing issues. The reconciliation pipeline catches them.

Daily reconciliation:

Extract: At end of day, acquirer sends a settlement file listing all authorized and captured transactions.
Compare: Your ledger entries are compared against the acquirer's file.
Classify discrepancies:
"In our ledger, not in acquirer's file" -- we think we charged, they do not. Investigate immediately. Possible lost authorization.
"In acquirer's file, not in our ledger" -- they charged, we did not record. Customer may be double-charged. Initiate reversal.
"Amount mismatch" -- possible partial capture or currency conversion error.
Resolve: Each discrepancy is assigned to a team for investigation. Most resolve within 24 hours.

Implementation: The reconciliation pipeline is a batch job (Spark or simple Python script) that runs nightly. It ingests the acquirer's settlement file (CSV or SFTP), joins with the ledger database, and outputs a discrepancy report.

Volume: 1M payments/day generates ~50,000 line items in the settlement file (after netting). Typical discrepancy rate: 0.01-0.1% (50-500 items to investigate). If the discrepancy rate exceeds 1%, something is seriously wrong and the pipeline triggers a P0 alert.

Deep Dive 3: Handling Currency Conversion

Multi-currency payments add complexity. A Japanese customer pays in JPY for a US merchant who receives USD.

Conversion points:

At authorization: The payment is authorized in the customer's currency (JPY). The issuing bank holds JPY.
At settlement: The card network converts JPY to USD at the day's exchange rate. The merchant receives USD.
Exchange rate risk: Between authorization (Day 0) and settlement (Day 2), the exchange rate may change. Who bears the risk? Typically the merchant or the payment processor, depending on the contract.

System design impact: The ledger must store both currencies:

INSERT INTO ledger_entries (payment_id, account_id, type, amount, currency)
VALUES
  ('pi_12345', 'customer_acct', 'DEBIT', 10000, 'JPY'),
  ('pi_12345', 'merchant_escrow', 'CREDIT', 65.00, 'USD');

The conversion rate is recorded as metadata on the payment. Double-entry still balances, but you need currency-aware balancing (sum of JPY debits = sum of JPY credits, sum of USD debits = sum of USD credits).

Alternative Designs

Approach	Pros	Cons	When to Use
Stripe-style PaymentIntent (described above)	Well-documented state machine. No terminal failed state. Two-level idempotency.	Complex. Many states. Requires careful orchestration.	Any serious payment processor.
Simple charge API (Stripe's old API)	One API call: charge the card. Simpler.	No separate authorize/capture. Cannot handle delayed capture (hotels, restaurants). Deprecated by Stripe.	Simple e-commerce with instant fulfillment.
PayPal model (redirect-based)	All payment handling on PayPal's side. Zero PCI scope.	User leaves your site. Higher abandonment. PayPal's cut is 3.49% + $0.49.	When PCI compliance is too expensive.
Cryptocurrency	No intermediaries. Low fees. Instant settlement.	Price volatility. Irreversible transactions. No chargeback protection. Regulatory uncertainty.	Niche. Not suitable for mainstream e-commerce.
Bank transfer (ACH/SEPA)	Low fees (0.5-1%). Direct bank-to-bank.	Slow (1-3 business days). No instant confirmation. No chargeback protection.	Recurring billing. B2B payments. Large transactions.

Scaling Math Verification

Payment Processing

Peak payments/sec:           500 (Black Friday)
Authorization round trip:    1-2 seconds (card network latency)
Concurrent authorizations:   500 * 1.5 = 750 in-flight
Payment service threads:     750 (one per in-flight auth) + buffer = 1,000
Servers (50 threads each):   20 servers

Idempotency cache (Redis):
  Keys per day:              1 million
  Key size:                  ~200 bytes (key + response hash)
  TTL:                       24 hours
  Redis memory:              1M * 200B = 200 MB (trivial)

Ledger Database

Entries per payment:         2 (debit + credit) for capture
                            + 2 for settlement
                            + 2 for fee calculation
                            = 6 entries per payment
Entries per day:             6M
Entry size:                  ~200 bytes
Daily ledger volume:         1.2 GB
Annual:                      438 GB
7-year retention:            3 TB

PostgreSQL with monthly partitions:
  12 partitions/year * 7 years = 84 partitions
  Largest partition:          ~37 GB (easily fits in memory)

Reconciliation Pipeline

Acquirer settlement file:    50,000 records/day (after netting)
Ledger records to compare:   1M payments/day
Join operation:              Hash join on payment_id
Processing time:             ~5 minutes on a single machine
Discrepancy rate:            0.05% = 500 items for human review

Failure Analysis

Failure	Impact	Mitigation
Authorization timeout (ambiguous)	Unknown if customer is charged. Risk of double charge on retry.	Default to non-retryable. Reconcile via background job querying acquirer. Never retry ambiguous auth failures.
Capture fails after authorization	Customer's funds held but not charged. Merchant ships product for free.	Retry capture with exponential backoff (safe to retry -- capture is idempotent with auth code). Alert if capture fails 3x. Manual intervention after 24 hours. Auth hold expires in 7 days.
Settlement file missing	Cannot reconcile today's transactions. Money movement delayed.	Alert immediately. Contact acquirer for file. Hold settlement until reconciliation completes.
Card vault (HSM) goes down	Cannot tokenize new cards. Cannot process new payments with new cards. Existing tokens still work.	HSM cluster with N+1 redundancy. Geographically distributed. If all HSMs fail, reject new card payments; retry with existing tokens works.
Ledger database goes down	Cannot record new ledger entries. Money movement stops.	Synchronous replication to standby. Automatic failover. Buffer ledger entries in Kafka during outage; replay on recovery. NEVER process payment without recording it.
Double charge detected	Customer charged twice. Chargeback risk. Regulatory violation.	Automatic detection: two authorized transactions with same idempotency key. Auto-refund the duplicate within 1 hour. Notify customer proactively.
Webhook delivery fails	Merchant does not know payment succeeded. May not fulfill order.	Retry webhook with exponential backoff (1s, 10s, 100s, 1000s). Store webhook delivery status. Merchant can poll GET /payments/:id as fallback.

Level Expectations

Level	What the Interviewer Expects
Mid (L4)	Basic charge flow. Database with payment status. Knows about idempotency conceptually. Mentions PCI as a concern.
Senior (L5)	Authorize-Capture-Settle as separate phases. Idempotency key implementation at API level. Tokenization for PCI scope reduction. Double-entry bookkeeping. Payment state machine with multiple attempts. Webhook delivery with retries. "Default to non-retryable" principle.
Staff+ (L6)	Airbnb Orpheus two-level idempotency. Stripe's "no terminal failed state" design philosophy. Daily reconciliation pipeline with discrepancy classification. 3D Secure flow and liability shift. Currency conversion in the ledger. Card vault HSM architecture. Quantified PCI audit cost savings from tokenization. Reference to actual Stripe PaymentIntent state machine.

References from Our Courses

Distributed Transactions — saga pattern for multi-step payment workflows
Delivery Guarantees — exactly-once processing for idempotent payments
MVCC and Concurrency — serializable isolation for balance operations

Red Team This Design

Ready to stress-test this architecture? The Attack companion tears apart every decision in this design — from hardware physics to security holes to what actually happens at 10x scale.

Attack: Design a Payment Processing System →