diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index ab5a105..41d5911 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -1,209 +1,155 @@ # Architecture -## Component Overview +## System Overview -```mermaid -graph TD - Client["Client
(Browser / App)"] - API["API Layer
chi router + handlers"] - WS["WebSocket Hub
live position updates"] - Auth["Auth
JWT + argon2id OTP"] - - CS["ClaimService"] - QS["QueueService"] - PS["PaymentService"] - ES["EventService"] - - AW["Admission Worker
every 5s"] - EW["Expiry Worker
every 30s"] - RW["Reconciliation Worker
every 30s"] - - PG[("PostgreSQL
source of truth")] - RD[("Redis
performance layer")] - GW["Paystack Gateway"] - - Client -->|HTTP| API - Client -->|WS| WS - API --> Auth - API --> CS - API --> QS - API --> PS - API --> ES - - AW -->|admit batch| QS - AW -->|push token| WS - EW -->|release claims| CS - RW -->|heal divergence| RD - RW -->|reconcile payments| PS - - CS --> PG - CS --> RD - QS --> PG - QS --> RD - PS --> PG - PS --> GW - ES --> PG +FairQueue is built in three horizontal layers. Each layer depends only on the layers below it. No upward dependencies exist. + +``` +┌─────────────────────────────────────────────────────┐ +│ API Layer (HTTP handlers, WebSocket hub) │ +├─────────────────────────────────────────────────────┤ +│ Workers (admission, expiry, reconciliation) │ +├─────────────────────────────────────────────────────┤ +│ Services (claims, queue, payments, events) │ +├─────────────────────────────────────────────────────┤ +│ Stores (postgres/, redis/) │ +├─────────────────────────────────────────────────────┤ +│ Domain (state machines, pure business rules) │ +└─────────────────────────────────────────────────────┘ ``` -## Request Flow: Customer Buys a Ticket +## Component Diagram ```mermaid -sequenceDiagram - actor Customer - participant API - participant Queue as Queue Service - participant Redis - participant Postgres - participant Admission as Admission Worker - participant Claim as Claim Service - participant Payment as Payment Service - participant Paystack - - Customer->>API: POST /events/{id}/queue - API->>Queue: Join(customerID, eventID) - Queue->>Postgres: INSERT queue_entry (WAITING) - Queue->>Redis: ZADD waiting:{eventID} score=joinedAt - API-->>Customer: 201 { position: 1547 } - - Note over Admission: Every 5 seconds - Admission->>Redis: ZPOPMIN waiting:{eventID} batch - Admission->>Redis: ZADD admitted:{eventID} - Admission->>Postgres: UPDATE queue_entries SET status=ADMITTED - Admission->>Customer: WebSocket push { type: admitted, token: ... } - - Customer->>API: POST /events/{id}/claims { admission_token } - API->>Claim: Claim(token, eventID) - Claim->>Redis: SET NX lock (concurrency shield layer 1) - Claim->>Redis: DECRBY inventory (atomic Lua script) - Claim->>Postgres: INSERT claims (unique constraint = layer 2) - API-->>Customer: 201 { claim_id, expires_at } - - Customer->>API: POST /claims/{id}/payments - API->>Payment: Initialize(claimID, customerID) - Payment->>Postgres: INSERT payments (status=INITIALIZING) - Payment->>Paystack: InitializeTransaction - Payment->>Postgres: UPDATE payments (status=PENDING) - API-->>Customer: 201 { authorization_url } - - Customer->>Paystack: Complete payment on hosted page - Paystack->>API: POST /webhooks/paystack (charge.success) - API->>Payment: HandleWebhook(rawBody, signature) - Payment->>Postgres: UPDATE payments (PENDING→CONFIRMED) - Payment->>Postgres: UPDATE claims (CLAIMED→CONFIRMED) - API-->>Paystack: 200 OK -``` +graph TD + subgraph Clients + Browser["Browser / Mobile App"] + end -## Inventory Consistency Model + subgraph API["API Layer (chi router)"] + Handlers["HTTP Handlers"] + Middleware["Auth Middleware\nOrganizer JWT · Customer JWT"] + Hub["WebSocket Hub\nlive position updates"] + end -Redis is a performance layer over Postgres. It is never the source of truth. + subgraph Services + EventSvc["EventService\ncreate · activate · end"] + QueueSvc["QueueService\njoin · position · abandon"] + ClaimSvc["ClaimService\nclaim · release"] + PaymentSvc["PaymentService\ninitialize · webhook · reconcile"] + end -```mermaid -flowchart LR - subgraph Write Path - direction TB - W1["Postgres INSERT claim
(atomic, durable)"] - W2["Redis DECRBY inventory
(fast, best-effort)"] - W1 --> W2 + subgraph Coordinators["Service Coordinators"] + QueueCoord["QueueCoordinator\nPostgres + Redis queue ops"] + InvCoord["InventoryCoordinator\nlock + decrement + rollback"] end - subgraph Failure Recovery - direction TB - F1["Redis wiped / crash"] - F2["Reconciliation Worker
derives count from Postgres
every 30s"] - F3["Startup Recovery
rebuilds ZSETs from Postgres"] - F1 --> F2 - F1 --> F3 + subgraph Workers["Background Workers (Scheduler)"] + AdmWorker["Admission Worker\nevery 5s — admit next batch"] + ExpWorker["Expiry Worker\nevery 30s — release stale claims"] + RecWorker["Reconciliation Worker\nevery 30s — heal Redis divergence"] + Recovery["Startup Recovery\nonce at boot — rebuild Redis from PG"] end - subgraph Read Path - direction TB - R1["Redis read (fast)"] - R2{"cache miss?"} - R3["Postgres read (fallback)"] - R1 --> R2 - R2 -->|yes| R3 + subgraph Storage + PG[("PostgreSQL\nsource of truth")] + RD[("Redis\nperformance layer only")] end + + Paystack["Paystack Gateway\nHTTP + webhook"] + + Browser -->|HTTP REST| Handlers + Browser -->|WebSocket ?token=| Hub + Handlers --> Middleware + Handlers --> EventSvc + Handlers --> QueueSvc + Handlers --> ClaimSvc + Handlers --> PaymentSvc + + QueueSvc --> QueueCoord + ClaimSvc --> QueueCoord + ClaimSvc --> InvCoord + QueueCoord --> PG + QueueCoord --> RD + InvCoord --> RD + + EventSvc --> PG + PaymentSvc --> PG + PaymentSvc --> Paystack + PaymentSvc --> InvCoord + + AdmWorker --> QueueCoord + AdmWorker --> InvCoord + AdmWorker -->|push admission token| Hub + ExpWorker --> PG + ExpWorker --> InvCoord + RecWorker --> PG + RecWorker --> InvCoord + Recovery --> PG + Recovery --> RD ``` -## Payment State Machine +## Core Flows -```mermaid -stateDiagram-v2 - [*] --> INITIALIZING: INSERT before gateway call - INITIALIZING --> PENDING: Paystack responds OK - INITIALIZING --> FAILED: Permanent gateway error - PENDING --> CONFIRMED: charge.success webhook - PENDING --> FAILED: charge.failed webhook - CONFIRMED --> [*] - FAILED --> [*] - - note right of INITIALIZING - Outbox record exists before - any external call is made. - Crash here = reconciliation - worker retries the gateway call. - end note -``` +### A customer buys a ticket -## Claim State Machine +1. **Join queue** — `POST /events/{id}/queue` writes to Postgres and adds the customer to the Redis waiting ZSET with their join timestamp as score. This is an O(log N) Redis write — it absorbs any volume. -```mermaid -stateDiagram-v2 - [*] --> CLAIMED: Customer claims with admission token - CLAIMED --> CONFIRMED: Payment confirmed - CLAIMED --> RELEASED: Payment failed or explicit release - CLAIMED --> RELEASED: Expiry worker (10min TTL) - CONFIRMED --> [*] - RELEASED --> [*] - - note right of RELEASED - Inventory restored in Redis. - Next person in queue can claim. - end note -``` +2. **Get admitted** — The admission worker runs every 5 seconds. It atomically moves the next batch from the waiting ZSET to the admitted ZSET (`ZPOPMIN` + `ZADD` in a Lua script), updates Postgres, generates a signed admission token per customer, and pushes it via WebSocket. Customers who miss the push poll `GET /events/{id}/queue/position` to retrieve their token. -## Concurrency Shield +3. **Claim** — `POST /events/{id}/claims` verifies the admission token, then runs through two concurrency layers: a Redis `SET NX` lock (layer 1) and an atomic Lua script that decrements the inventory counter only if it is above zero. If both pass, a claim row is inserted in Postgres. A unique constraint on `(event_id, customer_id)` is the final correctness guarantee. -Two independent layers prevent double-booking. Both must fail for an oversell to occur. +4. **Pay** — `POST /claims/{id}/payments` writes a `Payment` row in `INITIALIZING` state before touching Paystack (the outbox pattern). It then calls the Paystack initialize API. On success the row moves to `PENDING`. When Paystack fires a `charge.success` webhook, both the payment and the claim are confirmed in a single Postgres transaction. -```mermaid -flowchart TD - Request["Claim Request"] +### What happens when Redis is wiped - L1{"Redis SET NX lock
acquired?"} - Reject1["Return ErrAlreadyClaimed
(concurrent request)"] +Redis holds only reconstructible state. On startup, `RecoverRedisState` reads all active events from Postgres, derives the authoritative inventory count (`total_inventory - active_claims`), and re-adds it to Redis with `SET NX` (no-op if the key already exists). It also re-adds all `WAITING` queue entries to the ZSET using their original `joined_at` timestamp, preserving FIFO order exactly. Within 30 seconds, the reconciliation worker will also force-sync any remaining divergence. - L2["Atomic Lua script
DECRBY inventory"] - SoldOut["Return ErrEventSoldOut"] +## Inventory Consistency - L3{"Postgres INSERT
unique constraint passes?"} - Reject3["Rollback Redis decrement
Return ErrAlreadyClaimed"] +Redis is a write-through cache over Postgres. The ordering is always: Postgres first, Redis second. + +``` +Claim request arrives + │ + ▼ +Redis SET NX lock acquired? + No → return ErrAlreadyClaimed + Yes → continue + │ + ▼ +Redis Lua: DECRBY inventory if > 0 + -2 (sold out) → return ErrEventSoldOut + -1 (cache miss) → fall back to Postgres count, then retry + ≥ 0 (success) → continue + │ + ▼ +Postgres INSERT claim + unique violation → rollback Redis decrement, return ErrAlreadyClaimed + success → claim created ✓ + │ + ▼ +Release lock +``` - Success["Claim created ✓"] +If the server crashes after the Postgres commit but before the Redis decrement, Redis shows more inventory than actually exists. The reconciliation worker heals this within 30 seconds. The reverse — Redis showing less inventory than exists — would incorrectly turn away valid customers and is never permitted to happen. - Request --> L1 - L1 -->|no| Reject1 - L1 -->|yes| L2 - L2 -->|count ≤ 0| SoldOut - L2 -->|decremented| L3 - L3 -->|violation| Reject3 - L3 -->|inserted| Success +## Payment State Machine + +``` +INITIALIZING → PENDING → CONFIRMED + │ │ + └── FAILED ←────┘ ``` -## Layer Dependencies +A record is always written in `INITIALIZING` before the gateway is called. A crash at any point leaves a recoverable record. The reconciliation worker finds any record older than the configured stale threshold and either retries the initialization (for `INITIALIZING`) or polls Paystack for the current status (for `PENDING`). -Each layer depends only on the layers below it. No upward dependencies. +## Claim State Machine ``` -┌─────────────────────────────────┐ -│ API (handlers, middleware) │ ← HTTP boundary -├─────────────────────────────────┤ -│ Workers (scheduler, 3 workers) │ ← background processing -├─────────────────────────────────┤ -│ Services (4 services) │ ← business logic -├─────────────────────────────────┤ -│ Stores (postgres, redis) │ ← infrastructure -├─────────────────────────────────┤ -│ Domain (state machines) │ ← pure business rules -└─────────────────────────────────┘ -``` \ No newline at end of file +CLAIMED → CONFIRMED (payment confirmed) + │ + └──→ RELEASED (payment failed, explicit release, or expiry worker) +``` + +When a claim moves to `RELEASED`, the Redis inventory counter is incremented and the next customer in the queue can claim. Postgres is always written first; the Redis increment is best-effort and healed by the reconciliation worker if it fails. \ No newline at end of file diff --git a/README.md b/README.md index 4faed29..38f433a 100644 --- a/README.md +++ b/README.md @@ -1,109 +1,93 @@ # FairQueue -A high-throughput inventory allocation system for high-demand live events in Nigeria. Handles the thundering herd problem — when 50,000 people try to buy 5,000 tickets at the same time — without overselling, without crashes, and without bots taking everything before real fans get a chance. +A virtual queue and inventory allocation system for high-demand live events in Nigeria. Built to handle the moment 50,000 people try to buy 5,000 tickets at exactly the same time — without overselling, without crashes, and without bots. ## The Problem -When a high-demand event goes on sale: +When a high-demand event goes on sale, three things happen simultaneously: -- Websites crash under the simultaneous load -- Bots claim inventory in milliseconds -- Payment failures silently lose claimed tickets -- Users have no idea where they stand +- The website crashes under the sudden spike in traffic +- Bots grab the inventory in milliseconds before real fans get a chance +- Payment failures silently lose tickets that were already claimed -FairQueue solves this with a virtual queue that absorbs the spike, admits customers at a controlled rate, and guarantees atomic inventory allocation backed by real distributed systems correctness. +FairQueue solves all three. It absorbs the traffic spike into a virtual queue, admits customers at a controlled rate, and guarantees that inventory allocation is atomic — two people can never get the same ticket. ## Quick Start ```bash git clone https://github.com/DanielPopoola/fairqueue cd fairqueue -cp .env.example .env # fill in Paystack keys; defaults work for local dev +cp .env.example .env # fill in your Paystack keys; everything else works out of the box docker compose up --build ``` -The API is live at `http://localhost:8080`. -Swagger UI is at `http://localhost:8080/swagger/index.html`. +API is available at `http://localhost:8080`. Swagger UI is at `http://localhost:8080/swagger/index.html`. ## How It Works -``` -50,000 users hit the sale at 10:00:00 AM - │ - ▼ - Virtual Queue (Redis ZSET) - Assigns position instantly — cheap O(log N) operation - │ - ▼ (admission worker, every 5s) - Admitted in batches sized to available inventory - Each admitted user receives a short-lived signed token - │ - ▼ - Claim (atomic Redis Lua script + Postgres) - Token verified → inventory decremented → claim created - │ - ▼ - Payment (Paystack) - Outbox record written before gateway call - Webhook confirms → claim confirmed - Failure → claim released → inventory restored -``` +A customer's journey through the system has four stages: + +**1. Queue** — When the sale opens, everyone hits `POST /events/{id}/queue` at once. This is a cheap Redis write (O(log N)), so it absorbs any traffic volume without touching the database. Each customer gets a queue position. + +**2. Admission** — A background worker runs every 5 seconds. It pops the next batch of customers from the waiting queue, moves them to the admitted set, and pushes them a signed admission token via WebSocket. Customers who miss the push can poll `GET /events/{id}/queue/position` to retrieve their token. + +**3. Claim** — An admitted customer presents their token to `POST /events/{id}/claims`. The system atomically checks and decrements the Redis inventory counter. If that succeeds, it inserts a claim row in Postgres. A unique constraint on `(event_id, customer_id)` is the last line of defence against any race condition. -The queue absorbs the thundering herd. Only admitted users reach the claim layer. Only claimed users reach the payment layer. Each transition is a rate-limiting gate. +**4. Payment** — The customer calls `POST /claims/{id}/payments`, which writes a payment record to Postgres *before* calling Paystack. This means a crash can never produce a charge with no record. The reconciliation worker finds and heals any payments stuck in intermediate states. ## Key Design Decisions -**PostgreSQL is the source of truth.** Redis holds no state that cannot be reconstructed from Postgres. If Redis is wiped, the reconciliation worker heals inventory counts and the recovery function rebuilds the queue ZSETs from Postgres on the next startup. See [TRADEOFFS.md](./TRADEOFFS.md) for the full reasoning. +**PostgreSQL is the only source of truth.** Redis holds nothing that cannot be reconstructed from Postgres. If Redis is wiped, the startup recovery function rebuilds the queue and the reconciliation worker heals the inventory count. Redis makes things fast; Postgres makes things correct. -**Two-layer concurrency shield.** A Redis `SET NX` lock prevents concurrent claim attempts from reaching the database simultaneously. A Postgres unique constraint on `(event_id, customer_id)` is the inviolable last line of defence. The lock is a performance optimisation; the constraint is the correctness guarantee. +**Two-layer concurrency shield.** A Redis `SET NX` lock stops concurrent claim attempts before they reach the database. A Postgres unique constraint on `(event_id, customer_id)` is the inviolable correctness guarantee that holds even if the lock is unavailable. Both layers must fail for an oversell to occur. -**Outbox pattern for payment safety.** A `Payment` row is written in `INITIALIZING` state before the Paystack API is called. A crash between the write and the gateway call leaves a recoverable record. The reconciliation worker finds stale `INITIALIZING` records and retries the gateway call. +**Outbox pattern for payment safety.** A `Payment` row is always written in `INITIALIZING` state before the Paystack API is called. A crash at any point leaves a recoverable record. The reconciliation worker finds stale `INITIALIZING` records and retries the gateway call. -**100 goroutines, 1 ticket, exactly 1 claim.** The concurrency guarantee is a tested invariant, not just a design intent. See `internal/service/claims_test.go`. +**Postgres-first writes.** The Redis inventory counter is only decremented *after* the Postgres insert commits. If the server crashes between the commit and the Redis write, Redis shows more tickets than exist — the reconciliation worker heals this within 30 seconds. The alternative (Redis first) risks permanently locking out valid users. Temporary inflation is the only acceptable failure mode. + +For the full reasoning behind every decision, see [TRADEOFFS.md](./TRADEOFFS.md). ## Architecture -See [ARCHITECTURE.md](./ARCHITECTURE.md) for component diagrams and flow diagrams. +See [ARCHITECTURE.md](./ARCHITECTURE.md) for a component diagram showing how all the pieces fit together. ## Project Structure ``` -cmd/api/ Entry point, dependency wiring +cmd/api/ Entry point and dependency wiring internal/ - domain/ State machines and business rules (no infrastructure) - service/ Business logic orchestration + domain/ State machines and domain errors — no infrastructure dependencies + service/ Business logic: claims, queue, payments, events store/ - postgres/ PostgreSQL store implementations - redis/ Redis store implementations - worker/ Background workers (admission, expiry, reconciliation) - api/ HTTP handlers, middleware, WebSocket hub - gateway/ Paystack payment gateway - auth/ JWT tokenizers, argon2id password hashing - config/ Environment-based configuration + postgres/ PostgreSQL store implementations + redis/ Redis store implementations (inventory, queue, lock) + worker/ Background workers: admission, expiry, reconciliation, recovery + api/ HTTP handlers, middleware, WebSocket hub + gateway/paystack/ Paystack payment gateway adapter + auth/ JWT tokenizers (organizer + customer), argon2id password hashing + config/ Environment-based configuration loading and validation infra/ - migrate/ Embedded SQL migrations - retry/ Generic retry with exponential backoff + migrate/ Embedded SQL migrations + retry/ Generic retry with exponential backoff ``` -## Testing +## Running Tests ```bash -# Unit tests — domain logic, no infrastructure +# Domain logic only — fast, no infrastructure required make test-unit -# Integration tests — real Postgres + Redis via testcontainers +# Service and worker tests — spins up real Postgres and Redis via testcontainers make test-integration -# All tests +# Full end-to-end flow tests +make test-e2e + +# Everything make test ``` -Integration tests run against real infrastructure using testcontainers. No mocks except the Paystack gateway — because mocking the database tells you nothing about whether the unique constraint fires. - -The test suite caught two production bugs during development: - -- `MarkFailed` silently no-oping because the payment was still `INITIALIZING` not `PENDING` when a permanent gateway error occurred -- Missing unique index on `payments(claim_id)` allowing duplicate gateway calls under concurrency +Integration tests run against real infrastructure. The only mock in the codebase is the Paystack gateway — because mocking the database tells you nothing about whether the unique constraint fires. ## Stack @@ -113,6 +97,6 @@ The test suite caught two production bugs during development: | Database | PostgreSQL 16 | | Cache / Queue | Redis 7 | | Payment | Paystack | -| Container | Docker + Compose | -| HTTP | chi | -| WebSocket | coder/websocket | \ No newline at end of file +| HTTP router | Chi | +| WebSocket | coder/websocket | +| Container | Docker + Compose | \ No newline at end of file