Add outage integration test: HMA + Redis (phase 1 of #343)#646
Add outage integration test: HMA + Redis (phase 1 of #343)#646julietshen wants to merge 2 commits into
Conversation
`JSON.stringify(filename)` produced double-quoted args that collapsed against the outer `bash -c "..."` wrapper, so the shell parsed the command as `cd <pkg> && .../eslint --fix` with no file args. eslint then linted the entire package directory and surfaced every pre-existing warning/error in the repo, blocking any commit whose staged files were clean but whose tree wasn't. Switch to single-quoted args (with the standard `'\''` escape for any embedded single quote) so the file paths survive the outer double-quoted `bash -c` intact. Same fix as julietshen/user-strikes-signal and background-jobs-integ-test — applying here too since neither has merged to main yet and this PR needs to commit through the same hook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First phase of #343 per the planning comment on that issue. Adds the docker-compose service-control harness and the HMA + Redis outage scenarios. Scylla + Postgres + slow-variants are deferred to follow-up PRs. server/test/integ/dockerCompose.ts ---------------------------------- Thin wrapper around `docker compose stop/start/pause/unpause` plus two `withService*` helpers that auto-restore in a `finally` so an assertion inside the callback can't leak a stopped/paused service into the next test — which would cascade-fail the rest of the integ suite. `stop` simulates a "dependency is unreachable" outage (TCP rejected immediately). `pause` (SIGSTOP) simulates a slow / hung dependency (TCP hangs instead of rejects), close enough to a network timeout that we don't need to stand up toxiproxy for the assertions we care about. server/test/integ/outage.integ.test.ts -------------------------------------- Four scenarios, all green: 1. **HMA stopped**, report submission with image attached. Hits the outer try/catch in `submitReport`'s HMA block (the inner `withRetries`-wrapped `hashContentFromUrl` exhausts in ~3s), the report succeeds, no hashes attached. Graceful degradation. 2. **HMA paused**, same observable outcome — TCP-hung instead of reject, retries time out the same way. 3. **Redis stopped**, item submission via `/api/v1/items/async`. The API returns 202 immediately (the `itemSubmissionQueueBulkWrite` DataLoader batch resolves through ioredis's `enableOfflineQueue`). After Redis is restored, ioredis flushes the offline buffer, BullMQ delivers to the worker, and the item lands in Scylla as normal. **No data lost as long as the API process stays up.** 4. **Redis paused**, same shape — TCP-hung instead of reject, same recovery path. Note on the Redis path: today's 202-while-Redis-is-unreachable is *not* loss-of-data in the durable sense (offline buffer flushes on reconnect), but it *would* be lossy on simultaneous Redis outage + API process restart, because the offline buffer is in-process. Worth a follow-up (pre-flight Redis health check before claiming 202, or `enableOfflineQueue: false` on the enqueue connection) but doesn't belong in this test PR. Test must run with `--runInBand` (docker-compose state is shared global mutable state). The existing `npm run test:integ` config sets this implicitly via `--no-cache --forceExit` working sequentially in the integ suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 27 minutes and 54 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (3)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Pins today's known-lossy contract for #649: `ItemProcessingWorker` swallows Scylla insert errors, so a Scylla outage during processing drops the `item_submission_by_thread` row without retry. The job completes "successfully" from BullMQ's POV, the worker logs to ClickHouse, and the API has long since returned 202 to the client. Nothing surfaces the loss. The test: - Stops Scylla via `withServiceDown` (from phase 1 / #646). - Submits an item via `/api/v1/items/async`, expects 202. - Waits for the `CONTENT_API_REQUESTS` row in ClickHouse to confirm the worker reached past `insertWithRetries` — its presence proves the swallow happened and execution continued. If the worker were re-throwing on insert failure, this gate would time out instead. We deliberately don't assert "row absent in Scylla" after restore. The in-harness Cassandra driver doesn't recover cleanly within a reasonable window after a Scylla container restart (stale connections, prepared- statement re-prepare); a direct query there ends up testing the driver's reconnect timing rather than the worker's behaviour. The CH row is sufficient proof of the contract. When #649 escalates to path 1 (worker re-throws → BullMQ retries), the assertion shape flips to "row lands in Scylla after recovery" — same shape as the Redis-recovery test in `outage.integ.test.ts` — and the driver-recovery timing problem either dissolves into a presence poll or gets solved alongside. Lives in its own file rather than appending to `outage.integ.test.ts` because phase 1 (#646) is still under review at the time of writing and a parallel-branch PR would conflict. Once both land we can collapse the two files together. Branch stacks on `outage-integ-test` (PR #646) because it depends on the `dockerCompose.ts` harness shipped there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First phase of #343 per the planning comment on that issue.
What's in
server/test/integ/dockerCompose.ts— thin wrapper arounddocker compose stop/start/pause/unpausepluswithServiceDown/withServicePausedhelpers that auto-restore in afinallyso an assertion failure inside the callback can't leak a stopped/paused service into the next scenario (which would cascade-fail the rest of the suite).server/test/integ/outage.integ.test.ts— 4 scenarios, all green:submitReport)./api/v1/items/async→ 202 immediately (ioredis offline-queue path), item lands in Scylla after Redis is restored. No data loss as long as the API process stays up.Findings worth a follow-up (not in this PR)
enableOfflineQueue: falseon the enqueue connection. I'll file this as a separate issue.What's deferred
Per the plan on #343, this PR covers HMA + Redis only. Scylla, Postgres, and explicit "slow" variants ship in follow-up PRs:
docker pausedoesn't faithfully cover a needed "slow" scenario; current scenarios don't need it.Test plan
(cd server && npm run test:integ -- outage.integ.test.ts)against the real stack (npm run up && npm run db:update) — 4 passed, 4 totalAGENTS.md callouts
--runInBandis required because docker-compose state is shared global mutable state. The existingnpm run test:integscript does not pass--runInBand; if CI runs scenarios concurrently this PR's tests will stomp each other. Worth a short follow-up to the jest config to enforce serial execution for the integ suite.🤖 Generated with Claude Code