Add outage integration test: HMA + Redis (phase 1 of #343) by julietshen · Pull Request #646 · roostorg/coop

julietshen · 2026-05-29T17:11:33Z

First phase of #343 per the planning comment on that issue.

What's in

server/test/integ/dockerCompose.ts — thin wrapper around docker compose stop/start/pause/unpause plus withServiceDown / withServicePaused helpers that auto-restore in a finally so an assertion failure inside the callback can't leak a stopped/paused service into the next scenario (which would cascade-fail the rest of the suite).
server/test/integ/outage.integ.test.ts — 4 scenarios, all green:
1. HMA stopped during a report with images → 201, no hashes attached (graceful degradation via the outer try/catch in submitReport).
2. HMA paused — same outcome, TCP-hung instead of rejected.
3. Redis stopped during /api/v1/items/async → 202 immediately (ioredis offline-queue path), item lands in Scylla after Redis is restored. No data loss as long as the API process stays up.
4. Redis paused — same shape.

Findings worth a follow-up (not in this PR)

The Redis path returns 202 while Redis is unreachable. Today's offline-queue buffering keeps the data durable until the buffer flushes on reconnect, but the buffer lives in the API process — a simultaneous Redis outage + API restart drops the submission. Worth either a pre-flight Redis health check before claiming 202, or enableOfflineQueue: false on the enqueue connection. I'll file this as a separate issue.

What's deferred

Per the plan on #343, this PR covers HMA + Redis only. Scylla, Postgres, and explicit "slow" variants ship in follow-up PRs:

Phase 2 (Scylla) ships either a "loud-error" code change or a documented swallow-error contract — design question, not just a test.
Phase 3 (Postgres) builds on server: Survive Postgres idle-client errors without crashing the process #542 for the active-request case.
Phase 4 (toxiproxy) only if docker pause doesn't faithfully cover a needed "slow" scenario; current scenarios don't need it.

Test plan

(cd server && npm run test:integ -- outage.integ.test.ts) against the real stack (npm run up && npm run db:update) — 4 passed, 4 total
CI runs the full integ suite — confirm the docker-control primitives work in that environment (test process must have access to the host's docker socket; verify the CI runner config exposes it).

AGENTS.md callouts

The new tests use docker-compose service control via the host socket. CI may need the docker socket mounted into the test container or the test runner moved out-of-container. Flagging here so reviewers can confirm the CI runner exposes docker.
--runInBand is required because docker-compose state is shared global mutable state. The existing npm run test:integ script does not pass --runInBand; if CI runs scenarios concurrently this PR's tests will stomp each other. Worth a short follow-up to the jest config to enforce serial execution for the integ suite.

🤖 Generated with Claude Code

`JSON.stringify(filename)` produced double-quoted args that collapsed against the outer `bash -c "..."` wrapper, so the shell parsed the command as `cd <pkg> && .../eslint --fix` with no file args. eslint then linted the entire package directory and surfaced every pre-existing warning/error in the repo, blocking any commit whose staged files were clean but whose tree wasn't. Switch to single-quoted args (with the standard `'\''` escape for any embedded single quote) so the file paths survive the outer double-quoted `bash -c` intact. Same fix as julietshen/user-strikes-signal and background-jobs-integ-test — applying here too since neither has merged to main yet and this PR needs to commit through the same hook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

First phase of #343 per the planning comment on that issue. Adds the docker-compose service-control harness and the HMA + Redis outage scenarios. Scylla + Postgres + slow-variants are deferred to follow-up PRs. server/test/integ/dockerCompose.ts ---------------------------------- Thin wrapper around `docker compose stop/start/pause/unpause` plus two `withService*` helpers that auto-restore in a `finally` so an assertion inside the callback can't leak a stopped/paused service into the next test — which would cascade-fail the rest of the integ suite. `stop` simulates a "dependency is unreachable" outage (TCP rejected immediately). `pause` (SIGSTOP) simulates a slow / hung dependency (TCP hangs instead of rejects), close enough to a network timeout that we don't need to stand up toxiproxy for the assertions we care about. server/test/integ/outage.integ.test.ts -------------------------------------- Four scenarios, all green: 1. **HMA stopped**, report submission with image attached. Hits the outer try/catch in `submitReport`'s HMA block (the inner `withRetries`-wrapped `hashContentFromUrl` exhausts in ~3s), the report succeeds, no hashes attached. Graceful degradation. 2. **HMA paused**, same observable outcome — TCP-hung instead of reject, retries time out the same way. 3. **Redis stopped**, item submission via `/api/v1/items/async`. The API returns 202 immediately (the `itemSubmissionQueueBulkWrite` DataLoader batch resolves through ioredis's `enableOfflineQueue`). After Redis is restored, ioredis flushes the offline buffer, BullMQ delivers to the worker, and the item lands in Scylla as normal. **No data lost as long as the API process stays up.** 4. **Redis paused**, same shape — TCP-hung instead of reject, same recovery path. Note on the Redis path: today's 202-while-Redis-is-unreachable is *not* loss-of-data in the durable sense (offline buffer flushes on reconnect), but it *would* be lossy on simultaneous Redis outage + API process restart, because the offline buffer is in-process. Worth a follow-up (pre-flight Redis health check before claiming 202, or `enableOfflineQueue: false` on the enqueue connection) but doesn't belong in this test PR. Test must run with `--runInBand` (docker-compose state is shared global mutable state). The existing `npm run test:integ` config sets this implicitly via `--no-cache --forceExit` working sequentially in the integ suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-29T17:11:44Z

Warning

Review limit reached

@julietshen, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 27 minutes and 54 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: c14ad888-91a0-47e7-a1e3-ea07f13603a8

📥 Commits

Reviewing files that changed from the base of the PR and between 58e5831 and 8f8e08a.

📒 Files selected for processing (3)

lint-staged.config.mjs
server/test/integ/dockerCompose.ts
server/test/integ/outage.integ.test.ts

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch outage-integ-test

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Pins today's known-lossy contract for #649: `ItemProcessingWorker` swallows Scylla insert errors, so a Scylla outage during processing drops the `item_submission_by_thread` row without retry. The job completes "successfully" from BullMQ's POV, the worker logs to ClickHouse, and the API has long since returned 202 to the client. Nothing surfaces the loss. The test: - Stops Scylla via `withServiceDown` (from phase 1 / #646). - Submits an item via `/api/v1/items/async`, expects 202. - Waits for the `CONTENT_API_REQUESTS` row in ClickHouse to confirm the worker reached past `insertWithRetries` — its presence proves the swallow happened and execution continued. If the worker were re-throwing on insert failure, this gate would time out instead. We deliberately don't assert "row absent in Scylla" after restore. The in-harness Cassandra driver doesn't recover cleanly within a reasonable window after a Scylla container restart (stale connections, prepared- statement re-prepare); a direct query there ends up testing the driver's reconnect timing rather than the worker's behaviour. The CH row is sufficient proof of the contract. When #649 escalates to path 1 (worker re-throws → BullMQ retries), the assertion shape flips to "row lands in Scylla after recovery" — same shape as the Redis-recovery test in `outage.integ.test.ts` — and the driver-recovery timing problem either dissolves into a presence poll or gets solved alongside. Lives in its own file rather than appending to `outage.integ.test.ts` because phase 1 (#646) is still under review at the time of writing and a parallel-branch PR would conflict. Once both land we can collapse the two files together. Branch stacks on `outage-integ-test` (PR #646) because it depends on the `dockerCompose.ts` harness shipped there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

julietshen and others added 2 commits May 29, 2026 13:10

julietshen requested review from cassidyjames, dom-notion, juanmrad and vinaysrao1 as code owners May 29, 2026 17:11

This was referenced May 29, 2026

items/async returns 202 while Redis is unreachable (narrow data-loss on API restart) #647

Open

Integ test suite needs --runInBand for outage scenarios #648

Open

Phase 2 of #343: Scylla outage scenario (decide swallow-vs-loud first) #649

Open

This was referenced May 29, 2026

Add Scylla outage integ test (phase 2 of #343) #652

Open

items/async: fail fast on Redis outage (closes #647) #653

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add outage integration test: HMA + Redis (phase 1 of #343)#646

Add outage integration test: HMA + Redis (phase 1 of #343)#646
julietshen wants to merge 2 commits into
mainfrom
outage-integ-test

julietshen commented May 29, 2026

Uh oh!

coderabbitai Bot commented May 29, 2026

Review limit reached

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

julietshen commented May 29, 2026

What's in

Findings worth a follow-up (not in this PR)

What's deferred

Test plan

AGENTS.md callouts

Uh oh!

coderabbitai Bot commented May 29, 2026

Review limit reached

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant