Skip to content

Add outage integration test: HMA + Redis (phase 1 of #343)#646

Open
julietshen wants to merge 2 commits into
mainfrom
outage-integ-test
Open

Add outage integration test: HMA + Redis (phase 1 of #343)#646
julietshen wants to merge 2 commits into
mainfrom
outage-integ-test

Conversation

@julietshen
Copy link
Copy Markdown
Member

First phase of #343 per the planning comment on that issue.

What's in

  • server/test/integ/dockerCompose.ts — thin wrapper around docker compose stop/start/pause/unpause plus withServiceDown / withServicePaused helpers that auto-restore in a finally so an assertion failure inside the callback can't leak a stopped/paused service into the next scenario (which would cascade-fail the rest of the suite).
  • server/test/integ/outage.integ.test.ts — 4 scenarios, all green:
    1. HMA stopped during a report with images → 201, no hashes attached (graceful degradation via the outer try/catch in submitReport).
    2. HMA paused — same outcome, TCP-hung instead of rejected.
    3. Redis stopped during /api/v1/items/async → 202 immediately (ioredis offline-queue path), item lands in Scylla after Redis is restored. No data loss as long as the API process stays up.
    4. Redis paused — same shape.

Findings worth a follow-up (not in this PR)

  • The Redis path returns 202 while Redis is unreachable. Today's offline-queue buffering keeps the data durable until the buffer flushes on reconnect, but the buffer lives in the API process — a simultaneous Redis outage + API restart drops the submission. Worth either a pre-flight Redis health check before claiming 202, or enableOfflineQueue: false on the enqueue connection. I'll file this as a separate issue.

What's deferred

Per the plan on #343, this PR covers HMA + Redis only. Scylla, Postgres, and explicit "slow" variants ship in follow-up PRs:

  • Phase 2 (Scylla) ships either a "loud-error" code change or a documented swallow-error contract — design question, not just a test.
  • Phase 3 (Postgres) builds on server: Survive Postgres idle-client errors without crashing the process #542 for the active-request case.
  • Phase 4 (toxiproxy) only if docker pause doesn't faithfully cover a needed "slow" scenario; current scenarios don't need it.

Test plan

  • (cd server && npm run test:integ -- outage.integ.test.ts) against the real stack (npm run up && npm run db:update) — 4 passed, 4 total
  • CI runs the full integ suite — confirm the docker-control primitives work in that environment (test process must have access to the host's docker socket; verify the CI runner config exposes it).

AGENTS.md callouts

  • The new tests use docker-compose service control via the host socket. CI may need the docker socket mounted into the test container or the test runner moved out-of-container. Flagging here so reviewers can confirm the CI runner exposes docker.
  • --runInBand is required because docker-compose state is shared global mutable state. The existing npm run test:integ script does not pass --runInBand; if CI runs scenarios concurrently this PR's tests will stomp each other. Worth a short follow-up to the jest config to enforce serial execution for the integ suite.

🤖 Generated with Claude Code

julietshen and others added 2 commits May 29, 2026 13:10
`JSON.stringify(filename)` produced double-quoted args that collapsed
against the outer `bash -c "..."` wrapper, so the shell parsed the
command as `cd <pkg> && .../eslint --fix` with no file args. eslint
then linted the entire package directory and surfaced every pre-existing
warning/error in the repo, blocking any commit whose staged files were
clean but whose tree wasn't.

Switch to single-quoted args (with the standard `'\''` escape for any
embedded single quote) so the file paths survive the outer
double-quoted `bash -c` intact.

Same fix as julietshen/user-strikes-signal and background-jobs-integ-test
— applying here too since neither has merged to main yet and this PR
needs to commit through the same hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First phase of #343 per the planning comment on that issue. Adds the
docker-compose service-control harness and the HMA + Redis outage
scenarios. Scylla + Postgres + slow-variants are deferred to follow-up
PRs.

server/test/integ/dockerCompose.ts
----------------------------------
Thin wrapper around `docker compose stop/start/pause/unpause` plus two
`withService*` helpers that auto-restore in a `finally` so an assertion
inside the callback can't leak a stopped/paused service into the next
test — which would cascade-fail the rest of the integ suite.

`stop` simulates a "dependency is unreachable" outage (TCP rejected
immediately). `pause` (SIGSTOP) simulates a slow / hung dependency
(TCP hangs instead of rejects), close enough to a network timeout that
we don't need to stand up toxiproxy for the assertions we care about.

server/test/integ/outage.integ.test.ts
--------------------------------------
Four scenarios, all green:

1. **HMA stopped**, report submission with image attached. Hits the
   outer try/catch in `submitReport`'s HMA block (the inner
   `withRetries`-wrapped `hashContentFromUrl` exhausts in ~3s), the
   report succeeds, no hashes attached. Graceful degradation.
2. **HMA paused**, same observable outcome — TCP-hung instead of
   reject, retries time out the same way.
3. **Redis stopped**, item submission via `/api/v1/items/async`. The
   API returns 202 immediately (the `itemSubmissionQueueBulkWrite`
   DataLoader batch resolves through ioredis's `enableOfflineQueue`).
   After Redis is restored, ioredis flushes the offline buffer, BullMQ
   delivers to the worker, and the item lands in Scylla as normal.
   **No data lost as long as the API process stays up.**
4. **Redis paused**, same shape — TCP-hung instead of reject, same
   recovery path.

Note on the Redis path: today's 202-while-Redis-is-unreachable is *not*
loss-of-data in the durable sense (offline buffer flushes on reconnect),
but it *would* be lossy on simultaneous Redis outage + API process
restart, because the offline buffer is in-process. Worth a follow-up
(pre-flight Redis health check before claiming 202, or `enableOfflineQueue:
false` on the enqueue connection) but doesn't belong in this test PR.

Test must run with `--runInBand` (docker-compose state is shared
global mutable state). The existing `npm run test:integ` config sets
this implicitly via `--no-cache --forceExit` working sequentially in
the integ suite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

Warning

Review limit reached

@julietshen, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 27 minutes and 54 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: c14ad888-91a0-47e7-a1e3-ea07f13603a8

📥 Commits

Reviewing files that changed from the base of the PR and between 58e5831 and 8f8e08a.

📒 Files selected for processing (3)
  • lint-staged.config.mjs
  • server/test/integ/dockerCompose.ts
  • server/test/integ/outage.integ.test.ts
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch outage-integ-test

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

julietshen added a commit that referenced this pull request May 29, 2026
Pins today's known-lossy contract for #649: `ItemProcessingWorker`
swallows Scylla insert errors, so a Scylla outage during processing
drops the `item_submission_by_thread` row without retry. The job
completes "successfully" from BullMQ's POV, the worker logs to
ClickHouse, and the API has long since returned 202 to the client.
Nothing surfaces the loss.

The test:
- Stops Scylla via `withServiceDown` (from phase 1 / #646).
- Submits an item via `/api/v1/items/async`, expects 202.
- Waits for the `CONTENT_API_REQUESTS` row in ClickHouse to confirm the
  worker reached past `insertWithRetries` — its presence proves the
  swallow happened and execution continued. If the worker were
  re-throwing on insert failure, this gate would time out instead.

We deliberately don't assert "row absent in Scylla" after restore. The
in-harness Cassandra driver doesn't recover cleanly within a reasonable
window after a Scylla container restart (stale connections, prepared-
statement re-prepare); a direct query there ends up testing the
driver's reconnect timing rather than the worker's behaviour. The CH
row is sufficient proof of the contract.

When #649 escalates to path 1 (worker re-throws → BullMQ retries), the
assertion shape flips to "row lands in Scylla after recovery" — same
shape as the Redis-recovery test in `outage.integ.test.ts` — and the
driver-recovery timing problem either dissolves into a presence poll or
gets solved alongside.

Lives in its own file rather than appending to `outage.integ.test.ts`
because phase 1 (#646) is still under review at the time of writing and
a parallel-branch PR would conflict. Once both land we can collapse
the two files together.

Branch stacks on `outage-integ-test` (PR #646) because it depends on
the `dockerCompose.ts` harness shipped there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant