Add Scylla outage integ test (phase 2 of #343) by julietshen · Pull Request #652 · roostorg/coop

julietshen · 2026-05-29T20:46:04Z

Phase 2 of #343. Stacks on #646 (phase 1: HMA + Redis) for the dockerCompose.ts harness.

Summary

Pins today's known-lossy contract for #649: ItemProcessingWorker swallows Scylla insert errors, so a Scylla outage during processing drops the item_submission_by_thread row without retry. The job completes "successfully" from BullMQ's POV, the worker logs to ClickHouse, and the API has long since returned 202 to the client. Nothing surfaces the loss.

Test design

Stop Scylla via withServiceDown (from phase 1).
POST item via /api/v1/items/async — expect 202.
Wait for CONTENT_API_REQUESTS row in ClickHouse — its presence proves the worker reached past insertWithRetries. If the worker were re-throwing on insert failure (the desired contract), this gate would time out instead.

What's deliberately not asserted

"Row absent in Scylla after restore." The in-harness Cassandra driver doesn't recover cleanly within a reasonable window after a Scylla container restart — stale connections, prepared-statement re-prepare, etc. A direct Scylla.select after withServiceDown ends up testing the driver's reconnect timing rather than the worker's behaviour. The CH-row assertion alone is sufficient proof of the contract: that row can only get written if the worker reached past the swallowed insert.

When #649 flips the worker to re-throw-and-retry, the assertion shape becomes "row lands in Scylla after recovery" (same shape as the Redis-recovery test in outage.integ.test.ts) — and at that point the driver-recovery timing problem either dissolves into a presence poll (which tolerates transient errors as "not yet") or gets solved alongside.

File placement

Lives in a separate file (outage-scylla.integ.test.ts) rather than appending to outage.integ.test.ts from #646 because #646 is still under review and merging this in the same file would require constant rebases. Once both land we can collapse the two files together — happy to do that follow-up.

Test plan

(cd server && npm run test:integ -- outage-scylla.integ.test.ts) against the real stack (npm run up && npm run db:update) — 1 passed, 1 total
Reviewer: confirm "CH-row landing as sufficient proof of swallow" framing is the right contract pin for Phase 2 of #343: Scylla outage scenario (decide swallow-vs-loud first) #649. Alternative is to escalate to Phase 2 of #343: Scylla outage scenario (decide swallow-vs-loud first) #649 path 1 (worker code change) in this same PR — happy to do that if you'd rather.

AGENTS.md callouts

Stacks on PR Add outage integration test: HMA + Redis (phase 1 of #343) #646. If Add outage integration test: HMA + Redis (phase 1 of #343) #646 changes base, this PR's base needs to follow.
Same --runInBand requirement as phase 1 (see Integ test suite needs --runInBand for outage scenarios #648).

🤖 Generated with Claude Code

Summary by CodeRabbit

Tests
- Added integration tests to validate system resilience during service outages, ensuring the API continues accepting requests and background workers persist with processing despite temporary service unavailability.

Pins today's known-lossy contract for #649: `ItemProcessingWorker` swallows Scylla insert errors, so a Scylla outage during processing drops the `item_submission_by_thread` row without retry. The job completes "successfully" from BullMQ's POV, the worker logs to ClickHouse, and the API has long since returned 202 to the client. Nothing surfaces the loss. The test: - Stops Scylla via `withServiceDown` (from phase 1 / #646). - Submits an item via `/api/v1/items/async`, expects 202. - Waits for the `CONTENT_API_REQUESTS` row in ClickHouse to confirm the worker reached past `insertWithRetries` — its presence proves the swallow happened and execution continued. If the worker were re-throwing on insert failure, this gate would time out instead. We deliberately don't assert "row absent in Scylla" after restore. The in-harness Cassandra driver doesn't recover cleanly within a reasonable window after a Scylla container restart (stale connections, prepared- statement re-prepare); a direct query there ends up testing the driver's reconnect timing rather than the worker's behaviour. The CH row is sufficient proof of the contract. When #649 escalates to path 1 (worker re-throws → BullMQ retries), the assertion shape flips to "row lands in Scylla after recovery" — same shape as the Redis-recovery test in `outage.integ.test.ts` — and the driver-recovery timing problem either dissolves into a presence poll or gets solved alongside. Lives in its own file rather than appending to `outage.integ.test.ts` because phase 1 (#646) is still under review at the time of writing and a parallel-branch PR would conflict. Once both land we can collapse the two files together. Branch stacks on `outage-integ-test` (PR #646) because it depends on the `dockerCompose.ts` harness shipped there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-29T20:46:10Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (1)

main

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 7a6b4653-6cd5-4613-a1fa-2da46008dc96

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

✅ Review completed - (🔄 Check again to review again)

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch outage-integ-test-scylla-rebased

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

server/test/integ/outage-scylla.integ.test.ts (2)
52-56: ⚡ Quick win

Tighten outage-scylla cleanup: unpauseService is exported; waitFor already throws on timeout

unpauseService is exported from server/test/integ/dockerCompose.ts and used by other integ outage tests; in outage-scylla.integ.test.ts it’s belt-and-suspenders cleanup after withServiceDown (which only does stop/start), not part of the outage flow.

expect(chRow).toBeDefined() is redundant: waitForItemInClickHouse() returns only when the row is present, and waitFor() throws on timeout instead of returning null. Consider removing the assertion or replacing it with a field-level check.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/test/integ/outage-scylla.integ.test.ts` around lines 52 - 56, The test
is doing redundant cleanup and an unnecessary existence assertion: remove the
call to unpauseService from outage-scylla.integ.test.ts since unpauseService is
exported and other tests use it and withServiceDown only stops/starts the
service (so leave withServiceDown as the outage flow), and remove the redundant
expect(chRow).toBeDefined() because waitForItemInClickHouse()/waitFor already
throws on timeout; if you need stronger verification, replace that assertion
with a field-level check against the returned object (e.g., assert specific
property values from the value returned by waitForItemInClickHouse) to validate
contents rather than presence.
195-195: ⚡ Quick win

Tighten the ClickHouse-row assertion.

waitForItemInClickHouse polls via waitFor, which throws once timeoutMs elapses and only returns when the query result is non-null (rows[0]). So expect(chRow).toBeDefined() won’t pass on a timeout/null—it's already safe. If you want stronger contract coverage, assert specific fields on chRow (e.g., item_id, item_type_id, and/or event) rather than just presence.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/test/integ/outage-scylla.integ.test.ts` at line 195, The current
assertion expect(chRow).toBeDefined() is weak given waitForItemInClickHouse
already ensures a non-null row; strengthen the test by asserting specific fields
on the returned row (e.g., chRow.item_id, chRow.item_type_id, chRow.event) to
validate content and schema instead of only presence—locate the call to
waitForItemInClickHouse and the variable chRow in the test and add explicit
expects (or toMatchObject checks) for those fields to ensure correct
values/types are present.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@server/test/integ/outage-scylla.integ.test.ts`:
- Around line 52-56: The test is doing redundant cleanup and an unnecessary
existence assertion: remove the call to unpauseService from
outage-scylla.integ.test.ts since unpauseService is exported and other tests use
it and withServiceDown only stops/starts the service (so leave withServiceDown
as the outage flow), and remove the redundant expect(chRow).toBeDefined()
because waitForItemInClickHouse()/waitFor already throws on timeout; if you need
stronger verification, replace that assertion with a field-level check against
the returned object (e.g., assert specific property values from the value
returned by waitForItemInClickHouse) to validate contents rather than presence.
- Line 195: The current assertion expect(chRow).toBeDefined() is weak given
waitForItemInClickHouse already ensures a non-null row; strengthen the test by
asserting specific fields on the returned row (e.g., chRow.item_id,
chRow.item_type_id, chRow.event) to validate content and schema instead of only
presence—locate the call to waitForItemInClickHouse and the variable chRow in
the test and add explicit expects (or toMatchObject checks) for those fields to
ensure correct values/types are present.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 890faa3f-58d0-4e85-a163-af4b60021a14

📥 Commits

Reviewing files that changed from the base of the PR and between 8f8e08a and 728a44d.

📒 Files selected for processing (1)

server/test/integ/outage-scylla.integ.test.ts

julietshen requested review from cassidyjames, dom-notion, juanmrad and vinaysrao1 as code owners May 29, 2026 20:46

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Scylla outage integ test (phase 2 of #343)#652

Add Scylla outage integ test (phase 2 of #343)#652
julietshen wants to merge 1 commit into
outage-integ-testfrom
outage-integ-test-scylla-rebased

julietshen commented May 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 29, 2026 •

edited

Loading

Review skipped

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

julietshen commented May 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test design

What's deliberately not asserted

File placement

Test plan

AGENTS.md callouts

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

julietshen commented May 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 29, 2026 •

edited

Loading