Add Scylla outage integ test (phase 2 of #343)#652
Conversation
Pins today's known-lossy contract for #649: `ItemProcessingWorker` swallows Scylla insert errors, so a Scylla outage during processing drops the `item_submission_by_thread` row without retry. The job completes "successfully" from BullMQ's POV, the worker logs to ClickHouse, and the API has long since returned 202 to the client. Nothing surfaces the loss. The test: - Stops Scylla via `withServiceDown` (from phase 1 / #646). - Submits an item via `/api/v1/items/async`, expects 202. - Waits for the `CONTENT_API_REQUESTS` row in ClickHouse to confirm the worker reached past `insertWithRetries` — its presence proves the swallow happened and execution continued. If the worker were re-throwing on insert failure, this gate would time out instead. We deliberately don't assert "row absent in Scylla" after restore. The in-harness Cassandra driver doesn't recover cleanly within a reasonable window after a Scylla container restart (stale connections, prepared- statement re-prepare); a direct query there ends up testing the driver's reconnect timing rather than the worker's behaviour. The CH row is sufficient proof of the contract. When #649 escalates to path 1 (worker re-throws → BullMQ retries), the assertion shape flips to "row lands in Scylla after recovery" — same shape as the Redis-recovery test in `outage.integ.test.ts` — and the driver-recovery timing problem either dissolves into a presence poll or gets solved alongside. Lives in its own file rather than appending to `outage.integ.test.ts` because phase 1 (#646) is still under review at the time of writing and a parallel-branch PR would conflict. Once both land we can collapse the two files together. Branch stacks on `outage-integ-test` (PR #646) because it depends on the `dockerCompose.ts` harness shipped there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. 🗂️ Base branches to auto review (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
server/test/integ/outage-scylla.integ.test.ts (2)
52-56: ⚡ Quick winTighten outage-scylla cleanup:
unpauseServiceis exported;waitForalready throws on timeout
unpauseServiceis exported fromserver/test/integ/dockerCompose.tsand used by other integ outage tests; inoutage-scylla.integ.test.tsit’s belt-and-suspenders cleanup afterwithServiceDown(which only does stop/start), not part of the outage flow.expect(chRow).toBeDefined()is redundant:waitForItemInClickHouse()returns only when the row is present, andwaitFor()throws on timeout instead of returningnull. Consider removing the assertion or replacing it with a field-level check.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/test/integ/outage-scylla.integ.test.ts` around lines 52 - 56, The test is doing redundant cleanup and an unnecessary existence assertion: remove the call to unpauseService from outage-scylla.integ.test.ts since unpauseService is exported and other tests use it and withServiceDown only stops/starts the service (so leave withServiceDown as the outage flow), and remove the redundant expect(chRow).toBeDefined() because waitForItemInClickHouse()/waitFor already throws on timeout; if you need stronger verification, replace that assertion with a field-level check against the returned object (e.g., assert specific property values from the value returned by waitForItemInClickHouse) to validate contents rather than presence.
195-195: ⚡ Quick winTighten the ClickHouse-row assertion.
waitForItemInClickHousepolls viawaitFor, which throws oncetimeoutMselapses and only returns when the query result is non-null(rows[0]). Soexpect(chRow).toBeDefined()won’t pass on a timeout/null—it's already safe. If you want stronger contract coverage, assert specific fields onchRow(e.g.,item_id,item_type_id, and/orevent) rather than just presence.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/test/integ/outage-scylla.integ.test.ts` at line 195, The current assertion expect(chRow).toBeDefined() is weak given waitForItemInClickHouse already ensures a non-null row; strengthen the test by asserting specific fields on the returned row (e.g., chRow.item_id, chRow.item_type_id, chRow.event) to validate content and schema instead of only presence—locate the call to waitForItemInClickHouse and the variable chRow in the test and add explicit expects (or toMatchObject checks) for those fields to ensure correct values/types are present.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@server/test/integ/outage-scylla.integ.test.ts`:
- Around line 52-56: The test is doing redundant cleanup and an unnecessary
existence assertion: remove the call to unpauseService from
outage-scylla.integ.test.ts since unpauseService is exported and other tests use
it and withServiceDown only stops/starts the service (so leave withServiceDown
as the outage flow), and remove the redundant expect(chRow).toBeDefined()
because waitForItemInClickHouse()/waitFor already throws on timeout; if you need
stronger verification, replace that assertion with a field-level check against
the returned object (e.g., assert specific property values from the value
returned by waitForItemInClickHouse) to validate contents rather than presence.
- Line 195: The current assertion expect(chRow).toBeDefined() is weak given
waitForItemInClickHouse already ensures a non-null row; strengthen the test by
asserting specific fields on the returned row (e.g., chRow.item_id,
chRow.item_type_id, chRow.event) to validate content and schema instead of only
presence—locate the call to waitForItemInClickHouse and the variable chRow in
the test and add explicit expects (or toMatchObject checks) for those fields to
ensure correct values/types are present.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 890faa3f-58d0-4e85-a163-af4b60021a14
📒 Files selected for processing (1)
server/test/integ/outage-scylla.integ.test.ts
Phase 2 of #343. Stacks on #646 (phase 1: HMA + Redis) for the
dockerCompose.tsharness.Summary
Pins today's known-lossy contract for #649:
ItemProcessingWorkerswallows Scylla insert errors, so a Scylla outage during processing drops theitem_submission_by_threadrow without retry. The job completes "successfully" from BullMQ's POV, the worker logs to ClickHouse, and the API has long since returned 202 to the client. Nothing surfaces the loss.Test design
withServiceDown(from phase 1)./api/v1/items/async— expect 202.CONTENT_API_REQUESTSrow in ClickHouse — its presence proves the worker reached pastinsertWithRetries. If the worker were re-throwing on insert failure (the desired contract), this gate would time out instead.What's deliberately not asserted
"Row absent in Scylla after restore." The in-harness Cassandra driver doesn't recover cleanly within a reasonable window after a Scylla container restart — stale connections, prepared-statement re-prepare, etc. A direct
Scylla.selectafterwithServiceDownends up testing the driver's reconnect timing rather than the worker's behaviour. The CH-row assertion alone is sufficient proof of the contract: that row can only get written if the worker reached past the swallowed insert.When #649 flips the worker to re-throw-and-retry, the assertion shape becomes "row lands in Scylla after recovery" (same shape as the Redis-recovery test in
outage.integ.test.ts) — and at that point the driver-recovery timing problem either dissolves into a presence poll (which tolerates transient errors as "not yet") or gets solved alongside.File placement
Lives in a separate file (
outage-scylla.integ.test.ts) rather than appending tooutage.integ.test.tsfrom #646 because #646 is still under review and merging this in the same file would require constant rebases. Once both land we can collapse the two files together — happy to do that follow-up.Test plan
(cd server && npm run test:integ -- outage-scylla.integ.test.ts)against the real stack (npm run up && npm run db:update) — 1 passed, 1 totalAGENTS.md callouts
--runInBandrequirement as phase 1 (see Integ test suite needs --runInBand for outage scenarios #648).🤖 Generated with Claude Code
Summary by CodeRabbit