Skip to content

Add Scylla outage integ test (phase 2 of #343)#652

Open
julietshen wants to merge 1 commit into
outage-integ-testfrom
outage-integ-test-scylla-rebased
Open

Add Scylla outage integ test (phase 2 of #343)#652
julietshen wants to merge 1 commit into
outage-integ-testfrom
outage-integ-test-scylla-rebased

Conversation

@julietshen
Copy link
Copy Markdown
Member

@julietshen julietshen commented May 29, 2026

Phase 2 of #343. Stacks on #646 (phase 1: HMA + Redis) for the dockerCompose.ts harness.

Summary

Pins today's known-lossy contract for #649: ItemProcessingWorker swallows Scylla insert errors, so a Scylla outage during processing drops the item_submission_by_thread row without retry. The job completes "successfully" from BullMQ's POV, the worker logs to ClickHouse, and the API has long since returned 202 to the client. Nothing surfaces the loss.

Test design

  1. Stop Scylla via withServiceDown (from phase 1).
  2. POST item via /api/v1/items/async — expect 202.
  3. Wait for CONTENT_API_REQUESTS row in ClickHouse — its presence proves the worker reached past insertWithRetries. If the worker were re-throwing on insert failure (the desired contract), this gate would time out instead.

What's deliberately not asserted

"Row absent in Scylla after restore." The in-harness Cassandra driver doesn't recover cleanly within a reasonable window after a Scylla container restart — stale connections, prepared-statement re-prepare, etc. A direct Scylla.select after withServiceDown ends up testing the driver's reconnect timing rather than the worker's behaviour. The CH-row assertion alone is sufficient proof of the contract: that row can only get written if the worker reached past the swallowed insert.

When #649 flips the worker to re-throw-and-retry, the assertion shape becomes "row lands in Scylla after recovery" (same shape as the Redis-recovery test in outage.integ.test.ts) — and at that point the driver-recovery timing problem either dissolves into a presence poll (which tolerates transient errors as "not yet") or gets solved alongside.

File placement

Lives in a separate file (outage-scylla.integ.test.ts) rather than appending to outage.integ.test.ts from #646 because #646 is still under review and merging this in the same file would require constant rebases. Once both land we can collapse the two files together — happy to do that follow-up.

Test plan

AGENTS.md callouts

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Tests
    • Added integration tests to validate system resilience during service outages, ensuring the API continues accepting requests and background workers persist with processing despite temporary service unavailability.

Review Change Stack

Pins today's known-lossy contract for #649: `ItemProcessingWorker`
swallows Scylla insert errors, so a Scylla outage during processing
drops the `item_submission_by_thread` row without retry. The job
completes "successfully" from BullMQ's POV, the worker logs to
ClickHouse, and the API has long since returned 202 to the client.
Nothing surfaces the loss.

The test:
- Stops Scylla via `withServiceDown` (from phase 1 / #646).
- Submits an item via `/api/v1/items/async`, expects 202.
- Waits for the `CONTENT_API_REQUESTS` row in ClickHouse to confirm the
  worker reached past `insertWithRetries` — its presence proves the
  swallow happened and execution continued. If the worker were
  re-throwing on insert failure, this gate would time out instead.

We deliberately don't assert "row absent in Scylla" after restore. The
in-harness Cassandra driver doesn't recover cleanly within a reasonable
window after a Scylla container restart (stale connections, prepared-
statement re-prepare); a direct query there ends up testing the
driver's reconnect timing rather than the worker's behaviour. The CH
row is sufficient proof of the contract.

When #649 escalates to path 1 (worker re-throws → BullMQ retries), the
assertion shape flips to "row lands in Scylla after recovery" — same
shape as the Redis-recovery test in `outage.integ.test.ts` — and the
driver-recovery timing problem either dissolves into a presence poll or
gets solved alongside.

Lives in its own file rather than appending to `outage.integ.test.ts`
because phase 1 (#646) is still under review at the time of writing and
a parallel-branch PR would conflict. Once both land we can collapse
the two files together.

Branch stacks on `outage-integ-test` (PR #646) because it depends on
the `dockerCompose.ts` harness shipped there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

🗂️ Base branches to auto review (1)
  • main

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 7a6b4653-6cd5-4613-a1fa-2da46008dc96

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • ✅ Review completed - (🔄 Check again to review again)
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch outage-integ-test-scylla-rebased

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
server/test/integ/outage-scylla.integ.test.ts (2)

52-56: ⚡ Quick win

Tighten outage-scylla cleanup: unpauseService is exported; waitFor already throws on timeout

  • unpauseService is exported from server/test/integ/dockerCompose.ts and used by other integ outage tests; in outage-scylla.integ.test.ts it’s belt-and-suspenders cleanup after withServiceDown (which only does stop/start), not part of the outage flow.
  • expect(chRow).toBeDefined() is redundant: waitForItemInClickHouse() returns only when the row is present, and waitFor() throws on timeout instead of returning null. Consider removing the assertion or replacing it with a field-level check.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/test/integ/outage-scylla.integ.test.ts` around lines 52 - 56, The test
is doing redundant cleanup and an unnecessary existence assertion: remove the
call to unpauseService from outage-scylla.integ.test.ts since unpauseService is
exported and other tests use it and withServiceDown only stops/starts the
service (so leave withServiceDown as the outage flow), and remove the redundant
expect(chRow).toBeDefined() because waitForItemInClickHouse()/waitFor already
throws on timeout; if you need stronger verification, replace that assertion
with a field-level check against the returned object (e.g., assert specific
property values from the value returned by waitForItemInClickHouse) to validate
contents rather than presence.

195-195: ⚡ Quick win

Tighten the ClickHouse-row assertion.

waitForItemInClickHouse polls via waitFor, which throws once timeoutMs elapses and only returns when the query result is non-null (rows[0]). So expect(chRow).toBeDefined() won’t pass on a timeout/null—it's already safe. If you want stronger contract coverage, assert specific fields on chRow (e.g., item_id, item_type_id, and/or event) rather than just presence.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/test/integ/outage-scylla.integ.test.ts` at line 195, The current
assertion expect(chRow).toBeDefined() is weak given waitForItemInClickHouse
already ensures a non-null row; strengthen the test by asserting specific fields
on the returned row (e.g., chRow.item_id, chRow.item_type_id, chRow.event) to
validate content and schema instead of only presence—locate the call to
waitForItemInClickHouse and the variable chRow in the test and add explicit
expects (or toMatchObject checks) for those fields to ensure correct
values/types are present.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@server/test/integ/outage-scylla.integ.test.ts`:
- Around line 52-56: The test is doing redundant cleanup and an unnecessary
existence assertion: remove the call to unpauseService from
outage-scylla.integ.test.ts since unpauseService is exported and other tests use
it and withServiceDown only stops/starts the service (so leave withServiceDown
as the outage flow), and remove the redundant expect(chRow).toBeDefined()
because waitForItemInClickHouse()/waitFor already throws on timeout; if you need
stronger verification, replace that assertion with a field-level check against
the returned object (e.g., assert specific property values from the value
returned by waitForItemInClickHouse) to validate contents rather than presence.
- Line 195: The current assertion expect(chRow).toBeDefined() is weak given
waitForItemInClickHouse already ensures a non-null row; strengthen the test by
asserting specific fields on the returned row (e.g., chRow.item_id,
chRow.item_type_id, chRow.event) to validate content and schema instead of only
presence—locate the call to waitForItemInClickHouse and the variable chRow in
the test and add explicit expects (or toMatchObject checks) for those fields to
ensure correct values/types are present.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 890faa3f-58d0-4e85-a163-af4b60021a14

📥 Commits

Reviewing files that changed from the base of the PR and between 8f8e08a and 728a44d.

📒 Files selected for processing (1)
  • server/test/integ/outage-scylla.integ.test.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant