Skip to content

items/async: fail fast on Redis outage (closes #647)#653

Open
julietshen wants to merge 1 commit into
mainfrom
fix-redis-offline-queue-data-loss
Open

items/async: fail fast on Redis outage (closes #647)#653
julietshen wants to merge 1 commit into
mainfrom
fix-redis-offline-queue-data-loss

Conversation

@julietshen
Copy link
Copy Markdown
Member

Closes #647.

Summary

POST /api/v1/items/async used to return 202 even when Redis was unreachable. ioredis's default enableOfflineQueue: true buffers commands client-side, so queue.addBulk resolved cleanly while Redis was down and the DataLoader-batched write reported success. The submission survived as long as the API process stayed up and Redis came back — but the buffer is in-process, so a simultaneous Redis outage + API restart silently dropped every submission acknowledged with 202.

Fix

Three small changes:

  1. server/iocContainer/index.ts — extract a makeRedis(extraOptions) helper so the existing IORedis connection wiring isn't copy-pasted, then add a second factory IORedisEnqueueNoBuffer that builds the same client with enableOfflineQueue: false. Wire itemSubmissionQueueBulkWrite to the new client. The shared IORedis keeps its default buffer for BullMQ workers (which legitimately need to ride through transient reconnects). Register the new client in closeSharedResourcesForShutdown.

  2. server/routes/items/submitItems.ts — tighten the response flow. On bulkWriteResponse.error, short-circuit with 5xx and return. Before this fix the handler sent 500 and then fell through to a second res.status(202).end() call. Express swallowed the second one as "headers already sent" so clients got 500 in practice, but the code claimed both intents.

Verification

Verified locally with a standalone smoke test against the dev docker compose stack:

initial PING: PONG
stopping redis...
PING-while-down REJECTED: Stream isn't writeable and enableOfflineQueue options is false (GOOD)
starting redis...
PING-after-restore: PONG

The DLQ path (itemSubmissionRetryQueueBulkWrite) stays on the shared IORedis — it's worker-internal and buffering through brief reconnects is fine there.

What changed in observable behaviour

  • Redis up: identical to before (202 on success).
  • Redis briefly unreachable (stopped or paused): API now returns 5xx, telling the client to retry. Previously returned 202 against a buffer.
  • Redis intermittent (very brief disconnect during a single addBulk call): the new client rejects rather than buffering. Adopters who relied on the implicit buffering to ride through ms-scale blips will see more 5xx, but the trade is "honest 5xx" vs "false 202 + possible loss on restart." This is the intended contract per items/async returns 202 while Redis is unreachable (narrow data-loss on API restart) #647.

Follow-ups

Test plan

  • Standalone smoke test confirms enableOfflineQueue: false client rejects when Redis is stopped and recovers cleanly when restarted
  • items-submission.integ.test.ts (the happy-path integ test) still passes — confirms the new wiring doesn't regress submissions when Redis is healthy
  • Reviewer: confirm the DLQ path staying on the shared IORedis is the right call (rationale: worker-internal, transient reconnects acceptable)

🤖 Generated with Claude Code

`POST /api/v1/items/async` used to return 202 even when Redis was
unreachable. ioredis's default `enableOfflineQueue: true` buffers
commands client-side, so `queue.addBulk` resolved cleanly while Redis
was down and the DataLoader-batched write reported success. The
submission survived as long as the API process stayed up and Redis
came back — but the buffer is in-process, so a simultaneous Redis
outage + API restart silently dropped every submission that had been
acknowledged with 202.

Fix:

- Add a dedicated `IORedisEnqueueNoBuffer` client built with
  `enableOfflineQueue: false`. Same Redis as the shared `IORedis`,
  same connection wiring extracted into a `makeRedis(extraOptions)`
  helper to avoid copy-pasting.
- Wire `itemSubmissionQueueBulkWrite` to the new client. The retry/
  DLQ path stays on the shared `IORedis` — it's worker-internal and
  buffering is fine there.
- Tighten the handler in `submitItems.ts`: on `bulkWriteResponse.error`
  short-circuit with 5xx and `return`. Previously it sent 500 *and*
  then fell through to a second `res.status(202).end()` call. Express
  swallowed the second one as "headers already sent" so the client got
  500 in practice, but the code claimed both intents.
- Register the new client in `closeSharedResourcesForShutdown` so its
  connection is closed on shutdown.

With this change, `queue.addBulk` rejects immediately when Redis is
unreachable ("Stream isn't writeable and enableOfflineQueue options is
false") — verified locally against a stopped Redis container — and the
existing 5xx branch fires. The shared `IORedis` keeps its default
buffer for the BullMQ worker side, which legitimately needs to ride
through transient reconnects.

Follow-ups:
- Phase 1 of the outage integ suite (#646) currently pins the
  pre-fix lossy contract for items/async. Once both #646 and this
  PR land, the Redis scenarios in `outage.integ.test.ts` flip from
  "lands after recovery" to "5xx + no Scylla row". Filed as part of
  #646's review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

Warning

Review limit reached

@julietshen, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 34 minutes and 23 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 606802cc-9a32-45f8-a4d3-84c318dca0fc

📥 Commits

Reviewing files that changed from the base of the PR and between b98d11b and 8b75963.

📒 Files selected for processing (2)
  • server/iocContainer/index.ts
  • server/routes/items/submitItems.ts
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-redis-offline-queue-data-loss

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

items/async returns 202 while Redis is unreachable (narrow data-loss on API restart)

1 participant