fix(embedding): not-found metric classification + backfill pagination#91
Conversation
When GetFeedbackRecord returns not-found at the start of Work — a benign race where the record was deleted or its tenant purged between enqueue and processing — the worker recorded terminal-failure metrics (a get_record_failed worker error plus a failed_final outcome) before the not-found short-circuit. That inflates the failure metrics and can trip false alerts, and it was inconsistent with the same worker's not-found-on-write path, which already records skipped. Check not-found first and record it as skipped (no worker error); record the failure metrics only for genuine errors.
ListFeedbackRecordIDsForBackfill and BackfillEmbeddings materialized every eligible record id in one slice; on a large deployment that is a memory spike when the CLI runs. They now keyset-paginate (id > cursor LIMIT n), enqueuing page by page so the full set is never held in memory at once.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (5)
Walkthrough
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
What does this PR do?
Two independent hardening fixes to the embedding enrichment pipeline, both pre-existing on
mainand surfaced while reviewing the translation pipeline in #90. One commit each.fix: a not-found record is no longer recorded as a terminal failure. InFeedbackEmbeddingWorker.Work, a not-foundGetFeedbackRecord— a benign race where the record was deleted or its tenant purged between enqueue and processing — recordedfailed_finalplus aget_record_failedworker error before the not-found short-circuit returned nil. That inflateshub_embedding_outcomes_total{status="failed_final"}andhub_embedding_worker_errors_totaland can trip false alerts; it was also inconsistent with the same worker's not-found-on-write path, which already recordsskipped. It now recordsskipped(no worker error), and reserves the failure metrics for genuine errors.perf: the embedding backfill streams in keyset pages.BackfillEmbeddings/ListFeedbackRecordIDsForBackfillmaterialized every eligible record id in one slice; on a large deployment that is a memory spike when the CLI runs. They now keyset-paginate (fr.id > cursor ORDER BY fr.id LIMIT n), enqueuing page by page so the full set is never held in memory at once.No API or behavior change; the
cmd/backfill-embeddingsinterface is unchanged (it now streams internally).How should this be tested?
make buildmake fmt && make lint→ 0 issuesmake tests(integration tests intests/,DATABASE_URL→ a test DB with migrations + River applied), including:TestFeedbackEmbeddingWorker_GetNotFoundRecordsSkipped— a not-found GET recordsskipped, notfailed_finaland not a worker errortests/embedding_backfill_test.go— keyset pagination (bounded pages, strictly-ascending ids, no id on two pages, completeness) and the service backfill streaming every eligible recordcmd/backfill-embeddingsover a DB with un-embedded records → enqueues them in keyset pages (no behavior or CLI change).Checklist
Required
make buildmake tests(integration tests intests/)make fmtandmake lint; no new warningsmain)Appreciated
docs/if changes were necessary: n/amake tests-coveragefor meaningful logic changes