fix(face-clusters): run global reclustering as async job; remove 10s axios timeout cap (#1345)#1349
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughGlobal face reclustering now starts as a background task and is checked through a status endpoint. The backend adds task tracking, cleanup, and async execution; the frontend adds new API calls, polling, timeout updates, tests, and revised settings-page wiring. ChangesAsync Global Reclustering Flow
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
frontend/src/hooks/useGlobalRecluster.tsx (1)
40-116: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd automated coverage for the polling lifecycle.
This hook now owns the core reclustering UX, but the PR does not include tests for success, error, unmount cleanup, or repeated-trigger behavior. Those are the paths most likely to regress loaders, dialogs, and interval cleanup. As per path instructions, "Ensure that test code is automated, comprehensive, and follows testing best practices" and "Verify that all critical functionality is covered by tests."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/src/hooks/useGlobalRecluster.tsx` around lines 40 - 116, Add automated tests for useGlobalRecluster to cover the full polling lifecycle: successful completion, error handling from both startGlobalReclustering and getGlobalReclusterStatus, cleanup via stopPolling on unmount, and repeated trigger calls preventing duplicate intervals. Use the useGlobalRecluster hook and its trigger/stopPolling behavior as the primary targets, and assert queryClient.invalidateQueries is called on completion and that state transitions to isSuccess/isError are set correctly. Ensure the tests mock the interval/timer behavior and verify polling is cleared in all termination paths.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@backend/app/routes/face_clusters.py`:
- Around line 56-57: The task lifecycle in face_clusters should track when the
job finished, not when the tracker object was created, because cleanup uses that
timestamp to age out results too early. Update the tracker state used by the
recluster flow (the class/fields around created_at, task, and the cleanup logic
in the polling/cleanup path) so terminal results are retained from completion
time, and make the cleanup code compare against a completion timestamp rather
than the creation timestamp.
- Around line 60-67: The global recluster tracking in face_clusters.py is
worker-local, so `recluster_tasks` and `_active_recluster_task_id` can diverge
across multiple backend workers and allow duplicate jobs or broken polling.
Update the recluster flow around `recluster_tasks`, `_active_recluster_task_id`,
and the task creation/polling handlers to use a shared coordination mechanism
such as a distributed lock or shared persistence layer, or otherwise enforce a
single-worker deployment in the runtime startup scripts/config.
In `@frontend/src/hooks/useGlobalRecluster.tsx`:
- Around line 54-113: The polling logic in useGlobalRecluster’s trigger is
vulnerable to overlapping status requests and stale runs because
setInterval(async ...) can stack calls and an older trigger() can keep running
after a newer one starts. Replace the interval-based polling with a
self-scheduling timeout loop, and add a monotonically increasing run id/ref
inside trigger, stopPolling, and the polling callback so only the latest run can
update state or clear polling. Make sure getGlobalReclusterStatus, setState, and
queryClient.invalidateQueries only execute for the active request.
---
Nitpick comments:
In `@frontend/src/hooks/useGlobalRecluster.tsx`:
- Around line 40-116: Add automated tests for useGlobalRecluster to cover the
full polling lifecycle: successful completion, error handling from both
startGlobalReclustering and getGlobalReclusterStatus, cleanup via stopPolling on
unmount, and repeated trigger calls preventing duplicate intervals. Use the
useGlobalRecluster hook and its trigger/stopPolling behavior as the primary
targets, and assert queryClient.invalidateQueries is called on completion and
that state transitions to isSuccess/isError are set correctly. Ensure the tests
mock the interval/timer behavior and verify polling is cleared in all
termination paths.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0aa7d84e-f074-4372-a926-3c8fca0f1a52
📒 Files selected for processing (8)
backend/app/routes/face_clusters.pybackend/app/schemas/face_clusters.pybackend/main.pyfrontend/src/api/api-functions/face_clusters.tsfrontend/src/api/apiEndpoints.tsfrontend/src/api/axiosConfig.tsfrontend/src/hooks/useGlobalRecluster.tsxfrontend/src/pages/SettingsPage/components/ApplicationControlsCard.tsx
…axios cap (AOSSIE-Org#1345) The shared axios client applied a hard 10s timeout to every backend request with no per-call override, so synchronous endpoints that scale with library size (notably global face reclustering) aborted in the UI after ~10s while the backend kept running to completion, leaving the UI and DB inconsistent. Backend - POST /face-clusters/global-recluster now starts the full DBSCAN pass as a background task and returns a task_id immediately (202 Accepted), running the blocking work via asyncio.to_thread so the event loop is not blocked. - New GET /face-clusters/global-recluster/{task_id} to poll the job status (running | complete | error). - Concurrency guard: a second trigger while one is running rejoins the in-flight task instead of starting a second pass that would race on the cluster tables. - Cleanup loop reaps finished task results after a TTL (registered in the app lifespan); running tasks are never reaped. Terminal results are not deleted on first poll, so repeated polls (multiple tabs/retries) stay consistent. Frontend - Raise the default axios timeout 10s -> 30s and add LONG_REQUEST_TIMEOUT_MS (120s) applied per-call to the face-search / multi-person-search endpoints. - Replace the one-shot reclustering mutation with a useGlobalRecluster polling hook wired into the Settings recluster button (same loader/toast UX). Scoped to the face-clustering side of the issue: memories endpoints are left untouched (handled separately), and a live progress bar / SSE is deferred to a follow-up.
cd74151 to
2f1554f
Compare
…l loop Addresses review feedback on the async reclustering flow. - ReclusterTask now records finished_at, and the cleanup loop ages terminal results from completion time instead of creation time. Previously a job that ran close to the TTL could be reaped almost immediately after finishing, making polling clients see a 404 instead of the result. - useGlobalRecluster now polls with a self-scheduling setTimeout (the next tick is queued only after the current request resolves, so status requests can't stack/overlap) and guards every async callback with a monotonically increasing run id. A newer trigger() — or unmount — invalidates older runs so they cannot keep polling or overwrite state, fixing the orphaned-interval leak on repeated triggers.
…er setup - Add unit tests for the useGlobalRecluster polling hook covering the success lifecycle, error from start, error reported by the status poll, polling cleanup on unmount, and that a rapid second trigger does not leave a second poll loop running. - Warn at startup when WORKERS > 1, since model-download and global-reclustering job tracking is in-memory and per-worker; a job started in one worker is invisible to others, so deployments should keep a single worker. This is a non-breaking safeguard (no behaviour change to the default single-worker run).
- main.py already defines a module-level logger; remove the duplicate one added for the worker warning (the lifespan resolves the existing logger at runtime). - Remove the now-unused GlobalReclusterData interface from the frontend API module (superseded by the start/status data interfaces).
…king Model-download and global-reclustering job state lives in per-worker memory, so running multiple workers would let a job start in one worker while status polls hit another (404s, duplicate jobs). Hardcode the server to a single worker in run.sh and run-server.ps1 (removing the WORKERS override) and document the constraint at the in-memory state. Replaces the previous best-effort startup warning, which could not observe the real worker count.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@frontend/src/hooks/__tests__/useGlobalRecluster.test.tsx`:
- Around line 101-127: The `useGlobalRecluster` test suite is missing coverage
for the rejected-status path from `getGlobalReclusterStatus`. Add a test
alongside the existing `sets an error when the job reports failure` and `sets an
error when starting the job fails` cases that makes `mockStatus` reject (for
example, a missing/aged-out task), then assert the hook sets `isError`, surfaces
the failure message, and stops polling/invalidation behavior as expected via
`trigger()` and `flush()`. Keep the new case aligned with the existing `setup`,
`mockStart`, and `mockStatus` helpers so the cleanup/TTL contract is covered.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7d92f44c-69f1-47db-843d-a7914af5c785
📒 Files selected for processing (7)
backend/app/routes/face_clusters.pybackend/main.pybackend/run-server.ps1backend/run.shfrontend/src/api/api-functions/face_clusters.tsfrontend/src/hooks/__tests__/useGlobalRecluster.test.tsxfrontend/src/hooks/useGlobalRecluster.tsx
💤 Files with no reviewable changes (1)
- frontend/src/api/api-functions/face_clusters.ts
🚧 Files skipped from review as they are similar to previous changes (3)
- frontend/src/hooks/useGlobalRecluster.tsx
- backend/main.py
- backend/app/routes/face_clusters.py
…ster
Addresses a CodeRabbit review comment: the suite only covered the
{success: false, status: 'error'} payload path, not a rejected status
request (e.g. 404 for a missing/aged-out task ID).
Addressed Issues:
Fixes #1345
Screenshots/Recordings:
No visual UI change. The fix is behavioural: previously the Recluster Faces action in Settings (and other long calls) failed in the UI after about 10s on a large library while the backend kept running to completion. Now reclustering runs as a background job and the UI waits for it via polling, showing the existing loader and a success/failure toast on completion.
Additional Notes:
The shared axios client applied a hard 10s timeout to every backend request (
frontend/src/api/axiosConfig.ts) with no per-call override. Synchronous endpoints that scale with library size, most notably global face reclustering, run past 10s on a realistic library, so the client aborted with a timeout error while the backend kept running to completion, leaving the UI and DB inconsistent.Backend
POST /face-clusters/global-reclusternow starts the full DBSCAN pass as a background task and returns atask_idimmediately (202 Accepted), running the blocking work viaasyncio.to_threadso the event loop is not blocked.GET /face-clusters/global-recluster/{task_id}to poll job status (running,complete, orerror).task_idinstead of starting a second pass that would race on the cluster tables (a full recluster deletes and rebuilds every cluster).models.py). Running tasks are never reaped, and terminal results are not deleted on first poll, so repeated polls (multiple tabs or retries) stay consistent.Frontend
LONG_REQUEST_TIMEOUT_MS(120s) applied per-call to the face-search and multi-person-search endpoints.useGlobalReclusterpolling hook, wired into the Settings recluster button (same loader/toast UX as before). The hook polls with a self-scheduling timeout and a run-id guard so overlapping requests and stale runs cannot update state or leak timers.Scope and deliberate deferrals
memories/generate,timelineandlocationsare not modified in this PR; they still benefit from the global 10s to 30s default change, and any deeper async/persistence work on them is left to the GSoC effort.ReclusterTask,recluster_tasks,_cleanup_stale_recluster_tasks) duplicates the structure already inmodels.py. It works correctly with no runtime cost, but could be unified into a shared utility in a later refactor; it is deferred here to avoid touching the working model-download code.Testing
tsc --noEmit, ESLint, Prettier, and the full Jest suite (133/133) pass.black(24.4.2) andruff(0.4.10) clean on changed files; the concurrency-guard, cleanup, and repeat-poll logic verified with a standalone asyncio simulation. No new unit tests were added. The backendpytestsuite was not run locally (Python version mismatch with the project's pinned deps), so please confirm via CI.AI Usage Disclosure:
We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact. AI slop is strongly discouraged and may lead to banning and blocking. Do not spam our repos with AI slop.
Check one of the checkboxes below:
I have used the following AI models and tools: Codex
Checklist
Summary by CodeRabbit
task_id.