Skip to content

fix(face-clusters): run global reclustering as async job; remove 10s axios timeout cap (#1345)#1349

Open
VanshajPoonia wants to merge 6 commits into
AOSSIE-Org:mainfrom
VanshajPoonia:fix/1345-recluster-timeout
Open

fix(face-clusters): run global reclustering as async job; remove 10s axios timeout cap (#1345)#1349
VanshajPoonia wants to merge 6 commits into
AOSSIE-Org:mainfrom
VanshajPoonia:fix/1345-recluster-timeout

Conversation

@VanshajPoonia

@VanshajPoonia VanshajPoonia commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Addressed Issues:

Fixes #1345

Screenshots/Recordings:

No visual UI change. The fix is behavioural: previously the Recluster Faces action in Settings (and other long calls) failed in the UI after about 10s on a large library while the backend kept running to completion. Now reclustering runs as a background job and the UI waits for it via polling, showing the existing loader and a success/failure toast on completion.

Additional Notes:

The shared axios client applied a hard 10s timeout to every backend request (frontend/src/api/axiosConfig.ts) with no per-call override. Synchronous endpoints that scale with library size, most notably global face reclustering, run past 10s on a realistic library, so the client aborted with a timeout error while the backend kept running to completion, leaving the UI and DB inconsistent.

Backend

  • POST /face-clusters/global-recluster now starts the full DBSCAN pass as a background task and returns a task_id immediately (202 Accepted), running the blocking work via asyncio.to_thread so the event loop is not blocked.
  • New GET /face-clusters/global-recluster/{task_id} to poll job status (running, complete, or error).
  • Concurrency guard: a second trigger while a job is running rejoins the in-flight task_id instead of starting a second pass that would race on the cluster tables (a full recluster deletes and rebuilds every cluster).
  • Cleanup loop reaps finished task results after a TTL, measured from completion time so a long job's result is not reaped right after it finishes (registered in the app lifespan, mirroring the model-download cleanup in models.py). Running tasks are never reaped, and terminal results are not deleted on first poll, so repeated polls (multiple tabs or retries) stay consistent.

Frontend

  • Raise the default axios timeout from 10s to 30s, and add LONG_REQUEST_TIMEOUT_MS (120s) applied per-call to the face-search and multi-person-search endpoints.
  • Replace the one-shot reclustering mutation with a useGlobalRecluster polling hook, wired into the Settings recluster button (same loader/toast UX as before). The hook polls with a self-scheduling timeout and a run-id guard so overlapping requests and stale runs cannot update state or leak timers.

Scope and deliberate deferrals

  • Polling, not SSE. The issue suggested SSE; the clustering call emits no progress to stream, so SSE would carry the same running-then-done information as polling with more machinery. A real progress bar (which also needs progress reporting added inside the clustering function) is left for a separate issue.
  • Memories endpoints are intentionally left untouched because that work is owned by a GSoC project. memories/generate, timeline and locations are not modified in this PR; they still benefit from the global 10s to 30s default change, and any deeper async/persistence work on them is left to the GSoC effort.
  • Known follow-up: the background-job machinery added here (ReclusterTask, recluster_tasks, _cleanup_stale_recluster_tasks) duplicates the structure already in models.py. It works correctly with no runtime cost, but could be unified into a shared utility in a later refactor; it is deferred here to avoid touching the working model-download code.

Testing

  • Frontend: tsc --noEmit, ESLint, Prettier, and the full Jest suite (133/133) pass.
  • Backend: black (24.4.2) and ruff (0.4.10) clean on changed files; the concurrency-guard, cleanup, and repeat-poll logic verified with a standalone asyncio simulation. No new unit tests were added. The backend pytest suite was not run locally (Python version mismatch with the project's pinned deps), so please confirm via CI.

AI Usage Disclosure:

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact. AI slop is strongly discouraged and may lead to banning and blocking. Do not spam our repos with AI slop.

Check one of the checkboxes below:

  • This PR does not contain AI-generated code at all.
  • This PR contains AI-generated code. I have read the AI Usage Policy and this PR complies with this policy. I have tested the code locally and I am responsible for it.

I have used the following AI models and tools: Codex

Checklist

  • My PR addresses a single issue, fixes a single bug or makes a single improvement.
  • My code follows the project's code style and conventions
  • If applicable, I have made corresponding changes or additions to the documentation
  • If applicable, I have made corresponding changes or additions to tests
  • My changes generate no new warnings or errors
  • I have joined the Discord server and I will share a link to this PR with the project maintainers there
  • I have read the Contribution Guidelines
  • Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.
  • I have filled this PR template completely and carefully, and I understand that my PR may be closed without review otherwise.

Summary by CodeRabbit

  • New Features
    • Global face reclustering now starts as a background job and returns a task_id.
    • Added a status endpoint/UI flow to poll reclustering progress and completion (running/complete/error), including clusters created and faces skipped.
  • Bug Fixes
    • Improved reliability for long-running face operations with longer request timeouts.
    • Prevented overlapping reclustering runs and ensured completed/failed results remain available briefly for polling.
  • Tests
    • Added coverage for the global reclustering React hook, including success, error, cancellation, and rapid re-triggers.

@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dfe17b1f-1e47-4423-afb1-649e1328359c

📥 Commits

Reviewing files that changed from the base of the PR and between a422cc2 and 689fc83.

📒 Files selected for processing (1)
  • frontend/src/hooks/__tests__/useGlobalRecluster.test.tsx

Walkthrough

Global face reclustering now starts as a background task and is checked through a status endpoint. The backend adds task tracking, cleanup, and async execution; the frontend adds new API calls, polling, timeout updates, tests, and revised settings-page wiring.

Changes

Async Global Reclustering Flow

Layer / File(s) Summary
Backend schema contracts
backend/app/schemas/face_clusters.py
Removes GlobalReclusterData/Response and adds separate start and status models with task_id and constrained terminal states.
Backend task infrastructure and endpoints
backend/app/routes/face_clusters.py, backend/main.py, backend/run-server.ps1, backend/run.sh
Adds the reclustering task registry, concurrency guard, async worker, TTL cleanup helper, start/status route handlers, lifespan cleanup wiring, and single-worker startup settings.
Frontend HTTP timeouts and reclustering API
frontend/src/api/axiosConfig.ts, frontend/src/api/apiEndpoints.ts, frontend/src/api/api-functions/face_clusters.ts
Introduces shared timeout constants, updates axios defaults, applies the longer timeout to face-search requests, and replaces the synchronous reclustering API with start and status calls.
Polling hook and settings wiring
frontend/src/hooks/useGlobalRecluster.tsx, frontend/src/hooks/__tests__/useGlobalRecluster.test.tsx, frontend/src/pages/SettingsPage/components/ApplicationControlsCard.tsx
Adds the polling hook and its tests, then rewires the Settings page to trigger the hook, surface success and error feedback, and show skipped-face alerts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • AOSSIE-Org/PictoPy#560: This PR changes the same global reclustering endpoint and frontend trigger flow that are converted to async polling here.
  • AOSSIE-Org/PictoPy#1290: This PR also touches reclustering response fields such as faces_skipped, which are surfaced in the new status payload here.

Suggested labels

Python, TypeScript/JavaScript

Suggested reviewers

  • rahulharpal1603

Poem

🐇 Hop, hop — the task is on its way,
No more long waits in the HTTP fray.
A task_id blooms, then polls with grace,
Till clusters settle in their place.
Faces skipped? We whisper, “seen!”
Async carrots keep things clean.

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changes: async global reclustering and removal of the 10s axios cap.
Linked Issues check ✅ Passed The PR addresses #1345's main scope by making global reclustering asynchronous, raising the default timeout, and adding per-request timeouts for long face search calls.
Out of Scope Changes check ✅ Passed All changes support the reclustering/timeout fix; no unrelated functionality appears to be added.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@github-actions github-actions Bot added backend bug Something isn't working labels Jun 28, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
frontend/src/hooks/useGlobalRecluster.tsx (1)

40-116: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add automated coverage for the polling lifecycle.

This hook now owns the core reclustering UX, but the PR does not include tests for success, error, unmount cleanup, or repeated-trigger behavior. Those are the paths most likely to regress loaders, dialogs, and interval cleanup. As per path instructions, "Ensure that test code is automated, comprehensive, and follows testing best practices" and "Verify that all critical functionality is covered by tests."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/src/hooks/useGlobalRecluster.tsx` around lines 40 - 116, Add
automated tests for useGlobalRecluster to cover the full polling lifecycle:
successful completion, error handling from both startGlobalReclustering and
getGlobalReclusterStatus, cleanup via stopPolling on unmount, and repeated
trigger calls preventing duplicate intervals. Use the useGlobalRecluster hook
and its trigger/stopPolling behavior as the primary targets, and assert
queryClient.invalidateQueries is called on completion and that state transitions
to isSuccess/isError are set correctly. Ensure the tests mock the interval/timer
behavior and verify polling is cleared in all termination paths.

Source: Path instructions

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/app/routes/face_clusters.py`:
- Around line 56-57: The task lifecycle in face_clusters should track when the
job finished, not when the tracker object was created, because cleanup uses that
timestamp to age out results too early. Update the tracker state used by the
recluster flow (the class/fields around created_at, task, and the cleanup logic
in the polling/cleanup path) so terminal results are retained from completion
time, and make the cleanup code compare against a completion timestamp rather
than the creation timestamp.
- Around line 60-67: The global recluster tracking in face_clusters.py is
worker-local, so `recluster_tasks` and `_active_recluster_task_id` can diverge
across multiple backend workers and allow duplicate jobs or broken polling.
Update the recluster flow around `recluster_tasks`, `_active_recluster_task_id`,
and the task creation/polling handlers to use a shared coordination mechanism
such as a distributed lock or shared persistence layer, or otherwise enforce a
single-worker deployment in the runtime startup scripts/config.

In `@frontend/src/hooks/useGlobalRecluster.tsx`:
- Around line 54-113: The polling logic in useGlobalRecluster’s trigger is
vulnerable to overlapping status requests and stale runs because
setInterval(async ...) can stack calls and an older trigger() can keep running
after a newer one starts. Replace the interval-based polling with a
self-scheduling timeout loop, and add a monotonically increasing run id/ref
inside trigger, stopPolling, and the polling callback so only the latest run can
update state or clear polling. Make sure getGlobalReclusterStatus, setState, and
queryClient.invalidateQueries only execute for the active request.

---

Nitpick comments:
In `@frontend/src/hooks/useGlobalRecluster.tsx`:
- Around line 40-116: Add automated tests for useGlobalRecluster to cover the
full polling lifecycle: successful completion, error handling from both
startGlobalReclustering and getGlobalReclusterStatus, cleanup via stopPolling on
unmount, and repeated trigger calls preventing duplicate intervals. Use the
useGlobalRecluster hook and its trigger/stopPolling behavior as the primary
targets, and assert queryClient.invalidateQueries is called on completion and
that state transitions to isSuccess/isError are set correctly. Ensure the tests
mock the interval/timer behavior and verify polling is cleared in all
termination paths.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0aa7d84e-f074-4372-a926-3c8fca0f1a52

📥 Commits

Reviewing files that changed from the base of the PR and between b3eee2c and cd74151.

📒 Files selected for processing (8)
  • backend/app/routes/face_clusters.py
  • backend/app/schemas/face_clusters.py
  • backend/main.py
  • frontend/src/api/api-functions/face_clusters.ts
  • frontend/src/api/apiEndpoints.ts
  • frontend/src/api/axiosConfig.ts
  • frontend/src/hooks/useGlobalRecluster.tsx
  • frontend/src/pages/SettingsPage/components/ApplicationControlsCard.tsx

Comment thread backend/app/routes/face_clusters.py
Comment thread backend/app/routes/face_clusters.py
Comment thread frontend/src/hooks/useGlobalRecluster.tsx
…axios cap (AOSSIE-Org#1345)

The shared axios client applied a hard 10s timeout to every backend request
with no per-call override, so synchronous endpoints that scale with library
size (notably global face reclustering) aborted in the UI after ~10s while the
backend kept running to completion, leaving the UI and DB inconsistent.

Backend
- POST /face-clusters/global-recluster now starts the full DBSCAN pass as a
  background task and returns a task_id immediately (202 Accepted), running the
  blocking work via asyncio.to_thread so the event loop is not blocked.
- New GET /face-clusters/global-recluster/{task_id} to poll the job status
  (running | complete | error).
- Concurrency guard: a second trigger while one is running rejoins the in-flight
  task instead of starting a second pass that would race on the cluster tables.
- Cleanup loop reaps finished task results after a TTL (registered in the app
  lifespan); running tasks are never reaped. Terminal results are not deleted on
  first poll, so repeated polls (multiple tabs/retries) stay consistent.

Frontend
- Raise the default axios timeout 10s -> 30s and add LONG_REQUEST_TIMEOUT_MS
  (120s) applied per-call to the face-search / multi-person-search endpoints.
- Replace the one-shot reclustering mutation with a useGlobalRecluster polling
  hook wired into the Settings recluster button (same loader/toast UX).

Scoped to the face-clustering side of the issue: memories endpoints are left
untouched (handled separately), and a live progress bar / SSE is deferred to a
follow-up.
@VanshajPoonia VanshajPoonia force-pushed the fix/1345-recluster-timeout branch from cd74151 to 2f1554f Compare June 28, 2026 11:09
…l loop

Addresses review feedback on the async reclustering flow.

- ReclusterTask now records finished_at, and the cleanup loop ages terminal
  results from completion time instead of creation time. Previously a job that
  ran close to the TTL could be reaped almost immediately after finishing,
  making polling clients see a 404 instead of the result.

- useGlobalRecluster now polls with a self-scheduling setTimeout (the next tick
  is queued only after the current request resolves, so status requests can't
  stack/overlap) and guards every async callback with a monotonically
  increasing run id. A newer trigger() — or unmount — invalidates older runs so
  they cannot keep polling or overwrite state, fixing the orphaned-interval leak
  on repeated triggers.
…er setup

- Add unit tests for the useGlobalRecluster polling hook covering the success
  lifecycle, error from start, error reported by the status poll, polling
  cleanup on unmount, and that a rapid second trigger does not leave a second
  poll loop running.

- Warn at startup when WORKERS > 1, since model-download and global-reclustering
  job tracking is in-memory and per-worker; a job started in one worker is
  invisible to others, so deployments should keep a single worker. This is a
  non-breaking safeguard (no behaviour change to the default single-worker run).
- main.py already defines a module-level logger; remove the duplicate one
  added for the worker warning (the lifespan resolves the existing logger at
  runtime).
- Remove the now-unused GlobalReclusterData interface from the frontend API
  module (superseded by the start/status data interfaces).
…king

Model-download and global-reclustering job state lives in per-worker memory, so
running multiple workers would let a job start in one worker while status polls
hit another (404s, duplicate jobs). Hardcode the server to a single worker in
run.sh and run-server.ps1 (removing the WORKERS override) and document the
constraint at the in-memory state. Replaces the previous best-effort startup
warning, which could not observe the real worker count.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@frontend/src/hooks/__tests__/useGlobalRecluster.test.tsx`:
- Around line 101-127: The `useGlobalRecluster` test suite is missing coverage
for the rejected-status path from `getGlobalReclusterStatus`. Add a test
alongside the existing `sets an error when the job reports failure` and `sets an
error when starting the job fails` cases that makes `mockStatus` reject (for
example, a missing/aged-out task), then assert the hook sets `isError`, surfaces
the failure message, and stops polling/invalidation behavior as expected via
`trigger()` and `flush()`. Keep the new case aligned with the existing `setup`,
`mockStart`, and `mockStatus` helpers so the cleanup/TTL contract is covered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7d92f44c-69f1-47db-843d-a7914af5c785

📥 Commits

Reviewing files that changed from the base of the PR and between 2f1554f and a422cc2.

📒 Files selected for processing (7)
  • backend/app/routes/face_clusters.py
  • backend/main.py
  • backend/run-server.ps1
  • backend/run.sh
  • frontend/src/api/api-functions/face_clusters.ts
  • frontend/src/hooks/__tests__/useGlobalRecluster.test.tsx
  • frontend/src/hooks/useGlobalRecluster.tsx
💤 Files with no reviewable changes (1)
  • frontend/src/api/api-functions/face_clusters.ts
🚧 Files skipped from review as they are similar to previous changes (3)
  • frontend/src/hooks/useGlobalRecluster.tsx
  • backend/main.py
  • backend/app/routes/face_clusters.py

Comment thread frontend/src/hooks/__tests__/useGlobalRecluster.test.tsx
…ster

Addresses a CodeRabbit review comment: the suite only covered the
{success: false, status: 'error'} payload path, not a rejected status
request (e.g. 404 for a missing/aged-out task ID).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Frontend aborts long operations after 10s (global axios timeout) - reclustering, memory generation & search fail on large libraries

1 participant