fix(face-clusters): run global reclustering as async job; remove 10s axios timeout cap (#1345) by VanshajPoonia · Pull Request #1349 · AOSSIE-Org/PictoPy

VanshajPoonia · 2026-06-28T10:55:38Z

Addressed Issues:

Screenshots/Recordings:

No visual UI change. The fix is behavioural: previously the Recluster Faces action in Settings (and other long calls) failed in the UI after about 10s on a large library while the backend kept running to completion. Now reclustering runs as a background job and the UI waits for it via polling, showing the existing loader and a success/failure toast on completion.

Additional Notes:

The shared axios client applied a hard 10s timeout to every backend request (frontend/src/api/axiosConfig.ts) with no per-call override. Synchronous endpoints that scale with library size, most notably global face reclustering, run past 10s on a realistic library, so the client aborted with a timeout error while the backend kept running to completion, leaving the UI and DB inconsistent.

Backend

POST /face-clusters/global-recluster now starts the full DBSCAN pass as a background task and returns a task_id immediately (202 Accepted), running the blocking work via asyncio.to_thread so the event loop is not blocked.
New GET /face-clusters/global-recluster/{task_id} to poll job status (running, complete, or error).
Concurrency guard: a second trigger while a job is running rejoins the in-flight task_id instead of starting a second pass that would race on the cluster tables (a full recluster deletes and rebuilds every cluster).
Cleanup loop reaps finished task results after a TTL, measured from completion time so a long job's result is not reaped right after it finishes (registered in the app lifespan, mirroring the model-download cleanup in models.py). Running tasks are never reaped, and terminal results are not deleted on first poll, so repeated polls (multiple tabs or retries) stay consistent.

Frontend

Raise the default axios timeout from 10s to 30s, and add LONG_REQUEST_TIMEOUT_MS (120s) applied per-call to the face-search and multi-person-search endpoints.
Replace the one-shot reclustering mutation with a useGlobalRecluster polling hook, wired into the Settings recluster button (same loader/toast UX as before). The hook polls with a self-scheduling timeout and a run-id guard so overlapping requests and stale runs cannot update state or leak timers.

Scope and deliberate deferrals

Polling, not SSE. The issue suggested SSE; the clustering call emits no progress to stream, so SSE would carry the same running-then-done information as polling with more machinery. A real progress bar (which also needs progress reporting added inside the clustering function) is left for a separate issue.
Memories endpoints are intentionally left untouched because that work is owned by a GSoC project. memories/generate, timeline and locations are not modified in this PR; they still benefit from the global 10s to 30s default change, and any deeper async/persistence work on them is left to the GSoC effort.
Known follow-up: the background-job machinery added here (ReclusterTask, recluster_tasks, _cleanup_stale_recluster_tasks) duplicates the structure already in models.py. It works correctly with no runtime cost, but could be unified into a shared utility in a later refactor; it is deferred here to avoid touching the working model-download code.

Testing

Frontend: tsc --noEmit, ESLint, Prettier, and the full Jest suite (133/133) pass.
Backend: black (24.4.2) and ruff (0.4.10) clean on changed files; the concurrency-guard, cleanup, and repeat-poll logic verified with a standalone asyncio simulation. No new unit tests were added. The backend pytest suite was not run locally (Python version mismatch with the project's pinned deps), so please confirm via CI.

AI Usage Disclosure:

We encourage contributors to use AI tools responsibly when creating Pull Requests. While AI can be a valuable aid, it is essential to ensure that your contributions meet the task requirements, build successfully, include relevant tests, and pass all linters. Submissions that do not meet these standards may be closed without warning to maintain the quality and integrity of the project. Please take the time to understand the changes you are proposing and their impact. AI slop is strongly discouraged and may lead to banning and blocking. Do not spam our repos with AI slop.

Check one of the checkboxes below:

This PR does not contain AI-generated code at all.
This PR contains AI-generated code. I have read the AI Usage Policy and this PR complies with this policy. I have tested the code locally and I am responsible for it.

I have used the following AI models and tools: Codex

Checklist

My PR addresses a single issue, fixes a single bug or makes a single improvement.
My code follows the project's code style and conventions
If applicable, I have made corresponding changes or additions to the documentation
If applicable, I have made corresponding changes or additions to tests
My changes generate no new warnings or errors
I have joined the Discord server and I will share a link to this PR with the project maintainers there
I have read the Contribution Guidelines
Once I submit my PR, CodeRabbit AI will automatically review it and I will address CodeRabbit's comments.
I have filled this PR template completely and carefully, and I understand that my PR may be closed without review otherwise.

Summary by CodeRabbit

New Features
- Global face reclustering now starts as a background job and returns a task_id.
- Added a status endpoint/UI flow to poll reclustering progress and completion (running/complete/error), including clusters created and faces skipped.
Bug Fixes
- Improved reliability for long-running face operations with longer request timeouts.
- Prevented overlapping reclustering runs and ensured completed/failed results remain available briefly for polling.
Tests
- Added coverage for the global reclustering React hook, including success, error, cancellation, and rapid re-triggers.

coderabbitai · 2026-06-28T10:55:50Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dfe17b1f-1e47-4423-afb1-649e1328359c

📥 Commits

Reviewing files that changed from the base of the PR and between a422cc2 and 689fc83.

📒 Files selected for processing (1)

frontend/src/hooks/__tests__/useGlobalRecluster.test.tsx

Walkthrough

Global face reclustering now starts as a background task and is checked through a status endpoint. The backend adds task tracking, cleanup, and async execution; the frontend adds new API calls, polling, timeout updates, tests, and revised settings-page wiring.

Changes

Async Global Reclustering Flow

Layer / File(s)	Summary
Backend schema contracts `backend/app/schemas/face_clusters.py`	Removes `GlobalReclusterData/Response` and adds separate start and status models with `task_id` and constrained terminal states.
Backend task infrastructure and endpoints `backend/app/routes/face_clusters.py`, `backend/main.py`, `backend/run-server.ps1`, `backend/run.sh`	Adds the reclustering task registry, concurrency guard, async worker, TTL cleanup helper, start/status route handlers, lifespan cleanup wiring, and single-worker startup settings.
Frontend HTTP timeouts and reclustering API `frontend/src/api/axiosConfig.ts`, `frontend/src/api/apiEndpoints.ts`, `frontend/src/api/api-functions/face_clusters.ts`	Introduces shared timeout constants, updates axios defaults, applies the longer timeout to face-search requests, and replaces the synchronous reclustering API with start and status calls.
Polling hook and settings wiring `frontend/src/hooks/useGlobalRecluster.tsx`, `frontend/src/hooks/__tests__/useGlobalRecluster.test.tsx`, `frontend/src/pages/SettingsPage/components/ApplicationControlsCard.tsx`	Adds the polling hook and its tests, then rewires the Settings page to trigger the hook, surface success and error feedback, and show skipped-face alerts.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

AOSSIE-Org/PictoPy#560: This PR changes the same global reclustering endpoint and frontend trigger flow that are converted to async polling here.
AOSSIE-Org/PictoPy#1290: This PR also touches reclustering response fields such as faces_skipped, which are surfaced in the new status payload here.

Suggested labels

Python, TypeScript/JavaScript

Suggested reviewers

rahulharpal1603

Poem

🐇 Hop, hop — the task is on its way,
No more long waits in the HTTP fray.
A task_id blooms, then polls with grace,
Till clusters settle in their place.
Faces skipped? We whisper, “seen!”
Async carrots keep things clean.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main changes: async global reclustering and removal of the 10s axios cap.
Linked Issues check	✅ Passed	The PR addresses `#1345`'s main scope by making global reclustering asynchronous, raising the default timeout, and adding per-request timeouts for long face search calls.
Out of Scope Changes check	✅ Passed	All changes support the reclustering/timeout fix; no unrelated functionality appears to be added.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

frontend/src/hooks/useGlobalRecluster.tsx (1)
40-116: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add automated coverage for the polling lifecycle.

This hook now owns the core reclustering UX, but the PR does not include tests for success, error, unmount cleanup, or repeated-trigger behavior. Those are the paths most likely to regress loaders, dialogs, and interval cleanup. As per path instructions, "Ensure that test code is automated, comprehensive, and follows testing best practices" and "Verify that all critical functionality is covered by tests."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/src/hooks/useGlobalRecluster.tsx` around lines 40 - 116, Add
automated tests for useGlobalRecluster to cover the full polling lifecycle:
successful completion, error handling from both startGlobalReclustering and
getGlobalReclusterStatus, cleanup via stopPolling on unmount, and repeated
trigger calls preventing duplicate intervals. Use the useGlobalRecluster hook
and its trigger/stopPolling behavior as the primary targets, and assert
queryClient.invalidateQueries is called on completion and that state transitions
to isSuccess/isError are set correctly. Ensure the tests mock the interval/timer
behavior and verify polling is cleared in all termination paths.
Source: Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/app/routes/face_clusters.py`:
- Around line 56-57: The task lifecycle in face_clusters should track when the
job finished, not when the tracker object was created, because cleanup uses that
timestamp to age out results too early. Update the tracker state used by the
recluster flow (the class/fields around created_at, task, and the cleanup logic
in the polling/cleanup path) so terminal results are retained from completion
time, and make the cleanup code compare against a completion timestamp rather
than the creation timestamp.
- Around line 60-67: The global recluster tracking in face_clusters.py is
worker-local, so `recluster_tasks` and `_active_recluster_task_id` can diverge
across multiple backend workers and allow duplicate jobs or broken polling.
Update the recluster flow around `recluster_tasks`, `_active_recluster_task_id`,
and the task creation/polling handlers to use a shared coordination mechanism
such as a distributed lock or shared persistence layer, or otherwise enforce a
single-worker deployment in the runtime startup scripts/config.

In `@frontend/src/hooks/useGlobalRecluster.tsx`:
- Around line 54-113: The polling logic in useGlobalRecluster’s trigger is
vulnerable to overlapping status requests and stale runs because
setInterval(async ...) can stack calls and an older trigger() can keep running
after a newer one starts. Replace the interval-based polling with a
self-scheduling timeout loop, and add a monotonically increasing run id/ref
inside trigger, stopPolling, and the polling callback so only the latest run can
update state or clear polling. Make sure getGlobalReclusterStatus, setState, and
queryClient.invalidateQueries only execute for the active request.

---

Nitpick comments:
In `@frontend/src/hooks/useGlobalRecluster.tsx`:
- Around line 40-116: Add automated tests for useGlobalRecluster to cover the
full polling lifecycle: successful completion, error handling from both
startGlobalReclustering and getGlobalReclusterStatus, cleanup via stopPolling on
unmount, and repeated trigger calls preventing duplicate intervals. Use the
useGlobalRecluster hook and its trigger/stopPolling behavior as the primary
targets, and assert queryClient.invalidateQueries is called on completion and
that state transitions to isSuccess/isError are set correctly. Ensure the tests
mock the interval/timer behavior and verify polling is cleared in all
termination paths.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0aa7d84e-f074-4372-a926-3c8fca0f1a52

📥 Commits

Reviewing files that changed from the base of the PR and between b3eee2c and cd74151.

📒 Files selected for processing (8)

backend/app/routes/face_clusters.py
backend/app/schemas/face_clusters.py
backend/main.py
frontend/src/api/api-functions/face_clusters.ts
frontend/src/api/apiEndpoints.ts
frontend/src/api/axiosConfig.ts
frontend/src/hooks/useGlobalRecluster.tsx
frontend/src/pages/SettingsPage/components/ApplicationControlsCard.tsx

…axios cap (AOSSIE-Org#1345) The shared axios client applied a hard 10s timeout to every backend request with no per-call override, so synchronous endpoints that scale with library size (notably global face reclustering) aborted in the UI after ~10s while the backend kept running to completion, leaving the UI and DB inconsistent. Backend - POST /face-clusters/global-recluster now starts the full DBSCAN pass as a background task and returns a task_id immediately (202 Accepted), running the blocking work via asyncio.to_thread so the event loop is not blocked. - New GET /face-clusters/global-recluster/{task_id} to poll the job status (running | complete | error). - Concurrency guard: a second trigger while one is running rejoins the in-flight task instead of starting a second pass that would race on the cluster tables. - Cleanup loop reaps finished task results after a TTL (registered in the app lifespan); running tasks are never reaped. Terminal results are not deleted on first poll, so repeated polls (multiple tabs/retries) stay consistent. Frontend - Raise the default axios timeout 10s -> 30s and add LONG_REQUEST_TIMEOUT_MS (120s) applied per-call to the face-search / multi-person-search endpoints. - Replace the one-shot reclustering mutation with a useGlobalRecluster polling hook wired into the Settings recluster button (same loader/toast UX). Scoped to the face-clustering side of the issue: memories endpoints are left untouched (handled separately), and a live progress bar / SSE is deferred to a follow-up.

…l loop Addresses review feedback on the async reclustering flow. - ReclusterTask now records finished_at, and the cleanup loop ages terminal results from completion time instead of creation time. Previously a job that ran close to the TTL could be reaped almost immediately after finishing, making polling clients see a 404 instead of the result. - useGlobalRecluster now polls with a self-scheduling setTimeout (the next tick is queued only after the current request resolves, so status requests can't stack/overlap) and guards every async callback with a monotonically increasing run id. A newer trigger() — or unmount — invalidates older runs so they cannot keep polling or overwrite state, fixing the orphaned-interval leak on repeated triggers.

…er setup - Add unit tests for the useGlobalRecluster polling hook covering the success lifecycle, error from start, error reported by the status poll, polling cleanup on unmount, and that a rapid second trigger does not leave a second poll loop running. - Warn at startup when WORKERS > 1, since model-download and global-reclustering job tracking is in-memory and per-worker; a job started in one worker is invisible to others, so deployments should keep a single worker. This is a non-breaking safeguard (no behaviour change to the default single-worker run).

- main.py already defines a module-level logger; remove the duplicate one added for the worker warning (the lifespan resolves the existing logger at runtime). - Remove the now-unused GlobalReclusterData interface from the frontend API module (superseded by the start/status data interfaces).

…king Model-download and global-reclustering job state lives in per-worker memory, so running multiple workers would let a job start in one worker while status polls hit another (404s, duplicate jobs). Hardcode the server to a single worker in run.sh and run-server.ps1 (removing the WORKERS override) and document the constraint at the in-memory state. Replaces the previous best-effort startup warning, which could not observe the real worker count.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@frontend/src/hooks/__tests__/useGlobalRecluster.test.tsx`:
- Around line 101-127: The `useGlobalRecluster` test suite is missing coverage
for the rejected-status path from `getGlobalReclusterStatus`. Add a test
alongside the existing `sets an error when the job reports failure` and `sets an
error when starting the job fails` cases that makes `mockStatus` reject (for
example, a missing/aged-out task), then assert the hook sets `isError`, surfaces
the failure message, and stops polling/invalidation behavior as expected via
`trigger()` and `flush()`. Keep the new case aligned with the existing `setup`,
`mockStart`, and `mockStatus` helpers so the cleanup/TTL contract is covered.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7d92f44c-69f1-47db-843d-a7914af5c785

📥 Commits

Reviewing files that changed from the base of the PR and between 2f1554f and a422cc2.

📒 Files selected for processing (7)

backend/app/routes/face_clusters.py
backend/main.py
backend/run-server.ps1
backend/run.sh
frontend/src/api/api-functions/face_clusters.ts
frontend/src/hooks/__tests__/useGlobalRecluster.test.tsx
frontend/src/hooks/useGlobalRecluster.tsx

💤 Files with no reviewable changes (1)

frontend/src/api/api-functions/face_clusters.ts

🚧 Files skipped from review as they are similar to previous changes (3)

frontend/src/hooks/useGlobalRecluster.tsx
backend/main.py
backend/app/routes/face_clusters.py

…ster Addresses a CodeRabbit review comment: the suite only covered the {success: false, status: 'error'} payload path, not a rejected status request (e.g. 404 for a missing/aged-out task ID).

github-actions Bot added backend bug Something isn't working labels Jun 28, 2026

coderabbitai Bot requested changes Jun 28, 2026

View reviewed changes

Comment thread backend/app/routes/face_clusters.py

Comment thread backend/app/routes/face_clusters.py

Comment thread frontend/src/hooks/useGlobalRecluster.tsx

VanshajPoonia force-pushed the fix/1345-recluster-timeout branch from cd74151 to 2f1554f Compare June 28, 2026 11:09

VanshajPoonia added 4 commits June 28, 2026 16:44

coderabbitai Bot requested changes Jun 28, 2026

View reviewed changes

Comment thread frontend/src/hooks/__tests__/useGlobalRecluster.test.tsx

test(face-clusters): cover status-request rejection in useGlobalReclu…

689fc83

…ster Addresses a CodeRabbit review comment: the suite only covered the {success: false, status: 'error'} payload path, not a rejected status request (e.g. 404 for a missing/aged-out task ID).

coderabbitai Bot approved these changes Jun 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(face-clusters): run global reclustering as async job; remove 10s axios timeout cap (#1345)#1349

fix(face-clusters): run global reclustering as async job; remove 10s axios timeout cap (#1345)#1349
VanshajPoonia wants to merge 6 commits into
AOSSIE-Org:mainfrom
VanshajPoonia:fix/1345-recluster-timeout

VanshajPoonia commented Jun 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

VanshajPoonia commented Jun 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Addressed Issues:

Screenshots/Recordings:

Additional Notes:

AI Usage Disclosure:

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

VanshajPoonia commented Jun 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading