ENG-3925: Add Celery on_failure handler for worker-level DSR task deaths#8252
ENG-3925: Add Celery on_failure handler for worker-level DSR task deaths#8252eastandwestwind wants to merge 3 commits into
Conversation
When a privacy request task dies at the worker level (OOM kill, hard timeout, broker disconnect), no execution log is created. The on_failure callback on DatabaseTask catches these failures, writes an error execution log with the failure reason, and marks the request as errored. Skips if the in-task BaseException handler already handled the error. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub. 2 Skipped Deployments
|
| if not privacy_request: | ||
| return | ||
|
|
||
| if privacy_request.status == PrivacyRequestStatus.error: |
There was a problem hiding this comment.
These are already handled by the in-task handler
| from fides.api.models.privacy_request import PrivacyRequest | ||
| from fides.api.schemas.privacy_request import PrivacyRequestStatus |
There was a problem hiding this comment.
Inline to avoid circular deps 😢
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (90.47%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #8252 +/- ##
==========================================
+ Coverage 85.10% 85.15% +0.04%
==========================================
Files 669 670 +1
Lines 43370 43518 +148
Branches 5080 5096 +16
==========================================
+ Hits 36911 37056 +145
- Misses 5351 5352 +1
- Partials 1108 1110 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| connection_key=None, | ||
| dataset_name="Worker task failure", | ||
| collection_name=None, | ||
| message=f"Task failed at worker level: {type(exc).__name__}: {exc}", |
There was a problem hiding this comment.
Some example scenarios with corresponding err message:
OOM / Hard time limit (Celery kills the worker):
Task failed at worker level: TimeLimitExceeded: TimeLimitExceeded(3600,)Task failed at worker level: WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL) Job: 42.
Broker disconnect:
Task failed at worker level: ConnectionError: Error while reading from socket: Connection reset by peer
DB connection lost mid-task (if it escapes the catch-all):
Task failed at worker level: OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly
Memory watchdog (if enabled):
Task failed at worker level: MemoryLimitExceeded: Memory usage at 94.2% exceeds threshold of 90%
| _task_engine = None | ||
| _sessionmaker = None | ||
|
|
||
| def on_failure( |
There was a problem hiding this comment.
this is not going to help if the worker process is killed by the OS, only "softer" failures where Celery is still alive
Ticket ENG-3925
Description Of Changes
When a privacy request task dies at the worker level (OOM kill, hard timeout, broker disconnect), the task's Python exception handler never runs, so no execution log is created. The request gets stuck in
in_processingwith zero diagnostic info until the stuck task reaper eventually finds it — but even then, only a generic "stuck without running task" message is logged.Added an
on_failurecallback onDatabaseTask(the Celery base class for all DB tasks) that:privacy_request_idfrom task kwargsBaseExceptionhandler already marked the request as errored (avoids double-logging)Code Changes
src/fides/api/tasks/__init__.py— Addedon_failuremethod toDatabaseTasktests/fides/ops/tasks/test_database_task.py— 4 new tests:Steps to Confirm
pytest tests/fides/ops/tasks/test_database_task.py::TestDatabaseTaskOnFailure -vPre-Merge Checklist
CHANGELOG.mdupdated