Skip to content

ENG-3925: Add Celery on_failure handler for worker-level DSR task deaths#8252

Draft
eastandwestwind wants to merge 3 commits into
mainfrom
ENG-3925
Draft

ENG-3925: Add Celery on_failure handler for worker-level DSR task deaths#8252
eastandwestwind wants to merge 3 commits into
mainfrom
ENG-3925

Conversation

@eastandwestwind
Copy link
Copy Markdown
Contributor

@eastandwestwind eastandwestwind commented May 21, 2026

Ticket ENG-3925

Description Of Changes

When a privacy request task dies at the worker level (OOM kill, hard timeout, broker disconnect), the task's Python exception handler never runs, so no execution log is created. The request gets stuck in in_processing with zero diagnostic info until the stuck task reaper eventually finds it — but even then, only a generic "stuck without running task" message is logged.

Added an on_failure callback on DatabaseTask (the Celery base class for all DB tasks) that:

  • Extracts privacy_request_id from task kwargs
  • Checks if the in-task BaseException handler already marked the request as errored (avoids double-logging)
  • If not, writes an error execution log with the failure reason and marks the request as errored
  • Gracefully handles DB failures in the callback itself (logs and swallows)

Code Changes

  • src/fides/api/tasks/__init__.py — Added on_failure method to DatabaseTask
  • tests/fides/ops/tasks/test_database_task.py — 4 new tests:
    • Skips non-privacy-request tasks
    • Creates error log for worker death
    • Skips already-errored requests (no double-logging)
    • Handles DB errors gracefully

Steps to Confirm

  1. Run the new unit tests: pytest tests/fides/ops/tasks/test_database_task.py::TestDatabaseTaskOnFailure -v
  2. To test end-to-end: configure a hard time limit on the task and create a long-running privacy request that exceeds it — verify the execution log contains "Task failed at worker level"

Pre-Merge Checklist

  • Issue requirements met
  • All CI pipelines succeeded
  • CHANGELOG.md updated
    • Add a db-migration This indicates that a change includes a database migration label
    • Add a high-risk This issue suggests changes that have a high-probability of breaking existing code label
    • Updates unreleased work already in Changelog, no new entry necessary
  • UX feedback:
    • No UX review needed
  • Followup issues:
    • No followup issues
  • Database migrations:
    • No migrations
  • Documentation:
    • No documentation updates required

eastandwestwind and others added 3 commits May 21, 2026 14:20
When a privacy request task dies at the worker level (OOM kill, hard
timeout, broker disconnect), no execution log is created. The on_failure
callback on DatabaseTask catches these failures, writes an error
execution log with the failure reason, and marks the request as errored.
Skips if the in-task BaseException handler already handled the error.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented May 21, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments
Project Deployment Actions Updated (UTC)
fides-plus-nightly Ignored Ignored May 21, 2026 12:22pm
fides-privacy-center Ignored Ignored May 21, 2026 12:22pm

Request Review

if not privacy_request:
return

if privacy_request.status == PrivacyRequestStatus.error:
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are already handled by the in-task handler

Comment on lines +78 to +79
from fides.api.models.privacy_request import PrivacyRequest
from fides.api.schemas.privacy_request import PrivacyRequestStatus
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline to avoid circular deps 😢

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

❌ Patch coverage is 90.47619% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.15%. Comparing base (86ff003) to head (1f814cd).
⚠️ Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
src/fides/api/tasks/__init__.py 90.47% 1 Missing and 1 partial ⚠️

❌ Your patch check has failed because the patch coverage (90.47%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8252      +/-   ##
==========================================
+ Coverage   85.10%   85.15%   +0.04%     
==========================================
  Files         669      670       +1     
  Lines       43370    43518     +148     
  Branches     5080     5096      +16     
==========================================
+ Hits        36911    37056     +145     
- Misses       5351     5352       +1     
- Partials     1108     1110       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

connection_key=None,
dataset_name="Worker task failure",
collection_name=None,
message=f"Task failed at worker level: {type(exc).__name__}: {exc}",
Copy link
Copy Markdown
Contributor Author

@eastandwestwind eastandwestwind May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some example scenarios with corresponding err message:

OOM / Hard time limit (Celery kills the worker):

  • Task failed at worker level: TimeLimitExceeded: TimeLimitExceeded(3600,)
  • Task failed at worker level: WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL) Job: 42.

Broker disconnect:

  • Task failed at worker level: ConnectionError: Error while reading from socket: Connection reset by peer

DB connection lost mid-task (if it escapes the catch-all):

  • Task failed at worker level: OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly

Memory watchdog (if enabled):

  • Task failed at worker level: MemoryLimitExceeded: Memory usage at 94.2% exceeds threshold of 90%

_task_engine = None
_sessionmaker = None

def on_failure(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not going to help if the worker process is killed by the OS, only "softer" failures where Celery is still alive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant