Skip to content

[superlog] Downgrade stalled-job event from ERROR to WARN in uptime worker#479

Merged
izadoesdev merged 1 commit into
stagingfrom
superlog/downgrade-stalled-job-to-warn
Jun 30, 2026
Merged

[superlog] Downgrade stalled-job event from ERROR to WARN in uptime worker#479
izadoesdev merged 1 commit into
stagingfrom
superlog/downgrade-stalled-job-to-warn

Conversation

@superlog-app

@superlog-app superlog-app Bot commented Jun 15, 2026

Copy link
Copy Markdown

Summary

The uptime worker's BullMQ stalled event handler called captureError (which logs at ERROR severity) for every stalled job, including recoverable ones. On any worker restart, all in-flight jobs stall simultaneously when the lock-renewal thread disappears, producing a burst of 50+ ERROR log entries at a single timestamp — triggering false-positive incident alerts each time.

BullMQ v5 (with its default maxStalledCount: 1) moves a first-time stalled job back to the wait queue, so the uptime check still runs (just delayed by one stalledInterval ≈ 2 min). The ERROR level was therefore misleading: the operation hadn't actually failed. The failed event handler already catches jobs that are truly lost (stall count exceeds maxStalledCount, or normal job failures).

The fix replaces captureError(...) in the stalled handler with log.warn(...), keeping the event visible in logs while preventing it from generating ERROR-level incidents during normal restarts. No other behavior changes.

Alternative: The stalled handler could also be removed entirely and the failed event relied on exclusively; however, keeping a WARN log preserves operational visibility into stall frequency without the alert noise.

Incident on Superlog


Was this PR helpful? Leave feedback — goes straight to the Superlog team.


Summary by cubic

Downgraded bullmq stalled-job logs in the uptime worker from ERROR to WARN to prevent false-positive incident alerts during worker restarts. First stalls are auto-retried by bullmq v5, so we keep visibility with WARN without treating recoverable stalls as failures; no behavior change.

Written for commit 9466d8a. Summary will update on new commits.

Review in cubic

@vercel

vercel Bot commented Jun 15, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
databuddy-status Ready Ready Preview, Comment Jun 15, 2026 9:06am
2 Skipped Deployments
Project Deployment Actions Updated (UTC)
dashboard Skipped Skipped Jun 15, 2026 9:06am
documentation Skipped Skipped Jun 15, 2026 9:06am

@unkey-deploy

unkey-deploy Bot commented Jun 15, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Unkey Deploy

Name Status Preview Inspect Updated (UTC)
api (preview) Ready Visit Preview Inspect Jun 15, 2026 9:06am

@greptile-apps

greptile-apps Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR changes the BullMQ stalled event handler in the uptime worker to log at WARN instead of ERROR, preventing false-positive incident alerts during worker restarts when in-flight jobs stall simultaneously and BullMQ automatically requeues them.

  • The stalled handler now calls log.warn(...) (with a structured payload including error_message, service, and job_id) instead of captureError(new Error(...)), eliminating burst ERROR noise on restarts.
  • The failed and error event handlers remain unchanged at ERROR severity, correctly capturing jobs that are truly lost or encounter worker-level failures.

Confidence Score: 5/5

Safe to merge — the change is tightly scoped to a single event handler and does not alter job processing logic, retry behaviour, or any other event handler.

The only change is replacing captureError with log.warn in the stalled handler. The failed and error handlers still capture errors at the appropriate level, so no real failure goes undetected. The import of log from evlog is a minor, low-risk addition alongside the existing createLogger import. The reasoning matches BullMQ's documented behaviour for maxStalledCount.

No files require special attention.

Important Files Changed

Filename Overview
apps/uptime/src/worker.ts Stalled-job BullMQ event handler downgraded from captureError (ERROR) to log.warn; log import extended to include the top-level log export from evlog; all other event handlers (failed, error) are unchanged.

Sequence Diagram

sequenceDiagram
    participant BullMQ
    participant UptimeWorker
    participant Logger as log (evlog)
    participant ErrorCapture as captureError

    Note over BullMQ,UptimeWorker: Worker restart / lock-renewal thread gone

    BullMQ->>UptimeWorker: emit "stalled" (jobId)
    UptimeWorker->>Logger: "log.warn({ service, error_step, error_message, job_id })"
    Note over Logger: WARN — job moved back to wait queue

    BullMQ->>UptimeWorker: job re-queued in wait queue
    BullMQ->>UptimeWorker: job executes again (delayed ~stalledInterval)

    alt maxStalledCount exceeded
        BullMQ->>UptimeWorker: emit "failed" (job, error)
        UptimeWorker->>ErrorCapture: "captureError(error, { error_step, ... })"
        Note over ErrorCapture: ERROR — truly lost job
    end

    alt worker-level error
        BullMQ->>UptimeWorker: emit "error" (error)
        UptimeWorker->>ErrorCapture: "captureError(error, { error_step })"
        Note over ErrorCapture: ERROR — worker error
    end
Loading

Reviews (1): Last reviewed commit: "[superlog] Downgrade stalled-job event f..." | Re-trigger Greptile

@izadoesdev izadoesdev left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed against staging with a subagent pass. This is a focused one-file uptime worker change that replaces captureError/ERROR alerting for recoverable BullMQ stalled-job events with log.warn while preserving visibility. The PR targets staging, has no merge conflict, and lint/type/uptime tests are clean; the old red Test check is in @databuddy/ai and appears unrelated to this uptime diff, while recent staging CI is green. Prefer this focused survivor over #501, which is conflicting and broader.

@izadoesdev izadoesdev merged commit cdbdfdd into staging Jun 30, 2026
12 of 13 checks passed
@izadoesdev izadoesdev deleted the superlog/downgrade-stalled-job-to-warn branch June 30, 2026 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant