[superlog] Downgrade stalled-job event from ERROR to WARN in uptime worker#479
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
2 Skipped Deployments
|
|
The latest updates on your projects. Learn more about Unkey Deploy
|
Greptile SummaryThis PR changes the BullMQ
Confidence Score: 5/5Safe to merge — the change is tightly scoped to a single event handler and does not alter job processing logic, retry behaviour, or any other event handler. The only change is replacing captureError with log.warn in the stalled handler. The failed and error handlers still capture errors at the appropriate level, so no real failure goes undetected. The import of log from evlog is a minor, low-risk addition alongside the existing createLogger import. The reasoning matches BullMQ's documented behaviour for maxStalledCount. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant BullMQ
participant UptimeWorker
participant Logger as log (evlog)
participant ErrorCapture as captureError
Note over BullMQ,UptimeWorker: Worker restart / lock-renewal thread gone
BullMQ->>UptimeWorker: emit "stalled" (jobId)
UptimeWorker->>Logger: "log.warn({ service, error_step, error_message, job_id })"
Note over Logger: WARN — job moved back to wait queue
BullMQ->>UptimeWorker: job re-queued in wait queue
BullMQ->>UptimeWorker: job executes again (delayed ~stalledInterval)
alt maxStalledCount exceeded
BullMQ->>UptimeWorker: emit "failed" (job, error)
UptimeWorker->>ErrorCapture: "captureError(error, { error_step, ... })"
Note over ErrorCapture: ERROR — truly lost job
end
alt worker-level error
BullMQ->>UptimeWorker: emit "error" (error)
UptimeWorker->>ErrorCapture: "captureError(error, { error_step })"
Note over ErrorCapture: ERROR — worker error
end
Reviews (1): Last reviewed commit: "[superlog] Downgrade stalled-job event f..." | Re-trigger Greptile |
izadoesdev
left a comment
There was a problem hiding this comment.
Reviewed against staging with a subagent pass. This is a focused one-file uptime worker change that replaces captureError/ERROR alerting for recoverable BullMQ stalled-job events with log.warn while preserving visibility. The PR targets staging, has no merge conflict, and lint/type/uptime tests are clean; the old red Test check is in @databuddy/ai and appears unrelated to this uptime diff, while recent staging CI is green. Prefer this focused survivor over #501, which is conflicting and broader.
Summary
The uptime worker's BullMQ
stalledevent handler calledcaptureError(which logs at ERROR severity) for every stalled job, including recoverable ones. On any worker restart, all in-flight jobs stall simultaneously when the lock-renewal thread disappears, producing a burst of 50+ ERROR log entries at a single timestamp — triggering false-positive incident alerts each time.BullMQ v5 (with its default
maxStalledCount: 1) moves a first-time stalled job back to thewaitqueue, so the uptime check still runs (just delayed by onestalledInterval≈ 2 min). The ERROR level was therefore misleading: the operation hadn't actually failed. Thefailedevent handler already catches jobs that are truly lost (stall count exceedsmaxStalledCount, or normal job failures).The fix replaces
captureError(...)in thestalledhandler withlog.warn(...), keeping the event visible in logs while preventing it from generating ERROR-level incidents during normal restarts. No other behavior changes.Alternative: The stalled handler could also be removed entirely and the
failedevent relied on exclusively; however, keeping a WARN log preserves operational visibility into stall frequency without the alert noise.Incident on Superlog
Was this PR helpful? Leave feedback — goes straight to the Superlog team.
Summary by cubic
Downgraded
bullmqstalled-job logs in the uptime worker from ERROR to WARN to prevent false-positive incident alerts during worker restarts. First stalls are auto-retried bybullmqv5, so we keep visibility with WARN without treating recoverable stalls as failures; no behavior change.Written for commit 9466d8a. Summary will update on new commits.