Skip to content

[superlog] Downgrade worker.job_stalled log from ERROR to WARN#501

Open
superlog-app[bot] wants to merge 1 commit into
stagingfrom
superlog/downgrade-stalled-job-log-to-warn
Open

[superlog] Downgrade worker.job_stalled log from ERROR to WARN#501
superlog-app[bot] wants to merge 1 commit into
stagingfrom
superlog/downgrade-stalled-job-log-to-warn

Conversation

@superlog-app

@superlog-app superlog-app Bot commented Jun 27, 2026

Copy link
Copy Markdown

Summary

The insights worker's stalled BullMQ event handler logged at ERROR severity, creating false-positive incidents in Superlog every time the insights worker restarted or was redeployed while an AI insight generation job was in flight.

BullMQ's stalled event is part of its built-in resilience mechanism: when a worker loses its heartbeat (e.g., during a Rolling deploy on Railway), active jobs have their locks expire and are automatically moved back to waiting for retry. This is expected behavior — not an operational failure. Terminal failures (after all retry attempts are exhausted) are already captured at ERROR by the existing worker.on('failed', ...) handler.

This patch:

  1. Changes worker.job_stalled from ERROR → WARN so normal recovery during deploys no longer pages.
  2. Adds job_name inference from the job ID prefix (e.g. insights-website-*insights-generate-website) so stalled-event logs are easier to diagnose without querying the job object.

An alternative approach would be to suppress the stalled event entirely (just not log it), but keeping it at WARN preserves visibility for cases where stalls happen outside of deploys (e.g. OOM kills or runaway jobs).

Incident on Superlog


Was this PR helpful? Leave feedback — goes straight to the Superlog team.


Summary by cubic

Downgraded worker.job_stalled logs from error to warn in the insights worker to stop false Superlog incidents during normal bullmq recovery. Added job_name inference from job ID prefixes (dispatch, maintenance, rollup, website) to make stalled logs easier to diagnose; terminal failures remain logged as errors via the existing failed handler.

Written for commit 3368aa2. Summary will update on new commits.

Review in cubic

@vercel

vercel Bot commented Jun 27, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
databuddy-status Ready Ready Preview, Comment Jun 27, 2026 6:39am
2 Skipped Deployments
Project Deployment Actions Updated (UTC)
dashboard Skipped Skipped Jun 27, 2026 6:39am
documentation Skipped Skipped Jun 27, 2026 6:39am

@unkey-deploy

unkey-deploy Bot commented Jun 27, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Unkey Deploy

Name Status Preview Inspect Updated (UTC)
api (preview) Ready Visit Preview Inspect Jun 27, 2026 6:39am

@greptile-apps

greptile-apps Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR downgrades the worker.job_stalled log event from ERROR to WARN in the insights worker and adds a inferJobNameFromId helper so stalled-event logs include a human-readable job name without needing to fetch the full job object.

  • Log level change (apps/insights/src/worker.ts): emitInsightsEvent(\"error\", ...)emitInsightsEvent(\"warn\", ...) for the stalled event; this correctly reflects that BullMQ stalls are a normal part of the lock-expiry / retry cycle and are not terminal failures (which remain at ERROR via the failed handler).
  • inferJobNameFromId helper: Covers all four currently known job name constants (INSIGHTS_GENERATE_WEBSITE_JOB_NAME, INSIGHTS_ROLLUP_JOB_NAME, INSIGHTS_DISPATCH_JOB_NAME, INSIGHTS_MAINTENANCE_JOB_NAME) and returns \"unknown\" for anything unrecognised — a safe, best-effort fallback.

Confidence Score: 5/5

Safe to merge — the change is isolated to one event handler and reduces alert noise without dropping observability.

The only runtime behaviour that changes is the log level emitted on a stalled job, and the new inferJobNameFromId helper covers every job name constant currently imported by the file. Terminal failures remain at ERROR through the existing failed handler. There is no data-path or control-flow change.

No files require special attention. apps/insights/src/worker.ts is the only changed file and the diff is straightforward.

Important Files Changed

Filename Overview
apps/insights/src/worker.ts Downgrades worker.job_stalled from ERROR to WARN and adds inferJobNameFromId helper; all four known job types are covered, fallback returns "unknown"

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant BullMQ
    participant Worker as Insights Worker
    participant Logger as emitInsightsEvent

    BullMQ->>Worker: job active (lock acquired)
    Note over Worker: Worker restarts / deploy rolling
    BullMQ->>Worker: stalled event (jobId)
    Worker->>Worker: inferJobNameFromId(jobId)
    Worker->>Logger: "warn("worker.job_stalled", {job_id, job_name})"
    BullMQ->>BullMQ: move job back to "waiting"
    BullMQ->>Worker: job retried (normal retry path)
    alt all retries exhausted
        BullMQ->>Worker: failed event (job, error)
        Worker->>Logger: "error("worker.job_failed", {...})"
    else job succeeds
        BullMQ->>Worker: completed event (job)
        Worker->>Logger: "info("worker.job_completed", {...})"
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant BullMQ
    participant Worker as Insights Worker
    participant Logger as emitInsightsEvent

    BullMQ->>Worker: job active (lock acquired)
    Note over Worker: Worker restarts / deploy rolling
    BullMQ->>Worker: stalled event (jobId)
    Worker->>Worker: inferJobNameFromId(jobId)
    Worker->>Logger: "warn("worker.job_stalled", {job_id, job_name})"
    BullMQ->>BullMQ: move job back to "waiting"
    BullMQ->>Worker: job retried (normal retry path)
    alt all retries exhausted
        BullMQ->>Worker: failed event (job, error)
        Worker->>Logger: "error("worker.job_failed", {...})"
    else job succeeds
        BullMQ->>Worker: completed event (job)
        Worker->>Logger: "info("worker.job_completed", {...})"
    end
Loading

Reviews (1): Last reviewed commit: "[superlog] Downgrade worker.job_stalled ..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants