Skip to content

[Bug]: worker last_seen frozen for ~12 active workers — heartbeat-ingest appears to stop while workers stay functional #58

@tjluyao

Description

@tjluyao

[Bug]: worker last_seen frozen for ~12 active workers — heartbeat-ingest appears to stop while workers stay functional

Component: Host (orchestrator)

Describe the bug

On the live kv.run:8000/flowmesh host (cluster lumid0, namespace lumid), every worker except two long-stale rows reports an last_seen clustered tightly between 2026-05-27T18:18:31 and 2026-05-27T18:19:01 UTC — a ~30-second window 12 workers fell into and none have left. As of 2026-05-28T10:52Z that's a 16.6h freeze across 12 disparate workers on 8 different nodes (incl. CPU + RTX 5080 + RTX 6000 Ada + GB10 + RTX PRO 4000 boxes). The workers themselves are clearly still functional — three of them (wkr-24, wkr-21, wkr-12) report status: BUSY against the same stale last_seen, so they're actively executing tasks; only the host's heartbeat-ingest side seems to have stopped writing through that point.

Expected: last_seen should advance per worker on each heartbeat regardless of execution state.
Actual: all workers stalled at near-identical timestamps in a ~30s window, despite continuing to pick up jobs.

Reproduction (read-only probe):

curl -sS -H "Authorization: Bearer $PAT" \
  https://kv.run:8000/flowmesh/api/v1/workers \
  | jq '.[] | {id, node_id, status, last_seen}' | head -40

Output (representative subset, taken at 2026-05-28T10:52Z):

{"id":"wkr-15","node_id":"nde-10","status":"IDLE","last_seen":"2026-05-27T18:18:54.…Z"}
{"id":"wkr-10","node_id":"nde-13","status":"IDLE","last_seen":"2026-05-27T18:18:34.…Z"}
{"id":"wkr-16","node_id":"nde-14","status":"IDLE","last_seen":"2026-05-27T18:18:35.…Z"}
{"id":"wkr-24","node_id":"nde-20","status":"BUSY","last_seen":"2026-05-27T18:18:34.…Z"}
{"id":"wkr-21","node_id":"nde-6", "status":"BUSY","last_seen":"2026-05-27T18:18:37.…Z"}
{"id":"wkr-12","node_id":"nde-7", "status":"BUSY","last_seen":"2026-05-27T18:18:58.…Z"}
... (12 rows total, all between 18:18:31 and 18:19:01)
{"id":"wkr-23","node_id":"nde-18","status":"IDLE","last_seen":"2026-05-26T00:21:19.…Z"}  // 58h, predates the freeze
{"id":"wkr-2", "node_id":"nde-2", "status":"IDLE","last_seen":"2026-05-19T08:21:11.…Z"}  // genuine zombie (nde-2 offline)
{"id":"wkr-3", "node_id":"nde-2", "status":"IDLE","last_seen":"2026-05-19T08:21:11.…Z"}  // genuine zombie

That tight clustering across 8 nodes + the BUSY-with-stale-heartbeat combination point to a host-side ingest issue (Redis stream consumer / worker-state writer stalled, channel desync, etc.) rather than 12 simultaneous worker failures.

Impact:

  • Registry can no longer distinguish live workers from dead ones (every monitoring threshold built on last_seen flags healthy workers as stale).
  • Pre-existing zombies (wkr-2, wkr-3 on offline node nde-2/luyao1) become indistinguishable from the rest of the fleet.
  • Downstream dashboards / alerting (e.g. lum.id cluster registry) lose ground truth.

Suggested investigation:

  1. Grep the host logs around 2026-05-27T18:18:31Z ± 60s — look for an exception, a Redis reconnect, or a deploy/restart.
  2. Check whether the worker-heartbeat consumer group is stuck on a pending entry (Redis XPENDING).
  3. Confirm whether last_seen is updated only on heartbeat ingest, or also on result-submission / task-claim paths (status: BUSY updates without last_seen updates suggests the two paths diverged).

Environment

  • FlowMesh version: deployed image as of 2026-05-13T18:06:47Z (worker started_at); pre-v0.1.2 redeploy.
  • Cluster: lumid0, namespace lumid, multi-node (8 nodes, 15 workers — 12 GPU + 3 CPU).
  • Observation captured at: 2026-05-28T10:52Z.

Additional context

This was surfaced while validating the upcoming lumid-flowmesh-plugin v0.2.0 rollout. The plugin work is unrelated; the heartbeat-ingest behavior reproduces against the current v0.1.1 stack. A GET /api/v1/admin/workers/prune (or equivalent) endpoint that drops workers with last_seen older than a configurable threshold would also help — currently the host exposes only GET on /api/v1/workers/{id} and /api/v1/nodes/{id} (confirmed via OPTIONS), so stale rows can't be cleaned out via the API.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions