[Bug]: worker last_seen frozen for ~12 active workers — heartbeat-ingest appears to stop while workers stay functional

# [Bug]: worker `last_seen` frozen for ~12 active workers — heartbeat-ingest appears to stop while workers stay functional

**Component:** Host (orchestrator)

## Describe the bug

On the live `kv.run:8000/flowmesh` host (cluster `lumid0`, namespace `lumid`), every worker except two long-stale rows reports an `last_seen` clustered tightly between **`2026-05-27T18:18:31`** and **`2026-05-27T18:19:01`** UTC — a ~30-second window 12 workers fell into and none have left. As of `2026-05-28T10:52Z` that's a **16.6h freeze** across 12 disparate workers on 8 different nodes (incl. CPU + RTX 5080 + RTX 6000 Ada + GB10 + RTX PRO 4000 boxes). The workers themselves are clearly still functional — three of them (wkr-24, wkr-21, wkr-12) report `status: BUSY` against the same stale `last_seen`, so they're actively executing tasks; only the host's heartbeat-ingest side seems to have stopped writing through that point.

**Expected:** `last_seen` should advance per worker on each heartbeat regardless of execution state.
**Actual:** all workers stalled at near-identical timestamps in a ~30s window, despite continuing to pick up jobs.

**Reproduction (read-only probe):**

```bash
curl -sS -H "Authorization: Bearer $PAT" \
  https://kv.run:8000/flowmesh/api/v1/workers \
  | jq '.[] | {id, node_id, status, last_seen}' | head -40
```

Output (representative subset, taken at `2026-05-28T10:52Z`):

```
{"id":"wkr-15","node_id":"nde-10","status":"IDLE","last_seen":"2026-05-27T18:18:54.…Z"}
{"id":"wkr-10","node_id":"nde-13","status":"IDLE","last_seen":"2026-05-27T18:18:34.…Z"}
{"id":"wkr-16","node_id":"nde-14","status":"IDLE","last_seen":"2026-05-27T18:18:35.…Z"}
{"id":"wkr-24","node_id":"nde-20","status":"BUSY","last_seen":"2026-05-27T18:18:34.…Z"}
{"id":"wkr-21","node_id":"nde-6", "status":"BUSY","last_seen":"2026-05-27T18:18:37.…Z"}
{"id":"wkr-12","node_id":"nde-7", "status":"BUSY","last_seen":"2026-05-27T18:18:58.…Z"}
... (12 rows total, all between 18:18:31 and 18:19:01)
{"id":"wkr-23","node_id":"nde-18","status":"IDLE","last_seen":"2026-05-26T00:21:19.…Z"}  // 58h, predates the freeze
{"id":"wkr-2", "node_id":"nde-2", "status":"IDLE","last_seen":"2026-05-19T08:21:11.…Z"}  // genuine zombie (nde-2 offline)
{"id":"wkr-3", "node_id":"nde-2", "status":"IDLE","last_seen":"2026-05-19T08:21:11.…Z"}  // genuine zombie
```

That tight clustering across 8 nodes + the `BUSY`-with-stale-heartbeat combination point to a host-side ingest issue (Redis stream consumer / worker-state writer stalled, channel desync, etc.) rather than 12 simultaneous worker failures.

**Impact:**
- Registry can no longer distinguish live workers from dead ones (every monitoring threshold built on `last_seen` flags healthy workers as stale).
- Pre-existing zombies (wkr-2, wkr-3 on offline node `nde-2`/`luyao1`) become indistinguishable from the rest of the fleet.
- Downstream dashboards / alerting (e.g. lum.id cluster registry) lose ground truth.

**Suggested investigation:**
1. Grep the host logs around `2026-05-27T18:18:31Z` ± 60s — look for an exception, a Redis reconnect, or a deploy/restart.
2. Check whether the worker-heartbeat consumer group is stuck on a pending entry (Redis `XPENDING`).
3. Confirm whether `last_seen` is updated *only* on heartbeat ingest, or also on result-submission / task-claim paths (`status: BUSY` updates without `last_seen` updates suggests the two paths diverged).

## Environment

- FlowMesh version: deployed image as of `2026-05-13T18:06:47Z` (worker `started_at`); pre-v0.1.2 redeploy.
- Cluster: `lumid0`, namespace `lumid`, multi-node (8 nodes, 15 workers — 12 GPU + 3 CPU).
- Observation captured at: `2026-05-28T10:52Z`.

## Additional context

This was surfaced while validating the upcoming `lumid-flowmesh-plugin v0.2.0` rollout. The plugin work is unrelated; the heartbeat-ingest behavior reproduces against the current v0.1.1 stack. A `GET /api/v1/admin/workers/prune` (or equivalent) endpoint that drops workers with `last_seen` older than a configurable threshold would also help — currently the host exposes only `GET` on `/api/v1/workers/{id}` and `/api/v1/nodes/{id}` (confirmed via `OPTIONS`), so stale rows can't be cleaned out via the API.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: worker last_seen frozen for ~12 active workers — heartbeat-ingest appears to stop while workers stay functional #58

[Bug]: worker `last_seen` frozen for ~12 active workers — heartbeat-ingest appears to stop while workers stay functional

Describe the bug

Environment

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: worker last_seen frozen for ~12 active workers — heartbeat-ingest appears to stop while workers stay functional #58

Description

[Bug]: worker last_seen frozen for ~12 active workers — heartbeat-ingest appears to stop while workers stay functional

Describe the bug

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Bug]: worker `last_seen` frozen for ~12 active workers — heartbeat-ingest appears to stop while workers stay functional