Skip to content

fix(phlower): collapse hourly_workers Counter to worker group#28

Merged
webjunkie merged 1 commit into
mainfrom
fix/phlower-worker-group-hourly-leak
May 18, 2026
Merged

fix(phlower): collapse hourly_workers Counter to worker group#28
webjunkie merged 1 commit into
mainfrom
fix/phlower-worker-group-hourly-leak

Conversation

@webjunkie
Copy link
Copy Markdown
Contributor

TaskAggregate.hourly_workers was keyed on the full Celery hostname, which under K8s embeds a per-pod hash. With rolling deploys churning pod names, each hourly Counter accumulated hundreds of unique string keys that never collapsed back down. With aggregate_retention_hours=168 (7d), the structure grew linearly for ~5 days until the pod hit its memory limit and got OOMKilled — then started over.

Fix: aggregate by worker_group (the stable name derived in events.py and already stored on InvocationRecord) instead of the raw hostname. The Counter now has a tiny bounded set of keys. SQLite invocation rows still carry the full hostname so per-invocation drill-in is unchanged.

The recovery path was leaking the same way — _flush_counts was replaying the SQLite worker column verbatim. Group resolution now happens in _load_counts while building the batch, so the regex runs outside the store lock instead of blocking live ingest.

SNAPSHOT_VERSION bumped to 2 so existing on-disk hourly_workers blobs are discarded and aggregates rebuild clean from row replay.

Test plan

  • Build image, deploy, watch RSS curve over 24–48h — should plateau instead of climbing ~1 GB/day
  • Confirm task detail page still shows top worker groups (now collapsed by group, no longer per-pod hash)

…ng memory leak

TaskAggregate.hourly_workers was keyed on the full Celery hostname,
which in K8s embeds a per-pod hash. Rolling deploys and karpenter
scaling churn pod names constantly, so each hourly Counter accumulated
hundreds of unique string keys that never collapsed back down. With
aggregate_retention_hours=168 the structure grew linearly for ~5 days
before tripping the 6 GB pod limit — matches the observed OOM cadence.

Switch the aggregate keys to worker_group (already derived in events.py
and stored on InvocationRecord) so the Counter has a tiny, stable set
of keys instead. SQLite invocation rows still carry the full hostname
for per-invocation drill-in; only the aggregate collapses.

Recovery path was leaking the same way — _flush_counts replays the
SQLite worker column verbatim. Apply extract_worker_group there too,
otherwise every restart would rehydrate the leak from history.

Bump SNAPSHOT_VERSION so existing on-disk hourly_workers blobs are
discarded and aggregates get rebuilt clean from row replay.

Pyroscope memory profiling was discussed alongside this fix but the
pyroscope-io Python client only ships CPU sampling — not memory.
Leaving it out; tracemalloc/memray would be a separate change.
@webjunkie webjunkie merged commit a09db05 into main May 18, 2026
10 checks passed
@webjunkie webjunkie deleted the fix/phlower-worker-group-hourly-leak branch May 18, 2026 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant