fix(phlower): collapse hourly_workers Counter to worker group by webjunkie · Pull Request #28 · PostHog/phlower

webjunkie · 2026-05-18T12:20:33Z

TaskAggregate.hourly_workers was keyed on the full Celery hostname, which under K8s embeds a per-pod hash. With rolling deploys churning pod names, each hourly Counter accumulated hundreds of unique string keys that never collapsed back down. With aggregate_retention_hours=168 (7d), the structure grew linearly for ~5 days until the pod hit its memory limit and got OOMKilled — then started over.

Fix: aggregate by worker_group (the stable name derived in events.py and already stored on InvocationRecord) instead of the raw hostname. The Counter now has a tiny bounded set of keys. SQLite invocation rows still carry the full hostname so per-invocation drill-in is unchanged.

The recovery path was leaking the same way — _flush_counts was replaying the SQLite worker column verbatim. Group resolution now happens in _load_counts while building the batch, so the regex runs outside the store lock instead of blocking live ingest.

SNAPSHOT_VERSION bumped to 2 so existing on-disk hourly_workers blobs are discarded and aggregates rebuild clean from row replay.

Test plan

Build image, deploy, watch RSS curve over 24–48h — should plateau instead of climbing ~1 GB/day
Confirm task detail page still shows top worker groups (now collapsed by group, no longer per-pod hash)

…ng memory leak TaskAggregate.hourly_workers was keyed on the full Celery hostname, which in K8s embeds a per-pod hash. Rolling deploys and karpenter scaling churn pod names constantly, so each hourly Counter accumulated hundreds of unique string keys that never collapsed back down. With aggregate_retention_hours=168 the structure grew linearly for ~5 days before tripping the 6 GB pod limit — matches the observed OOM cadence. Switch the aggregate keys to worker_group (already derived in events.py and stored on InvocationRecord) so the Counter has a tiny, stable set of keys instead. SQLite invocation rows still carry the full hostname for per-invocation drill-in; only the aggregate collapses. Recovery path was leaking the same way — _flush_counts replays the SQLite worker column verbatim. Apply extract_worker_group there too, otherwise every restart would rehydrate the leak from history. Bump SNAPSHOT_VERSION so existing on-disk hourly_workers blobs are discarded and aggregates get rebuilt clean from row replay. Pyroscope memory profiling was discussed alongside this fix but the pyroscope-io Python client only ships CPU sampling — not memory. Leaving it out; tracemalloc/memray would be a separate change.

webjunkie merged commit a09db05 into main May 18, 2026
10 checks passed

webjunkie deleted the fix/phlower-worker-group-hourly-leak branch May 18, 2026 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(phlower): collapse hourly_workers Counter to worker group#28

fix(phlower): collapse hourly_workers Counter to worker group#28
webjunkie merged 1 commit into
mainfrom
fix/phlower-worker-group-hourly-leak

webjunkie commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

webjunkie commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant