From ac91077a84a3fce04a3ad367ca8e0b9d9a31048a Mon Sep 17 00:00:00 2001 From: Sanjay Singh Date: Tue, 30 Jun 2026 16:09:41 -0700 Subject: [PATCH] Draft blog post: Why Your Autoscaler Is Always Late: HPA, KEDA, and Metric Lag Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- ...26-06-18-autoscaler-metric-lag-hpa-keda.md | 161 ++++++++++++++++++ assets/img/posts/autoscaler-lag/hero.svg | 93 ++++++++++ 2 files changed, 254 insertions(+) create mode 100644 _posts/2026-06-18-autoscaler-metric-lag-hpa-keda.md create mode 100644 assets/img/posts/autoscaler-lag/hero.svg diff --git a/_posts/2026-06-18-autoscaler-metric-lag-hpa-keda.md b/_posts/2026-06-18-autoscaler-metric-lag-hpa-keda.md new file mode 100644 index 0000000..9082a4d --- /dev/null +++ b/_posts/2026-06-18-autoscaler-metric-lag-hpa-keda.md @@ -0,0 +1,161 @@ +--- +title: "Why Your Autoscaler Is Always Late: HPA, KEDA, and Metric Lag" +description: "Autoscaling is a feedback control loop, and every loop has delay: scrape interval, sync period, averaging window, scheduling, image pull, and readiness. By the time you add capacity, the spike may be over. Here is where the lag hides and how to scale on a leading signal instead." +date: 2026-06-18 12:00:00 +0000 +categories: [Distributed Systems, Kubernetes] +tags: [kubernetes, autoscaling, hpa, keda, scaling, distributed-systems] +image: + path: /assets/img/posts/autoscaler-lag/hero.svg + alt: "A traffic spike hitting a service while the autoscaling chain (metric scrape, HPA decision, pod schedule, readiness) lags behind, leaving a gap where requests are dropped before new capacity is ready" +--- + +There is a comforting story we tell about autoscaling: traffic goes up, the autoscaler notices, more pods appear, the system absorbs the load. It is a good story. It is also, at the moment that matters most, a lie. **The autoscaler is a feedback control loop, and every feedback loop has delay.** By the time your Horizontal Pod Autoscaler has observed the spike, decided to act, scheduled new pods, pulled images, and waited for readiness probes to pass, the burst that triggered it may already be over. You paid for the capacity, your users ate the latency, and the graph looks like the system "handled it" only because you are reading it after the fact. + +This is the autoscaling version of a theme I keep returning to: the failure is not in the component, it is in the **time gap between observing reality and reacting to it**. The same gap shows up in [health checking](/2026/01/12/health-checks-client-vs-server-side-lb.html), where a backend is dead for several check intervals before anyone routes around it. Here it shows up as capacity that always arrives a little too late. + +## Autoscaling Is a Control Loop, and Control Loops Lag + +Start with the mental model that the HPA documentation buries: this is a closed control loop running on a timer. It does not react to events. It wakes up periodically, reads a metric, compares it to a target, and adjusts. Every stage between "load changed" and "new pod serving traffic" adds latency. + +Walk the chain end to end: + +- **Scrape interval.** The metrics pipeline (typically the metrics-server or a Prometheus adapter) samples pod metrics on an interval, often 15 to 60 seconds. Your spike is invisible until the next scrape. +- **HPA sync period.** The controller reconciles on its own loop, `--horizontal-pod-autoscaler-sync-period`, 15 seconds by default. It can only act on what the last scrape told it. +- **The averaging window.** Utilization is averaged across all ready pods, so a sharp spike on a few pods is diluted by the calm majority. +- **Scheduling time.** Once the HPA raises the replica count, the scheduler has to find a node with room. +- **Image pull and start.** A cold image on a fresh node can take tens of seconds to pull and start. +- **Readiness.** The pod does not receive traffic until its readiness probe passes, which by design includes a warm-up delay. + +Add those up and "instant autoscaling" is routinely 60 to 120 seconds from spike to served traffic, on a good day. For a burst that lasts 30 seconds, the capacity shows up after the worst is over. + +## HPA Basics and Why CPU Lags + +The HPA computes a desired replica count from a strikingly simple formula. For a target utilization, it scales the current replicas by the ratio of current metric to target. + +
+
+ + the HPA scaling formula +
+
# desired replicas, rounded up
+desired = ceil( current * ( currentMetric / targetMetric ) )
+
+# example: 4 pods at 90% CPU, target is 50%
+desired = ceil( 4 * ( 90 / 50 ) ) = ceil(7.2) = 8
+
+ +The math is fine. The problem is the **choice of metric**. CPU utilization is a lagging signal: CPU only climbs after the work has already arrived and started competing for the core. For a bursty or IO-bound workload, this is doubly bad. A service that spends most of its time waiting on a database or a downstream call can be completely saturated on concurrency while its CPU sits at 30 percent, so a CPU-targeted HPA never scales it at all. You are measuring the wrong thing, late. + +A typical CPU-based HPA looks innocent enough. + +
+
+ + hpa-cpu.yaml +
+
apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata: { name: checkout, namespace: shop }
+spec:
+  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: checkout }
+  minReplicas: 4
+  maxReplicas: 40
+  metrics:
+    - type: Resource
+      resource:
+        name: cpu
+        target: { type: Utilization, averageUtilization: 50 }
+# CPU is lagging and averaged: a 5s burst is diluted and arrives late
+
+ +## KEDA and Scaling on the Queue, Not the Symptom + +The fix for a lagging signal is to scale on a **leading** one: something that rises before your pods feel pain. For most asynchronous and event-driven systems, that signal is queue depth. If the backlog in Kafka or a message queue is growing, you need more consumers, full stop, and you knew it the instant the producer outran the consumer, long before CPU moved. + +[KEDA](https://keda.sh/) (Kubernetes Event-Driven Autoscaling) exists for exactly this. It is an operator that drives an HPA from external event sources: queue length, stream lag, the rate of HTTP requests, a Prometheus query, dozens of scalers. You point it at the backlog and it scales on the cause rather than the symptom. + +
+
+ + keda-scaledobject.yaml +
+
apiVersion: keda.sh/v1alpha1
+kind: ScaledObject
+metadata: { name: order-worker, namespace: shop }
+spec:
+  scaleTargetRef: { name: order-worker }
+  minReplicaCount: 0      # scale to zero when the queue is empty
+  maxReplicaCount: 100
+  triggers:
+    - type: kafka
+      metadata:
+        topic: orders
+        consumerGroup: order-workers
+        lagThreshold: "50"   # one pod per 50 messages of lag
+
+ +KEDA's other headline feature is **scale to zero**: when the queue is empty, it removes every replica and you pay nothing. The catch is the mirror image of the lag problem. The first message after an idle period hits an empty deployment, and now the full cold-start chain (schedule, pull, start, ready) sits directly in that request's latency. **Scale to zero trades steady-state cost for tail latency on the first request.** It is excellent for batch and background work, and a trap for anything latency-sensitive without a warm pool kept aside. + +## VPA, the Cluster Autoscaler, and Layered Lag + +So far the lag has been one layer deep: scale pods on an existing node. Two other autoscalers change the picture, and one of them makes the lag dramatically worse. + +**The Vertical Pod Autoscaler** adjusts the CPU and memory *requests* of a pod rather than the replica count. It is the right tool for a singleton that cannot be sharded, but note that, in its common mode, changing requests means recreating the pod, which is itself a disruption. VPA and HPA on the same CPU metric also fight each other, so they are not a free combination. + +**The Cluster Autoscaler** is the one that hurts. When the HPA asks for more pods and no node has room, the new pods sit `Pending` until the cluster autoscaler notices, requests a node from the cloud provider, waits for it to boot and join, and only then can the scheduler place the pods. You have now stacked a second control loop on top of the first. + +
+
+ + the layered lag, worst case +
+
spike            t+0s
+metric scraped   t+0..15s    # wait for next sample
+hpa decides      t+15..30s   # wait for sync period
+pods Pending     t+30s       # no room on existing nodes
+node requested   t+30..60s   # cluster autoscaler reacts
+node Ready       t+60..150s  # cloud provisions and joins
+pods scheduled   t+150s      # scheduler places them
+image pulled     t+150..180s
+pod Ready        t+180..210s # readiness probe passes -> serving
+
+ +Three minutes from spike to served traffic when you have to add nodes is not an exotic worst case. It is the ordinary case for a cluster running near capacity. Anyone who has lived through [the silent failure modes of distributed systems](/2026/02/18/dns-the-silent-killer-of-distributed-systems.html) will recognize the shape: each layer is individually reasonable, and stacked together they produce a window where the system simply cannot keep up. + +## Tactics for the Gap + +You cannot remove the lag entirely, because the physics of scheduling and readiness are real. What you can do is shrink it, scale on better signals, and have a plan for the window that remains. + +**Scale on a leading signal.** Prefer queue depth, in-flight requests, or RPS over CPU. These rise with the cause, not after the symptom. For request-driven services this means a custom or external metric (RPS per pod via the Prometheus adapter, or KEDA's HTTP and Prometheus scalers) rather than the default CPU target. + +**Tune the stabilization windows deliberately.** The HPA's behavior block lets you scale up fast and scale down slow. A short scale-up window reacts to bursts; a long scale-down window prevents flapping when the burst passes. + +
+
+ + hpa behavior: up fast, down slow +
+
  behavior:
+    scaleUp:
+      stabilizationWindowSeconds: 0     # react immediately to bursts
+      policies:
+        - { type: Percent, value: 100, periodSeconds: 15 }  # double per 15s
+    scaleDown:
+      stabilizationWindowSeconds: 300   # wait 5m before shrinking
+
+ +**Keep headroom and over-provision.** The cheapest way to beat node-provisioning lag is to never hit the node-provisioning path. Run low-priority "pause" pods that hold spare capacity and get evicted the instant a real pod needs the room, so the scheduler always has a warm node waiting. You are buying back the worst slice of that three-minute chain. + +**Pre-scale for known patterns.** Most spikes are not surprises. A daily traffic curve, a marketing email, a scheduled batch job: scale on a cron or a calendar ahead of the event, not on the metric during it. The autoscaler is for the unexpected; the predictable load should already have capacity in place. + +**Pair autoscaling with load shedding.** This is the one most teams skip, and it is the only thing that protects you *during* the gap. No matter how well tuned, there is a window where demand exceeds capacity and new pods are not ready yet. In that window you must shed: return a fast 429, drop the lowest-priority requests, and protect the requests you can actually serve. Autoscaling closes the gap over the next minute; load shedding keeps the service alive in the meantime. The two are partners, not alternatives. Which requests you shed and how you balance the survivors connects directly to the [load balancing algorithm](/2026/03/02/load-balancing-algorithms.html) sitting in front of them, and to the [health-aware routing](/2026/01/12/health-checks-client-vs-server-side-lb.html) that should already be steering traffic away from the pods still warming up. + +It is worth remembering where those new pods even become reachable. A freshly scheduled replica is invisible until readiness gates it into the [service's live endpoint registry](/2026/06/30/service-discovery-in-kubernetes.html), which is the same readiness delay that sits at the tail of every lag calculation above. Capacity is not capacity until discovery and load balancing agree it can take traffic. + +

An autoscaler does not prevent overload. It recovers from one, on a delay you do not control. Design for the delay: scale on the cause, keep a node warm, and shed what you cannot yet serve.

+ +--- + +*This continues my writing on the time gaps inside distributed systems, alongside [health checking in client vs server-side load balancing](/2026/01/12/health-checks-client-vs-server-side-lb.html), [load balancing algorithms](/2026/03/02/load-balancing-algorithms.html), and [service discovery in Kubernetes](/2026/06/30/service-discovery-in-kubernetes.html).* + +*Tuning autoscaling against real traffic and tired of capacity that arrives a minute late? I am on [LinkedIn](https://www.linkedin.com/in/singhsanjay12) or reachable by [email](mailto:hello@singh-sanjay.com).* diff --git a/assets/img/posts/autoscaler-lag/hero.svg b/assets/img/posts/autoscaler-lag/hero.svg new file mode 100644 index 0000000..f484f46 --- /dev/null +++ b/assets/img/posts/autoscaler-lag/hero.svg @@ -0,0 +1,93 @@ + + + + The autoscaler is always late + A traffic spike outruns the scale-up chain, and capacity arrives after the gap + + + + + demand vs ready capacity + + + + + time + + + + gap + + + + + + + + + + + + demand + + ready capacity + + + + + + the lag chain + + + + scrape interval + + HPA sync + averaging + + schedule + image pull + + readiness probe passes + + + + + + + + + metric + scrape + + + + HPA + decision + + + + pod + schedule + + + + pod + ready + + + + shed the + gap (429) + + + + + + + + + + + + + +