Add blog on llm-d benchmarking on heterogenous cluster#326
Conversation
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
✅ Deploy Preview for elaborate-kangaroo-25e1ee ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
chcost
left a comment
There was a problem hiding this comment.
Technical review: checked graphs against text claims, cross-referenced performance numbers with public benchmarks, and inspected for consistency.
|
|
||
| ### Single-vendor pools — granite-4.1-8b | ||
|
|
||
| We start with the per-vendor baselines so the heterogeneous results below have a reference point. All runs use ~7.2K ISL + 1K OSL. |
There was a problem hiding this comment.
Text-graph discrepancy. The graph for AMD MI325X sarvam-30b shows llm-d peaking at approximately ~25K output tok/s, not 29K. The 85% improvement figure also doesn't hold if the baseline is 17K (25/17 ≈ 47%). Please reconcile the numbers in this row with what the graph actually shows — either the graph axis labels are off, or the text overstates the result by ~15%.
Same issue appears in the body text on line 93.
There was a problem hiding this comment.
Actually, the peak is about 28.6K to be precise. I think it's confusing because the graph's axis stopped at 25K. Will extend it to 30K so that its clear, and keep the comparison precise instead of rounding it.
| | NVIDIA-only | 4 H100-NVL | sarvam-30b | 2× | 22× | | ||
| | AMD-only | 8 MI325X | granite-4.1-8b | +79% | 21× | | ||
| | AMD-only | 8 MI325X | sarvam-30b | +85% (29 K vs 17 K out tok/s) | 5× | | ||
| | Gaudi-only | 8 Gaudi3 | granite-4.1-8b | +34% | 18× | |
There was a problem hiding this comment.
Inconsistent routing modes across experiments. The text here says "precise prefix-cache-aware routing" was used universally, but the graph legends tell a different story:
- Granite on NVIDIA/AMD:
llm-d (precise-prefix-cache) - Sarvam on NVIDIA/AMD:
llm-d (approximate prefix-cache)← different - Granite on Gaudi3 and 3-vendor:
llm-d (prefix-cache-aware EPP)← also different
Three different labels across eight graphs. Were different routing algorithms actually used? If so, the text should explain why (e.g., approximate matching for MoE models, different EPP config for Gaudi). If not, the graph legends should be unified. Readers will notice.
There was a problem hiding this comment.
For Sarvam, we used approximate, since precise-prefix-cache config did not run due to trust-remote-code not enabled in v.0.8.0 llm-d-router. Will mention that clearly, and fix the labels accordingly.
| ### Single-vendor pools — sarvam-30b (multilingual MoE) | ||
|
|
||
| **4× NVIDIA H100-NVL.** llm-d delivers 2× the throughput and 22× better TTFT. k8s saturates around rate 25–30; llm-d keeps scaling. | ||
|
|
There was a problem hiding this comment.
k8s saturation point seems earlier than stated. The graph shows k8s TTFT jumping sharply between rate 15–20 (from ~0.2s to ~1.3s), suggesting saturation starts closer to rate 15–20, not 25–30 as stated here. Worth double-checking against the raw data.
|
|
||
| **NVIDIA + AMD + Gaudi (20 pods, granite-4.1-8b).** The 20-pod 3-vendor pool delivers **14.2 K out tok/s peak with llm-d vs 9.6 K with k8s round-robin**. k8s saturates at rate 25 and *declines* to 7.5 K at rate 85 (queue depth dominates) — llm-d delivers **+91% throughput at the same load**. TTFT at rate 85: llm-d 6.8 s, k8s 36.4 s (**5.4× better**). | ||
|
|
||
|  |
There was a problem hiding this comment.
3-vendor pool likely not fully saturated. Individual pools saturate at ~5–6 QPS/pod, so 20 pods would need ~100–120 QPS to reach peak throughput. This test only went to 85 QPS. The 14.2K peak is real, but it's probably not the pool's ceiling — meaning the actual throughput advantage of llm-d on a 3-vendor fleet could be even larger. Consider either (a) noting this explicitly, or (b) testing at higher QPS to find the true peak.
Also: the additive theoretical max from individual pools is ~25.7K (5.7K + 15K + 5K). At 85 QPS, llm-d achieves 55% of that — reasonable for an under-loaded pool, but the gap might raise questions without context.
There was a problem hiding this comment.
We stopped at 85 bps because we started seeing more failures. This is because of our standard EPP configuration (18 CPUs) throughout the entire tests. Higher CPU limit, or horizontal scaling would have helped scale further to the theoretical limit. Unfortunately, the testbed is not available anymore. Hence will mention this limitation.
| ## Prefix-aware caching | ||
|
|
||
| We deployed llm-d v0.0.7 with [precise prefix-cache-aware routing](https://github.com/llm-d/llm-d/tree/main/guides/precise-prefix-cache-routing). Each vendor's pods are deployed as a separate Helm release in the same namespace; only the `nodeSelector` and a small set of vendor-specific tuning flags (e.g. Gaudi's `--block-size 128`, `--max-num-seqs 256`, `VLLM_BUILD` pin) vary between releases. All pods carry the same selector labels and register with a single InferencePool maintained by llm-d's router. For the baseline, we use a ClusterIP service over the same set of pods to drive plain Kubernetes round-robin scheduling — **same pods, same vLLM, same flags; only the routing layer differs.** | ||
|
|
There was a problem hiding this comment.
Fair characterization, but worth being more explicit that the shared_prefix_synthetic workload represents the best case for prefix-cache routing, all requests share the same long prefix, so cache hit rate approaches 100% under llm-d. The TTFT improvements (16–22×) are specific to this high-cache-hit regime. Production workloads with diverse prefixes or no shared system prompt would see significantly smaller gains.
A sentence like "Results are strongest for workloads with high prefix reuse; gains vary with prefix diversity" would preempt the obvious follow-up question.
|
|
||
| ## Prefix-aware caching | ||
|
|
||
| We deployed llm-d v0.0.7 with [precise prefix-cache-aware routing](https://github.com/llm-d/llm-d/tree/main/guides/precise-prefix-cache-routing). Each vendor's pods are deployed as a separate Helm release in the same namespace; only the `nodeSelector` and a small set of vendor-specific tuning flags (e.g. Gaudi's `--block-size 128`, `--max-num-seqs 256`, `VLLM_BUILD` pin) vary between releases. All pods carry the same selector labels and register with a single InferencePool maintained by llm-d's router. For the baseline, we use a ClusterIP service over the same set of pods to drive plain Kubernetes round-robin scheduling — **same pods, same vLLM, same flags; only the routing layer differs.** |
There was a problem hiding this comment.
Granite-4.1-8b is a hybrid-Mamba transformer. Mamba layers use state-space models, not attention, so they don't have traditional KV caches. How does prefix caching interact with the Mamba components? This is a technically interesting detail that readers familiar with the architecture will wonder about. Even a brief note (e.g., "prefix caching applies to the transformer layers; Mamba state is recomputed") would add credibility.
There was a problem hiding this comment.
Good catch @chcost, Granite-4.1-8b is not a hybrid-mamba transformer, it's a dense model. I think it got remained since its predecessor was a hybrid-mamba (the model we started benchmarking). Will change it accordingly.
| * `ibm-granite/granite-4.1-8b` — 8 B parameter, hybrid-Mamba transformer | ||
| * `sarvamai/sarvam-30b` — 30 B MoE, Indic-multilingual model with custom vLLM kernels | ||
|
|
||
| The workload is the prefill-heavy `shared_prefix_synthetic` from [inference-perf](https://github.com/kubernetes-sigs/inference-perf): a long shared system prompt + short question + decode-tolerant output (~7.2K input tokens + 1K output tokens). This matches production RAG, chat, and citizen-services traffic profiles where prefix-cache routing has the most room to win. |
There was a problem hiding this comment.
Nit: "2 nodes × 2 GPUs" is slightly ambiguous. does this mean 2 servers each with one H100-NVL pair (which is a 2-GPU package with NVLink bridge), or 2 servers each with 2 discrete H100 GPUs? Since TP=1 is used, the distinction is mainly about memory per GPU: NVL halves have 94 GB HBM3 each, vs 80 GB for standard H100 SXM. Clarifying helps reproducibility.
|  | ||
|
|
||
| We were unable to run sarvam-30b on Intel Gaudi3 due to software compatibility issues, but plan to work with the llm-d community to bridge this gap in the future. | ||
|
|
There was a problem hiding this comment.
Transparent and appreciated. Would be even stronger to briefly note what the incompatibility was (e.g., vLLM-HPU fork missing support for sarvam's custom kernels? Memory constraints? Kernel mismatch?) — helps readers evaluate whether this is a temporary gap or a deeper limitation.
There was a problem hiding this comment.
I have mentioned clearly now due to vllm version mismatch.
| tags: [blog, inference, scheduling, kv-cache, sig-benchmarking] | ||
| --- | ||
| # High-volume inference on a three-vendor sovereign cluster | ||
|
|
There was a problem hiding this comment.
Title feedback: "High-volume" might set expectations for a large-scale deployment, but the actual setup is 20 GPUs across 4 nodes. The core contribution is proving heterogeneous multi-vendor inference works with intelligent routing, not raw scale. A title that foregrounds the multi-vendor angle might land better:
- "Intelligent inference routing across a three-vendor sovereign cluster"
- "Heterogeneous inference serving across three GPU vendors with llm-d"
Or if "high-volume" refers to the workload (long-context, high-throughput), make that clearer.
There was a problem hiding this comment.
Makes sense Carlos. I will change to "Heterogeneous inference serving across three GPU vendors with llm-d" to highlight the heterogeneity angle.
| @@ -0,0 +1,122 @@ | |||
| --- | |||
| title: "High-volume inference on a three-vendor sovereign cluster" | |||
| description: "Benchmarking llm-d's prefix-cache-aware routing and prefill/decode disaggregation across NVIDIA H100-NVL, AMD MI325X, and Intel Gaudi3 pools on the NxtGen sovereign cloud — single-vendor and heterogeneous, with up to +91% throughput and 5.4× better TTFT vs plain Kubernetes round-robin." | |||
There was a problem hiding this comment.
Description mentions P/D disaggregation, but the post does not cover it. The description says "prefix-cache-aware routing and prefill/decode disaggregation" but the blog only benchmarks prefix-cache routing. P/D disaggregation appears only in the "What's next" section as future work. There is also an unused pd-sarvam.png image in the PR assets that is not referenced in the markdown — looks like a P/D section was drafted and then removed, but the description and image were not cleaned up.
Either add the P/D results back, or update the description to match what the post actually covers.
There was a problem hiding this comment.
I will remove the reference to PD. I think it got left out after we removed the PD results.
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
What does this PR do?
Add blog contents for "High-volume inference on a three-vendor sovereign cluster"
Few things to note :
Why is this change needed?
How was this tested?
npm test)npm run build)npm start)Checklist
git commit -s) per DCOnpm test)npm run build)Related Issues