Add blog on llm-d benchmarking on heterogenous cluster by praveingk · Pull Request #326 · llm-d/llm-d.github.io

praveingk · 2026-05-28T05:12:04Z

What does this PR do?

Add blog contents for "High-volume inference on a three-vendor sovereign cluster"

Few things to note :

The date in the blog file is marked as 05/29, but it has to be updated accordingly based on the readiness to publish.

Why is this change needed?

How was this tested?

Tests added/updated (npm test)
Site builds successfully (npm run build)
Manual testing performed (npm start)

Checklist

Commits are signed off (git commit -s) per DCO
Code follows project contributing guidelines
Tests pass locally (npm test)
Site builds without errors (npm run build)
Documentation updated (if applicable)

Related Issues

Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>

netlify · 2026-05-28T05:12:09Z

✅ Deploy Preview for elaborate-kangaroo-25e1ee ready!

Name	Link
🔨 Latest commit	`696d574`
🔍 Latest deploy log	https://app.netlify.com/projects/elaborate-kangaroo-25e1ee/deploys/6a21125fb6254e00085a83ca
😎 Deploy Preview	https://deploy-preview-326--elaborate-kangaroo-25e1ee.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>

chcost

Technical review: checked graphs against text claims, cross-referenced performance numbers with public benchmarks, and inspected for consistency.

chcost · 2026-06-03T19:30:20Z

+
+### Single-vendor pools — granite-4.1-8b
+
+We start with the per-vendor baselines so the heterogeneous results below have a reference point. All runs use ~7.2K ISL + 1K OSL.


Text-graph discrepancy. The graph for AMD MI325X sarvam-30b shows llm-d peaking at approximately ~25K output tok/s, not 29K. The 85% improvement figure also doesn't hold if the baseline is 17K (25/17 ≈ 47%). Please reconcile the numbers in this row with what the graph actually shows — either the graph axis labels are off, or the text overstates the result by ~15%.

Same issue appears in the body text on line 93.

Actually, the peak is about 28.6K to be precise. I think it's confusing because the graph's axis stopped at 25K. Will extend it to 30K so that its clear, and keep the comparison precise instead of rounding it.

chcost · 2026-06-03T19:30:20Z

+| NVIDIA-only | 4 H100-NVL | sarvam-30b | 2× | 22× |
+| AMD-only | 8 MI325X | granite-4.1-8b | +79% | 21× |
+| AMD-only | 8 MI325X | sarvam-30b | +85% (29 K vs 17 K out tok/s) | 5× |
+| Gaudi-only | 8 Gaudi3 | granite-4.1-8b | +34% | 18× |


Inconsistent routing modes across experiments. The text here says "precise prefix-cache-aware routing" was used universally, but the graph legends tell a different story:

Granite on NVIDIA/AMD: llm-d (precise-prefix-cache)

Sarvam on NVIDIA/AMD: llm-d (approximate prefix-cache) ← different

Granite on Gaudi3 and 3-vendor: llm-d (prefix-cache-aware EPP) ← also different

Three different labels across eight graphs. Were different routing algorithms actually used? If so, the text should explain why (e.g., approximate matching for MoE models, different EPP config for Gaudi). If not, the graph legends should be unified. Readers will notice.

For Sarvam, we used approximate, since precise-prefix-cache config did not run due to trust-remote-code not enabled in v.0.8.0 llm-d-router. Will mention that clearly, and fix the labels accordingly.

chcost · 2026-06-03T19:30:20Z

+### Single-vendor pools — sarvam-30b (multilingual MoE)
+
+**4× NVIDIA H100-NVL.** llm-d delivers 2× the throughput and 22× better TTFT. k8s saturates around rate 25–30; llm-d keeps scaling.
+


k8s saturation point seems earlier than stated. The graph shows k8s TTFT jumping sharply between rate 15–20 (from ~0.2s to ~1.3s), suggesting saturation starts closer to rate 15–20, not 25–30 as stated here. Worth double-checking against the raw data.

chcost · 2026-06-03T19:30:20Z

+
+**NVIDIA + AMD + Gaudi (20 pods, granite-4.1-8b).** The 20-pod 3-vendor pool delivers **14.2 K out tok/s peak with llm-d vs 9.6 K with k8s round-robin**. k8s saturates at rate 25 and *declines* to 7.5 K at rate 85 (queue depth dominates) — llm-d delivers **+91% throughput at the same load**. TTFT at rate 85: llm-d 6.8 s, k8s 36.4 s (**5.4× better**).
+
+![3-vendor (NVIDIA + AMD + Gaudi) granite-4.1-8b: llm-d vs k8s round-robin](/img/blogs/heterogeneous-3vendor/3vendor-granite.png)


3-vendor pool likely not fully saturated. Individual pools saturate at ~5–6 QPS/pod, so 20 pods would need ~100–120 QPS to reach peak throughput. This test only went to 85 QPS. The 14.2K peak is real, but it's probably not the pool's ceiling — meaning the actual throughput advantage of llm-d on a 3-vendor fleet could be even larger. Consider either (a) noting this explicitly, or (b) testing at higher QPS to find the true peak.

Also: the additive theoretical max from individual pools is ~25.7K (5.7K + 15K + 5K). At 85 QPS, llm-d achieves 55% of that — reasonable for an under-loaded pool, but the gap might raise questions without context.

We stopped at 85 bps because we started seeing more failures. This is because of our standard EPP configuration (18 CPUs) throughout the entire tests. Higher CPU limit, or horizontal scaling would have helped scale further to the theoretical limit. Unfortunately, the testbed is not available anymore. Hence will mention this limitation.

chcost · 2026-06-03T19:30:20Z

+## Prefix-aware caching
+
+We deployed llm-d v0.0.7 with [precise prefix-cache-aware routing](https://github.com/llm-d/llm-d/tree/main/guides/precise-prefix-cache-routing). Each vendor's pods are deployed as a separate Helm release in the same namespace; only the `nodeSelector` and a small set of vendor-specific tuning flags (e.g. Gaudi's `--block-size 128`, `--max-num-seqs 256`, `VLLM_BUILD` pin) vary between releases. All pods carry the same selector labels and register with a single InferencePool maintained by llm-d's router. For the baseline, we use a ClusterIP service over the same set of pods to drive plain Kubernetes round-robin scheduling — **same pods, same vLLM, same flags; only the routing layer differs.**
+


Fair characterization, but worth being more explicit that the shared_prefix_synthetic workload represents the best case for prefix-cache routing, all requests share the same long prefix, so cache hit rate approaches 100% under llm-d. The TTFT improvements (16–22×) are specific to this high-cache-hit regime. Production workloads with diverse prefixes or no shared system prompt would see significantly smaller gains.

A sentence like "Results are strongest for workloads with high prefix reuse; gains vary with prefix diversity" would preempt the obvious follow-up question.

chcost · 2026-06-03T19:30:20Z

+
+## Prefix-aware caching
+
+We deployed llm-d v0.0.7 with [precise prefix-cache-aware routing](https://github.com/llm-d/llm-d/tree/main/guides/precise-prefix-cache-routing). Each vendor's pods are deployed as a separate Helm release in the same namespace; only the `nodeSelector` and a small set of vendor-specific tuning flags (e.g. Gaudi's `--block-size 128`, `--max-num-seqs 256`, `VLLM_BUILD` pin) vary between releases. All pods carry the same selector labels and register with a single InferencePool maintained by llm-d's router. For the baseline, we use a ClusterIP service over the same set of pods to drive plain Kubernetes round-robin scheduling — **same pods, same vLLM, same flags; only the routing layer differs.**


Granite-4.1-8b is a hybrid-Mamba transformer. Mamba layers use state-space models, not attention, so they don't have traditional KV caches. How does prefix caching interact with the Mamba components? This is a technically interesting detail that readers familiar with the architecture will wonder about. Even a brief note (e.g., "prefix caching applies to the transformer layers; Mamba state is recomputed") would add credibility.

Good catch @chcost, Granite-4.1-8b is not a hybrid-mamba transformer, it's a dense model. I think it got remained since its predecessor was a hybrid-mamba (the model we started benchmarking). Will change it accordingly.

chcost · 2026-06-03T19:30:20Z

+* `ibm-granite/granite-4.1-8b` — 8 B parameter, hybrid-Mamba transformer
+* `sarvamai/sarvam-30b` — 30 B MoE, Indic-multilingual model with custom vLLM kernels
+
+The workload is the prefill-heavy `shared_prefix_synthetic` from [inference-perf](https://github.com/kubernetes-sigs/inference-perf): a long shared system prompt + short question + decode-tolerant output (~7.2K input tokens + 1K output tokens). This matches production RAG, chat, and citizen-services traffic profiles where prefix-cache routing has the most room to win.


Nit: "2 nodes × 2 GPUs" is slightly ambiguous. does this mean 2 servers each with one H100-NVL pair (which is a 2-GPU package with NVLink bridge), or 2 servers each with 2 discrete H100 GPUs? Since TP=1 is used, the distinction is mainly about memory per GPU: NVL halves have 94 GB HBM3 each, vs 80 GB for standard H100 SXM. Clarifying helps reproducibility.

chcost · 2026-06-03T19:30:20Z

+![AMD MI325X sarvam-30b: llm-d vs k8s round-robin](/img/blogs/heterogeneous-3vendor/amd-sarvam.png)
+
+We were unable to run sarvam-30b on Intel Gaudi3 due to software compatibility issues, but plan to work with the llm-d community to bridge this gap in the future.
+


Transparent and appreciated. Would be even stronger to briefly note what the incompatibility was (e.g., vLLM-HPU fork missing support for sarvam's custom kernels? Memory constraints? Kernel mismatch?) — helps readers evaluate whether this is a temporary gap or a deeper limitation.

I have mentioned clearly now due to vllm version mismatch.

chcost · 2026-06-03T19:30:20Z

+tags: [blog, inference, scheduling, kv-cache, sig-benchmarking]
+---
+# High-volume inference on a three-vendor sovereign cluster
+


Title feedback: "High-volume" might set expectations for a large-scale deployment, but the actual setup is 20 GPUs across 4 nodes. The core contribution is proving heterogeneous multi-vendor inference works with intelligent routing, not raw scale. A title that foregrounds the multi-vendor angle might land better:

"Intelligent inference routing across a three-vendor sovereign cluster"

"Heterogeneous inference serving across three GPU vendors with llm-d"

Or if "high-volume" refers to the workload (long-context, high-throughput), make that clearer.

Makes sense Carlos. I will change to "Heterogeneous inference serving across three GPU vendors with llm-d" to highlight the heterogeneity angle.

chcost · 2026-06-03T19:33:39Z

@@ -0,0 +1,122 @@
+---
+title: "High-volume inference on a three-vendor sovereign cluster"
+description: "Benchmarking llm-d's prefix-cache-aware routing and prefill/decode disaggregation across NVIDIA H100-NVL, AMD MI325X, and Intel Gaudi3 pools on the NxtGen sovereign cloud — single-vendor and heterogeneous, with up to +91% throughput and 5.4× better TTFT vs plain Kubernetes round-robin."


Description mentions P/D disaggregation, but the post does not cover it. The description says "prefix-cache-aware routing and prefill/decode disaggregation" but the blog only benchmarks prefix-cache routing. P/D disaggregation appears only in the "What's next" section as future work. There is also an unused pd-sarvam.png image in the PR assets that is not referenced in the markdown — looks like a P/D section was drafted and then removed, but the description and image were not cleaned up.

Either add the P/D results back, or update the description to match what the post actually covers.

I will remove the reference to PD. I think it got left out after we removed the PD results.

Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>

praveingk · 2026-06-04T06:05:52Z

@chcost Thanks for the review. I have addressed the comments in this commit
Please check.

praveingk added 5 commits May 27, 2026 16:27

Add multi-vendor inference blog contents

080a24c

Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>

Add changes to socials

a031762

Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>

Fix author titles

b8d4504

Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>

Remove PD

c19f15d

Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>

Add changes to future

7d48396

Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>

praveingk requested review from Gregory-Pereira, chcost, clubanderson, davidgs, jjasghar, petecheslock, robertgshaw2-redhat and smarterclayton as code owners May 28, 2026 05:12

davidgs added the Blog Post This PR is a blog post label May 28, 2026

Update designations of authors

23f274f

Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>

chcost reviewed Jun 3, 2026

View reviewed changes

Address comments

696d574

Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>


		### Single-vendor pools — granite-4.1-8b

		We start with the per-vendor baselines so the heterogeneous results below have a reference point. All runs use ~7.2K ISL + 1K OSL.

		### Single-vendor pools — sarvam-30b (multilingual MoE)

		4× NVIDIA H100-NVL. llm-d delivers 2× the throughput and 22× better TTFT. k8s saturates around rate 25–30; llm-d keeps scaling.


		NVIDIA + AMD + Gaudi (20 pods, granite-4.1-8b). The 20-pod 3-vendor pool delivers 14.2 K out tok/s peak with llm-d vs 9.6 K with k8s round-robin. k8s saturates at rate 25 and declines to 7.5 K at rate 85 (queue depth dominates) — llm-d delivers +91% throughput at the same load. TTFT at rate 85: llm-d 6.8 s, k8s 36.4 s (5.4× better).

		![3-vendor (NVIDIA + AMD + Gaudi) granite-4.1-8b: llm-d vs k8s round-robin](/img/blogs/heterogeneous-3vendor/3vendor-granite.png)

		## Prefix-aware caching

		We deployed llm-d v0.0.7 with [precise prefix-cache-aware routing](https://github.com/llm-d/llm-d/tree/main/guides/precise-prefix-cache-routing). Each vendor's pods are deployed as a separate Helm release in the same namespace; only the `nodeSelector` and a small set of vendor-specific tuning flags (e.g. Gaudi's `--block-size 128`, `--max-num-seqs 256`, `VLLM_BUILD` pin) vary between releases. All pods carry the same selector labels and register with a single InferencePool maintained by llm-d's router. For the baseline, we use a ClusterIP service over the same set of pods to drive plain Kubernetes round-robin scheduling — same pods, same vLLM, same flags; only the routing layer differs.

		![AMD MI325X sarvam-30b: llm-d vs k8s round-robin](/img/blogs/heterogeneous-3vendor/amd-sarvam.png)

		We were unable to run sarvam-30b on Intel Gaudi3 due to software compatibility issues, but plan to work with the llm-d community to bridge this gap in the future.

Conversation

praveingk commented May 28, 2026

What does this PR do?

Why is this change needed?

How was this tested?

Checklist

Related Issues

Uh oh!

netlify Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for elaborate-kangaroo-25e1ee ready!

Uh oh!

chcost left a comment

Choose a reason for hiding this comment

Uh oh!

chcost Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

praveingk Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chcost Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chcost Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chcost Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chcost Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chcost Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chcost Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chcost Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

praveingk commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netlify Bot commented May 28, 2026 •

edited

Loading

chcost Jun 3, 2026 •

edited

Loading

praveingk Jun 4, 2026 •

edited

Loading

chcost Jun 3, 2026 •

edited

Loading

chcost Jun 3, 2026 •

edited

Loading

chcost Jun 3, 2026 •

edited

Loading

chcost Jun 3, 2026 •

edited

Loading

chcost Jun 3, 2026 •

edited

Loading

chcost Jun 3, 2026 •

edited

Loading

chcost Jun 3, 2026 •

edited

Loading