Skip to content

Add blog on llm-d benchmarking on heterogenous cluster#326

Open
praveingk wants to merge 7 commits into
llm-d:mainfrom
praveingk:main
Open

Add blog on llm-d benchmarking on heterogenous cluster#326
praveingk wants to merge 7 commits into
llm-d:mainfrom
praveingk:main

Conversation

@praveingk
Copy link
Copy Markdown

What does this PR do?

Add blog contents for "High-volume inference on a three-vendor sovereign cluster"

Few things to note :

  1. The date in the blog file is marked as 05/29, but it has to be updated accordingly based on the readiness to publish.

Why is this change needed?

How was this tested?

  • Tests added/updated (npm test)
  • Site builds successfully (npm run build)
  • Manual testing performed (npm start)

Checklist

  • Commits are signed off (git commit -s) per DCO
  • Code follows project contributing guidelines
  • Tests pass locally (npm test)
  • Site builds without errors (npm run build)
  • Documentation updated (if applicable)

Related Issues

praveingk added 5 commits May 27, 2026 16:27
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
@netlify
Copy link
Copy Markdown

netlify Bot commented May 28, 2026

Deploy Preview for elaborate-kangaroo-25e1ee ready!

Name Link
🔨 Latest commit 696d574
🔍 Latest deploy log https://app.netlify.com/projects/elaborate-kangaroo-25e1ee/deploys/6a21125fb6254e00085a83ca
😎 Deploy Preview https://deploy-preview-326--elaborate-kangaroo-25e1ee.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@davidgs davidgs added the Blog Post This PR is a blog post label May 28, 2026
Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
Copy link
Copy Markdown
Contributor

@chcost chcost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technical review: checked graphs against text claims, cross-referenced performance numbers with public benchmarks, and inspected for consistency.


### Single-vendor pools — granite-4.1-8b

We start with the per-vendor baselines so the heterogeneous results below have a reference point. All runs use ~7.2K ISL + 1K OSL.
Copy link
Copy Markdown
Contributor

@chcost chcost Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Text-graph discrepancy. The graph for AMD MI325X sarvam-30b shows llm-d peaking at approximately ~25K output tok/s, not 29K. The 85% improvement figure also doesn't hold if the baseline is 17K (25/17 ≈ 47%). Please reconcile the numbers in this row with what the graph actually shows — either the graph axis labels are off, or the text overstates the result by ~15%.

Same issue appears in the body text on line 93.

Copy link
Copy Markdown
Author

@praveingk praveingk Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the peak is about 28.6K to be precise. I think it's confusing because the graph's axis stopped at 25K. Will extend it to 30K so that its clear, and keep the comparison precise instead of rounding it.

| NVIDIA-only | 4 H100-NVL | sarvam-30b | 2× | 22× |
| AMD-only | 8 MI325X | granite-4.1-8b | +79% | 21× |
| AMD-only | 8 MI325X | sarvam-30b | +85% (29 K vs 17 K out tok/s) | 5× |
| Gaudi-only | 8 Gaudi3 | granite-4.1-8b | +34% | 18× |
Copy link
Copy Markdown
Contributor

@chcost chcost Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistent routing modes across experiments. The text here says "precise prefix-cache-aware routing" was used universally, but the graph legends tell a different story:

  • Granite on NVIDIA/AMD: llm-d (precise-prefix-cache)
  • Sarvam on NVIDIA/AMD: llm-d (approximate prefix-cache) ← different
  • Granite on Gaudi3 and 3-vendor: llm-d (prefix-cache-aware EPP) ← also different

Three different labels across eight graphs. Were different routing algorithms actually used? If so, the text should explain why (e.g., approximate matching for MoE models, different EPP config for Gaudi). If not, the graph legends should be unified. Readers will notice.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Sarvam, we used approximate, since precise-prefix-cache config did not run due to trust-remote-code not enabled in v.0.8.0 llm-d-router. Will mention that clearly, and fix the labels accordingly.

### Single-vendor pools — sarvam-30b (multilingual MoE)

**4× NVIDIA H100-NVL.** llm-d delivers 2× the throughput and 22× better TTFT. k8s saturates around rate 25–30; llm-d keeps scaling.

Copy link
Copy Markdown
Contributor

@chcost chcost Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k8s saturation point seems earlier than stated. The graph shows k8s TTFT jumping sharply between rate 15–20 (from ~0.2s to ~1.3s), suggesting saturation starts closer to rate 15–20, not 25–30 as stated here. Worth double-checking against the raw data.


**NVIDIA + AMD + Gaudi (20 pods, granite-4.1-8b).** The 20-pod 3-vendor pool delivers **14.2 K out tok/s peak with llm-d vs 9.6 K with k8s round-robin**. k8s saturates at rate 25 and *declines* to 7.5 K at rate 85 (queue depth dominates) — llm-d delivers **+91% throughput at the same load**. TTFT at rate 85: llm-d 6.8 s, k8s 36.4 s (**5.4× better**).

![3-vendor (NVIDIA + AMD + Gaudi) granite-4.1-8b: llm-d vs k8s round-robin](/img/blogs/heterogeneous-3vendor/3vendor-granite.png)
Copy link
Copy Markdown
Contributor

@chcost chcost Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3-vendor pool likely not fully saturated. Individual pools saturate at ~5–6 QPS/pod, so 20 pods would need ~100–120 QPS to reach peak throughput. This test only went to 85 QPS. The 14.2K peak is real, but it's probably not the pool's ceiling — meaning the actual throughput advantage of llm-d on a 3-vendor fleet could be even larger. Consider either (a) noting this explicitly, or (b) testing at higher QPS to find the true peak.

Also: the additive theoretical max from individual pools is ~25.7K (5.7K + 15K + 5K). At 85 QPS, llm-d achieves 55% of that — reasonable for an under-loaded pool, but the gap might raise questions without context.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We stopped at 85 bps because we started seeing more failures. This is because of our standard EPP configuration (18 CPUs) throughout the entire tests. Higher CPU limit, or horizontal scaling would have helped scale further to the theoretical limit. Unfortunately, the testbed is not available anymore. Hence will mention this limitation.

## Prefix-aware caching

We deployed llm-d v0.0.7 with [precise prefix-cache-aware routing](https://github.com/llm-d/llm-d/tree/main/guides/precise-prefix-cache-routing). Each vendor's pods are deployed as a separate Helm release in the same namespace; only the `nodeSelector` and a small set of vendor-specific tuning flags (e.g. Gaudi's `--block-size 128`, `--max-num-seqs 256`, `VLLM_BUILD` pin) vary between releases. All pods carry the same selector labels and register with a single InferencePool maintained by llm-d's router. For the baseline, we use a ClusterIP service over the same set of pods to drive plain Kubernetes round-robin scheduling — **same pods, same vLLM, same flags; only the routing layer differs.**

Copy link
Copy Markdown
Contributor

@chcost chcost Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair characterization, but worth being more explicit that the shared_prefix_synthetic workload represents the best case for prefix-cache routing, all requests share the same long prefix, so cache hit rate approaches 100% under llm-d. The TTFT improvements (16–22×) are specific to this high-cache-hit regime. Production workloads with diverse prefixes or no shared system prompt would see significantly smaller gains.

A sentence like "Results are strongest for workloads with high prefix reuse; gains vary with prefix diversity" would preempt the obvious follow-up question.


## Prefix-aware caching

We deployed llm-d v0.0.7 with [precise prefix-cache-aware routing](https://github.com/llm-d/llm-d/tree/main/guides/precise-prefix-cache-routing). Each vendor's pods are deployed as a separate Helm release in the same namespace; only the `nodeSelector` and a small set of vendor-specific tuning flags (e.g. Gaudi's `--block-size 128`, `--max-num-seqs 256`, `VLLM_BUILD` pin) vary between releases. All pods carry the same selector labels and register with a single InferencePool maintained by llm-d's router. For the baseline, we use a ClusterIP service over the same set of pods to drive plain Kubernetes round-robin scheduling — **same pods, same vLLM, same flags; only the routing layer differs.**
Copy link
Copy Markdown
Contributor

@chcost chcost Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Granite-4.1-8b is a hybrid-Mamba transformer. Mamba layers use state-space models, not attention, so they don't have traditional KV caches. How does prefix caching interact with the Mamba components? This is a technically interesting detail that readers familiar with the architecture will wonder about. Even a brief note (e.g., "prefix caching applies to the transformer layers; Mamba state is recomputed") would add credibility.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @chcost, Granite-4.1-8b is not a hybrid-mamba transformer, it's a dense model. I think it got remained since its predecessor was a hybrid-mamba (the model we started benchmarking). Will change it accordingly.

* `ibm-granite/granite-4.1-8b` — 8 B parameter, hybrid-Mamba transformer
* `sarvamai/sarvam-30b` — 30 B MoE, Indic-multilingual model with custom vLLM kernels

The workload is the prefill-heavy `shared_prefix_synthetic` from [inference-perf](https://github.com/kubernetes-sigs/inference-perf): a long shared system prompt + short question + decode-tolerant output (~7.2K input tokens + 1K output tokens). This matches production RAG, chat, and citizen-services traffic profiles where prefix-cache routing has the most room to win.
Copy link
Copy Markdown
Contributor

@chcost chcost Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "2 nodes × 2 GPUs" is slightly ambiguous. does this mean 2 servers each with one H100-NVL pair (which is a 2-GPU package with NVLink bridge), or 2 servers each with 2 discrete H100 GPUs? Since TP=1 is used, the distinction is mainly about memory per GPU: NVL halves have 94 GB HBM3 each, vs 80 GB for standard H100 SXM. Clarifying helps reproducibility.

![AMD MI325X sarvam-30b: llm-d vs k8s round-robin](/img/blogs/heterogeneous-3vendor/amd-sarvam.png)

We were unable to run sarvam-30b on Intel Gaudi3 due to software compatibility issues, but plan to work with the llm-d community to bridge this gap in the future.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Transparent and appreciated. Would be even stronger to briefly note what the incompatibility was (e.g., vLLM-HPU fork missing support for sarvam's custom kernels? Memory constraints? Kernel mismatch?) — helps readers evaluate whether this is a temporary gap or a deeper limitation.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mentioned clearly now due to vllm version mismatch.

tags: [blog, inference, scheduling, kv-cache, sig-benchmarking]
---
# High-volume inference on a three-vendor sovereign cluster

Copy link
Copy Markdown
Contributor

@chcost chcost Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Title feedback: "High-volume" might set expectations for a large-scale deployment, but the actual setup is 20 GPUs across 4 nodes. The core contribution is proving heterogeneous multi-vendor inference works with intelligent routing, not raw scale. A title that foregrounds the multi-vendor angle might land better:

  • "Intelligent inference routing across a three-vendor sovereign cluster"
  • "Heterogeneous inference serving across three GPU vendors with llm-d"

Or if "high-volume" refers to the workload (long-context, high-throughput), make that clearer.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense Carlos. I will change to "Heterogeneous inference serving across three GPU vendors with llm-d" to highlight the heterogeneity angle.

@@ -0,0 +1,122 @@
---
title: "High-volume inference on a three-vendor sovereign cluster"
description: "Benchmarking llm-d's prefix-cache-aware routing and prefill/decode disaggregation across NVIDIA H100-NVL, AMD MI325X, and Intel Gaudi3 pools on the NxtGen sovereign cloud — single-vendor and heterogeneous, with up to +91% throughput and 5.4× better TTFT vs plain Kubernetes round-robin."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description mentions P/D disaggregation, but the post does not cover it. The description says "prefix-cache-aware routing and prefill/decode disaggregation" but the blog only benchmarks prefix-cache routing. P/D disaggregation appears only in the "What's next" section as future work. There is also an unused pd-sarvam.png image in the PR assets that is not referenced in the markdown — looks like a P/D section was drafted and then removed, but the description and image were not cleaned up.

Either add the P/D results back, or update the description to match what the post actually covers.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove the reference to PD. I think it got left out after we removed the PD results.

Signed-off-by: Pravein Govindan Kannan <pravein.govindan.kannan@ibm.com>
@praveingk
Copy link
Copy Markdown
Author

@chcost Thanks for the review. I have addressed the comments in this commit
Please check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Blog Post This PR is a blog post

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants