Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
---
title: "Heterogeneous inference serving across three GPU vendors with llm-d"
description: "Benchmarking llm-d's prefix-cache-aware routing across NVIDIA H100-NVL, AMD MI325X, and Intel Gaudi3 pools on the NxtGen sovereign cloud — single-vendor and heterogeneous, with up to +91% throughput and 5.4× better TTFT vs plain Kubernetes round-robin."
slug: heterogeneous-inference-3-vendor-sovereign-cluster
date: 2026-05-29T09:00

authors:
- praveinkannan
- praveenjayachandran
- jaikhari
- varunraste
- prasadmukhedkar
- vinodpathangay
- jayanthreddy
- abhisyant

tags: [blog, inference, scheduling, kv-cache, sig-benchmarking]
---
# High-volume inference on a three-vendor sovereign cluster

Comment thread
praveingk marked this conversation as resolved.
Most production inference clusters today are single-vendor — not because it's an optimal design, but because it's the simplest way to configure a cluster.

That is starting to change. Procurement cycles bring new generations alongside older ones, supply constraints push teams across vendors, and the cost gap between accelerators makes a one-size-fits-all fleet increasingly expensive to defend. Real production fleets are accumulating heterogeneity whether or not the architecture planned for it.

This is an opportunity to unlock real value: lower-cost accelerators can absorb low-priority workloads while premium hardware handles latency-sensitive paths, stranded capacity gets reclaimed, and the organization is no longer held hostage to one supplier's roadmap or pricing. The case is stronger still for sovereign and on-premise deployments, where data residency, regulatory alignment, and the long-term economics of high-volume inference are pushing AI workloads off centralized hyperscaler stacks.

But making it work in practice is hard. Divergent driver stacks, firmware versions, container images, hardware-specific attention kernels, and the absence of standardized performance comparisons across accelerators all combine to make a coherent serving layer over a heterogeneous fleet a non-trivial systems problem.

<!-- truncate -->

## Setup

To evaluate llm-d on a heterogeneous environment, we ran experiments on the **NxtGen sovereign cloud's** mixed GPU environment, with the following accelerator pools within a single OpenShift AI cluster:

| Pool | Hardware | Count |
| :---- | :---- | :---- |
| NVIDIA | 2 nodes, each with 1× H100-NVL pair (2 NVLink-bridged GPUs, 94 GB HBM3 each) | 4 |
| AMD | 1 node with 8× MI325X | 8 |
| Intel | 1 node with 8× Gaudi3 | 8 |

All nodes are connected over a shared **100 G RoCE** network. We pinned each vLLM replica to a single accelerator card (TP = 1) to maximize the number of independent serving instances and exercise the routing layer.

Models served:

* `ibm-granite/granite-4.1-8b` — 8 B parameter, decoder-only dense transformer model
* `sarvamai/sarvam-30b` — 30 B MoE, Indic-multilingual model

The workload is the prefill-heavy `shared_prefix_synthetic` from [inference-perf](https://github.com/kubernetes-sigs/inference-perf): a long shared system prompt + short question + decode-tolerant output (~7.2K input tokens + 1K output tokens). This matches production RAG, chat, and citizen-services traffic profiles where prefix-cache routing has the most room to win.
Comment thread
praveingk marked this conversation as resolved.

## Prefix-aware caching

We deployed llm-d v0.0.7 with prefix-cache-aware routing in two flavours, picked per-model rather than per-pool: **`granite-4.1-8b` runs against the [precise prefix-cache scorer](https://github.com/llm-d/llm-d/tree/main/guides/precise-prefix-cache-routing)** (tokenizer-backed, exact prefix matching), and **`sarvam-30b` runs against the approximate prefix-cache scorer** (xxhash over raw prompt bytes, no tokenizer required). The precise scorer needs a tokenizer, but sarvam's custom HF code requires `trust_remote_code=True` — a flag the v0.8.0 llm-d-router images does not pass through the UDS tokenizer sidecar. The hash-based approximate scorer bridges that gap.

Each vendor's pods are deployed as a separate Helm release in the same namespace; only the `nodeSelector` and a small set of vendor-specific tuning flags (e.g. Gaudi's `--block-size 128`, `--max-num-seqs 256`, `VLLM_BUILD` pin) vary between releases. All pods carry the same selector labels and register with a single InferencePool maintained by llm-d's router. For the baseline, we use a ClusterIP service over the same set of pods to drive plain Kubernetes round-robin scheduling — **same pods, same vLLM, same flags; only the routing layer differs.**

Comment thread
praveingk marked this conversation as resolved.
Across every pool we tested — single-vendor (NVIDIA-only / AMD-only / Gaudi-only) and heterogeneous (NVIDIA+AMD, NVIDIA+AMD+Gaudi) — **llm-d's prefix-cache-aware routing consistently wins over plain k8s round-robin** on both throughput and time-to-first-token (TTFT). The advantage grows with pool size and heterogeneity.

| Pool | Pods | Model | Throughput edge (llm-d vs k8s) | TTFT edge |
| :---- | :---- | :---- | :---- | :---- |
| NVIDIA-only | 4 H100-NVL | granite-4.1-8b | +25–36% | 16× |
| NVIDIA-only | 4 H100-NVL | sarvam-30b | 2× | 22× |
| AMD-only | 8 MI325X | granite-4.1-8b | +79% | 21× |
| AMD-only | 8 MI325X | sarvam-30b | +83% @ 200 qps (28.6 K vs 15.6 K out tok/s) | 5× |
| Gaudi-only | 8 Gaudi3 | granite-4.1-8b | +34% | 18× |
Comment thread
praveingk marked this conversation as resolved.
| NVIDIA + AMD | 12 | granite-4.1-8b | +85% (19.4 K vs 10–11 K) | 3.4–5.6× |
| NVIDIA + AMD | 12 | sarvam-30b | ~3× @ 200 qps | 2.85–4.54× |
| **NVIDIA + AMD + Gaudi** | **20** | **granite-4.1-8b** | **+91% @ 85 qps** | **5.4×** |

**Why llm-d wins biggest on heterogeneous pools:** k8s round-robin spreads requests evenly regardless of pod speed, so a single slow accelerator becomes a queueing sink that drags total throughput down. llm-d's prefix-cache-aware EPP routes around saturated pods and concentrates cache hits on warm ones, so heterogeneity is no longer a penalty.

:::note Workload caveat

`shared_prefix_synthetic` is a favourable regime for prefix-cache routing: all requests share the same long system prompt, so cache hit rate on llm-d approaches 100% once a prefix is warm on a pod. Results are strongest for workloads with high prefix reuse; gains can vary with prefix diversity.

:::

### Single-vendor pools — granite-4.1-8b

We start with the per-vendor baselines so the heterogeneous results below have a reference point. All runs use ~7.2K ISL + 1K OSL.
Comment thread
praveingk marked this conversation as resolved.

**4× NVIDIA H100-NVL.** llm-d improves TTFT by up to 16× compared to k8s, and output throughput by 25–36%.

![NVIDIA H100-NVL granite-4.1-8b: llm-d vs k8s round-robin](/img/blogs/heterogeneous-3vendor/nvidia-granite.png)

**8× AMD MI325X.** llm-d delivers up to 21× better TTFT and +79% throughput vs k8s round-robin on this AMD-only granite deployment.

![AMD MI325X granite-4.1-8b: llm-d vs k8s round-robin](/img/blogs/heterogeneous-3vendor/amd-granite.png)

**8× Intel Gaudi3.** At saturation (rate 25), llm-d delivers +34% throughput and ~18× better TTFT vs plain k8s round-robin.

![Intel Gaudi3 granite-4.1-8b: llm-d vs k8s round-robin](/img/blogs/heterogeneous-3vendor/gaudi-granite.png)

### Single-vendor pools — sarvam-30b (multilingual MoE)

**4× NVIDIA H100-NVL.** llm-d delivers 2× the throughput and 22× better TTFT. k8s TTFT degrades sharply between rate 15-20 and is fully saturated by rate 25; llm-d keeps scaling.

Comment thread
praveingk marked this conversation as resolved.
![NVIDIA H100-NVL sarvam-30b: llm-d vs k8s round-robin](/img/blogs/heterogeneous-3vendor/nvidia-sarvam.png)

**8× AMD MI325X.** k8s throughput plateaus at ~17 K out tok/s (peak at rate 175) and declines to 15.6 K at rate 200, while llm-d keeps scaling to **28.6 K out tok/s at rate 200 — +83% over k8s at the same rate** (or +65% peak-vs-peak). TTFT-wise llm-d is up to 5× faster at lower rates.

![AMD MI325X sarvam-30b: llm-d vs k8s round-robin](/img/blogs/heterogeneous-3vendor/amd-sarvam.png)

We were unable to run sarvam-30b on Intel Gaudi3: sarvam's `hotpatch_vllm.py` pins `vllm==0.15.0` and the upstream `sarvam.py` is written against vllm 0.15.x core APIs, but `llm-d-hpu:v0.7.0` (the version we tested) ships `vllm-gaudi 0.16`. We plan to revisit with the llm-d community once the version mismatch is resolved.

Comment thread
praveingk marked this conversation as resolved.
### Heterogeneous pools — where llm-d wins biggest

**NVIDIA + AMD (12 pods, granite-4.1-8b).** While k8s throughput plateaus at 10–11 K tok/s, llm-d goes up to 19.4 K — 85% higher throughput. TTFT-wise llm-d is 3.4–5.6× faster at higher rates.

![NVIDIA + AMD mixed pool granite-4.1-8b: llm-d vs k8s round-robin](/img/blogs/heterogeneous-3vendor/nvidia-amd-granite.png)

**NVIDIA + AMD (12 pods, sarvam-30b).** llm-d brings down TTFT by 2.85–4.54× and increases throughput by close to 3× at rate 200. llm-d wins biggest in this mixed pool — round-robin is most punished by heterogeneous capacity, and llm-d's prefix-aware routing avoids this trap.

![NVIDIA + AMD mixed pool sarvam-30b: llm-d vs k8s round-robin](/img/blogs/heterogeneous-3vendor/nvidia-amd-sarvam.png)

**NVIDIA + AMD + Gaudi (20 pods, granite-4.1-8b).** The 20-pod 3-vendor pool delivers **14.2 K out tok/s peak with llm-d vs 9.6 K with k8s round-robin**. k8s saturates at rate 25 and *declines* to 7.5 K at rate 85 (queue depth dominates) — llm-d delivers **+91% throughput at the same load**. TTFT at rate 85: llm-d 6.8 s, k8s 36.4 s (**5.4× better**).

![3-vendor (NVIDIA + AMD + Gaudi) granite-4.1-8b: llm-d vs k8s round-robin](/img/blogs/heterogeneous-3vendor/3vendor-granite.png)
Comment thread
praveingk marked this conversation as resolved.

The rate ladder stopped at 85 QPS because the single-replica EPP became CPU-bound, not because the pool saturated. EPP work scales with `pod_count × QPS`, so the 20-pod 3-vendor pool hits the CPU ceiling well before a smaller single-vendor pool at the same QPS. Per-pod throughput here sums to ~25 K, so the true pool ceiling is likely higher; EPP horizontal scaling (replicas: 1 → 2) or a higher CPU limit would unblock it further.

## What's next

**Cross-accelerator P/D disaggregation.** We plan to take heterogeneous inference to the next level by enabling prefill and decode to run on mixed accelerator types within the same cluster — for example, routing compute-heavy prefill to NVIDIA H100 nodes and memory-bandwidth-intensive decode to AMD MI325X nodes (or vice versa), based on where each phase runs most efficiently. This requires the KV cache transfer library (e.g. NIXL) to work across different GPU vendors on each end, an active area of development in the llm-d community.

<!-- ## Get involved with llm-d

The llm-d project thrives on community contributions, and there are many ways to get involved:

* Explore the llm-d Community Quickstart Guide → [Start here](/community) to learn more about getting involved in the llm-d project.
* Join our Slack → [Get your invite](/slack) and connect with maintainers and contributors.
* Explore the code → Browse our [GitHub organization](https://github.com/llm-d) and find issues that interest you.
* Attend meetings → All meetings are open. Add our [public calendar](/community#public-meeting-calendar) and join discussions. -->
65 changes: 65 additions & 0 deletions blog/authors.yml
Original file line number Diff line number Diff line change
Expand Up @@ -198,3 +198,68 @@ abdullahgharaibeh:
url: https://github.com/ahg-g
image_url: https://avatars.githubusercontent.com/u/40361897?v=4

praveinkannan:
name: Pravein Govindan Kannan
title: Staff Research Scientist, IBM
email: pravein.govindan.kannan@ibm.com
image_url: https://avatars.githubusercontent.com/u/9940568?v=4
socials:
linkedin: praveingk
github: https://github.com/praveingk
twitter: praveingk

praveenjayachandran:
name: Praveen Jayachandran
title: Senior Technical Staff Member, IBM
email: praveen.j@in.ibm.com
image_url: /img/blogs/praveenjayachandran.webp
socials:
linkedin: praveen-jayachandran-97831113a

jaikhari:
name: Jaikrishnan Hari
title: Research Partnerships & BD Executive, IBM
email: jaikhari@in.ibm.com
image_url: /img/blogs/jaikhari.webp
socials:
linkedin: jaikrishnan-hari-a137972

varunraste:
name: Varun Raste
title: Solution Architect, IBM
email: varun.raste@ibm.com
image_url: /img/blogs/varunraste.webp
socials:
linkedin: varun-raste-0783b3184

prasadmukhedkar:
name: Prasad Mukhedkar
title: Associate Principal AI Architect, Red Hat
email: pmukhedk@redhat.com
image_url: /img/blogs/prasadmukhedkar.webp
socials:
linkedin: prasadmukhedkar1

vinodpathangay:
name: Vinod Pathangay
title: Chief Architect, Field CTO Organization, Red Hat
email: vpathang@redhat.com
image_url: /img/blogs/vinodpathangay.webp
socials:
linkedin: vinod-pathangay

jayanthreddy:
name: Jayanth Babu Reddy
title: Principal Architect, NxtGen Cloud Technologies
email: jayanth.reddy@nxtgen.ai
image_url: /img/blogs/jayanthreddy.webp
socials:
linkedin: jayanth-babu-reddy-498bb0207

abhisyant:
name: Abhisyant Anasapurapu
title: VP, NxtGen Cloud Technologies
email: abhisyant@nxtgen.ai
image_url: /img/blogs/abhisyant.webp
socials:
linkedin: abhisyant
Binary file added static/img/blogs/abhisyant.webp
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/img/blogs/jaikhari.webp
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/img/blogs/jayanthreddy.webp
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/img/blogs/prasadmukhedkar.webp
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/img/blogs/praveenjayachandran.webp
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/img/blogs/varunraste.webp
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/img/blogs/vinodpathangay.webp
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading