Blog: BLIS — Evolving llm-d at simulation speed#344
Conversation
✅ Deploy Preview for elaborate-kangaroo-25e1ee ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
9ca5adc to
87b56fe
Compare
Signed-off-by: Mert Toslali <toslali@ibm.com>
87b56fe to
f89b95e
Compare
|
|
||
| ### Capacity planning | ||
|
|
||
| Before you deploy any LLMs for any purpose, you need answers: |
There was a problem hiding this comment.
This is the most technically novel result in the post and it deserves substantially more depth. Applying Drift Plus Penalty (Lyapunov optimization) to P/D disaggregation decisions is what is novel. I a few things I think we can add:
Why does DPP work here? What queue stability / penalty tradeoff is it optimizing?
What model, hardware, workload, and QPS were used? The "2-20x" range is too wide without explaining what drives the variance.
You mention "regimes where each policy stands out". that is the interesting finding. When does always-local win? When does EDPP dominate?
|
|
||
| BLIS has two jobs. It helps llm-d evolve faster, and it helps users plan deployments before spending GPU time. Let's start with the bigger one. | ||
|
|
||
| ### AI-native evolution of llm-d |
There was a problem hiding this comment.
"High-fidelity" needs a number. Vidur reports <9% error. AIConfigurator reports 6-12% MAPE. What is BLIS's fidelity vs real hardware? Without this, the claim is an unsubstantiated adjective. Even a single validation benchmark (e.g., BLIS predicted X TTFT at Y QPS, real cluster measured Z, error was N%) would ground this.
| - The router picks which vLLM instance handles each request. It looks at prefix cache hits, queue depth, and KV use. | ||
| - Each instance decides which requests to batch together right now. | ||
| - The KV cache has to find room. Old blocks may need to make space. | ||
| - The autoscaler watches load and brings new instances up if needed. |
There was a problem hiding this comment.
A note on tone: our published blogs (predicted-latency, v0.5) establish technical credibility through precise, declarative prose. Stating problems with data and letting the evidence carry the argument. This section uses hypothetical framing ("Imagine 500 requests...") and colloquial language ("Things look fine, then suddenly they don't", "scan a hundred settings before lunch") that reads differently from the rest of our blogs. I'd recommend adopting the same engineering-report tone: direct, confident, evidence-first. Also, llm-d audience doesn't need to be walked through what distributed serving is.
Consider cutting this section entirely and weaving the essential points (many interacting knobs, hard to predict on paper) into the intro.
|
|
||
| For the full story, see [our earlier post on the admission controller loop](https://ai-native-systems-research.github.io/ai-native-systems-research/blog/2026/05/13/from-simulation-to-production-how-an-ai-native-pipeline-discovered-a-better-admission-controller-for-llm-d/). | ||
|
|
||
| #### When to disaggregate prefill and decode |
There was a problem hiding this comment.
"TTFT p90 was about 30x faster in the tail" on what model? What hardware? What workload shape? What QPS? "About 30x" is imprecise.
The predicted-latency blog gives a table with exact numbers on Qwen3-480B across 13 servers with 8xH200s. Every quantitative claim in an llm-d blog should be accompanied by the configuration that produced it. This is especially important for a simulator blog... if the claim comes from simulation, the reader needs to know the simulation parameters to judge the result.
| - [Why simulate before you scale](https://inference-sim.github.io/inference-sim/latest/blog/2026/03/05/why-simulate-before-you-scale/) | ||
| - [The physics of high-fidelity distributed inference platform simulation](https://medium.com/modeling-distributed-inference/the-physics-of-high-fidelity-distributed-inference-platform-simulation-28fe27b59da2) | ||
| - **The admission controller story in full:** [From simulation to production](https://ai-native-systems-research.github.io/ai-native-systems-research/blog/2026/05/13/from-simulation-to-production-how-an-ai-native-pipeline-discovered-a-better-admission-controller-for-llm-d/) | ||
| - **The upcoming BLIS proposal for llm-d** |
There was a problem hiding this comment.
The post should acknowledge limitations. what can't BLIS model? Where does the performance model break down? What workloads give poor fidelity?
The predicted-latency blog explicitly acknowledges Scenario C where its approach only matches (not beats) the baseline, and explains why. This kind of honesty builds trust with a technical audience. A post that presents everything as a win reads as advocacy rather than engineering.
Summary
Authors
Mert Toslali, Dipanwita Guhathakurta, Srinivasan Parthasarathy, Jing Chen, Nick Masluk, Vishakha Ramani, Michael Kalantar, Asser Tantawi, Fabio Oliveira, Carlos Costa