Skip to content

Blog: BLIS — Evolving llm-d at simulation speed#344

Open
mtoslalibu wants to merge 1 commit into
llm-d:mainfrom
mtoslalibu:blog/blis-simulator
Open

Blog: BLIS — Evolving llm-d at simulation speed#344
mtoslalibu wants to merge 1 commit into
llm-d:mainfrom
mtoslalibu:blog/blis-simulator

Conversation

@mtoslalibu
Copy link
Copy Markdown

Summary

  • Adds blog post introducing BLIS, the llm-d simulator
  • Covers AI-native evolution of llm-d policies, prefill/decode disaggregation results, and capacity planning via simulation
  • Includes 6 figures and author entries

Authors

Mert Toslali, Dipanwita Guhathakurta, Srinivasan Parthasarathy, Jing Chen, Nick Masluk, Vishakha Ramani, Michael Kalantar, Asser Tantawi, Fabio Oliveira, Carlos Costa

@netlify
Copy link
Copy Markdown

netlify Bot commented Jun 5, 2026

Deploy Preview for elaborate-kangaroo-25e1ee ready!

Name Link
🔨 Latest commit f89b95e
🔍 Latest deploy log https://app.netlify.com/projects/elaborate-kangaroo-25e1ee/deploys/6a2330094982e30007515e44
😎 Deploy Preview https://deploy-preview-344--elaborate-kangaroo-25e1ee.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@mtoslalibu mtoslalibu force-pushed the blog/blis-simulator branch from 9ca5adc to 87b56fe Compare June 5, 2026 20:14
Signed-off-by: Mert Toslali <toslali@ibm.com>
@mtoslalibu mtoslalibu force-pushed the blog/blis-simulator branch from 87b56fe to f89b95e Compare June 5, 2026 20:22
Copy link
Copy Markdown
Contributor

@chcost chcost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strong core thesis but needs to align a bit better with the technical tone other llm-d published blogs. See inline comments for specific suggestions...


### Capacity planning

Before you deploy any LLMs for any purpose, you need answers:
Copy link
Copy Markdown
Contributor

@chcost chcost Jun 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the most technically novel result in the post and it deserves substantially more depth. Applying Drift Plus Penalty (Lyapunov optimization) to P/D disaggregation decisions is what is novel. I a few things I think we can add:
Why does DPP work here? What queue stability / penalty tradeoff is it optimizing?
What model, hardware, workload, and QPS were used? The "2-20x" range is too wide without explaining what drives the variance.
You mention "regimes where each policy stands out". that is the interesting finding. When does always-local win? When does EDPP dominate?


BLIS has two jobs. It helps llm-d evolve faster, and it helps users plan deployments before spending GPU time. Let's start with the bigger one.

### AI-native evolution of llm-d
Copy link
Copy Markdown
Contributor

@chcost chcost Jun 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"High-fidelity" needs a number. Vidur reports <9% error. AIConfigurator reports 6-12% MAPE. What is BLIS's fidelity vs real hardware? Without this, the claim is an unsubstantiated adjective. Even a single validation benchmark (e.g., BLIS predicted X TTFT at Y QPS, real cluster measured Z, error was N%) would ground this.

- The router picks which vLLM instance handles each request. It looks at prefix cache hits, queue depth, and KV use.
- Each instance decides which requests to batch together right now.
- The KV cache has to find room. Old blocks may need to make space.
- The autoscaler watches load and brings new instances up if needed.
Copy link
Copy Markdown
Contributor

@chcost chcost Jun 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A note on tone: our published blogs (predicted-latency, v0.5) establish technical credibility through precise, declarative prose. Stating problems with data and letting the evidence carry the argument. This section uses hypothetical framing ("Imagine 500 requests...") and colloquial language ("Things look fine, then suddenly they don't", "scan a hundred settings before lunch") that reads differently from the rest of our blogs. I'd recommend adopting the same engineering-report tone: direct, confident, evidence-first. Also, llm-d audience doesn't need to be walked through what distributed serving is.

Consider cutting this section entirely and weaving the essential points (many interacting knobs, hard to predict on paper) into the intro.


For the full story, see [our earlier post on the admission controller loop](https://ai-native-systems-research.github.io/ai-native-systems-research/blog/2026/05/13/from-simulation-to-production-how-an-ai-native-pipeline-discovered-a-better-admission-controller-for-llm-d/).

#### When to disaggregate prefill and decode
Copy link
Copy Markdown
Contributor

@chcost chcost Jun 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"TTFT p90 was about 30x faster in the tail" on what model? What hardware? What workload shape? What QPS? "About 30x" is imprecise.

The predicted-latency blog gives a table with exact numbers on Qwen3-480B across 13 servers with 8xH200s. Every quantitative claim in an llm-d blog should be accompanied by the configuration that produced it. This is especially important for a simulator blog... if the claim comes from simulation, the reader needs to know the simulation parameters to judge the result.

- [Why simulate before you scale](https://inference-sim.github.io/inference-sim/latest/blog/2026/03/05/why-simulate-before-you-scale/)
- [The physics of high-fidelity distributed inference platform simulation](https://medium.com/modeling-distributed-inference/the-physics-of-high-fidelity-distributed-inference-platform-simulation-28fe27b59da2)
- **The admission controller story in full:** [From simulation to production](https://ai-native-systems-research.github.io/ai-native-systems-research/blog/2026/05/13/from-simulation-to-production-how-an-ai-native-pipeline-discovered-a-better-admission-controller-for-llm-d/)
- **The upcoming BLIS proposal for llm-d**
Copy link
Copy Markdown
Contributor

@chcost chcost Jun 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The post should acknowledge limitations. what can't BLIS model? Where does the performance model break down? What workloads give poor fidelity?

The predicted-latency blog explicitly acknowledges Scenario C where its approach only matches (not beats) the baseline, and explains why. This kind of honesty builds trust with a technical audience. A post that presents everything as a win reads as advocacy rather than engineering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants