Skip to content

Adding reference architectures for optimized-baseline, precise prefix cache routing, predicted latency routing llm-d well-lit paths#461

Open
gushob21 wants to merge 7 commits into
mainfrom
gushob-refactor-llmd
Open

Adding reference architectures for optimized-baseline, precise prefix cache routing, predicted latency routing llm-d well-lit paths#461
gushob21 wants to merge 7 commits into
mainfrom
gushob-refactor-llmd

Conversation

@gushob21

@gushob21 gushob21 commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Adding reference architectures for optimized-baseline, precise prefix cache routing, predicted latency routing llm-d well-lit paths

gushob21 and others added 2 commits June 8, 2026 14:59
Adding Kubernetes manifests for llmd deployment

Adding Kubernetes manifests for llmd deployment

Adding Kubernetes manifests for llmd deployment

Adding Kubernetes manifests for llmd deployment

fixing CI failures and removing a unwanted directory

adding deploy and teardown scripts for llmd optimized-baseline

fixing cspell CI failure

Refining READE for llmd baseline optimization architecture

Refactoring optimized-baseline reference architecture and adding reference architecture for precise-prefix-cache

Fixing cspell CI pipeline

I can sort the file in first attempt. Fixing it now.

minor fix in README

moving llmd_model_id and llmd_accelerator_type to llmd shared variable files

editing README for llmd optimized baseline and precise prefix cache

updating runtime.ev for gemma4 on g4 and update README

Adding reference architecture for llm-d predicted latency routing

fixing typo

refining READE files for llm architectures

Minute readme fixes

fixing typos in readme

Adding reference architectures for optimized-baseline, precise prefix
cache routing, predicted latency routing llm-d well-lit paths
@syeda-anjum syeda-anjum self-requested a review June 8, 2026 16:58

@fernandorubbo fernandorubbo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the precise prefix that are using 2 chips..

The rest of the comments is minor. Lets discuss that offline


Valid values for `ACCELERATOR` are:

- `h100`

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about TPU v6e?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also notice we say in the list h100, but looks like the h100 implementation is pending yet

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have not been able to test them due to stock out. That is the reason I left h100 manifests out in first place. But I have added them now. TPU will follow once we confirm that the architectures are working as expected after we have implemented benchmarking

## Architecture

This guide is an implementation of
[llm-d optimized baseline well-lit path](https://github.com/llm-d/llm-d/tree/main/guides/optimized-baseline).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mentioned that this deployment uses the Approximated Prefix Cache and link where the user can learn the difference for approximated, precise and predicted

I also think we could add links to the other options here


## Prerequisite

This architecture and workflow assumes that the reader is familiar with the

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 3 files are 90% equal. Keeping this note here for us to discuss if we should have only one to avoid maitenance burden

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally kept the separate, the reasons for that are :

  1. If one component changes later, we don't want to impact the other two
  2. One could spin up all three reference architecture from accelerated-platform in parallel if we have separate code for them

- runtime.env
name: runtime

nameSuffix: -rtx-pro-6000-qwen3-32b

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference I see from the other kustomization.yamls (but the base one) is the name of the model. That being said.. couldn't this entire file be in the base and only the suffix be here? or even better, the suffix be build using variables like you do in other parts?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue is that if we move this code section to the ../base/kustomization.yaml, we will have to provide the relative paths to the runtime.env and patch-resource.yaml there based on how overlays work; these two files have differnt configuration required for different model-acceleratior combinations so they should live in their specific folders. But that means the Kustomize will scan and source all the files when we want to run it just for one model-accelerator combination. The trade-off is not worth it IMO.

# See the License for the specific language governing permissions and
# limitations under the License.
---
apiVersion: apps/v1

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.. This is the same file as the previous.

Should we create a hierarchy like

base/chip/model

Today you have base that only points to main llm-d

and then chip-models

What causes a lot of the code to be repeated

Particularly I would move everything that was possible into base.. then everything that was possible into chip.. and let to the model only the model bits to mitigate this issue

@gushob21 gushob21 Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am following how the accelerated platform repo handles the kustomization deployment as of now and keeping this architecture inline with that. If we change at one place, we should change everywhere otherwise we will end up having inconsistent patterns across the repo.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the current files, it is only the patch-nodeselector.yaml which may remain consistent for a given accelerator so may seem duplicate. We could place this yaml under the base/chip as per your suggestion. The problem will occur when we use a model that can not be accommodated on the default CCC specified in the patch-nodeselector.yaml e.g the default is gpu-rtx-pro-6000-96gb-x1 but when we use bigger models we may need gpu-rtx-pro-6000-96gb-x2. In that case, we will have to apply another patch for nodeselector inside the model directory. I think it will make it hard-to-follow architecture. IMO, keeping it separate model-accelerator combination will make it more user friendly and sounds like a good the trade-off with some duplicate code.

limits:
cpu: "10"
memory: 128G
nvidia.com/gpu: "2"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 2?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

precise prefix cache routing makes vllm pin more memory compared to optimized-baseline and predicted latency routing and hence we get this error when we try to spin up precise prefix cache on 1 accelerator:

ValueError: To serve at least one request with the models's max seq len (32768), (27.5 GiB KV cache is needed, which is larger than the available KV cache memory (27.19 GiB). Based on the available memory, the estimated maximum model length is 32384. Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details."

The workaround is to reduce the max_model_len, When I change it from 32768 to 30000. Do you want to strictly keep it on one accelerator by changing the max_model_len instead of using 2 accelerators ?

- "--max-model-len=$(MAX_MODEL_LEN)"
- "--gpu-memory-utilization=$(GPU_MEMORY_UTILIZATION)"
- "--kv-events-config"
- '{"enable_kv_cache_events":true,"publisher":"zmq","endpoint":"$(KV_EVENTS_ENDPOINT)","topic":"kv@$(POD_IP):$(POD_PORT)@$(MODEL_ID)"}'

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to overlay model id; upstream has it hardcoded

MAX_MODEL_LEN=32768
MODEL_ID=google/gemma-4-31b-it
MODEL_NAME=Gemma-4-31B-it
TENSOR_PARALLEL_SIZE=2

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 2 chips is needed if the approximated prefix you used only one?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

precise prefix cache routing makes vllm pin more memory compared to optimized-baseline and predicted latency routing and hence we get this error when we try to spin up precise prefix cache on 1 accelerator:

ValueError: To serve at least one request with the models's max seq len (32768), (27.5 GiB KV cache is needed, which is larger than the available KV cache memory (27.19 GiB). Based on the available memory, the estimated maximum model length is 32384. Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details."

The workaround is to reduce the max_model_len, When I change it from 32768 to 30000. Do you want to strictly keep it on one accelerator by changing the max_model_len instead of using 2 accelerators ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants