Adding reference architectures for optimized-baseline, precise prefix cache routing, predicted latency routing llm-d well-lit paths by gushob21 · Pull Request #461 · GoogleCloudPlatform/accelerated-platforms

gushob21 · 2026-06-08T15:05:13Z

Adding reference architectures for optimized-baseline, precise prefix cache routing, predicted latency routing llm-d well-lit paths

Adding Kubernetes manifests for llmd deployment Adding Kubernetes manifests for llmd deployment Adding Kubernetes manifests for llmd deployment Adding Kubernetes manifests for llmd deployment fixing CI failures and removing a unwanted directory adding deploy and teardown scripts for llmd optimized-baseline fixing cspell CI failure Refining READE for llmd baseline optimization architecture Refactoring optimized-baseline reference architecture and adding reference architecture for precise-prefix-cache Fixing cspell CI pipeline I can sort the file in first attempt. Fixing it now. minor fix in README moving llmd_model_id and llmd_accelerator_type to llmd shared variable files editing README for llmd optimized baseline and precise prefix cache updating runtime.ev for gemma4 on g4 and update README Adding reference architecture for llm-d predicted latency routing fixing typo refining READE files for llm architectures Minute readme fixes fixing typos in readme Adding reference architectures for optimized-baseline, precise prefix cache routing, predicted latency routing llm-d well-lit paths

fernandorubbo

Please fix the precise prefix that are using 2 chips..

The rest of the comments is minor. Lets discuss that offline

fernandorubbo · 2026-06-09T00:10:25Z

+
+  Valid values for `ACCELERATOR` are:
+
+  - `h100`


what about TPU v6e?

I also notice we say in the list h100, but looks like the h100 implementation is pending yet

i have not been able to test them due to stock out. That is the reason I left h100 manifests out in first place. But I have added them now. TPU will follow once we confirm that the architectures are working as expected after we have implemented benchmarking

fernandorubbo · 2026-06-09T00:17:12Z

+## Architecture
+
+This guide is an implementation of
+[llm-d optimized baseline well-lit path](https://github.com/llm-d/llm-d/tree/main/guides/optimized-baseline).


Should we mentioned that this deployment uses the Approximated Prefix Cache and link where the user can learn the difference for approximated, precise and predicted

I also think we could add links to the other options here

fernandorubbo · 2026-06-09T00:23:07Z

+
+## Prerequisite
+
+This architecture and workflow assumes that the reader is familiar with the


All 3 files are 90% equal. Keeping this note here for us to discuss if we should have only one to avoid maitenance burden

I intentionally kept the separate, the reasons for that are :

If one component changes later, we don't want to impact the other two

One could spin up all three reference architecture from accelerated-platform in parallel if we have separate code for them

fernandorubbo · 2026-06-09T00:35:38Z

+      - runtime.env
+    name: runtime
+
+nameSuffix: -rtx-pro-6000-qwen3-32b


The only difference I see from the other kustomization.yamls (but the base one) is the name of the model. That being said.. couldn't this entire file be in the base and only the suffix be here? or even better, the suffix be build using variables like you do in other parts?

the issue is that if we move this code section to the ../base/kustomization.yaml, we will have to provide the relative paths to the runtime.env and patch-resource.yaml there based on how overlays work; these two files have differnt configuration required for different model-acceleratior combinations so they should live in their specific folders. But that means the Kustomize will scan and source all the files when we want to run it just for one model-accelerator combination. The trade-off is not worth it IMO.

fernandorubbo · 2026-06-09T00:39:00Z

+# See the License for the specific language governing permissions and
+# limitations under the License.
+---
+apiVersion: apps/v1


Same here.. This is the same file as the previous.

Should we create a hierarchy like

base/chip/model

Today you have base that only points to main llm-d

and then chip-models

What causes a lot of the code to be repeated

Particularly I would move everything that was possible into base.. then everything that was possible into chip.. and let to the model only the model bits to mitigate this issue

i am following how the accelerated platform repo handles the kustomization deployment as of now and keeping this architecture inline with that. If we change at one place, we should change everywhere otherwise we will end up having inconsistent patterns across the repo.

Looking at the current files, it is only the patch-nodeselector.yaml which may remain consistent for a given accelerator so may seem duplicate. We could place this yaml under the base/chip as per your suggestion. The problem will occur when we use a model that can not be accommodated on the default CCC specified in the patch-nodeselector.yaml e.g the default is gpu-rtx-pro-6000-96gb-x1 but when we use bigger models we may need gpu-rtx-pro-6000-96gb-x2. In that case, we will have to apply another patch for nodeselector inside the model directory. I think it will make it hard-to-follow architecture. IMO, keeping it separate model-accelerator combination will make it more user friendly and sounds like a good the trade-off with some duplicate code.

fernandorubbo · 2026-06-09T00:40:08Z

+            limits:
+              cpu: "10"
+              memory: 128G
+              nvidia.com/gpu: "2"


precise prefix cache routing makes vllm pin more memory compared to optimized-baseline and predicted latency routing and hence we get this error when we try to spin up precise prefix cache on 1 accelerator:

ValueError: To serve at least one request with the models's max seq len (32768), (27.5 GiB KV cache is needed, which is larger than the available KV cache memory (27.19 GiB). Based on the available memory, the estimated maximum model length is 32384. Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details."

The workaround is to reduce the max_model_len, When I change it from 32768 to 30000. Do you want to strictly keep it on one accelerator by changing the max_model_len instead of using 2 accelerators ?

fernandorubbo · 2026-06-09T00:41:09Z

+            - "--max-model-len=$(MAX_MODEL_LEN)"
+            - "--gpu-memory-utilization=$(GPU_MEMORY_UTILIZATION)"
+            - "--kv-events-config"
+            - '{"enable_kv_cache_events":true,"publisher":"zmq","endpoint":"$(KV_EVENTS_ENDPOINT)","topic":"kv@$(POD_IP):$(POD_PORT)@$(MODEL_ID)"}'


why is this needed?

to overlay model id; upstream has it hardcoded

fernandorubbo · 2026-06-09T00:42:16Z

+MAX_MODEL_LEN=32768
+MODEL_ID=google/gemma-4-31b-it
+MODEL_NAME=Gemma-4-31B-it
+TENSOR_PARALLEL_SIZE=2


why 2 chips is needed if the approximated prefix you used only one?

precise prefix cache routing makes vllm pin more memory compared to optimized-baseline and predicted latency routing and hence we get this error when we try to spin up precise prefix cache on 1 accelerator:

ValueError: To serve at least one request with the models's max seq len (32768), (27.5 GiB KV cache is needed, which is larger than the available KV cache memory (27.19 GiB). Based on the available memory, the estimated maximum model length is 32384. Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details."

The workaround is to reduce the max_model_len, When I change it from 32768 to 30000. Do you want to strictly keep it on one accelerator by changing the max_model_len instead of using 2 accelerators ?

gushob21 and others added 2 commits June 8, 2026 14:59

Merge branch 'main' into gushob-refactor-llmd

e051f05

syeda-anjum self-requested a review June 8, 2026 16:58

fernandorubbo reviewed Jun 9, 2026

View reviewed changes

gushob21 added 5 commits June 10, 2026 15:04

Adding kustomize manifests for h100 for llmd architectures

9e08c7b

Adding manifests to H200 and TPU v6e

feb895f

fixing overlays for tpu architectures for llmd

5eb01d6

Refactoring llmd deploy scripts for TPU guide

3444a7c

Updating code to fix llmd deployment on tpu

97d2f75


		## Prerequisite

		This architecture and workflow assumes that the reader is familiar with the

Conversation

gushob21 commented Jun 8, 2026

Uh oh!

fernandorubbo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gushob21 Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gushob21 Jun 9, 2026 •

edited

Loading