Adding reference architectures for optimized-baseline, precise prefix cache routing, predicted latency routing llm-d well-lit paths#461
Conversation
Adding Kubernetes manifests for llmd deployment Adding Kubernetes manifests for llmd deployment Adding Kubernetes manifests for llmd deployment Adding Kubernetes manifests for llmd deployment fixing CI failures and removing a unwanted directory adding deploy and teardown scripts for llmd optimized-baseline fixing cspell CI failure Refining READE for llmd baseline optimization architecture Refactoring optimized-baseline reference architecture and adding reference architecture for precise-prefix-cache Fixing cspell CI pipeline I can sort the file in first attempt. Fixing it now. minor fix in README moving llmd_model_id and llmd_accelerator_type to llmd shared variable files editing README for llmd optimized baseline and precise prefix cache updating runtime.ev for gemma4 on g4 and update README Adding reference architecture for llm-d predicted latency routing fixing typo refining READE files for llm architectures Minute readme fixes fixing typos in readme Adding reference architectures for optimized-baseline, precise prefix cache routing, predicted latency routing llm-d well-lit paths
fernandorubbo
left a comment
There was a problem hiding this comment.
Please fix the precise prefix that are using 2 chips..
The rest of the comments is minor. Lets discuss that offline
|
|
||
| Valid values for `ACCELERATOR` are: | ||
|
|
||
| - `h100` |
There was a problem hiding this comment.
I also notice we say in the list h100, but looks like the h100 implementation is pending yet
There was a problem hiding this comment.
i have not been able to test them due to stock out. That is the reason I left h100 manifests out in first place. But I have added them now. TPU will follow once we confirm that the architectures are working as expected after we have implemented benchmarking
| ## Architecture | ||
|
|
||
| This guide is an implementation of | ||
| [llm-d optimized baseline well-lit path](https://github.com/llm-d/llm-d/tree/main/guides/optimized-baseline). |
There was a problem hiding this comment.
Should we mentioned that this deployment uses the Approximated Prefix Cache and link where the user can learn the difference for approximated, precise and predicted
I also think we could add links to the other options here
|
|
||
| ## Prerequisite | ||
|
|
||
| This architecture and workflow assumes that the reader is familiar with the |
There was a problem hiding this comment.
All 3 files are 90% equal. Keeping this note here for us to discuss if we should have only one to avoid maitenance burden
There was a problem hiding this comment.
I intentionally kept the separate, the reasons for that are :
- If one component changes later, we don't want to impact the other two
- One could spin up all three reference architecture from accelerated-platform in parallel if we have separate code for them
| - runtime.env | ||
| name: runtime | ||
|
|
||
| nameSuffix: -rtx-pro-6000-qwen3-32b |
There was a problem hiding this comment.
The only difference I see from the other kustomization.yamls (but the base one) is the name of the model. That being said.. couldn't this entire file be in the base and only the suffix be here? or even better, the suffix be build using variables like you do in other parts?
There was a problem hiding this comment.
the issue is that if we move this code section to the ../base/kustomization.yaml, we will have to provide the relative paths to the runtime.env and patch-resource.yaml there based on how overlays work; these two files have differnt configuration required for different model-acceleratior combinations so they should live in their specific folders. But that means the Kustomize will scan and source all the files when we want to run it just for one model-accelerator combination. The trade-off is not worth it IMO.
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
| --- | ||
| apiVersion: apps/v1 |
There was a problem hiding this comment.
Same here.. This is the same file as the previous.
Should we create a hierarchy like
base/chip/model
Today you have base that only points to main llm-d
and then chip-models
What causes a lot of the code to be repeated
Particularly I would move everything that was possible into base.. then everything that was possible into chip.. and let to the model only the model bits to mitigate this issue
There was a problem hiding this comment.
i am following how the accelerated platform repo handles the kustomization deployment as of now and keeping this architecture inline with that. If we change at one place, we should change everywhere otherwise we will end up having inconsistent patterns across the repo.
There was a problem hiding this comment.
Looking at the current files, it is only the patch-nodeselector.yaml which may remain consistent for a given accelerator so may seem duplicate. We could place this yaml under the base/chip as per your suggestion. The problem will occur when we use a model that can not be accommodated on the default CCC specified in the patch-nodeselector.yaml e.g the default is gpu-rtx-pro-6000-96gb-x1 but when we use bigger models we may need gpu-rtx-pro-6000-96gb-x2. In that case, we will have to apply another patch for nodeselector inside the model directory. I think it will make it hard-to-follow architecture. IMO, keeping it separate model-accelerator combination will make it more user friendly and sounds like a good the trade-off with some duplicate code.
| limits: | ||
| cpu: "10" | ||
| memory: 128G | ||
| nvidia.com/gpu: "2" |
There was a problem hiding this comment.
precise prefix cache routing makes vllm pin more memory compared to optimized-baseline and predicted latency routing and hence we get this error when we try to spin up precise prefix cache on 1 accelerator:
ValueError: To serve at least one request with the models's max seq len (32768), (27.5 GiB KV cache is needed, which is larger than the available KV cache memory (27.19 GiB). Based on the available memory, the estimated maximum model length is 32384. Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details."
The workaround is to reduce the max_model_len, When I change it from 32768 to 30000. Do you want to strictly keep it on one accelerator by changing the max_model_len instead of using 2 accelerators ?
| - "--max-model-len=$(MAX_MODEL_LEN)" | ||
| - "--gpu-memory-utilization=$(GPU_MEMORY_UTILIZATION)" | ||
| - "--kv-events-config" | ||
| - '{"enable_kv_cache_events":true,"publisher":"zmq","endpoint":"$(KV_EVENTS_ENDPOINT)","topic":"kv@$(POD_IP):$(POD_PORT)@$(MODEL_ID)"}' |
There was a problem hiding this comment.
to overlay model id; upstream has it hardcoded
| MAX_MODEL_LEN=32768 | ||
| MODEL_ID=google/gemma-4-31b-it | ||
| MODEL_NAME=Gemma-4-31B-it | ||
| TENSOR_PARALLEL_SIZE=2 |
There was a problem hiding this comment.
why 2 chips is needed if the approximated prefix you used only one?
There was a problem hiding this comment.
precise prefix cache routing makes vllm pin more memory compared to optimized-baseline and predicted latency routing and hence we get this error when we try to spin up precise prefix cache on 1 accelerator:
ValueError: To serve at least one request with the models's max seq len (32768), (27.5 GiB KV cache is needed, which is larger than the available KV cache memory (27.19 GiB). Based on the available memory, the estimated maximum model length is 32384. Try increasing gpu_memory_utilization or decreasing max_model_len when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details."
The workaround is to reduce the max_model_len, When I change it from 32768 to 30000. Do you want to strictly keep it on one accelerator by changing the max_model_len instead of using 2 accelerators ?
Adding reference architectures for optimized-baseline, precise prefix cache routing, predicted latency routing llm-d well-lit paths