Add nvidia.com/gpu toleration to Ray Serve GPU workloads#18
Conversation
nebari-infrastructure-core auto-taints AWS GPU node groups with nvidia.com/gpu=true:NoSchedule, and EKS has no admission controller to inject a matching toleration. Without one, GPU-requesting Ray Serve head/worker pods stop scheduling onto GPU nodes once the taint lands. Inject the nvidia.com/gpu toleration (operator: Exists, so it matches any taint value) on a Ray group's pod template when its resources request nvidia.com/gpu, via a new nebari-rayserve.tolerations helper. Pods that don't request a GPU render no tolerations and are unchanged. Explicit head.tolerations / worker.tolerations are appended. Closes #14
marcelovilla
left a comment
There was a problem hiding this comment.
LGTM @tylerpotts 🚀 !
Tested this on a live EKS cluster deployed via nebari-dev/nebari-infrastructure-core#370.
I deployed the Helm chart from this branch and added a GPU worker. Can confirm that the worker (requesting GPU resources) got scheduled in a GPU node while the head (which does not request GPU resources) did not get scheduled in the GPU node.
I'm left a small comment but I'm approving. Up to you if you want to address it.
Only inject the nvidia.com/gpu toleration when the group config does not already define a toleration for that key, so a user-provided toleration acts as an intentional override rather than producing a duplicate entry.
|
Late drive-by feedback after verifying #14 against an EKS 1.34 cluster running Two observations worth recording for context: The premise about EKS turns out to be incorrectThe PR body says "EKS has no admission controller to inject a matching toleration". That isn't what I observed. Reproduction on stock EKS 1.34: # kubectl apply -- raw pod, no kubespawner, no manual toleration
apiVersion: v1
kind: Pod
metadata: {name: gpu-no-toleration, namespace: default}
spec:
containers:
- name: c
image: nvcr.io/nvidia/cuda:12.4.1-base-ubuntu22.04
command: ["nvidia-smi"]
resources: {limits: {nvidia.com/gpu: 1}}Read back from the apiserver, the persisted pod has So on managed EKS / GKE / AKS, the chart-side auto-injection here is technically redundant — ERT would do the same job at admission time. On vanilla / kubeadm / on-prem clusters where ERT isn't enabled, this PR is the only thing keeping GPU Ray workloads scheduling. Why this PR is still correct on EKSThe implementation cleanly composes with ERT for two reasons:
The combination means on EKS the chart-injected toleration matches what ERT would inject, ERT's idempotency skips the duplicate during admission, and net pod spec is identical with or without this PR. On non-managed clusters it's load-bearing. No false-positive injection, no override of user intent — exactly the shape one wants. Pattern worth replicatingThe analogous data-science-pack docs (PR nebari-dev/data-science-pack#139) recommend a Nice work on the helper logic — leaving the issue verified. |
Summary
nebari-infrastructure-core auto-taints AWS GPU node groups with
nvidia.com/gpu=true:NoSchedule(nebari-dev/nebari-infrastructure-core#370), and EKS has no admission controller to inject a matching toleration. Without one, GPU-requesting Ray Serve head/worker pods stop scheduling onto GPU nodes once the taint lands.This injects the toleration onto a Ray group's pod template when its resources request
nvidia.com/gpu:operator: Existsmatches any taint value, so it works with NIC'svalue: "true"and any other.Changes
chart/templates/_helpers.tpl— newnebari-rayserve.tolerationshelper. Given a group config (.Values.head/.Values.worker), it appends thenvidia.com/gputoleration whenresources.limits/resources.requestscontainsnvidia.com/gpu, plus any explicittolerationsfrom the group config. Renders nothing otherwise.chart/templates/rayservice.yaml— wired the helper into the head and worker pod templates (only emits atolerations:block when non-empty).chart/values.yaml— added documentedtolerations: []toheadandworker.Acceptance criteria
helm template).Testing
helm lintpasses.helm templatechecked across three cases (output parsed with PyYAML to confirm valid YAML):nvidia.com/gpunvidia.com/gpunvidia.com/gpuThe acceptance criterion that a GPU pod actually schedules onto a tainted node and serves can't be exercised by the GPU-less CI kind cluster. It can be validated on kind without real GPU hardware by tainting a node
nvidia.com/gpu=true:NoScheduleand advertising fakenvidia.com/gpucapacity on it, or end-to-end on real GPU infra — worth confirming before the NIC taint rolls out.Closes #14