docs: GPU profile toleration for tainted EKS nodes by aktech · Pull Request #139 · nebari-dev/data-science-pack

aktech · 2026-06-19T11:29:54Z

NIC taints EKS GPU nodes nvidia.com/gpu=true:NoSchedule and EKS has no auto-toleration, so GPU servers stay Pending. Documents adding the toleration to a profile's kubespawner_override (already supported, no code change). Workaround for #117.

A follow up for doing this automatically would be to add the toleration when using a GPU image.

oren-openteams · 2026-06-26T13:10:51Z

Drive-by from validating this on an EKS 1.34 cluster with NIC's nvidia.com/gpu=true:NoSchedule taint in place. The toleration recipe in the doc works exactly as written, but the framing has two issues worth fixing before this lands.

1. ExtendedResourceToleration IS available on EKS

The doc says "EKS cannot run the ExtendedResourceToleration admission controller (it is not available on the managed control plane)". That isn't what I observed. On a stock EKS cluster:

A raw kubectl apply-ed pod (not via kubespawner) with resources.limits.nvidia.com/gpu: 1 and no explicit toleration came back from the apiserver with this toleration auto-injected:
```
{"key": "nvidia.com/gpu", "operator": "Exists", "effect": "NoSchedule"}
```
Listed every mutating webhook on the cluster — none of aws-load-balancer-webhook, pod-identity-webhook, vpc-resource-mutating-webhook, envoy-gateway-topology-injector, cert-manager-webhook, longhorn-webhook-mutator would inject a GPU toleration. By elimination, the injection has to come from the apiserver's built-in admission chain.
Behavior matches ERT's documented semantics exactly: extended-resource request → matching toleration.

So on managed EKS / GKE / AKS the workaround is redundant — GPU profiles will schedule without it. It's still correct and necessary for vanilla / kubeadm / bare-metal clusters where ERT isn't enabled by the operator.

Suggest reframing the doc as "required on clusters where ExtendedResourceToleration is not enabled in the apiserver admission chain (typical for kubeadm-on-bare-metal, kops, on-prem installers); redundant on EKS/GKE/AKS where it's enabled by default."

2. `kubespawner_override.tolerations` REPLACES rather than appends

Side effect worth surfacing: setting tolerations in kubespawner_override calls setattr(spawner, 'tolerations', [...]), which overwrites c.KubeSpawner.tolerations for that profile rather than appending to it. Any global tolerations the chart populated at hub startup (from scheduling.userPods.tolerations) are wiped for the affected profile.

Concretely, on my test cluster the spawned GPU pod's tolerations went from:

hub.jupyter.org/dedicated=user:NoSchedule       (chart, global)
hub.jupyter.org_dedicated=user:NoSchedule       (chart, global)
node.kubernetes.io/not-ready                    (default)
node.kubernetes.io/unreachable                  (default)
nvidia.com/gpu Exists NoSchedule                (ERT-injected)

to:

nvidia.com/gpu Exists NoSchedule                (this PR's explicit)
node.kubernetes.io/not-ready                    (default)
node.kubernetes.io/unreachable                  (default)

Both hub.jupyter.org/dedicated tolerations were silently dropped. On a cluster that uses z2jh's dedicated-user-nodes feature with that taint, applying this workaround would block the GPU profile from scheduling onto user-dedicated nodes — exactly the opposite of what the recipe intends.

Mitigations to consider documenting:

Note in the doc that any scheduling.userPods.tolerations need to be repeated in the per-profile block when using this pattern.
Or recommend extra_pod_config.tolerations (which appends at the pod-spec level instead of via the kubespawner trait — though that has its own quirks).
Or close Auto-add nvidia.com/gpu toleration to GPU JupyterHub profiles #117 with a chart-side merge implementation that combines global + per-profile rather than overriding.

Net

The PR is useful as a workaround for the non-EKS case, but the current framing risks operators applying it on EKS (where it's redundant and silently changes node-affinity behavior). Worth a short revision before merge.

aktech and others added 2 commits June 19, 2026 12:29

docs: document GPU profile toleration for tainted EKS nodes

297c9d8

Merge branch 'main' into docs/gpu-profile-toleration

1c5bb7e

This was referenced Jun 26, 2026

Auto-add nvidia.com/gpu toleration to GPU JupyterHub profiles #117

Open

Add nvidia.com/gpu toleration to Ray Serve GPU workloads nebari-dev/rayserve-pack#18

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: GPU profile toleration for tainted EKS nodes#139

docs: GPU profile toleration for tainted EKS nodes#139
aktech wants to merge 2 commits into
mainfrom
docs/gpu-profile-toleration

aktech commented Jun 19, 2026 •

edited

Loading

Uh oh!

oren-openteams commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

aktech commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oren-openteams commented Jun 26, 2026

1. ExtendedResourceToleration IS available on EKS

2. kubespawner_override.tolerations REPLACES rather than appends

Net

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aktech commented Jun 19, 2026 •

edited

Loading

2. `kubespawner_override.tolerations` REPLACES rather than appends