Skip to content

docs: GPU profile toleration for tainted EKS nodes#139

Open
aktech wants to merge 2 commits into
mainfrom
docs/gpu-profile-toleration
Open

docs: GPU profile toleration for tainted EKS nodes#139
aktech wants to merge 2 commits into
mainfrom
docs/gpu-profile-toleration

Conversation

@aktech

@aktech aktech commented Jun 19, 2026

Copy link
Copy Markdown
Member

NIC taints EKS GPU nodes nvidia.com/gpu=true:NoSchedule and EKS has no auto-toleration, so GPU servers stay Pending. Documents adding the toleration to a profile's kubespawner_override (already supported, no code change). Workaround for #117.

A follow up for doing this automatically would be to add the toleration when using a GPU image.

@oren-openteams

Copy link
Copy Markdown
Contributor

Drive-by from validating this on an EKS 1.34 cluster with NIC's nvidia.com/gpu=true:NoSchedule taint in place. The toleration recipe in the doc works exactly as written, but the framing has two issues worth fixing before this lands.

1. ExtendedResourceToleration IS available on EKS

The doc says "EKS cannot run the ExtendedResourceToleration admission controller (it is not available on the managed control plane)". That isn't what I observed. On a stock EKS cluster:

  • A raw kubectl apply-ed pod (not via kubespawner) with resources.limits.nvidia.com/gpu: 1 and no explicit toleration came back from the apiserver with this toleration auto-injected:
    {"key": "nvidia.com/gpu", "operator": "Exists", "effect": "NoSchedule"}
  • Listed every mutating webhook on the cluster — none of aws-load-balancer-webhook, pod-identity-webhook, vpc-resource-mutating-webhook, envoy-gateway-topology-injector, cert-manager-webhook, longhorn-webhook-mutator would inject a GPU toleration. By elimination, the injection has to come from the apiserver's built-in admission chain.
  • Behavior matches ERT's documented semantics exactly: extended-resource request → matching toleration.

So on managed EKS / GKE / AKS the workaround is redundant — GPU profiles will schedule without it. It's still correct and necessary for vanilla / kubeadm / bare-metal clusters where ERT isn't enabled by the operator.

Suggest reframing the doc as "required on clusters where ExtendedResourceToleration is not enabled in the apiserver admission chain (typical for kubeadm-on-bare-metal, kops, on-prem installers); redundant on EKS/GKE/AKS where it's enabled by default."

2. kubespawner_override.tolerations REPLACES rather than appends

Side effect worth surfacing: setting tolerations in kubespawner_override calls setattr(spawner, 'tolerations', [...]), which overwrites c.KubeSpawner.tolerations for that profile rather than appending to it. Any global tolerations the chart populated at hub startup (from scheduling.userPods.tolerations) are wiped for the affected profile.

Concretely, on my test cluster the spawned GPU pod's tolerations went from:

hub.jupyter.org/dedicated=user:NoSchedule       (chart, global)
hub.jupyter.org_dedicated=user:NoSchedule       (chart, global)
node.kubernetes.io/not-ready                    (default)
node.kubernetes.io/unreachable                  (default)
nvidia.com/gpu Exists NoSchedule                (ERT-injected)

to:

nvidia.com/gpu Exists NoSchedule                (this PR's explicit)
node.kubernetes.io/not-ready                    (default)
node.kubernetes.io/unreachable                  (default)

Both hub.jupyter.org/dedicated tolerations were silently dropped. On a cluster that uses z2jh's dedicated-user-nodes feature with that taint, applying this workaround would block the GPU profile from scheduling onto user-dedicated nodes — exactly the opposite of what the recipe intends.

Mitigations to consider documenting:

  • Note in the doc that any scheduling.userPods.tolerations need to be repeated in the per-profile block when using this pattern.
  • Or recommend extra_pod_config.tolerations (which appends at the pod-spec level instead of via the kubespawner trait — though that has its own quirks).
  • Or close Auto-add nvidia.com/gpu toleration to GPU JupyterHub profiles #117 with a chart-side merge implementation that combines global + per-profile rather than overriding.

Net

The PR is useful as a workaround for the non-EKS case, but the current framing risks operators applying it on EKS (where it's redundant and silently changes node-affinity behavior). Worth a short revision before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants