You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NIC taints EKS GPU nodes nvidia.com/gpu=true:NoSchedule and EKS has no auto-toleration, so GPU servers stay Pending. Documents adding the toleration to a profile's kubespawner_override (already supported, no code change). Workaround for #117.
A follow up for doing this automatically would be to add the toleration when using a GPU image.
Drive-by from validating this on an EKS 1.34 cluster with NIC's nvidia.com/gpu=true:NoSchedule taint in place. The toleration recipe in the doc works exactly as written, but the framing has two issues worth fixing before this lands.
1. ExtendedResourceToleration IS available on EKS
The doc says "EKS cannot run the ExtendedResourceToleration admission controller (it is not available on the managed control plane)". That isn't what I observed. On a stock EKS cluster:
A raw kubectl apply-ed pod (not via kubespawner) with resources.limits.nvidia.com/gpu: 1 and no explicit toleration came back from the apiserver with this toleration auto-injected:
Listed every mutating webhook on the cluster — none of aws-load-balancer-webhook, pod-identity-webhook, vpc-resource-mutating-webhook, envoy-gateway-topology-injector, cert-manager-webhook, longhorn-webhook-mutator would inject a GPU toleration. By elimination, the injection has to come from the apiserver's built-in admission chain.
So on managed EKS / GKE / AKS the workaround is redundant — GPU profiles will schedule without it. It's still correct and necessary for vanilla / kubeadm / bare-metal clusters where ERT isn't enabled by the operator.
Suggest reframing the doc as "required on clusters where ExtendedResourceToleration is not enabled in the apiserver admission chain (typical for kubeadm-on-bare-metal, kops, on-prem installers); redundant on EKS/GKE/AKS where it's enabled by default."
2. kubespawner_override.tolerations REPLACES rather than appends
Side effect worth surfacing: setting tolerations in kubespawner_override calls setattr(spawner, 'tolerations', [...]), which overwritesc.KubeSpawner.tolerations for that profile rather than appending to it. Any global tolerations the chart populated at hub startup (from scheduling.userPods.tolerations) are wiped for the affected profile.
Concretely, on my test cluster the spawned GPU pod's tolerations went from:
Both hub.jupyter.org/dedicated tolerations were silently dropped. On a cluster that uses z2jh's dedicated-user-nodes feature with that taint, applying this workaround would block the GPU profile from scheduling onto user-dedicated nodes — exactly the opposite of what the recipe intends.
Mitigations to consider documenting:
Note in the doc that any scheduling.userPods.tolerations need to be repeated in the per-profile block when using this pattern.
Or recommend extra_pod_config.tolerations (which appends at the pod-spec level instead of via the kubespawner trait — though that has its own quirks).
The PR is useful as a workaround for the non-EKS case, but the current framing risks operators applying it on EKS (where it's redundant and silently changes node-affinity behavior). Worth a short revision before merge.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
NIC taints EKS GPU nodes
nvidia.com/gpu=true:NoScheduleand EKS has no auto-toleration, so GPU servers stay Pending. Documents adding the toleration to a profile'skubespawner_override(already supported, no code change). Workaround for #117.A follow up for doing this automatically would be to add the toleration when using a GPU image.