fix: fall through to cookie when Authorization Bearer is not a jhub-apps wrapper#676
Draft
aktech wants to merge 4 commits into
Draft
fix: fall through to cookie when Authorization Bearer is not a jhub-apps wrapper#676aktech wants to merge 4 commits into
aktech wants to merge 4 commits into
Conversation
…pps wrapper When Envoy Gateway is configured with SecurityPolicy.oidc.forwardAccessToken=true it injects the upstream user's Keycloak access token (RS256) in the same Authorization: Bearer header that jhub-apps inspects for its own HS256 wrapper. The previous decoder treated any decode failure as a 401, producing an infinite redirect loop in the browser between /jhub-login and the env-listing endpoint. _get_jhub_token_from_jwt_token now returns None when the input is not our wrapper, and get_current_user iterates the credential sources (param, header, cookie) and uses the first that decodes. This preserves backwards-compatible behaviour and lets deployments behind Envoy forwardAccessToken authenticate via the still-present jhub-apps cookie.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
aktech
added a commit
to nebari-dev/data-science-pack
that referenced
this pull request
May 8, 2026
Bundle the head of nebari-dev/jhub-apps#676 in the jupyterhub image so the chart can ship with forwardAccessToken=true by default. The upstream patch makes jhub-apps' get_current_user fall through to the cookie when the Authorization header is not its own HS256 wrapper, removing the OAuth redirect-loop that previously forced this default to false. - images/jupyterhub/pixi.toml: pin jhub-apps via git rev (PR #676 head), bump pyjwt to >=2.10 in conda deps to satisfy that branch's requirement. - images/jupyterhub/pixi.lock: regenerated. - values.yaml: nebariapp.auth.forwardAccessToken defaults to true.
Starlette 1.0 removed the deprecated (name, context) positional form. The
new signature is (request, name, context, ...); calling the old shape now
resolves to name=<context dict> and crashes in get_template with
'TypeError: unhashable type: dict' on /services/japps/{create-app,edit-app,
server-types,success}.
Switch to the new positional form, which works on every Starlette release
since 0.30 (the version that introduced the new signature with a backward-
compat dispatcher) through 1.0+.
When SecurityPolicy.oidc.forwardAccessToken=true, Envoy injects the user's freshly-refreshed Keycloak access token in Authorization: Bearer on every upstream request. Pre-existing JAppsConfig.conda_envs callables (and any similar custom callables) read auth_state via the hub admin API; that store goes stale on gateways that don't rotate the AccessToken-* cookie, so downstream token-exchange fails ~5 min after login. Capture the gateway-forwarded RS256 token in get_current_user, surface it on User.access_token, and forward it onto the user dict the conda_envs callable receives. Downstream code can then drive token exchange with a fresh token without depending on the hub's stored auth_state. Token is treated opaquely here; consumers validate against their IdP.
aktech
added a commit
to nebari-dev/data-science-pack
that referenced
this pull request
May 14, 2026
The temporary nebari-dev/jhub-apps#676 git pin (bearer fall-through + Starlette 1.0 TemplateResponse fix) plus pyjwt>=2.10 floor were only needed while Envoy was the OAuth client and could inject an RS256 Bearer that confused jhub-apps. With hub doing its own OAuth, Envoy no longer injects to /services/japps/* and neither workaround is needed. - jhub-apps: git@5d86277 -> ==2025.11.1 (conda-forge release) - pyjwt: >=2.10,<3 -> >=2.9,<2.10 (matches jhub-apps 2025.11.1 constraint) Lock regenerated via pixi 0.68.1 in a linux/arm64 container.
aktech
added a commit
to nebari-dev/data-science-pack
that referenced
this pull request
May 14, 2026
The temporary nebari-dev/jhub-apps#676 git pin (bearer fall-through + Starlette 1.0 TemplateResponse fix) plus pyjwt>=2.10 floor were only needed while Envoy was the OAuth client and could inject an RS256 Bearer that confused jhub-apps. With hub doing its own OAuth, Envoy no longer injects to /services/japps/* and neither workaround is needed. - jhub-apps: git@5d86277 -> ==2025.11.1 (conda-forge release) - pyjwt: >=2.10,<3 -> >=2.9,<2.10 (matches jhub-apps 2025.11.1 constraint) Lock regenerated via pixi 0.68.1 in a linux/arm64 container.
aktech
added a commit
to nebari-dev/data-science-pack
that referenced
this pull request
May 15, 2026
…etch (#53) * feat: add shared group directories and NSS wrapper - Add shared-storage RWX PVC volume support with per-group subPaths mounted at /shared/<group> inside user pods - Add init container that creates group directories with chmod 2775 (setgid bit) so new files inherit the group and are group-writable - Add libnss_wrapper.so configuration so whoami/id report the real username instead of 'jovyan', with NB_UMASK=0002 - Refactor pre_spawn_hook into focused single-responsibility functions: _get_user_groups, _setup_shared_storage, _setup_nss_wrapper - Orchestrator _pre_spawn_hook chains Nebi auth, shared storage, and NSS wrapper; always registered (NSS runs even without shared storage) - Add sharedStorage.groups allowlist and mountPathPrefix values - Add jupyterhub.custom.shared-storage-* config keys * fix: correct NSS GID to 1000 (jovyan) and always create ~/shared dir - gid default was 100 but z2jh sets pod securityContext GID to 1000; add jovyan:x:1000: group entry so 'groups' command resolves the name - when shared PVC is disabled, mkdir -p ~/shared instead of removing it so users always see the directory regardless of storage configuration * fix: store groups in auth_state and always create ~/shared/<group> dirs - EnvoyOIDCAuthenticator now stores parsed groups in auth_state so the spawner can read them at spawn time (JupyterHub groups table is empty when manage_groups is not enabled) - refresh_user also re-parses groups from the refreshed IdToken to keep auth_state current - _pre_spawn_hook always resolves user groups, not only when shared PVC is enabled - _setup_nss_wrapper creates local ~/shared/<group> dirs per group when no shared PVC is configured, so users always see their group dirs * feat: add in-cluster NFS server for shared storage on RWO-only clusters Deploys quay.io/nebari/volume-nfs backed by a single RWO PVC and re-exports it as RWX NFS, enabling shared group directories on providers like Hetzner that only provide ReadWriteOnce storage (hcloud-volumes). - templates/nfs-server.yaml: NFS Deployment, Service, backend RWO PVC - templates/shared-pvc.yaml: StorageClass + PV (NFS path) + PVC when nfsServer.enabled; falls back to external RWX PVC otherwise - values.yaml: sharedStorage.nfsServer.{enabled,storageClass,image} fields * fix: add DaemonSet to install nfs-common on k3s worker nodes k3s worker nodes on minimal OS images (Hetzner) ship without nfs-common, causing NFS PV mounts to fail with 'bad option'. The DaemonSet uses nsenter to install nfs-common on the host via apt-get, skipping if already present. Gated on sharedStorage.nfsServer.installClient (default false). * fix: use alpine:3 sleep for DaemonSet pause container * fix: NFS PV path /exports not / (overlayfs cannot be exported) * fix: remove spawner.user.groups (DetachedInstanceError in async); add try-except _get_user_groups accessed spawner.user.groups (SQLAlchemy lazy-loaded relationship) from an async pre_spawn_hook, causing DetachedInstanceError which silently aborted _setup_shared_storage and _setup_nss_wrapper. Groups are now read only from auth_state (stored by EnvoyOIDCAuthenticator). Each step is individually wrapped in try-except so failures are logged and don't prevent subsequent steps from running. * fix: address code review findings (I1-I7, C1-C3, M1, N3) I1: set c.KubeSpawner.fs_gid=100 explicitly so shared dir file ownership is deterministic (GID 100 = users group) rather than relying on z2jh default I2: add Helm validation in _helpers.tpl that fails at template time if sharedStorage.enabled and jupyterhub.custom.shared-storage-enabled diverge I3: use Path(g).name like classic Nebari so /projects/myproj -> myproj, not projects/myproj; deduplicate groups to prevent duplicate mountPaths I4: add nodeSelector/nodeAffinity support to NFS server Deployment so deployers can pin it to worker nodes and avoid slow RWO PVC reattachment I7: add argocd.argoproj.io/sync-options: Prune=false to StorageClass and PersistentVolume to prevent accidental deletion during ArgoCD force sync C1: add chown 0:100 before chmod 2775 in initialize-shared-mounts init container so shared dirs are explicitly owned by GID 100 (users) C2: use printf instead of echo '...' for NSS file writes to safely handle special characters in usernames without shell quoting issues C3: deduplicate groups in _get_user_groups (via Path.name already handles most cases; added explicit dedup set for belt-and-suspenders) M1: log exception with exc_info=True in refresh_user JWT parse failure N3: merge into existing lifecycle_hooks instead of replacing; warn if a postStart hook already exists before overwriting Logging: added comprehensive info/debug/warning logging throughout all pre-spawn hook functions for both happy and failure paths * docs: add JupyterLab profiles design spec * Revert "docs: add JupyterLab profiles design spec" This reverts commit 56c22cf. * feat: add JupyterLab profiles for CPU/RAM resource sizing (closes #31) Exposes a profile selector in JupyterHub matching the classic Nebari experience. Profiles are defined under jupyterhub.custom.profiles in values.yaml and passed directly to c.KubeSpawner.profile_list via get_config(). Default profiles: - Small: 1 CPU / 2 GB RAM (default) - Medium: 4 CPU / 8 GB RAM kubespawner_override accepts any KubeSpawner trait so GPU profiles, custom images, and node selectors work without code changes in the future. When profiles list is empty, no selector is shown (single-instance mode). * fix: add descriptive names to default server profiles Update default profile display_name and description to be more user-friendly (e.g. "Small Instance" with "Stable environment with 1 CPU / 2 GB RAM" instead of just "Small" / "1 CPU / 2 GB RAM"). * split: move JupyterLab profiles to separate branch Profiles feature (#31) is out of scope for this PR. Moved to local branch feat/jupyterlab-profiles for a follow-up PR. * test: add k3d-based e2e smoke test Replaces the inline-bash test workflow with a pytest-based suite that manages the k3d cluster, helm install, and pod-wait lifecycle. Conftest exposes a 'cluster' session fixture and a 'hub_url' fixture that port-forwards proxy-public. CI runs uvx pytest tests/e2e -v with PYTHONUNBUFFERED=1 so live logs stream into the workflow output. Locally: uvx pytest tests/e2e -v # fresh cluster K3D_CLUSTER=k3d-nebari-dev uvx pytest tests/e2e -v # reuse * test: add NFS-backed shared-storage e2e tests Switches the e2e harness to kind (k3d's busybox-on-scratch nodes lack a package manager, so the chart's nfs-common installer DaemonSet can't provision NFS client tools). New tests in tests/e2e/test_shared_storage.py exercise the full PR #30 spawn path against a real cluster: - test_user_in_group_can_write: alice-data writes /shared/data/... - test_shared_dir_is_group_owned: dir mode is 2775 (setgid) The DummyAuth shim in tests/e2e/fixtures/test-values.yaml maps the login username to user+groups (alice-data -> alice in [data]) so we can fake auth_state without running Keycloak. Everything else (spawner hook, init container, NSS wrapper, NFS mount) is real. Chart changes: - sharedStorage.nfsServer.mountOptions added (default []). Tests pass [nfsvers=3] because kind nodes use overlayfs which fails the volume-nfs image's NFSv4 root export. Production unchanged. Conftest infrastructure: - kind cluster fixture with KIND_KEEP=1 reuse - hosts-entry workaround so kubelet's host mount.nfs can resolve the cluster-internal NFS service FQDN (kind nodes have no cluster DNS in their host resolv.conf) - structured logging + step counter + per-cycle pod state, events from kubectl describe, and node-level kubelet journal lines - autouse failure-dump fixture (kubectl get pods/events + hub and singleuser logs) * refactor(tests/e2e): split conftest into deep modules Conftest had grown to 577 lines mixing five concerns. Extracted into focused modules each with a small interface and large hidden impl: _process.py subprocess + kubectl helpers + step counter _hub.py HubClient (cookie/login/spawn/stop session) _pod_observer.py wait_for_pod_ready + dedup'd pod-state polling _cluster.py kind lifecycle + helm install + NFS hosts workaround conftest.py shrinks to 218 lines holding only fixtures that compose the modules. Eliminates duplication of: - cookie-jar login flow (was repeated in _login_and_spawn + _stop_server) - two parallel subprocess wrappers (_run + _kctl) - inline pod-state polling loop in the spawn flow Tests still pass locally (3 passed in 35s, cluster reused). * ci: speed up e2e — disable z2jh prePuller + cache kindest/node prePuller hooks pre-pull singleuser images on every node before helm install completes. On a single-node test cluster this is pure overhead (~30s of blocking wait). Disable in test-values.yaml. kindest/node image pull was the largest variable cost in CI: 9s on a fast runner, 130s on a slow one. Cache it as a docker tarball keyed on the kind version so subsequent runs are deterministic and fast. Expected: total CI time drops from variable 3-5min to ~90-120s. * ci: fix kindest/node cache-save step under set -e Previous heuristic used `[ -n "$img" ] && docker save` which exits 1 when grep finds no image, killing the whole step. Hardcode the v1.32.2 tag (fixed by kind v0.27.0) and use plain commands so set -e only triggers on real failures. * ci: drop kindest/node cache attempt GH Actions ubuntu-latest runners come with kindest/node preinstalled (tagged "<none>"). The actual image fetch on cache miss was only ~12s because docker just verifies the digest. The cache step was earning ~5s in the best case and breaking the workflow when docker save couldn't find the v1.32.2 tag (image is referenced by digest, not tag). prePuller-disable change is keeping its ~30s saving — sufficient win without the cache complexity. * test(e2e): expand shared-storage suite to full permission contract 9 tests (was 2) covering the per-group /shared/<group> contract end-to-end: - dir is root:users 2775 (parametrized over groups) - pod is member of users group; NB_UMASK=0002 in env - new files inherit gid=100 mode 0664; new subdirs gid=100 mode 02775 (setgid propagation) - multi-group user sees + writes every group dir - user does not see groups they don't belong to (mount-time isolation) - file written by one user is readable + appendable by a groupmate from a separate pod (cross-user collaboration) Conftest adds PathStat + SpawnedUser.stat()/path_exists() so tests assert against typed fields (mode/uid/gid) instead of parsing stat strings — keeps tests short and behavior-focused. * ci: cache singleuser image across runs to skip ~73s cold pull Singleuser image (multi-GB) is currently pulled by kubelet inside the kind node on first user spawn, costing ~73s of every CI run. Pull it once on the runner host, save as tar, cache it (key = image ref so a values.yaml bump auto-invalidates), and side-load with `kind load image-archive`. Pre-create the cluster in the workflow so the side-load happens before any pod is scheduled — the pytest fixture's ensure_cluster() reuses the existing cluster. Cache hit: skips the ~90s registry pull entirely; only kind-load (~20s) remains. Cache miss: pull + save once (~120s), then every subsequent run benefits. * docs(shared-storage): position external RWX as primary, mark in-cluster NFS as transitional Addresses comment on issue #29: the bundled `nfsServer.enabled=true` path relies on `quay.io/nebari/volume-nfs:0.8-repack`, a manifest-schema repack of an abandoned upstream image (nebari-dev/nebari-docker-images#230). We should not be carrying that workaround image as the recommended path for a greenfield chart. The chart already supported bringing your own RWX StorageClass; this change makes that path the documented primary: - values.yaml: reframe the sharedStorage block. Recommend an external RWX class with provider-specific examples (Longhorn, EFS, Filestore, Azure Files, nfs-subdir-external-provisioner). Add a deprecation note on the nfsServer.image block linking to issue #29. - README: add a "Shared Storage" section with the same matrix and an explicit pointer to the issue tracking removal of the in-cluster NFS path. No template changes — the external-RWX path was already rendered when nfsServer.enabled=false. Verified via `helm template` that setting only `sharedStorage.storageClass=longhorn` produces a single RWX PVC and no nfs-server pod. * docs(shared-storage): correct provider list — only longhorn is NIC-provisioned Previous commit listed EFS/Filestore/Azure Files as recommended RWX backends. NIC does not provision those — they are separate cloud-managed services no one in NIC has wired up. NIC's actual storage reality: hetzner : longhorn (longhorn.Install in pkg/provider/hetzner) aws : longhorn (longhorn.Install in pkg/provider/aws) existing : longhorn (longhorn.Install in pkg/provider/existing) gcp : standard-rwo (no RWX provisioned) azure : managed-csi (no RWX provisioned) local : (no storage layer) So the accurate recommendation is just longhorn. Updated values.yaml and README to say so directly. The in-cluster NFS fallback stays — it covers the providers where NIC has not yet wired up an RWX class — with a pointer to issue #29 for tracking removal once that lands everywhere. * fix(nebi-envs): re-fetch auth_state when access token is stale EnvoyOIDCAuthenticator stores no refresh_token (Envoy keeps only access_token + id_token in cookies) and the access_token lifetime is ~5 minutes. jhub-apps calls the env-listing callable on every Create App page render, often well after the token captured at login has expired, producing `token-exchange step 2 FAILED: HTTP 400 invalid_request "Invalid token"` and a silent empty selector. Mirror 01-spawner.py: when access_token has <30s remaining, re-fetch auth_state via the hub API (which refresh_user keeps current with fresh Envoy cookies on browser activity) before exchanging. * feat: forward access token via Authorization Bearer header Envoy Gateway v1.6 stores the OAuth2 access token in the AccessToken cookie at OIDC login but does not rotate the cookie content when its internal refresh token rotates the access token. Result: hub reads a frozen-at-login access token from the cookie and downstream calls that need a fresh JWT (jhub-apps env selector → Keycloak token exchange → Nebi) fail with `400 invalid_request "Invalid token"` ~5 min after login. Three changes: 1. values.yaml — set nebariapp.auth.enforceAtGateway=true and forwardAccessToken=true by default. Envoy Gateway then injects the user's freshly-refreshed access token as `Authorization: Bearer <token>` on every upstream request. 2. templates/nebariapp.yaml — pass enforceAtGateway, forwardAccessToken, and tokenExchange through to the NebariApp CRD so the chart can drive the operator-managed SecurityPolicy. 3. config/jupyterhub/00-gateway-auth.py — `_extract_envoy_cookies` now prefers `Authorization: Bearer` over the `AccessToken-*` cookie. The header is the only always-current source; cookie fallback retained for deployments without forwardAccessToken. The stale-token re-fetch in 03-nebi-envs.py (commit 5e95e8d) becomes defensive: with this fix, refresh_user captures a fresh access_token on each browser request and the env-listing callable rarely needs to fall back to it. * fix(values): default forwardAccessToken=false to avoid jhub-apps loop forwardAccessToken=true makes Envoy inject the user's Keycloak access token as Authorization: Bearer on every upstream request. jhub-apps's get_current_user reads the Authorization header before its own cookie and unconditionally tries to HS256-decode it as the jhub-apps JWT. The Keycloak token is RS256, decode raises InvalidAlgorithmError, authentication fails, browser is redirected to /jhub-login, OAuth round-trips, new cookie is set, next request hits the same path — infinite loop in the UI. Until jhub-apps is patched to ignore non-HS256 Authorization tokens (or until a different transport delivers the user's fresh access token to the env-listing callable), default to off. The chart still exposes both fields for explicit opt-in. * feat: pin jhub-apps to PR #676 + default forwardAccessToken=true Bundle the head of nebari-dev/jhub-apps#676 in the jupyterhub image so the chart can ship with forwardAccessToken=true by default. The upstream patch makes jhub-apps' get_current_user fall through to the cookie when the Authorization header is not its own HS256 wrapper, removing the OAuth redirect-loop that previously forced this default to false. - images/jupyterhub/pixi.toml: pin jhub-apps via git rev (PR #676 head), bump pyjwt to >=2.10 in conda deps to satisfy that branch's requirement. - images/jupyterhub/pixi.lock: regenerated. - values.yaml: nebariapp.auth.forwardAccessToken defaults to true. * chore: bump hub image to sha-e906f78 (carries patched jhub-apps) * chore: pin hub image to PR-53 merge-commit sha-9381aab * chore: bump jhub-apps git pin to include Starlette 1.0 TemplateResponse fix * chore: bump hub image to sha-ab37dda (Starlette 1.0 TemplateResponse fix) * test(unit): add pytest harness for jupyterhub config modules Loads jupyterhub_config.d files (hyphenated, digit-prefixed) by path via importlib spec. FakeConfig records traitlets-style attribute assignments so tests can assert on what each config module wires onto JupyterHub's `c` global without needing a running hub. Also seeds the harness with .venv-unit/ ignored by git/helm so 'uv venv' can install jupyterhub + oauthenticator for tests without leaking into helm package output. * feat(auth): switch hub to KeyCloakOAuthenticator (GenericOAuthenticator) Hub now does its own OAuth dance with Keycloak instead of reading cookies that Envoy Gateway sets at the OIDC filter. JupyterHub's built-in refresh_user uses the stored refresh_token to keep auth_state fresh — no browser hit, no gateway-injected Bearer, no per-caller plumbing. Fixes the stale-token bug: /services/japps/conda-environments/ returned [] ~5 min after login because Envoy v1.6 doesn't rotate AccessToken-* cookie contents on every request, jhub-apps paths bypass hub, and the env-listing callable read user.auth_state which was frozen at OAuth callback time. Reads issuer-url / client-id / client-secret from a Secret mounted at /etc/oauth/ (overridable via OAUTH_SECRET_DIR); OAUTH_CALLBACK_URL and OAUTH_EXTERNAL_URL come from the deployment's env. Production wiring is gated on OAUTH_CALLBACK_URL being set, so plain kind deploys keep the chart-default authenticator (dummy). Logout points at KC's end_session_endpoint with a percent-encoded post_logout_redirect_uri so the upstream session is terminated, not just hub's local cookie. Replaces the 222-line EnvoyOIDCAuthenticator with a 100-line module behind a single configure() entry point. 12 unit tests cover URL derivation, auth_state/refresh wiring, admin/groups claims, logout URL encoding, and the env-gate. * chore(chart): flip NebariApp to enforceAtGateway=false, hub OAuth callback With hub doing its own OAuth (see KeyCloakOAuthenticator), Envoy SecurityPolicy on the hub host adds nothing — its cookie rotation lag was the original cause of the env-list stale-token bug. - enforceAtGateway: true -> false (operator drops the SecurityPolicy) - forwardAccessToken: true -> false (no longer relevant; avoid dual-token paths) - redirectURI: /oauth2/callback -> /hub/oauth_callback (JupyterHub default) Operator's provisionClient: true is independent of enforceAtGateway, so the KC client + Secret stay provisioned. The redirectURI change drives the operator to update the client's allowed redirect URI. * revert(deps): drop jhub-apps git pin + relax pyjwt — hub owns OAuth now The temporary nebari-dev/jhub-apps#676 git pin (bearer fall-through + Starlette 1.0 TemplateResponse fix) plus pyjwt>=2.10 floor were only needed while Envoy was the OAuth client and could inject an RS256 Bearer that confused jhub-apps. With hub doing its own OAuth, Envoy no longer injects to /services/japps/* and neither workaround is needed. - jhub-apps: git@5d86277 -> ==2025.11.1 (conda-forge release) - pyjwt: >=2.10,<3 -> >=2.9,<2.10 (matches jhub-apps 2025.11.1 constraint) Lock regenerated via pixi 0.68.1 in a linux/arm64 container. * chore: bump hub image to sha-ae0969a (GenericOAuthenticator + jhub-apps 2025.11.1) Carries the GenericOAuthenticator switch (config/jupyterhub/00-gateway-auth.py) and jhub-apps 2025.11.1 release from conda-forge. Digest: sha256:e9b481657f34c16b367d402ca1cce79ac64b177dc9eba48f85f35be363958126 * fix(auth): request openid + profile + email + groups scopes Without explicit scope, GenericOAuthenticator sends no scope= param in the authorize redirect; KC then issues a token that lacks the openid scope, and /userinfo returns 403 at token_to_user. Symptom: 500 on /hub/oauth_callback after the user signs in at KC. Add a unit test that fails if openid drops out of the scope list. * chore: bump hub image to sha-ffb035a (openid scope fix) Digest: sha256:4be08f31306c4da35ceccc390688e02947f02d7eab5fbd1efddca90af8bd00fb * fix(deps): cap starlette<1 — jhub-apps 2025.11.1 uses legacy TemplateResponse Starlette 1.0 reordered TemplateResponse positional args to (request, name, ...); jhub-apps 2025.11.1 still calls the 2-arg form, which makes /services/japps/create-app 500 at handle_apps. Pin to the last 0.x release until jhub-apps ships a fix. * chore: bump hub image to sha-8676046 (starlette<1 cap) Digest: sha256:8e007b6dc55ffe5f451d016610f0733462de323df5bb2af3235a6f17b22e5ddf * docs: mark HANDOFF-stale-token.md resolved (GenericOAuthenticator switch) Tested end-to-end on hetzner via Playwright headless: - /hub/oauth_callback returns 302 to /hub/home (no 500) - /services/japps/create-app renders (starlette<1 cap) - /services/japps/conda-environments/ returns 200 - After 6-min idle, refresh_token grant fires, token stays fresh - 3-step KC -> Nebi token exchange succeeds end-to-end * feat(auth): auto-login + KC end-session with id_token_hint Two UX fixes: 1. auto_login=True Hub now 302s /hub/login directly to KC instead of rendering the local form with a 'Sign in with OAuth 2.0' link. Single IdP — no point making the user click through. 2. KeyCloakLogoutHandler KC v18+ rejects /protocol/openid-connect/logout when post_logout_redirect_uri is present without id_token_hint. The static logout_redirect_url can't include it (per-user), so install a handler that reads auth_state.id_token at request time and builds the URL. Falls back to no-hint URL if auth_state is missing (legacy session). * chore: bump hub image to sha-0e393cb (auto_login + KC end-session) Digest: sha256:93f7139b8775b7a22ac4db313583c83cded449ac77302229e1060f27bce3d6c1 * fix(auth): override LogoutHandler.get to inject id_token_hint Base LogoutHandler.get() short-circuits to authenticator.logout_redirect_url when auto_login=True, so the prior override of render_logout_page never fired. Move the per-user URL building into get() itself, with default_handle_logout + handle_logout still called so hub's local session state is cleared. * chore: bump hub image to sha-38305c6 (logout id_token_hint fix) * fix(auth): monkey-patch LogoutHandler.get so /hub/logout uses our handler Authenticator-supplied handlers are appended after jupyterhub's defaults in init_handlers, so tornado's first-match routing picks the default LogoutHandler at /logout — our override via get_handlers is a dead route. Monkey-patch the base LogoutHandler.get instead. * chore: bump hub image to sha-8ff8d0a (logout monkey-patch) * fix(auth): use OAuthenticator.logout_handler hook (no monkey-patch) OAuthenticator.get_handlers reads the class-level logout_handler attribute when registering the /logout route. Swap it to our subclass (class attr on KeyCloakOAuthenticator) instead of monkey-patching LogoutHandler.get or duplicating the /logout entry — the latter just appends a second tuple after oauthenticator's own (r'/logout', OAuthLogoutHandler), and tornado's first-match keeps picking the base class. KeyCloakLogoutHandler subclasses OAuthLogoutHandler and overrides render_logout_page (not get) so the inherited LogoutHandler.get still runs default_handle_logout + handle_logout (token revocation, cookie clear). For that to happen, authenticator.logout_redirect_url is left empty — otherwise LogoutHandler.get short-circuits when auto_login is True and never calls render_logout_page. * chore: bump hub image to sha-239effb (logout_handler hook fix) * fix(auth): stash logout pieces on class attr (not via c. traitlets) Traitlets' config loader rejects unknown attribute names with a warning and never sets the value, so c.KeyCloakOAuthenticator._kc_end_session_url was a no-op — _kc_end_session_url stayed empty on the class default, making the logout URL relative and causing a redirect loop. * chore: bump hub image to sha-2c816a2 (logout class-attr fix) * fix(auth): override LogoutHandler.get to capture id_token before user cleared LogoutHandler.get sets self._jupyterhub_user = None BEFORE calling render_logout_page (jupyterhub/handlers/login.py:89), so reading auth_state from render_logout_page always sees current_user=None. Move the id_token capture into get() before the cleanup runs. * chore: bump hub image to sha-67880ee (logout id_token capture in get) * fix(spawner): pin pvc_name_template to claim-{username} KubeSpawner's default `pvc_name_template` for a *named* server is `claim-{username}--{servername}`, but the chart's home volume mount is hardcoded to `claim-{username}` (so all of a user's servers share a single RWO PVC, co-located on one node via the pod-affinity rule). Without an explicit override the names diverge: KubeSpawner creates a fresh per-server PVC and the pod tries to mount a different per-user PVC. Users who'd previously launched the default JupyterLab server still had the per-user PVC sitting around and survived; fresh users (e.g. anyone who first interacts with the platform via jhub-apps Create App) hit FailedScheduling: 'persistentvolumeclaim claim-<user> not found' and the pod sits Pending until the hub spawn-timeout (5 min) fires. Lock the template to `claim-{username}` so ensure + mount converge. Test in tests/unit/test_spawner_storage.py. * fix(auth): implement refresh_user to rotate KC refresh_token in auth_state The earlier switch to KeyCloakOAuthenticator (GenericOAuthenticator) set `auth_refresh_age = 240`, expecting JupyterHub to keep auth_state fresh via its built-in refresh_user. But JupyterHub's Authenticator.refresh_user is a no-op stub (returns True) and oauthenticator's GenericOAuthenticator does not override it. So auth_state.refresh_token stays frozen at OAuth-callback time and expires after KC's SSO idle timeout (~30 min by default), at which point nebi-envs's 3-step token exchange fails at step 1 with: invalid_grant: Token is not active and the jhub-apps Create-App "Software Environment" dropdown silently disappears (env list is empty when the exchange fails). Implement refresh_user on KeyCloakOAuthenticator: POST grant_type= refresh_token to KC's token endpoint, persist the rotated tokens back to auth_state via the {"auth_state": ...} return shape, return False on invalid_grant to force re-login, and return True (no-op) on transient HTTP errors. Tests in tests/unit/test_refresh_user.py cover the four return-shape contracts: success, invalid_grant, transient error, no-refresh-token. * fix(e2e): drop japps service_workers 4 -> 1 to unstick hub bootup z2jh's hub waits ~10s for each managed service's HTTP port to bind. The default jhub-apps service_workers is 4 and four uvicorn workers take ~12s to bind on CI runners, so hub crashes with Cannot connect to managed service japps at http://hub:10202 restarts, hits the same timeout, restarts, etc. By the time the e2e fixture's port-forward starts polling /hub/login, hub is still in this crash-loop. The first urlopen with timeout=15s eventually raises TimeoutError unwrapped through HubClient._request (which only catches HTTPError), aborting every test fixture in setup. Mirror the production overlay (gitops/apps/data-science-pack.yaml in openteams-ai/nebari-hetzner) which pins service_workers: 1. The hub boots cleanly within seconds and the e2e suite proceeds. * chore: remove EnvoyOIDC-era stale-token fallbacks made dead by refresh_user The `if access_token and not refresh_token` branch in 01-spawner.py's _nebi_pre_spawn_hook and 03-nebi-envs.py's get_nebi_environments was a fallback for EnvoyOIDCAuthenticator's auth_state, which never carried a refresh_token (Envoy only stored access_token + id_token in cookies). With KeyCloakOAuthenticator + the new refresh_user() override, auth_state always has a rotating refresh_token, so the branch never fires. Also remove the now-orphan `_fetch_fresh_auth_state` helper it called, and update two docstrings/comments that still referenced EnvoyOIDCAuthenticator as the source of the groups claim or the hub's OAuth client. Reword values.yaml comment on `forwardAccessToken: false` to drop the "avoids confusing dual-token paths" framing — there is no Envoy-injected Bearer in the current architecture. * refactor(auth): bundle KC strings into KeyCloakConfig dataclass The KC migration left two related concerns scattered: * endpoint URLs (authorize/token/userdata/end_session) derived from the issuer were assigned individually to traitlets in configure(), via a small `_kc_urls` dict helper. * the per-user logout URL had its inputs spread across the free function `_build_logout_url(end_session_url=, id_token=, post_logout_redirect_uri=)` and TWO stray class attributes (`KeyCloakOAuthenticator._kc_end_session_url`, `KeyCloakOAuthenticator._kc_post_logout_redirect_uri`) that configure() stashed at startup so the logout handler could read them at request time. Replace both with a single `KeyCloakConfig` frozen dataclass that holds every KC string the chart needs and owns the logout-URL composition as a method. * `KeyCloakConfig.build(issuer=..., post_logout_redirect_uri=...)` derives every endpoint URL from the realm issuer. * `cfg.build_logout_url(id_token)` composes the end-session URL, omitting `id_token_hint` for legacy sessions (KC v18+ rejects logout without it when `post_logout_redirect_uri` is set). * configure() builds one KeyCloakConfig and stashes it on `KeyCloakOAuthenticator.kc_config`; the logout handler reads it via `self.authenticator.kc_config`. Net effect: one cohesive object instead of two stray class attrs + one free function + one dict helper. The endpoint derivation becomes trivially testable in isolation. Tests updated in test_keycloak_authenticator.py: * `test_configure_attaches_kc_config_to_authenticator_class` replaces the pair of stray-attribute assertions. * `test_kc_config_build_logout_url_*` cover the method directly. * `test_kc_config_from_issuer_is_pure_and_doesnt_need_configure` pins the classmethod's pure-function semantics. * chore: untrack HANDOFF-stale-token.md and ignore future handoff notes HANDOFF*.md files are in-flight working notes between agent sessions; they should never have been checked in. Remove the one that snuck in during the auth saga and add a .gitignore rule so the next one doesn't. --------- Co-authored-by: Amit Kumar <aktech@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When jhub-apps runs behind Envoy Gateway with
SecurityPolicy.oidc.forwardAccessToken=true, Envoy injects the user's Keycloak RS256 access token inAuthorization: Bearer …. The current decoder assumes that header always carries a jhub-apps HS256 wrapper JWT and returns 401 on any decode failure, causing an infinite redirect loop between/jhub-loginand protected endpoints.Fix
_get_jhub_token_from_jwt_tokenreturnsNonefor tokens that aren't a jhub-apps wrapper (instead of raising).get_current_useriteratesparam → header → cookieand uses the first source that decodes as our wrapper. The KC RS256 token in the Bearer header is harmlessly skipped; the still-present jhub-apps cookie authenticates the user.Tests
tests/tests_unit/test_security.pycovers:Nonefrom the decoder.get_current_userwith RS256 Bearer + valid cookie → authenticates via cookie.get_current_userwith HS256 wrapper Bearer → uses Bearer (preserved).get_current_userwith RS256 Bearer + no cookie → 401.