Skip to content

fix: fall through to cookie when Authorization Bearer is not a jhub-apps wrapper#676

Draft
aktech wants to merge 4 commits into
mainfrom
fix/bearer-fallthrough-rs256
Draft

fix: fall through to cookie when Authorization Bearer is not a jhub-apps wrapper#676
aktech wants to merge 4 commits into
mainfrom
fix/bearer-fallthrough-rs256

Conversation

@aktech

@aktech aktech commented May 8, 2026

Copy link
Copy Markdown
Member

Problem

When jhub-apps runs behind Envoy Gateway with SecurityPolicy.oidc.forwardAccessToken=true, Envoy injects the user's Keycloak RS256 access token in Authorization: Bearer …. The current decoder assumes that header always carries a jhub-apps HS256 wrapper JWT and returns 401 on any decode failure, causing an infinite redirect loop between /jhub-login and protected endpoints.

Fix

  • _get_jhub_token_from_jwt_token returns None for tokens that aren't a jhub-apps wrapper (instead of raising).
  • get_current_user iterates param → header → cookie and uses the first source that decodes as our wrapper. The KC RS256 token in the Bearer header is harmlessly skipped; the still-present jhub-apps cookie authenticates the user.

Tests

tests/tests_unit/test_security.py covers:

  • Valid HS256 wrapper round-trips inner Hub OAuth token.
  • RS256 / garbage / wrong-secret tokens return None from the decoder.
  • get_current_user with RS256 Bearer + valid cookie → authenticates via cookie.
  • get_current_user with HS256 wrapper Bearer → uses Bearer (preserved).
  • get_current_user with RS256 Bearer + no cookie → 401.

…pps wrapper

When Envoy Gateway is configured with SecurityPolicy.oidc.forwardAccessToken=true
it injects the upstream user's Keycloak access token (RS256) in the same
Authorization: Bearer header that jhub-apps inspects for its own HS256 wrapper.
The previous decoder treated any decode failure as a 401, producing an
infinite redirect loop in the browser between /jhub-login and the env-listing
endpoint.

_get_jhub_token_from_jwt_token now returns None when the input is not our
wrapper, and get_current_user iterates the credential sources (param, header,
cookie) and uses the first that decodes.  This preserves backwards-compatible
behaviour and lets deployments behind Envoy forwardAccessToken authenticate
via the still-present jhub-apps cookie.
@vercel

vercel Bot commented May 8, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
jhub-apps Ready Ready Preview, Comment May 16, 2026 6:32pm
jhub-apps-docs Ready Ready Preview, Comment May 16, 2026 6:32pm

aktech added a commit to nebari-dev/data-science-pack that referenced this pull request May 8, 2026
Bundle the head of nebari-dev/jhub-apps#676 in the jupyterhub image so the
chart can ship with forwardAccessToken=true by default. The upstream patch
makes jhub-apps' get_current_user fall through to the cookie when the
Authorization header is not its own HS256 wrapper, removing the OAuth
redirect-loop that previously forced this default to false.

- images/jupyterhub/pixi.toml: pin jhub-apps via git rev (PR #676 head),
  bump pyjwt to >=2.10 in conda deps to satisfy that branch's requirement.
- images/jupyterhub/pixi.lock: regenerated.
- values.yaml: nebariapp.auth.forwardAccessToken defaults to true.
Starlette 1.0 removed the deprecated (name, context) positional form. The
new signature is (request, name, context, ...); calling the old shape now
resolves to name=<context dict> and crashes in get_template with
'TypeError: unhashable type: dict' on /services/japps/{create-app,edit-app,
server-types,success}.

Switch to the new positional form, which works on every Starlette release
since 0.30 (the version that introduced the new signature with a backward-
compat dispatcher) through 1.0+.
When SecurityPolicy.oidc.forwardAccessToken=true, Envoy injects the user's
freshly-refreshed Keycloak access token in Authorization: Bearer on every
upstream request. Pre-existing JAppsConfig.conda_envs callables (and any
similar custom callables) read auth_state via the hub admin API; that store
goes stale on gateways that don't rotate the AccessToken-* cookie, so
downstream token-exchange fails ~5 min after login.

Capture the gateway-forwarded RS256 token in get_current_user, surface it
on User.access_token, and forward it onto the user dict the conda_envs
callable receives. Downstream code can then drive token exchange with a
fresh token without depending on the hub's stored auth_state.

Token is treated opaquely here; consumers validate against their IdP.
aktech added a commit to nebari-dev/data-science-pack that referenced this pull request May 14, 2026
The temporary nebari-dev/jhub-apps#676 git pin (bearer fall-through +
Starlette 1.0 TemplateResponse fix) plus pyjwt>=2.10 floor were only
needed while Envoy was the OAuth client and could inject an RS256
Bearer that confused jhub-apps. With hub doing its own OAuth, Envoy no
longer injects to /services/japps/* and neither workaround is needed.

- jhub-apps: git@5d86277 -> ==2025.11.1 (conda-forge release)
- pyjwt: >=2.10,<3 -> >=2.9,<2.10 (matches jhub-apps 2025.11.1 constraint)

Lock regenerated via pixi 0.68.1 in a linux/arm64 container.
aktech added a commit to nebari-dev/data-science-pack that referenced this pull request May 14, 2026
The temporary nebari-dev/jhub-apps#676 git pin (bearer fall-through +
Starlette 1.0 TemplateResponse fix) plus pyjwt>=2.10 floor were only
needed while Envoy was the OAuth client and could inject an RS256
Bearer that confused jhub-apps. With hub doing its own OAuth, Envoy no
longer injects to /services/japps/* and neither workaround is needed.

- jhub-apps: git@5d86277 -> ==2025.11.1 (conda-forge release)
- pyjwt: >=2.10,<3 -> >=2.9,<2.10 (matches jhub-apps 2025.11.1 constraint)

Lock regenerated via pixi 0.68.1 in a linux/arm64 container.
aktech added a commit to nebari-dev/data-science-pack that referenced this pull request May 15, 2026
…etch (#53)

* feat: add shared group directories and NSS wrapper

- Add shared-storage RWX PVC volume support with per-group subPaths
  mounted at /shared/<group> inside user pods
- Add init container that creates group directories with chmod 2775
  (setgid bit) so new files inherit the group and are group-writable
- Add libnss_wrapper.so configuration so whoami/id report the real
  username instead of 'jovyan', with NB_UMASK=0002
- Refactor pre_spawn_hook into focused single-responsibility functions:
  _get_user_groups, _setup_shared_storage, _setup_nss_wrapper
- Orchestrator _pre_spawn_hook chains Nebi auth, shared storage, and
  NSS wrapper; always registered (NSS runs even without shared storage)
- Add sharedStorage.groups allowlist and mountPathPrefix values
- Add jupyterhub.custom.shared-storage-* config keys

* fix: correct NSS GID to 1000 (jovyan) and always create ~/shared dir

- gid default was 100 but z2jh sets pod securityContext GID to 1000;
  add jovyan:x:1000: group entry so 'groups' command resolves the name
- when shared PVC is disabled, mkdir -p ~/shared instead of removing it
  so users always see the directory regardless of storage configuration

* fix: store groups in auth_state and always create ~/shared/<group> dirs

- EnvoyOIDCAuthenticator now stores parsed groups in auth_state so the
  spawner can read them at spawn time (JupyterHub groups table is empty
  when manage_groups is not enabled)
- refresh_user also re-parses groups from the refreshed IdToken to keep
  auth_state current
- _pre_spawn_hook always resolves user groups, not only when shared PVC
  is enabled
- _setup_nss_wrapper creates local ~/shared/<group> dirs per group when
  no shared PVC is configured, so users always see their group dirs

* feat: add in-cluster NFS server for shared storage on RWO-only clusters

Deploys quay.io/nebari/volume-nfs backed by a single RWO PVC and re-exports
it as RWX NFS, enabling shared group directories on providers like Hetzner
that only provide ReadWriteOnce storage (hcloud-volumes).

- templates/nfs-server.yaml: NFS Deployment, Service, backend RWO PVC
- templates/shared-pvc.yaml: StorageClass + PV (NFS path) + PVC when
  nfsServer.enabled; falls back to external RWX PVC otherwise
- values.yaml: sharedStorage.nfsServer.{enabled,storageClass,image} fields

* fix: add DaemonSet to install nfs-common on k3s worker nodes

k3s worker nodes on minimal OS images (Hetzner) ship without nfs-common,
causing NFS PV mounts to fail with 'bad option'. The DaemonSet uses nsenter
to install nfs-common on the host via apt-get, skipping if already present.
Gated on sharedStorage.nfsServer.installClient (default false).

* fix: use alpine:3 sleep for DaemonSet pause container

* fix: NFS PV path /exports not / (overlayfs cannot be exported)

* fix: remove spawner.user.groups (DetachedInstanceError in async); add try-except

_get_user_groups accessed spawner.user.groups (SQLAlchemy lazy-loaded
relationship) from an async pre_spawn_hook, causing DetachedInstanceError
which silently aborted _setup_shared_storage and _setup_nss_wrapper.

Groups are now read only from auth_state (stored by EnvoyOIDCAuthenticator).
Each step is individually wrapped in try-except so failures are logged and
don't prevent subsequent steps from running.

* fix: address code review findings (I1-I7, C1-C3, M1, N3)

I1: set c.KubeSpawner.fs_gid=100 explicitly so shared dir file ownership
    is deterministic (GID 100 = users group) rather than relying on z2jh default

I2: add Helm validation in _helpers.tpl that fails at template time if
    sharedStorage.enabled and jupyterhub.custom.shared-storage-enabled diverge

I3: use Path(g).name like classic Nebari so /projects/myproj -> myproj,
    not projects/myproj; deduplicate groups to prevent duplicate mountPaths

I4: add nodeSelector/nodeAffinity support to NFS server Deployment so
    deployers can pin it to worker nodes and avoid slow RWO PVC reattachment

I7: add argocd.argoproj.io/sync-options: Prune=false to StorageClass and
    PersistentVolume to prevent accidental deletion during ArgoCD force sync

C1: add chown 0:100 before chmod 2775 in initialize-shared-mounts init
    container so shared dirs are explicitly owned by GID 100 (users)

C2: use printf instead of echo '...' for NSS file writes to safely handle
    special characters in usernames without shell quoting issues

C3: deduplicate groups in _get_user_groups (via Path.name already handles
    most cases; added explicit dedup set for belt-and-suspenders)

M1: log exception with exc_info=True in refresh_user JWT parse failure

N3: merge into existing lifecycle_hooks instead of replacing; warn if
    a postStart hook already exists before overwriting

Logging: added comprehensive info/debug/warning logging throughout all
    pre-spawn hook functions for both happy and failure paths

* docs: add JupyterLab profiles design spec

* Revert "docs: add JupyterLab profiles design spec"

This reverts commit 56c22cf.

* feat: add JupyterLab profiles for CPU/RAM resource sizing (closes #31)

Exposes a profile selector in JupyterHub matching the classic Nebari experience.
Profiles are defined under jupyterhub.custom.profiles in values.yaml and passed
directly to c.KubeSpawner.profile_list via get_config().

Default profiles:
  - Small: 1 CPU / 2 GB RAM (default)
  - Medium: 4 CPU / 8 GB RAM

kubespawner_override accepts any KubeSpawner trait so GPU profiles, custom
images, and node selectors work without code changes in the future.
When profiles list is empty, no selector is shown (single-instance mode).

* fix: add descriptive names to default server profiles

Update default profile display_name and description to be more
user-friendly (e.g. "Small Instance" with "Stable environment with
1 CPU / 2 GB RAM" instead of just "Small" / "1 CPU / 2 GB RAM").

* split: move JupyterLab profiles to separate branch

Profiles feature (#31) is out of scope for this PR. Moved to local
branch feat/jupyterlab-profiles for a follow-up PR.

* test: add k3d-based e2e smoke test

Replaces the inline-bash test workflow with a pytest-based suite that
manages the k3d cluster, helm install, and pod-wait lifecycle.
Conftest exposes a 'cluster' session fixture and a 'hub_url' fixture
that port-forwards proxy-public.

CI runs uvx pytest tests/e2e -v with PYTHONUNBUFFERED=1 so live logs
stream into the workflow output.

Locally:
  uvx pytest tests/e2e -v                       # fresh cluster
  K3D_CLUSTER=k3d-nebari-dev uvx pytest tests/e2e -v   # reuse

* test: add NFS-backed shared-storage e2e tests

Switches the e2e harness to kind (k3d's busybox-on-scratch nodes lack a
package manager, so the chart's nfs-common installer DaemonSet can't
provision NFS client tools).

New tests in tests/e2e/test_shared_storage.py exercise the full PR #30
spawn path against a real cluster:
  - test_user_in_group_can_write: alice-data writes /shared/data/...
  - test_shared_dir_is_group_owned: dir mode is 2775 (setgid)

The DummyAuth shim in tests/e2e/fixtures/test-values.yaml maps the
login username to user+groups (alice-data -> alice in [data]) so we
can fake auth_state without running Keycloak. Everything else
(spawner hook, init container, NSS wrapper, NFS mount) is real.

Chart changes:
  - sharedStorage.nfsServer.mountOptions added (default []). Tests
    pass [nfsvers=3] because kind nodes use overlayfs which fails
    the volume-nfs image's NFSv4 root export. Production unchanged.

Conftest infrastructure:
  - kind cluster fixture with KIND_KEEP=1 reuse
  - hosts-entry workaround so kubelet's host mount.nfs can resolve
    the cluster-internal NFS service FQDN (kind nodes have no
    cluster DNS in their host resolv.conf)
  - structured logging + step counter + per-cycle pod state, events
    from kubectl describe, and node-level kubelet journal lines
  - autouse failure-dump fixture (kubectl get pods/events + hub
    and singleuser logs)

* refactor(tests/e2e): split conftest into deep modules

Conftest had grown to 577 lines mixing five concerns. Extracted into
focused modules each with a small interface and large hidden impl:

  _process.py     subprocess + kubectl helpers + step counter
  _hub.py         HubClient (cookie/login/spawn/stop session)
  _pod_observer.py wait_for_pod_ready + dedup'd pod-state polling
  _cluster.py     kind lifecycle + helm install + NFS hosts workaround

conftest.py shrinks to 218 lines holding only fixtures that compose the
modules. Eliminates duplication of:
  - cookie-jar login flow (was repeated in _login_and_spawn + _stop_server)
  - two parallel subprocess wrappers (_run + _kctl)
  - inline pod-state polling loop in the spawn flow

Tests still pass locally (3 passed in 35s, cluster reused).

* ci: speed up e2e — disable z2jh prePuller + cache kindest/node

prePuller hooks pre-pull singleuser images on every node before helm
install completes. On a single-node test cluster this is pure overhead
(~30s of blocking wait). Disable in test-values.yaml.

kindest/node image pull was the largest variable cost in CI: 9s on a
fast runner, 130s on a slow one. Cache it as a docker tarball keyed on
the kind version so subsequent runs are deterministic and fast.

Expected: total CI time drops from variable 3-5min to ~90-120s.

* ci: fix kindest/node cache-save step under set -e

Previous heuristic used `[ -n "$img" ] && docker save` which exits 1
when grep finds no image, killing the whole step. Hardcode the
v1.32.2 tag (fixed by kind v0.27.0) and use plain commands so set -e
only triggers on real failures.

* ci: drop kindest/node cache attempt

GH Actions ubuntu-latest runners come with kindest/node preinstalled
(tagged "<none>"). The actual image fetch on cache miss was only ~12s
because docker just verifies the digest. The cache step was earning ~5s
in the best case and breaking the workflow when docker save couldn't
find the v1.32.2 tag (image is referenced by digest, not tag).

prePuller-disable change is keeping its ~30s saving — sufficient win
without the cache complexity.

* test(e2e): expand shared-storage suite to full permission contract

9 tests (was 2) covering the per-group /shared/<group> contract end-to-end:

  - dir is root:users 2775 (parametrized over groups)
  - pod is member of users group; NB_UMASK=0002 in env
  - new files inherit gid=100 mode 0664; new subdirs gid=100 mode 02775
    (setgid propagation)
  - multi-group user sees + writes every group dir
  - user does not see groups they don't belong to (mount-time isolation)
  - file written by one user is readable + appendable by a groupmate
    from a separate pod (cross-user collaboration)

Conftest adds PathStat + SpawnedUser.stat()/path_exists() so tests assert
against typed fields (mode/uid/gid) instead of parsing stat strings —
keeps tests short and behavior-focused.

* ci: cache singleuser image across runs to skip ~73s cold pull

Singleuser image (multi-GB) is currently pulled by kubelet inside the
kind node on first user spawn, costing ~73s of every CI run. Pull it
once on the runner host, save as tar, cache it (key = image ref so a
values.yaml bump auto-invalidates), and side-load with `kind load
image-archive`. Pre-create the cluster in the workflow so the side-load
happens before any pod is scheduled — the pytest fixture's
ensure_cluster() reuses the existing cluster.

Cache hit: skips the ~90s registry pull entirely; only kind-load (~20s)
remains. Cache miss: pull + save once (~120s), then every subsequent run
benefits.

* docs(shared-storage): position external RWX as primary, mark in-cluster NFS as transitional

Addresses comment on issue #29: the bundled `nfsServer.enabled=true` path
relies on `quay.io/nebari/volume-nfs:0.8-repack`, a manifest-schema repack
of an abandoned upstream image (nebari-dev/nebari-docker-images#230). We
should not be carrying that workaround image as the recommended path for
a greenfield chart.

The chart already supported bringing your own RWX StorageClass; this
change makes that path the documented primary:

  - values.yaml: reframe the sharedStorage block. Recommend an external
    RWX class with provider-specific examples (Longhorn, EFS, Filestore,
    Azure Files, nfs-subdir-external-provisioner). Add a deprecation
    note on the nfsServer.image block linking to issue #29.

  - README: add a "Shared Storage" section with the same matrix and an
    explicit pointer to the issue tracking removal of the in-cluster
    NFS path.

No template changes — the external-RWX path was already rendered when
nfsServer.enabled=false. Verified via `helm template` that setting only
`sharedStorage.storageClass=longhorn` produces a single RWX PVC and no
nfs-server pod.

* docs(shared-storage): correct provider list — only longhorn is NIC-provisioned

Previous commit listed EFS/Filestore/Azure Files as recommended RWX
backends. NIC does not provision those — they are separate cloud-managed
services no one in NIC has wired up. NIC's actual storage reality:

  hetzner   : longhorn (longhorn.Install in pkg/provider/hetzner)
  aws       : longhorn (longhorn.Install in pkg/provider/aws)
  existing  : longhorn (longhorn.Install in pkg/provider/existing)
  gcp       : standard-rwo (no RWX provisioned)
  azure     : managed-csi (no RWX provisioned)
  local     : (no storage layer)

So the accurate recommendation is just longhorn. Updated values.yaml and
README to say so directly. The in-cluster NFS fallback stays — it covers
the providers where NIC has not yet wired up an RWX class — with a
pointer to issue #29 for tracking removal once that lands everywhere.

* fix(nebi-envs): re-fetch auth_state when access token is stale

EnvoyOIDCAuthenticator stores no refresh_token (Envoy keeps only
access_token + id_token in cookies) and the access_token lifetime is
~5 minutes. jhub-apps calls the env-listing callable on every Create
App page render, often well after the token captured at login has
expired, producing
`token-exchange step 2 FAILED: HTTP 400 invalid_request "Invalid token"`
and a silent empty selector.

Mirror 01-spawner.py: when access_token has <30s remaining, re-fetch
auth_state via the hub API (which refresh_user keeps current with
fresh Envoy cookies on browser activity) before exchanging.

* feat: forward access token via Authorization Bearer header

Envoy Gateway v1.6 stores the OAuth2 access token in the AccessToken cookie
at OIDC login but does not rotate the cookie content when its internal
refresh token rotates the access token. Result: hub reads a frozen-at-login
access token from the cookie and downstream calls that need a fresh JWT
(jhub-apps env selector → Keycloak token exchange → Nebi) fail with
`400 invalid_request "Invalid token"` ~5 min after login.

Three changes:

1. values.yaml — set nebariapp.auth.enforceAtGateway=true and
   forwardAccessToken=true by default. Envoy Gateway then injects the
   user's freshly-refreshed access token as `Authorization: Bearer <token>`
   on every upstream request.

2. templates/nebariapp.yaml — pass enforceAtGateway, forwardAccessToken,
   and tokenExchange through to the NebariApp CRD so the chart can drive
   the operator-managed SecurityPolicy.

3. config/jupyterhub/00-gateway-auth.py — `_extract_envoy_cookies` now
   prefers `Authorization: Bearer` over the `AccessToken-*` cookie. The
   header is the only always-current source; cookie fallback retained for
   deployments without forwardAccessToken.

The stale-token re-fetch in 03-nebi-envs.py (commit 5e95e8d) becomes
defensive: with this fix, refresh_user captures a fresh access_token on
each browser request and the env-listing callable rarely needs to fall
back to it.

* fix(values): default forwardAccessToken=false to avoid jhub-apps loop

forwardAccessToken=true makes Envoy inject the user's Keycloak access
token as Authorization: Bearer on every upstream request. jhub-apps's
get_current_user reads the Authorization header before its own cookie
and unconditionally tries to HS256-decode it as the jhub-apps JWT.
The Keycloak token is RS256, decode raises InvalidAlgorithmError,
authentication fails, browser is redirected to /jhub-login, OAuth
round-trips, new cookie is set, next request hits the same path —
infinite loop in the UI.

Until jhub-apps is patched to ignore non-HS256 Authorization tokens
(or until a different transport delivers the user's fresh access
token to the env-listing callable), default to off. The chart still
exposes both fields for explicit opt-in.

* feat: pin jhub-apps to PR #676 + default forwardAccessToken=true

Bundle the head of nebari-dev/jhub-apps#676 in the jupyterhub image so the
chart can ship with forwardAccessToken=true by default. The upstream patch
makes jhub-apps' get_current_user fall through to the cookie when the
Authorization header is not its own HS256 wrapper, removing the OAuth
redirect-loop that previously forced this default to false.

- images/jupyterhub/pixi.toml: pin jhub-apps via git rev (PR #676 head),
  bump pyjwt to >=2.10 in conda deps to satisfy that branch's requirement.
- images/jupyterhub/pixi.lock: regenerated.
- values.yaml: nebariapp.auth.forwardAccessToken defaults to true.

* chore: bump hub image to sha-e906f78 (carries patched jhub-apps)

* chore: pin hub image to PR-53 merge-commit sha-9381aab

* chore: bump jhub-apps git pin to include Starlette 1.0 TemplateResponse fix

* chore: bump hub image to sha-ab37dda (Starlette 1.0 TemplateResponse fix)

* test(unit): add pytest harness for jupyterhub config modules

Loads jupyterhub_config.d files (hyphenated, digit-prefixed) by path via
importlib spec. FakeConfig records traitlets-style attribute assignments
so tests can assert on what each config module wires onto JupyterHub's
`c` global without needing a running hub.

Also seeds the harness with .venv-unit/ ignored by git/helm so 'uv venv'
can install jupyterhub + oauthenticator for tests without leaking into
helm package output.

* feat(auth): switch hub to KeyCloakOAuthenticator (GenericOAuthenticator)

Hub now does its own OAuth dance with Keycloak instead of reading
cookies that Envoy Gateway sets at the OIDC filter. JupyterHub's
built-in refresh_user uses the stored refresh_token to keep auth_state
fresh — no browser hit, no gateway-injected Bearer, no per-caller
plumbing.

Fixes the stale-token bug: /services/japps/conda-environments/ returned
[] ~5 min after login because Envoy v1.6 doesn't rotate AccessToken-*
cookie contents on every request, jhub-apps paths bypass hub, and the
env-listing callable read user.auth_state which was frozen at OAuth
callback time.

Reads issuer-url / client-id / client-secret from a Secret mounted at
/etc/oauth/ (overridable via OAUTH_SECRET_DIR); OAUTH_CALLBACK_URL and
OAUTH_EXTERNAL_URL come from the deployment's env. Production wiring is
gated on OAUTH_CALLBACK_URL being set, so plain kind deploys keep the
chart-default authenticator (dummy).

Logout points at KC's end_session_endpoint with a percent-encoded
post_logout_redirect_uri so the upstream session is terminated, not
just hub's local cookie.

Replaces the 222-line EnvoyOIDCAuthenticator with a 100-line module
behind a single configure() entry point. 12 unit tests cover URL
derivation, auth_state/refresh wiring, admin/groups claims, logout URL
encoding, and the env-gate.

* chore(chart): flip NebariApp to enforceAtGateway=false, hub OAuth callback

With hub doing its own OAuth (see KeyCloakOAuthenticator), Envoy
SecurityPolicy on the hub host adds nothing — its cookie rotation lag
was the original cause of the env-list stale-token bug.

- enforceAtGateway: true -> false   (operator drops the SecurityPolicy)
- forwardAccessToken: true -> false (no longer relevant; avoid dual-token paths)
- redirectURI: /oauth2/callback -> /hub/oauth_callback (JupyterHub default)

Operator's provisionClient: true is independent of enforceAtGateway, so
the KC client + Secret stay provisioned. The redirectURI change drives
the operator to update the client's allowed redirect URI.

* revert(deps): drop jhub-apps git pin + relax pyjwt — hub owns OAuth now

The temporary nebari-dev/jhub-apps#676 git pin (bearer fall-through +
Starlette 1.0 TemplateResponse fix) plus pyjwt>=2.10 floor were only
needed while Envoy was the OAuth client and could inject an RS256
Bearer that confused jhub-apps. With hub doing its own OAuth, Envoy no
longer injects to /services/japps/* and neither workaround is needed.

- jhub-apps: git@5d86277 -> ==2025.11.1 (conda-forge release)
- pyjwt: >=2.10,<3 -> >=2.9,<2.10 (matches jhub-apps 2025.11.1 constraint)

Lock regenerated via pixi 0.68.1 in a linux/arm64 container.

* chore: bump hub image to sha-ae0969a (GenericOAuthenticator + jhub-apps 2025.11.1)

Carries the GenericOAuthenticator switch (config/jupyterhub/00-gateway-auth.py)
and jhub-apps 2025.11.1 release from conda-forge.

Digest: sha256:e9b481657f34c16b367d402ca1cce79ac64b177dc9eba48f85f35be363958126

* fix(auth): request openid + profile + email + groups scopes

Without explicit scope, GenericOAuthenticator sends no scope= param in
the authorize redirect; KC then issues a token that lacks the openid
scope, and /userinfo returns 403 at token_to_user. Symptom: 500 on
/hub/oauth_callback after the user signs in at KC.

Add a unit test that fails if openid drops out of the scope list.

* chore: bump hub image to sha-ffb035a (openid scope fix)

Digest: sha256:4be08f31306c4da35ceccc390688e02947f02d7eab5fbd1efddca90af8bd00fb

* fix(deps): cap starlette<1 — jhub-apps 2025.11.1 uses legacy TemplateResponse

Starlette 1.0 reordered TemplateResponse positional args to
(request, name, ...); jhub-apps 2025.11.1 still calls the 2-arg form,
which makes /services/japps/create-app 500 at handle_apps. Pin to the
last 0.x release until jhub-apps ships a fix.

* chore: bump hub image to sha-8676046 (starlette<1 cap)

Digest: sha256:8e007b6dc55ffe5f451d016610f0733462de323df5bb2af3235a6f17b22e5ddf

* docs: mark HANDOFF-stale-token.md resolved (GenericOAuthenticator switch)

Tested end-to-end on hetzner via Playwright headless:
- /hub/oauth_callback returns 302 to /hub/home (no 500)
- /services/japps/create-app renders (starlette<1 cap)
- /services/japps/conda-environments/ returns 200
- After 6-min idle, refresh_token grant fires, token stays fresh
- 3-step KC -> Nebi token exchange succeeds end-to-end

* feat(auth): auto-login + KC end-session with id_token_hint

Two UX fixes:

1. auto_login=True
   Hub now 302s /hub/login directly to KC instead of rendering the
   local form with a 'Sign in with OAuth 2.0' link. Single IdP — no
   point making the user click through.

2. KeyCloakLogoutHandler
   KC v18+ rejects /protocol/openid-connect/logout when
   post_logout_redirect_uri is present without id_token_hint. The static
   logout_redirect_url can't include it (per-user), so install a handler
   that reads auth_state.id_token at request time and builds the URL.
   Falls back to no-hint URL if auth_state is missing (legacy session).

* chore: bump hub image to sha-0e393cb (auto_login + KC end-session)

Digest: sha256:93f7139b8775b7a22ac4db313583c83cded449ac77302229e1060f27bce3d6c1

* fix(auth): override LogoutHandler.get to inject id_token_hint

Base LogoutHandler.get() short-circuits to authenticator.logout_redirect_url
when auto_login=True, so the prior override of render_logout_page never
fired. Move the per-user URL building into get() itself, with default_handle_logout
+ handle_logout still called so hub's local session state is cleared.

* chore: bump hub image to sha-38305c6 (logout id_token_hint fix)

* fix(auth): monkey-patch LogoutHandler.get so /hub/logout uses our handler

Authenticator-supplied handlers are appended after jupyterhub's defaults
in init_handlers, so tornado's first-match routing picks the default
LogoutHandler at /logout — our override via get_handlers is a dead
route. Monkey-patch the base LogoutHandler.get instead.

* chore: bump hub image to sha-8ff8d0a (logout monkey-patch)

* fix(auth): use OAuthenticator.logout_handler hook (no monkey-patch)

OAuthenticator.get_handlers reads the class-level logout_handler
attribute when registering the /logout route. Swap it to our subclass
(class attr on KeyCloakOAuthenticator) instead of monkey-patching
LogoutHandler.get or duplicating the /logout entry — the latter just
appends a second tuple after oauthenticator's own (r'/logout',
OAuthLogoutHandler), and tornado's first-match keeps picking the base
class.

KeyCloakLogoutHandler subclasses OAuthLogoutHandler and overrides
render_logout_page (not get) so the inherited LogoutHandler.get still
runs default_handle_logout + handle_logout (token revocation, cookie
clear). For that to happen, authenticator.logout_redirect_url is left
empty — otherwise LogoutHandler.get short-circuits when auto_login is
True and never calls render_logout_page.

* chore: bump hub image to sha-239effb (logout_handler hook fix)

* fix(auth): stash logout pieces on class attr (not via c. traitlets)

Traitlets' config loader rejects unknown attribute names with a warning
and never sets the value, so c.KeyCloakOAuthenticator._kc_end_session_url
was a no-op — _kc_end_session_url stayed empty on the class default,
making the logout URL relative and causing a redirect loop.

* chore: bump hub image to sha-2c816a2 (logout class-attr fix)

* fix(auth): override LogoutHandler.get to capture id_token before user cleared

LogoutHandler.get sets self._jupyterhub_user = None BEFORE calling
render_logout_page (jupyterhub/handlers/login.py:89), so reading
auth_state from render_logout_page always sees current_user=None.
Move the id_token capture into get() before the cleanup runs.

* chore: bump hub image to sha-67880ee (logout id_token capture in get)

* fix(spawner): pin pvc_name_template to claim-{username}

KubeSpawner's default `pvc_name_template` for a *named* server is
`claim-{username}--{servername}`, but the chart's home volume mount is
hardcoded to `claim-{username}` (so all of a user's servers share a single
RWO PVC, co-located on one node via the pod-affinity rule).

Without an explicit override the names diverge: KubeSpawner creates a fresh
per-server PVC and the pod tries to mount a different per-user PVC. Users
who'd previously launched the default JupyterLab server still had the
per-user PVC sitting around and survived; fresh users (e.g. anyone who
first interacts with the platform via jhub-apps Create App) hit
FailedScheduling: 'persistentvolumeclaim claim-<user> not found' and the
pod sits Pending until the hub spawn-timeout (5 min) fires.

Lock the template to `claim-{username}` so ensure + mount converge.

Test in tests/unit/test_spawner_storage.py.

* fix(auth): implement refresh_user to rotate KC refresh_token in auth_state

The earlier switch to KeyCloakOAuthenticator (GenericOAuthenticator) set
`auth_refresh_age = 240`, expecting JupyterHub to keep auth_state fresh
via its built-in refresh_user. But JupyterHub's Authenticator.refresh_user
is a no-op stub (returns True) and oauthenticator's GenericOAuthenticator
does not override it. So auth_state.refresh_token stays frozen at
OAuth-callback time and expires after KC's SSO idle timeout (~30 min by
default), at which point nebi-envs's 3-step token exchange fails at
step 1 with:

  invalid_grant: Token is not active

and the jhub-apps Create-App "Software Environment" dropdown silently
disappears (env list is empty when the exchange fails).

Implement refresh_user on KeyCloakOAuthenticator: POST grant_type=
refresh_token to KC's token endpoint, persist the rotated tokens back
to auth_state via the {"auth_state": ...} return shape, return False on
invalid_grant to force re-login, and return True (no-op) on transient
HTTP errors.

Tests in tests/unit/test_refresh_user.py cover the four return-shape
contracts: success, invalid_grant, transient error, no-refresh-token.

* fix(e2e): drop japps service_workers 4 -> 1 to unstick hub bootup

z2jh's hub waits ~10s for each managed service's HTTP port to bind. The
default jhub-apps service_workers is 4 and four uvicorn workers take
~12s to bind on CI runners, so hub crashes with

    Cannot connect to managed service japps at http://hub:10202

restarts, hits the same timeout, restarts, etc. By the time the e2e
fixture's port-forward starts polling /hub/login, hub is still in this
crash-loop. The first urlopen with timeout=15s eventually raises
TimeoutError unwrapped through HubClient._request (which only catches
HTTPError), aborting every test fixture in setup.

Mirror the production overlay (gitops/apps/data-science-pack.yaml in
openteams-ai/nebari-hetzner) which pins service_workers: 1. The hub
boots cleanly within seconds and the e2e suite proceeds.

* chore: remove EnvoyOIDC-era stale-token fallbacks made dead by refresh_user

The `if access_token and not refresh_token` branch in 01-spawner.py's
_nebi_pre_spawn_hook and 03-nebi-envs.py's get_nebi_environments was a
fallback for EnvoyOIDCAuthenticator's auth_state, which never carried a
refresh_token (Envoy only stored access_token + id_token in cookies).
With KeyCloakOAuthenticator + the new refresh_user() override, auth_state
always has a rotating refresh_token, so the branch never fires.

Also remove the now-orphan `_fetch_fresh_auth_state` helper it called,
and update two docstrings/comments that still referenced
EnvoyOIDCAuthenticator as the source of the groups claim or the hub's
OAuth client.

Reword values.yaml comment on `forwardAccessToken: false` to drop the
"avoids confusing dual-token paths" framing — there is no
Envoy-injected Bearer in the current architecture.

* refactor(auth): bundle KC strings into KeyCloakConfig dataclass

The KC migration left two related concerns scattered:

  * endpoint URLs (authorize/token/userdata/end_session) derived from
    the issuer were assigned individually to traitlets in configure(),
    via a small `_kc_urls` dict helper.
  * the per-user logout URL had its inputs spread across the free
    function `_build_logout_url(end_session_url=, id_token=,
    post_logout_redirect_uri=)` and TWO stray class attributes
    (`KeyCloakOAuthenticator._kc_end_session_url`,
    `KeyCloakOAuthenticator._kc_post_logout_redirect_uri`) that
    configure() stashed at startup so the logout handler could read
    them at request time.

Replace both with a single `KeyCloakConfig` frozen dataclass that
holds every KC string the chart needs and owns the logout-URL
composition as a method.

  * `KeyCloakConfig.build(issuer=..., post_logout_redirect_uri=...)`
    derives every endpoint URL from the realm issuer.
  * `cfg.build_logout_url(id_token)` composes the end-session URL,
    omitting `id_token_hint` for legacy sessions (KC v18+ rejects
    logout without it when `post_logout_redirect_uri` is set).
  * configure() builds one KeyCloakConfig and stashes it on
    `KeyCloakOAuthenticator.kc_config`; the logout handler reads it
    via `self.authenticator.kc_config`.

Net effect: one cohesive object instead of two stray class attrs +
one free function + one dict helper. The endpoint derivation becomes
trivially testable in isolation.

Tests updated in test_keycloak_authenticator.py:
  * `test_configure_attaches_kc_config_to_authenticator_class`
    replaces the pair of stray-attribute assertions.
  * `test_kc_config_build_logout_url_*` cover the method directly.
  * `test_kc_config_from_issuer_is_pure_and_doesnt_need_configure`
    pins the classmethod's pure-function semantics.

* chore: untrack HANDOFF-stale-token.md and ignore future handoff notes

HANDOFF*.md files are in-flight working notes between agent sessions;
they should never have been checked in. Remove the one that snuck in
during the auth saga and add a .gitignore rule so the next one doesn't.

---------

Co-authored-by: Amit Kumar <aktech@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant