Skip to content

perf: cut hub api roundtrips on the /home hot path#680

Draft
aktech wants to merge 4 commits into
mainfrom
perf/hub-roundtrips
Draft

perf: cut hub api roundtrips on the /home hot path#680
aktech wants to merge 4 commits into
mainfrom
perf/hub-roundtrips

Conversation

@aktech

@aktech aktech commented May 16, 2026

Copy link
Copy Markdown
Member

Summary

Three changes, all within the existing scopes on japps-service-roleno scope widening, no behavior change at the public API level.

1. Stop minting+revoking user-impersonation tokens for read-only calls

HubClient.get_user, get_shared_servers, and get_services were decorated with @requires_user_token. Per call that means:

```
POST /hub/api/users/{me}/tokens (DB write)
GET /hub/api/... (the actual read)
DELETE /hub/api/users/{me}/tokens/{id} (DB write)
```

The service token already has read:users, shares (jh>=5), and read:services via the role assignment in jhub_apps/configuration.py:120, so the read returns the same data with the service token directly — three hub roundtrips collapse to one.

2. Replace the cluster-wide scan in get_shared_servers

service.utils.get_shared_servers used to call HubClient.get_users()GET /hub/api/users?include_stopped_servers=True — to enrich the share list. On a Nebari deployment that's an O(users × servers) JSON payload pulled on every /server/ GET, then mostly discarded.

The new flow:

  1. GET /users/{me}/shared → small share list (just owner.name + server.name)
  2. dedupe owners
  3. parallel GET /users/{owner} for unique grantors (capped at 10 workers)
  4. filter each owner's servers map to the shared names

Scales with shares granted to me, not cluster size. Returns the exact same Server records as before.

3. Share a keep-alive requests.Session

Module-level requests.Session reused across HubClient instances, so consecutive hub calls reuse the TCP/TLS connection instead of handshaking per call. Noticeable win on k8s pod networking.

Security

The role assignment was already granting the scopes we now use directly. Switching to the service token for reads is a no-op for what the hub would have allowed — we're just not minting a derived user token that had the same scopes. State-changing calls (create_server, delete_server, …) still use @requires_user_token so the audit trail records the user as the actor.

Impact

For a logged-in user opening /hub/home (no app form open):

before after
Hub API roundtrips ~11 ~4
Cluster-scan endpoints 1 (GET /users) 0
Hub DB writes per page 6 (token create/revoke pairs) 0

Test plan

  • pytest jhub_apps/tests --ignore=jhub_apps/tests/tests_e2e — 26 passed on py3.12.
  • Updated test_shared_server_filtering to model the new fan-out and assert get_user is called exactly once per unique owner.
  • CI integration tests on this PR.

Three changes, all behind the existing scope grants on the
japps-service-role — no widening of permissions.

1. Drop @requires_user_token from `HubClient.get_user`,
   `get_shared_servers`, and `get_services`. The decorator's
   create-token-call-revoke-token dance costs three hub roundtrips
   (with DB writes) per logical call. The service token already has
   `read:users`, `shares`, and `read:services` via the role assignment
   in `jhub_apps/configuration.py`, so the read returns the same data
   in one call instead of three.

2. Rewrite `service.utils.get_shared_servers`. Previously it called
   `HubClient.get_users()` — `GET /hub/api/users?include_stopped_servers=True`,
   the entire cluster's users-and-servers payload — on every /server/
   GET, then threw most of it away. Now it pulls the small Share list
   from `/users/{me}/shared`, dedupes owners, and fans out one
   `GET /users/{owner}` per unique grantor in parallel (capped at 10).
   Scales with shares-granted-to-me, not cluster size.

3. Share a module-level `requests.Session` across all `HubClient`
   instances so consecutive hub calls reuse the TCP/TLS connection.
   On k8s pod networking the per-call handshake was a meaningful slice
   of the wall time.

Effect on /hub/home for a logged-in user (no app form open):

  before:  ~11 hub roundtrips, one O(users x servers across cluster)
  after:    ~4 hub roundtrips, all O(distinct people who shared with me)

Unit tests updated and green (26 passed). Integration/e2e behavior
is unchanged at the public-API level — same response shapes, same
status codes, same scope checks on the hub side.
@vercel

vercel Bot commented May 16, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
jhub-apps Ready Ready Preview, Comment May 16, 2026 9:16pm
jhub-apps-docs Ready Ready Preview, Comment May 16, 2026 9:16pm

While benchmarking PR #680 I noticed every authenticated request was
still firing `GET /hub/api/users` + `GET /hub/api/groups` + a token
mint/revoke pair, regardless of which endpoint was called. The cause:
`security.get_current_user` was unconditionally populating
`user.share_permissions` via `get_users_and_group_allowed_to_share_with`,
which is heavy and only needed by the share dropdown.

Changes:

1. Stop populating `user.share_permissions` in `get_current_user`.
   The hot path no longer touches the share-targets code at all.

2. Add a dedicated `GET /services/japps/share-permissions/` endpoint.
   The UI's `app-sharing.tsx` fetches it on demand via react-query
   (`staleTime: 0` — Keycloak/group changes propagate immediately on
   the next dialog open; never cached).

3. Rewrite `get_users_and_group_allowed_to_share_with` to take the
   most direct hub call per scope shape:
     - explicit `read:users:name!user=NAME` → name comes straight out
       of the scope, no hub call;
     - explicit `read:users:name!group=GROUP` → one
       `GET /groups/{GROUP}` (returns member names directly);
     - broad `read:users:name` / `read:groups:name` → one full
       `GET /users` (or `GET /groups`) only because there is no
       smaller query that returns the cluster.
   Same shape for the groups side. No caching anywhere.

4. New `HubClient.get_group(name)` for the targeted group fetch.
   Drop the now-unused `filter_entity_based_on_scopes`.

5. Replace the old scope-filter test with a parameterised test for
   `_resolve_share_targets` that exercises all three branches
   (explicit name, group-resolution, broad-fallback) and asserts the
   minimal set of hub calls in each.

Benchmark (one user, N=20 per endpoint, against this local hub):

  hub API calls during 126 service requests:
                                       main    perf
    POST  /users/admin/tokens            231       0
    DELETE /users/admin/tokens/*         230       0
    GET   /users?include_stopped...      147       0   <- cluster scan
    GET   /groups                        126       0
    GET   /user                          126     126
    GET   /users/admin?include_stopped     63      63
    GET   /users/admin/shared              21      21
    GET   /services                        21      21
    TOTAL                                965     231   (1.8 calls/req)

  mean latency (ms):
    /user                              24.7  ->   6.3   (3.9x)
    /server/                           62.2  ->  14.7   (4.2x)
    /services/                         31.2  ->   6.7   (4.7x)
    /frameworks/                       21.8  ->   5.8   (3.8x)
    /conda-environments/               32.4  ->   9.1   (3.6x)
    /spawner-profiles/                 33.7  ->  12.1   (2.8x)

The `GET /users` row (the cluster scan) is the one that scales with
cluster size on a real Nebari deployment — eliminating it is where the
biggest production win comes from.
…flight

CI vitest run failed 5 app-sharing tests because the new
`useQuery('/share-permissions/')` never resolves in jsdom (no axios
mock), so the component rendered an empty state and the
`MuiTable-root` assertions tripped.

Keep production behaviour exactly the same — the server no longer
populates `User.share_permissions`, so the query is always the source
of truth there. In component tests the fixture
(`ui/src/data/user.ts`) still sets `share_permissions`, so the
fallback `currentUser.share_permissions` drives the initial render
without anyone mocking the network call.

No tests were modified.
Acts on the multi-agent review of perf/hub-roundtrips.

Security:
- `get_services`: keep the no-mint fast path but project the response
  down to the field set the UI actually renders (name/kind/admin/
  display/info/url) in `routes.py::hub_services`. The service token
  pulls extra fields (`command`, `prefix`, `pid`, `roles`) under
  `read:services` that a user's `list:services` token would never
  have surfaced; this projection prevents the perf change from
  widening disclosure.
- `HubClient.get_shared_servers`: drop the `username` parameter. The
  only caller was already passing nothing; the signature was an IDOR
  primitive waiting for a future caller to thread a user-controlled
  value through. Now derives from `self.username` only.
- `_resolve_share_targets`: replace the `GET /groups/{name}` lookup
  with a single `GET /users` filtered on the `groups` field client
  side. The service role has `list:groups`, NOT `read:groups`, so
  `GET /groups/{name}` would 403 — this branch was functionally
  broken on `read:users:name!group=...` shapes.
- `get_shared_servers`: `urllib.parse.quote(self.username, safe="")`
  on the path segment.

Reliability:
- Fan-out in `service/utils.py::get_shared_servers` was fail-fast on
  `ex.map`: a single grantor that 404s (renamed user, transient hub
  blip) took down the whole /server/ GET. Each `get_user(owner)` is
  now wrapped in try/except + logged + skipped.
- `structlog.contextvars` are stdlib `ContextVars` and are NOT
  inherited by `ThreadPoolExecutor` worker threads. Snapshot the
  current context with `contextvars.copy_context()` and submit each
  worker through `ctx.run` so `request_id` survives across the
  fanout.

Architecture:
- `HubClient.get_server` still ran a full-cluster `get_users()` scan
  inside a `@requires_user_token` wrapper, contradicting the PR
  thesis. Replace with a single `get_user(username)` (service token,
  no mint), handle 404 via `requests.HTTPError`, drop the decorator.
- Mount an explicit `HTTPAdapter(pool_connections=20,
  pool_maxsize=50)` on the shared `requests.Session`. urllib3's
  default pool is 10/10 — exactly the width of the new share fanout
  — so under any concurrency the keep-alive win was getting eaten
  by fresh-socket churn past the cap.

Contract clean-up:
- Drop `User.share_permissions` from the Pydantic /user response;
  the dedicated `/share-permissions/` endpoint is the only source
  now. Removes the silent `null` contract break that was the
  follow-on of moving the computation off `get_current_user`.
- `UserState.share_permissions` is now optional in TypeScript to
  match.
- Drop the dead `is_jupyterhub_5` import + `# noqa: F401` shim from
  `security.py`.

UI:
- `app-sharing.tsx` now renders an explicit "Loading share
  permissions…" line while the new query is in flight, and an
  error banner if it fails. Previously the dialog rendered an
  empty space with no spinner or retry.

Tests:
- `test_filter_users__groups_based_on_scopes.py` retargeted at the
  new resolution path: `!group=` scopes are now expected to drive a
  single `get_users()` call instead of one `get_group()` per group,
  and a new "mixed explicit + group" case covers union semantics.
- All 29 unit + 236 UI tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant