perf: cut hub api roundtrips on the /home hot path by aktech · Pull Request #680 · nebari-dev/jhub-apps

aktech · 2026-05-16T20:30:29Z

Summary

Three changes, all within the existing scopes on japps-service-role — no scope widening, no behavior change at the public API level.

1. Stop minting+revoking user-impersonation tokens for read-only calls

HubClient.get_user, get_shared_servers, and get_services were decorated with @requires_user_token. Per call that means:

```
POST /hub/api/users/{me}/tokens (DB write)
GET /hub/api/... (the actual read)
DELETE /hub/api/users/{me}/tokens/{id} (DB write)
```

The service token already has read:users, shares (jh>=5), and read:services via the role assignment in jhub_apps/configuration.py:120, so the read returns the same data with the service token directly — three hub roundtrips collapse to one.

2. Replace the cluster-wide scan in `get_shared_servers`

service.utils.get_shared_servers used to call HubClient.get_users() — GET /hub/api/users?include_stopped_servers=True — to enrich the share list. On a Nebari deployment that's an O(users × servers) JSON payload pulled on every /server/ GET, then mostly discarded.

The new flow:

GET /users/{me}/shared → small share list (just owner.name + server.name)
dedupe owners
parallel GET /users/{owner} for unique grantors (capped at 10 workers)
filter each owner's servers map to the shared names

Scales with shares granted to me, not cluster size. Returns the exact same Server records as before.

3. Share a keep-alive `requests.Session`

Module-level requests.Session reused across HubClient instances, so consecutive hub calls reuse the TCP/TLS connection instead of handshaking per call. Noticeable win on k8s pod networking.

Security

The role assignment was already granting the scopes we now use directly. Switching to the service token for reads is a no-op for what the hub would have allowed — we're just not minting a derived user token that had the same scopes. State-changing calls (create_server, delete_server, …) still use @requires_user_token so the audit trail records the user as the actor.

Impact

For a logged-in user opening /hub/home (no app form open):

	before	after
Hub API roundtrips	~11	~4
Cluster-scan endpoints	1 (`GET /users`)	0
Hub DB writes per page	6 (token create/revoke pairs)	0

Test plan

pytest jhub_apps/tests --ignore=jhub_apps/tests/tests_e2e — 26 passed on py3.12.
Updated test_shared_server_filtering to model the new fan-out and assert get_user is called exactly once per unique owner.
CI integration tests on this PR.

Three changes, all behind the existing scope grants on the japps-service-role — no widening of permissions. 1. Drop @requires_user_token from `HubClient.get_user`, `get_shared_servers`, and `get_services`. The decorator's create-token-call-revoke-token dance costs three hub roundtrips (with DB writes) per logical call. The service token already has `read:users`, `shares`, and `read:services` via the role assignment in `jhub_apps/configuration.py`, so the read returns the same data in one call instead of three. 2. Rewrite `service.utils.get_shared_servers`. Previously it called `HubClient.get_users()` — `GET /hub/api/users?include_stopped_servers=True`, the entire cluster's users-and-servers payload — on every /server/ GET, then threw most of it away. Now it pulls the small Share list from `/users/{me}/shared`, dedupes owners, and fans out one `GET /users/{owner}` per unique grantor in parallel (capped at 10). Scales with shares-granted-to-me, not cluster size. 3. Share a module-level `requests.Session` across all `HubClient` instances so consecutive hub calls reuse the TCP/TLS connection. On k8s pod networking the per-call handshake was a meaningful slice of the wall time. Effect on /hub/home for a logged-in user (no app form open): before: ~11 hub roundtrips, one O(users x servers across cluster) after: ~4 hub roundtrips, all O(distinct people who shared with me) Unit tests updated and green (26 passed). Integration/e2e behavior is unchanged at the public-API level — same response shapes, same status codes, same scope checks on the hub side.

vercel · 2026-05-16T20:30:34Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
jhub-apps	Ready	Preview, Comment	May 16, 2026 9:16pm
jhub-apps-docs	Ready	Preview, Comment	May 16, 2026 9:16pm

While benchmarking PR #680 I noticed every authenticated request was still firing `GET /hub/api/users` + `GET /hub/api/groups` + a token mint/revoke pair, regardless of which endpoint was called. The cause: `security.get_current_user` was unconditionally populating `user.share_permissions` via `get_users_and_group_allowed_to_share_with`, which is heavy and only needed by the share dropdown. Changes: 1. Stop populating `user.share_permissions` in `get_current_user`. The hot path no longer touches the share-targets code at all. 2. Add a dedicated `GET /services/japps/share-permissions/` endpoint. The UI's `app-sharing.tsx` fetches it on demand via react-query (`staleTime: 0` — Keycloak/group changes propagate immediately on the next dialog open; never cached). 3. Rewrite `get_users_and_group_allowed_to_share_with` to take the most direct hub call per scope shape: - explicit `read:users:name!user=NAME` → name comes straight out of the scope, no hub call; - explicit `read:users:name!group=GROUP` → one `GET /groups/{GROUP}` (returns member names directly); - broad `read:users:name` / `read:groups:name` → one full `GET /users` (or `GET /groups`) only because there is no smaller query that returns the cluster. Same shape for the groups side. No caching anywhere. 4. New `HubClient.get_group(name)` for the targeted group fetch. Drop the now-unused `filter_entity_based_on_scopes`. 5. Replace the old scope-filter test with a parameterised test for `_resolve_share_targets` that exercises all three branches (explicit name, group-resolution, broad-fallback) and asserts the minimal set of hub calls in each. Benchmark (one user, N=20 per endpoint, against this local hub): hub API calls during 126 service requests: main perf POST /users/admin/tokens 231 0 DELETE /users/admin/tokens/* 230 0 GET /users?include_stopped... 147 0 <- cluster scan GET /groups 126 0 GET /user 126 126 GET /users/admin?include_stopped 63 63 GET /users/admin/shared 21 21 GET /services 21 21 TOTAL 965 231 (1.8 calls/req) mean latency (ms): /user 24.7 -> 6.3 (3.9x) /server/ 62.2 -> 14.7 (4.2x) /services/ 31.2 -> 6.7 (4.7x) /frameworks/ 21.8 -> 5.8 (3.8x) /conda-environments/ 32.4 -> 9.1 (3.6x) /spawner-profiles/ 33.7 -> 12.1 (2.8x) The `GET /users` row (the cluster scan) is the one that scales with cluster size on a real Nebari deployment — eliminating it is where the biggest production win comes from.

…flight CI vitest run failed 5 app-sharing tests because the new `useQuery('/share-permissions/')` never resolves in jsdom (no axios mock), so the component rendered an empty state and the `MuiTable-root` assertions tripped. Keep production behaviour exactly the same — the server no longer populates `User.share_permissions`, so the query is always the source of truth there. In component tests the fixture (`ui/src/data/user.ts`) still sets `share_permissions`, so the fallback `currentUser.share_permissions` drives the initial render without anyone mocking the network call. No tests were modified.

Acts on the multi-agent review of perf/hub-roundtrips. Security: - `get_services`: keep the no-mint fast path but project the response down to the field set the UI actually renders (name/kind/admin/ display/info/url) in `routes.py::hub_services`. The service token pulls extra fields (`command`, `prefix`, `pid`, `roles`) under `read:services` that a user's `list:services` token would never have surfaced; this projection prevents the perf change from widening disclosure. - `HubClient.get_shared_servers`: drop the `username` parameter. The only caller was already passing nothing; the signature was an IDOR primitive waiting for a future caller to thread a user-controlled value through. Now derives from `self.username` only. - `_resolve_share_targets`: replace the `GET /groups/{name}` lookup with a single `GET /users` filtered on the `groups` field client side. The service role has `list:groups`, NOT `read:groups`, so `GET /groups/{name}` would 403 — this branch was functionally broken on `read:users:name!group=...` shapes. - `get_shared_servers`: `urllib.parse.quote(self.username, safe="")` on the path segment. Reliability: - Fan-out in `service/utils.py::get_shared_servers` was fail-fast on `ex.map`: a single grantor that 404s (renamed user, transient hub blip) took down the whole /server/ GET. Each `get_user(owner)` is now wrapped in try/except + logged + skipped. - `structlog.contextvars` are stdlib `ContextVars` and are NOT inherited by `ThreadPoolExecutor` worker threads. Snapshot the current context with `contextvars.copy_context()` and submit each worker through `ctx.run` so `request_id` survives across the fanout. Architecture: - `HubClient.get_server` still ran a full-cluster `get_users()` scan inside a `@requires_user_token` wrapper, contradicting the PR thesis. Replace with a single `get_user(username)` (service token, no mint), handle 404 via `requests.HTTPError`, drop the decorator. - Mount an explicit `HTTPAdapter(pool_connections=20, pool_maxsize=50)` on the shared `requests.Session`. urllib3's default pool is 10/10 — exactly the width of the new share fanout — so under any concurrency the keep-alive win was getting eaten by fresh-socket churn past the cap. Contract clean-up: - Drop `User.share_permissions` from the Pydantic /user response; the dedicated `/share-permissions/` endpoint is the only source now. Removes the silent `null` contract break that was the follow-on of moving the computation off `get_current_user`. - `UserState.share_permissions` is now optional in TypeScript to match. - Drop the dead `is_jupyterhub_5` import + `# noqa: F401` shim from `security.py`. UI: - `app-sharing.tsx` now renders an explicit "Loading share permissions…" line while the new query is in flight, and an error banner if it fails. Previously the dialog rendered an empty space with no spinner or retry. Tests: - `test_filter_users__groups_based_on_scopes.py` retargeted at the new resolution path: `!group=` scopes are now expected to drive a single `get_users()` call instead of one `get_group()` per group, and a new "mixed explicit + group" case covers union semantics. - All 29 unit + 236 UI tests pass.

vercel Bot deployed to Preview – jhub-apps-docs May 16, 2026 20:31 View deployment

vercel Bot deployed to Preview – jhub-apps May 16, 2026 20:31 View deployment

vercel Bot deployed to Preview – jhub-apps-docs May 16, 2026 20:56 View deployment

vercel Bot deployed to Preview – jhub-apps May 16, 2026 20:56 View deployment

vercel Bot deployed to Preview – jhub-apps-docs May 16, 2026 21:02 View deployment

vercel Bot deployed to Preview – jhub-apps May 16, 2026 21:02 View deployment

vercel Bot deployed to Preview – jhub-apps May 16, 2026 21:16 View deployment

vercel Bot deployed to Preview – jhub-apps-docs May 16, 2026 21:16 View deployment

aktech mentioned this pull request May 16, 2026

feat(hub): pin jhub-apps to PR #680 (perf/hub-roundtrips) nebari-dev/data-science-pack#80

Draft

2 tasks

aktech marked this pull request as draft May 16, 2026 21:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: cut hub api roundtrips on the /home hot path#680

perf: cut hub api roundtrips on the /home hot path#680
aktech wants to merge 4 commits into
mainfrom
perf/hub-roundtrips

aktech commented May 16, 2026

Uh oh!

vercel Bot commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

aktech commented May 16, 2026

Summary

1. Stop minting+revoking user-impersonation tokens for read-only calls

2. Replace the cluster-wide scan in get_shared_servers

3. Share a keep-alive requests.Session

Security

Impact

Test plan

Uh oh!

vercel Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

2. Replace the cluster-wide scan in `get_shared_servers`

3. Share a keep-alive `requests.Session`

vercel Bot commented May 16, 2026 •

edited

Loading