perf: cut hub api roundtrips on the /home hot path#680
Draft
aktech wants to merge 4 commits into
Draft
Conversation
Three changes, all behind the existing scope grants on the
japps-service-role — no widening of permissions.
1. Drop @requires_user_token from `HubClient.get_user`,
`get_shared_servers`, and `get_services`. The decorator's
create-token-call-revoke-token dance costs three hub roundtrips
(with DB writes) per logical call. The service token already has
`read:users`, `shares`, and `read:services` via the role assignment
in `jhub_apps/configuration.py`, so the read returns the same data
in one call instead of three.
2. Rewrite `service.utils.get_shared_servers`. Previously it called
`HubClient.get_users()` — `GET /hub/api/users?include_stopped_servers=True`,
the entire cluster's users-and-servers payload — on every /server/
GET, then threw most of it away. Now it pulls the small Share list
from `/users/{me}/shared`, dedupes owners, and fans out one
`GET /users/{owner}` per unique grantor in parallel (capped at 10).
Scales with shares-granted-to-me, not cluster size.
3. Share a module-level `requests.Session` across all `HubClient`
instances so consecutive hub calls reuse the TCP/TLS connection.
On k8s pod networking the per-call handshake was a meaningful slice
of the wall time.
Effect on /hub/home for a logged-in user (no app form open):
before: ~11 hub roundtrips, one O(users x servers across cluster)
after: ~4 hub roundtrips, all O(distinct people who shared with me)
Unit tests updated and green (26 passed). Integration/e2e behavior
is unchanged at the public-API level — same response shapes, same
status codes, same scope checks on the hub side.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
While benchmarking PR #680 I noticed every authenticated request was still firing `GET /hub/api/users` + `GET /hub/api/groups` + a token mint/revoke pair, regardless of which endpoint was called. The cause: `security.get_current_user` was unconditionally populating `user.share_permissions` via `get_users_and_group_allowed_to_share_with`, which is heavy and only needed by the share dropdown. Changes: 1. Stop populating `user.share_permissions` in `get_current_user`. The hot path no longer touches the share-targets code at all. 2. Add a dedicated `GET /services/japps/share-permissions/` endpoint. The UI's `app-sharing.tsx` fetches it on demand via react-query (`staleTime: 0` — Keycloak/group changes propagate immediately on the next dialog open; never cached). 3. Rewrite `get_users_and_group_allowed_to_share_with` to take the most direct hub call per scope shape: - explicit `read:users:name!user=NAME` → name comes straight out of the scope, no hub call; - explicit `read:users:name!group=GROUP` → one `GET /groups/{GROUP}` (returns member names directly); - broad `read:users:name` / `read:groups:name` → one full `GET /users` (or `GET /groups`) only because there is no smaller query that returns the cluster. Same shape for the groups side. No caching anywhere. 4. New `HubClient.get_group(name)` for the targeted group fetch. Drop the now-unused `filter_entity_based_on_scopes`. 5. Replace the old scope-filter test with a parameterised test for `_resolve_share_targets` that exercises all three branches (explicit name, group-resolution, broad-fallback) and asserts the minimal set of hub calls in each. Benchmark (one user, N=20 per endpoint, against this local hub): hub API calls during 126 service requests: main perf POST /users/admin/tokens 231 0 DELETE /users/admin/tokens/* 230 0 GET /users?include_stopped... 147 0 <- cluster scan GET /groups 126 0 GET /user 126 126 GET /users/admin?include_stopped 63 63 GET /users/admin/shared 21 21 GET /services 21 21 TOTAL 965 231 (1.8 calls/req) mean latency (ms): /user 24.7 -> 6.3 (3.9x) /server/ 62.2 -> 14.7 (4.2x) /services/ 31.2 -> 6.7 (4.7x) /frameworks/ 21.8 -> 5.8 (3.8x) /conda-environments/ 32.4 -> 9.1 (3.6x) /spawner-profiles/ 33.7 -> 12.1 (2.8x) The `GET /users` row (the cluster scan) is the one that scales with cluster size on a real Nebari deployment — eliminating it is where the biggest production win comes from.
…flight
CI vitest run failed 5 app-sharing tests because the new
`useQuery('/share-permissions/')` never resolves in jsdom (no axios
mock), so the component rendered an empty state and the
`MuiTable-root` assertions tripped.
Keep production behaviour exactly the same — the server no longer
populates `User.share_permissions`, so the query is always the source
of truth there. In component tests the fixture
(`ui/src/data/user.ts`) still sets `share_permissions`, so the
fallback `currentUser.share_permissions` drives the initial render
without anyone mocking the network call.
No tests were modified.
Acts on the multi-agent review of perf/hub-roundtrips.
Security:
- `get_services`: keep the no-mint fast path but project the response
down to the field set the UI actually renders (name/kind/admin/
display/info/url) in `routes.py::hub_services`. The service token
pulls extra fields (`command`, `prefix`, `pid`, `roles`) under
`read:services` that a user's `list:services` token would never
have surfaced; this projection prevents the perf change from
widening disclosure.
- `HubClient.get_shared_servers`: drop the `username` parameter. The
only caller was already passing nothing; the signature was an IDOR
primitive waiting for a future caller to thread a user-controlled
value through. Now derives from `self.username` only.
- `_resolve_share_targets`: replace the `GET /groups/{name}` lookup
with a single `GET /users` filtered on the `groups` field client
side. The service role has `list:groups`, NOT `read:groups`, so
`GET /groups/{name}` would 403 — this branch was functionally
broken on `read:users:name!group=...` shapes.
- `get_shared_servers`: `urllib.parse.quote(self.username, safe="")`
on the path segment.
Reliability:
- Fan-out in `service/utils.py::get_shared_servers` was fail-fast on
`ex.map`: a single grantor that 404s (renamed user, transient hub
blip) took down the whole /server/ GET. Each `get_user(owner)` is
now wrapped in try/except + logged + skipped.
- `structlog.contextvars` are stdlib `ContextVars` and are NOT
inherited by `ThreadPoolExecutor` worker threads. Snapshot the
current context with `contextvars.copy_context()` and submit each
worker through `ctx.run` so `request_id` survives across the
fanout.
Architecture:
- `HubClient.get_server` still ran a full-cluster `get_users()` scan
inside a `@requires_user_token` wrapper, contradicting the PR
thesis. Replace with a single `get_user(username)` (service token,
no mint), handle 404 via `requests.HTTPError`, drop the decorator.
- Mount an explicit `HTTPAdapter(pool_connections=20,
pool_maxsize=50)` on the shared `requests.Session`. urllib3's
default pool is 10/10 — exactly the width of the new share fanout
— so under any concurrency the keep-alive win was getting eaten
by fresh-socket churn past the cap.
Contract clean-up:
- Drop `User.share_permissions` from the Pydantic /user response;
the dedicated `/share-permissions/` endpoint is the only source
now. Removes the silent `null` contract break that was the
follow-on of moving the computation off `get_current_user`.
- `UserState.share_permissions` is now optional in TypeScript to
match.
- Drop the dead `is_jupyterhub_5` import + `# noqa: F401` shim from
`security.py`.
UI:
- `app-sharing.tsx` now renders an explicit "Loading share
permissions…" line while the new query is in flight, and an
error banner if it fails. Previously the dialog rendered an
empty space with no spinner or retry.
Tests:
- `test_filter_users__groups_based_on_scopes.py` retargeted at the
new resolution path: `!group=` scopes are now expected to drive a
single `get_users()` call instead of one `get_group()` per group,
and a new "mixed explicit + group" case covers union semantics.
- All 29 unit + 236 UI tests pass.
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three changes, all within the existing scopes on
japps-service-role— no scope widening, no behavior change at the public API level.1. Stop minting+revoking user-impersonation tokens for read-only calls
HubClient.get_user,get_shared_servers, andget_serviceswere decorated with@requires_user_token. Per call that means:```
POST /hub/api/users/{me}/tokens (DB write)
GET /hub/api/... (the actual read)
DELETE /hub/api/users/{me}/tokens/{id} (DB write)
```
The service token already has
read:users,shares(jh>=5), andread:servicesvia the role assignment injhub_apps/configuration.py:120, so the read returns the same data with the service token directly — three hub roundtrips collapse to one.2. Replace the cluster-wide scan in
get_shared_serversservice.utils.get_shared_serversused to callHubClient.get_users()—GET /hub/api/users?include_stopped_servers=True— to enrich the share list. On a Nebari deployment that's an O(users × servers) JSON payload pulled on every/server/GET, then mostly discarded.The new flow:
GET /users/{me}/shared→ small share list (just owner.name + server.name)GET /users/{owner}for unique grantors (capped at 10 workers)serversmap to the shared namesScales with shares granted to me, not cluster size. Returns the exact same
Serverrecords as before.3. Share a keep-alive
requests.SessionModule-level
requests.Sessionreused acrossHubClientinstances, so consecutive hub calls reuse the TCP/TLS connection instead of handshaking per call. Noticeable win on k8s pod networking.Security
The role assignment was already granting the scopes we now use directly. Switching to the service token for reads is a no-op for what the hub would have allowed — we're just not minting a derived user token that had the same scopes. State-changing calls (
create_server,delete_server, …) still use@requires_user_tokenso the audit trail records the user as the actor.Impact
For a logged-in user opening
/hub/home(no app form open):GET /users)Test plan
pytest jhub_apps/tests --ignore=jhub_apps/tests/tests_e2e— 26 passed on py3.12.test_shared_server_filteringto model the new fan-out and assertget_useris called exactly once per unique owner.