Conversation
…ap-fill) + Pages workflow
…fill pin) Align scaffold with brief spec: proper title casing on all stub pages, TODO comments in go.mod and hugo.toml for re-pinning after PR #5 merges, concurrency group set to "pages", and _index.md content per spec.
Full sections 1-13 of docs/install-production.md migrated verbatim into docs/site/content/installation.md. Updated stale chart targetRevision from v0.1.0-alpha.8 to v0.1.0-alpha.9. Rewrote links to getting-started.md and troubleshooting into root-relative /local-development/ and /troubleshooting/. Added Beta documentation gate callouts for Keycloak prereqs (sec 7) and dual-endpoint auth explanation (sec 10.9).
Covers HuggingFace PVC-backed model caching, OCI/modelcar emptyDir loading, the concurrent-download lock mechanism, PVC sizing guidance, and per-model storage class override. All technical claims confirmed against operator/internal/controller/reconcilers/storage.go and docs/design.md.
…-serving-operator)
Add full configuration.md replacing the stub. Covers Helm values reference (27 user-customizable keys with exact defaults from values.yaml), LLMModel CRD field reference (spec.model, spec.model.storage, spec.resources, spec.access, spec.serving, spec.endpoints, spec.advanced, status fields), and NebariApp fields used by the key-manager template. All tables reconciled against the real source files; no README/design discrepancies found that affected content.
Replaces the stub with the full architecture narrative migrated from docs/design.md: components, reconciliation flow, dual endpoint auth (external apiKeyAuth + internal JWT), key manager design and data model, Envoy AI Gateway, cluster-singleton TLS reconciler, single- namespace deployment model, and security model. Values reference and CRD spec table are deliberately left in /configuration/; cross-links added throughout.
Documents all five GitHub Actions workflows (test, lint, build-images, docs, add-to-project), the three GHCR images they produce, chart versioning (version/appVersion kept in sync), and the current manual release process with explicit note of gaps (no automated Helm publish, draft alpha.9 release).
…grated flat docs - Add ## Known Limitations section to README.md drawn from design.md non-goals, key manager limitations, and single-namespace constraint - Add Documentation link near the top of README.md pointing to the published docs site - Update README.md links: docs/install-production.md -> installation/, docs/getting-started.md -> local-development/ - Update pack-metadata.yaml links.docs to published docs site /installation/ - git rm docs/design.md, docs/getting-started.md, docs/install-production.md
Adds scripts/check-links.sh which builds the Hugo site and verifies: - every internal href/src in public/**/*.html resolves to an existing file under public/ (handles the /nebari-llm-serving-pack subpath prefix) - every .edit-link href maps to an existing source file under content/ Pure bash + grep/sed; no Node or external tools beyond Hugo + coreutils.
- repoint stale docs/design.md + docs/getting-started.md refs in operator source comments to the published site (architecture / local-development) - wire scripts/check-links.sh into the docs CI build job - check-links.sh: hard-error on missing editBase; strip #fragment before resolving - fix Quickstart key-manager URL to the configured nebariApp hostname - scope Pages write/id-token permissions to the deploy job (least privilege) - pack-metadata docs link -> site root; hugo mod tidy cleans stale go.sum hash
…building in check-links
tylerpotts
left a comment
There was a problem hiding this comment.
Review — Hugo documentation site
Nice consolidation. Architecturally this is sound: the three flat docs (design.md, getting-started.md, install-production.md) are removed and migrated into the site with README.md and the Go comments rewired to published URLs, so there's no flat-vs-site divergence risk. The design.md → architecture.md shrink is redistribution (CRD/Helm/model-loading moved into configuration.md / installation.md / shared-storage.md), not content loss. Theme-as-Hugo-module pinned to v0.1.1, the path-filtered build/deploy split, and least-privilege Pages permissions are all the right calls. PR builds do meaningfully run the link check (SKIP_BUILD=1 against the just-built public/).
I'm requesting changes for one reason: four doc claims contradict the actual operator/key-manager behavior, two of them security-relevant. Since this is about to become the public source of truth, those should be corrected first. Details inline.
Blockers (content correctness)
shared-storage.md— describes a.locked/noclobber/stale-lock mechanism that does not exist in the code.local-development.md:146,153— wrong key-manager API paths (/api/v1/models,/api/v1/keys); actual routes are/api/models//api/keys(key-manager/internal/api/handler.go:48-52). They 404 and contradictinstallation.md. (These lines were carried over unchanged from the oldgetting-started.mdin the rename, so no inline anchor — flagging here.)architecture.md— claims API keys are stripped before vLLM, butsanitize/forwardClientIDHeaderare an explicit TODO (security-relevant).architecture.md— presents the key audit as always-on; it's off by default (oidcUserinfoURLships empty), so keys aren't revoked on group change out of the box (security-relevant).
Non-blocking (script/workflow robustness)
A few check-links.sh portability/false-clean nits and a go-version drift note inline.
Questions
cicd-and-releasing.mddescribesdocs.ymlbut omits theCheck internal linksstep — intentional?- Want the link checker to hard-fail when zero edit-links are found (closing a silent false-clean gap)?
Everything else I checked — build/deploy gating, permissions, concurrency pattern, baseURL fallback, resolve_path logic, unsafe = true, go.mod/go.sum pin+verify, and the two comment-only Go changes — is correct.
| 3. If the lock is held by another pod: read the timestamp in `.locked`, wait, and retry | ||
| 4. If the lock is older than 1 hour: treat it as stale (pod died holding the lock) and take over | ||
|
|
||
| Because `huggingface-cli download` is idempotent, the second and subsequent replicas that acquire the lock after the first replica finishes will simply checksum files and exit quickly. |
There was a problem hiding this comment.
Blocker — this lock mechanism does not exist. There is no .locked file, noclobber, or 1-hour stale-lock takeover in the code. model-downloader/entrypoint.py just forwards argv to hf, and operator/internal/controller/reconcilers/storage.go:140 builds plain init-container args ([<model>, --local-dir, /model-cache, (--revision X)]). A user relying on this for multi-replica download safety is being promised a guarantee the operator doesn't provide. Either implement the serialization or remove/reword this section to describe the actual behavior (idempotent re-download per replica).
| - `AIGatewayRoute` for token counting, rate limiting, protocol normalization | ||
| - `SecurityPolicy` with `apiKeyAuth` attached to the generated `HTTPRoute` (same name as the `AIGatewayRoute`), per model | ||
| - `sanitize: true` strips the API key before forwarding to vLLM | ||
| - `forwardClientIDHeader: X-Client-ID` passes the authenticated client ID downstream for logging and flow control |
There was a problem hiding this comment.
Blocker (security-relevant) — key stripping is not implemented yet. sanitize: true / forwardClientIDHeader are an explicit TODO in operator/internal/controller/reconcilers/auth.go:123-125, gated on Envoy Gateway v1.7+. The security-model section (line 349) also asserts this as current behavior. As written the docs claim the API key is removed before it reaches vLLM, which is not true today. Reword to mark it as planned/not-yet-enabled.
|
|
||
| 1. Lists all API key Secrets in the operator namespace | ||
| 2. For each key entry, looks up the creator's current groups via the OIDC userinfo endpoint | ||
| 3. If the creator no longer belongs to a group that matches the model's `access.groups`, revokes the key |
There was a problem hiding this comment.
Blocker (security-relevant) — the audit is off by default. The auditor only starts when oidcUserinfoURL != "" (key-manager/cmd/main.go:96), and values.yaml ships it empty. So out of the box, keys are never revoked on group change — the eventual-consistency guarantee stated here (and at line 251) doesn't hold in a default install. Please state the precondition (audit requires oidcUserinfoURL to be configured) so operators don't assume automatic revocation.
| BROKEN_EDIT+=("BROKEN edit-link: $html_file -> $edit_href (expected: $src_file)") | ||
| fi | ||
| done < <( | ||
| grep -oE 'class=edit-link href=[^[:space:]>]+|class="edit-link"[[:space:]]+href="[^"]*"' "$html_file" \ |
There was a problem hiding this comment.
Edit-link extraction can false-clean silently. This regex matches only class=edit-link href=… (unquoted) or exactly class="edit-link" href="…". If the theme ever emits href before class, an intervening attribute (target=_blank), or a multi-class value (class="edit-link btn"), it matches nothing and the edit-link check passes vacuously. You hard-error on a missing editBase (line 119) precisely to avoid false-cleans — consider also failing (or warning loudly) when zero edit-links are found across all pages.
| # --------------------------------------------------------------------------- | ||
| BASE_URL=$(grep -m1 'baseURL' "$SITE_DIR/hugo.toml" | sed 's/.*= *"\(.*\)"/\1/') | ||
| # Strip scheme+host: https://nebari-dev.github.io/nebari-llm-serving-pack/ | ||
| SUBPATH_PREFIX=$(echo "$BASE_URL" | sed 's|^https\?://[^/]*||' | sed 's|/$||') |
There was a problem hiding this comment.
\? here is GNU-sed-only. Fine on CI (ubuntu/GNU sed), but on macOS/BSD sed \? is literal, so SUBPATH_PREFIX won't be stripped and a local run would mass-flag every link. Portable form: sed -E 's#^https?://[^/]+##'.
| esac | ||
|
|
||
| # Strip any #fragment before resolving to a file (e.g. /page/#section) | ||
| url="${url%%#*}" |
There was a problem hiding this comment.
Minor / latent: #fragment is stripped before resolving, so anchor targets aren't validated — a renamed heading (broken #anchor) passes silently. Not a current break; worth a comment noting the checker validates pages but not fragments.
| - uses: actions/checkout@v4 | ||
| with: { fetch-depth: 0 } | ||
| - uses: actions/setup-go@v5 | ||
| with: { go-version: "stable", cache: false } |
There was a problem hiding this comment.
go-version: "stable" vs docs/site/go.mod's go 1.26.3: works today via GOTOOLCHAIN=auto, but it's brittle (fails if toolchain downloads are ever restricted or "stable" lags). This go.mod only fetches the Hugo theme — relax it to go 1.26, or pin the workflow with go-version-file: docs/site/go.mod so the two can't drift.
Address review feedback on the docs site: - shared-storage: remove the described .locked/noclobber/stale-lock mechanism, which the operator does not implement; describe the actual per-replica idempotent download behavior. - local-development: fix key-manager API paths (/api/models, /api/keys; the /api/v1/* routes do not exist). - architecture + README + configuration: state that key stripping (sanitize/forwardClientIDHeader) is planned, not yet enabled (needs Envoy Gateway v1.7+), and that the group-change audit is off by default (requires keyManager.oidcUserinfoURL, which ships empty). - check-links.sh: portable sed -E; robust edit-link extraction with a hard-fail when zero edit-links are found; note that fragments are not validated. - docs.yml: pin Go via go-version-file so it can't drift from go.mod; document the link-check step in cicd-and-releasing.
|
Thanks @tylerpotts, this was a sharp review. You were right on all four content blockers; I verified each against the code before changing anything. Pushed in ef9c02e. Blockers
Your two questions
Non-blocking script/workflow nits
Site builds clean at root and subpath, and |
|
📄 Docs preview for |
# Conflicts: # docs/design.md
…sign into architecture page - Shared TLS section: explains bring-your-own-certificate mode (platform.tls.secretName), no cert-manager required for air-gapped/private-CA clusters - Spec updates section: retitled "Spec updates and restarts", explains Recreate default and why RollingUpdate deadlocks on clusters without spare GPU capacity - New /dev/shm subsection: memory-backed emptyDir for tensor parallelism > 1 - New PassthroughModel CRD section: routes shared endpoints to external OpenAI-compatible providers, YAML example, generated-resources table, and semantics notes - configuration.md: adds platform.tls.secretName and spec.serving.updateStrategy rows
- docs/site/go.mod + go.sum: bump nebari-hugo-theme v0.1.1 -> v0.2.0 - docs/site/hugo.toml: add logoLink = "https://packs.nebari.dev/" as first key under [params] so the header logo returns to the portal root
…q, key-manager name/hostname, workflow count + cleanup workflow)
tylerpotts
left a comment
There was a problem hiding this comment.
Re-review of blocker #4 (audit-based key revocation).
The off-by-default correction is right, and propagating it into the README Known Limitations and the configuration.md values table was good follow-through — the default state is now consistent across all three docs.
There's one layer below this they all still get wrong, though. Setting oidcUserinfoURL doesn't actually enable revocation in the version we ship. makeUserinfoLookup is a stub that always returns an error (key-manager/cmd/main.go:164-173, identical on main):
// For v0.1 this is a stub that returns an error so the auditor skips revocation
// safely until token exchange is implemented.
return nil, fmt.Errorf("userinfo lookup not yet implemented (url: %s, user: %s)", ...)So when oidcUserinfoURL is set the audit loop starts (main.go:96-99) but every group lookup fails and the auditor skips revocation (fail-safe). Group-change revocation isn't implemented at all in 0.1.0-alpha.x — configuring the URL doesn't turn it on.
That means the "enabled → revokes" framing is still inaccurate in the spots flagged inline. Could we reword them the way the sanitize/forwardClientIDHeader fix handled it — revocation is planned but not yet implemented (the lookup is a stub pending OIDC token exchange), rather than a toggle you switch on with oidcUserinfoURL? As written, an operator who sets the URL will reasonably believe keys get revoked on group change, and they won't. Since this is the public source of truth and it's security-relevant, I'd like it accurate before merge.
| 2. For each key entry, looks up the creator's current groups via the OIDC userinfo endpoint | ||
| 3. If the creator no longer belongs to a group that matches the model's `access.groups`, revokes the key | ||
|
|
||
| This audit is a precondition for the eventual-consistency revocation described above, and it is **disabled unless `keyManager.oidcUserinfoURL` is set**. The default `values.yaml` ships it empty, so a default install performs no group-change revocation. Configure the userinfo endpoint (for Keycloak, `https://<keycloak>/realms/<realm>/protocol/openid-connect/userinfo`) to enable it. |
There was a problem hiding this comment.
Blocker (security-relevant), still open. "Configure the userinfo endpoint ... to enable it" over-promises: even with oidcUserinfoURL set, the audit loop starts but makeUserinfoLookup (key-manager/cmd/main.go:164-173) always returns an error, so the auditor skips revocation. Group-change revocation is not implemented in 0.1.0-alpha.x — setting the URL does not turn it on. Please reframe as planned/not-yet-implemented (the lookup is a stub pending OIDC token exchange), same as the sanitize fix, rather than a config toggle. Same applies to line 344 ("looks up the creator's current groups via the OIDC userinfo endpoint").
| - Usage billing or cost chargeback | ||
| - Scale-to-zero autoscaling | ||
| - Per-key rate limiting or token quotas | ||
| - API key expiration (periodic audit revokes keys when group membership changes) |
There was a problem hiding this comment.
This feature-list line still asserts revocation as a current capability ("periodic audit revokes keys when group membership changes"). It isn't — the userinfo lookup is a stub that errors, so no key is ever revoked, even when the audit is enabled. Mark it planned/not-yet-implemented.
| This pack is at alpha maturity (`v0.1.0-alpha.x`). The following limitations apply: | ||
|
|
||
| - **Single-namespace model:** All LLMModels must be in the operator's own namespace. Per-team isolation requires running separate pack installs (one namespace per team). This is a hard constraint imposed by Envoy Gateway's `apiKeyAuth`, which does not support cross-namespace Secret references. | ||
| - **API keys are not continuously tied to group membership.** Keys are issued based on the user's OIDC groups at creation time. If a user later loses group access, existing keys continue to work. The periodic audit that revokes such keys (default interval: 5 minutes) is **off by default** - it only runs when `keyManager.oidcUserinfoURL` is configured, which ships empty. Without it, keys are never revoked on group change. Even when enabled, this is eventual consistency, not real-time revocation. |
There was a problem hiding this comment.
Good — the off-by-default state is now correct here. But "Even when enabled, this is eventual consistency" still implies enabling the audit makes it revoke. It doesn't: the userinfo lookup (key-manager/cmd/main.go:164-173) is a stub that errors, so revocation never fires even when oidcUserinfoURL is set. Reword to planned/not-yet-implemented.
| | `keyManager.image.tag` | `""` | Image tag. Defaults to `.Chart.AppVersion` when empty, keeping chart and image versions in sync. Override only when testing a specific build. | | ||
| | `keyManager.image.pullPolicy` | `Always` | Image pull policy. | | ||
| | `keyManager.auditInterval` | `5m` | How often the audit loop checks key owners' current groups (only runs when `oidcUserinfoURL` is set). | | ||
| | `keyManager.oidcUserinfoURL` | `""` | OIDC userinfo endpoint used by the audit loop to re-check a key owner's groups. **Empty (the default) disables the audit entirely**, so keys are never revoked on group change. For Keycloak: `https://<keycloak>/realms/<realm>/protocol/openid-connect/userinfo`. | |
There was a problem hiding this comment.
"OIDC userinfo endpoint used by the audit loop to re-check a key owner's groups" implies it works once set. The lookup is a stub that always errors (key-manager/cmd/main.go:164-173), so setting this URL starts the audit loop but never revokes anything. Note that group-change revocation is not yet implemented in the current release.
# Conflicts: # docs/design.md
The userinfo lookup (key-manager/cmd/main.go) is a stub that always errors, so the audit loop never revokes keys even when oidcUserinfoURL is set. Reword architecture.md, README, and configuration.md to state group-change revocation is planned but not implemented in v0.1, rather than a toggle enabled by configuring the userinfo endpoint. Also bump the stale defaults.serving.image reference to v0.7.0 to match values.yaml.
|
Thanks for catching the layer underneath, Tyler - you're right, and I verified it against Reframed all the spots you flagged from "config toggle" to "planned but not implemented in v0.1," same treatment as the
Also resolved the merge conflict with Builds clean at root and subpath, |
tylerpotts
left a comment
There was a problem hiding this comment.
Review — Hugo docs site + Cloudflare Pages
Scope: New Hugo site at docs/site/, Cloudflare Pages deploy workflows, a link-checker script, pack-metadata.yaml rename, and two comment-only Go edits. ~1970 LOC, mostly Markdown.
Solid, careful PR. One blocker that directly contradicts the PR's own purpose, plus some consistency cleanups and questions.
Cross-cutting
No design-gating blockers. The PR establishes one canonical docs host consistently in hugo.toml (baseURL = https://packs.nebari.dev/llm-serving-pack/), pack-metadata.yaml, and README.md — moving off GitHub Pages onto Cloudflare Pages. That intent is consistent everywhere except the two Go comments (Blocker #1).
Blockers
1. The two Go comment edits point at a host this PR decommissions — they will 404.
operator/internal/controller/reconcilers/auth.go:52→https://nebari-dev.github.io/nebari-llm-serving-pack/local-development/operator/internal/controller/reconcilers/clustertls.go:22→https://nebari-dev.github.io/nebari-llm-serving-pack/architecture/
These point to GitHub Pages (*.github.io), which is exactly what this PR migrates away from — there is no GitHub Pages deploy here, so these URLs are dead on arrival (not even a redirect, unlike the github.com repo-rename alias). They should match the canonical host:
https://packs.nebari.dev/llm-serving-pack/local-development/https://packs.nebari.dev/llm-serving-pack/architecture/
For a PR whose entire purpose is correct doc links, shipping two guaranteed-broken ones is worth fixing. (check-links.sh won't catch these — it only validates internal links inside the built Hugo site, not source-comment URLs.)
Stylistic / should-fix
2. hugo.toml repo/editBase use the stale nebari-llm-serving-pack repo name. The repo was renamed to llm-serving-pack (the old name is now only a redirect alias), and this PR renames the pack accordingly in pack-metadata.yaml. The editBase/repo links work today only via GitHub's 302 redirect. For consistency with the rename — and so "Edit this page" links don't depend on a redirect — update both to llm-serving-pack.
3. docs/design.md is deleted, but docs/getting-started.md still exists and links to it. Only design.md is removed here, yet the description says "Old flat docs deleted" (migrating install-production.md and getting-started.md). Those two flat files are not removed, and docs/getting-started.md:178 still links to the now-deleted design.md — a broken intra-doc link the Hugo link-checker won't catch (it's outside docs/site/). Either finish the deletion or fix the dangling link. See Question 1.
4. Preview baseURL is computed independently from Cloudflare's actual alias (docs.yml). The ALIAS=... tr/sed step builds a preview baseURL, but the real preview URL comes from steps.deploy.outputs.pages-deployment-alias-url. If Cloudflare's own alias generation diverges (e.g. the 63-char DNS-label cap on a long branch name, or different sanitization), the built site's absolute links point at the wrong host in previews. Production is unaffected — preview-only cosmetic risk. Worth a comment acknowledging the two can drift.
Questions
-
Are the remaining flat docs (
docs/getting-started.md,docs/install-production.md) intentionally kept? README andpack-metadata.yamlalready repoint away from them, so they're now orphaned from the main entry points but still referenced byexamples/*.yamlcomments. If they're meant to be gone, delete them and fix the example references; if kept, fix item 3's broken link. -
Does anything key off the old pack name
nebari-llm-serving-pack? The rename tollm-serving-packdrives the portal's/llm-serving-pack/route viadocs_site: true. Confirm thesoftware-pack-dashboardroute table is fine. (Separately,charts/nebari-llm-serving/values.yamlstill usesghcr.io/nebari-dev/nebari-llm-serving-pack/...image repos — image paths, separate concern, but worth confirming those GHCR paths weren't also renamed.)
What's good
- Fork-PR handling is correct: deploy is gated on same-repo, so no secrets leak to forks, while build + link-check still gate every PR.
check-links.shis thoughtfully defensive: missingeditBaseand zero-edit-links both hard-fail (avoiding vacuous passes), and the fragment-validation limitation is documented honestly.- Preview cleanup workflow correctly scopes to same-repo and uses
cancel-in-progress: falseso a cleanup isn't interrupted. - README "Known Limitations" is genuinely honest (stubbed revocation, single-namespace constraint, ~1 MiB storage ceiling) rather than marketing copy.
Bottom line: Fix the two dead github.io URLs in the Go comments (Blocker #1) before merge — they directly contradict the PR's own purpose. The rest are consistency cleanups and clarifying questions.
🤖 Reviewed with Claude Code
Adds a Hugo documentation site at
docs/site/, deployed to Cloudflare Pages as part of the newpacks.nebari.devsoftware-packs portal, consumingnebari-hugo-theme(pinned tov0.1.1). Migrates the existing flat docs into the agreed sidebar layout, authors the Beta-gate content, and removes the old flat docs.Layout (matches the design mockup)
What moved / what's new
install-production.md-> Installation (runbook + Keycloak auth setup),getting-started.md-> Local Development,design.md-> Architecture. Old flat docs deleted; README +pack-metadata.yamllinks repointed to the portal (packs.nebari.dev/llm-serving-pack/)..github/workflows/docs.yml(build + link-check on PRs, deploy on every push) andscripts/check-links.sh(internal + edit-link checker).Deployment: Cloudflare Pages + per-PR previews
This PR now deploys to Cloudflare Pages instead of GitHub Pages, so the docs site lands directly on the
packs.nebari.devportal rather than on a URL we would immediately retire.maindeploys to production atpacks.nebari.dev/llm-serving-pack/; every other branch / PR gets its own preview URL, posted as a sticky comment on the PR.baseURL(production subpath vs preview alias) and keeps thecheck-links.shgate. Fork PRs still build and link-check but skip the deploy (no secrets exposed to forks).pack-metadata.yamlsetsdocs_site: true, which is what tells the portal (thesoftware-pack-dashboardrouting Worker) to route/llm-serving-pack/here.CLOUDFLARE_API_TOKEN/CLOUDFLARE_ACCOUNT_IDsecrets (set). The production route appears once this merges and the dashboard regenerates its route table.Reviewers who looked at the earlier GitHub Pages version: the deploy mechanism is the main change since then; content is unchanged.
Accuracy
Content was cross-checked against the real
charts/, the LLMModel CRD, the operator (storage.go), and the workflows during review - fixing drift along the way (e.g. thedownload-model->model-downloadercontainer name, the chart version, the semver image-tagvprefix). Builds clean at root and at the portal subpath;check-links.shpasses.Note
Targets Beta documentation maturity. The pack itself remains alpha (
v0.1.0-alpha.x) - GA-only docs (CHANGELOG, sizing guidance, upgrade path) are intentionally out of scope.Screenshots
How to test
Requires Hugo extended (CI pins 0.159.0) and a Go toolchain (Hugo resolves the theme module). From the repo root:
Then open http://localhost:1313/. The
nebari-hugo-themeis pulled automatically as a Hugo Module (pinned tov0.1.1) - no extra setup. If a theme bump ever shows stale styling, runhugo mod cleanand rebuild.Worth checking:
Validate internal + edit links:
./scripts/check-links.sh(also runs in the Docs CI build).