Skip to content

Add Hugo documentation site (Beta docs) consuming nebari-hugo-theme#108

Open
dcmcand wants to merge 42 commits into
mainfrom
docs-site
Open

Add Hugo documentation site (Beta docs) consuming nebari-hugo-theme#108
dcmcand wants to merge 42 commits into
mainfrom
docs-site

Conversation

@dcmcand

@dcmcand dcmcand commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Adds a Hugo documentation site at docs/site/, deployed to Cloudflare Pages as part of the new packs.nebari.dev software-packs portal, consuming nebari-hugo-theme (pinned to v0.1.1). Migrates the existing flat docs into the agreed sidebar layout, authors the Beta-gate content, and removes the old flat docs.

Layout (matches the design mockup)

  • Documentation: Quickstart, Installation, Local Development, Shared Storage, Troubleshooting
  • Reference: Configuration, Architecture, CI/CD and Releasing

What moved / what's new

  • Migrated: install-production.md -> Installation (runbook + Keycloak auth setup), getting-started.md -> Local Development, design.md -> Architecture. Old flat docs deleted; README + pack-metadata.yaml links repointed to the portal (packs.nebari.dev/llm-serving-pack/).
  • New (Beta gates): a Known Limitations section in the README, a Troubleshooting page (7 real failure modes with diagnostics), and a Configuration values + LLMModel CRD + NebariApp reference.
  • New (contributor): CI/CD and Releasing page documenting the real workflows, images, and release process (including honest gaps: no automated Helm publish yet).
  • .github/workflows/docs.yml (build + link-check on PRs, deploy on every push) and scripts/check-links.sh (internal + edit-link checker).

Deployment: Cloudflare Pages + per-PR previews

This PR now deploys to Cloudflare Pages instead of GitHub Pages, so the docs site lands directly on the packs.nebari.dev portal rather than on a URL we would immediately retire.

  • main deploys to production at packs.nebari.dev/llm-serving-pack/; every other branch / PR gets its own preview URL, posted as a sticky comment on the PR.
  • Builds with a branch-aware baseURL (production subpath vs preview alias) and keeps the check-links.sh gate. Fork PRs still build and link-check but skip the deploy (no secrets exposed to forks).
  • pack-metadata.yaml sets docs_site: true, which is what tells the portal (the software-pack-dashboard routing Worker) to route /llm-serving-pack/ here.
  • Requires the org-level CLOUDFLARE_API_TOKEN / CLOUDFLARE_ACCOUNT_ID secrets (set). The production route appears once this merges and the dashboard regenerates its route table.

Reviewers who looked at the earlier GitHub Pages version: the deploy mechanism is the main change since then; content is unchanged.

Accuracy

Content was cross-checked against the real charts/, the LLMModel CRD, the operator (storage.go), and the workflows during review - fixing drift along the way (e.g. the download-model -> model-downloader container name, the chart version, the semver image-tag v prefix). Builds clean at root and at the portal subpath; check-links.sh passes.

Note

Targets Beta documentation maturity. The pack itself remains alpha (v0.1.0-alpha.x) - GA-only docs (CHANGELOG, sizing guidance, upgrade path) are intentionally out of scope.

Screenshots

Home Docs page (Architecture)
Home overview Docs page with sidebar + ToC
Quickstart (dark) Mobile
Quickstart, dark mode Responsive mobile layout

How to test

Requires Hugo extended (CI pins 0.159.0) and a Go toolchain (Hugo resolves the theme module). From the repo root:

cd docs/site
hugo server --baseURL "http://localhost:1313/"

Then open http://localhost:1313/. The nebari-hugo-theme is pulled automatically as a Hugo Module (pinned to v0.1.1) - no extra setup. If a theme bump ever shows stale styling, run hugo mod clean and rebuild.

Worth checking:

  • Home overview; each Quickstart prerequisite links into its matching Installation section
  • Sidebar (Documentation / Reference groups, collapsible, active highlight) matches the design mockup
  • Right-hand "On this page" ToC + scroll-spy on a long page (Architecture / Installation)
  • Dark/light toggle, search, and the mobile layout (narrow the window)

Validate internal + edit links: ./scripts/check-links.sh (also runs in the Docs CI build).

dcmcand added 28 commits June 18, 2026 16:07
…fill pin)

Align scaffold with brief spec: proper title casing on all stub pages,
TODO comments in go.mod and hugo.toml for re-pinning after PR #5 merges,
concurrency group set to "pages", and _index.md content per spec.
Full sections 1-13 of docs/install-production.md migrated verbatim into
docs/site/content/installation.md. Updated stale chart targetRevision
from v0.1.0-alpha.8 to v0.1.0-alpha.9. Rewrote links to getting-started.md
and troubleshooting into root-relative /local-development/ and /troubleshooting/.
Added Beta documentation gate callouts for Keycloak prereqs (sec 7) and
dual-endpoint auth explanation (sec 10.9).
Covers HuggingFace PVC-backed model caching, OCI/modelcar emptyDir
loading, the concurrent-download lock mechanism, PVC sizing guidance,
and per-model storage class override. All technical claims confirmed
against operator/internal/controller/reconcilers/storage.go and
docs/design.md.
Add full configuration.md replacing the stub. Covers Helm values
reference (27 user-customizable keys with exact defaults from
values.yaml), LLMModel CRD field reference (spec.model, spec.model.storage,
spec.resources, spec.access, spec.serving, spec.endpoints, spec.advanced,
status fields), and NebariApp fields used by the key-manager template.
All tables reconciled against the real source files; no README/design
discrepancies found that affected content.
Replaces the stub with the full architecture narrative migrated from
docs/design.md: components, reconciliation flow, dual endpoint auth
(external apiKeyAuth + internal JWT), key manager design and data
model, Envoy AI Gateway, cluster-singleton TLS reconciler, single-
namespace deployment model, and security model.

Values reference and CRD spec table are deliberately left in
/configuration/; cross-links added throughout.
Documents all five GitHub Actions workflows (test, lint, build-images,
docs, add-to-project), the three GHCR images they produce, chart
versioning (version/appVersion kept in sync), and the current manual
release process with explicit note of gaps (no automated Helm publish,
draft alpha.9 release).
…grated flat docs

- Add ## Known Limitations section to README.md drawn from design.md
  non-goals, key manager limitations, and single-namespace constraint
- Add Documentation link near the top of README.md pointing to the
  published docs site
- Update README.md links: docs/install-production.md -> installation/,
  docs/getting-started.md -> local-development/
- Update pack-metadata.yaml links.docs to published docs site /installation/
- git rm docs/design.md, docs/getting-started.md, docs/install-production.md
Adds scripts/check-links.sh which builds the Hugo site and verifies:
- every internal href/src in public/**/*.html resolves to an existing
  file under public/ (handles the /nebari-llm-serving-pack subpath prefix)
- every .edit-link href maps to an existing source file under content/

Pure bash + grep/sed; no Node or external tools beyond Hugo + coreutils.
- repoint stale docs/design.md + docs/getting-started.md refs in operator
  source comments to the published site (architecture / local-development)
- wire scripts/check-links.sh into the docs CI build job
- check-links.sh: hard-error on missing editBase; strip #fragment before resolving
- fix Quickstart key-manager URL to the configured nebariApp hostname
- scope Pages write/id-token permissions to the deploy job (least privilege)
- pack-metadata docs link -> site root; hugo mod tidy cleans stale go.sum hash

@tylerpotts tylerpotts left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — Hugo documentation site

Nice consolidation. Architecturally this is sound: the three flat docs (design.md, getting-started.md, install-production.md) are removed and migrated into the site with README.md and the Go comments rewired to published URLs, so there's no flat-vs-site divergence risk. The design.mdarchitecture.md shrink is redistribution (CRD/Helm/model-loading moved into configuration.md / installation.md / shared-storage.md), not content loss. Theme-as-Hugo-module pinned to v0.1.1, the path-filtered build/deploy split, and least-privilege Pages permissions are all the right calls. PR builds do meaningfully run the link check (SKIP_BUILD=1 against the just-built public/).

I'm requesting changes for one reason: four doc claims contradict the actual operator/key-manager behavior, two of them security-relevant. Since this is about to become the public source of truth, those should be corrected first. Details inline.

Blockers (content correctness)

  1. shared-storage.md — describes a .locked/noclobber/stale-lock mechanism that does not exist in the code.
  2. local-development.md:146,153 — wrong key-manager API paths (/api/v1/models, /api/v1/keys); actual routes are /api/models / /api/keys (key-manager/internal/api/handler.go:48-52). They 404 and contradict installation.md. (These lines were carried over unchanged from the old getting-started.md in the rename, so no inline anchor — flagging here.)
  3. architecture.md — claims API keys are stripped before vLLM, but sanitize/forwardClientIDHeader are an explicit TODO (security-relevant).
  4. architecture.md — presents the key audit as always-on; it's off by default (oidcUserinfoURL ships empty), so keys aren't revoked on group change out of the box (security-relevant).

Non-blocking (script/workflow robustness)

A few check-links.sh portability/false-clean nits and a go-version drift note inline.

Questions

  • cicd-and-releasing.md describes docs.yml but omits the Check internal links step — intentional?
  • Want the link checker to hard-fail when zero edit-links are found (closing a silent false-clean gap)?

Everything else I checked — build/deploy gating, permissions, concurrency pattern, baseURL fallback, resolve_path logic, unsafe = true, go.mod/go.sum pin+verify, and the two comment-only Go changes — is correct.

Comment thread docs/site/content/shared-storage.md Outdated
3. If the lock is held by another pod: read the timestamp in `.locked`, wait, and retry
4. If the lock is older than 1 hour: treat it as stale (pod died holding the lock) and take over

Because `huggingface-cli download` is idempotent, the second and subsequent replicas that acquire the lock after the first replica finishes will simply checksum files and exit quickly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker — this lock mechanism does not exist. There is no .locked file, noclobber, or 1-hour stale-lock takeover in the code. model-downloader/entrypoint.py just forwards argv to hf, and operator/internal/controller/reconcilers/storage.go:140 builds plain init-container args ([<model>, --local-dir, /model-cache, (--revision X)]). A user relying on this for multi-replica download safety is being promised a guarantee the operator doesn't provide. Either implement the serialization or remove/reword this section to describe the actual behavior (idempotent re-download per replica).

Comment thread docs/site/content/architecture.md Outdated
- `AIGatewayRoute` for token counting, rate limiting, protocol normalization
- `SecurityPolicy` with `apiKeyAuth` attached to the generated `HTTPRoute` (same name as the `AIGatewayRoute`), per model
- `sanitize: true` strips the API key before forwarding to vLLM
- `forwardClientIDHeader: X-Client-ID` passes the authenticated client ID downstream for logging and flow control

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker (security-relevant) — key stripping is not implemented yet. sanitize: true / forwardClientIDHeader are an explicit TODO in operator/internal/controller/reconcilers/auth.go:123-125, gated on Envoy Gateway v1.7+. The security-model section (line 349) also asserts this as current behavior. As written the docs claim the API key is removed before it reaches vLLM, which is not true today. Reword to mark it as planned/not-yet-enabled.

Comment thread docs/site/content/architecture.md Outdated

1. Lists all API key Secrets in the operator namespace
2. For each key entry, looks up the creator's current groups via the OIDC userinfo endpoint
3. If the creator no longer belongs to a group that matches the model's `access.groups`, revokes the key

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker (security-relevant) — the audit is off by default. The auditor only starts when oidcUserinfoURL != "" (key-manager/cmd/main.go:96), and values.yaml ships it empty. So out of the box, keys are never revoked on group change — the eventual-consistency guarantee stated here (and at line 251) doesn't hold in a default install. Please state the precondition (audit requires oidcUserinfoURL to be configured) so operators don't assume automatic revocation.

Comment thread scripts/check-links.sh Outdated
BROKEN_EDIT+=("BROKEN edit-link: $html_file -> $edit_href (expected: $src_file)")
fi
done < <(
grep -oE 'class=edit-link href=[^[:space:]>]+|class="edit-link"[[:space:]]+href="[^"]*"' "$html_file" \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit-link extraction can false-clean silently. This regex matches only class=edit-link href=… (unquoted) or exactly class="edit-link" href="…". If the theme ever emits href before class, an intervening attribute (target=_blank), or a multi-class value (class="edit-link btn"), it matches nothing and the edit-link check passes vacuously. You hard-error on a missing editBase (line 119) precisely to avoid false-cleans — consider also failing (or warning loudly) when zero edit-links are found across all pages.

Comment thread scripts/check-links.sh Outdated
# ---------------------------------------------------------------------------
BASE_URL=$(grep -m1 'baseURL' "$SITE_DIR/hugo.toml" | sed 's/.*= *"\(.*\)"/\1/')
# Strip scheme+host: https://nebari-dev.github.io/nebari-llm-serving-pack/
SUBPATH_PREFIX=$(echo "$BASE_URL" | sed 's|^https\?://[^/]*||' | sed 's|/$||')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\? here is GNU-sed-only. Fine on CI (ubuntu/GNU sed), but on macOS/BSD sed \? is literal, so SUBPATH_PREFIX won't be stripped and a local run would mass-flag every link. Portable form: sed -E 's#^https?://[^/]+##'.

Comment thread scripts/check-links.sh
esac

# Strip any #fragment before resolving to a file (e.g. /page/#section)
url="${url%%#*}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor / latent: #fragment is stripped before resolving, so anchor targets aren't validated — a renamed heading (broken #anchor) passes silently. Not a current break; worth a comment noting the checker validates pages but not fragments.

Comment thread .github/workflows/docs.yml Outdated
- uses: actions/checkout@v4
with: { fetch-depth: 0 }
- uses: actions/setup-go@v5
with: { go-version: "stable", cache: false }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go-version: "stable" vs docs/site/go.mod's go 1.26.3: works today via GOTOOLCHAIN=auto, but it's brittle (fails if toolchain downloads are ever restricted or "stable" lags). This go.mod only fetches the Hugo theme — relax it to go 1.26, or pin the workflow with go-version-file: docs/site/go.mod so the two can't drift.

Address review feedback on the docs site:

- shared-storage: remove the described .locked/noclobber/stale-lock
  mechanism, which the operator does not implement; describe the actual
  per-replica idempotent download behavior.
- local-development: fix key-manager API paths (/api/models, /api/keys;
  the /api/v1/* routes do not exist).
- architecture + README + configuration: state that key stripping
  (sanitize/forwardClientIDHeader) is planned, not yet enabled (needs
  Envoy Gateway v1.7+), and that the group-change audit is off by default
  (requires keyManager.oidcUserinfoURL, which ships empty).
- check-links.sh: portable sed -E; robust edit-link extraction with a
  hard-fail when zero edit-links are found; note that fragments are not
  validated.
- docs.yml: pin Go via go-version-file so it can't drift from go.mod;
  document the link-check step in cicd-and-releasing.
@dcmcand

dcmcand commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @tylerpotts, this was a sharp review. You were right on all four content blockers; I verified each against the code before changing anything. Pushed in ef9c02e.

Blockers

  1. shared-storage - confirmed there's no .locked/noclobber/stale-lock anywhere; entrypoint.py just forwards to hf and storage.go builds plain init-container args. Removed the invented lock section and described the actual behavior: independent per-replica init containers, idempotent hf download, no operator-level coordination.
  2. local-development - fixed the key-manager paths to /api/models and /api/keys (the /api/v1/* routes don't exist in handler.go), so they now match installation.md.
  3. architecture - sanitize/forwardClientIDHeader are now marked planned-not-enabled (Envoy Gateway v1.7+), in both the endpoint list and the security-model section.
  4. architecture - the audit is now stated as off by default and gated on keyManager.oidcUserinfoURL. I also carried the same correction into the README "Known Limitations" and the configuration.md values table, since they asserted the same always-on behavior.

Your two questions

  • docs.yml link-check step omission - not intentional, just a gap. The build-job description now includes the check-links.sh (SKIP_BUILD=1) step and the Go pin.
  • Hard-fail on zero edit-links - done. The checker now counts extracted edit-links and exits non-zero if none are found, same rationale as the missing-editBase guard. I also broadened the extraction regex to match the <a ...edit-link...> tag regardless of attribute order/quoting/multi-class, so it won't silently match nothing if the theme markup shifts.

Non-blocking script/workflow nits

  • SUBPATH_PREFIX now uses sed -E (portable on BSD/macOS, not just GNU).
  • Added a comment noting the checker validates pages but not #fragments.
  • setup-go now pins go-version-file: docs/site/go.mod so the workflow and module versions can't drift.

Site builds clean at root and subpath, and check-links.sh passes. Ready for another look when you have a moment.

@dcmcand dcmcand requested a review from tylerpotts June 23, 2026 07:25
@dcmcand

dcmcand commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

📄 Docs preview for docs-site:
https://docs-site.llm-serving-pack.pages.dev

dcmcand added 6 commits June 24, 2026 14:52
…sign into architecture page

- Shared TLS section: explains bring-your-own-certificate mode (platform.tls.secretName),
  no cert-manager required for air-gapped/private-CA clusters
- Spec updates section: retitled "Spec updates and restarts", explains Recreate default
  and why RollingUpdate deadlocks on clusters without spare GPU capacity
- New /dev/shm subsection: memory-backed emptyDir for tensor parallelism > 1
- New PassthroughModel CRD section: routes shared endpoints to external OpenAI-compatible
  providers, YAML example, generated-resources table, and semantics notes
- configuration.md: adds platform.tls.secretName and spec.serving.updateStrategy rows
- docs/site/go.mod + go.sum: bump nebari-hugo-theme v0.1.1 -> v0.2.0
- docs/site/hugo.toml: add logoLink = "https://packs.nebari.dev/" as
  first key under [params] so the header logo returns to the portal root
…q, key-manager name/hostname, workflow count + cleanup workflow)

@tylerpotts tylerpotts left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review of blocker #4 (audit-based key revocation).

The off-by-default correction is right, and propagating it into the README Known Limitations and the configuration.md values table was good follow-through — the default state is now consistent across all three docs.

There's one layer below this they all still get wrong, though. Setting oidcUserinfoURL doesn't actually enable revocation in the version we ship. makeUserinfoLookup is a stub that always returns an error (key-manager/cmd/main.go:164-173, identical on main):

// For v0.1 this is a stub that returns an error so the auditor skips revocation
// safely until token exchange is implemented.
return nil, fmt.Errorf("userinfo lookup not yet implemented (url: %s, user: %s)", ...)

So when oidcUserinfoURL is set the audit loop starts (main.go:96-99) but every group lookup fails and the auditor skips revocation (fail-safe). Group-change revocation isn't implemented at all in 0.1.0-alpha.x — configuring the URL doesn't turn it on.

That means the "enabled → revokes" framing is still inaccurate in the spots flagged inline. Could we reword them the way the sanitize/forwardClientIDHeader fix handled it — revocation is planned but not yet implemented (the lookup is a stub pending OIDC token exchange), rather than a toggle you switch on with oidcUserinfoURL? As written, an operator who sets the URL will reasonably believe keys get revoked on group change, and they won't. Since this is the public source of truth and it's security-relevant, I'd like it accurate before merge.

Comment thread docs/site/content/architecture.md Outdated
2. For each key entry, looks up the creator's current groups via the OIDC userinfo endpoint
3. If the creator no longer belongs to a group that matches the model's `access.groups`, revokes the key

This audit is a precondition for the eventual-consistency revocation described above, and it is **disabled unless `keyManager.oidcUserinfoURL` is set**. The default `values.yaml` ships it empty, so a default install performs no group-change revocation. Configure the userinfo endpoint (for Keycloak, `https://<keycloak>/realms/<realm>/protocol/openid-connect/userinfo`) to enable it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocker (security-relevant), still open. "Configure the userinfo endpoint ... to enable it" over-promises: even with oidcUserinfoURL set, the audit loop starts but makeUserinfoLookup (key-manager/cmd/main.go:164-173) always returns an error, so the auditor skips revocation. Group-change revocation is not implemented in 0.1.0-alpha.x — setting the URL does not turn it on. Please reframe as planned/not-yet-implemented (the lookup is a stub pending OIDC token exchange), same as the sanitize fix, rather than a config toggle. Same applies to line 344 ("looks up the creator's current groups via the OIDC userinfo endpoint").

Comment thread docs/site/content/architecture.md Outdated
- Usage billing or cost chargeback
- Scale-to-zero autoscaling
- Per-key rate limiting or token quotas
- API key expiration (periodic audit revokes keys when group membership changes)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature-list line still asserts revocation as a current capability ("periodic audit revokes keys when group membership changes"). It isn't — the userinfo lookup is a stub that errors, so no key is ever revoked, even when the audit is enabled. Mark it planned/not-yet-implemented.

Comment thread README.md Outdated
This pack is at alpha maturity (`v0.1.0-alpha.x`). The following limitations apply:

- **Single-namespace model:** All LLMModels must be in the operator's own namespace. Per-team isolation requires running separate pack installs (one namespace per team). This is a hard constraint imposed by Envoy Gateway's `apiKeyAuth`, which does not support cross-namespace Secret references.
- **API keys are not continuously tied to group membership.** Keys are issued based on the user's OIDC groups at creation time. If a user later loses group access, existing keys continue to work. The periodic audit that revokes such keys (default interval: 5 minutes) is **off by default** - it only runs when `keyManager.oidcUserinfoURL` is configured, which ships empty. Without it, keys are never revoked on group change. Even when enabled, this is eventual consistency, not real-time revocation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good — the off-by-default state is now correct here. But "Even when enabled, this is eventual consistency" still implies enabling the audit makes it revoke. It doesn't: the userinfo lookup (key-manager/cmd/main.go:164-173) is a stub that errors, so revocation never fires even when oidcUserinfoURL is set. Reword to planned/not-yet-implemented.

Comment thread docs/site/content/configuration.md Outdated
| `keyManager.image.tag` | `""` | Image tag. Defaults to `.Chart.AppVersion` when empty, keeping chart and image versions in sync. Override only when testing a specific build. |
| `keyManager.image.pullPolicy` | `Always` | Image pull policy. |
| `keyManager.auditInterval` | `5m` | How often the audit loop checks key owners' current groups (only runs when `oidcUserinfoURL` is set). |
| `keyManager.oidcUserinfoURL` | `""` | OIDC userinfo endpoint used by the audit loop to re-check a key owner's groups. **Empty (the default) disables the audit entirely**, so keys are never revoked on group change. For Keycloak: `https://<keycloak>/realms/<realm>/protocol/openid-connect/userinfo`. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"OIDC userinfo endpoint used by the audit loop to re-check a key owner's groups" implies it works once set. The lookup is a stub that always errors (key-manager/cmd/main.go:164-173), so setting this URL starts the audit loop but never revokes anything. Note that group-change revocation is not yet implemented in the current release.

dcmcand added 2 commits June 26, 2026 10:17
The userinfo lookup (key-manager/cmd/main.go) is a stub that always
errors, so the audit loop never revokes keys even when oidcUserinfoURL
is set. Reword architecture.md, README, and configuration.md to state
group-change revocation is planned but not implemented in v0.1, rather
than a toggle enabled by configuring the userinfo endpoint. Also bump
the stale defaults.serving.image reference to v0.7.0 to match values.yaml.
@dcmcand

dcmcand commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for catching the layer underneath, Tyler - you're right, and I verified it against main before changing anything. makeUserinfoLookup (key-manager/cmd/main.go:164-173) is a stub that always returns an error, so the auditor skips revocation even when oidcUserinfoURL is set. The audit loop starting (main.go:96) was masking the fact that revocation never actually fires.

Reframed all the spots you flagged from "config toggle" to "planned but not implemented in v0.1," same treatment as the sanitize/forwardClientIDHeader fix:

  • architecture.md - non-goals line (40), the "Known limitation" section (renamed to "group-change revocation not yet implemented"), and the "API key audit" section now lead with "This is not functional in v0.1" and explain the stub.
  • README.md - the "not tied to ongoing group membership" and "No API key expiration" limitations.
  • configuration.md - the auditInterval and oidcUserinfoURL rows: setting the URL starts the loop but does not enable revocation.

Also resolved the merge conflict with main: docs/design.md stays deleted (migrated into the site), and I carried forward the v0.7.0 bump from #100 - one stale defaults.serving.image: v0.6.0 reference in configuration.md was still on the old version.

Builds clean at root and subpath, check-links.sh passes. Ready for another look.

@dcmcand dcmcand requested a review from tylerpotts June 26, 2026 08:39

@tylerpotts tylerpotts left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — Hugo docs site + Cloudflare Pages

Scope: New Hugo site at docs/site/, Cloudflare Pages deploy workflows, a link-checker script, pack-metadata.yaml rename, and two comment-only Go edits. ~1970 LOC, mostly Markdown.

Solid, careful PR. One blocker that directly contradicts the PR's own purpose, plus some consistency cleanups and questions.

Cross-cutting

No design-gating blockers. The PR establishes one canonical docs host consistently in hugo.toml (baseURL = https://packs.nebari.dev/llm-serving-pack/), pack-metadata.yaml, and README.md — moving off GitHub Pages onto Cloudflare Pages. That intent is consistent everywhere except the two Go comments (Blocker #1).

Blockers

1. The two Go comment edits point at a host this PR decommissions — they will 404.

  • operator/internal/controller/reconcilers/auth.go:52https://nebari-dev.github.io/nebari-llm-serving-pack/local-development/
  • operator/internal/controller/reconcilers/clustertls.go:22https://nebari-dev.github.io/nebari-llm-serving-pack/architecture/

These point to GitHub Pages (*.github.io), which is exactly what this PR migrates away from — there is no GitHub Pages deploy here, so these URLs are dead on arrival (not even a redirect, unlike the github.com repo-rename alias). They should match the canonical host:

  • https://packs.nebari.dev/llm-serving-pack/local-development/
  • https://packs.nebari.dev/llm-serving-pack/architecture/

For a PR whose entire purpose is correct doc links, shipping two guaranteed-broken ones is worth fixing. (check-links.sh won't catch these — it only validates internal links inside the built Hugo site, not source-comment URLs.)

Stylistic / should-fix

2. hugo.toml repo/editBase use the stale nebari-llm-serving-pack repo name. The repo was renamed to llm-serving-pack (the old name is now only a redirect alias), and this PR renames the pack accordingly in pack-metadata.yaml. The editBase/repo links work today only via GitHub's 302 redirect. For consistency with the rename — and so "Edit this page" links don't depend on a redirect — update both to llm-serving-pack.

3. docs/design.md is deleted, but docs/getting-started.md still exists and links to it. Only design.md is removed here, yet the description says "Old flat docs deleted" (migrating install-production.md and getting-started.md). Those two flat files are not removed, and docs/getting-started.md:178 still links to the now-deleted design.md — a broken intra-doc link the Hugo link-checker won't catch (it's outside docs/site/). Either finish the deletion or fix the dangling link. See Question 1.

4. Preview baseURL is computed independently from Cloudflare's actual alias (docs.yml). The ALIAS=... tr/sed step builds a preview baseURL, but the real preview URL comes from steps.deploy.outputs.pages-deployment-alias-url. If Cloudflare's own alias generation diverges (e.g. the 63-char DNS-label cap on a long branch name, or different sanitization), the built site's absolute links point at the wrong host in previews. Production is unaffected — preview-only cosmetic risk. Worth a comment acknowledging the two can drift.

Questions

  1. Are the remaining flat docs (docs/getting-started.md, docs/install-production.md) intentionally kept? README and pack-metadata.yaml already repoint away from them, so they're now orphaned from the main entry points but still referenced by examples/*.yaml comments. If they're meant to be gone, delete them and fix the example references; if kept, fix item 3's broken link.

  2. Does anything key off the old pack name nebari-llm-serving-pack? The rename to llm-serving-pack drives the portal's /llm-serving-pack/ route via docs_site: true. Confirm the software-pack-dashboard route table is fine. (Separately, charts/nebari-llm-serving/values.yaml still uses ghcr.io/nebari-dev/nebari-llm-serving-pack/... image repos — image paths, separate concern, but worth confirming those GHCR paths weren't also renamed.)

What's good

  • Fork-PR handling is correct: deploy is gated on same-repo, so no secrets leak to forks, while build + link-check still gate every PR.
  • check-links.sh is thoughtfully defensive: missing editBase and zero-edit-links both hard-fail (avoiding vacuous passes), and the fragment-validation limitation is documented honestly.
  • Preview cleanup workflow correctly scopes to same-repo and uses cancel-in-progress: false so a cleanup isn't interrupted.
  • README "Known Limitations" is genuinely honest (stubbed revocation, single-namespace constraint, ~1 MiB storage ceiling) rather than marketing copy.

Bottom line: Fix the two dead github.io URLs in the Go comments (Blocker #1) before merge — they directly contradict the PR's own purpose. The rest are consistency cleanups and clarifying questions.

🤖 Reviewed with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants