Skip to content

Opt 12 v1beta1 CRDs into storage-version migration + CI guard#5391

Merged
ChrisJBurns merged 6 commits into
mainfrom
chris/svm-opt-in-labels
May 29, 2026
Merged

Opt 12 v1beta1 CRDs into storage-version migration + CI guard#5391
ChrisJBurns merged 6 commits into
mainfrom
chris/svm-opt-in-labels

Conversation

@ChrisJBurns
Copy link
Copy Markdown
Collaborator

@ChrisJBurns ChrisJBurns commented May 28, 2026

Summary

Adds the opt-in label that the StorageVersionMigrator controller (shipped in #5362) requires to act on a CRD, and a CI guard that prevents future CRD additions from silently skipping migration.

This is PR-B of the three-PR sequence for #4969. PR-A (#5362) shipped the controller behind a default-off env-var flag. PR-B (this one) labels every storage-version root type and adds the marker-coverage CI guard. PR-C (next) flips the default to on and ships the helm chart surface + user docs.

Part of #4969. Depends on #5362 (merged).

Medium level
  • Adds +kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true to every storage-version root type under cmd/thv-operator/api/. That's the 12 root types under api/v1beta1/ plus MCPWebhookConfig under api/v1alpha1/types.go (the only CRD whose storage version is still v1alpha1). Regenerated CRD YAML so the label appears in metadata.labels on every CRD the chart ships.
  • New CI test TestStorageVersionRootMarkerCoverage in cmd/thv-operator/controllers/marker_coverage_test.go. Walks every cmd/thv-operator/api/v*/ directory and scans types.go and *_types.go. For every root type carrying both +kubebuilder:object:root=true AND +kubebuilder:storageversion, the migrate marker is required. No escape hatch — if a CRD legitimately should not be auto-migrated, an entry is added to a hardcoded allowlist (excludedRootTypes in the same file, currently empty) via PR review, not via a self-serve marker.
  • Marker match is whole-token strict, not substring. A typo such as =truee would pass a substring match but fail the controller's runtime check (labels[key] == "true"), silently excluding the CRD. The strict match catches it at PR time. TestContainsMarkerStrictBoundary defends the boundary contract against future regressions.
  • No code logic changes, no chart-side changes, no default flip. The controller in main is still default-off; PR-B's only behavioural effect is that if an admin opts in via the env var, the migrator will now actually find labeled CRDs to act on.
Low level
File Change
cmd/thv-operator/api/v1beta1/*_types.go (×12) +kubebuilder:metadata:labels=...auto-migrate-storage-version=true marker on each root type
cmd/thv-operator/api/v1alpha1/types.go Same marker added to MCPWebhookConfig (the only never-graduated storage-version root)
cmd/thv-operator/controllers/marker_coverage_test.go New static-source CI test — scans all api/v*/ directories, requires migrate marker on every storage-version root, supports an excludedRootTypes allowlist for documented exclusions
deploy/charts/operator-crds/files/crds/*.yaml (×13) Regenerated — label in metadata.labels
deploy/charts/operator-crds/templates/*.yaml (×13) Regenerated — same

Type of change

  • New feature (additive — extends an already-shipped feature flag)
  • Bug fix
  • Breaking change
  • Refactoring
  • Documentation
  • Other

Test plan

  • TestStorageVersionRootMarkerCoverage passes (scans every root type across all API version dirs, asserts marker presence)
  • TestContainsMarkerStrictBoundary passes (7 subtests covering exact match, whitespace boundary, value-typo rejection — defends against future regression to permissive substring matching)
  • task operator-manifests produces no diff after the markers — labels land in CRD YAML cleanly
  • Full operator test suite green: go test -race ./cmd/thv-operator/...
  • envtest suites green including the StorageVersionMigrator suite from PR-A
  • task lint-fix — known to be blocked locally by a pre-existing toolchain mismatch (golangci-lint built against go1.25 vs repo's go1.26 target); not introduced by this PR

Why the CI guard matters

Without TestStorageVersionRootMarkerCoverage:

  1. Someone adds a new CRD root type and forgets the marker.
  2. task operator-manifests regenerates without the opt-in label.
  3. The controller's isManagedCRD check returns false for that new CRD; CRs of that kind never get re-encoded.
  4. The bug stays invisible until a future release tries to drop a deprecated version from spec.versions — at which point storedVersions still contains the deprecated version and the apiserver rejects the CRD update.
  5. Diagnosis trail is long cold; the missed marker was on a PR that landed months earlier.

With the test, the PR adding the new CRD fails CI with a clear pointer to the missing marker.

The test is forward-compatible: when a future v1beta2 graduation moves +kubebuilder:storageversion from v1beta1 root types to v1beta2 root types, the scanner picks up the new directory automatically without code changes here. The previous version's root types lose the storage marker and are naturally filtered out.

Design choice: no escape hatch

An earlier iteration of this test accepted either the migrate marker or an explicit +thv:storage-version-migrator:exclude sibling. That permissive contract opens a graduation-time hole: a single-version CRD created with exclude passes CI, but at v1→v2 graduation time the developer has to remember to swap exclude→migrate. Forgetting the swap leaves the CRD silently un-migrated — exactly the failure mode the test is meant to prevent.

Replaced with: every storage-version root must have the migrate marker, full stop. If a CRD legitimately needs to be excluded, the path is to add its <version>/<file>.<TypeName> key to the excludedRootTypes map in the test file as a deliberate, reviewed decision. The allowlist is currently empty and has an example comment showing the format.

Why every storage-version root (not just multi-version CRDs)

Single-version CRDs today have nothing to migrate — storedVersions == [storageVersion], the controller's isMigrationNeeded check returns false, no-op forever. But when that CRD eventually graduates to a second version, storedVersions accumulates the old entry and the migrator suddenly needs to know whether to act. Requiring the label from day one means graduation PRs don't need to remember to add it later. The cost of the label on a single-version CRD is zero (a string in metadata.labels); the cost of forgetting at graduation is silent data orphaning.

Special notes for reviewers

  • The label key toolhive.stacklok.dev/auto-migrate-storage-version is a stable API surface once shipped — users may write Kyverno rules or GitOps filters against it, so renaming becomes a breaking change.
  • 13 type-file additions (one line each). The 26 generated CRD YAMLs each get the same 2 label lines. These are auto-generated; manually editing them would be wiped on the next task operator-manifests.
  • Marker match is now whole-token-strict, not substring. A value typo like =truee was a real silent-failure class — controller runtime checks for exact string "true", but the CI's substring Contains would accept =truee as a match. The new containsMarker requires the marker to be followed by end-of-line or whitespace, catching the typo at PR time. TestContainsMarkerStrictBoundary is the regression guard if anyone "improves" it back to substring semantics.
  • No behavioural change for any user until both (a) they opt in via TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR=true AND (b) they have CRs whose etcd-stored apiVersion differs from the current storage version. Today no toolhive cluster has the former and the latter is rare.

Generated with Claude Code

ChrisJBurns and others added 2 commits May 28, 2026 17:59
Adds the kubebuilder marker

    +kubebuilder:metadata:labels=toolhive.stacklok.dev/auto-migrate-storage-version=true

to every v1beta1 root type and regenerates the CRD manifests. The label
is the opt-in signal for an upcoming StorageVersionMigrator controller
that will reconcile each CRD's status.storedVersions down to the current
storage version. List types are intentionally not labelled — the label
applies to the CRD itself, which is keyed on the root type.

Part of #4969.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Guards against the main footgun of the opt-in-label design for
storage-version migration: if a new CRD root type is added to
cmd/thv-operator/api/v1beta1/ without the migrate-marker, the
StorageVersionMigrator controller silently excludes it. The problem
surfaces only when a future release tries to drop a deprecated version
— at which point it is far too late to fix.

The test scans every root type (marker block contains both
+kubebuilder:object:root=true and +kubebuilder:storageversion) and
fails unless the block also contains either the migrate marker or an
explicit +thv:storage-version-migrator:exclude sibling marker.

Part of #4969.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the size/S Small PR: 100-299 lines changed label May 28, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.81%. Comparing base (8932f93) to head (6d03996).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5391   +/-   ##
=======================================
  Coverage   68.81%   68.81%           
=======================================
  Files         628      628           
  Lines       63900    63900           
=======================================
+ Hits        43971    43974    +3     
+ Misses      16669    16667    -2     
+ Partials     3260     3259    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The marker-coverage test was only scanning cmd/thv-operator/api/v1beta1/
and filtering for files matching *_types.go. This had two gaps:

1. Future graduations (v1beta1 → v1beta2) would silently break the test:
   when +kubebuilder:storageversion moves to v1beta2/, the v1beta1 root
   types lose the storage marker and become invisible to the scanner.
   The test would either pass vacuously (other version dirs unscanned)
   or hard-fail on the "no roots found" guard with no clear path forward.

2. MCPWebhookConfig lives in api/v1alpha1/types.go (not _types.go) and
   carries +kubebuilder:storageversion as the only never-graduated CRD.
   The original PR-B commit skipped it entirely; the migrator could not
   act on its CRs even when explicitly opted in.

Changes:

- Walk cmd/thv-operator/api/v*/ instead of hardcoding v1beta1.
- File filter accepts both exact "types.go" and "*_types.go" suffix.
- Rename TestV1beta1TypesMarkerCoverage → TestStorageVersionRootMarkerCoverage.
- Update failure message + docstring to reflect the broader scope.
- Add the migrate marker to MCPWebhookConfig in v1alpha1/types.go and
  regenerate its two CRD YAML files under deploy/charts/operator-crds/.

Future-proof: when v1beta2 lands and +kubebuilder:storageversion moves
to its root types, the scanner picks them up without code changes here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels May 28, 2026
The previous test accepted either the migrate marker or an explicit
+thv:storage-version-migrator:exclude sibling. That permissive
contract opens a graduation-time hole: a single-version CRD created
with exclude passes CI, but at v1→v2 graduation time the developer
has to remember to swap exclude→migrate. Forgetting the swap leaves
the CRD silently un-migrated — exactly the failure mode this test
is meant to prevent.

No CRD in the codebase uses the exclude marker today. Removing it
costs nothing and closes the hole. If a legitimate exclusion ever
comes up, handle it as a one-off PR conversation and a hardcoded
allowlist in this test — not a self-serve marker that future
contributors can misuse.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels May 28, 2026
Two defensive additions identified during stress-testing the
opt-in-label design:

1. Strict marker matching. containsMarker previously used
   strings.Contains, which would accept "=truee" (or any extra
   character) as a substring match for "=true". The CI test would
   pass, the generated CRD YAML would carry label value "truee",
   and the controller's runtime check (labels[key] == "true")
   would silently exclude the CRD. The bug only surfaces at
   version-drop time months later. containsMarker now requires
   the marker to be followed by end-of-line or whitespace.
   TestContainsMarkerStrictBoundary defends the contract against
   future regressions to substring semantics.

2. excludedRootTypes allowlist stub. The previous commit removed
   the self-serve +thv:storage-version-migrator:exclude marker.
   This adds the documented alternative — a hardcoded allowlist
   in the test file — as an empty map with an example entry and
   a comment explaining the deliberate, reviewed-decision policy
   for adding entries. If a future contributor needs to exclude a
   CRD, the path is "add an entry, justify it in PR review";
   they don't have to invent the mechanism from scratch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/S Small PR: 100-299 lines changed size/M Medium PR: 300-599 lines changed labels May 28, 2026
- v1alpha1/types.go MCPWebhookConfig: move the migrate marker to
  AFTER +kubebuilder:subresource:status, matching the v1beta1
  convention. controller-gen ignores order so behaviour is
  unchanged; this is purely consistency for greppability and
  future readers.
- marker_coverage_test.go: drop `tc := tc` shadowing in
  TestContainsMarkerStrictBoundary. go.mod is on go 1.26; loop
  variables have been per-iteration since 1.22, so the shadow is
  dead code from older-Go muscle memory.

go/ast or controller-tools/pkg/markers refactor (raised separately)
deliberately deferred: the bufio scanner is correct for every CRD
in the codebase today; switching to AST is a ~40-line restructure
better landed as a follow-up if/when we actually need generics or
grouped type decl support in a CRD.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels May 28, 2026
@ChrisJBurns ChrisJBurns merged commit 0578af6 into main May 29, 2026
46 checks passed
@ChrisJBurns ChrisJBurns deleted the chris/svm-opt-in-labels branch May 29, 2026 09:52
ChrisJBurns added a commit that referenced this pull request Jun 1, 2026
Adds an opt-in chart value (`operator.features.storageVersionMigrator`,
default `false`) and wires `TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR`
into the operator deployment. helm-docs README regenerated.

The controller (PR-A, #5362) and the opt-in labels + CI guard
(PR-B, #5391) are on main. This PR is the chart-side surface that
lets operators opt in via `helm install --set` or values.yaml.

Reference and upgrade-guide docs are intentionally separate — they
land in a follow-up PR.

Closes #4969.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants