Add MCPAuthzConfig CRD for backend-agnostic authorization#4777
Conversation
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4777 +/- ##
==========================================
- Coverage 68.80% 68.77% -0.04%
==========================================
Files 629 631 +2
Lines 63914 64150 +236
==========================================
+ Hits 43978 44121 +143
- Misses 16678 16749 +71
- Partials 3258 3280 +22 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| - message: inline must be set when type is 'inline', and must not | ||
| be set otherwise | ||
| rule: 'self.type == ''inline'' ? has(self.inline) : !has(self.inline)' | ||
| authzConfigRef: | ||
| description: |- | ||
| AuthzConfigRef references a shared MCPAuthzConfig resource for authorization. | ||
| The referenced MCPAuthzConfig must exist in the same namespace as this MCPServer. | ||
| Mutually exclusive with authzConfig. | ||
| properties: | ||
| name: | ||
| description: Name is the name of the MCPAuthzConfig resource in | ||
| the same namespace. | ||
| minLength: 1 | ||
| type: string | ||
| required: | ||
| - name | ||
| type: object | ||
| backendReplicas: | ||
| description: |- | ||
| BackendReplicas is the desired number of MCP server backend pod replicas. |
There was a problem hiding this comment.
🔴 The new authzConfig/authzConfigRef fields are documented as mutually exclusive across MCPServer, MCPRemoteProxy, and VirtualMCPServer's IncomingAuthConfig, but no CEL x-kubernetes-validations rule enforces this at the API level. A user can submit a resource with both fields set and the Kubernetes API server will accept it without error. The fix is to add +kubebuilder:validation:XValidation markers (and the corresponding x-kubernetes-validations entries in the generated CRD YAMLs) mirroring the existing oidcConfig/oidcConfigRef precedent for all three affected types.
Extended reasoning...
What the bug is and how it manifests
The PR introduces authzConfigRef fields to MCPServerSpec, MCPRemoteProxySpec, and IncomingAuthConfig (VirtualMCPServer), explicitly documenting them as mutually exclusive with the existing authzConfig field. However, unlike the analogous oidcConfig/oidcConfigRef pair, no CEL validation rule is added to enforce this constraint at admission time. The Kubernetes API server therefore accepts resources with both fields populated simultaneously.
The specific code path that triggers it
In mcpremoteproxy_types.go, the struct-level kubebuilder marker reads:
// +kubebuilder:validation:XValidation:rule="\!(has(self.oidcConfig) && has(self.oidcConfigRef))",message="oidcConfig and oidcConfigRef are mutually exclusive..."
type MCPRemoteProxySpec struct {
No equivalent marker exists for authzConfig/authzConfigRef in MCPRemoteProxySpec, MCPServerSpec, or IncomingAuthConfig. The generated CRD YAML (e.g. toolhive.stacklok.dev_mcpremoteproxies.yaml) reflects this: the spec-level x-kubernetes-validations block only contains the OIDC rule, not an authz rule.
Why existing code doesn't prevent it
The Go field comments say "Mutually exclusive with authzConfig" but comments are not enforced. The kubebuilder XValidation marker is what generates the CEL rule into the CRD schema. That marker was added for oidcConfig/oidcConfigRef but was omitted for the new authzConfig/authzConfigRef pair in all three types.
Impact
Any user who sets both authzConfig and authzConfigRef will get a resource that is silently accepted by the API server. When Part 2 wires these fields into the workload controllers, the controller will face ambiguity: it will need to pick one field, likely silently ignoring the other, leading to unexpected authorization behavior that is hard to diagnose.
How to fix it
Add the following marker to MCPServerSpec, MCPRemoteProxySpec, and IncomingAuthConfig (adjacent to the existing oidcConfig/oidcConfigRef marker):
// +kubebuilder:validation:XValidation:rule="\!(has(self.authzConfig) && has(self.authzConfigRef))",message="authzConfig and authzConfigRef are mutually exclusive"Then regenerate the CRD manifests so that the three affected YAML files each gain a corresponding x-kubernetes-validations entry at the spec level.
Step-by-step proof
- Create an MCPServer manifest with both fields:
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: MCPServer
metadata:
name: test
spec:
image: example/mcp:latest
authzConfig:
type: inline
inline:
policies: ["permit(principal, action, resource);"]
authzConfigRef:
name: my-authzconfig- Apply it to a cluster with this PR's CRDs installed.
- Observe:
kubectl applysucceeds with no validation error. - Compare with the equivalent test using
oidcConfig+oidcConfigRef: the API server rejects it with "oidcConfig and oidcConfigRef are mutually exclusive" because the CEL rule exists for OIDC but not for authz.
| if hashChanged { | ||
| logger.Info("MCPAuthzConfig configuration changed", | ||
| "oldHash", authzConfig.Status.ConfigHash, | ||
| "newHash", configHash) | ||
|
|
||
| authzConfig.Status.ConfigHash = configHash | ||
| authzConfig.Status.ObservedGeneration = authzConfig.Generation | ||
|
|
||
| if err := r.Status().Update(ctx, authzConfig); err != nil { | ||
| logger.Error(err, "Failed to update MCPAuthzConfig status") | ||
| return ctrl.Result{}, err | ||
| } | ||
| return ctrl.Result{}, nil | ||
| } | ||
|
|
||
| // Refresh ReferencingWorkloads list | ||
| referencingWorkloads, err := r.findReferencingWorkloads(ctx, authzConfig) | ||
| if err != nil { | ||
| logger.Error(err, "Failed to find referencing workloads") | ||
| } else if !ctrlutil.WorkloadRefsEqual(authzConfig.Status.ReferencingWorkloads, referencingWorkloads) { | ||
| authzConfig.Status.ReferencingWorkloads = referencingWorkloads | ||
| conditionChanged = true | ||
| } |
There was a problem hiding this comment.
🔴 The new MCPAuthzConfig controller has two related issues that leave ReferencingWorkloads status stale. First, when findReferencingWorkloads returns an error in the normal reconcile path (lines 134-141), the error is only logged and execution continues without requeueing — unlike handleDeletion which correctly returns the error for automatic retry. Second, when hashChanged==true, the controller returns early at line 131 before ever reaching the findReferencingWorkloads call at line 135, so workload reference changes concurrent with spec changes are silently dropped for that reconcile cycle. The fix is to refresh workload refs before the hash-changed early return, and to return the error (with requeue) when findReferencingWorkloads fails, consistent with the deletion path.
Extended reasoning...
Bug 1 — Swallowed error in normal reconcile path (lines 134-141)
In the normal reconcile path, when findReferencingWorkloads returns an error, the code at lines 134-141 logs the error and falls through without returning it:
referencingWorkloads, err := r.findReferencingWorkloads(ctx, authzConfig)
if err != nil {
logger.Error(err, "Failed to find referencing workloads")
} else if ...The deletion path (handleDeletion) correctly propagates the error, which causes controller-runtime to schedule a retry with exponential backoff:
referencingWorkloads, err := r.findReferencingWorkloads(ctx, authzConfig)
if err != nil {
logger.Error(err, "Failed to check referencing workloads during deletion")
return ctrl.Result{}, err // requeues
}On a transient Kubernetes API error (e.g., brief etcd unavailability), the normal reconcile path silently leaves ReferencingWorkloads stale with no automatic retry. The status is only corrected when an unrelated event (e.g., a workload change) happens to re-trigger reconciliation. This asymmetry violates controller-runtime best practices and is the identical pattern that exists in toolconfig_controller.go, meaning the PR copies a known sub-optimal approach rather than improving on it.
Bug 2 — Hash-changed early return skips ReferencingWorkloads refresh (lines 119-132)
When hashChanged==true, the reconciler updates ConfigHash and ObservedGeneration, calls r.Status().Update(), and returns at line 131:
if hashChanged {
authzConfig.Status.ConfigHash = configHash
authzConfig.Status.ObservedGeneration = authzConfig.Generation
if err := r.Status().Update(ctx, authzConfig); err != nil { ... }
return ctrl.Result{}, nil // <-- findReferencingWorkloads is never called
}
// This is unreachable when hashChanged==true:
referencingWorkloads, err := r.findReferencingWorkloads(ctx, authzConfig)Concrete proof for Bug 2: Consider MCPAuthzConfig my-authz with no referencing workloads. A user simultaneously (a) modifies the spec, and (b) adds authzConfigRef: {name: my-authz} to an MCPServer. Controller-runtime coalesces both watch events into a single reconcile. Inside that reconcile: hashChanged==true → controller updates ConfigHash and returns. ReferencingWorkloads is never refreshed. The watch handler for MCPServer fired to trigger this reconcile, so no second reconcile is enqueued. The status shows referencingWorkloads: [] even though the MCPServer now references the config, until either another spec change or another workload event occurs.
Addressing the refutation for Bug 2: One verifier argues that the watch-based reconciliation mechanism compensates for the early return, because a workload change fires a watch event that enqueues a new reconcile. This is true in the separate event case. However, it fails in the coalesced event case described above: if controller-runtime batches the spec-change event and the workload-change event into a single reconcile call, the controller sees hashChanged==true, returns early, and consumes the single reconcile that would have refreshed the refs. No additional reconcile is triggered because the watch event for the workload has already been processed. This is a real (if narrow) window of inconsistency.
Impact and fix: The actual deletion protection in handleDeletion queries live state and is unaffected. The impact is stale ReferencingWorkloads status visible to users and operators. The fix is to call findReferencingWorkloads before the hash-change early return and include its result in the combined status update (as the MCPTelemetryConfig controller already does correctly), and to return the error rather than swallowing it when the lookup fails.
ChrisJBurns
left a comment
There was a problem hiding this comment.
Multi-Agent Consensus Review
Agents consulted: kubernetes-expert, go-security-reviewer, code-reviewer, toolhive-expert
Consensus Summary
| # | Finding | Consensus | Severity | Action |
|---|---|---|---|---|
| 1 | Missing CEL XValidation for authzConfig/authzConfigRef mutual exclusivity | 10/10 | HIGH | Fix |
| 2 | Missing RBAC markers for watched workload resources | 10/10 | MEDIUM | Fix |
| 3 | mustMarshalJSON uses panic in controller | 10/10 | MEDIUM | Fix |
| 4 | No nil-check on spec.Config.Raw before use | 8/10 | MEDIUM | Fix |
| 5 | Test coverage gaps in watch handlers and deletion paths | 7/10 | MEDIUM | Fix |
| 6 | No Validate() method on MCPAuthzConfig type (pattern deviation) | 7/10 | LOW | Discuss |
| 7 | ReferencingWorkloads listMapKey=name insufficient for uniqueness | 7/10 | LOW | Discuss (pre-existing) |
| 8 | Controller blank-imports authorizer backend packages | 7/10 | LOW | Discuss |
Overall
This is a well-structured PR that closely follows the established MCPOIDCConfig controller pattern. The CRD design is clean, the ConfigKey() interface addition is minimal, and the findStaleRefs helper is a nice DRY improvement over the inline stale-ref scanning in existing controllers. The part 1/part 2 split is a sensible approach for incremental delivery.
The main blocker is the missing CEL XValidation rules (Finding #1). The existing OIDC pattern enforces mutual exclusivity between oidcConfig and oidcConfigRef via CEL rules, but no equivalent rules exist for authzConfig/authzConfigRef. Without these, the API server silently accepts resources with both fields set, contradicting the documented contract. The remaining findings are medium-severity improvements: RBAC marker completeness, replacing panic with error return, and adding a nil-check on Config.Raw.
Generated with Claude Code
| // The referenced MCPAuthzConfig must exist in the same namespace as this MCPServer. | ||
| // Mutually exclusive with authzConfig. | ||
| // +optional | ||
| AuthzConfigRef *MCPAuthzConfigReference `json:"authzConfigRef,omitempty"` |
There was a problem hiding this comment.
[HIGH] Missing CEL XValidation for authzConfig/authzConfigRef mutual exclusivity (Consensus: 10/10)
The AuthzConfig and AuthzConfigRef fields are documented as "mutually exclusive" but no CEL validation rule enforces this at the API server level. The existing OIDC pattern has:
// +kubebuilder:validation:XValidation:rule="!(has(self.oidcConfig) && has(self.oidcConfigRef))",message="oidcConfig and oidcConfigRef are mutually exclusive..."The same rule is needed here on MCPServerSpec, MCPRemoteProxySpec, and IncomingAuthConfig:
// +kubebuilder:validation:XValidation:rule="!(has(self.authzConfig) && has(self.authzConfigRef))",message="authzConfig and authzConfigRef are mutually exclusive; use authzConfigRef to reference a shared MCPAuthzConfig"Without this, users can set both fields simultaneously and the controller must decide which takes precedence.
Raised by: kubernetes-expert, code-reviewer, toolhive-expert
|
|
||
| // +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs,verbs=get;list;watch;create;update;patch;delete | ||
| // +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs/status,verbs=get;update;patch | ||
| // +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs/finalizers,verbs=update |
There was a problem hiding this comment.
[MEDIUM] Missing RBAC markers for watched workload resources (Consensus: 10/10)
The controller watches MCPServer, VirtualMCPServer, and MCPRemoteProxy (via SetupWithManager) and lists them (in findReferencingWorkloads), but no +kubebuilder:rbac markers declare these permissions. The MCPOIDCConfig controller declares them:
| // +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs/finalizers,verbs=update | |
| // +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs/finalizers,verbs=update | |
| // +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpservers,verbs=get;list;watch | |
| // +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=virtualmcpservers,verbs=get;list;watch | |
| // +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpremoteproxies,verbs=get;list;watch |
While these permissions are currently aggregated from other controllers, they should be self-declared for completeness and to prevent breakage if another controller is removed.
Raised by: kubernetes-expert, code-reviewer, toolhive-expert
| fullConfig := map[string]json.RawMessage{ | ||
| "version": mustMarshalJSON(authzConfigVersion), | ||
| "type": mustMarshalJSON(spec.Type), | ||
| configKey: spec.Config.Raw, |
There was a problem hiding this comment.
[MEDIUM] No nil-check on spec.Config.Raw before use (Consensus: 8/10)
If spec.Config.Raw is nil (bypassed validation, manual testing, or future code changes), this produces "cedar": null in the JSON output, leading to confusing downstream errors. Add a defensive check before use:
if len(spec.Config.Raw) == 0 {
return nil, fmt.Errorf("config field is empty")
}The config field is required in the CRD schema, but defense-in-depth is warranted for RawExtension fields.
Raised by: code-reviewer, kubernetes-expert
| @@ -0,0 +1,616 @@ | |||
| // SPDX-FileCopyrightText: Copyright 2025 Stacklok, Inc. | |||
There was a problem hiding this comment.
[MEDIUM] Test coverage gaps in watch handlers and deletion paths (Consensus: 7/10)
The following methods have zero test coverage (Codecov reports 43.5% patch / 122 uncovered lines):
findReferencingWorkloads(covers MCPServer, VirtualMCPServer, MCPRemoteProxy listing)findStaleRefs(stale reference cleanup)- All three
map*ToAuthzConfigwatch handlers SetupWithManager
Additionally, the handleDeletion test only covers the MCPServer reference path -- VirtualMCPServer and MCPRemoteProxy deletion-blocking cases are missing.
At minimum, consider adding tests for findReferencingWorkloads covering all three workload types, and extending the deletion test with VirtualMCPServer and MCPRemoteProxy cases.
Raised by: code-reviewer
| // +optional | ||
| ReferencingWorkloads []WorkloadReference `json:"referencingWorkloads,omitempty"` | ||
| } | ||
|
|
There was a problem hiding this comment.
[LOW] ReferencingWorkloads listMapKey=name may be insufficient (Consensus: 7/10)
Using only name as the list map key could collide if an MCPServer and a VirtualMCPServer share the same name in the same namespace. The composite key should be kind + name:
// +listMapKey=kind
// +listMapKey=nameThis is a pre-existing pattern copied from MCPOIDCConfig, MCPTelemetryConfig, etc. -- consider fixing across all config CRDs in a separate PR.
Raised by: kubernetes-expert
| // Type identifies the authorizer backend (e.g., "cedarv1", "httpv1"). | ||
| // Must match a registered authorizer type in the factory registry. | ||
| // +kubebuilder:validation:Required | ||
| // +kubebuilder:validation:MinLength=1 |
There was a problem hiding this comment.
[LOW] No Validate() method on MCPAuthzConfig type (Consensus: 7/10)
Other shared config CRDs (MCPOIDCConfig, MCPTelemetryConfig, MCPExternalAuthConfig) define a Validate() method on the type itself, called from the controller as config.Validate(). This CRD uses a standalone validateAuthzConfigSpec() function in the controller package instead.
The deviation is understandable (validation requires the factory registry, an external dependency), but a comment in the controller explaining why would help maintainability.
Raised by: toolhive-expert, code-reviewer
734e8db to
d571e89
Compare
Introduce a namespace-scoped MCPAuthzConfig CRD so authorization policy can be defined once and shared across MCPServer, MCPRemoteProxy, and VirtualMCPServer workloads, mirroring the existing MCPOIDCConfig sharing pattern. The spec carries a backend Type plus an opaque Config RawExtension. A ConfigKey() method on AuthorizerFactory (cedar->"cedar", http->"pdp") lets the controller reconstruct the full authorizer config and validate it via the factory registry, keeping the controller backend-agnostic; the backends are registered through blank imports in the operator entrypoint. The controller validates the spec, computes a config hash, manages a finalizer, tracks referencing workloads, and blocks deletion while referenced. AuthzConfigRef fields are added to the three workload specs with CEL XValidation enforcing mutual exclusivity against the existing inline authzConfig, matching the oidcConfig/oidcConfigRef pattern. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
d571e89 to
321c747
Compare
Large PR justification has been provided. Thank you!
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
|
Thanks for the thorough multi-agent review. Addressed all findings. Note: since this was last touched,
Relates to #3157 (still open; this implements its "Solution option 3"). Part 2 (#4778) is stacked on top and will be rebased once this lands. |
The chainsaw operator RBAC assertions compare the live toolhive-operator-manager-role ClusterRole against a hardcoded expected manifest. Adding the MCPAuthzConfig CRD introduced new mcpauthzconfigs rules in the generated ClusterRole, making the fixtures stale and failing E2E on all kind versions. Add the mcpauthzconfigs, mcpauthzconfigs/finalizers, and mcpauthzconfigs/status entries in the same alphabetical position the generated role uses (after embeddingservers*, before mcpexternalauthconfigs*) to both the multi-tenancy and single-tenancy setup fixtures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| // ─── MCPAuthzConfig ────────────────────────────────────────────────────────── | ||
|
|
||
| //+kubebuilder:object:root=true | ||
| //+kubebuilder:deprecatedversion:warning="toolhive.stacklok.dev/v1alpha1 is deprecated; use v1beta1" |
There was a problem hiding this comment.
We don't need this? As MCPAuthzConfig is a new CRD?
| //+kubebuilder:printcolumn:name="References",type=integer,JSONPath=`.status.referenceCount` | ||
| //+kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp` | ||
|
|
||
| // MCPAuthzConfig is the deprecated v1alpha1 version of the MCPAuthzConfig resource. |
ChrisJBurns
left a comment
There was a problem hiding this comment.
Multi-agent review summary
Five specialist reviewers (operator, Go, authz/security, tests, general) examined this PR; codex cross-review was skipped (CLI not installed). The design — a backend-agnostic MCPAuthzConfig CRD with a RawExtension payload, factory-driven validation, and a stacked AuthzConfigRef on workload specs — is the right shape, and the multi-version CRD plumbing (v1alpha1 compat shell embedding v1beta1, generated YAML / deepcopy / helm / RBAC / docs / chainsaw asserts) is internally consistent.
The implementation correctness is mostly good but inherits the same pre-existing anti-patterns the codebase rules flag — direct r.Status().Update and r.Update rather than MutateAndPatchStatus / MutateAndPatchSpec. There are also a handful of defense-in-depth gaps worth tightening before #4778 wires AuthzConfigRef into the workload controllers, since those gaps become user-visible at that point.
After consensus scoring there are 0 HIGH, 6 MEDIUM, 10 LOW findings. No blocking issues, but several MEDIUMs cluster around operator-rule compliance, hash determinism, and CEL test coverage.
Consensus summary
| # | Finding | Severity | Action |
|---|---|---|---|
| F1 | AuthzConfigRef field added but no workload controller consumes it (intentional per PR description, but Deprecated: annotation pre-announces capability) |
MEDIUM | Fix/confirm |
| F2 | Status writes use r.Status().Update instead of MutateAndPatchStatus |
MEDIUM | Fix |
| F3 | Finalizer writes use r.Update instead of MutateAndPatchSpec |
MEDIUM | Fix |
| F4 | ReferencingWorkloads refresh skipped on hash-change branch |
LOW | Fix |
| F5 | Hash over RawExtension flaps on whitespace / key-order |
MEDIUM | Fix |
| F6 | BuildFullAuthzConfigJSON lets factory key shadow reserved version/type keys |
MEDIUM | Fix |
| F7 | CEL mutual-exclusion rules on three CRD types have no envtest coverage | MEDIUM | Fix |
| F8 | DeletionBlocked condition never cleared on success path |
LOW | Fix |
| F9 | findReferencingWorkloads error swallowed; reconcile returns success |
LOW | Fix |
| F10 | BuildFullAuthzConfigJSON exported but only used inside controllers package |
LOW | Fix |
| F11 | No regression test for Status.Conditions wholesale-overwrite race |
MEDIUM | Fix |
| F12 | No test for finalizer-removed-after-last-reference transition | LOW | Fix |
| F13 | MCPAuthzConfig.Validate() doesn't check Type against registry |
LOW | Fix |
| F14 | kubebuilder RBAC marker omits mcpservers which the controller watches |
LOW | Fix |
| F15 | No K8s events emitted on validation failure / deletion-block | LOW | Optional |
| F16 | Migration path from inline authzConfig to authzConfigRef undocumented |
LOW | Optional |
Detailed comments are inline. Pros, tradeoffs and a suggested test plan are below.
Pros
- Backend-agnostic envelope (
ConfigKey()) is a clean way to support multiple authorizers without hardcoding Cedar specifics. findStaleRefsextraction is a notable improvement over the duplicated loops inMCPOIDCConfig— worth back-porting.- Replacing
mustMarshalJSON's panic with an error-returning helper is the right direction (even if the helper is over-engineered, see review comments). - Multi-version CRD plumbing is correctly applied; deepcopy, CRD YAML, helm templates, RBAC and chainsaw asserts all stay in sync.
- Stale-ref enqueue + sorting/dedup keep
Status.ReferencingWorkloadsstable across reconciles. - Large-PR justification is accurate — the majority of additions are generated.
Risks worth weighing
- The PR adds
AuthzConfigRefto the workload specs and annotates the existing inline field asDeprecated:to point at it, but no workload controller resolves the ref into a runtime authz config in this PR. Until #4778 lands, a user who follows the deprecation guidance gets a workload running with no authorization enforced. The PR description discloses this; the field-level docs do not. - Status / finalizer writes use raw
Updatecalls, which the codebase's own operator rules (.claude/rules/operator.md) explicitly call out as the anti-pattern they exist to prevent. SiblingMCPOIDCConfigcontroller has the same issue, so this isn't a regression, but new code is the natural place to stop the bleed.
Suggested manual verification
# Generated-artifact drift check
task operator-generate
task operator-manifests
task -d cmd/thv-operator crdref-gen
git diff --exit-code # must be clean
# Smoke: valid + invalid MCPAuthzConfig
kubectl apply -f - <<EOF
apiVersion: toolhive.stacklok.dev/v1beta1
kind: MCPAuthzConfig
metadata: { name: cedar-test, namespace: default }
spec:
type: cedarv1
config:
version: "1.0"
policies: ["permit(principal, action, resource);"]
entitiesJson: "[]"
EOF
kubectl get mcpauthzconfig cedar-test -o yaml # expect Valid=True, referenceCount=0
kubectl apply -f - <<EOF
apiVersion: toolhive.stacklok.dev/v1beta1
kind: MCPAuthzConfig
metadata: { name: bad, namespace: default }
spec: { type: nope, config: {} }
EOF
kubectl get mcpauthzconfig bad -o jsonpath='{.status.conditions}' # expect Valid=FalseTo confirm the silent-no-op claim under F1: apply only the authzConfigRef form (no inline authzConfig) on an MCPServer, then exec into the resulting pod and look for the authorization middleware in the runner config — at the current PR head it will be absent.
Automated multi-agent review (5 specialists, no consensus disputes). The skill flagged 16 findings; 13 below the consensus threshold were dropped (see local review log).
| // The referenced MCPAuthzConfig must exist in the same namespace as this MCPServer. | ||
| // Mutually exclusive with authzConfig. | ||
| // +optional | ||
| AuthzConfigRef *MCPAuthzConfigReference `json:"authzConfigRef,omitempty"` |
There was a problem hiding this comment.
F1 — MEDIUM: AuthzConfigRef is added here (and to MCPRemoteProxy/VirtualMCPServer) but no workload controller resolves it into a runtime authz config in this PR. Grepping the repo shows Spec.AuthzConfigRef only read by the new MCPAuthzConfigReconciler for bookkeeping (finalizer / ReferenceCount). A user who follows the Deprecated: Use AuthzConfigRef ... guidance on the sibling inline field will get a successfully reconciled MCPServer with no authorization enforced. The PR description discloses this is part 1/2 with #4778 stacked, but the field-level docs and the Deprecated: annotation get ahead of the actual capability.
Consider one of:
- Hold the
Deprecated:annotation on the inlineAuthzConfigfield (and the equivalent on MCPRemoteProxy/VirtualMCPServer) until Wire MCPAuthzConfig references into workload controllers #4778 lands. - Add a controller-side validation that emits a Warning event when
AuthzConfigRefis set on a workload until the consumer ships, so users aren't silently mis-configured. - Note the staging gap in the field's godoc (e.g. "Consumed by workload controllers in Wire MCPAuthzConfig references into workload controllers #4778; until then this field is reference-tracking only and does not apply authorization.").
|
|
||
| // MCPServerSpec defines the desired state of MCPServer | ||
| // | ||
| // +kubebuilder:validation:XValidation:rule="!(has(self.authzConfig) && has(self.authzConfigRef))",message="authzConfig and authzConfigRef are mutually exclusive; use authzConfigRef to reference a shared MCPAuthzConfig" |
There was a problem hiding this comment.
F7 — MEDIUM: This XValidation rule (and the parallel rules on MCPRemoteProxySpec and IncomingAuthConfig) is admission-time only, so unit tests with the fake client can't exercise it. There are no envtest cases that apply each CR with (a) only authzConfig, (b) only authzConfigRef, (c) both set — only (c) should be rejected.
Please add envtest coverage for all three new mutual-exclusion rules. This is the exact scenario the .claude/rules/operator.md testing guidance points at envtest for (over chainsaw).
|
|
||
| // AuthzConfig defines authorization policy configuration for the MCP server | ||
| // AuthzConfig defines authorization policy configuration for the MCP server. | ||
| // Deprecated: Use AuthzConfigRef to reference a shared MCPAuthzConfig resource instead. |
There was a problem hiding this comment.
F16 — LOW: Marking the inline field Deprecated: is correct in intent, but there is no migration guide for users today: no conversion table (inline.policies → MCPAuthzConfig.spec.config.policies), no operator-side auto-conversion, and no docs link from this comment. Combined with F1 (no workload controller consumes AuthzConfigRef yet), a user attempting migration in the wrong order ends up with no enforcement.
Either defer the Deprecated: annotation until #4778 lands, or add a short pointer to a migration doc (e.g. docs/operator/migration-authzconfig.md).
| // Backend-specific validation (delegating to the authorizer factory registry) lives | ||
| // in the controller rather than here, because the API types package must not import | ||
| // the authorizer backends. | ||
| func (r *MCPAuthzConfig) Validate() error { |
There was a problem hiding this comment.
F13 — LOW: Validate() only checks Type != "" and len(Config.Raw) != 0. Whether the type is actually registered is checked only inside the controller via authorizers.GetFactory(cfg.Type). Other callers of Validate() (a future webhook, CLI tooling, an admission-style helper) would happily accept an unknown type as "valid".
Consider consulting the authorizers registry from Validate() directly so all callers fail-closed:
| func (r *MCPAuthzConfig) Validate() error { | |
| func (r *MCPAuthzConfig) Validate() error { | |
| if r.Spec.Type == "" { | |
| return fmt.Errorf("spec.type is required") | |
| } | |
| if !authorizers.IsRegistered(r.Spec.Type) { | |
| return fmt.Errorf("spec.type %q is not a registered authorizer backend (registered: %v)", r.Spec.Type, authorizers.RegisteredTypes()) | |
| } | |
| if len(r.Spec.Config.Raw) == 0 { |
(Adjust the helper name to whatever the registry actually exposes.)
| // This controller manages the lifecycle of MCPAuthzConfig resources: validation | ||
| // via the authorizer factory registry, config hash computation, finalizer management, | ||
| // reference tracking, and deletion protection when workloads reference this config. | ||
| type MCPAuthzConfigReconciler struct { |
There was a problem hiding this comment.
F15 — LOW (optional): The reconciler doesn't take an EventRecorder. For a policy resource this is meaningful: cluster operators investigating a missing/invalid authz policy will not see kubectl describe mcpauthzconfig events explaining "Cedar policy rejected" or "deletion blocked by N workloads". Sibling controllers in this repo wire a recorder; adopting the same pattern here would improve the audit trail and discoverability.
Consider adding a Recorder record.EventRecorder field and wiring it in SetupWithManager via mgr.GetEventRecorderFor("mcpauthzconfig-controller"). Emit Warning events on ConfigInvalid / DeletionBlocked, Normal events on recovery. Avoid putting raw policy contents in event messages (the HTTP PDP URL in particular is topology-sensitive).
| // BuildFullAuthzConfigJSON reconstructs the full authorizer config JSON from a | ||
| // MCPAuthzConfig spec. The result is the same format accepted by authorizers.Config | ||
| // and used in ConfigMaps: {"version": "1.0", "type": "<type>", "<configKey>": {<config>}}. | ||
| func BuildFullAuthzConfigJSON(spec mcpv1beta1.MCPAuthzConfigSpec) ([]byte, error) { |
There was a problem hiding this comment.
F10 — LOW: BuildFullAuthzConfigJSON is exported but its only production caller lives in the same controllers package. Per .claude/rules/go-style.md ("Package API surface"), exporting locks the signature for callers that don't yet exist. The doc comment hints this will be used by workload controllers in the future — if so, it'll likely want to live in cmd/thv-operator/pkg/controllerutil/ next to BuildInlineCedarAuthzConfig to avoid downstream packages importing the controllers package.
Suggestion: unexport (buildFullAuthzConfigJSON) for now; promote and relocate when #4778 needs an actual cross-package caller, at which point the consumer's needs may also reshape the signature.
| return nil, fmt.Errorf("failed to marshal type: %w", err) | ||
| } | ||
|
|
||
| fullConfig := map[string]json.RawMessage{ |
There was a problem hiding this comment.
F6 — MEDIUM: This map literal lets the factory-supplied configKey collide with the reserved envelope keys. Go map literals with duplicate keys silently take the last assignment, so a factory whose ConfigKey() returns "version" or "type" (or "") silently overwrites the envelope metadata with attacker-controlled JSON. Today only Cedar ("cedar") and HTTP ("pdp") register, but the AuthorizerFactory interface is exported and the registration path is open to test stubs and future / out-of-tree backends, so this is a latent defense-in-depth gap.
Add a guard in Register() (pkg/authz/authorizers/registry.go) that rejects ConfigKey() values in a reserved set ("version", "type", "") — panicking at process startup makes a misconfigured factory visible immediately. A short unit test that registers a factory with ConfigKey() == "version" and expects a panic locks the contract in.
| "referencingWorkloads", referencingWorkloads) | ||
|
|
||
| meta.SetStatusCondition(&authzConfig.Status.Conditions, metav1.Condition{ | ||
| Type: mcpv1beta1.ConditionTypeDeletionBlocked, |
There was a problem hiding this comment.
F8 — LOW: handleDeletion sets ConditionTypeDeletionBlocked = True here, but nothing on the non-deletion reconcile path clears it back to False (or removes the condition). Outside the deletion flow, the main Reconcile only touches ConditionTypeAuthzConfigValid. If a user attempts deletion, then cancels (e.g. removes the deletionTimestamp via finalizer removal followed by re-apply), the stale DeletionBlocked: True condition can persist and mislead operators.
On the success path of Reconcile (after validation passes), call meta.SetStatusCondition to set DeletionBlocked = False (reason e.g. NoReferences) or meta.RemoveStatusCondition if DeletionTimestamp.IsZero(). Same pattern is used by sibling config controllers.
| assert.Equal(t, initialRV, afterSteady.ResourceVersion, "ResourceVersion should not change (no writes)") | ||
| } | ||
|
|
||
| func TestMCPAuthzConfigReconciler_ValidationFailureSetsCondition(t *testing.T) { |
There was a problem hiding this comment.
F11 — MEDIUM: .claude/rules/operator.md §"Status Writes" warns that direct r.Status().Update (used by this controller — see F2) replaces Status.Conditions wholesale on CRDs via JSON merge-patch. There is no regression test that interleaves two condition writers (e.g. a validation-failure reconcile concurrent with a hash-change reconcile, or an externally-injected condition + a reconcile) and asserts that all set conditions survive.
Without such a test, a future change that re-introduces the race (or that converts the controller to MutateAndPatchStatus and gets it wrong) will go undetected. Suggested approach: pre-populate a fake MCPAuthzConfig with an externally-injected condition type the controller doesn't manage, reconcile once, then assert that condition is still present alongside the controller-managed Valid condition. This is the cheapest possible canary against the wholesale-overwrite class of bug.
| assert.True(t, foundCondition, "Should have a Valid condition") | ||
| } | ||
|
|
||
| func TestMCPAuthzConfigReconciler_handleDeletion(t *testing.T) { |
There was a problem hiding this comment.
F12 — LOW: TestMCPAuthzConfigReconciler_handleDeletion covers two static starting states (no workloads → finalizer removed; one workload → blocked/requeue) but never exercises the transition that matters in practice: a config blocked by N references, then the last referencing workload is deleted, then handleDeletion is invoked again and the finalizer is removed.
Add a sub-test that (1) creates a deleting config plus one referencing workload, invokes handleDeletion (assert requeue + finalizer still present), (2) deletes the referencing workload, invokes handleDeletion again (assert finalizer removed). This is the only path through the deletion-protection logic that actually flips state, and it is currently uncovered.
Summary
authzConfig) is Cedar-specific and doesn't support alternative authorization backends like HTTP PDP. This adds a newMCPAuthzConfigCRD that decouples authorization configuration from the workload CRDs, enabling any registered authorizer backend to be used.ConfigKey() stringto theAuthorizerFactoryinterface so the MCPAuthzConfig controller can reconstruct full config JSON without hardcoding backend-specific keys.MCPAuthzConfigCRD withtype+config(RawExtension) spec, validation via factory registry, hash-based change detection, referencing workload tracking, and deletion protection.AuthzConfigReffield to MCPServer, MCPRemoteProxy, and VirtualMCPServer specs for referencing shared MCPAuthzConfig resources.Relates to #3157
Type of change
Test plan
task test)task lint-fix)Changes
pkg/authz/authorizers/registry.goConfigKey()toAuthorizerFactoryinterfacepkg/authz/authorizers/cedar/core.goConfigKey()returning"cedar"pkg/authz/authorizers/http/core.goConfigKey()returning"http"cmd/thv-operator/api/v1beta1/mcpauthzconfig_types.goValidate():MCPAuthzConfig,MCPAuthzConfigSpec,MCPAuthzConfigStatus,MCPAuthzConfigReferencecmd/thv-operator/api/v1beta1/{mcpserver,mcpremoteproxy,virtualmcpserver}_types.goAuthzConfigRef+ CEL mutual-exclusion XValidationcmd/thv-operator/api/v1alpha1/*cmd/thv-operator/controllers/mcpauthzconfig_controller.gocmd/thv-operator/app/app.godeploy/charts/operator*Special notes for reviewers
BuildFullAuthzConfigJSON()reconstructs the full config envelope ({"version","type","<configKey>": ...}) using the factory'sConfigKey()— this avoids hardcoding Cedar-specific structure.mustMarshalJSON's panic was replaced with an error-returningmarshalJSONString.ReferencingWorkloadsstatus and enforce deletion protection. RBAC markers for the watched resources are present and rendered into the generated clusterrole.XValidationenforcesauthzConfig/authzConfigRefmutual exclusivity on all three workload specs, mirroring theoidcConfig/oidcConfigRefrules, and renders into both served CRD versions.Large PR Justification
This PR is a single cohesive, atomic addition: a new
MCPAuthzConfigCRD plus its controller. The pieces cannot be split without leaving the tree in a non-functional intermediate state:verify-genruns.SetupWithManagerregistration inapp.go, and theConfigKey()interface method are mutually dependent and won't compile apart.PR #4778 is stacked on top of this and wires these refs into the workload controllers, so the CRD + controller must land first as a unit.
Generated with Claude Code