Add MCPAuthzConfig CRD for backend-agnostic authorization by JAORMX · Pull Request #4777 · stacklok/toolhive

JAORMX · 2026-04-13T13:57:33Z

Summary

The current inline authz configuration (authzConfig) is Cedar-specific and doesn't support alternative authorization backends like HTTP PDP. This adds a new MCPAuthzConfig CRD that decouples authorization configuration from the workload CRDs, enabling any registered authorizer backend to be used.
Adds ConfigKey() string to the AuthorizerFactory interface so the MCPAuthzConfig controller can reconstruct full config JSON without hardcoding backend-specific keys.
Creates the MCPAuthzConfig CRD with type + config (RawExtension) spec, validation via factory registry, hash-based change detection, referencing workload tracking, and deletion protection.
Adds AuthzConfigRef field to MCPServer, MCPRemoteProxy, and VirtualMCPServer specs for referencing shared MCPAuthzConfig resources.

Relates to #3157

Type of change

New feature

Test plan

Unit tests (task test)
Linting (task lint-fix)

Changes

Note: this PR was re-implemented against the v1beta1 API after main graduated v1beta1 to the storage version. The new CRD types live in cmd/thv-operator/api/v1beta1/, with a deprecated compat shell in v1alpha1, mirroring the established MCPOIDCConfig pattern. Generated CRD YAML, deepcopy, helm templates, RBAC, and CRD reference docs are regenerated accordingly.

Area	Change
`pkg/authz/authorizers/registry.go`	Add `ConfigKey()` to `AuthorizerFactory` interface
`pkg/authz/authorizers/cedar/core.go`	Implement `ConfigKey()` returning `"cedar"`
`pkg/authz/authorizers/http/core.go`	Implement `ConfigKey()` returning `"http"`
`cmd/thv-operator/api/v1beta1/mcpauthzconfig_types.go`	New CRD types + `Validate()`: `MCPAuthzConfig`, `MCPAuthzConfigSpec`, `MCPAuthzConfigStatus`, `MCPAuthzConfigReference`
`cmd/thv-operator/api/v1beta1/{mcpserver,mcpremoteproxy,virtualmcpserver}_types.go`	Add `AuthzConfigRef` + CEL mutual-exclusion XValidation
`cmd/thv-operator/api/v1alpha1/*`	Deprecated compat shims embedding v1beta1
`cmd/thv-operator/controllers/mcpauthzconfig_controller.go`	MCPAuthzConfig reconciler: validation, hash, finalizer, watches
`cmd/thv-operator/app/app.go`	Register MCPAuthzConfig controller; backend blank-imports
`deploy/charts/operator*`	Regenerated CRD YAML, RBAC clusterrole, helm templates, CRD docs

Special notes for reviewers

This is part 1 of 2: CRD + controller only. Part 2 (Wire MCPAuthzConfig references into workload controllers #4778, wiring references into workload controllers) follows on top.
BuildFullAuthzConfigJSON() reconstructs the full config envelope ({"version","type","<configKey>": ...}) using the factory's ConfigKey() — this avoids hardcoding Cedar-specific structure. mustMarshalJSON's panic was replaced with an error-returning marshalJSONString.
The controller watches MCPServer, VirtualMCPServer, and MCPRemoteProxy to maintain accurate ReferencingWorkloads status and enforce deletion protection. RBAC markers for the watched resources are present and rendered into the generated clusterrole.
CEL XValidation enforces authzConfig/authzConfigRef mutual exclusivity on all three workload specs, mirroring the oidcConfig/oidcConfigRef rules, and renders into both served CRD versions.

Large PR Justification

This PR is a single cohesive, atomic addition: a new MCPAuthzConfig CRD plus its controller. The pieces cannot be split without leaving the tree in a non-functional intermediate state:

The API types, deepcopy, CRD YAML, helm templates, RBAC, and CRD reference docs are all machine-generated from the same type definitions — splitting them produces failing verify-gen runs.
The controller, its SetupWithManager registration in app.go, and the ConfigKey() interface method are mutually dependent and won't compile apart.
Of the additions, the large majority is generated YAML/docs/deepcopy and tests. Hand-written production Go is a small fraction.

PR #4778 is stacked on top of this and wires these refs into the workload controllers, so the CRD + controller must land first as a unit.

Generated with Claude Code

github-actions

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.

This review will be automatically dismissed once you add the justification section.

codecov · 2026-04-13T14:04:42Z

Codecov Report

❌ Patch coverage is 68.64407% with 74 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.77%. Comparing base (35e66bb) to head (1e95d5c).

Files with missing lines	Patch %	Lines
...-operator/controllers/mcpauthzconfig_controller.go	68.63%	46 Missing and 23 partials ⚠️
cmd/thv-operator/app/app.go	0.00%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4777      +/-   ##
==========================================
- Coverage   68.80%   68.77%   -0.04%     
==========================================
  Files         629      631       +2     
  Lines       63914    64150     +236     
==========================================
+ Hits        43978    44121     +143     
- Misses      16678    16749      +71     
- Partials     3258     3280      +22

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

claude · 2026-04-13T14:11:25Z

                - message: inline must be set when type is 'inline', and must not
                    be set otherwise
                  rule: 'self.type == ''inline'' ? has(self.inline) : !has(self.inline)'
+              authzConfigRef:
+                description: |-
+                  AuthzConfigRef references a shared MCPAuthzConfig resource for authorization.
+                  The referenced MCPAuthzConfig must exist in the same namespace as this MCPServer.
+                  Mutually exclusive with authzConfig.
+                properties:
+                  name:
+                    description: Name is the name of the MCPAuthzConfig resource in
+                      the same namespace.
+                    minLength: 1
+                    type: string
+                required:
+                - name
+                type: object
              backendReplicas:
                description: |-
                  BackendReplicas is the desired number of MCP server backend pod replicas.


🔴 The new authzConfig/authzConfigRef fields are documented as mutually exclusive across MCPServer, MCPRemoteProxy, and VirtualMCPServer's IncomingAuthConfig, but no CEL x-kubernetes-validations rule enforces this at the API level. A user can submit a resource with both fields set and the Kubernetes API server will accept it without error. The fix is to add +kubebuilder:validation:XValidation markers (and the corresponding x-kubernetes-validations entries in the generated CRD YAMLs) mirroring the existing oidcConfig/oidcConfigRef precedent for all three affected types.

Extended reasoning...

What the bug is and how it manifests

The PR introduces authzConfigRef fields to MCPServerSpec, MCPRemoteProxySpec, and IncomingAuthConfig (VirtualMCPServer), explicitly documenting them as mutually exclusive with the existing authzConfig field. However, unlike the analogous oidcConfig/oidcConfigRef pair, no CEL validation rule is added to enforce this constraint at admission time. The Kubernetes API server therefore accepts resources with both fields populated simultaneously.

The specific code path that triggers it

In mcpremoteproxy_types.go, the struct-level kubebuilder marker reads:

// +kubebuilder:validation:XValidation:rule="\!(has(self.oidcConfig) && has(self.oidcConfigRef))",message="oidcConfig and oidcConfigRef are mutually exclusive..." type MCPRemoteProxySpec struct {

No equivalent marker exists for authzConfig/authzConfigRef in MCPRemoteProxySpec, MCPServerSpec, or IncomingAuthConfig. The generated CRD YAML (e.g. toolhive.stacklok.dev_mcpremoteproxies.yaml) reflects this: the spec-level x-kubernetes-validations block only contains the OIDC rule, not an authz rule.

Why existing code doesn't prevent it

The Go field comments say "Mutually exclusive with authzConfig" but comments are not enforced. The kubebuilder XValidation marker is what generates the CEL rule into the CRD schema. That marker was added for oidcConfig/oidcConfigRef but was omitted for the new authzConfig/authzConfigRef pair in all three types.

Impact

Any user who sets both authzConfig and authzConfigRef will get a resource that is silently accepted by the API server. When Part 2 wires these fields into the workload controllers, the controller will face ambiguity: it will need to pick one field, likely silently ignoring the other, leading to unexpected authorization behavior that is hard to diagnose.

How to fix it

Add the following marker to MCPServerSpec, MCPRemoteProxySpec, and IncomingAuthConfig (adjacent to the existing oidcConfig/oidcConfigRef marker):

// +kubebuilder:validation:XValidation:rule="\!(has(self.authzConfig) && has(self.authzConfigRef))",message="authzConfig and authzConfigRef are mutually exclusive"

Then regenerate the CRD manifests so that the three affected YAML files each gain a corresponding x-kubernetes-validations entry at the spec level.

Step-by-step proof

Create an MCPServer manifest with both fields:

apiVersion: toolhive.stacklok.dev/v1alpha1 kind: MCPServer metadata: name: test spec: image: example/mcp:latest authzConfig: type: inline inline: policies: ["permit(principal, action, resource);"] authzConfigRef: name: my-authzconfig

Apply it to a cluster with this PR's CRDs installed.

Observe: kubectl apply succeeds with no validation error.

Compare with the equivalent test using oidcConfig + oidcConfigRef: the API server rejects it with "oidcConfig and oidcConfigRef are mutually exclusive" because the CEL rule exists for OIDC but not for authz.

claude · 2026-04-13T14:11:25Z

+	if hashChanged {
+		logger.Info("MCPAuthzConfig configuration changed",
+			"oldHash", authzConfig.Status.ConfigHash,
+			"newHash", configHash)
+
+		authzConfig.Status.ConfigHash = configHash
+		authzConfig.Status.ObservedGeneration = authzConfig.Generation
+
+		if err := r.Status().Update(ctx, authzConfig); err != nil {
+			logger.Error(err, "Failed to update MCPAuthzConfig status")
+			return ctrl.Result{}, err
+		}
+		return ctrl.Result{}, nil
+	}
+
+	// Refresh ReferencingWorkloads list
+	referencingWorkloads, err := r.findReferencingWorkloads(ctx, authzConfig)
+	if err != nil {
+		logger.Error(err, "Failed to find referencing workloads")
+	} else if !ctrlutil.WorkloadRefsEqual(authzConfig.Status.ReferencingWorkloads, referencingWorkloads) {
+		authzConfig.Status.ReferencingWorkloads = referencingWorkloads
+		conditionChanged = true
+	}


🔴 The new MCPAuthzConfig controller has two related issues that leave ReferencingWorkloads status stale. First, when findReferencingWorkloads returns an error in the normal reconcile path (lines 134-141), the error is only logged and execution continues without requeueing — unlike handleDeletion which correctly returns the error for automatic retry. Second, when hashChanged==true, the controller returns early at line 131 before ever reaching the findReferencingWorkloads call at line 135, so workload reference changes concurrent with spec changes are silently dropped for that reconcile cycle. The fix is to refresh workload refs before the hash-changed early return, and to return the error (with requeue) when findReferencingWorkloads fails, consistent with the deletion path.

Extended reasoning...

Bug 1 — Swallowed error in normal reconcile path (lines 134-141)

In the normal reconcile path, when findReferencingWorkloads returns an error, the code at lines 134-141 logs the error and falls through without returning it:

referencingWorkloads, err := r.findReferencingWorkloads(ctx, authzConfig) if err != nil { logger.Error(err, "Failed to find referencing workloads") } else if ...

The deletion path (handleDeletion) correctly propagates the error, which causes controller-runtime to schedule a retry with exponential backoff:

referencingWorkloads, err := r.findReferencingWorkloads(ctx, authzConfig) if err != nil { logger.Error(err, "Failed to check referencing workloads during deletion") return ctrl.Result{}, err // requeues }

On a transient Kubernetes API error (e.g., brief etcd unavailability), the normal reconcile path silently leaves ReferencingWorkloads stale with no automatic retry. The status is only corrected when an unrelated event (e.g., a workload change) happens to re-trigger reconciliation. This asymmetry violates controller-runtime best practices and is the identical pattern that exists in toolconfig_controller.go, meaning the PR copies a known sub-optimal approach rather than improving on it.

Bug 2 — Hash-changed early return skips ReferencingWorkloads refresh (lines 119-132)

When hashChanged==true, the reconciler updates ConfigHash and ObservedGeneration, calls r.Status().Update(), and returns at line 131:

if hashChanged { authzConfig.Status.ConfigHash = configHash authzConfig.Status.ObservedGeneration = authzConfig.Generation if err := r.Status().Update(ctx, authzConfig); err != nil { ... } return ctrl.Result{}, nil // <-- findReferencingWorkloads is never called } // This is unreachable when hashChanged==true: referencingWorkloads, err := r.findReferencingWorkloads(ctx, authzConfig)

Concrete proof for Bug 2: Consider MCPAuthzConfig my-authz with no referencing workloads. A user simultaneously (a) modifies the spec, and (b) adds authzConfigRef: {name: my-authz} to an MCPServer. Controller-runtime coalesces both watch events into a single reconcile. Inside that reconcile: hashChanged==true → controller updates ConfigHash and returns. ReferencingWorkloads is never refreshed. The watch handler for MCPServer fired to trigger this reconcile, so no second reconcile is enqueued. The status shows referencingWorkloads: [] even though the MCPServer now references the config, until either another spec change or another workload event occurs.

Addressing the refutation for Bug 2: One verifier argues that the watch-based reconciliation mechanism compensates for the early return, because a workload change fires a watch event that enqueues a new reconcile. This is true in the separate event case. However, it fails in the coalesced event case described above: if controller-runtime batches the spec-change event and the workload-change event into a single reconcile call, the controller sees hashChanged==true, returns early, and consumes the single reconcile that would have refreshed the refs. No additional reconcile is triggered because the watch event for the workload has already been processed. This is a real (if narrow) window of inconsistency.

Impact and fix: The actual deletion protection in handleDeletion queries live state and is unaffected. The impact is stale ReferencingWorkloads status visible to users and operators. The fix is to call findReferencingWorkloads before the hash-change early return and include its result in the combined status update (as the MCPTelemetryConfig controller already does correctly), and to return the error rather than swallowing it when the lookup fails.

ChrisJBurns

Multi-Agent Consensus Review

Agents consulted: kubernetes-expert, go-security-reviewer, code-reviewer, toolhive-expert

Consensus Summary

#	Finding	Consensus	Severity	Action
1	Missing CEL XValidation for authzConfig/authzConfigRef mutual exclusivity	10/10	HIGH	Fix
2	Missing RBAC markers for watched workload resources	10/10	MEDIUM	Fix
3	mustMarshalJSON uses panic in controller	10/10	MEDIUM	Fix
4	No nil-check on spec.Config.Raw before use	8/10	MEDIUM	Fix
5	Test coverage gaps in watch handlers and deletion paths	7/10	MEDIUM	Fix
6	No Validate() method on MCPAuthzConfig type (pattern deviation)	7/10	LOW	Discuss
7	ReferencingWorkloads listMapKey=name insufficient for uniqueness	7/10	LOW	Discuss (pre-existing)
8	Controller blank-imports authorizer backend packages	7/10	LOW	Discuss

Overall

This is a well-structured PR that closely follows the established MCPOIDCConfig controller pattern. The CRD design is clean, the ConfigKey() interface addition is minimal, and the findStaleRefs helper is a nice DRY improvement over the inline stale-ref scanning in existing controllers. The part 1/part 2 split is a sensible approach for incremental delivery.

The main blocker is the missing CEL XValidation rules (Finding #1). The existing OIDC pattern enforces mutual exclusivity between oidcConfig and oidcConfigRef via CEL rules, but no equivalent rules exist for authzConfig/authzConfigRef. Without these, the API server silently accepts resources with both fields set, contradicting the documented contract. The remaining findings are medium-severity improvements: RBAC marker completeness, replacing panic with error return, and adding a nil-check on Config.Raw.

Generated with Claude Code

ChrisJBurns · 2026-04-13T15:21:40Z

+	// The referenced MCPAuthzConfig must exist in the same namespace as this MCPServer.
+	// Mutually exclusive with authzConfig.
+	// +optional
+	AuthzConfigRef *MCPAuthzConfigReference `json:"authzConfigRef,omitempty"`


[HIGH] Missing CEL XValidation for authzConfig/authzConfigRef mutual exclusivity (Consensus: 10/10)

The AuthzConfig and AuthzConfigRef fields are documented as "mutually exclusive" but no CEL validation rule enforces this at the API server level. The existing OIDC pattern has:

// +kubebuilder:validation:XValidation:rule="!(has(self.oidcConfig) && has(self.oidcConfigRef))",message="oidcConfig and oidcConfigRef are mutually exclusive..."

The same rule is needed here on MCPServerSpec, MCPRemoteProxySpec, and IncomingAuthConfig:

// +kubebuilder:validation:XValidation:rule="!(has(self.authzConfig) && has(self.authzConfigRef))",message="authzConfig and authzConfigRef are mutually exclusive; use authzConfigRef to reference a shared MCPAuthzConfig"

Without this, users can set both fields simultaneously and the controller must decide which takes precedence.

Raised by: kubernetes-expert, code-reviewer, toolhive-expert

ChrisJBurns · 2026-04-13T15:21:40Z

+
+// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs,verbs=get;list;watch;create;update;patch;delete
+// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs/status,verbs=get;update;patch
+// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs/finalizers,verbs=update


[MEDIUM] Missing RBAC markers for watched workload resources (Consensus: 10/10)

The controller watches MCPServer, VirtualMCPServer, and MCPRemoteProxy (via SetupWithManager) and lists them (in findReferencingWorkloads), but no +kubebuilder:rbac markers declare these permissions. The MCPOIDCConfig controller declares them:

Suggested change

// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs/finalizers,verbs=update

// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs/finalizers,verbs=update

// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpservers,verbs=get;list;watch

// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=virtualmcpservers,verbs=get;list;watch

// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpremoteproxies,verbs=get;list;watch

While these permissions are currently aggregated from other controllers, they should be self-declared for completeness and to prevent breakage if another controller is removed.

Raised by: kubernetes-expert, code-reviewer, toolhive-expert

ChrisJBurns · 2026-04-13T15:21:41Z

+	fullConfig := map[string]json.RawMessage{
+		"version": mustMarshalJSON(authzConfigVersion),
+		"type":    mustMarshalJSON(spec.Type),
+		configKey: spec.Config.Raw,


[MEDIUM] No nil-check on spec.Config.Raw before use (Consensus: 8/10)

If spec.Config.Raw is nil (bypassed validation, manual testing, or future code changes), this produces "cedar": null in the JSON output, leading to confusing downstream errors. Add a defensive check before use:

if len(spec.Config.Raw) == 0 { return nil, fmt.Errorf("config field is empty") }

The config field is required in the CRD schema, but defense-in-depth is warranted for RawExtension fields.

Raised by: code-reviewer, kubernetes-expert

ChrisJBurns · 2026-04-13T15:21:41Z

@@ -0,0 +1,616 @@
+// SPDX-FileCopyrightText: Copyright 2025 Stacklok, Inc.


[MEDIUM] Test coverage gaps in watch handlers and deletion paths (Consensus: 7/10)

The following methods have zero test coverage (Codecov reports 43.5% patch / 122 uncovered lines):

findReferencingWorkloads (covers MCPServer, VirtualMCPServer, MCPRemoteProxy listing)

findStaleRefs (stale reference cleanup)

All three map*ToAuthzConfig watch handlers

SetupWithManager

Additionally, the handleDeletion test only covers the MCPServer reference path -- VirtualMCPServer and MCPRemoteProxy deletion-blocking cases are missing.

At minimum, consider adding tests for findReferencingWorkloads covering all three workload types, and extending the deletion test with VirtualMCPServer and MCPRemoteProxy cases.

Raised by: code-reviewer

ChrisJBurns · 2026-04-13T15:21:41Z

+	// +optional
+	ReferencingWorkloads []WorkloadReference `json:"referencingWorkloads,omitempty"`
+}
+


[LOW] ReferencingWorkloads listMapKey=name may be insufficient (Consensus: 7/10)

Using only name as the list map key could collide if an MCPServer and a VirtualMCPServer share the same name in the same namespace. The composite key should be kind + name:

// +listMapKey=kind // +listMapKey=name

This is a pre-existing pattern copied from MCPOIDCConfig, MCPTelemetryConfig, etc. -- consider fixing across all config CRDs in a separate PR.

Raised by: kubernetes-expert

ChrisJBurns · 2026-04-13T15:21:41Z

+	// Type identifies the authorizer backend (e.g., "cedarv1", "httpv1").
+	// Must match a registered authorizer type in the factory registry.
+	// +kubebuilder:validation:Required
+	// +kubebuilder:validation:MinLength=1


[LOW] No Validate() method on MCPAuthzConfig type (Consensus: 7/10)

Other shared config CRDs (MCPOIDCConfig, MCPTelemetryConfig, MCPExternalAuthConfig) define a Validate() method on the type itself, called from the controller as config.Validate(). This CRD uses a standalone validateAuthzConfigSpec() function in the controller package instead.

The deviation is understandable (validation requires the factory registry, an external dependency), but a comment in the controller explaining why would help maintainability.

Raised by: toolhive-expert, code-reviewer

Introduce a namespace-scoped MCPAuthzConfig CRD so authorization policy can be defined once and shared across MCPServer, MCPRemoteProxy, and VirtualMCPServer workloads, mirroring the existing MCPOIDCConfig sharing pattern. The spec carries a backend Type plus an opaque Config RawExtension. A ConfigKey() method on AuthorizerFactory (cedar->"cedar", http->"pdp") lets the controller reconstruct the full authorizer config and validate it via the factory registry, keeping the controller backend-agnostic; the backends are registered through blank imports in the operator entrypoint. The controller validates the spec, computes a config hash, manages a finalizer, tracks referencing workloads, and blocks deletion while referenced. AuthzConfigRef fields are added to the three workload specs with CEL XValidation enforcing mutual exclusivity against the existing inline authzConfig, matching the oidcConfig/oidcConfigRef pattern. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Large PR justification has been provided. Thank you!

github-actions · 2026-05-29T11:29:21Z

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

JAORMX · 2026-05-29T11:30:01Z

Thanks for the thorough multi-agent review. Addressed all findings. Note: since this was last touched, main graduated the operator API to v1beta1 as the storage version, so the PR has been re-implemented against v1beta1 (with a deprecated v1alpha1 compat shell), mirroring the established MCPOIDCConfig pattern. Per-finding:

HIGH — CEL XValidation mutual exclusivity: Done. !(has(self.authzConfig) && has(self.authzConfigRef)) is applied to MCPServerSpec, MCPRemoteProxySpec, and IncomingAuthConfig (vMCP), matching the oidcConfig/oidcConfigRef pattern, and renders into both served CRD versions.
MEDIUM — RBAC markers: Added; the regenerated clusterrole grants get/list/watch on the watched workload types plus the mcpauthzconfigs status/finalizers subresources. mcpservers RBAC is inherited from the MCPServer controller, consistent with MCPOIDCConfig.
MEDIUM — panic in mustMarshalJSON: Replaced with marshalJSONString returning an error, propagated through BuildFullAuthzConfigJSON.
MEDIUM — nil-check on spec.Config.Raw: Guarded (len(spec.Config.Raw) == 0 → explicit error), plus a Validate() method on the type as defense-in-depth.
MEDIUM — test coverage: Added watch-handler tests (current + stale-ref enqueue for all three workload kinds, plus wrong-type and no-ref cases) and deletion-path tests (finalizer add, deletion-without-finalizer, reference-blocking per workload type).
LOW — Validate() method: Added, mirroring MCPOIDCConfig. Structural checks only; backend-specific validation stays in the controller because it needs the authorizer registry the API package must not import (documented inline).
LOW — ReferencingWorkloads listMapKey: Aligned to +listMapKey=name, matching the shared WorkloadReference convention used by every other config CRD. The "name not unique across kinds" concern is pre-existing and repo-wide — suggest addressing it uniformly in a separate change.
LOW — controller blank-imports backends: Kept out of the controller; they live in the operator entrypoint (app.go) so the controller stays backend-agnostic, with a comment explaining why.

Relates to #3157 (still open; this implements its "Solution option 3"). Part 2 (#4778) is stacked on top and will be rebased once this lands.

The chainsaw operator RBAC assertions compare the live toolhive-operator-manager-role ClusterRole against a hardcoded expected manifest. Adding the MCPAuthzConfig CRD introduced new mcpauthzconfigs rules in the generated ClusterRole, making the fixtures stale and failing E2E on all kind versions. Add the mcpauthzconfigs, mcpauthzconfigs/finalizers, and mcpauthzconfigs/status entries in the same alphabetical position the generated role uses (after embeddingservers*, before mcpexternalauthconfigs*) to both the multi-tenancy and single-tenancy setup fixtures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ChrisJBurns · 2026-06-01T15:37:48Z

+// ─── MCPAuthzConfig ──────────────────────────────────────────────────────────
+
+//+kubebuilder:object:root=true
+//+kubebuilder:deprecatedversion:warning="toolhive.stacklok.dev/v1alpha1 is deprecated; use v1beta1"


We don't need this? As MCPAuthzConfig is a new CRD?

ChrisJBurns · 2026-06-01T15:38:19Z

+//+kubebuilder:printcolumn:name="References",type=integer,JSONPath=`.status.referenceCount`
+//+kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`
+
+// MCPAuthzConfig is the deprecated v1alpha1 version of the MCPAuthzConfig resource.


This isn't true?

ChrisJBurns

Multi-agent review summary

Five specialist reviewers (operator, Go, authz/security, tests, general) examined this PR; codex cross-review was skipped (CLI not installed). The design — a backend-agnostic MCPAuthzConfig CRD with a RawExtension payload, factory-driven validation, and a stacked AuthzConfigRef on workload specs — is the right shape, and the multi-version CRD plumbing (v1alpha1 compat shell embedding v1beta1, generated YAML / deepcopy / helm / RBAC / docs / chainsaw asserts) is internally consistent.

The implementation correctness is mostly good but inherits the same pre-existing anti-patterns the codebase rules flag — direct r.Status().Update and r.Update rather than MutateAndPatchStatus / MutateAndPatchSpec. There are also a handful of defense-in-depth gaps worth tightening before #4778 wires AuthzConfigRef into the workload controllers, since those gaps become user-visible at that point.

After consensus scoring there are 0 HIGH, 6 MEDIUM, 10 LOW findings. No blocking issues, but several MEDIUMs cluster around operator-rule compliance, hash determinism, and CEL test coverage.

Consensus summary

#	Finding	Severity	Action
F1	AuthzConfigRef field added but no workload controller consumes it (intentional per PR description, but `Deprecated:` annotation pre-announces capability)	MEDIUM	Fix/confirm
F2	Status writes use `r.Status().Update` instead of `MutateAndPatchStatus`	MEDIUM	Fix
F3	Finalizer writes use `r.Update` instead of `MutateAndPatchSpec`	MEDIUM	Fix
F4	`ReferencingWorkloads` refresh skipped on hash-change branch	LOW	Fix
F5	Hash over `RawExtension` flaps on whitespace / key-order	MEDIUM	Fix
F6	`BuildFullAuthzConfigJSON` lets factory key shadow reserved `version`/`type` keys	MEDIUM	Fix
F7	CEL mutual-exclusion rules on three CRD types have no envtest coverage	MEDIUM	Fix
F8	`DeletionBlocked` condition never cleared on success path	LOW	Fix
F9	`findReferencingWorkloads` error swallowed; reconcile returns success	LOW	Fix
F10	`BuildFullAuthzConfigJSON` exported but only used inside `controllers` package	LOW	Fix
F11	No regression test for `Status.Conditions` wholesale-overwrite race	MEDIUM	Fix
F12	No test for finalizer-removed-after-last-reference transition	LOW	Fix
F13	`MCPAuthzConfig.Validate()` doesn't check Type against registry	LOW	Fix
F14	kubebuilder RBAC marker omits `mcpservers` which the controller watches	LOW	Fix
F15	No K8s events emitted on validation failure / deletion-block	LOW	Optional
F16	Migration path from inline `authzConfig` to `authzConfigRef` undocumented	LOW	Optional

Detailed comments are inline. Pros, tradeoffs and a suggested test plan are below.

Pros

Backend-agnostic envelope (ConfigKey()) is a clean way to support multiple authorizers without hardcoding Cedar specifics.
findStaleRefs extraction is a notable improvement over the duplicated loops in MCPOIDCConfig — worth back-porting.
Replacing mustMarshalJSON's panic with an error-returning helper is the right direction (even if the helper is over-engineered, see review comments).
Multi-version CRD plumbing is correctly applied; deepcopy, CRD YAML, helm templates, RBAC and chainsaw asserts all stay in sync.
Stale-ref enqueue + sorting/dedup keep Status.ReferencingWorkloads stable across reconciles.
Large-PR justification is accurate — the majority of additions are generated.

Risks worth weighing

The PR adds AuthzConfigRef to the workload specs and annotates the existing inline field as Deprecated: to point at it, but no workload controller resolves the ref into a runtime authz config in this PR. Until #4778 lands, a user who follows the deprecation guidance gets a workload running with no authorization enforced. The PR description discloses this; the field-level docs do not.
Status / finalizer writes use raw Update calls, which the codebase's own operator rules (.claude/rules/operator.md) explicitly call out as the anti-pattern they exist to prevent. Sibling MCPOIDCConfig controller has the same issue, so this isn't a regression, but new code is the natural place to stop the bleed.

Suggested manual verification

# Generated-artifact drift check
task operator-generate
task operator-manifests
task -d cmd/thv-operator crdref-gen
git diff --exit-code   # must be clean

# Smoke: valid + invalid MCPAuthzConfig
kubectl apply -f - <<EOF
apiVersion: toolhive.stacklok.dev/v1beta1
kind: MCPAuthzConfig
metadata: { name: cedar-test, namespace: default }
spec:
  type: cedarv1
  config:
    version: "1.0"
    policies: ["permit(principal, action, resource);"]
    entitiesJson: "[]"
EOF
kubectl get mcpauthzconfig cedar-test -o yaml   # expect Valid=True, referenceCount=0

kubectl apply -f - <<EOF
apiVersion: toolhive.stacklok.dev/v1beta1
kind: MCPAuthzConfig
metadata: { name: bad, namespace: default }
spec: { type: nope, config: {} }
EOF
kubectl get mcpauthzconfig bad -o jsonpath='{.status.conditions}'   # expect Valid=False

To confirm the silent-no-op claim under F1: apply only the authzConfigRef form (no inline authzConfig) on an MCPServer, then exec into the resulting pod and look for the authorization middleware in the runner config — at the current PR head it will be absent.

Automated multi-agent review (5 specialists, no consensus disputes). The skill flagged 16 findings; 13 below the consensus threshold were dropped (see local review log).

ChrisJBurns · 2026-06-01T15:40:46Z

+	// The referenced MCPAuthzConfig must exist in the same namespace as this MCPServer.
+	// Mutually exclusive with authzConfig.
+	// +optional
+	AuthzConfigRef *MCPAuthzConfigReference `json:"authzConfigRef,omitempty"`


F1 — MEDIUM: AuthzConfigRef is added here (and to MCPRemoteProxy/VirtualMCPServer) but no workload controller resolves it into a runtime authz config in this PR. Grepping the repo shows Spec.AuthzConfigRef only read by the new MCPAuthzConfigReconciler for bookkeeping (finalizer / ReferenceCount). A user who follows the Deprecated: Use AuthzConfigRef ... guidance on the sibling inline field will get a successfully reconciled MCPServer with no authorization enforced. The PR description discloses this is part 1/2 with #4778 stacked, but the field-level docs and the Deprecated: annotation get ahead of the actual capability.

Consider one of:

Hold the Deprecated: annotation on the inline AuthzConfig field (and the equivalent on MCPRemoteProxy/VirtualMCPServer) until Wire MCPAuthzConfig references into workload controllers #4778 lands.

Add a controller-side validation that emits a Warning event when AuthzConfigRef is set on a workload until the consumer ships, so users aren't silently mis-configured.

Note the staging gap in the field's godoc (e.g. "Consumed by workload controllers in Wire MCPAuthzConfig references into workload controllers #4778; until then this field is reference-tracking only and does not apply authorization.").

ChrisJBurns · 2026-06-01T15:40:46Z


 // MCPServerSpec defines the desired state of MCPServer
 //
+// +kubebuilder:validation:XValidation:rule="!(has(self.authzConfig) && has(self.authzConfigRef))",message="authzConfig and authzConfigRef are mutually exclusive; use authzConfigRef to reference a shared MCPAuthzConfig"


F7 — MEDIUM: This XValidation rule (and the parallel rules on MCPRemoteProxySpec and IncomingAuthConfig) is admission-time only, so unit tests with the fake client can't exercise it. There are no envtest cases that apply each CR with (a) only authzConfig, (b) only authzConfigRef, (c) both set — only (c) should be rejected.

Please add envtest coverage for all three new mutual-exclusion rules. This is the exact scenario the .claude/rules/operator.md testing guidance points at envtest for (over chainsaw).

ChrisJBurns · 2026-06-01T15:40:46Z


-	// AuthzConfig defines authorization policy configuration for the MCP server
+	// AuthzConfig defines authorization policy configuration for the MCP server.
+	// Deprecated: Use AuthzConfigRef to reference a shared MCPAuthzConfig resource instead.


F16 — LOW: Marking the inline field Deprecated: is correct in intent, but there is no migration guide for users today: no conversion table (inline.policies → MCPAuthzConfig.spec.config.policies), no operator-side auto-conversion, and no docs link from this comment. Combined with F1 (no workload controller consumes AuthzConfigRef yet), a user attempting migration in the wrong order ends up with no enforcement.

Either defer the Deprecated: annotation until #4778 lands, or add a short pointer to a migration doc (e.g. docs/operator/migration-authzconfig.md).

ChrisJBurns · 2026-06-01T15:40:46Z

+// Backend-specific validation (delegating to the authorizer factory registry) lives
+// in the controller rather than here, because the API types package must not import
+// the authorizer backends.
+func (r *MCPAuthzConfig) Validate() error {


F13 — LOW: Validate() only checks Type != "" and len(Config.Raw) != 0. Whether the type is actually registered is checked only inside the controller via authorizers.GetFactory(cfg.Type). Other callers of Validate() (a future webhook, CLI tooling, an admission-style helper) would happily accept an unknown type as "valid".

Consider consulting the authorizers registry from Validate() directly so all callers fail-closed:

Suggested change

func (r *MCPAuthzConfig) Validate() error {

func (r *MCPAuthzConfig) Validate() error {

if r.Spec.Type == "" {

return fmt.Errorf("spec.type is required")

}

if !authorizers.IsRegistered(r.Spec.Type) {

return fmt.Errorf("spec.type %q is not a registered authorizer backend (registered: %v)", r.Spec.Type, authorizers.RegisteredTypes())

}

if len(r.Spec.Config.Raw) == 0 {

(Adjust the helper name to whatever the registry actually exposes.)

ChrisJBurns · 2026-06-01T15:40:47Z

+// This controller manages the lifecycle of MCPAuthzConfig resources: validation
+// via the authorizer factory registry, config hash computation, finalizer management,
+// reference tracking, and deletion protection when workloads reference this config.
+type MCPAuthzConfigReconciler struct {


F15 — LOW (optional): The reconciler doesn't take an EventRecorder. For a policy resource this is meaningful: cluster operators investigating a missing/invalid authz policy will not see kubectl describe mcpauthzconfig events explaining "Cedar policy rejected" or "deletion blocked by N workloads". Sibling controllers in this repo wire a recorder; adopting the same pattern here would improve the audit trail and discoverability.

Consider adding a Recorder record.EventRecorder field and wiring it in SetupWithManager via mgr.GetEventRecorderFor("mcpauthzconfig-controller"). Emit Warning events on ConfigInvalid / DeletionBlocked, Normal events on recovery. Avoid putting raw policy contents in event messages (the HTTP PDP URL in particular is topology-sensitive).

ChrisJBurns · 2026-06-01T15:40:47Z

+// BuildFullAuthzConfigJSON reconstructs the full authorizer config JSON from a
+// MCPAuthzConfig spec. The result is the same format accepted by authorizers.Config
+// and used in ConfigMaps: {"version": "1.0", "type": "<type>", "<configKey>": {<config>}}.
+func BuildFullAuthzConfigJSON(spec mcpv1beta1.MCPAuthzConfigSpec) ([]byte, error) {


F10 — LOW: BuildFullAuthzConfigJSON is exported but its only production caller lives in the same controllers package. Per .claude/rules/go-style.md ("Package API surface"), exporting locks the signature for callers that don't yet exist. The doc comment hints this will be used by workload controllers in the future — if so, it'll likely want to live in cmd/thv-operator/pkg/controllerutil/ next to BuildInlineCedarAuthzConfig to avoid downstream packages importing the controllers package.

Suggestion: unexport (buildFullAuthzConfigJSON) for now; promote and relocate when #4778 needs an actual cross-package caller, at which point the consumer's needs may also reshape the signature.

ChrisJBurns · 2026-06-01T15:40:47Z

+		return nil, fmt.Errorf("failed to marshal type: %w", err)
+	}
+
+	fullConfig := map[string]json.RawMessage{


F6 — MEDIUM: This map literal lets the factory-supplied configKey collide with the reserved envelope keys. Go map literals with duplicate keys silently take the last assignment, so a factory whose ConfigKey() returns "version" or "type" (or "") silently overwrites the envelope metadata with attacker-controlled JSON. Today only Cedar ("cedar") and HTTP ("pdp") register, but the AuthorizerFactory interface is exported and the registration path is open to test stubs and future / out-of-tree backends, so this is a latent defense-in-depth gap.

Add a guard in Register() (pkg/authz/authorizers/registry.go) that rejects ConfigKey() values in a reserved set ("version", "type", "") — panicking at process startup makes a misconfigured factory visible immediately. A short unit test that registers a factory with ConfigKey() == "version" and expects a panic locks the contract in.

ChrisJBurns · 2026-06-01T15:40:47Z

+				"referencingWorkloads", referencingWorkloads)
+
+			meta.SetStatusCondition(&authzConfig.Status.Conditions, metav1.Condition{
+				Type:               mcpv1beta1.ConditionTypeDeletionBlocked,


F8 — LOW: handleDeletion sets ConditionTypeDeletionBlocked = True here, but nothing on the non-deletion reconcile path clears it back to False (or removes the condition). Outside the deletion flow, the main Reconcile only touches ConditionTypeAuthzConfigValid. If a user attempts deletion, then cancels (e.g. removes the deletionTimestamp via finalizer removal followed by re-apply), the stale DeletionBlocked: True condition can persist and mislead operators.

On the success path of Reconcile (after validation passes), call meta.SetStatusCondition to set DeletionBlocked = False (reason e.g. NoReferences) or meta.RemoveStatusCondition if DeletionTimestamp.IsZero(). Same pattern is used by sibling config controllers.

ChrisJBurns · 2026-06-01T15:40:47Z

+	assert.Equal(t, initialRV, afterSteady.ResourceVersion, "ResourceVersion should not change (no writes)")
+}
+
+func TestMCPAuthzConfigReconciler_ValidationFailureSetsCondition(t *testing.T) {


F11 — MEDIUM: .claude/rules/operator.md §"Status Writes" warns that direct r.Status().Update (used by this controller — see F2) replaces Status.Conditions wholesale on CRDs via JSON merge-patch. There is no regression test that interleaves two condition writers (e.g. a validation-failure reconcile concurrent with a hash-change reconcile, or an externally-injected condition + a reconcile) and asserts that all set conditions survive.

Without such a test, a future change that re-introduces the race (or that converts the controller to MutateAndPatchStatus and gets it wrong) will go undetected. Suggested approach: pre-populate a fake MCPAuthzConfig with an externally-injected condition type the controller doesn't manage, reconcile once, then assert that condition is still present alongside the controller-managed Valid condition. This is the cheapest possible canary against the wholesale-overwrite class of bug.

ChrisJBurns · 2026-06-01T15:40:47Z

+	assert.True(t, foundCondition, "Should have a Valid condition")
+}
+
+func TestMCPAuthzConfigReconciler_handleDeletion(t *testing.T) {


F12 — LOW: TestMCPAuthzConfigReconciler_handleDeletion covers two static starting states (no workloads → finalizer removed; one workload → blocked/requeue) but never exercises the transition that matters in practice: a config blocked by N references, then the last referencing workload is deleted, then handleDeletion is invoked again and the finalizer is removed.

Add a sub-test that (1) creates a deleting config plus one referencing workload, invokes handleDeletion (assert requeue + finalizer still present), (2) deletes the referencing workload, invokes handleDeletion again (assert finalizer removed). This is the only path through the deletion-protection logic that actually flips state, and it is currently uncovered.

JAORMX requested review from ChrisJBurns, jerm-dro, jhrozek and yrobla as code owners April 13, 2026 13:57

github-actions Bot added the size/XL Extra large PR: 1000+ lines changed label Apr 13, 2026

github-actions Bot previously requested changes Apr 13, 2026

View reviewed changes

claude Bot reviewed Apr 13, 2026

View reviewed changes

JAORMX mentioned this pull request Apr 13, 2026

Wire MCPAuthzConfig references into workload controllers #4778

Open

3 tasks

ChrisJBurns requested changes Apr 13, 2026

View reviewed changes

JAORMX force-pushed the jaosorior/mcpauthzconfig-crd branch from 734e8db to d571e89 Compare April 14, 2026 06:19

github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 14, 2026

JAORMX force-pushed the jaosorior/mcpauthzconfig-crd branch from d571e89 to 321c747 Compare May 29, 2026 11:19

github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 29, 2026

This was referenced May 29, 2026

Tighten ReferencingWorkloads uniqueness to kind+name across config CRDs #5401

Open

Migrate config controllers to MutateAndPatch helpers #5402

Open

github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 29, 2026

Merge branch 'main' into jaosorior/mcpauthzconfig-crd

1e95d5c

github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jun 1, 2026

ChrisJBurns reviewed Jun 1, 2026

View reviewed changes

-// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs/finalizers,verbs=update
+// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs/finalizers,verbs=update
+// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpservers,verbs=get;list;watch
+// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=virtualmcpservers,verbs=get;list;watch
+// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpremoteproxies,verbs=get;list;watch

		@@ -0,0 +1,616 @@
		// SPDX-FileCopyrightText: Copyright 2025 Stacklok, Inc.

-func (r *MCPAuthzConfig) Validate() error {
+func (r *MCPAuthzConfig) Validate() error {
+	if r.Spec.Type == "" {
+		return fmt.Errorf("spec.type is required")
+	}
+	if !authorizers.IsRegistered(r.Spec.Type) {
+		return fmt.Errorf("spec.type %q is not a registered authorizer backend (registered: %v)", r.Spec.Type, authorizers.RegisteredTypes())
+	}
+	if len(r.Spec.Config.Raw) == 0 {

Conversation

JAORMX commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of change

Test plan

Changes

Special notes for reviewers

Large PR Justification

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Large PR Detected

How to unblock this PR:

Alternative:

Uh oh!

codecov Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

claude Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

ChrisJBurns left a comment

Choose a reason for hiding this comment

Multi-Agent Consensus Review

Consensus Summary

Overall

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

JAORMX commented May 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChrisJBurns left a comment

Choose a reason for hiding this comment

Multi-agent review summary

Consensus summary

Pros

Risks worth weighing

Suggested manual verification

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JAORMX commented Apr 13, 2026 •

edited

Loading

codecov Bot commented Apr 13, 2026 •

edited

Loading