Skip to content

Add MCPAuthzConfig CRD for backend-agnostic authorization#4777

Open
JAORMX wants to merge 3 commits into
mainfrom
jaosorior/mcpauthzconfig-crd
Open

Add MCPAuthzConfig CRD for backend-agnostic authorization#4777
JAORMX wants to merge 3 commits into
mainfrom
jaosorior/mcpauthzconfig-crd

Conversation

@JAORMX
Copy link
Copy Markdown
Collaborator

@JAORMX JAORMX commented Apr 13, 2026

Summary

  • The current inline authz configuration (authzConfig) is Cedar-specific and doesn't support alternative authorization backends like HTTP PDP. This adds a new MCPAuthzConfig CRD that decouples authorization configuration from the workload CRDs, enabling any registered authorizer backend to be used.
  • Adds ConfigKey() string to the AuthorizerFactory interface so the MCPAuthzConfig controller can reconstruct full config JSON without hardcoding backend-specific keys.
  • Creates the MCPAuthzConfig CRD with type + config (RawExtension) spec, validation via factory registry, hash-based change detection, referencing workload tracking, and deletion protection.
  • Adds AuthzConfigRef field to MCPServer, MCPRemoteProxy, and VirtualMCPServer specs for referencing shared MCPAuthzConfig resources.

Relates to #3157

Type of change

  • New feature

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)

Changes

Note: this PR was re-implemented against the v1beta1 API after main graduated v1beta1 to the storage version. The new CRD types live in cmd/thv-operator/api/v1beta1/, with a deprecated compat shell in v1alpha1, mirroring the established MCPOIDCConfig pattern. Generated CRD YAML, deepcopy, helm templates, RBAC, and CRD reference docs are regenerated accordingly.

Area Change
pkg/authz/authorizers/registry.go Add ConfigKey() to AuthorizerFactory interface
pkg/authz/authorizers/cedar/core.go Implement ConfigKey() returning "cedar"
pkg/authz/authorizers/http/core.go Implement ConfigKey() returning "http"
cmd/thv-operator/api/v1beta1/mcpauthzconfig_types.go New CRD types + Validate(): MCPAuthzConfig, MCPAuthzConfigSpec, MCPAuthzConfigStatus, MCPAuthzConfigReference
cmd/thv-operator/api/v1beta1/{mcpserver,mcpremoteproxy,virtualmcpserver}_types.go Add AuthzConfigRef + CEL mutual-exclusion XValidation
cmd/thv-operator/api/v1alpha1/* Deprecated compat shims embedding v1beta1
cmd/thv-operator/controllers/mcpauthzconfig_controller.go MCPAuthzConfig reconciler: validation, hash, finalizer, watches
cmd/thv-operator/app/app.go Register MCPAuthzConfig controller; backend blank-imports
deploy/charts/operator* Regenerated CRD YAML, RBAC clusterrole, helm templates, CRD docs

Special notes for reviewers

  • This is part 1 of 2: CRD + controller only. Part 2 (Wire MCPAuthzConfig references into workload controllers #4778, wiring references into workload controllers) follows on top.
  • BuildFullAuthzConfigJSON() reconstructs the full config envelope ({"version","type","<configKey>": ...}) using the factory's ConfigKey() — this avoids hardcoding Cedar-specific structure. mustMarshalJSON's panic was replaced with an error-returning marshalJSONString.
  • The controller watches MCPServer, VirtualMCPServer, and MCPRemoteProxy to maintain accurate ReferencingWorkloads status and enforce deletion protection. RBAC markers for the watched resources are present and rendered into the generated clusterrole.
  • CEL XValidation enforces authzConfig/authzConfigRef mutual exclusivity on all three workload specs, mirroring the oidcConfig/oidcConfigRef rules, and renders into both served CRD versions.

Large PR Justification

This PR is a single cohesive, atomic addition: a new MCPAuthzConfig CRD plus its controller. The pieces cannot be split without leaving the tree in a non-functional intermediate state:

  • The API types, deepcopy, CRD YAML, helm templates, RBAC, and CRD reference docs are all machine-generated from the same type definitions — splitting them produces failing verify-gen runs.
  • The controller, its SetupWithManager registration in app.go, and the ConfigKey() interface method are mutually dependent and won't compile apart.
  • Of the additions, the large majority is generated YAML/docs/deepcopy and tests. Hand-written production Go is a small fraction.

PR #4778 is stacked on top of this and wires these refs into the workload controllers, so the CRD + controller must land first as a unit.

Generated with Claude Code

@github-actions github-actions Bot added the size/XL Extra large PR: 1000+ lines changed label Apr 13, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 13, 2026

Codecov Report

❌ Patch coverage is 68.64407% with 74 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.77%. Comparing base (35e66bb) to head (1e95d5c).

Files with missing lines Patch % Lines
...-operator/controllers/mcpauthzconfig_controller.go 68.63% 46 Missing and 23 partials ⚠️
cmd/thv-operator/app/app.go 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4777      +/-   ##
==========================================
- Coverage   68.80%   68.77%   -0.04%     
==========================================
  Files         629      631       +2     
  Lines       63914    64150     +236     
==========================================
+ Hits        43978    44121     +143     
- Misses      16678    16749      +71     
- Partials     3258     3280      +22     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment on lines 157 to 176
- message: inline must be set when type is 'inline', and must not
be set otherwise
rule: 'self.type == ''inline'' ? has(self.inline) : !has(self.inline)'
authzConfigRef:
description: |-
AuthzConfigRef references a shared MCPAuthzConfig resource for authorization.
The referenced MCPAuthzConfig must exist in the same namespace as this MCPServer.
Mutually exclusive with authzConfig.
properties:
name:
description: Name is the name of the MCPAuthzConfig resource in
the same namespace.
minLength: 1
type: string
required:
- name
type: object
backendReplicas:
description: |-
BackendReplicas is the desired number of MCP server backend pod replicas.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new authzConfig/authzConfigRef fields are documented as mutually exclusive across MCPServer, MCPRemoteProxy, and VirtualMCPServer's IncomingAuthConfig, but no CEL x-kubernetes-validations rule enforces this at the API level. A user can submit a resource with both fields set and the Kubernetes API server will accept it without error. The fix is to add +kubebuilder:validation:XValidation markers (and the corresponding x-kubernetes-validations entries in the generated CRD YAMLs) mirroring the existing oidcConfig/oidcConfigRef precedent for all three affected types.

Extended reasoning...

What the bug is and how it manifests

The PR introduces authzConfigRef fields to MCPServerSpec, MCPRemoteProxySpec, and IncomingAuthConfig (VirtualMCPServer), explicitly documenting them as mutually exclusive with the existing authzConfig field. However, unlike the analogous oidcConfig/oidcConfigRef pair, no CEL validation rule is added to enforce this constraint at admission time. The Kubernetes API server therefore accepts resources with both fields populated simultaneously.

The specific code path that triggers it

In mcpremoteproxy_types.go, the struct-level kubebuilder marker reads:

// +kubebuilder:validation:XValidation:rule="\!(has(self.oidcConfig) && has(self.oidcConfigRef))",message="oidcConfig and oidcConfigRef are mutually exclusive..."
type MCPRemoteProxySpec struct {

No equivalent marker exists for authzConfig/authzConfigRef in MCPRemoteProxySpec, MCPServerSpec, or IncomingAuthConfig. The generated CRD YAML (e.g. toolhive.stacklok.dev_mcpremoteproxies.yaml) reflects this: the spec-level x-kubernetes-validations block only contains the OIDC rule, not an authz rule.

Why existing code doesn't prevent it

The Go field comments say "Mutually exclusive with authzConfig" but comments are not enforced. The kubebuilder XValidation marker is what generates the CEL rule into the CRD schema. That marker was added for oidcConfig/oidcConfigRef but was omitted for the new authzConfig/authzConfigRef pair in all three types.

Impact

Any user who sets both authzConfig and authzConfigRef will get a resource that is silently accepted by the API server. When Part 2 wires these fields into the workload controllers, the controller will face ambiguity: it will need to pick one field, likely silently ignoring the other, leading to unexpected authorization behavior that is hard to diagnose.

How to fix it

Add the following marker to MCPServerSpec, MCPRemoteProxySpec, and IncomingAuthConfig (adjacent to the existing oidcConfig/oidcConfigRef marker):

// +kubebuilder:validation:XValidation:rule="\!(has(self.authzConfig) && has(self.authzConfigRef))",message="authzConfig and authzConfigRef are mutually exclusive"

Then regenerate the CRD manifests so that the three affected YAML files each gain a corresponding x-kubernetes-validations entry at the spec level.

Step-by-step proof

  1. Create an MCPServer manifest with both fields:
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: MCPServer
metadata:
  name: test
spec:
  image: example/mcp:latest
  authzConfig:
    type: inline
    inline:
      policies: ["permit(principal, action, resource);"]
  authzConfigRef:
    name: my-authzconfig
  1. Apply it to a cluster with this PR's CRDs installed.
  2. Observe: kubectl apply succeeds with no validation error.
  3. Compare with the equivalent test using oidcConfig + oidcConfigRef: the API server rejects it with "oidcConfig and oidcConfigRef are mutually exclusive" because the CEL rule exists for OIDC but not for authz.

Comment on lines +119 to +141
if hashChanged {
logger.Info("MCPAuthzConfig configuration changed",
"oldHash", authzConfig.Status.ConfigHash,
"newHash", configHash)

authzConfig.Status.ConfigHash = configHash
authzConfig.Status.ObservedGeneration = authzConfig.Generation

if err := r.Status().Update(ctx, authzConfig); err != nil {
logger.Error(err, "Failed to update MCPAuthzConfig status")
return ctrl.Result{}, err
}
return ctrl.Result{}, nil
}

// Refresh ReferencingWorkloads list
referencingWorkloads, err := r.findReferencingWorkloads(ctx, authzConfig)
if err != nil {
logger.Error(err, "Failed to find referencing workloads")
} else if !ctrlutil.WorkloadRefsEqual(authzConfig.Status.ReferencingWorkloads, referencingWorkloads) {
authzConfig.Status.ReferencingWorkloads = referencingWorkloads
conditionChanged = true
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new MCPAuthzConfig controller has two related issues that leave ReferencingWorkloads status stale. First, when findReferencingWorkloads returns an error in the normal reconcile path (lines 134-141), the error is only logged and execution continues without requeueing — unlike handleDeletion which correctly returns the error for automatic retry. Second, when hashChanged==true, the controller returns early at line 131 before ever reaching the findReferencingWorkloads call at line 135, so workload reference changes concurrent with spec changes are silently dropped for that reconcile cycle. The fix is to refresh workload refs before the hash-changed early return, and to return the error (with requeue) when findReferencingWorkloads fails, consistent with the deletion path.

Extended reasoning...

Bug 1 — Swallowed error in normal reconcile path (lines 134-141)

In the normal reconcile path, when findReferencingWorkloads returns an error, the code at lines 134-141 logs the error and falls through without returning it:

referencingWorkloads, err := r.findReferencingWorkloads(ctx, authzConfig)
if err != nil {
    logger.Error(err, "Failed to find referencing workloads")
} else if ...

The deletion path (handleDeletion) correctly propagates the error, which causes controller-runtime to schedule a retry with exponential backoff:

referencingWorkloads, err := r.findReferencingWorkloads(ctx, authzConfig)
if err != nil {
    logger.Error(err, "Failed to check referencing workloads during deletion")
    return ctrl.Result{}, err  // requeues
}

On a transient Kubernetes API error (e.g., brief etcd unavailability), the normal reconcile path silently leaves ReferencingWorkloads stale with no automatic retry. The status is only corrected when an unrelated event (e.g., a workload change) happens to re-trigger reconciliation. This asymmetry violates controller-runtime best practices and is the identical pattern that exists in toolconfig_controller.go, meaning the PR copies a known sub-optimal approach rather than improving on it.

Bug 2 — Hash-changed early return skips ReferencingWorkloads refresh (lines 119-132)

When hashChanged==true, the reconciler updates ConfigHash and ObservedGeneration, calls r.Status().Update(), and returns at line 131:

if hashChanged {
    authzConfig.Status.ConfigHash = configHash
    authzConfig.Status.ObservedGeneration = authzConfig.Generation
    if err := r.Status().Update(ctx, authzConfig); err != nil { ... }
    return ctrl.Result{}, nil  // <-- findReferencingWorkloads is never called
}
// This is unreachable when hashChanged==true:
referencingWorkloads, err := r.findReferencingWorkloads(ctx, authzConfig)

Concrete proof for Bug 2: Consider MCPAuthzConfig my-authz with no referencing workloads. A user simultaneously (a) modifies the spec, and (b) adds authzConfigRef: {name: my-authz} to an MCPServer. Controller-runtime coalesces both watch events into a single reconcile. Inside that reconcile: hashChanged==true → controller updates ConfigHash and returns. ReferencingWorkloads is never refreshed. The watch handler for MCPServer fired to trigger this reconcile, so no second reconcile is enqueued. The status shows referencingWorkloads: [] even though the MCPServer now references the config, until either another spec change or another workload event occurs.

Addressing the refutation for Bug 2: One verifier argues that the watch-based reconciliation mechanism compensates for the early return, because a workload change fires a watch event that enqueues a new reconcile. This is true in the separate event case. However, it fails in the coalesced event case described above: if controller-runtime batches the spec-change event and the workload-change event into a single reconcile call, the controller sees hashChanged==true, returns early, and consumes the single reconcile that would have refreshed the refs. No additional reconcile is triggered because the watch event for the workload has already been processed. This is a real (if narrow) window of inconsistency.

Impact and fix: The actual deletion protection in handleDeletion queries live state and is unaffected. The impact is stale ReferencingWorkloads status visible to users and operators. The fix is to call findReferencingWorkloads before the hash-change early return and include its result in the combined status update (as the MCPTelemetryConfig controller already does correctly), and to return the error rather than swallowing it when the lookup fails.

Copy link
Copy Markdown
Collaborator

@ChrisJBurns ChrisJBurns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-Agent Consensus Review

Agents consulted: kubernetes-expert, go-security-reviewer, code-reviewer, toolhive-expert

Consensus Summary

# Finding Consensus Severity Action
1 Missing CEL XValidation for authzConfig/authzConfigRef mutual exclusivity 10/10 HIGH Fix
2 Missing RBAC markers for watched workload resources 10/10 MEDIUM Fix
3 mustMarshalJSON uses panic in controller 10/10 MEDIUM Fix
4 No nil-check on spec.Config.Raw before use 8/10 MEDIUM Fix
5 Test coverage gaps in watch handlers and deletion paths 7/10 MEDIUM Fix
6 No Validate() method on MCPAuthzConfig type (pattern deviation) 7/10 LOW Discuss
7 ReferencingWorkloads listMapKey=name insufficient for uniqueness 7/10 LOW Discuss (pre-existing)
8 Controller blank-imports authorizer backend packages 7/10 LOW Discuss

Overall

This is a well-structured PR that closely follows the established MCPOIDCConfig controller pattern. The CRD design is clean, the ConfigKey() interface addition is minimal, and the findStaleRefs helper is a nice DRY improvement over the inline stale-ref scanning in existing controllers. The part 1/part 2 split is a sensible approach for incremental delivery.

The main blocker is the missing CEL XValidation rules (Finding #1). The existing OIDC pattern enforces mutual exclusivity between oidcConfig and oidcConfigRef via CEL rules, but no equivalent rules exist for authzConfig/authzConfigRef. Without these, the API server silently accepts resources with both fields set, contradicting the documented contract. The remaining findings are medium-severity improvements: RBAC marker completeness, replacing panic with error return, and adding a nil-check on Config.Raw.


Generated with Claude Code

// The referenced MCPAuthzConfig must exist in the same namespace as this MCPServer.
// Mutually exclusive with authzConfig.
// +optional
AuthzConfigRef *MCPAuthzConfigReference `json:"authzConfigRef,omitempty"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[HIGH] Missing CEL XValidation for authzConfig/authzConfigRef mutual exclusivity (Consensus: 10/10)

The AuthzConfig and AuthzConfigRef fields are documented as "mutually exclusive" but no CEL validation rule enforces this at the API server level. The existing OIDC pattern has:

// +kubebuilder:validation:XValidation:rule="!(has(self.oidcConfig) && has(self.oidcConfigRef))",message="oidcConfig and oidcConfigRef are mutually exclusive..."

The same rule is needed here on MCPServerSpec, MCPRemoteProxySpec, and IncomingAuthConfig:

// +kubebuilder:validation:XValidation:rule="!(has(self.authzConfig) && has(self.authzConfigRef))",message="authzConfig and authzConfigRef are mutually exclusive; use authzConfigRef to reference a shared MCPAuthzConfig"

Without this, users can set both fields simultaneously and the controller must decide which takes precedence.

Raised by: kubernetes-expert, code-reviewer, toolhive-expert


// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs/finalizers,verbs=update
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Missing RBAC markers for watched workload resources (Consensus: 10/10)

The controller watches MCPServer, VirtualMCPServer, and MCPRemoteProxy (via SetupWithManager) and lists them (in findReferencingWorkloads), but no +kubebuilder:rbac markers declare these permissions. The MCPOIDCConfig controller declares them:

Suggested change
// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs/finalizers,verbs=update
// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpauthzconfigs/finalizers,verbs=update
// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpservers,verbs=get;list;watch
// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=virtualmcpservers,verbs=get;list;watch
// +kubebuilder:rbac:groups=toolhive.stacklok.dev,resources=mcpremoteproxies,verbs=get;list;watch

While these permissions are currently aggregated from other controllers, they should be self-declared for completeness and to prevent breakage if another controller is removed.

Raised by: kubernetes-expert, code-reviewer, toolhive-expert

Comment thread cmd/thv-operator/controllers/mcpauthzconfig_controller.go Outdated
fullConfig := map[string]json.RawMessage{
"version": mustMarshalJSON(authzConfigVersion),
"type": mustMarshalJSON(spec.Type),
configKey: spec.Config.Raw,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] No nil-check on spec.Config.Raw before use (Consensus: 8/10)

If spec.Config.Raw is nil (bypassed validation, manual testing, or future code changes), this produces "cedar": null in the JSON output, leading to confusing downstream errors. Add a defensive check before use:

if len(spec.Config.Raw) == 0 {
    return nil, fmt.Errorf("config field is empty")
}

The config field is required in the CRD schema, but defense-in-depth is warranted for RawExtension fields.

Raised by: code-reviewer, kubernetes-expert

@@ -0,0 +1,616 @@
// SPDX-FileCopyrightText: Copyright 2025 Stacklok, Inc.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] Test coverage gaps in watch handlers and deletion paths (Consensus: 7/10)

The following methods have zero test coverage (Codecov reports 43.5% patch / 122 uncovered lines):

  • findReferencingWorkloads (covers MCPServer, VirtualMCPServer, MCPRemoteProxy listing)
  • findStaleRefs (stale reference cleanup)
  • All three map*ToAuthzConfig watch handlers
  • SetupWithManager

Additionally, the handleDeletion test only covers the MCPServer reference path -- VirtualMCPServer and MCPRemoteProxy deletion-blocking cases are missing.

At minimum, consider adding tests for findReferencingWorkloads covering all three workload types, and extending the deletion test with VirtualMCPServer and MCPRemoteProxy cases.

Raised by: code-reviewer

// +optional
ReferencingWorkloads []WorkloadReference `json:"referencingWorkloads,omitempty"`
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] ReferencingWorkloads listMapKey=name may be insufficient (Consensus: 7/10)

Using only name as the list map key could collide if an MCPServer and a VirtualMCPServer share the same name in the same namespace. The composite key should be kind + name:

// +listMapKey=kind
// +listMapKey=name

This is a pre-existing pattern copied from MCPOIDCConfig, MCPTelemetryConfig, etc. -- consider fixing across all config CRDs in a separate PR.

Raised by: kubernetes-expert

Comment thread cmd/thv-operator/controllers/mcpauthzconfig_controller.go Outdated
// Type identifies the authorizer backend (e.g., "cedarv1", "httpv1").
// Must match a registered authorizer type in the factory registry.
// +kubebuilder:validation:Required
// +kubebuilder:validation:MinLength=1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[LOW] No Validate() method on MCPAuthzConfig type (Consensus: 7/10)

Other shared config CRDs (MCPOIDCConfig, MCPTelemetryConfig, MCPExternalAuthConfig) define a Validate() method on the type itself, called from the controller as config.Validate(). This CRD uses a standalone validateAuthzConfigSpec() function in the controller package instead.

The deviation is understandable (validation requires the factory registry, an external dependency), but a comment in the controller explaining why would help maintainability.

Raised by: toolhive-expert, code-reviewer

@JAORMX JAORMX force-pushed the jaosorior/mcpauthzconfig-crd branch from 734e8db to d571e89 Compare April 14, 2026 06:19
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 14, 2026
Introduce a namespace-scoped MCPAuthzConfig CRD so authorization
policy can be defined once and shared across MCPServer,
MCPRemoteProxy, and VirtualMCPServer workloads, mirroring the
existing MCPOIDCConfig sharing pattern.

The spec carries a backend Type plus an opaque Config RawExtension.
A ConfigKey() method on AuthorizerFactory (cedar->"cedar",
http->"pdp") lets the controller reconstruct the full authorizer
config and validate it via the factory registry, keeping the
controller backend-agnostic; the backends are registered through
blank imports in the operator entrypoint.

The controller validates the spec, computes a config hash, manages a
finalizer, tracks referencing workloads, and blocks deletion while
referenced. AuthzConfigRef fields are added to the three workload
specs with CEL XValidation enforcing mutual exclusivity against the
existing inline authzConfig, matching the oidcConfig/oidcConfigRef
pattern.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@JAORMX JAORMX force-pushed the jaosorior/mcpauthzconfig-crd branch from d571e89 to 321c747 Compare May 29, 2026 11:19
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 29, 2026
@github-actions github-actions Bot dismissed their stale review May 29, 2026 11:29

Large PR justification has been provided. Thank you!

@github-actions
Copy link
Copy Markdown
Contributor

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 29, 2026
@JAORMX
Copy link
Copy Markdown
Collaborator Author

JAORMX commented May 29, 2026

Thanks for the thorough multi-agent review. Addressed all findings. Note: since this was last touched, main graduated the operator API to v1beta1 as the storage version, so the PR has been re-implemented against v1beta1 (with a deprecated v1alpha1 compat shell), mirroring the established MCPOIDCConfig pattern. Per-finding:

  1. HIGH — CEL XValidation mutual exclusivity: Done. !(has(self.authzConfig) && has(self.authzConfigRef)) is applied to MCPServerSpec, MCPRemoteProxySpec, and IncomingAuthConfig (vMCP), matching the oidcConfig/oidcConfigRef pattern, and renders into both served CRD versions.
  2. MEDIUM — RBAC markers: Added; the regenerated clusterrole grants get/list/watch on the watched workload types plus the mcpauthzconfigs status/finalizers subresources. mcpservers RBAC is inherited from the MCPServer controller, consistent with MCPOIDCConfig.
  3. MEDIUM — panic in mustMarshalJSON: Replaced with marshalJSONString returning an error, propagated through BuildFullAuthzConfigJSON.
  4. MEDIUM — nil-check on spec.Config.Raw: Guarded (len(spec.Config.Raw) == 0 → explicit error), plus a Validate() method on the type as defense-in-depth.
  5. MEDIUM — test coverage: Added watch-handler tests (current + stale-ref enqueue for all three workload kinds, plus wrong-type and no-ref cases) and deletion-path tests (finalizer add, deletion-without-finalizer, reference-blocking per workload type).
  6. LOW — Validate() method: Added, mirroring MCPOIDCConfig. Structural checks only; backend-specific validation stays in the controller because it needs the authorizer registry the API package must not import (documented inline).
  7. LOW — ReferencingWorkloads listMapKey: Aligned to +listMapKey=name, matching the shared WorkloadReference convention used by every other config CRD. The "name not unique across kinds" concern is pre-existing and repo-wide — suggest addressing it uniformly in a separate change.
  8. LOW — controller blank-imports backends: Kept out of the controller; they live in the operator entrypoint (app.go) so the controller stays backend-agnostic, with a comment explaining why.

Relates to #3157 (still open; this implements its "Solution option 3"). Part 2 (#4778) is stacked on top and will be rebased once this lands.

The chainsaw operator RBAC assertions compare the live
toolhive-operator-manager-role ClusterRole against a hardcoded
expected manifest. Adding the MCPAuthzConfig CRD introduced new
mcpauthzconfigs rules in the generated ClusterRole, making the
fixtures stale and failing E2E on all kind versions.

Add the mcpauthzconfigs, mcpauthzconfigs/finalizers, and
mcpauthzconfigs/status entries in the same alphabetical position the
generated role uses (after embeddingservers*, before
mcpexternalauthconfigs*) to both the multi-tenancy and single-tenancy
setup fixtures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 29, 2026
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Jun 1, 2026
// ─── MCPAuthzConfig ──────────────────────────────────────────────────────────

//+kubebuilder:object:root=true
//+kubebuilder:deprecatedversion:warning="toolhive.stacklok.dev/v1alpha1 is deprecated; use v1beta1"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this? As MCPAuthzConfig is a new CRD?

//+kubebuilder:printcolumn:name="References",type=integer,JSONPath=`.status.referenceCount`
//+kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`

// MCPAuthzConfig is the deprecated v1alpha1 version of the MCPAuthzConfig resource.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't true?

Copy link
Copy Markdown
Collaborator

@ChrisJBurns ChrisJBurns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-agent review summary

Five specialist reviewers (operator, Go, authz/security, tests, general) examined this PR; codex cross-review was skipped (CLI not installed). The design — a backend-agnostic MCPAuthzConfig CRD with a RawExtension payload, factory-driven validation, and a stacked AuthzConfigRef on workload specs — is the right shape, and the multi-version CRD plumbing (v1alpha1 compat shell embedding v1beta1, generated YAML / deepcopy / helm / RBAC / docs / chainsaw asserts) is internally consistent.

The implementation correctness is mostly good but inherits the same pre-existing anti-patterns the codebase rules flag — direct r.Status().Update and r.Update rather than MutateAndPatchStatus / MutateAndPatchSpec. There are also a handful of defense-in-depth gaps worth tightening before #4778 wires AuthzConfigRef into the workload controllers, since those gaps become user-visible at that point.

After consensus scoring there are 0 HIGH, 6 MEDIUM, 10 LOW findings. No blocking issues, but several MEDIUMs cluster around operator-rule compliance, hash determinism, and CEL test coverage.

Consensus summary

# Finding Severity Action
F1 AuthzConfigRef field added but no workload controller consumes it (intentional per PR description, but Deprecated: annotation pre-announces capability) MEDIUM Fix/confirm
F2 Status writes use r.Status().Update instead of MutateAndPatchStatus MEDIUM Fix
F3 Finalizer writes use r.Update instead of MutateAndPatchSpec MEDIUM Fix
F4 ReferencingWorkloads refresh skipped on hash-change branch LOW Fix
F5 Hash over RawExtension flaps on whitespace / key-order MEDIUM Fix
F6 BuildFullAuthzConfigJSON lets factory key shadow reserved version/type keys MEDIUM Fix
F7 CEL mutual-exclusion rules on three CRD types have no envtest coverage MEDIUM Fix
F8 DeletionBlocked condition never cleared on success path LOW Fix
F9 findReferencingWorkloads error swallowed; reconcile returns success LOW Fix
F10 BuildFullAuthzConfigJSON exported but only used inside controllers package LOW Fix
F11 No regression test for Status.Conditions wholesale-overwrite race MEDIUM Fix
F12 No test for finalizer-removed-after-last-reference transition LOW Fix
F13 MCPAuthzConfig.Validate() doesn't check Type against registry LOW Fix
F14 kubebuilder RBAC marker omits mcpservers which the controller watches LOW Fix
F15 No K8s events emitted on validation failure / deletion-block LOW Optional
F16 Migration path from inline authzConfig to authzConfigRef undocumented LOW Optional

Detailed comments are inline. Pros, tradeoffs and a suggested test plan are below.

Pros

  • Backend-agnostic envelope (ConfigKey()) is a clean way to support multiple authorizers without hardcoding Cedar specifics.
  • findStaleRefs extraction is a notable improvement over the duplicated loops in MCPOIDCConfig — worth back-porting.
  • Replacing mustMarshalJSON's panic with an error-returning helper is the right direction (even if the helper is over-engineered, see review comments).
  • Multi-version CRD plumbing is correctly applied; deepcopy, CRD YAML, helm templates, RBAC and chainsaw asserts all stay in sync.
  • Stale-ref enqueue + sorting/dedup keep Status.ReferencingWorkloads stable across reconciles.
  • Large-PR justification is accurate — the majority of additions are generated.

Risks worth weighing

  • The PR adds AuthzConfigRef to the workload specs and annotates the existing inline field as Deprecated: to point at it, but no workload controller resolves the ref into a runtime authz config in this PR. Until #4778 lands, a user who follows the deprecation guidance gets a workload running with no authorization enforced. The PR description discloses this; the field-level docs do not.
  • Status / finalizer writes use raw Update calls, which the codebase's own operator rules (.claude/rules/operator.md) explicitly call out as the anti-pattern they exist to prevent. Sibling MCPOIDCConfig controller has the same issue, so this isn't a regression, but new code is the natural place to stop the bleed.

Suggested manual verification

# Generated-artifact drift check
task operator-generate
task operator-manifests
task -d cmd/thv-operator crdref-gen
git diff --exit-code   # must be clean

# Smoke: valid + invalid MCPAuthzConfig
kubectl apply -f - <<EOF
apiVersion: toolhive.stacklok.dev/v1beta1
kind: MCPAuthzConfig
metadata: { name: cedar-test, namespace: default }
spec:
  type: cedarv1
  config:
    version: "1.0"
    policies: ["permit(principal, action, resource);"]
    entitiesJson: "[]"
EOF
kubectl get mcpauthzconfig cedar-test -o yaml   # expect Valid=True, referenceCount=0

kubectl apply -f - <<EOF
apiVersion: toolhive.stacklok.dev/v1beta1
kind: MCPAuthzConfig
metadata: { name: bad, namespace: default }
spec: { type: nope, config: {} }
EOF
kubectl get mcpauthzconfig bad -o jsonpath='{.status.conditions}'   # expect Valid=False

To confirm the silent-no-op claim under F1: apply only the authzConfigRef form (no inline authzConfig) on an MCPServer, then exec into the resulting pod and look for the authorization middleware in the runner config — at the current PR head it will be absent.


Automated multi-agent review (5 specialists, no consensus disputes). The skill flagged 16 findings; 13 below the consensus threshold were dropped (see local review log).

// The referenced MCPAuthzConfig must exist in the same namespace as this MCPServer.
// Mutually exclusive with authzConfig.
// +optional
AuthzConfigRef *MCPAuthzConfigReference `json:"authzConfigRef,omitempty"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F1 — MEDIUM: AuthzConfigRef is added here (and to MCPRemoteProxy/VirtualMCPServer) but no workload controller resolves it into a runtime authz config in this PR. Grepping the repo shows Spec.AuthzConfigRef only read by the new MCPAuthzConfigReconciler for bookkeeping (finalizer / ReferenceCount). A user who follows the Deprecated: Use AuthzConfigRef ... guidance on the sibling inline field will get a successfully reconciled MCPServer with no authorization enforced. The PR description discloses this is part 1/2 with #4778 stacked, but the field-level docs and the Deprecated: annotation get ahead of the actual capability.

Consider one of:

  1. Hold the Deprecated: annotation on the inline AuthzConfig field (and the equivalent on MCPRemoteProxy/VirtualMCPServer) until Wire MCPAuthzConfig references into workload controllers #4778 lands.
  2. Add a controller-side validation that emits a Warning event when AuthzConfigRef is set on a workload until the consumer ships, so users aren't silently mis-configured.
  3. Note the staging gap in the field's godoc (e.g. "Consumed by workload controllers in Wire MCPAuthzConfig references into workload controllers #4778; until then this field is reference-tracking only and does not apply authorization.").


// MCPServerSpec defines the desired state of MCPServer
//
// +kubebuilder:validation:XValidation:rule="!(has(self.authzConfig) && has(self.authzConfigRef))",message="authzConfig and authzConfigRef are mutually exclusive; use authzConfigRef to reference a shared MCPAuthzConfig"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F7 — MEDIUM: This XValidation rule (and the parallel rules on MCPRemoteProxySpec and IncomingAuthConfig) is admission-time only, so unit tests with the fake client can't exercise it. There are no envtest cases that apply each CR with (a) only authzConfig, (b) only authzConfigRef, (c) both set — only (c) should be rejected.

Please add envtest coverage for all three new mutual-exclusion rules. This is the exact scenario the .claude/rules/operator.md testing guidance points at envtest for (over chainsaw).


// AuthzConfig defines authorization policy configuration for the MCP server
// AuthzConfig defines authorization policy configuration for the MCP server.
// Deprecated: Use AuthzConfigRef to reference a shared MCPAuthzConfig resource instead.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F16 — LOW: Marking the inline field Deprecated: is correct in intent, but there is no migration guide for users today: no conversion table (inline.policiesMCPAuthzConfig.spec.config.policies), no operator-side auto-conversion, and no docs link from this comment. Combined with F1 (no workload controller consumes AuthzConfigRef yet), a user attempting migration in the wrong order ends up with no enforcement.

Either defer the Deprecated: annotation until #4778 lands, or add a short pointer to a migration doc (e.g. docs/operator/migration-authzconfig.md).

// Backend-specific validation (delegating to the authorizer factory registry) lives
// in the controller rather than here, because the API types package must not import
// the authorizer backends.
func (r *MCPAuthzConfig) Validate() error {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F13 — LOW: Validate() only checks Type != "" and len(Config.Raw) != 0. Whether the type is actually registered is checked only inside the controller via authorizers.GetFactory(cfg.Type). Other callers of Validate() (a future webhook, CLI tooling, an admission-style helper) would happily accept an unknown type as "valid".

Consider consulting the authorizers registry from Validate() directly so all callers fail-closed:

Suggested change
func (r *MCPAuthzConfig) Validate() error {
func (r *MCPAuthzConfig) Validate() error {
if r.Spec.Type == "" {
return fmt.Errorf("spec.type is required")
}
if !authorizers.IsRegistered(r.Spec.Type) {
return fmt.Errorf("spec.type %q is not a registered authorizer backend (registered: %v)", r.Spec.Type, authorizers.RegisteredTypes())
}
if len(r.Spec.Config.Raw) == 0 {

(Adjust the helper name to whatever the registry actually exposes.)

// This controller manages the lifecycle of MCPAuthzConfig resources: validation
// via the authorizer factory registry, config hash computation, finalizer management,
// reference tracking, and deletion protection when workloads reference this config.
type MCPAuthzConfigReconciler struct {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F15 — LOW (optional): The reconciler doesn't take an EventRecorder. For a policy resource this is meaningful: cluster operators investigating a missing/invalid authz policy will not see kubectl describe mcpauthzconfig events explaining "Cedar policy rejected" or "deletion blocked by N workloads". Sibling controllers in this repo wire a recorder; adopting the same pattern here would improve the audit trail and discoverability.

Consider adding a Recorder record.EventRecorder field and wiring it in SetupWithManager via mgr.GetEventRecorderFor("mcpauthzconfig-controller"). Emit Warning events on ConfigInvalid / DeletionBlocked, Normal events on recovery. Avoid putting raw policy contents in event messages (the HTTP PDP URL in particular is topology-sensitive).

// BuildFullAuthzConfigJSON reconstructs the full authorizer config JSON from a
// MCPAuthzConfig spec. The result is the same format accepted by authorizers.Config
// and used in ConfigMaps: {"version": "1.0", "type": "<type>", "<configKey>": {<config>}}.
func BuildFullAuthzConfigJSON(spec mcpv1beta1.MCPAuthzConfigSpec) ([]byte, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F10 — LOW: BuildFullAuthzConfigJSON is exported but its only production caller lives in the same controllers package. Per .claude/rules/go-style.md ("Package API surface"), exporting locks the signature for callers that don't yet exist. The doc comment hints this will be used by workload controllers in the future — if so, it'll likely want to live in cmd/thv-operator/pkg/controllerutil/ next to BuildInlineCedarAuthzConfig to avoid downstream packages importing the controllers package.

Suggestion: unexport (buildFullAuthzConfigJSON) for now; promote and relocate when #4778 needs an actual cross-package caller, at which point the consumer's needs may also reshape the signature.

return nil, fmt.Errorf("failed to marshal type: %w", err)
}

fullConfig := map[string]json.RawMessage{
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F6 — MEDIUM: This map literal lets the factory-supplied configKey collide with the reserved envelope keys. Go map literals with duplicate keys silently take the last assignment, so a factory whose ConfigKey() returns "version" or "type" (or "") silently overwrites the envelope metadata with attacker-controlled JSON. Today only Cedar ("cedar") and HTTP ("pdp") register, but the AuthorizerFactory interface is exported and the registration path is open to test stubs and future / out-of-tree backends, so this is a latent defense-in-depth gap.

Add a guard in Register() (pkg/authz/authorizers/registry.go) that rejects ConfigKey() values in a reserved set ("version", "type", "") — panicking at process startup makes a misconfigured factory visible immediately. A short unit test that registers a factory with ConfigKey() == "version" and expects a panic locks the contract in.

"referencingWorkloads", referencingWorkloads)

meta.SetStatusCondition(&authzConfig.Status.Conditions, metav1.Condition{
Type: mcpv1beta1.ConditionTypeDeletionBlocked,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F8 — LOW: handleDeletion sets ConditionTypeDeletionBlocked = True here, but nothing on the non-deletion reconcile path clears it back to False (or removes the condition). Outside the deletion flow, the main Reconcile only touches ConditionTypeAuthzConfigValid. If a user attempts deletion, then cancels (e.g. removes the deletionTimestamp via finalizer removal followed by re-apply), the stale DeletionBlocked: True condition can persist and mislead operators.

On the success path of Reconcile (after validation passes), call meta.SetStatusCondition to set DeletionBlocked = False (reason e.g. NoReferences) or meta.RemoveStatusCondition if DeletionTimestamp.IsZero(). Same pattern is used by sibling config controllers.

assert.Equal(t, initialRV, afterSteady.ResourceVersion, "ResourceVersion should not change (no writes)")
}

func TestMCPAuthzConfigReconciler_ValidationFailureSetsCondition(t *testing.T) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F11 — MEDIUM: .claude/rules/operator.md §"Status Writes" warns that direct r.Status().Update (used by this controller — see F2) replaces Status.Conditions wholesale on CRDs via JSON merge-patch. There is no regression test that interleaves two condition writers (e.g. a validation-failure reconcile concurrent with a hash-change reconcile, or an externally-injected condition + a reconcile) and asserts that all set conditions survive.

Without such a test, a future change that re-introduces the race (or that converts the controller to MutateAndPatchStatus and gets it wrong) will go undetected. Suggested approach: pre-populate a fake MCPAuthzConfig with an externally-injected condition type the controller doesn't manage, reconcile once, then assert that condition is still present alongside the controller-managed Valid condition. This is the cheapest possible canary against the wholesale-overwrite class of bug.

assert.True(t, foundCondition, "Should have a Valid condition")
}

func TestMCPAuthzConfigReconciler_handleDeletion(t *testing.T) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F12 — LOW: TestMCPAuthzConfigReconciler_handleDeletion covers two static starting states (no workloads → finalizer removed; one workload → blocked/requeue) but never exercises the transition that matters in practice: a config blocked by N references, then the last referencing workload is deleted, then handleDeletion is invoked again and the finalizer is removed.

Add a sub-test that (1) creates a deleting config plus one referencing workload, invokes handleDeletion (assert requeue + finalizer still present), (2) deletes the referencing workload, invokes handleDeletion again (assert finalizer removed). This is the only path through the deletion-protection logic that actually flips state, and it is currently uncovered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants