Skip to content

[BUG] Failure of one RemoteMCPServer to register prevents subsequent registrations #1785

@karlnewell

Description

@karlnewell

📋 Prerequisites

  • I have searched the existing issues to avoid creating a duplicate
  • By submitting this issue, you agree to follow our Code of Conduct
  • I am using the latest version of the software
  • I have tried to clear cache/cookies or used incognito mode (if ui-related)
  • I can consistently reproduce this issue

🎯 Affected Service(s)

Controller Service

🚦 Impact/Severity

Blocker

🐛 Bug Description

When a RemoteMCPServer resource fails to register — in this case, due to a suspected connection hang on an IPv6 address resolved from a dual-stack hostname in an IPv4-only cluster — the kagent controller enters a stuck state that prevents registration of any RemoteMCPServer resources applied after the failing one. The stuck state persists until the failing resource is deleted and the controller pod is restarted. Deletion of the failing resource alone is insufficient to unblock subsequent registrations.

Two distinct issues are present:

  1. A connection hang on failed RemoteMCPServer registration blocks all subsequent RemoteMCPServer registrations — the controller does not time out, isolate the failure, or continue processing other resources.
  2. No logging or status information is emitted for RemoteMCPServer registration attempts, successes, or failures, even at debug log level, making it impossible to confirm the root cause or observe controller behavior during reconciliation.

🔄 Steps To Reproduce

  1. Deploy kagent 0.9.0 on an IPv4-only Kubernetes cluster.

  2. Apply a RemoteMCPServer whose hostname resolves to both IPv4 and IPv6 addresses. In an IPv4-only cluster, the controller is suspected to resolve the IPv6 (AAAA) address and hang on the connection attempt with no timeout.

    apiVersion: kagent.dev/v1alpha2
    kind: RemoteMCPServer
    metadata:
      name: looking-glass-mcp
    spec:
      description: "Internet2 Looking Glass MCP server — traceroute, BGP, ping across I2 backbone"
      url: "https://example.internet2.edu/mcp"
      protocol: STREAMABLE_HTTP
  3. Apply one or more additional RemoteMCPServer resources (via Kustomize or kubectl apply) after the failing resource is applied.

  4. Observe that none of the subsequently applied RemoteMCPServer resources complete registration.

  5. Inspect the Status of the blocked resources — no ACCEPTED detail and no conditions are present.

  6. Delete the failing looking-glass-mcp resource. Observe that the other resources remain blocked.

  7. Restart the kagent controller pod. Observe that the remaining RemoteMCPServer resources now register successfully (assuming the failing resource is absent).

🤔 Expected Behavior

  • Each RemoteMCPServer reconciliation should be independent. A failure or hang on one resource must not affect reconciliation of any other resource.
  • Connection attempts during registration should have a configurable or sensible default timeout, after which the controller marks the resource as failed and moves on.
  • Failed registration should be reflected in the RemoteMCPServer resource's Status with a descriptive condition (e.g., Ready: False, reason, and message).
  • The controller should log registration attempts, outcomes, and errors at an appropriate level (at minimum at debug, ideally at info for successes and warn/error for failures) to provide operator visibility.
  • Deleting a failed RemoteMCPServer resource should be sufficient to unblock the controller — a pod restart should not be required to recover from a failed registration.

📱 Actual Behavior

  • A single hung RemoteMCPServer registration blocks all subsequent registrations for the lifetime of the controller process.
  • Deleting the failing resource does not unblock the controller; a pod restart is required in addition to deletion.
  • Blocked RemoteMCPServer resources show no Status conditions and no detail in their resource status, even after extended wait time.
  • No log output related to RemoteMCPServer registration is emitted at any log level, including debug, preventing diagnosis without external tooling.

💻 Environment

Field Value
kagent Helm chart version 0.9.0
Kubernetes cluster K3s v1.32.12+k3s1 IPv4-only dev cluster
RemoteMCPServer protocol STREAMABLE_HTTP
Failing server URL Dual-stack hostname (resolves both A and AAAA records)
Resource management Kustomize (resources applied in lexicographic order)

🔧 CLI Bug Report

No response

🔍 Additional Context

Did this work before?
This is a new deployment under active initial setup; there is no prior working baseline to compare against.

When did it start failing?
Observed from initial deployment on kagent Helm chart 0.9.0 (confirmed broken in 0.8.0 as well).

Suspected root cause:
The controller is likely resolving the AAAA (IPv6) record for the dual-stack hostname and attempting to connect to it. On an IPv4-only cluster, this connection hangs indefinitely rather than failing fast and falling back to the A (IPv4) record. The core bugs — no timeout, no isolation between reconciliations, no status reporting, no logging — are independent of the IPv6 trigger and would surface with any sufficiently slow or unreachable remote endpoint.

Workarounds tried:

Workaround Result
Delete failing RemoteMCPServer, re-apply No change; other registrations remain blocked
Apply other RemoteMCPServer resources first (before the failing one) Successful — confirms the block is applied to resources registered after the failing one in controller processing order
Delete failing resource + restart controller pod Unblocks remaining registrations ✓

Current mitigation:
Exclude the dual-stack RemoteMCPServer from the Kustomize overlay until IPv6 DNS resolution is resolved at the cluster level, or until a fix is available in kagent.

Suggested fixes (in priority order):

  1. Add a connection timeout to RemoteMCPServer registration attempts with a reasonable default (e.g., 10–30 seconds).
  2. Isolate each RemoteMCPServer reconciliation so that a failure or timeout on one resource does not block others (e.g., per-resource goroutine or error boundary).
  3. Emit structured log entries for registration attempts, success, and failure at debug/info/error levels respectively.
  4. Reflect registration failure in the RemoteMCPServer .status with a Ready: False condition, reason, and human-readable message.
  5. Ensure that deleting a failing RemoteMCPServer is sufficient to unblock the controller without requiring a pod restart.

📋 Logs

📷 Screenshots

looking-glass-mcp blocking registration of tsds-mcp
Image

after deleting looking-glass-mcp and restarting kagent controller pod
Image

🙋 Are you willing to contribute?

  • I am willing to submit a PR to fix this issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions