📋 Prerequisites
🎯 Affected Service(s)
Controller Service
🚦 Impact/Severity
Blocker
🐛 Bug Description
When a RemoteMCPServer resource fails to register — in this case, due to a suspected connection hang on an IPv6 address resolved from a dual-stack hostname in an IPv4-only cluster — the kagent controller enters a stuck state that prevents registration of any RemoteMCPServer resources applied after the failing one. The stuck state persists until the failing resource is deleted and the controller pod is restarted. Deletion of the failing resource alone is insufficient to unblock subsequent registrations.
Two distinct issues are present:
- A connection hang on failed
RemoteMCPServer registration blocks all subsequent RemoteMCPServer registrations — the controller does not time out, isolate the failure, or continue processing other resources.
- No logging or status information is emitted for
RemoteMCPServer registration attempts, successes, or failures, even at debug log level, making it impossible to confirm the root cause or observe controller behavior during reconciliation.
🔄 Steps To Reproduce
-
Deploy kagent 0.9.0 on an IPv4-only Kubernetes cluster.
-
Apply a RemoteMCPServer whose hostname resolves to both IPv4 and IPv6 addresses. In an IPv4-only cluster, the controller is suspected to resolve the IPv6 (AAAA) address and hang on the connection attempt with no timeout.
apiVersion: kagent.dev/v1alpha2
kind: RemoteMCPServer
metadata:
name: looking-glass-mcp
spec:
description: "Internet2 Looking Glass MCP server — traceroute, BGP, ping across I2 backbone"
url: "https://example.internet2.edu/mcp"
protocol: STREAMABLE_HTTP
-
Apply one or more additional RemoteMCPServer resources (via Kustomize or kubectl apply) after the failing resource is applied.
-
Observe that none of the subsequently applied RemoteMCPServer resources complete registration.
-
Inspect the Status of the blocked resources — no ACCEPTED detail and no conditions are present.
-
Delete the failing looking-glass-mcp resource. Observe that the other resources remain blocked.
-
Restart the kagent controller pod. Observe that the remaining RemoteMCPServer resources now register successfully (assuming the failing resource is absent).
🤔 Expected Behavior
- Each
RemoteMCPServer reconciliation should be independent. A failure or hang on one resource must not affect reconciliation of any other resource.
- Connection attempts during registration should have a configurable or sensible default timeout, after which the controller marks the resource as failed and moves on.
- Failed registration should be reflected in the
RemoteMCPServer resource's Status with a descriptive condition (e.g., Ready: False, reason, and message).
- The controller should log registration attempts, outcomes, and errors at an appropriate level (at minimum at
debug, ideally at info for successes and warn/error for failures) to provide operator visibility.
- Deleting a failed
RemoteMCPServer resource should be sufficient to unblock the controller — a pod restart should not be required to recover from a failed registration.
📱 Actual Behavior
- A single hung
RemoteMCPServer registration blocks all subsequent registrations for the lifetime of the controller process.
- Deleting the failing resource does not unblock the controller; a pod restart is required in addition to deletion.
- Blocked
RemoteMCPServer resources show no Status conditions and no detail in their resource status, even after extended wait time.
- No log output related to
RemoteMCPServer registration is emitted at any log level, including debug, preventing diagnosis without external tooling.
💻 Environment
| Field |
Value |
| kagent Helm chart version |
0.9.0 |
| Kubernetes cluster |
K3s v1.32.12+k3s1 IPv4-only dev cluster |
RemoteMCPServer protocol |
STREAMABLE_HTTP |
| Failing server URL |
Dual-stack hostname (resolves both A and AAAA records) |
| Resource management |
Kustomize (resources applied in lexicographic order) |
🔧 CLI Bug Report
No response
🔍 Additional Context
Did this work before?
This is a new deployment under active initial setup; there is no prior working baseline to compare against.
When did it start failing?
Observed from initial deployment on kagent Helm chart 0.9.0 (confirmed broken in 0.8.0 as well).
Suspected root cause:
The controller is likely resolving the AAAA (IPv6) record for the dual-stack hostname and attempting to connect to it. On an IPv4-only cluster, this connection hangs indefinitely rather than failing fast and falling back to the A (IPv4) record. The core bugs — no timeout, no isolation between reconciliations, no status reporting, no logging — are independent of the IPv6 trigger and would surface with any sufficiently slow or unreachable remote endpoint.
Workarounds tried:
| Workaround |
Result |
Delete failing RemoteMCPServer, re-apply |
No change; other registrations remain blocked |
Apply other RemoteMCPServer resources first (before the failing one) |
Successful — confirms the block is applied to resources registered after the failing one in controller processing order |
| Delete failing resource + restart controller pod |
Unblocks remaining registrations ✓ |
Current mitigation:
Exclude the dual-stack RemoteMCPServer from the Kustomize overlay until IPv6 DNS resolution is resolved at the cluster level, or until a fix is available in kagent.
Suggested fixes (in priority order):
- Add a connection timeout to
RemoteMCPServer registration attempts with a reasonable default (e.g., 10–30 seconds).
- Isolate each
RemoteMCPServer reconciliation so that a failure or timeout on one resource does not block others (e.g., per-resource goroutine or error boundary).
- Emit structured log entries for registration attempts, success, and failure at
debug/info/error levels respectively.
- Reflect registration failure in the
RemoteMCPServer .status with a Ready: False condition, reason, and human-readable message.
- Ensure that deleting a failing
RemoteMCPServer is sufficient to unblock the controller without requiring a pod restart.
📋 Logs
📷 Screenshots
looking-glass-mcp blocking registration of tsds-mcp

after deleting looking-glass-mcp and restarting kagent controller pod

🙋 Are you willing to contribute?
📋 Prerequisites
🎯 Affected Service(s)
Controller Service
🚦 Impact/Severity
Blocker
🐛 Bug Description
When a
RemoteMCPServerresource fails to register — in this case, due to a suspected connection hang on an IPv6 address resolved from a dual-stack hostname in an IPv4-only cluster — the kagent controller enters a stuck state that prevents registration of anyRemoteMCPServerresources applied after the failing one. The stuck state persists until the failing resource is deleted and the controller pod is restarted. Deletion of the failing resource alone is insufficient to unblock subsequent registrations.Two distinct issues are present:
RemoteMCPServerregistration blocks all subsequentRemoteMCPServerregistrations — the controller does not time out, isolate the failure, or continue processing other resources.RemoteMCPServerregistration attempts, successes, or failures, even atdebuglog level, making it impossible to confirm the root cause or observe controller behavior during reconciliation.🔄 Steps To Reproduce
Deploy kagent
0.9.0on an IPv4-only Kubernetes cluster.Apply a
RemoteMCPServerwhose hostname resolves to both IPv4 and IPv6 addresses. In an IPv4-only cluster, the controller is suspected to resolve the IPv6 (AAAA) address and hang on the connection attempt with no timeout.Apply one or more additional
RemoteMCPServerresources (via Kustomize orkubectl apply) after the failing resource is applied.Observe that none of the subsequently applied
RemoteMCPServerresources complete registration.Inspect the Status of the blocked resources — no
ACCEPTEDdetail and no conditions are present.Delete the failing
looking-glass-mcpresource. Observe that the other resources remain blocked.Restart the kagent controller pod. Observe that the remaining
RemoteMCPServerresources now register successfully (assuming the failing resource is absent).🤔 Expected Behavior
RemoteMCPServerreconciliation should be independent. A failure or hang on one resource must not affect reconciliation of any other resource.RemoteMCPServerresource's Status with a descriptive condition (e.g.,Ready: False, reason, and message).debug, ideally atinfofor successes andwarn/errorfor failures) to provide operator visibility.RemoteMCPServerresource should be sufficient to unblock the controller — a pod restart should not be required to recover from a failed registration.📱 Actual Behavior
RemoteMCPServerregistration blocks all subsequent registrations for the lifetime of the controller process.RemoteMCPServerresources show no Status conditions and no detail in their resource status, even after extended wait time.RemoteMCPServerregistration is emitted at any log level, includingdebug, preventing diagnosis without external tooling.💻 Environment
0.9.0v1.32.12+k3s1IPv4-only dev clusterRemoteMCPServerprotocolSTREAMABLE_HTTP🔧 CLI Bug Report
No response
🔍 Additional Context
Did this work before?
This is a new deployment under active initial setup; there is no prior working baseline to compare against.
When did it start failing?
Observed from initial deployment on kagent Helm chart
0.9.0(confirmed broken in0.8.0as well).Suspected root cause:
The controller is likely resolving the AAAA (IPv6) record for the dual-stack hostname and attempting to connect to it. On an IPv4-only cluster, this connection hangs indefinitely rather than failing fast and falling back to the A (IPv4) record. The core bugs — no timeout, no isolation between reconciliations, no status reporting, no logging — are independent of the IPv6 trigger and would surface with any sufficiently slow or unreachable remote endpoint.
Workarounds tried:
RemoteMCPServer, re-applyRemoteMCPServerresources first (before the failing one)Current mitigation:
Exclude the dual-stack
RemoteMCPServerfrom the Kustomize overlay until IPv6 DNS resolution is resolved at the cluster level, or until a fix is available in kagent.Suggested fixes (in priority order):
RemoteMCPServerregistration attempts with a reasonable default (e.g., 10–30 seconds).RemoteMCPServerreconciliation so that a failure or timeout on one resource does not block others (e.g., per-resource goroutine or error boundary).debug/info/errorlevels respectively.RemoteMCPServer.statuswith aReady: Falsecondition, reason, and human-readable message.RemoteMCPServeris sufficient to unblock the controller without requiring a pod restart.📋 Logs
📷 Screenshots
looking-glass-mcp blocking registration of tsds-mcp

after deleting looking-glass-mcp and restarting kagent controller pod

🙋 Are you willing to contribute?