[BUG] Failure of one RemoteMCPServer to register prevents subsequent registrations

### 📋 Prerequisites

- [x] I have searched the [existing issues](./issues) to avoid creating a duplicate
- [x] By submitting this issue, you agree to follow our [Code of Conduct](https://github.com/kagent-dev/kagent/blob/main/CODE_OF_CONDUCT.md)
- [x] I am using the latest version of the software
- [x] I have tried to clear cache/cookies or used incognito mode (if ui-related)
- [x] I can consistently reproduce this issue

### 🎯 Affected Service(s)

Controller Service

### 🚦 Impact/Severity

Blocker

### 🐛 Bug Description

When a `RemoteMCPServer` resource fails to register — in this case, due to a suspected connection hang on an IPv6 address resolved from a dual-stack hostname in an IPv4-only cluster — the kagent controller enters a stuck state that prevents registration of any `RemoteMCPServer` resources applied after the failing one. The stuck state persists until the failing resource is deleted **and** the controller pod is restarted. Deletion of the failing resource alone is insufficient to unblock subsequent registrations.

Two distinct issues are present:

1. **A connection hang on failed `RemoteMCPServer` registration blocks all subsequent `RemoteMCPServer` registrations** — the controller does not time out, isolate the failure, or continue processing other resources.
2. **No logging or status information is emitted for `RemoteMCPServer` registration attempts, successes, or failures**, even at `debug` log level, making it impossible to confirm the root cause or observe controller behavior during reconciliation.

### 🔄 Steps To Reproduce

1. Deploy kagent `0.9.0` on an IPv4-only Kubernetes cluster.

2. Apply a `RemoteMCPServer` whose hostname resolves to both IPv4 and IPv6 addresses. In an IPv4-only cluster, the controller is suspected to resolve the IPv6 (AAAA) address and hang on the connection attempt with no timeout.

   ```yaml
   apiVersion: kagent.dev/v1alpha2
   kind: RemoteMCPServer
   metadata:
     name: looking-glass-mcp
   spec:
     description: "Internet2 Looking Glass MCP server — traceroute, BGP, ping across I2 backbone"
     url: "https://example.internet2.edu/mcp"
     protocol: STREAMABLE_HTTP
   ```

3. Apply one or more additional `RemoteMCPServer` resources (via Kustomize or `kubectl apply`) after the failing resource is applied.

4. Observe that none of the subsequently applied `RemoteMCPServer` resources complete registration.

5. Inspect the Status of the blocked resources — no `ACCEPTED` detail and no conditions are present.

6. Delete the failing `looking-glass-mcp` resource. Observe that the other resources remain blocked.

7. Restart the kagent controller pod. Observe that the remaining `RemoteMCPServer` resources now register successfully (assuming the failing resource is absent).

### 🤔 Expected Behavior

- Each `RemoteMCPServer` reconciliation should be **independent**. A failure or hang on one resource must not affect reconciliation of any other resource.
- Connection attempts during registration should have a **configurable or sensible default timeout**, after which the controller marks the resource as failed and moves on.
- Failed registration should be reflected in the `RemoteMCPServer` resource's **Status** with a descriptive condition (e.g., `Ready: False`, reason, and message).
- The controller should **log registration attempts, outcomes, and errors** at an appropriate level (at minimum at `debug`, ideally at `info` for successes and `warn`/`error` for failures) to provide operator visibility.
- Deleting a failed `RemoteMCPServer` resource should be sufficient to unblock the controller — **a pod restart should not be required** to recover from a failed registration.


### 📱 Actual Behavior

- A single hung `RemoteMCPServer` registration **blocks all subsequent registrations** for the lifetime of the controller process.
- Deleting the failing resource does not unblock the controller; a **pod restart is required** in addition to deletion.
- Blocked `RemoteMCPServer` resources show **no Status conditions and no detail** in their resource status, even after extended wait time.
- **No log output** related to `RemoteMCPServer` registration is emitted at any log level, including `debug`, preventing diagnosis without external tooling.

### 💻 Environment

| Field | Value |
|---|---|
| kagent Helm chart version | `0.9.0` |
| Kubernetes cluster | K3s `v1.32.12+k3s1` IPv4-only dev cluster |
| `RemoteMCPServer` protocol | `STREAMABLE_HTTP` |
| Failing server URL | Dual-stack hostname (resolves both A and AAAA records) |
| Resource management | Kustomize (resources applied in lexicographic order) |

### 🔧 CLI Bug Report

_No response_

### 🔍 Additional Context

**Did this work before?**
This is a new deployment under active initial setup; there is no prior working baseline to compare against.

**When did it start failing?**
Observed from initial deployment on kagent Helm chart `0.9.0` (confirmed broken in `0.8.0` as well).

**Suspected root cause:**
The controller is likely resolving the AAAA (IPv6) record for the dual-stack hostname and attempting to connect to it. On an IPv4-only cluster, this connection hangs indefinitely rather than failing fast and falling back to the A (IPv4) record. The core bugs — no timeout, no isolation between reconciliations, no status reporting, no logging — are independent of the IPv6 trigger and would surface with any sufficiently slow or unreachable remote endpoint.

**Workarounds tried:**

| Workaround | Result |
|---|---|
| Delete failing `RemoteMCPServer`, re-apply | No change; other registrations remain blocked |
| Apply other `RemoteMCPServer` resources first (before the failing one) | Successful — confirms the block is applied to resources registered *after* the failing one in controller processing order |
| Delete failing resource + restart controller pod | Unblocks remaining registrations ✓ |

**Current mitigation:**
Exclude the dual-stack `RemoteMCPServer` from the Kustomize overlay until IPv6 DNS resolution is resolved at the cluster level, or until a fix is available in kagent.

**Suggested fixes (in priority order):**
1. Add a connection timeout to `RemoteMCPServer` registration attempts with a reasonable default (e.g., 10–30 seconds).
2. Isolate each `RemoteMCPServer` reconciliation so that a failure or timeout on one resource does not block others (e.g., per-resource goroutine or error boundary).
3. Emit structured log entries for registration attempts, success, and failure at `debug`/`info`/`error` levels respectively.
4. Reflect registration failure in the `RemoteMCPServer` `.status` with a `Ready: False` condition, reason, and human-readable message.
5. Ensure that deleting a failing `RemoteMCPServer` is sufficient to unblock the controller without requiring a pod restart.


### 📋 Logs

```shell

```

### 📷 Screenshots

looking-glass-mcp blocking registration of tsds-mcp
<img width="1278" height="95" alt="Image" src="https://github.com/user-attachments/assets/a6ab7eed-9c59-4c3e-bdf6-a7ae0de009b5" />

after deleting looking-glass-mcp and restarting kagent controller pod
<img width="1278" height="95" alt="Image" src="https://github.com/user-attachments/assets/95a48323-2519-4ec7-928f-928650a5cd91" />

### 🙋 Are you willing to contribute?

- [ ] I am willing to submit a PR to fix this issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Failure of one RemoteMCPServer to register prevents subsequent registrations #1785

📋 Prerequisites

🎯 Affected Service(s)

🚦 Impact/Severity

🐛 Bug Description

🔄 Steps To Reproduce

🤔 Expected Behavior

📱 Actual Behavior

💻 Environment

🔧 CLI Bug Report

🔍 Additional Context

📋 Logs

📷 Screenshots

🙋 Are you willing to contribute?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Field	Value
kagent Helm chart version	`0.9.0`
Kubernetes cluster	K3s `v1.32.12+k3s1` IPv4-only dev cluster
`RemoteMCPServer` protocol	`STREAMABLE_HTTP`
Failing server URL	Dual-stack hostname (resolves both A and AAAA records)
Resource management	Kustomize (resources applied in lexicographic order)

Workaround	Result
Delete failing `RemoteMCPServer`, re-apply	No change; other registrations remain blocked
Apply other `RemoteMCPServer` resources first (before the failing one)	Successful — confirms the block is applied to resources registered after the failing one in controller processing order
Delete failing resource + restart controller pod	Unblocks remaining registrations ✓

[BUG] Failure of one RemoteMCPServer to register prevents subsequent registrations #1785

Description

📋 Prerequisites

🎯 Affected Service(s)

🚦 Impact/Severity

🐛 Bug Description

🔄 Steps To Reproduce

🤔 Expected Behavior

📱 Actual Behavior

💻 Environment

🔧 CLI Bug Report

🔍 Additional Context

📋 Logs

📷 Screenshots

🙋 Are you willing to contribute?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions