fix: remove hardcoded cluster.local and add configurable clusterDomai…#1774
fix: remove hardcoded cluster.local and add configurable clusterDomai…#1774RustyCoderX wants to merge 3 commits intokagent-dev:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Removes hardcoded cluster.local from various internal Kubernetes service URLs and introduces a Helm-level clusterDomain option to optionally render fully-qualified service DNS names, defaulting safely to .svc.
Changes:
- Replaced hardcoded
.svc.cluster.localURLs with.svcacross UI, Go, Python samples, and golden/test fixtures. - Added Helm
clusterDomainvalue pluskagent.clusterDomainSuffixhelper to render.svc.<clusterDomain>when configured (otherwise.svc). - Updated internal-URL detection logic used to decide when to apply the proxy for RemoteMCPServer URLs.
Reviewed changes
Copilot reviewed 27 out of 27 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| ui/src/lib/utils.ts | Production fallback backend URL now uses .svc instead of .svc.cluster.local. |
| ui/src/lib/tests/utils.test.ts | Updated test expectation for the new UI fallback URL. |
| python/samples/langgraph/currency/agent.yaml | Updated Jaeger/OTLP endpoints to use .svc form. |
| python/packages/kagent-adk/tests/unittests/test_proxy_integration.py | Updated proxy URL expectations to .svc. |
| helm/kagent/values.yaml | Introduces clusterDomain value + updates examples/default docs away from hardcoded cluster.local. |
| helm/kagent/tests/controller-deployment_test.yaml | Updated expected computed URLs to .svc form. |
| helm/kagent/templates/ui-deployment.yaml | Uses kagent.clusterDomainSuffix when setting NEXT_PUBLIC_BACKEND_URL. |
| helm/kagent/templates/controller-deployment.yaml | Injects CLUSTER_DOMAIN env var and uses kagent.clusterDomainSuffix in Postgres URL. |
| helm/kagent/templates/_helpers.tpl | Adds kagent.clusterDomainSuffix helper and uses it for default A2A base URL. |
| go/core/pkg/app/app_test.go | Updates proxy URL env var test data to .svc. |
| go/core/pkg/app/app.go | Updates default Postgres URL/proxy help text; adds cluster-domain flag/config field. |
| go/core/internal/httpserver/auth/proxy_authn_test.go | Updates test JWT issuer string to .svc form. |
| go/core/internal/controller/translator/agent/testdata/outputs/agent_with_proxy_service.json | Golden output URLs updated to .svc. |
| go/core/internal/controller/translator/agent/testdata/outputs/agent_with_proxy_mcpserver_custom_timeout.json | Golden output URLs updated to .svc. |
| go/core/internal/controller/translator/agent/testdata/outputs/agent_with_proxy_mcpserver.json | Golden output URLs updated to .svc. |
| go/core/internal/controller/translator/agent/testdata/outputs/agent_with_proxy.json | Golden output URLs updated to .svc. |
| go/core/internal/controller/translator/agent/testdata/inputs/agent_with_proxy_service.yaml | Test input proxy URL updated to .svc. |
| go/core/internal/controller/translator/agent/testdata/inputs/agent_with_proxy_mcpserver_custom_timeout.yaml | Test input proxy URL updated to .svc. |
| go/core/internal/controller/translator/agent/testdata/inputs/agent_with_proxy_mcpserver.yaml | Test input proxy URL updated to .svc. |
| go/core/internal/controller/translator/agent/testdata/inputs/agent_with_proxy_external_remotemcp.yaml | Test input proxy URL updated to .svc. |
| go/core/internal/controller/translator/agent/testdata/inputs/agent_with_proxy.yaml | Test input proxy URL updated to .svc. |
| go/core/internal/controller/translator/agent/proxy_test.go | Updates proxy URL assertions to .svc. |
| go/core/internal/controller/translator/agent/adk_translator_golden_test.go | Updates commentary around service URL patterns. |
| go/core/internal/controller/translator/agent/adk_api_translator.go | Updates internal k8s URL detection logic from .svc.cluster.local to .svc-based heuristics. |
| contrib/tools/mcp-grafana/tests/toolserver_test.yaml | Updates expected toolserver URL to .svc. |
| contrib/tools/mcp-grafana/templates/toolserver.yaml | Updates rendered toolserver URL to .svc. |
| contrib/addons/grafana.yaml | Changes Grafana dashboard default service variable from .svc.cluster.local to .svc. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -165,7 +166,7 @@ func (cfg *Config) SetFlags(commandLine *flag.FlagSet) { | |||
| commandLine.StringVar(&cfg.DefaultModelConfig.Namespace, "default-model-config-namespace", kagentNamespace, "The namespace of the default model config.") | |||
| commandLine.StringVar(&cfg.HttpServerAddr, "http-server-address", ":8083", "The address the HTTP server binds to.") | |||
| commandLine.StringVar(&cfg.A2ABaseUrl, "a2a-base-url", "http://127.0.0.1:8083", "The base URL of the A2A Server endpoint, as advertised to clients.") | |||
| commandLine.StringVar(&cfg.Database.Url, "postgres-database-url", "postgres://postgres:kagent@kagent-postgresql.kagent.svc.cluster.local:5432/postgres", "The URL of the PostgreSQL database.") | |||
| commandLine.StringVar(&cfg.Database.Url, "postgres-database-url", "postgres://postgres:kagent@kagent-postgresql.kagent.svc:5432/postgres", "The URL of the PostgreSQL database.") | |||
| commandLine.StringVar(&cfg.Database.UrlFile, "postgres-database-url-file", "", "Path to a file containing the PostgreSQL database URL. Takes precedence over --postgres-database-url.") | |||
| commandLine.BoolVar(&cfg.Database.VectorEnabled, "database-vector-enabled", true, "Enable pgvector extension and memory table. Requires pgvector to be installed on the PostgreSQL server.") | |||
|
|
|||
| @@ -175,7 +176,7 @@ func (cfg *Config) SetFlags(commandLine *flag.FlagSet) { | |||
| commandLine.Var(&cfg.Streaming.InitialBufSize, "streaming-initial-buf-size", "The initial size of the streaming buffer.") | |||
| commandLine.DurationVar(&cfg.Streaming.Timeout, "streaming-timeout", 60*time.Second, "The timeout for the streaming connection.") | |||
|
|
|||
| commandLine.StringVar(&cfg.Proxy.URL, "proxy-url", "", "Proxy URL for internally-built k8s URLs (e.g., http://proxy.kagent.svc.cluster.local:8080)") | |||
| commandLine.StringVar(&cfg.Proxy.URL, "proxy-url", "", "Proxy URL for internally-built k8s URLs (e.g., http://proxy.kagent.svc:8080)") | |||
|
|
|||
| commandLine.StringVar(&cfg.Auth.Mode, "auth-mode", "unsecure", "Authentication mode: unsecure or trusted-proxy") | |||
| commandLine.StringVar(&cfg.Auth.UserIDClaim, "auth-user-id-claim", "sub", "JWT claim name for user identity") | |||
| @@ -195,6 +196,7 @@ func (cfg *Config) SetFlags(commandLine *flag.FlagSet) { | |||
| commandLine.Var(&MapValue{Target: &agent_translator.DefaultAgentPodLabels}, "default-agent-pod-labels", "Comma-separated key=value pairs of labels to apply to all agent pod templates (e.g. 'team=platform,env=prod'). Per-agent labels take precedence.") | |||
|
|
|||
| commandLine.StringVar(&agent_translator.DefaultAgentBindHost, "default-agent-bind-host", agent_translator.DefaultAgentBindHost, "Default host address for agent pods to bind to. Use '0.0.0.0' for IPv4 only or '::' for dual-stack (IPv4+IPv6).") | |||
| commandLine.StringVar(&cfg.ClusterDomain, "cluster-domain", "", "Optional cluster domain. If provided, internal service URLs will append it (e.g., .svc.cluster.local). If omitted, URLs will use .svc.") | |||
| } | |||
There was a problem hiding this comment.
ClusterDomain is added to the Config and exposed as a --cluster-domain/CLUSTER_DOMAIN setting, but it is never referenced anywhere else in the codebase (search shows only the definition and flag registration). Either wire this into the places that construct internal service URLs (so it actually affects behavior) or remove the config/flag to avoid a dead, misleading setting.
There was a problem hiding this comment.
I agree, as I mentioned I don't think we really need this right now
| {{- if .Values.clusterDomain }} | ||
| - name: CLUSTER_DOMAIN | ||
| value: {{ .Values.clusterDomain | quote }} | ||
| {{- end }} |
There was a problem hiding this comment.
This chart sets CLUSTER_DOMAIN when .Values.clusterDomain is provided, but the controller binary currently does not use cluster-domain/CLUSTER_DOMAIN for anything (it’s parsed into config but never read). Either remove this env var until it’s consumed, or implement the runtime behavior so the setting has an effect.
| {{- if .Values.clusterDomain }} | |
| - name: CLUSTER_DOMAIN | |
| value: {{ .Values.clusterDomain | quote }} | |
| {{- end }} |
There was a problem hiding this comment.
This needs addressing. Feels like we don't need this anyway.
| @@ -953,8 +953,8 @@ func (a *adkApiTranslator) isInternalK8sURL(ctx context.Context, urlStr, namespa | |||
| return false | |||
| } | |||
|
|
|||
| // Check if it ends with .svc.cluster.local (definitely internal) | |||
| if strings.HasSuffix(hostname, ".svc.cluster.local") { | |||
| // Check if it's an internal FQDN by looking for .svc or .svc. | |||
| if strings.HasSuffix(hostname, ".svc") || strings.Contains(hostname, ".svc.") { | |||
| return true | |||
| } | |||
There was a problem hiding this comment.
isInternalK8sURL now returns true for any hostname that ends with .svc or merely contains .svc. (e.g., foo.svc.example.com), which can misclassify external URLs and apply the proxy unexpectedly. Consider parsing hostname labels and requiring the canonical k8s pattern service.namespace.svc[.<clusterDomain>] (i.e., parts[2] == "svc" with len(parts) >= 3) instead of substring checks; also update the function comment since it no longer always relies on namespace existence.
| // Split hostname by dots: service.namespace or service.namespace.svc | ||
| hostname := parsed.Hostname() | ||
| parts := strings.Split(hostname, ".") | ||
|
|
||
| // Valid patterns: | ||
| // - service.namespace (2 parts) | ||
| // - service.namespace.svc (3 parts) | ||
| // - service.namespace.svc.cluster.local (5 parts) | ||
| // - service.namespace.svc (5 parts) | ||
| if len(parts) >= 2 { |
There was a problem hiding this comment.
The comment for valid hostname patterns is incorrect: it lists service.namespace.svc twice and claims the latter is “5 parts”. If this is meant to cover custom cluster domains, it should describe service.namespace.svc.<clusterDomain> (which can be 4+ parts depending on the domain).
| destination_workload_namespace), 0.001)\",\"format\":\"time_series\",\"intervalFactor\":1,\"legendFormat\":\"{{ | ||
| destination_workload }}.{{destination_workload_namespace }}\",\"refId\":\"B\",\"step\":2}],\"title\":\"Bytes | ||
| Sent to Incoming TCP Connection\",\"type\":\"timeseries\"}],\"refresh\":\"1m\",\"schemaVersion\":38,\"style\":\"dark\",\"tags\":[],\"templating\":{\"list\":[{\"hide\":0,\"includeAll\":false,\"multi\":false,\"name\":\"datasource\",\"options\":[],\"query\":\"prometheus\",\"queryValue\":\"\",\"refresh\":1,\"regex\":\"\",\"skipUrlSync\":false,\"type\":\"datasource\"},{\"current\":{\"selected\":false,\"text\":\"details.default.svc.cluster.local\",\"value\":\"details.default.svc.cluster.local\"},\"datasource\":{\"type\":\"prometheus\",\"uid\":\"${datasource}\"},\"definition\":\"\",\"hide\":0,\"includeAll\":false,\"label\":\"Service\",\"multi\":false,\"name\":\"service\",\"options\":[],\"query\":\"query_result(sum(istio_requests_total{}) | ||
| Sent to Incoming TCP Connection\",\"type\":\"timeseries\"}],\"refresh\":\"1m\",\"schemaVersion\":38,\"style\":\"dark\",\"tags\":[],\"templating\":{\"list\":[{\"hide\":0,\"includeAll\":false,\"multi\":false,\"name\":\"datasource\",\"options\":[],\"query\":\"prometheus\",\"queryValue\":\"\",\"refresh\":1,\"regex\":\"\",\"skipUrlSync\":false,\"type\":\"datasource\"},{\"current\":{\"selected\":false,\"text\":\"details.default.svc\",\"value\":\"details.default.svc\"},\"datasource\":{\"type\":\"prometheus\",\"uid\":\"${datasource}\"},\"definition\":\"\",\"hide\":0,\"includeAll\":false,\"label\":\"Service\",\"multi\":false,\"name\":\"service\",\"options\":[],\"query\":\"query_result(sum(istio_requests_total{}) |
There was a problem hiding this comment.
In this dashboard JSON, the service template variable’s current default was changed to details.default.svc, but the variable options are populated from the destination_service label values returned by Prometheus. Those label values are typically fully-qualified (e.g., details.default.svc.<clusterDomain>), so this default may not match any option and the dashboard will load with no data until the user changes it. Consider leaving the default empty/first-option, or adjust the variable regex/query so the option values match the shortened form.
| Sent to Incoming TCP Connection\",\"type\":\"timeseries\"}],\"refresh\":\"1m\",\"schemaVersion\":38,\"style\":\"dark\",\"tags\":[],\"templating\":{\"list\":[{\"hide\":0,\"includeAll\":false,\"multi\":false,\"name\":\"datasource\",\"options\":[],\"query\":\"prometheus\",\"queryValue\":\"\",\"refresh\":1,\"regex\":\"\",\"skipUrlSync\":false,\"type\":\"datasource\"},{\"current\":{\"selected\":false,\"text\":\"details.default.svc\",\"value\":\"details.default.svc\"},\"datasource\":{\"type\":\"prometheus\",\"uid\":\"${datasource}\"},\"definition\":\"\",\"hide\":0,\"includeAll\":false,\"label\":\"Service\",\"multi\":false,\"name\":\"service\",\"options\":[],\"query\":\"query_result(sum(istio_requests_total{}) | |
| Sent to Incoming TCP Connection\",\"type\":\"timeseries\"}],\"refresh\":\"1m\",\"schemaVersion\":38,\"style\":\"dark\",\"tags\":[],\"templating\":{\"list\":[{\"hide\":0,\"includeAll\":false,\"multi\":false,\"name\":\"datasource\",\"options\":[],\"query\":\"prometheus\",\"queryValue\":\"\",\"refresh\":1,\"regex\":\"\",\"skipUrlSync\":false,\"type\":\"datasource\"},{\"current\":{\"selected\":false,\"text\":\"\",\"value\":\"\"},\"datasource\":{\"type\":\"prometheus\",\"uid\":\"${datasource}\"},\"definition\":\"\",\"hide\":0,\"includeAll\":false,\"label\":\"Service\",\"multi\":false,\"name\":\"service\",\"options\":[],\"query\":\"query_result(sum(istio_requests_total{}) |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io>
Removed hardcoded cluster.local and added support for custom cluster domains with a safe .svc fallback