Skip to content

fix: remove hardcoded cluster.local and add configurable clusterDomai…#1774

Open
RustyCoderX wants to merge 3 commits intokagent-dev:mainfrom
RustyCoderX:fix/cluster-domain-support
Open

fix: remove hardcoded cluster.local and add configurable clusterDomai…#1774
RustyCoderX wants to merge 3 commits intokagent-dev:mainfrom
RustyCoderX:fix/cluster-domain-support

Conversation

@RustyCoderX
Copy link
Copy Markdown

Removed hardcoded cluster.local and added support for custom cluster domains with a safe .svc fallback

Copilot AI review requested due to automatic review settings April 28, 2026 21:38
@github-actions github-actions Bot added the bug Something isn't working label Apr 28, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Removes hardcoded cluster.local from various internal Kubernetes service URLs and introduces a Helm-level clusterDomain option to optionally render fully-qualified service DNS names, defaulting safely to .svc.

Changes:

  • Replaced hardcoded .svc.cluster.local URLs with .svc across UI, Go, Python samples, and golden/test fixtures.
  • Added Helm clusterDomain value plus kagent.clusterDomainSuffix helper to render .svc.<clusterDomain> when configured (otherwise .svc).
  • Updated internal-URL detection logic used to decide when to apply the proxy for RemoteMCPServer URLs.

Reviewed changes

Copilot reviewed 27 out of 27 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
ui/src/lib/utils.ts Production fallback backend URL now uses .svc instead of .svc.cluster.local.
ui/src/lib/tests/utils.test.ts Updated test expectation for the new UI fallback URL.
python/samples/langgraph/currency/agent.yaml Updated Jaeger/OTLP endpoints to use .svc form.
python/packages/kagent-adk/tests/unittests/test_proxy_integration.py Updated proxy URL expectations to .svc.
helm/kagent/values.yaml Introduces clusterDomain value + updates examples/default docs away from hardcoded cluster.local.
helm/kagent/tests/controller-deployment_test.yaml Updated expected computed URLs to .svc form.
helm/kagent/templates/ui-deployment.yaml Uses kagent.clusterDomainSuffix when setting NEXT_PUBLIC_BACKEND_URL.
helm/kagent/templates/controller-deployment.yaml Injects CLUSTER_DOMAIN env var and uses kagent.clusterDomainSuffix in Postgres URL.
helm/kagent/templates/_helpers.tpl Adds kagent.clusterDomainSuffix helper and uses it for default A2A base URL.
go/core/pkg/app/app_test.go Updates proxy URL env var test data to .svc.
go/core/pkg/app/app.go Updates default Postgres URL/proxy help text; adds cluster-domain flag/config field.
go/core/internal/httpserver/auth/proxy_authn_test.go Updates test JWT issuer string to .svc form.
go/core/internal/controller/translator/agent/testdata/outputs/agent_with_proxy_service.json Golden output URLs updated to .svc.
go/core/internal/controller/translator/agent/testdata/outputs/agent_with_proxy_mcpserver_custom_timeout.json Golden output URLs updated to .svc.
go/core/internal/controller/translator/agent/testdata/outputs/agent_with_proxy_mcpserver.json Golden output URLs updated to .svc.
go/core/internal/controller/translator/agent/testdata/outputs/agent_with_proxy.json Golden output URLs updated to .svc.
go/core/internal/controller/translator/agent/testdata/inputs/agent_with_proxy_service.yaml Test input proxy URL updated to .svc.
go/core/internal/controller/translator/agent/testdata/inputs/agent_with_proxy_mcpserver_custom_timeout.yaml Test input proxy URL updated to .svc.
go/core/internal/controller/translator/agent/testdata/inputs/agent_with_proxy_mcpserver.yaml Test input proxy URL updated to .svc.
go/core/internal/controller/translator/agent/testdata/inputs/agent_with_proxy_external_remotemcp.yaml Test input proxy URL updated to .svc.
go/core/internal/controller/translator/agent/testdata/inputs/agent_with_proxy.yaml Test input proxy URL updated to .svc.
go/core/internal/controller/translator/agent/proxy_test.go Updates proxy URL assertions to .svc.
go/core/internal/controller/translator/agent/adk_translator_golden_test.go Updates commentary around service URL patterns.
go/core/internal/controller/translator/agent/adk_api_translator.go Updates internal k8s URL detection logic from .svc.cluster.local to .svc-based heuristics.
contrib/tools/mcp-grafana/tests/toolserver_test.yaml Updates expected toolserver URL to .svc.
contrib/tools/mcp-grafana/templates/toolserver.yaml Updates rendered toolserver URL to .svc.
contrib/addons/grafana.yaml Changes Grafana dashboard default service variable from .svc.cluster.local to .svc.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread go/core/pkg/app/app.go
Comment on lines 142 to 200
@@ -165,7 +166,7 @@ func (cfg *Config) SetFlags(commandLine *flag.FlagSet) {
commandLine.StringVar(&cfg.DefaultModelConfig.Namespace, "default-model-config-namespace", kagentNamespace, "The namespace of the default model config.")
commandLine.StringVar(&cfg.HttpServerAddr, "http-server-address", ":8083", "The address the HTTP server binds to.")
commandLine.StringVar(&cfg.A2ABaseUrl, "a2a-base-url", "http://127.0.0.1:8083", "The base URL of the A2A Server endpoint, as advertised to clients.")
commandLine.StringVar(&cfg.Database.Url, "postgres-database-url", "postgres://postgres:kagent@kagent-postgresql.kagent.svc.cluster.local:5432/postgres", "The URL of the PostgreSQL database.")
commandLine.StringVar(&cfg.Database.Url, "postgres-database-url", "postgres://postgres:kagent@kagent-postgresql.kagent.svc:5432/postgres", "The URL of the PostgreSQL database.")
commandLine.StringVar(&cfg.Database.UrlFile, "postgres-database-url-file", "", "Path to a file containing the PostgreSQL database URL. Takes precedence over --postgres-database-url.")
commandLine.BoolVar(&cfg.Database.VectorEnabled, "database-vector-enabled", true, "Enable pgvector extension and memory table. Requires pgvector to be installed on the PostgreSQL server.")

@@ -175,7 +176,7 @@ func (cfg *Config) SetFlags(commandLine *flag.FlagSet) {
commandLine.Var(&cfg.Streaming.InitialBufSize, "streaming-initial-buf-size", "The initial size of the streaming buffer.")
commandLine.DurationVar(&cfg.Streaming.Timeout, "streaming-timeout", 60*time.Second, "The timeout for the streaming connection.")

commandLine.StringVar(&cfg.Proxy.URL, "proxy-url", "", "Proxy URL for internally-built k8s URLs (e.g., http://proxy.kagent.svc.cluster.local:8080)")
commandLine.StringVar(&cfg.Proxy.URL, "proxy-url", "", "Proxy URL for internally-built k8s URLs (e.g., http://proxy.kagent.svc:8080)")

commandLine.StringVar(&cfg.Auth.Mode, "auth-mode", "unsecure", "Authentication mode: unsecure or trusted-proxy")
commandLine.StringVar(&cfg.Auth.UserIDClaim, "auth-user-id-claim", "sub", "JWT claim name for user identity")
@@ -195,6 +196,7 @@ func (cfg *Config) SetFlags(commandLine *flag.FlagSet) {
commandLine.Var(&MapValue{Target: &agent_translator.DefaultAgentPodLabels}, "default-agent-pod-labels", "Comma-separated key=value pairs of labels to apply to all agent pod templates (e.g. 'team=platform,env=prod'). Per-agent labels take precedence.")

commandLine.StringVar(&agent_translator.DefaultAgentBindHost, "default-agent-bind-host", agent_translator.DefaultAgentBindHost, "Default host address for agent pods to bind to. Use '0.0.0.0' for IPv4 only or '::' for dual-stack (IPv4+IPv6).")
commandLine.StringVar(&cfg.ClusterDomain, "cluster-domain", "", "Optional cluster domain. If provided, internal service URLs will append it (e.g., .svc.cluster.local). If omitted, URLs will use .svc.")
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ClusterDomain is added to the Config and exposed as a --cluster-domain/CLUSTER_DOMAIN setting, but it is never referenced anywhere else in the codebase (search shows only the definition and flag registration). Either wire this into the places that construct internal service URLs (so it actually affects behavior) or remove the config/flag to avoid a dead, misleading setting.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, as I mentioned I don't think we really need this right now

Comment on lines +64 to +67
{{- if .Values.clusterDomain }}
- name: CLUSTER_DOMAIN
value: {{ .Values.clusterDomain | quote }}
{{- end }}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This chart sets CLUSTER_DOMAIN when .Values.clusterDomain is provided, but the controller binary currently does not use cluster-domain/CLUSTER_DOMAIN for anything (it’s parsed into config but never read). Either remove this env var until it’s consumed, or implement the runtime behavior so the setting has an effect.

Suggested change
{{- if .Values.clusterDomain }}
- name: CLUSTER_DOMAIN
value: {{ .Values.clusterDomain | quote }}
{{- end }}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs addressing. Feels like we don't need this anyway.

Comment on lines 941 to 959
@@ -953,8 +953,8 @@ func (a *adkApiTranslator) isInternalK8sURL(ctx context.Context, urlStr, namespa
return false
}

// Check if it ends with .svc.cluster.local (definitely internal)
if strings.HasSuffix(hostname, ".svc.cluster.local") {
// Check if it's an internal FQDN by looking for .svc or .svc.
if strings.HasSuffix(hostname, ".svc") || strings.Contains(hostname, ".svc.") {
return true
}
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isInternalK8sURL now returns true for any hostname that ends with .svc or merely contains .svc. (e.g., foo.svc.example.com), which can misclassify external URLs and apply the proxy unexpectedly. Consider parsing hostname labels and requiring the canonical k8s pattern service.namespace.svc[.<clusterDomain>] (i.e., parts[2] == "svc" with len(parts) >= 3) instead of substring checks; also update the function comment since it no longer always relies on namespace existence.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs addressing

Comment on lines +287 to 295
// Split hostname by dots: service.namespace or service.namespace.svc
hostname := parsed.Hostname()
parts := strings.Split(hostname, ".")

// Valid patterns:
// - service.namespace (2 parts)
// - service.namespace.svc (3 parts)
// - service.namespace.svc.cluster.local (5 parts)
// - service.namespace.svc (5 parts)
if len(parts) >= 2 {
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment for valid hostname patterns is incorrect: it lists service.namespace.svc twice and claims the latter is “5 parts”. If this is meant to cover custom cluster domains, it should describe service.namespace.svc.<clusterDomain> (which can be 4+ parts depending on the domain).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs addressing

Comment thread helm/kagent/tests/controller-deployment_test.yaml
destination_workload_namespace), 0.001)\",\"format\":\"time_series\",\"intervalFactor\":1,\"legendFormat\":\"{{
destination_workload }}.{{destination_workload_namespace }}\",\"refId\":\"B\",\"step\":2}],\"title\":\"Bytes
Sent to Incoming TCP Connection\",\"type\":\"timeseries\"}],\"refresh\":\"1m\",\"schemaVersion\":38,\"style\":\"dark\",\"tags\":[],\"templating\":{\"list\":[{\"hide\":0,\"includeAll\":false,\"multi\":false,\"name\":\"datasource\",\"options\":[],\"query\":\"prometheus\",\"queryValue\":\"\",\"refresh\":1,\"regex\":\"\",\"skipUrlSync\":false,\"type\":\"datasource\"},{\"current\":{\"selected\":false,\"text\":\"details.default.svc.cluster.local\",\"value\":\"details.default.svc.cluster.local\"},\"datasource\":{\"type\":\"prometheus\",\"uid\":\"${datasource}\"},\"definition\":\"\",\"hide\":0,\"includeAll\":false,\"label\":\"Service\",\"multi\":false,\"name\":\"service\",\"options\":[],\"query\":\"query_result(sum(istio_requests_total{})
Sent to Incoming TCP Connection\",\"type\":\"timeseries\"}],\"refresh\":\"1m\",\"schemaVersion\":38,\"style\":\"dark\",\"tags\":[],\"templating\":{\"list\":[{\"hide\":0,\"includeAll\":false,\"multi\":false,\"name\":\"datasource\",\"options\":[],\"query\":\"prometheus\",\"queryValue\":\"\",\"refresh\":1,\"regex\":\"\",\"skipUrlSync\":false,\"type\":\"datasource\"},{\"current\":{\"selected\":false,\"text\":\"details.default.svc\",\"value\":\"details.default.svc\"},\"datasource\":{\"type\":\"prometheus\",\"uid\":\"${datasource}\"},\"definition\":\"\",\"hide\":0,\"includeAll\":false,\"label\":\"Service\",\"multi\":false,\"name\":\"service\",\"options\":[],\"query\":\"query_result(sum(istio_requests_total{})
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this dashboard JSON, the service template variable’s current default was changed to details.default.svc, but the variable options are populated from the destination_service label values returned by Prometheus. Those label values are typically fully-qualified (e.g., details.default.svc.<clusterDomain>), so this default may not match any option and the dashboard will load with no data until the user changes it. Consider leaving the default empty/first-option, or adjust the variable regex/query so the option values match the shortened form.

Suggested change
Sent to Incoming TCP Connection\",\"type\":\"timeseries\"}],\"refresh\":\"1m\",\"schemaVersion\":38,\"style\":\"dark\",\"tags\":[],\"templating\":{\"list\":[{\"hide\":0,\"includeAll\":false,\"multi\":false,\"name\":\"datasource\",\"options\":[],\"query\":\"prometheus\",\"queryValue\":\"\",\"refresh\":1,\"regex\":\"\",\"skipUrlSync\":false,\"type\":\"datasource\"},{\"current\":{\"selected\":false,\"text\":\"details.default.svc\",\"value\":\"details.default.svc\"},\"datasource\":{\"type\":\"prometheus\",\"uid\":\"${datasource}\"},\"definition\":\"\",\"hide\":0,\"includeAll\":false,\"label\":\"Service\",\"multi\":false,\"name\":\"service\",\"options\":[],\"query\":\"query_result(sum(istio_requests_total{})
Sent to Incoming TCP Connection\",\"type\":\"timeseries\"}],\"refresh\":\"1m\",\"schemaVersion\":38,\"style\":\"dark\",\"tags\":[],\"templating\":{\"list\":[{\"hide\":0,\"includeAll\":false,\"multi\":false,\"name\":\"datasource\",\"options\":[],\"query\":\"prometheus\",\"queryValue\":\"\",\"refresh\":1,\"regex\":\"\",\"skipUrlSync\":false,\"type\":\"datasource\"},{\"current\":{\"selected\":false,\"text\":\"\",\"value\":\"\"},\"datasource\":{\"type\":\"prometheus\",\"uid\":\"${datasource}\"},\"definition\":\"\",\"hide\":0,\"includeAll\":false,\"label\":\"Service\",\"multi\":false,\"name\":\"service\",\"options\":[],\"query\":\"query_result(sum(istio_requests_total{})

Copilot uses AI. Check for mistakes.
EItanya and others added 2 commits April 29, 2026 13:07
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Eitan Yarmush <eitan.yarmush@solo.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants