Skip to content

Post-mission DumpData/GetRawMetrics crashes when cluster does not resolve .local TLD #399

Description

@tomerweller

Observed behavior

When running Supercluster missions on a Namespace k3s cluster, the mission logic completes successfully but the teardown metrics dump crashes with:

System.Net.WebException: Name or service not known (ssc-xxxxx-yyyyyy.local:80)

This happens during DumpPeerMetrics → GetRawMetrics → Peer.fetch(), which resolves the per-pod ingress hostname (e.g. ssc-1840z-d05de1.local) to scrape admin HTTP metrics before tearing down Kubernetes resources. The .local TLD does not resolve in non-standard DNS environments like Namespace k3s clusters.

Impact

Missions appear to fail (SSC exits non-zero) even when the actual consensus/loadgen logic succeeded. This prevents automated pass/fail gating for mixed-image or custom-image missions on non-SDF Kubernetes clusters.

Suggested fix

Use the per-pod Kubernetes service DNS name instead of the ingress .local hostname for the post-mission metrics scrape. The svc.cluster.local names already work and are used elsewhere in the SSC config generation.

Alternatively, make the ingress domain suffix configurable (currently hardcoded to .local via StellarCoreCfg.fs / StellarKubeSpecs.fs).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions