clusterd: derive CLUSTERD_PROCESS from pod hostname for distroless#36872
clusterd: derive CLUSTERD_PROCESS from pod hostname for distroless#36872jasonhernandez wants to merge 1 commit into
Conversation
| ..Default::default() | ||
| }, | ||
| EnvVar { | ||
| name: "CLUSTERD_GRPC_HOST".to_string(), |
There was a problem hiding this comment.
What does the CLUSTERD_GRPC_HOST env symbol actually do? We only use GRPC for persist pubsub now, and I'm not sure it uses this.
There was a problem hiding this comment.
Thanks - digging into it - I don't think it is used. I reworked this draft PR to just add the SIGTERM handler (which should unblock switching clusterd to distroless). #36876 removes some more cruft.
Distroless containers run the binary directly as PID 1 (no tini/shell), so two things from the removed entrypoint.sh are handled here: 1. PID 1 ignores signals with a SIG_DFL disposition, so SIGTERM from Kubernetes is silently dropped. Add an explicit termination-signal handler in clusterd (environmentd already has one). The StatefulSet ordinal (CLUSTERD_PROCESS) is derived in-process from the hostname. 2. entrypoint.sh set CLUSTERD_GRPC_HOST via `hostname --fqdn`. That value fed the CTP handshake's optional `server_fqdn` check: clusterd advertised its FQDN and the controller compared it against the address it dialed (transport.rs::handshake). The check only fired when the value was set, is unrelated to gRPC despite the name, and guards only against reaching a misrouted replica. Rather than re-plumb it for distroless, remove it: drop `--grpc-host`/`CLUSTERD_GRPC_HOST` and the `server_fqdn` field from the CTP `Hello`, along with the now-unused `host_from_address` helper and the `test_handshake_fqdn_mismatch` test. test_metrics byte-count bounds loosened since the handshake shrank. This is the "rip it out" alternative to the FQDN-handling PRs (#36100 in-process resolve, #36872 orchestrator-injected). Pick one. Part of SEC-236 distroless migration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
0907409 to
2efe3f1
Compare
Stacks on #36872 (SIGTERM handler). Removes the optional CTP `server_fqdn` handshake check: clusterd advertised its FQDN (via CLUSTERD_GRPC_HOST, set by the now-removed entrypoint.sh) and the controller compared it to the address it dialed. The check only fired when the value was set, is unrelated to gRPC despite the name, and guards only against reaching a misrouted replica. Drops `--grpc-host`/`CLUSTERD_GRPC_HOST`, the `server_fqdn` field from the CTP `Hello`, the `host_from_address` helper, and the `test_handshake_fqdn_mismatch` test. CTP version-gates the handshake, so dropping the field is fine across a release boundary. Part of SEC-236 distroless migration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
antiguru
left a comment
There was a problem hiding this comment.
Looks fine. Do we need the SIGTERM handler in other binaries, too? Then I'd recommend extracting common logic into mz-ore.
2efe3f1 to
ec0fd3c
Compare
🔗 Distroless migration — coordinationPart of a set that ships together: #36099 (distroless image), #36101 (UID/GID gating), #36876 (removes the unused CTP FQDN check, stacks on this). Merge order: land before or with #36099. Without the in-process No signal handling is included: |
Stacks on #36872 (SIGTERM handler). Removes the optional CTP `server_fqdn` handshake check: clusterd advertised its FQDN (via CLUSTERD_GRPC_HOST, set by the now-removed entrypoint.sh) and the controller compared it to the address it dialed. The check only fired when the value was set, is unrelated to gRPC despite the name, and guards only against reaching a misrouted replica. Drops `--grpc-host`/`CLUSTERD_GRPC_HOST`, the `server_fqdn` field from the CTP `Hello`, the `host_from_address` helper, and the `test_handshake_fqdn_mismatch` test. CTP version-gates the handshake, so dropping the field is fine across a release boundary. Part of SEC-236 distroless migration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stacks on #36872 (SIGTERM handler). Removes the optional CTP `server_fqdn` handshake check: clusterd advertised its FQDN (via CLUSTERD_GRPC_HOST, set by the now-removed entrypoint.sh) and the controller compared it to the address it dialed. The check only fired when the value was set, is unrelated to gRPC despite the name, and guards only against reaching a misrouted replica. Drops `--grpc-host`/`CLUSTERD_GRPC_HOST`, the `server_fqdn` field from the CTP `Hello`, the `host_from_address` helper, and the `test_handshake_fqdn_mismatch` test. CTP version-gates the handshake, so dropping the field is fine across a release boundary. Part of SEC-236 distroless migration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Distroless clusterd has no shell entrypoint to set CLUSTERD_PROCESS (the process ordinal) from the pod hostname. Derive it in clusterd itself when running under Kubernetes and the value is unset, via `process_ordinal_from_hostname`, which takes the trailing `-`-delimited segment of the StatefulSet pod name and returns it only when it parses as a process index. This mirrors how orchestrator-kubernetes recovers the process id from pod names. No in-process signal handler is needed: tini (added to the distroless image in #36099) runs as PID 1 and forwards signals, so clusterd is a child process and SIGTERM takes its default action rather than being ignored by PID 1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ec0fd3c to
0931afb
Compare
Stacks on #36872 (SIGTERM handler). Removes the optional CTP `server_fqdn` handshake check: clusterd advertised its FQDN (via CLUSTERD_GRPC_HOST, set by the now-removed entrypoint.sh) and the controller compared it to the address it dialed. The check only fired when the value was set, is unrelated to gRPC despite the name, and guards only against reaching a misrouted replica. Drops `--grpc-host`/`CLUSTERD_GRPC_HOST`, the `server_fqdn` field from the CTP `Hello`, the `host_from_address` helper, and the `test_handshake_fqdn_mismatch` test. CTP version-gates the handshake, so dropping the field is fine across a release boundary. Part of SEC-236 distroless migration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Distroless clusterd has no shell entrypoint to set
CLUSTERD_PROCESS(the process ordinal), whichentrypoint.shderived from the pod hostname before the distroless migration (#36099) removes it. clusterd now derives it itself when running under Kubernetes and the value is unset: the ordinal is the last--delimited segment of the StatefulSet pod name, matching how orchestrator-kubernetes parses the process id back out of pod names.No signal handling is added. tini (added to the distroless image in #36099) runs as PID 1 and forwards signals, so clusterd is a child process and SIGTERM takes its default action rather than being ignored by PID 1.
Part of the distroless migration: #36099 (image), #36101 (UID/GID gating), #36876 (removes the unused CTP FQDN check, stacks on this). Supersedes the closed #36100.
Merge order
Land before or with #36099. Without it, distroless clusterd has no process ordinal and won't start. Safe to land first: on the current Ubuntu images
entrypoint.shstill setsCLUSTERD_PROCESS, and the in-process derivation only runs when it is unset.Test plan
cargo check -p mz-clusterd