Skip to content

clusterd: derive CLUSTERD_PROCESS from pod hostname for distroless#36872

Open
jasonhernandez wants to merge 1 commit into
mainfrom
jason/distroless-orchestrator-fqdn
Open

clusterd: derive CLUSTERD_PROCESS from pod hostname for distroless#36872
jasonhernandez wants to merge 1 commit into
mainfrom
jason/distroless-orchestrator-fqdn

Conversation

@jasonhernandez

@jasonhernandez jasonhernandez commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Distroless clusterd has no shell entrypoint to set CLUSTERD_PROCESS (the process ordinal), which entrypoint.sh derived from the pod hostname before the distroless migration (#36099) removes it. clusterd now derives it itself when running under Kubernetes and the value is unset: the ordinal is the last --delimited segment of the StatefulSet pod name, matching how orchestrator-kubernetes parses the process id back out of pod names.

No signal handling is added. tini (added to the distroless image in #36099) runs as PID 1 and forwards signals, so clusterd is a child process and SIGTERM takes its default action rather than being ignored by PID 1.

Part of the distroless migration: #36099 (image), #36101 (UID/GID gating), #36876 (removes the unused CTP FQDN check, stacks on this). Supersedes the closed #36100.

Merge order

Land before or with #36099. Without it, distroless clusterd has no process ordinal and won't start. Safe to land first: on the current Ubuntu images entrypoint.sh still sets CLUSTERD_PROCESS, and the in-process derivation only runs when it is unset.

Test plan

  • cargo check -p mz-clusterd
  • Confirm a distroless clusterd pod in k8s picks up its ordinal and joins the cluster

Comment thread src/orchestrator-kubernetes/src/lib.rs Outdated
..Default::default()
},
EnvVar {
name: "CLUSTERD_GRPC_HOST".to_string(),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the CLUSTERD_GRPC_HOST env symbol actually do? We only use GRPC for persist pubsub now, and I'm not sure it uses this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - digging into it - I don't think it is used. I reworked this draft PR to just add the SIGTERM handler (which should unblock switching clusterd to distroless). #36876 removes some more cruft.

jasonhernandez added a commit that referenced this pull request Jun 2, 2026
Distroless containers run the binary directly as PID 1 (no tini/shell), so
two things from the removed entrypoint.sh are handled here:

1. PID 1 ignores signals with a SIG_DFL disposition, so SIGTERM from
   Kubernetes is silently dropped. Add an explicit termination-signal
   handler in clusterd (environmentd already has one). The StatefulSet
   ordinal (CLUSTERD_PROCESS) is derived in-process from the hostname.

2. entrypoint.sh set CLUSTERD_GRPC_HOST via `hostname --fqdn`. That value
   fed the CTP handshake's optional `server_fqdn` check: clusterd advertised
   its FQDN and the controller compared it against the address it dialed
   (transport.rs::handshake). The check only fired when the value was set,
   is unrelated to gRPC despite the name, and guards only against reaching a
   misrouted replica. Rather than re-plumb it for distroless, remove it:
   drop `--grpc-host`/`CLUSTERD_GRPC_HOST` and the `server_fqdn` field from
   the CTP `Hello`, along with the now-unused `host_from_address` helper and
   the `test_handshake_fqdn_mismatch` test. test_metrics byte-count bounds
   loosened since the handshake shrank.

This is the "rip it out" alternative to the FQDN-handling PRs (#36100
in-process resolve, #36872 orchestrator-injected). Pick one.

Part of SEC-236 distroless migration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jasonhernandez jasonhernandez force-pushed the jason/distroless-orchestrator-fqdn branch from 0907409 to 2efe3f1 Compare June 2, 2026 20:33
jasonhernandez added a commit that referenced this pull request Jun 2, 2026
Stacks on #36872 (SIGTERM handler). Removes the optional CTP `server_fqdn`
handshake check: clusterd advertised its FQDN (via CLUSTERD_GRPC_HOST, set by
the now-removed entrypoint.sh) and the controller compared it to the address
it dialed. The check only fired when the value was set, is unrelated to gRPC
despite the name, and guards only against reaching a misrouted replica.

Drops `--grpc-host`/`CLUSTERD_GRPC_HOST`, the `server_fqdn` field from the CTP
`Hello`, the `host_from_address` helper, and the `test_handshake_fqdn_mismatch`
test. CTP version-gates the handshake, so dropping the field is fine across a
release boundary.

Part of SEC-236 distroless migration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jasonhernandez jasonhernandez changed the title clusterd: SIGTERM handler + orchestrator-injected gRPC FQDN (Variant B) clusterd: SIGTERM handler for distroless containers Jun 2, 2026
@jasonhernandez jasonhernandez marked this pull request as ready for review June 4, 2026 21:40
@jasonhernandez jasonhernandez requested a review from a team as a code owner June 4, 2026 21:40

@antiguru antiguru left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine. Do we need the SIGTERM handler in other binaries, too? Then I'd recommend extracting common logic into mz-ore.

Comment thread src/clusterd/src/lib.rs Outdated
Comment thread src/clusterd/src/lib.rs Outdated
@jasonhernandez jasonhernandez force-pushed the jason/distroless-orchestrator-fqdn branch from 2efe3f1 to ec0fd3c Compare July 1, 2026 18:40
@jasonhernandez jasonhernandez changed the title clusterd: SIGTERM handler for distroless containers clusterd: derive CLUSTERD_PROCESS from pod hostname for distroless Jul 1, 2026
@jasonhernandez

Copy link
Copy Markdown
Contributor Author

🔗 Distroless migration — coordination

Part of a set that ships together: #36099 (distroless image), #36101 (UID/GID gating), #36876 (removes the unused CTP FQDN check, stacks on this).

Merge order: land before or with #36099. Without the in-process CLUSTERD_PROCESS derivation here, distroless clusterd has no process ordinal and won't start. Safe to land first: current Ubuntu images still set CLUSTERD_PROCESS via entrypoint.sh, and the in-process path only runs when it is unset.

No signal handling is included: tini (added to the distroless image in #36099) runs as PID 1 and forwards signals, so clusterd is a child process and SIGTERM takes its default action. This supersedes the earlier SIGTERM-handler framing, which assumed clusterd would be PID 1.

jasonhernandez added a commit that referenced this pull request Jul 1, 2026
Stacks on #36872 (SIGTERM handler). Removes the optional CTP `server_fqdn`
handshake check: clusterd advertised its FQDN (via CLUSTERD_GRPC_HOST, set by
the now-removed entrypoint.sh) and the controller compared it to the address
it dialed. The check only fired when the value was set, is unrelated to gRPC
despite the name, and guards only against reaching a misrouted replica.

Drops `--grpc-host`/`CLUSTERD_GRPC_HOST`, the `server_fqdn` field from the CTP
`Hello`, the `host_from_address` helper, and the `test_handshake_fqdn_mismatch`
test. CTP version-gates the handshake, so dropping the field is fine across a
release boundary.

Part of SEC-236 distroless migration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
jasonhernandez added a commit that referenced this pull request Jul 1, 2026
Stacks on #36872 (SIGTERM handler). Removes the optional CTP `server_fqdn`
handshake check: clusterd advertised its FQDN (via CLUSTERD_GRPC_HOST, set by
the now-removed entrypoint.sh) and the controller compared it to the address
it dialed. The check only fired when the value was set, is unrelated to gRPC
despite the name, and guards only against reaching a misrouted replica.

Drops `--grpc-host`/`CLUSTERD_GRPC_HOST`, the `server_fqdn` field from the CTP
`Hello`, the `host_from_address` helper, and the `test_handshake_fqdn_mismatch`
test. CTP version-gates the handshake, so dropping the field is fine across a
release boundary.

Part of SEC-236 distroless migration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Distroless clusterd has no shell entrypoint to set CLUSTERD_PROCESS (the
process ordinal) from the pod hostname. Derive it in clusterd itself when
running under Kubernetes and the value is unset, via
`process_ordinal_from_hostname`, which takes the trailing `-`-delimited
segment of the StatefulSet pod name and returns it only when it parses as
a process index. This mirrors how orchestrator-kubernetes recovers the
process id from pod names.

No in-process signal handler is needed: tini (added to the distroless
image in #36099) runs as PID 1 and forwards signals, so clusterd is a
child process and SIGTERM takes its default action rather than being
ignored by PID 1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jasonhernandez jasonhernandez force-pushed the jason/distroless-orchestrator-fqdn branch from ec0fd3c to 0931afb Compare July 1, 2026 22:26
jasonhernandez added a commit that referenced this pull request Jul 1, 2026
Stacks on #36872 (SIGTERM handler). Removes the optional CTP `server_fqdn`
handshake check: clusterd advertised its FQDN (via CLUSTERD_GRPC_HOST, set by
the now-removed entrypoint.sh) and the controller compared it to the address
it dialed. The check only fired when the value was set, is unrelated to gRPC
despite the name, and guards only against reaching a misrouted replica.

Drops `--grpc-host`/`CLUSTERD_GRPC_HOST`, the `server_fqdn` field from the CTP
`Hello`, the `host_from_address` helper, and the `test_handshake_fqdn_mismatch`
test. CTP version-gates the handshake, so dropping the field is fine across a
release boundary.

Part of SEC-236 distroless migration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants