test(qa): add cluster-e2e nightly job exercising real SSH path by phil-opp · Pull Request #1982 · dora-rs/dora

phil-opp · 2026-05-31T15:50:57Z

Adds a new nightly job that stands up a real openssh-server on a random loopback port and runs dora cluster up/status/down against a 3-machine cluster.yml (all host: localhost, distinct ssh port + daemon_port per machine — the two schema fields landed in # and #). cluster-smoke (already in nightly) covers status/down against a locally-spawned daemon but never exercises the SSH path of cluster up; this fills that gap.

Wired in the standard places: scripts/qa/ci-nightly-jobs.sh (function + dispatch + usage), .github/workflows/nightly.yml (matrix entry + failure-issue mappings), Makefile (qa-cluster-e2e target), and the CLAUDE.md nightly inventory.

Validated end-to-end in the dev container: 3 SSH-spawned daemons register, status lists all three, down tears everything down with no leftovers. Linux-only; hard-fails (per agreed policy) if openssh-server is missing. GHA workflow apt-get installs it before running the script.

Stacks on #1979; rebase on main once that lands.

Partially resolves #1704

trunk-io · 2026-05-31T16:20:39Z

Merging to main in this repository is managed by Trunk.

To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

…discovery without multicast #1778 introduced `DORA_ZENOH_CONNECT` so spawned nodes can bootstrap Zenoh peer discovery without multicast — but only for the daemon↔node hop. The daemon↔daemon hop still relies on multicast scouting, which silently breaks cross-daemon dataflows in dev containers and many CI environments (the source produces, but messages never reach peers on other daemons because they can't discover each other). This adds a shared rendezvous knob: 1. `open_zenoh_session_with_listen` gains an `inter_daemon_peer: Option<&str>` parameter (libraries/core/src/topics.rs). When set, it is added to both `listen/endpoints` and `connect/endpoints` and multicast scouting is disabled. The existing `listen/exit_on_failure: false` semantics let the all-daemons-bind-same-port pattern degrade cleanly — the first daemon wins the bind and acts as the gossip hub, the rest fall through to connect-only. 2. `dora daemon --zenoh-peer <endpoint>` exposes it on the CLI (e.g. `--zenoh-peer tcp/192.168.1.1:5456`). Threaded through `Daemon::run`, `run_with_builds`, and `run_general`. The `run_dataflow` path (local-only dataflows) explicitly passes `None`. 3. `ClusterConfig.zenoh_peer: Option<String>` (top-level, not per- machine — it MUST be the same on every daemon to function as a rendezvous). `dora cluster up` reads it and appends `--zenoh-peer <ep>` to every daemon spawn command. `dora cluster install` does the same for the generated systemd unit's ExecStart. Backward compatible: every change is opt-in. Existing cluster.yml files without `zenoh_peer` parse unchanged and daemons continue using multicast scouting (which works fine for production hosts on real networks). Two new unit tests in cluster::config; existing 11 still pass. fmt + clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cluster-smoke (already in nightly) covers cluster status/down against a locally-spawned coordinator+daemon but does NOT exercise `dora cluster up` or any SSH-based daemon spawn — the most failure-prone part of the feature. cluster-e2e fills that gap: it stands up a real openssh-server on a random loopback port, generates a per-run keypair and ssh wrapper on PATH (so the test never touches the user's real ~/.ssh), then runs `dora cluster up/status/down` against a cluster.yml with 3 machines all pointing at localhost (distinct ssh + daemon_port per machine, the two schema fields added by #PR1 and #PR2 in this series). Wired everywhere cluster-smoke is: scripts/qa/ci-nightly-jobs.sh (function + known_job() + usage + dispatch), .github/workflows/nightly.yml (matrix job + file-issue-on-failure mappings), Makefile (qa-cluster-e2e target), and the CLAUDE.md nightly-job inventory (count bumped 18→19, 14→15). Validated end-to-end in the dev container: all three SSH-spawned daemons register, `cluster status` lists them, `cluster down` tears everything down with no leftover processes. Linux-only — macOS sshd defaults and Windows OpenSSH would each need distinct setup that isn't worth the maintenance. Hard-fails (per the agreed policy) if openssh-server is missing; the GHA job installs it before invoking the script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

1. Leftover-detection: SSH-spawned daemons have argv[0]="dora" (PATH resolved), not "$CLI_ROOT/bin/dora", so `pgrep -f "$CLI_ROOT/bin/dora"` would miss them — a failed cluster down could leak daemons silently while the leftover-process check still passed. Add a `find_our_dora_pids` helper that combines pgrep (argv) with `/proc/<pid>/exe` (binary), and route the shared `terminate_our_dora_children`, `cleanup_all_managed`, and the cluster-e2e leftover check through it. /proc is Linux-only; on macOS we fall back to pgrep alone, same as before. CLI_ROOT is a unique per-run mktemp path so neither method has false positives. 2. CI redundancy: the nightly workflow downloads the shared `dora-cli` artifact but the helper rebuilds via `cargo install` regardless, wasting 3-5 min on cold cache and risking the 15-min timeout. Add a `DORA_QA_CLI_BIN` env var hook in `ensure_cli_installed`: when set to an executable, copy it into CLI_ROOT instead of cargo-installing. The cluster-e2e GHA job sets it to the downloaded artifact path. Local dev leaves the var unset and pays the cargo cost as before. Other jobs are unaffected. 3. Stale counts: bump `scripts/qa/ci-nightly-jobs.sh` header and `Makefile` qa-nightly comment from 18/14 to 19/15, matching the CLAUDE.md update in the prior commit. Validated locally: cluster-e2e PASS via the cargo-install path (artifact-reuse path is a small early-return that copies the binary into CLI_ROOT; downstream test logic is identical). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Extend the cluster-e2e job to start, run, verify, and stop a real dataflow through the SSH-spawned cluster, in addition to the existing up/status/down lifecycle. Builds the three rust-dataflow example nodes (same set cluster-smoke uses), generates an inline dataflow.yml with all nodes pinned to m1 via `_unstable_deploy.machine`, runs `dora start --detach`, gives it 5s to spawn and process, asserts the dataflow appears in `dora list`, then `dora stop`s it before `cluster down`. All nodes pinned to m1 (single-machine within the SSH-spawned cluster) rather than spread across m1/m2/m3 because the daemon currently opens its Zenoh session with coordinator_addr=None (libraries/core/src/topics.rs:817), so cross-daemon peer discovery relies on multicast scouting — which does not work in nested-container environments like devcontainers. The schema has no per-daemon hook for `DORA_ZENOH_CONNECT` today. Follow-up to teach `cluster up` to plumb a shared peer endpoint into the per-daemon env (or run a zenohd router in the fixture) would let us cover the cross-daemon Zenoh data plane too; for now the single-machine pin still validates the real-execution path through an SSH-spawned daemon, which is the core gap cluster-smoke leaves open. Validated locally: dataflow runs, reaches `Finished` status in dora list, cluster down cleans up with no leftover processes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

1. cluster up/install/uninstall/upgrade now exit non-zero on partial failures (binaries/cli/src/command/cluster/{up,install,uninstall, upgrade}.rs). Previously they printed "Cluster partially up" or a `n/m succeeded` summary and still returned `Ok(())`, so scripts and CI couldn't tell apart "fully up" from "two daemons unreachable + one ssh timeout." `up` additionally tracks daemons that didn't register within the 30s window and counts them toward the failure total. **NOTE for the reviewer:** these four files are also touched by the open `cluster-daemon-port` PR. If you'd prefer this fix to land in that parent PR, this commit can be cherry-picked across and dropped here on rebase. 2. Critical script commands now have explicit exit-status checks (scripts/qa/ci-nightly-jobs.sh). `run_job` invokes jobs via `if "$fn"; then`, which disables `set -e` inside the function per POSIX/bash inheritance rules — so a failing `cargo build`, `ssh-keygen`, or `dora cluster down` would silently fall through to the next step. Wrapped each of those in `if ! ...; then return 1; fi`, and added a comment to the function header documenting the gotcha so future contributors don't reintroduce silent failures. Pure local-FS ops (mkdir/chmod on a fresh mktemp dir) are left unchecked — they only fail under disk pressure, which surfaces later. 3. SSH wrapper now passes `-F /dev/null` and `-o GlobalKnownHostsFile=/dev/null` so a local `~/.ssh/config` `Host localhost` rule (HostName, ProxyCommand, etc.) and any `/etc/ssh/ssh_known_hosts` entry can't influence the loopback test. Same change in the scp wrapper. Validated: cluster::config tests still 11/11 pass, fmt + clippy clean, cluster-e2e PASS end-to-end (3 SSH-spawned daemons, dataflow runs to Finished status on m1, cluster down clean). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Now that the parent PR (cluster-zenoh-peer) provides the shared inter-daemon Zenoh rendezvous, switch the cluster-e2e job from the single-machine pin (all nodes on m1) to a real distributed dataflow: - rust-node on m1 (source) - rust-status-node on m2 (filter) - rust-sink on m3 (sink) `zenoh_peer: tcp/127.0.0.1:$ZENOH_PORT` in the generated cluster.yml makes daemons find each other via the explicit rendezvous instead of multicast scouting (which doesn't work on loopback in nested containers). Cross-daemon messages now flow over the inter-daemon Zenoh data plane — the actual distributed code path — rather than the within-daemon shared-memory shortcut. Also adds `dora cluster restart cluster.yml cluster-e2e` between the start-and-verify and stop phases: - Polls until the dataflow reaches Running (RestartByName only matches running_dataflows, not archived — coordinator/lib.rs:: restart_dataflow). - Asserts restart prints "dataflow restarted: <old> -> <new>" and extracts the new UUID. - Verifies the new UUID is visible in `dora list` afterward. Validated locally end-to-end: 3 SSH-spawned daemons, dataflow runs across m1/m2/m3, restart produces a new UUID, cluster down clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…erification Temporarily duplicates the nightly cluster-e2e job into PR CI so the job runs on a real ubuntu-latest GHA runner before merge. Drop this commit once green; the canonical home is nightly.yml::cluster-e2e. Unlike the nightly job, this one doesn't depend on the build-cli artifact (PR CI doesn't produce one) — ensure_cli_installed cargo- installs the CLI from source inside the script. Adds ~3-5 min on cold cache; the 25-min job timeout has headroom. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

1. cluster-e2e: assert cross-daemon messages actually flowed. Replaced "sleep 5 + grep new UUID in dora list" with a poll for the restarted dataflow to reach `Finished` status (max 30s). Both downstream nodes (rust-status-node, rust-sink) only exit when their upstream input closes, and those closures propagate over inter- daemon Zenoh. A broken data plane leaves both nodes blocked in `events.recv()` and the dataflow stuck Running — which a Running- only assertion would silently accept. Finished status proves all three cross-daemon hops (rust-node@m1 -> status@m2 -> sink@m3) actually carried data. 2. cluster-e2e: stop leaking PATH and DORA_COORDINATOR_PORT into subsequent ci-nightly-jobs.sh jobs. Wrapped the dora-calling test body in a subshell so the env exports go out of scope when the subshell exits. Sshd termination + tempdir rm move outside the subshell so they run regardless of pass/fail. Replaced `return 1` with `exit 1` inside the subshell; the outer function captures the subshell's exit code and returns it. Verified locally: after the job exits, $PATH and $DORA_COORDINATOR_PORT match their pre-job values. 3. cluster restart docs were stale: `dora cluster restart` takes `<PATH> <NAME>`, not bare `<NAME>` (docs claimed "no YAML path needed"). Also dropped the "name or UUID" claim — the request is sent as `RestartByName`, which the coordinator's `resolve_name` matches against the dataflow's `name` field only (handlers.rs). Fixed both the markdown docs and the clap `///` doc-string on `Restart.dataflow` so `--help` is also correct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Base automatically changed from cluster-daemon-port to main May 31, 2026 16:20

phil-opp force-pushed the cluster-e2e-test branch from 3267ec3 to 0747769 Compare May 31, 2026 18:05

phil-opp and others added 8 commits May 31, 2026 18:29

fixup(qa-cluster-e2e): typos CI: misfire (one word)

4f85063

phil-opp force-pushed the cluster-e2e-test branch from 0747769 to 4f85063 Compare May 31, 2026 18:33

phil-opp requested a review from heyong4725 May 31, 2026 18:36

phil-opp force-pushed the cluster-e2e-test branch from 4da1cf3 to 1f33351 Compare May 31, 2026 19:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(qa): add cluster-e2e nightly job exercising real SSH path#1982

test(qa): add cluster-e2e nightly job exercising real SSH path#1982
phil-opp wants to merge 9 commits into
mainfrom
cluster-e2e-test

phil-opp commented May 31, 2026 •

edited

Loading

Uh oh!

trunk-io Bot commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

phil-opp commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trunk-io Bot commented May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

phil-opp commented May 31, 2026 •

edited

Loading