Skip to content

test(qa): add cluster-e2e nightly job exercising real SSH path#1982

Open
phil-opp wants to merge 9 commits into
mainfrom
cluster-e2e-test
Open

test(qa): add cluster-e2e nightly job exercising real SSH path#1982
phil-opp wants to merge 9 commits into
mainfrom
cluster-e2e-test

Conversation

@phil-opp
Copy link
Copy Markdown
Collaborator

@phil-opp phil-opp commented May 31, 2026

Adds a new nightly job that stands up a real openssh-server on a random loopback port and runs dora cluster up/status/down against a 3-machine cluster.yml (all host: localhost, distinct ssh port + daemon_port per machine — the two schema fields landed in # and #). cluster-smoke (already in nightly) covers status/down against a locally-spawned daemon but never exercises the SSH path of cluster up; this fills that gap.

Wired in the standard places: scripts/qa/ci-nightly-jobs.sh (function + dispatch + usage), .github/workflows/nightly.yml (matrix entry + failure-issue mappings), Makefile (qa-cluster-e2e target), and the CLAUDE.md nightly inventory.

Validated end-to-end in the dev container: 3 SSH-spawned daemons register, status lists all three, down tears everything down with no leftovers. Linux-only; hard-fails (per agreed policy) if openssh-server is missing. GHA workflow apt-get installs it before running the script.

Stacks on #1979; rebase on main once that lands.

Partially resolves #1704

Base automatically changed from cluster-daemon-port to main May 31, 2026 16:20
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io Bot commented May 31, 2026

Merging to main in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@phil-opp phil-opp force-pushed the cluster-e2e-test branch from 3267ec3 to 0747769 Compare May 31, 2026 18:05
phil-opp and others added 8 commits May 31, 2026 18:29
…discovery without multicast

#1778 introduced `DORA_ZENOH_CONNECT` so spawned nodes can bootstrap
Zenoh peer discovery without multicast — but only for the daemon↔node
hop. The daemon↔daemon hop still relies on multicast scouting, which
silently breaks cross-daemon dataflows in dev containers and many CI
environments (the source produces, but messages never reach peers on
other daemons because they can't discover each other).

This adds a shared rendezvous knob:

1. `open_zenoh_session_with_listen` gains an `inter_daemon_peer:
   Option<&str>` parameter (libraries/core/src/topics.rs). When set, it
   is added to both `listen/endpoints` and `connect/endpoints` and
   multicast scouting is disabled. The existing `listen/exit_on_failure:
   false` semantics let the all-daemons-bind-same-port pattern degrade
   cleanly — the first daemon wins the bind and acts as the gossip hub,
   the rest fall through to connect-only.

2. `dora daemon --zenoh-peer <endpoint>` exposes it on the CLI (e.g.
   `--zenoh-peer tcp/192.168.1.1:5456`). Threaded through `Daemon::run`,
   `run_with_builds`, and `run_general`. The `run_dataflow` path
   (local-only dataflows) explicitly passes `None`.

3. `ClusterConfig.zenoh_peer: Option<String>` (top-level, not per-
   machine — it MUST be the same on every daemon to function as a
   rendezvous). `dora cluster up` reads it and appends
   `--zenoh-peer <ep>` to every daemon spawn command. `dora cluster
   install` does the same for the generated systemd unit's ExecStart.

Backward compatible: every change is opt-in. Existing cluster.yml
files without `zenoh_peer` parse unchanged and daemons continue using
multicast scouting (which works fine for production hosts on real
networks).

Two new unit tests in cluster::config; existing 11 still pass. fmt +
clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cluster-smoke (already in nightly) covers cluster status/down against a
locally-spawned coordinator+daemon but does NOT exercise `dora cluster up`
or any SSH-based daemon spawn — the most failure-prone part of the
feature. cluster-e2e fills that gap: it stands up a real openssh-server
on a random loopback port, generates a per-run keypair and ssh wrapper
on PATH (so the test never touches the user's real ~/.ssh), then runs
`dora cluster up/status/down` against a cluster.yml with 3 machines all
pointing at localhost (distinct ssh + daemon_port per machine, the two
schema fields added by #PR1 and #PR2 in this series).

Wired everywhere cluster-smoke is: scripts/qa/ci-nightly-jobs.sh
(function + known_job() + usage + dispatch), .github/workflows/nightly.yml
(matrix job + file-issue-on-failure mappings), Makefile (qa-cluster-e2e
target), and the CLAUDE.md nightly-job inventory (count bumped 18→19,
14→15). Validated end-to-end in the dev container: all three SSH-spawned
daemons register, `cluster status` lists them, `cluster down` tears
everything down with no leftover processes.

Linux-only — macOS sshd defaults and Windows OpenSSH would each need
distinct setup that isn't worth the maintenance. Hard-fails (per the
agreed policy) if openssh-server is missing; the GHA job installs it
before invoking the script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. Leftover-detection: SSH-spawned daemons have argv[0]="dora" (PATH
   resolved), not "$CLI_ROOT/bin/dora", so `pgrep -f "$CLI_ROOT/bin/dora"`
   would miss them — a failed cluster down could leak daemons silently
   while the leftover-process check still passed. Add a `find_our_dora_pids`
   helper that combines pgrep (argv) with `/proc/<pid>/exe` (binary), and
   route the shared `terminate_our_dora_children`, `cleanup_all_managed`,
   and the cluster-e2e leftover check through it. /proc is Linux-only; on
   macOS we fall back to pgrep alone, same as before. CLI_ROOT is a unique
   per-run mktemp path so neither method has false positives.

2. CI redundancy: the nightly workflow downloads the shared `dora-cli`
   artifact but the helper rebuilds via `cargo install` regardless,
   wasting 3-5 min on cold cache and risking the 15-min timeout. Add a
   `DORA_QA_CLI_BIN` env var hook in `ensure_cli_installed`: when set
   to an executable, copy it into CLI_ROOT instead of cargo-installing.
   The cluster-e2e GHA job sets it to the downloaded artifact path.
   Local dev leaves the var unset and pays the cargo cost as before.
   Other jobs are unaffected.

3. Stale counts: bump `scripts/qa/ci-nightly-jobs.sh` header and
   `Makefile` qa-nightly comment from 18/14 to 19/15, matching the
   CLAUDE.md update in the prior commit.

Validated locally: cluster-e2e PASS via the cargo-install path
(artifact-reuse path is a small early-return that copies the binary
into CLI_ROOT; downstream test logic is identical).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend the cluster-e2e job to start, run, verify, and stop a real
dataflow through the SSH-spawned cluster, in addition to the existing
up/status/down lifecycle. Builds the three rust-dataflow example nodes
(same set cluster-smoke uses), generates an inline dataflow.yml with
all nodes pinned to m1 via `_unstable_deploy.machine`, runs `dora
start --detach`, gives it 5s to spawn and process, asserts the
dataflow appears in `dora list`, then `dora stop`s it before
`cluster down`.

All nodes pinned to m1 (single-machine within the SSH-spawned cluster)
rather than spread across m1/m2/m3 because the daemon currently opens
its Zenoh session with coordinator_addr=None
(libraries/core/src/topics.rs:817), so cross-daemon peer discovery
relies on multicast scouting — which does not work in nested-container
environments like devcontainers. The schema has no per-daemon hook for
`DORA_ZENOH_CONNECT` today. Follow-up to teach `cluster up` to plumb
a shared peer endpoint into the per-daemon env (or run a zenohd
router in the fixture) would let us cover the cross-daemon Zenoh
data plane too; for now the single-machine pin still validates the
real-execution path through an SSH-spawned daemon, which is the
core gap cluster-smoke leaves open.

Validated locally: dataflow runs, reaches `Finished` status in
dora list, cluster down cleans up with no leftover processes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. cluster up/install/uninstall/upgrade now exit non-zero on partial
   failures (binaries/cli/src/command/cluster/{up,install,uninstall,
   upgrade}.rs). Previously they printed "Cluster partially up" or a
   `n/m succeeded` summary and still returned `Ok(())`, so scripts and
   CI couldn't tell apart "fully up" from "two daemons unreachable +
   one ssh timeout." `up` additionally tracks daemons that didn't
   register within the 30s window and counts them toward the failure
   total. **NOTE for the reviewer:** these four files are also touched
   by the open `cluster-daemon-port` PR. If you'd prefer this fix to
   land in that parent PR, this commit can be cherry-picked across and
   dropped here on rebase.

2. Critical script commands now have explicit exit-status checks
   (scripts/qa/ci-nightly-jobs.sh). `run_job` invokes jobs via
   `if "$fn"; then`, which disables `set -e` inside the function per
   POSIX/bash inheritance rules — so a failing `cargo build`,
   `ssh-keygen`, or `dora cluster down` would silently fall through to
   the next step. Wrapped each of those in `if ! ...; then return 1;
   fi`, and added a comment to the function header documenting the
   gotcha so future contributors don't reintroduce silent failures.
   Pure local-FS ops (mkdir/chmod on a fresh mktemp dir) are left
   unchecked — they only fail under disk pressure, which surfaces
   later.

3. SSH wrapper now passes `-F /dev/null` and
   `-o GlobalKnownHostsFile=/dev/null` so a local `~/.ssh/config`
   `Host localhost` rule (HostName, ProxyCommand, etc.) and any
   `/etc/ssh/ssh_known_hosts` entry can't influence the loopback test.
   Same change in the scp wrapper.

Validated: cluster::config tests still 11/11 pass, fmt + clippy clean,
cluster-e2e PASS end-to-end (3 SSH-spawned daemons, dataflow runs to
Finished status on m1, cluster down clean).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that the parent PR (cluster-zenoh-peer) provides the shared
inter-daemon Zenoh rendezvous, switch the cluster-e2e job from the
single-machine pin (all nodes on m1) to a real distributed dataflow:

  - rust-node       on m1  (source)
  - rust-status-node on m2 (filter)
  - rust-sink       on m3  (sink)

`zenoh_peer: tcp/127.0.0.1:$ZENOH_PORT` in the generated cluster.yml
makes daemons find each other via the explicit rendezvous instead of
multicast scouting (which doesn't work on loopback in nested
containers). Cross-daemon messages now flow over the inter-daemon
Zenoh data plane — the actual distributed code path — rather than the
within-daemon shared-memory shortcut.

Also adds `dora cluster restart cluster.yml cluster-e2e` between the
start-and-verify and stop phases:
  - Polls until the dataflow reaches Running (RestartByName only
    matches running_dataflows, not archived — coordinator/lib.rs::
    restart_dataflow).
  - Asserts restart prints "dataflow restarted: <old> -> <new>" and
    extracts the new UUID.
  - Verifies the new UUID is visible in `dora list` afterward.

Validated locally end-to-end: 3 SSH-spawned daemons, dataflow runs
across m1/m2/m3, restart produces a new UUID, cluster down clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erification

Temporarily duplicates the nightly cluster-e2e job into PR CI so the
job runs on a real ubuntu-latest GHA runner before merge. Drop this
commit once green; the canonical home is nightly.yml::cluster-e2e.

Unlike the nightly job, this one doesn't depend on the build-cli
artifact (PR CI doesn't produce one) — ensure_cli_installed cargo-
installs the CLI from source inside the script. Adds ~3-5 min on
cold cache; the 25-min job timeout has headroom.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@phil-opp phil-opp force-pushed the cluster-e2e-test branch from 0747769 to 4f85063 Compare May 31, 2026 18:33
@phil-opp phil-opp requested a review from heyong4725 May 31, 2026 18:36
1. cluster-e2e: assert cross-daemon messages actually flowed.
   Replaced "sleep 5 + grep new UUID in dora list" with a poll for the
   restarted dataflow to reach `Finished` status (max 30s). Both
   downstream nodes (rust-status-node, rust-sink) only exit when their
   upstream input closes, and those closures propagate over inter-
   daemon Zenoh. A broken data plane leaves both nodes blocked in
   `events.recv()` and the dataflow stuck Running — which a Running-
   only assertion would silently accept. Finished status proves all
   three cross-daemon hops (rust-node@m1 -> status@m2 -> sink@m3)
   actually carried data.

2. cluster-e2e: stop leaking PATH and DORA_COORDINATOR_PORT into
   subsequent ci-nightly-jobs.sh jobs. Wrapped the dora-calling test
   body in a subshell so the env exports go out of scope when the
   subshell exits. Sshd termination + tempdir rm move outside the
   subshell so they run regardless of pass/fail. Replaced `return 1`
   with `exit 1` inside the subshell; the outer function captures the
   subshell's exit code and returns it. Verified locally: after the
   job exits, $PATH and $DORA_COORDINATOR_PORT match their pre-job
   values.

3. cluster restart docs were stale: `dora cluster restart` takes
   `<PATH> <NAME>`, not bare `<NAME>` (docs claimed "no YAML path
   needed"). Also dropped the "name or UUID" claim — the request is
   sent as `RestartByName`, which the coordinator's `resolve_name`
   matches against the dataflow's `name` field only (handlers.rs).
   Fixed both the markdown docs and the clap `///` doc-string on
   `Restart.dataflow` so `--help` is also correct.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@phil-opp phil-opp force-pushed the cluster-e2e-test branch from 4da1cf3 to 1f33351 Compare May 31, 2026 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

E2E coverage gap: dora cluster install/up/down (SSH + systemd lifecycle)

1 participant