test(qa): add cluster-e2e nightly job exercising real SSH path#1982
Open
phil-opp wants to merge 9 commits into
Open
test(qa): add cluster-e2e nightly job exercising real SSH path#1982phil-opp wants to merge 9 commits into
phil-opp wants to merge 9 commits into
Conversation
Contributor
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
3267ec3 to
0747769
Compare
…discovery without multicast #1778 introduced `DORA_ZENOH_CONNECT` so spawned nodes can bootstrap Zenoh peer discovery without multicast — but only for the daemon↔node hop. The daemon↔daemon hop still relies on multicast scouting, which silently breaks cross-daemon dataflows in dev containers and many CI environments (the source produces, but messages never reach peers on other daemons because they can't discover each other). This adds a shared rendezvous knob: 1. `open_zenoh_session_with_listen` gains an `inter_daemon_peer: Option<&str>` parameter (libraries/core/src/topics.rs). When set, it is added to both `listen/endpoints` and `connect/endpoints` and multicast scouting is disabled. The existing `listen/exit_on_failure: false` semantics let the all-daemons-bind-same-port pattern degrade cleanly — the first daemon wins the bind and acts as the gossip hub, the rest fall through to connect-only. 2. `dora daemon --zenoh-peer <endpoint>` exposes it on the CLI (e.g. `--zenoh-peer tcp/192.168.1.1:5456`). Threaded through `Daemon::run`, `run_with_builds`, and `run_general`. The `run_dataflow` path (local-only dataflows) explicitly passes `None`. 3. `ClusterConfig.zenoh_peer: Option<String>` (top-level, not per- machine — it MUST be the same on every daemon to function as a rendezvous). `dora cluster up` reads it and appends `--zenoh-peer <ep>` to every daemon spawn command. `dora cluster install` does the same for the generated systemd unit's ExecStart. Backward compatible: every change is opt-in. Existing cluster.yml files without `zenoh_peer` parse unchanged and daemons continue using multicast scouting (which works fine for production hosts on real networks). Two new unit tests in cluster::config; existing 11 still pass. fmt + clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cluster-smoke (already in nightly) covers cluster status/down against a locally-spawned coordinator+daemon but does NOT exercise `dora cluster up` or any SSH-based daemon spawn — the most failure-prone part of the feature. cluster-e2e fills that gap: it stands up a real openssh-server on a random loopback port, generates a per-run keypair and ssh wrapper on PATH (so the test never touches the user's real ~/.ssh), then runs `dora cluster up/status/down` against a cluster.yml with 3 machines all pointing at localhost (distinct ssh + daemon_port per machine, the two schema fields added by #PR1 and #PR2 in this series). Wired everywhere cluster-smoke is: scripts/qa/ci-nightly-jobs.sh (function + known_job() + usage + dispatch), .github/workflows/nightly.yml (matrix job + file-issue-on-failure mappings), Makefile (qa-cluster-e2e target), and the CLAUDE.md nightly-job inventory (count bumped 18→19, 14→15). Validated end-to-end in the dev container: all three SSH-spawned daemons register, `cluster status` lists them, `cluster down` tears everything down with no leftover processes. Linux-only — macOS sshd defaults and Windows OpenSSH would each need distinct setup that isn't worth the maintenance. Hard-fails (per the agreed policy) if openssh-server is missing; the GHA job installs it before invoking the script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. Leftover-detection: SSH-spawned daemons have argv[0]="dora" (PATH resolved), not "$CLI_ROOT/bin/dora", so `pgrep -f "$CLI_ROOT/bin/dora"` would miss them — a failed cluster down could leak daemons silently while the leftover-process check still passed. Add a `find_our_dora_pids` helper that combines pgrep (argv) with `/proc/<pid>/exe` (binary), and route the shared `terminate_our_dora_children`, `cleanup_all_managed`, and the cluster-e2e leftover check through it. /proc is Linux-only; on macOS we fall back to pgrep alone, same as before. CLI_ROOT is a unique per-run mktemp path so neither method has false positives. 2. CI redundancy: the nightly workflow downloads the shared `dora-cli` artifact but the helper rebuilds via `cargo install` regardless, wasting 3-5 min on cold cache and risking the 15-min timeout. Add a `DORA_QA_CLI_BIN` env var hook in `ensure_cli_installed`: when set to an executable, copy it into CLI_ROOT instead of cargo-installing. The cluster-e2e GHA job sets it to the downloaded artifact path. Local dev leaves the var unset and pays the cargo cost as before. Other jobs are unaffected. 3. Stale counts: bump `scripts/qa/ci-nightly-jobs.sh` header and `Makefile` qa-nightly comment from 18/14 to 19/15, matching the CLAUDE.md update in the prior commit. Validated locally: cluster-e2e PASS via the cargo-install path (artifact-reuse path is a small early-return that copies the binary into CLI_ROOT; downstream test logic is identical). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend the cluster-e2e job to start, run, verify, and stop a real dataflow through the SSH-spawned cluster, in addition to the existing up/status/down lifecycle. Builds the three rust-dataflow example nodes (same set cluster-smoke uses), generates an inline dataflow.yml with all nodes pinned to m1 via `_unstable_deploy.machine`, runs `dora start --detach`, gives it 5s to spawn and process, asserts the dataflow appears in `dora list`, then `dora stop`s it before `cluster down`. All nodes pinned to m1 (single-machine within the SSH-spawned cluster) rather than spread across m1/m2/m3 because the daemon currently opens its Zenoh session with coordinator_addr=None (libraries/core/src/topics.rs:817), so cross-daemon peer discovery relies on multicast scouting — which does not work in nested-container environments like devcontainers. The schema has no per-daemon hook for `DORA_ZENOH_CONNECT` today. Follow-up to teach `cluster up` to plumb a shared peer endpoint into the per-daemon env (or run a zenohd router in the fixture) would let us cover the cross-daemon Zenoh data plane too; for now the single-machine pin still validates the real-execution path through an SSH-spawned daemon, which is the core gap cluster-smoke leaves open. Validated locally: dataflow runs, reaches `Finished` status in dora list, cluster down cleans up with no leftover processes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1. cluster up/install/uninstall/upgrade now exit non-zero on partial
failures (binaries/cli/src/command/cluster/{up,install,uninstall,
upgrade}.rs). Previously they printed "Cluster partially up" or a
`n/m succeeded` summary and still returned `Ok(())`, so scripts and
CI couldn't tell apart "fully up" from "two daemons unreachable +
one ssh timeout." `up` additionally tracks daemons that didn't
register within the 30s window and counts them toward the failure
total. **NOTE for the reviewer:** these four files are also touched
by the open `cluster-daemon-port` PR. If you'd prefer this fix to
land in that parent PR, this commit can be cherry-picked across and
dropped here on rebase.
2. Critical script commands now have explicit exit-status checks
(scripts/qa/ci-nightly-jobs.sh). `run_job` invokes jobs via
`if "$fn"; then`, which disables `set -e` inside the function per
POSIX/bash inheritance rules — so a failing `cargo build`,
`ssh-keygen`, or `dora cluster down` would silently fall through to
the next step. Wrapped each of those in `if ! ...; then return 1;
fi`, and added a comment to the function header documenting the
gotcha so future contributors don't reintroduce silent failures.
Pure local-FS ops (mkdir/chmod on a fresh mktemp dir) are left
unchecked — they only fail under disk pressure, which surfaces
later.
3. SSH wrapper now passes `-F /dev/null` and
`-o GlobalKnownHostsFile=/dev/null` so a local `~/.ssh/config`
`Host localhost` rule (HostName, ProxyCommand, etc.) and any
`/etc/ssh/ssh_known_hosts` entry can't influence the loopback test.
Same change in the scp wrapper.
Validated: cluster::config tests still 11/11 pass, fmt + clippy clean,
cluster-e2e PASS end-to-end (3 SSH-spawned daemons, dataflow runs to
Finished status on m1, cluster down clean).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Now that the parent PR (cluster-zenoh-peer) provides the shared
inter-daemon Zenoh rendezvous, switch the cluster-e2e job from the
single-machine pin (all nodes on m1) to a real distributed dataflow:
- rust-node on m1 (source)
- rust-status-node on m2 (filter)
- rust-sink on m3 (sink)
`zenoh_peer: tcp/127.0.0.1:$ZENOH_PORT` in the generated cluster.yml
makes daemons find each other via the explicit rendezvous instead of
multicast scouting (which doesn't work on loopback in nested
containers). Cross-daemon messages now flow over the inter-daemon
Zenoh data plane — the actual distributed code path — rather than the
within-daemon shared-memory shortcut.
Also adds `dora cluster restart cluster.yml cluster-e2e` between the
start-and-verify and stop phases:
- Polls until the dataflow reaches Running (RestartByName only
matches running_dataflows, not archived — coordinator/lib.rs::
restart_dataflow).
- Asserts restart prints "dataflow restarted: <old> -> <new>" and
extracts the new UUID.
- Verifies the new UUID is visible in `dora list` afterward.
Validated locally end-to-end: 3 SSH-spawned daemons, dataflow runs
across m1/m2/m3, restart produces a new UUID, cluster down clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erification Temporarily duplicates the nightly cluster-e2e job into PR CI so the job runs on a real ubuntu-latest GHA runner before merge. Drop this commit once green; the canonical home is nightly.yml::cluster-e2e. Unlike the nightly job, this one doesn't depend on the build-cli artifact (PR CI doesn't produce one) — ensure_cli_installed cargo- installs the CLI from source inside the script. Adds ~3-5 min on cold cache; the 25-min job timeout has headroom. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0747769 to
4f85063
Compare
1. cluster-e2e: assert cross-daemon messages actually flowed. Replaced "sleep 5 + grep new UUID in dora list" with a poll for the restarted dataflow to reach `Finished` status (max 30s). Both downstream nodes (rust-status-node, rust-sink) only exit when their upstream input closes, and those closures propagate over inter- daemon Zenoh. A broken data plane leaves both nodes blocked in `events.recv()` and the dataflow stuck Running — which a Running- only assertion would silently accept. Finished status proves all three cross-daemon hops (rust-node@m1 -> status@m2 -> sink@m3) actually carried data. 2. cluster-e2e: stop leaking PATH and DORA_COORDINATOR_PORT into subsequent ci-nightly-jobs.sh jobs. Wrapped the dora-calling test body in a subshell so the env exports go out of scope when the subshell exits. Sshd termination + tempdir rm move outside the subshell so they run regardless of pass/fail. Replaced `return 1` with `exit 1` inside the subshell; the outer function captures the subshell's exit code and returns it. Verified locally: after the job exits, $PATH and $DORA_COORDINATOR_PORT match their pre-job values. 3. cluster restart docs were stale: `dora cluster restart` takes `<PATH> <NAME>`, not bare `<NAME>` (docs claimed "no YAML path needed"). Also dropped the "name or UUID" claim — the request is sent as `RestartByName`, which the coordinator's `resolve_name` matches against the dataflow's `name` field only (handlers.rs). Fixed both the markdown docs and the clap `///` doc-string on `Restart.dataflow` so `--help` is also correct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4da1cf3 to
1f33351
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a new nightly job that stands up a real openssh-server on a random loopback port and runs
dora cluster up/status/downagainst a 3-machine cluster.yml (allhost: localhost, distinct sshport+daemon_portper machine — the two schema fields landed in # and #). cluster-smoke (already in nightly) coversstatus/downagainst a locally-spawned daemon but never exercises the SSH path ofcluster up; this fills that gap.Wired in the standard places:
scripts/qa/ci-nightly-jobs.sh(function + dispatch + usage),.github/workflows/nightly.yml(matrix entry + failure-issue mappings),Makefile(qa-cluster-e2etarget), and the CLAUDE.md nightly inventory.Validated end-to-end in the dev container: 3 SSH-spawned daemons register, status lists all three, down tears everything down with no leftovers. Linux-only; hard-fails (per agreed policy) if
openssh-serveris missing. GHA workflowapt-get installs it before running the script.Stacks on #1979; rebase on main once that lands.
Partially resolves #1704