Add netsim scenarios#352
Open
Rekseto wants to merge 48 commits into
Open
Conversation
Install the astral-agent skill into the Qwen Code operator. The netsim host owns a deploy key (SATFORGE_SKILLS_DEPLOY_KEY); run.sh injects it into the VM, which clones the private satforgedev/skills repo, builds the satforge-skills linker (Go already present from install-astrald), and runs `link astral-agent --target qwen` -> ~/.qwen/skills/astral-agent. Folded into lab.story after install-qwen-code; documented in the task README (one-time deploy-key setup) and netsim/README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The astral-agent skill is installed as a tree of symlinks, so the verify file count used `find`, which does not traverse symlinked directories and undercounted. Use `find -L` to follow symlinks (and silence transient errors) so the count reflects the materialized tree. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Thin-prompt, skill-driven swarm task: a two-sentence prompt tells the in-VM Qwen operator to make node1 a User-controlled node by following its astral-agent skill's node-setup playbook (software-User path), without restating the procedure. run.sh base64-ships the prompt over one `netsim ssh` argv and runs `qwen -y` as tester; verify.sh independently reads the persisted User token and asserts apphost.whoami = User id and user.info returns the active contract. Standalone (not in lab.story): `netsim task --stage astrald-lab --save astrald-user bootstrap-user`. Validated end-to-end on a live astrald-lab. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Thin-prompt, skill-driven swarm task chained onto bootstrap-user: a two-sentence prompt drives the Qwen operator to claim node2 into the User's swarm via its astral-agent skill's node-claiming playbook (`user.claim`, with nearby handling reachability). verify.sh is an independent both-ends check -- both nodes hold a contract from the same User, node1 lists node2 as a Linked sibling, and a mutual link exists -- parsing the astral-query JSON object-stream line-by-line. Standalone: `netsim task --stage astrald-user --save astrald-swarm link-swarm`. Validated end-to-end (two nodes in one User Swarm). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
First scenario past swarm formation: store an astral object on node1 and prove sibling node2 can obtain it by Object ID across the swarm. The thin prompt drives the Qwen operator (acting as its User) to objects.store a text payload and record the id; the cross-swarm fetch lives in verify.sh, not the prompt, which from node2 tries a ladder -- explicit-target <node1>:objects.load, transparent objects.load, then objects.find -- and asserts the bytes match, distinguishing a routing failure from an auth rejection. Standalone: `netsim task --stage astrald-swarm --save astrald-shared share-object`. Drafted; not yet run end-to-end (the cross-swarm read hop is inferred from the docs). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The post-install probe only waited ~10s for astrald to come up, but a fresh
astrald's first start (node-key generation + SQLite init), right after a
CPU-heavy go build still loads the VM, can take longer -- it flaked on an
otherwise-clean lab build ("astrald did not come up"). Wait up to ~90s, and
on failure dump `systemctl status` + `journalctl -u astrald` so the message
is a real diagnosis instead of opaque. Validated: the lab build passed with
the wider window on both nodes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ect) The link-swarm and share-object verifiers were shell scripts that gathered astral-query JSON and parsed it with embedded python heredocs -- awkward, and the parsing couldn't be unit-tested without booting a VM. Move all logic into a real verify.py per task (calls `netsim ssh ... astral-query` via subprocess, parses the JSON streams, asserts); verify.sh becomes a thin shim: exec python3 "$NETSIM_TASK_DIR/verify.py" "$@" netsim sets $NETSIM_TASK_DIR to the task dir and only auto-runs run.sh/verify.sh, so verify.py sits alongside and is found cleanly. Behavior-preserving: parsers golden-file tested against captured JSON, and the full pipeline re-run fresh on NFS -- link-swarm verify PASSES, share-object verify reproduces the cross-swarm-fetch diagnostic, both via the new shim. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rename swarm wording to match astrald master (PR #350) and the updated astral-agent skill: user.claim -> user.adopt, the node-claiming playbook -> node-adoption, and mod.user.swarm_access_action -> mod.user.swarm_membership_action. Docs/comments/prompt wording only; no verifier logic changes.
Add a fourth verifier check: node2 must list node1 as a Linked sibling (user.swarm_status, which derives from node2's own active contract, so no token). This is a direct regression guard for astrald #348 (roster sync to a newly adopted node) and the precondition share-object's write direction relies on. Offline golden test: post-#348 passes, pre-#348 (roster={node2}) correctly fails.
Pivot from the (blocked) cross-swarm read to the now-unblocked write direction: the agent stores an object ON node2 (<node2>:objects.store) and reads it back; verify.py independently proves node2 physically holds it via repo-pinned, ungated objects.load/contains -repo local. Unblocked by #348 (node2 now recognizes node1 -> AuthorizeRelayFor permits the relayed store, which reaches the ungated op_store). Caveat documented: op-level write is unauthenticated (CreateObjectAction still unwired).
Move lab.story into netsim/stories/ and add one story per tested flow (bootstrap-user, link-swarm, share-object), each a thin task list with a start/save stage header so a story doubles as a pass/fail integration test. Refresh netsim/README.md (full task list, swarm pipeline via stories) and reconcile running-as-a-service.md snapshot guidance (disk image: stop; live RAM snapshot: leave running).
The link-swarm and share-object verifiers carried long module docstrings restating rationale already in their README.md. Cut to a one/two-line summary; no logic change.
Cut the per-task READMEs down to a short paragraph (what the task does + the stage it produces); dropped the execution-model, build-facts, verify-internals, deploy-key setup, and security-note sections. No behavior change.
Replace the scattered ~/.netsim/{user.id,user.token,object.*} files with a
single $HOME/info.json (/home/tester/info.json) holding user_id, user_token,
object_id, object_payload, object_readback, object_target. bootstrap-user writes
user_*; share-object merges object_* (keeping user_*); verifiers and smoke-checks
read the JSON (python3 in-VM for shell, host-side json for verify.py). Transient
prompt/log files stay under ~/.netsim.
A drop-in alternative to bootstrap-user: instead of minting fresh entropy, the agent derives the User key from a provided BIP-39 mnemonic (ASTRAL_USER_MNEMONIC) and installs node1's active contract under that existing software User. verify.sh asserts node1 is a User node and, if ASTRAL_USER_ID is set, that the derived id matches exactly (proof the existing key was used). Produces stage astrald-user.
Rename the two first-node User-setup tasks to spell out the key variant: bootstrap-user -> bootstrap-user-software-key (new soft key), import-user -> import-user-software-key (existing soft key, known mnemonic). Renames the task dirs + story files and updates every reference (internal messages, prompt/log basenames, cross-references in link-swarm/share-object, README layout/pipeline). Leaves room for hardware-key variants later.
The previous commit swept in py_compile byte-cache via 'git add -A'. Remove the .pyc artifacts and add netsim/.gitignore for __pycache__/*.pyc.
Bake a valid BIP-39 mnemonic (the canonical all-zero-entropy test vector) into prompt.md instead of an __MNEMONIC__ placeholder; the task is now self-contained and reproducible. run.sh ships prompt.md verbatim (drops the ASTRAL_USER_MNEMONIC requirement and sed substitution). verify.sh's optional ASTRAL_USER_ID assertion is unchanged.
Skills moved off GitHub to ssh://git@git.satforge.dev/satforge/skills.git. Update the default SATFORGE_SKILLS_REPO and the comments/README; drop the GitHub-specific 443 fallback note. Host-key handling (StrictHostKeyChecking= accept-new) already covers the new host, and the deploy-key flow is unchanged (the key must now be registered on git.satforge.dev).
Rename the second-node task to match the swarm vocabulary (user.adopt): link-swarm -> adopt-node. Renames the task dir + story file and updates every reference (internal messages, prompt/log basenames, README layout/pipeline, share-object cross-reference). Stage names unchanged (astrald-user -> astrald-swarm).
…gle-node Both bootstrap-user-software-key and import-user-software-key produce the same single-node stage (a node set up as a User; they differ only in the User key — random vs the embedded mnemonic). Name it astrald-single-node and point adopt-node at it. Stage-name change only (set via --save/--stage); no script logic depends on it.
Split the object lifecycle into two focused scenarios and drop the combined (write-direction) share-object: - object-store (0006): node1 stores an object in its OWN local repo and reads it back; agent-driven, verify re-loads -repo local. astrald-swarm -> astrald-stored. - read-remote-object (0007): node2 reads node1's stored object OVER ASTRAL; host-driven (node2 has no operator), verify runs the <node1>:objects.load ladder and asserts the bytes. astrald-stored -> astrald-read. This is the peer-reads-node1 direction that failed pre-#348 — re-probed on current master. Rewire README pipeline accordingly.
astrald-single-node -> one-node, astrald-swarm -> two-nodes, astrald-stored -> two-nodes-data, astrald-read -> two-nodes-data-read. Stage-name change only (set via --save/--stage in story headers + docs); no script logic depends on it. astrald-lab kept as the base build fixture.
object-store now stores either in node1's own local repo (--target self, default) or on the sibling node2 (--target peer, via <node2>:objects.store) — so one story tests local storage and another tests storing on a peer. run.sh selects the prompt (prompt.md / prompt-peer.md); verify.py checks the holder's local repo (node1 for self, node2 for peer). Adds object-store-peer.story (two-nodes -> two-nodes-data-peer); object-store.story stays self -> two-nodes-data and feeds read-remote-object.
…ters node aliases Replace object-store's abstract --target self|peer with a real astral query target (--target, default localnode; e.g. node2). One prompt template (drops prompt-peer.md): the agent stores on / reads back from <target>, forming the right query itself. verify.py maps target -> holder (localnode/node1 -> node1, node2 -> node2). adopt-node now registers node1/node2 directory aliases (dir.set_alias) on both nodes when the swarm forms, so tasks can address nodes by name. Also fixes adopt-node's stale soft-check (read the User token from info.json, not the removed user.token). object-store-peer.story now passes --target node2.
Strip astral-agent/playbook/skill-location references and harness meta ('the skill
won't mention this'); the operator already has the skill auto-loaded. Prompts now
read like a person's request (still naming astral/astrald), keeping only the task
plus a terse 'save results to ~/info.json' the automated check needs.
The User on node1 permanently bans node2 from the swarm via user.expel, driven by the Qwen operator through its astral-agent skill. verify.py confirms the ban from both ends: node2 lands in user.list_expelled, drops out of user.swarm_status (OpSwarmStatus lists ActiveNodes, which filters the expelledSet), and the node1<->node2 link is torn down. README registers the new task/story and the two-nodes -> two-nodes-expel branch.
The old read-remote-object was host-driven and read node2->node1 anonymously, which can't route (network zone stripped) -- it tested the wrong, unroutable direction. Now it's agent-driven on node1: the agent reads the object (id from ~/info.json, written by object-store --target node2) FROM the peer as the User -- the authenticated, routable direction -- and records what it read. verify.py independently re-reads <peer>:objects.load as the User and asserts the bytes. New read-remote-peer.story chains object-store --target node2 (store on the peer) then read-remote-object (read it back from node1). Drops the old read-remote-object.story; README pipeline updated.
Minimizing the prompts dropped the 'keep existing keys' hint, so object-store's agent overwrote ~/info.json with object_* and wiped the user_token bootstrap wrote -- breaking read-remote-object's verify (which reads the peer as the User). Restore a natural 'leaving the existing entries in place' instruction in object-store and read-remote-object.
Each task writes its own file (no shared accumulator, no merge, no clobbering): - bootstrap/import -> ~/user.json (user_id, user_token) - object-store -> ~/object.json (object_id, object_payload, object_readback) - read-remote-object -> ~/read.json (object_remote) Readers reference the specific file(s) they need: adopt-node + expel-node read user.json; object-store verify reads object.json; read-remote-object verify reads user.json + object.json + read.json. Prompts drop the 'keep existing entries' hint (own file, overwrite is fine). Updates expel-node's reads to user.json too.
object-store now ships a fixed payload.txt to the operator and tells the agent to store that file's contents (deterministic id/bytes) instead of inventing 'distinctive text'; verify.py (object-store and read-remote-object) uses the shipped file as ground truth. Simplify every task prompt to precise, minimal wording and name __TARGET__/__PEER__ as astral nodes.
…ls ref, minimized READMEs - enable-tor: new host task — bring up a node with a Tor endpoint and save it to /root/tor.json (validated live: real onion published + saved). - object-store: agent only stores + records object_id; verify owns the read-back and byte match against the shipped payload.txt. - adopt-node: agent records swarm siblings to ~/siblings.json (sibling_ids); verify asserts it includes the adopted node. - configure-astral-agent: SATFORGE_SKILLS_REF builds the lab against a skills branch (fails loudly if the ref can't be fetched). - prompts: terser, human-style wording (adopt/bootstrap/import/expel/object-store). - READMEs: minimized to astral-docs voice across all tasks.
… scenario 0004)
Restore the two parked tasks for Tor scenario 0004 ("a node leaves the LAN and
links over Tor"), completing the scenario alongside the already-committed
enable-tor building block. Sequenced by tor-link.story (two-nodes -> two-nodes-tor):
enable-tor -> leave-lan -> link-over-tor.
- leave-lan (host): seed node1 with node2's onion while the LAN is up, then
nftables-drop the LAN path between them (WAN/Tor egress stays up). verify.py
asserts node2 can no longer TCP-connect to node1:1791 (only a timeout passes).
- link-over-tor (agent): thin prompt drives the Qwen operator to force the swarm
link over Tor (nodes.new_link -strategies tor) per the astral-agent skill's
linking-over-tor playbook; verify.py independently asserts a nodes.links entry
with Network=tor.
The linking-over-tor playbook is now on skills main (PR #4), so the lab builds
against main with no SATFORGE_SKILLS_REF override.
Checkpoint: not yet validated against the real Tor network (the original parking
gate -- VM WAN NAT -> Tor reachability + the agent's per-turn tool-call cap).
A resumed stage runs astrald + all userspace live; netsim's sync_clock corrects the stale snapshot clock with a ~day forward jump, which makes systemd's Persistent apt-daily/apt-daily-upgrade timers fire and unattended-upgrades saturate the 1-vCPU VMs for minutes -- breaking every resumed scenario (node unreachable, QMP save timeouts). astrald itself tolerates the jump. Fix it once in the image (standard ephemeral-VM hygiene): install-astrald masks apt-daily/apt-daily-upgrade/unattended-upgrades on the fresh build boot, so every stage is born quiet. The per-task quiescing in enable-tor/leave-lan is now redundant and removed (DPkg::Lock::Timeout kept). Validated by a full rebuild: the resumed scenarios no longer saturate (object-store-peer/read-remote/tor-link green, no QMP/ssh-banner timeouts); tor-link clears the real-Tor path end to end.
Expulsion is a membership change, not a disconnect -- a lingering link is permitted -- so verify no longer checks nodes.links. It asserts node2 is in user.list_expelled and gone from user.swarm_status. node2's identity now comes from node1's siblings.json (recorded by adopt-node), not from node2: once expelled, node2 rejects user.info (query rejected (2) untokened, auth_failed with the User token), so it can't identify itself. Verified live against a post-expel stage.
The import/bootstrap prompts said "set up user ... save id + token", which the agent could satisfy by importing the key + minting a token without installing the node's active contract (user.info then rejects). State the end goal -- make this a User node, install the active contract -- so the agent runs the full node-setup flow. Validated: import-user now passes.
…allback, offline tests)
…nonymous WS sessions)
…op dead code, intent comments)
…ternal-checkout path)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.