Skip to content

Add netsim scenarios#352

Open
Rekseto wants to merge 48 commits into
masterfrom
intern0/dev/netsim-scenarios
Open

Add netsim scenarios#352
Rekseto wants to merge 48 commits into
masterfrom
intern0/dev/netsim-scenarios

Conversation

@Rekseto

@Rekseto Rekseto commented Jun 22, 2026

Copy link
Copy Markdown
Member

No description provided.

Rekseto and others added 30 commits June 21, 2026 23:50
Install the astral-agent skill into the Qwen Code operator. The netsim
host owns a deploy key (SATFORGE_SKILLS_DEPLOY_KEY); run.sh injects it
into the VM, which clones the private satforgedev/skills repo, builds the
satforge-skills linker (Go already present from install-astrald), and
runs `link astral-agent --target qwen` -> ~/.qwen/skills/astral-agent.
Folded into lab.story after install-qwen-code; documented in the task
README (one-time deploy-key setup) and netsim/README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The astral-agent skill is installed as a tree of symlinks, so the verify
file count used `find`, which does not traverse symlinked directories and
undercounted. Use `find -L` to follow symlinks (and silence transient
errors) so the count reflects the materialized tree.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Thin-prompt, skill-driven swarm task: a two-sentence prompt tells the
in-VM Qwen operator to make node1 a User-controlled node by following its
astral-agent skill's node-setup playbook (software-User path), without
restating the procedure. run.sh base64-ships the prompt over one
`netsim ssh` argv and runs `qwen -y` as tester; verify.sh independently
reads the persisted User token and asserts apphost.whoami = User id and
user.info returns the active contract. Standalone (not in lab.story):
`netsim task --stage astrald-lab --save astrald-user bootstrap-user`.
Validated end-to-end on a live astrald-lab.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Thin-prompt, skill-driven swarm task chained onto bootstrap-user: a
two-sentence prompt drives the Qwen operator to claim node2 into the
User's swarm via its astral-agent skill's node-claiming playbook
(`user.claim`, with nearby handling reachability). verify.sh is an
independent both-ends check -- both nodes hold a contract from the same
User, node1 lists node2 as a Linked sibling, and a mutual link exists --
parsing the astral-query JSON object-stream line-by-line. Standalone:
`netsim task --stage astrald-user --save astrald-swarm link-swarm`.
Validated end-to-end (two nodes in one User Swarm).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
First scenario past swarm formation: store an astral object on node1 and
prove sibling node2 can obtain it by Object ID across the swarm. The thin
prompt drives the Qwen operator (acting as its User) to objects.store a
text payload and record the id; the cross-swarm fetch lives in verify.sh,
not the prompt, which from node2 tries a ladder -- explicit-target
<node1>:objects.load, transparent objects.load, then objects.find -- and
asserts the bytes match, distinguishing a routing failure from an auth
rejection. Standalone:
`netsim task --stage astrald-swarm --save astrald-shared share-object`.
Drafted; not yet run end-to-end (the cross-swarm read hop is inferred
from the docs).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The post-install probe only waited ~10s for astrald to come up, but a fresh
astrald's first start (node-key generation + SQLite init), right after a
CPU-heavy go build still loads the VM, can take longer -- it flaked on an
otherwise-clean lab build ("astrald did not come up"). Wait up to ~90s, and
on failure dump `systemctl status` + `journalctl -u astrald` so the message
is a real diagnosis instead of opaque. Validated: the lab build passed with
the wider window on both nodes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ect)

The link-swarm and share-object verifiers were shell scripts that gathered
astral-query JSON and parsed it with embedded python heredocs -- awkward, and the
parsing couldn't be unit-tested without booting a VM. Move all logic into a real
verify.py per task (calls `netsim ssh ... astral-query` via subprocess, parses the
JSON streams, asserts); verify.sh becomes a thin shim:
  exec python3 "$NETSIM_TASK_DIR/verify.py" "$@"
netsim sets $NETSIM_TASK_DIR to the task dir and only auto-runs run.sh/verify.sh,
so verify.py sits alongside and is found cleanly.

Behavior-preserving: parsers golden-file tested against captured JSON, and the
full pipeline re-run fresh on NFS -- link-swarm verify PASSES, share-object verify
reproduces the cross-swarm-fetch diagnostic, both via the new shim.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rename swarm wording to match astrald master (PR #350) and the updated
astral-agent skill: user.claim -> user.adopt, the node-claiming playbook
-> node-adoption, and mod.user.swarm_access_action ->
mod.user.swarm_membership_action. Docs/comments/prompt wording only; no
verifier logic changes.
Add a fourth verifier check: node2 must list node1 as a Linked sibling
(user.swarm_status, which derives from node2's own active contract, so no
token). This is a direct regression guard for astrald #348 (roster sync to
a newly adopted node) and the precondition share-object's write direction
relies on. Offline golden test: post-#348 passes, pre-#348 (roster={node2})
correctly fails.
Pivot from the (blocked) cross-swarm read to the now-unblocked write
direction: the agent stores an object ON node2 (<node2>:objects.store) and
reads it back; verify.py independently proves node2 physically holds it via
repo-pinned, ungated objects.load/contains -repo local. Unblocked by #348
(node2 now recognizes node1 -> AuthorizeRelayFor permits the relayed store,
which reaches the ungated op_store). Caveat documented: op-level write is
unauthenticated (CreateObjectAction still unwired).
Move lab.story into netsim/stories/ and add one story per tested flow
(bootstrap-user, link-swarm, share-object), each a thin task list with a
start/save stage header so a story doubles as a pass/fail integration test.
Refresh netsim/README.md (full task list, swarm pipeline via stories) and
reconcile running-as-a-service.md snapshot guidance (disk image: stop;
live RAM snapshot: leave running).
The link-swarm and share-object verifiers carried long module docstrings
restating rationale already in their README.md. Cut to a one/two-line summary;
no logic change.
Cut the per-task READMEs down to a short paragraph (what the task does + the
stage it produces); dropped the execution-model, build-facts, verify-internals,
deploy-key setup, and security-note sections. No behavior change.
Replace the scattered ~/.netsim/{user.id,user.token,object.*} files with a
single $HOME/info.json (/home/tester/info.json) holding user_id, user_token,
object_id, object_payload, object_readback, object_target. bootstrap-user writes
user_*; share-object merges object_* (keeping user_*); verifiers and smoke-checks
read the JSON (python3 in-VM for shell, host-side json for verify.py). Transient
prompt/log files stay under ~/.netsim.
A drop-in alternative to bootstrap-user: instead of minting fresh entropy, the
agent derives the User key from a provided BIP-39 mnemonic (ASTRAL_USER_MNEMONIC)
and installs node1's active contract under that existing software User. verify.sh
asserts node1 is a User node and, if ASTRAL_USER_ID is set, that the derived id
matches exactly (proof the existing key was used). Produces stage astrald-user.
Rename the two first-node User-setup tasks to spell out the key variant:
bootstrap-user -> bootstrap-user-software-key (new soft key),
import-user -> import-user-software-key (existing soft key, known mnemonic).
Renames the task dirs + story files and updates every reference (internal
messages, prompt/log basenames, cross-references in link-swarm/share-object,
README layout/pipeline). Leaves room for hardware-key variants later.
The previous commit swept in py_compile byte-cache via 'git add -A'. Remove the
.pyc artifacts and add netsim/.gitignore for __pycache__/*.pyc.
Bake a valid BIP-39 mnemonic (the canonical all-zero-entropy test vector) into
prompt.md instead of an __MNEMONIC__ placeholder; the task is now self-contained
and reproducible. run.sh ships prompt.md verbatim (drops the ASTRAL_USER_MNEMONIC
requirement and sed substitution). verify.sh's optional ASTRAL_USER_ID assertion
is unchanged.
Skills moved off GitHub to ssh://git@git.satforge.dev/satforge/skills.git.
Update the default SATFORGE_SKILLS_REPO and the comments/README; drop the
GitHub-specific 443 fallback note. Host-key handling (StrictHostKeyChecking=
accept-new) already covers the new host, and the deploy-key flow is unchanged
(the key must now be registered on git.satforge.dev).
Rename the second-node task to match the swarm vocabulary (user.adopt):
link-swarm -> adopt-node. Renames the task dir + story file and updates every
reference (internal messages, prompt/log basenames, README layout/pipeline,
share-object cross-reference). Stage names unchanged (astrald-user -> astrald-swarm).
…gle-node

Both bootstrap-user-software-key and import-user-software-key produce the same
single-node stage (a node set up as a User; they differ only in the User key —
random vs the embedded mnemonic). Name it astrald-single-node and point adopt-node
at it. Stage-name change only (set via --save/--stage); no script logic depends
on it.
Split the object lifecycle into two focused scenarios and drop the combined
(write-direction) share-object:
- object-store (0006): node1 stores an object in its OWN local repo and reads it
  back; agent-driven, verify re-loads -repo local. astrald-swarm -> astrald-stored.
- read-remote-object (0007): node2 reads node1's stored object OVER ASTRAL;
  host-driven (node2 has no operator), verify runs the <node1>:objects.load ladder
  and asserts the bytes. astrald-stored -> astrald-read. This is the peer-reads-node1
  direction that failed pre-#348 — re-probed on current master.
Rewire README pipeline accordingly.
astrald-single-node -> one-node, astrald-swarm -> two-nodes,
astrald-stored -> two-nodes-data, astrald-read -> two-nodes-data-read.
Stage-name change only (set via --save/--stage in story headers + docs); no
script logic depends on it. astrald-lab kept as the base build fixture.
object-store now stores either in node1's own local repo (--target self, default)
or on the sibling node2 (--target peer, via <node2>:objects.store) — so one story
tests local storage and another tests storing on a peer. run.sh selects the prompt
(prompt.md / prompt-peer.md); verify.py checks the holder's local repo (node1 for
self, node2 for peer). Adds object-store-peer.story (two-nodes -> two-nodes-data-peer);
object-store.story stays self -> two-nodes-data and feeds read-remote-object.
…ters node aliases

Replace object-store's abstract --target self|peer with a real astral query target
(--target, default localnode; e.g. node2). One prompt template (drops prompt-peer.md):
the agent stores on / reads back from <target>, forming the right query itself.
verify.py maps target -> holder (localnode/node1 -> node1, node2 -> node2).

adopt-node now registers node1/node2 directory aliases (dir.set_alias) on both nodes
when the swarm forms, so tasks can address nodes by name. Also fixes adopt-node's
stale soft-check (read the User token from info.json, not the removed user.token).

object-store-peer.story now passes --target node2.
Strip astral-agent/playbook/skill-location references and harness meta ('the skill
won't mention this'); the operator already has the skill auto-loaded. Prompts now
read like a person's request (still naming astral/astrald), keeping only the task
plus a terse 'save results to ~/info.json' the automated check needs.
The User on node1 permanently bans node2 from the swarm via user.expel, driven
by the Qwen operator through its astral-agent skill. verify.py confirms the ban
from both ends: node2 lands in user.list_expelled, drops out of
user.swarm_status (OpSwarmStatus lists ActiveNodes, which filters the
expelledSet), and the node1<->node2 link is torn down. README registers the new
task/story and the two-nodes -> two-nodes-expel branch.
The old read-remote-object was host-driven and read node2->node1 anonymously, which
can't route (network zone stripped) -- it tested the wrong, unroutable direction.

Now it's agent-driven on node1: the agent reads the object (id from ~/info.json,
written by object-store --target node2) FROM the peer as the User -- the
authenticated, routable direction -- and records what it read. verify.py
independently re-reads <peer>:objects.load as the User and asserts the bytes.

New read-remote-peer.story chains object-store --target node2 (store on the peer)
then read-remote-object (read it back from node1). Drops the old
read-remote-object.story; README pipeline updated.
Minimizing the prompts dropped the 'keep existing keys' hint, so object-store's
agent overwrote ~/info.json with object_* and wiped the user_token bootstrap wrote
-- breaking read-remote-object's verify (which reads the peer as the User). Restore
a natural 'leaving the existing entries in place' instruction in object-store and
read-remote-object.
Each task writes its own file (no shared accumulator, no merge, no clobbering):
- bootstrap/import -> ~/user.json   (user_id, user_token)
- object-store     -> ~/object.json (object_id, object_payload, object_readback)
- read-remote-object -> ~/read.json (object_remote)
Readers reference the specific file(s) they need: adopt-node + expel-node read
user.json; object-store verify reads object.json; read-remote-object verify reads
user.json + object.json + read.json. Prompts drop the 'keep existing entries' hint
(own file, overwrite is fine). Updates expel-node's reads to user.json too.
object-store now ships a fixed payload.txt to the operator and tells the agent
to store that file's contents (deterministic id/bytes) instead of inventing
'distinctive text'; verify.py (object-store and read-remote-object) uses the
shipped file as ground truth. Simplify every task prompt to precise, minimal
wording and name __TARGET__/__PEER__ as astral nodes.
…ls ref, minimized READMEs

- enable-tor: new host task — bring up a node with a Tor endpoint and save it to
  /root/tor.json (validated live: real onion published + saved).
- object-store: agent only stores + records object_id; verify owns the read-back
  and byte match against the shipped payload.txt.
- adopt-node: agent records swarm siblings to ~/siblings.json (sibling_ids);
  verify asserts it includes the adopted node.
- configure-astral-agent: SATFORGE_SKILLS_REF builds the lab against a skills
  branch (fails loudly if the ref can't be fetched).
- prompts: terser, human-style wording (adopt/bootstrap/import/expel/object-store).
- READMEs: minimized to astral-docs voice across all tasks.
… scenario 0004)

Restore the two parked tasks for Tor scenario 0004 ("a node leaves the LAN and
links over Tor"), completing the scenario alongside the already-committed
enable-tor building block. Sequenced by tor-link.story (two-nodes -> two-nodes-tor):
enable-tor -> leave-lan -> link-over-tor.

- leave-lan (host): seed node1 with node2's onion while the LAN is up, then
  nftables-drop the LAN path between them (WAN/Tor egress stays up). verify.py
  asserts node2 can no longer TCP-connect to node1:1791 (only a timeout passes).
- link-over-tor (agent): thin prompt drives the Qwen operator to force the swarm
  link over Tor (nodes.new_link -strategies tor) per the astral-agent skill's
  linking-over-tor playbook; verify.py independently asserts a nodes.links entry
  with Network=tor.

The linking-over-tor playbook is now on skills main (PR #4), so the lab builds
against main with no SATFORGE_SKILLS_REF override.

Checkpoint: not yet validated against the real Tor network (the original parking
gate -- VM WAN NAT -> Tor reachability + the agent's per-turn tool-call cap).
A resumed stage runs astrald + all userspace live; netsim's sync_clock corrects the
stale snapshot clock with a ~day forward jump, which makes systemd's Persistent
apt-daily/apt-daily-upgrade timers fire and unattended-upgrades saturate the 1-vCPU
VMs for minutes -- breaking every resumed scenario (node unreachable, QMP save
timeouts). astrald itself tolerates the jump.

Fix it once in the image (standard ephemeral-VM hygiene): install-astrald masks
apt-daily/apt-daily-upgrade/unattended-upgrades on the fresh build boot, so every
stage is born quiet. The per-task quiescing in enable-tor/leave-lan is now redundant
and removed (DPkg::Lock::Timeout kept). Validated by a full rebuild: the resumed
scenarios no longer saturate (object-store-peer/read-remote/tor-link green, no
QMP/ssh-banner timeouts); tor-link clears the real-Tor path end to end.
Expulsion is a membership change, not a disconnect -- a lingering link is permitted --
so verify no longer checks nodes.links. It asserts node2 is in user.list_expelled and
gone from user.swarm_status. node2's identity now comes from node1's siblings.json
(recorded by adopt-node), not from node2: once expelled, node2 rejects user.info
(query rejected (2) untokened, auth_failed with the User token), so it can't identify
itself. Verified live against a post-expel stage.
The import/bootstrap prompts said "set up user ... save id + token", which the agent
could satisfy by importing the key + minting a token without installing the node's
active contract (user.info then rejects). State the end goal -- make this a User node,
install the active contract -- so the agent runs the full node-setup flow. Validated:
import-user now passes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants