A discovery-based robot/instrument orchestration framework. Every skill — hardware-bound (MoveIt2 arm motion, pylabrobot instrument calls) or software-only (LLM-authored scripts) — is exposed as a self-advertising ROS 2 action. The orchestrator never hardcodes endpoints; it subscribes to latched <node>/skills manifests and dispatches goals over DDS.
Built on ROS 2 Jazzy + MoveIt2 + a Python BehaviorTree.CPP-v4-compatible executor. All action servers are Python; arm atoms drive MoveIt2 over its native action/service surface (no per-language MoveGroupInterface shim). See docs/adr/0001-action-server-language.md for the rationale.
Three views of the same system.
Runtime architecture (docs/architecture.svg · docs/architecture.excalidraw)
What runs where: browser + agent host on the outside, orchestrator PC, per-provider PCs, the DDS plumbing between them.
Abstraction architecture (docs/abstraction.svg · docs/abstraction.excalidraw)
The Python class hierarchy in lib/robot_skills_py/: two parallel bases (SkillNode / InstrumentMultiActionNode), plus the @action decorator that every multi-action server is built from.
Dispatch chain (docs/dispatch.svg · docs/dispatch.excalidraw)
How a BT XML tag actually reaches a working skill: RosActionNode (BT-side adapter) ↔ SkillManifest (the contract published over the latched <node>/skills topic) ↔ SkillNode / InstrumentMultiActionNode (server-side base, hosts the ActionServer, dispatches its own ActionClient to MoveIt2 / pylabrobot / vendor SDKs). Includes an explicit "does MoveIt need a SkillNode?" answer.
Full text writeup: docs/architecture.md. Open the editable sources at excalidraw.com.
Core principles
- A ROS action is the only skill API. Arm atoms (
@action-decorated methods on aRobotSkillNode), instrument atoms (InstrumentMultiActionNodes wrapping pylabrobot / vendor SDKs), and agent-authored scripts all advertise the same way and are dispatched the same way. - Discovery, not registration. Every node hosting skills publishes a latched
<node>/skillsmanifest withTRANSIENT_LOCALdurability andLIVELINESS_AUTOMATICQoS. Restart and late-join are handled by DDS, not heartbeats. - Topology = trust. Code runs on the host that owns its execution context. The orchestrator never executes agent-authored code; it dispatches to ROS endpoints.
- One process per provider host. Each robot/instrument PC runs a single multi-action node hosting all of that host's atoms —
RobotArmActionServerfor arms,InstrumentMultiActionNodesubclasses for instruments. - Each top-level directory maps to one deployment role.
Class hierarchy (in lib/robot_skills_py/) — two parallel bases keyed off shape, not vendor:
| Base class | Use for | Examples |
|---|---|---|
SkillNode (single-action) |
One action per process | FrankaGripperSkillNode, ImagingSimNode |
RobotSkillNode (extends SkillNode) |
MoveIt-coupled single-action atom | held in reserve — single-action arm atoms are rare today |
InstrumentSkillNode (extends SkillNode) |
Single-action atom with a sim/real backend hook | held in reserve |
InstrumentMultiActionNode (parallel base) |
Many actions on one shared device / FSM | RobotArmActionServer (12 arm atoms · Meca500/FR3), MockRobotArmActionServer (13 mock atoms), RosbagSkillsNode, LiconicActionServer, HamiltonStarActionServer (14 actions on a 12-state FSM) |
@action(action_type, action_name, …) decorates each method on a multi-action subclass; the base class wires ActionServer constructors, manifest publication via SkillAdvertiser, parameter declarations, deferred-init, and (optionally) an AsyncioBridge for async-native backends. Architecture diagram: docs/abstraction.png (editable: docs/abstraction.excalidraw).
| Top-level dir | Role | Runs on |
|---|---|---|
| lib/ | Shared libraries — host-agnostic, no runtime | Consumed by every host |
| src/ | Orchestrator processes + assets | Orchestrator PC |
| agent/ | MCP-host processes + agent-local services | Agent's machine (anywhere on DDS) |
| providers/ | Per-robot/instrument code | The robot/instrument PC |
| Provider | Code | Skills |
|---|---|---|
| Mecademic Meca500 6-DoF arm | providers/meca500/ | All 12 arm atoms hosted in one RobotArmActionServer Python process talking to MoveIt2 and ros2_control over their native ROS interfaces (MoveToNamed/Joint/Cartesian, MoveCartesianLinear, Gripper, SetDIO, RobotEnable, CheckSystemReady, CheckCollision, UpdatePlanningScene, DetectObject, CapturePointCloud) — plus RecordRosbag / StopRecording from a sibling rosbag_skills_node. |
| Franka FR3 7-DoF arm + Franka Hand | providers/fr3/ | Same 12 arm atoms via the unified RobotArmActionServer (re-parameterized for FR3 / panda_arm planning group). franka_gripper_skill adds a FrankaGripperControl atom that bridges robot_skills_msgs/FrankaGripperControl onto the upstream franka_gripper Move/Grasp/Homing action set. |
| Arm mock sim (Meca500 + FR3) | providers/arm_mock_sim/ | Director-managed mock arm atoms. Single MockRobotArmActionServer (InstrumentMultiActionNode) hosts all 13 mock atoms — same action interface as the production server, no MoveIt2. Launched per-robot under /meca500 or /fr3 namespace by the Director's meca500-mock-sim-run / fr3-mock-sim-run tasks. |
| PBI Liconic STX44 incubator | providers/pbi_liconic/ | TakeIn, Fetch (pylabrobot) — single InstrumentMultiActionNode. |
| Hamilton STAR liquid handler | providers/pbi_liconic/ | 14 actions on one InstrumentMultiActionNode (12-state FSM gated through gate_goal): MoveResource, HandoffTransfer, PickUpCoreGripper, ReturnCoreGripper, Aspirate / Dispense / PickUpTips / DropTips (single-channel and 96-channel), JogChannel, Transfer. |
| Imaging station (sim) | providers/imaging_station/ | ImagePlate (idempotent) — single-action SkillNode. Sim backend writes placeholder PNGs; real driver (BMG / Tecan Spark / BioTek Cytation / microscope) swaps in via the same action. |
Both Liconic and Hamilton are pulled from the guyEIT/pbi_liconic upstream as a git subtree. See CLAUDE.md for the subtree pull/push commands.
The imaging-station provider is a fresh in-tree provider, not a subtree — it owns the robot_skills_msgs/action/ImagePlate interface and ships a sim backend so the campaign behaviour tree can run a full Liconic ↔ Hamilton ↔ Imager ↔ Hamilton ↔ Liconic loop without imager hardware. Real driver backends (BMG / Tecan / BioTek / microscope) plug into the same action interface.
- Pixi (Linux desktop). The Docker Compose path is deprecated — keep it only for legacy bring-up; new work uses pixi natively.
None — pixi install -e <env> resolves the four interfaces/ packages and the three framework helpers in lib/ via path-deps. No conda channel preload required.
pixi run channel-packBuilds every workspace ROS package and harvests the resulting .conda artifacts into ~/channel/. Run on the Director PC once, then pixi run channel-serve-up exposes the channel over HTTP so worker PCs can pull pre-built artifacts instead of rebuilding from source. See Distributed launch.
pixi run -e director director-up
# In the dashboard's Topology panel: pick "lab-sim" → LaunchBrings up every provider sim (mock arm atoms for Meca500 + FR3, sim backends for Liconic + Hamilton) as independent director-managed processes, then the orch_lab_sim orchestrator. Each can be killed and restarted from the dashboard independently.
pixi run -e director meca500-mock-sim-run # mock Meca500 atoms under /meca500 (no MoveIt2)
pixi run -e director fr3-mock-sim-run # mock FR3 atoms under /fr3 (no MoveIt2)
pixi run hamilton-sim-test # STAR sim backend + skill_server
pixi run liconic-sim-test # Liconic sim backend + skill_serverSubmit the matching test tree from src/robot_behaviors/trees/.
| Mode | Command | What runs |
|---|---|---|
real-native |
pixi run real-native-up |
full real-robot stack on one PC |
Per-PC pixi envs install only what each box needs. The four ROS interface packages and three framework helpers are path-deps inside the workspace (built locally on each host), or — for fleet deploys — fetched as pre-built .conda artifacts from the Director's HTTP channel (see "Distributed launch" below).
| Env | Target PC | Tasks |
|---|---|---|
orchestrator |
control PC (legacy single-instance) | orchestrator-up |
★ director |
control PC (fleet-wide, Tier 1 dashboard, central pixi channel) | director-up, channel-serve-up, meca500-mock-sim-run, fr3-mock-sim-run |
★ launch-agent |
every PC the Director should manage | launch-agent-up |
meca500-host |
Meca500 robot PC | meca500-real-run |
fr3-host |
FR3 robot PC | fr3-real-run |
liconic-host |
Liconic / Hamilton PC | liconic-up, liconic-sim-up, hamilton-up, hamilton-sim-up |
real-native |
single-box real robot | real-native-up |
Three control layers move the "which pixi run task lives on which PC" knowledge out of operators' shells and onto a single dashboard / MCP / YAML surface:
- Topology Director (src/topology_director/) on the control PC reads config/topology.yaml, supervises N skill-server orchestrator instances (each in its own ROS namespace so multiple BTs can run in parallel), and fans launch / stop / kill commands out to per-host launch agents.
- Launch agents (src/launch_agent/) — one per PC — own subprocess spawn + SIGINT → SIGTERM → SIGKILL escalation +
kill_nodefor arbitrary ROS nodes. - Central pixi channel —
python -m http.serverover~/channelon the Director PC. Worker hosts pull pre-built binary packages (ros-jazzy-robot-skills-msgs,ros-jazzy-franka-msgs,ros-jazzy-liconic-msgs,ros-jazzy-hamilton-star-msgs,ros-jazzy-launch-agent, …) instead of each rebuilding from source.
All Pixi environments include the shared ros-network activation feature. The same defaults also live in scripts/ros-network-env.sh, which daemon startup, Director/launch-agent foreground startup, self-restart helpers, and launch-agent supervised service launches source before starting ROS processes:
RMW_IMPLEMENTATION=rmw_fastrtps_cpp
ROS_DOMAIN_ID=0
ROS_AUTOMATIC_DISCOVERY_RANGE=SUBNET
ROS_STATIC_PEERS=10.6.104.87 # Director (callus) — every host unicasts SPDP hereOnly ROS-framework env vars — no DDS vendor XML, no per-host interface pinning. Fast DDS (the Jazzy default, REP-2000 Tier-1) listens on every NIC out of the box.
Why static peers instead of pure multicast: the lab's 10.6.104.0/24 and 10.6.105.0/24 subnets are L3-routed via a common gateway. ROS 2 SPDP multicast uses TTL=1 so packets don't survive the gateway hop. Setting ROBOT_BEHAVIOURS_ROS_PEERS to the Director's IP means every worker unicasts its SPDP announcement directly to the Director; the Director learns each worker's address from the received packet and can reply. No per-worker IP config needed on the Director. When the network admin enables multicast routing between the two subnets (see the polite request below), static peers can be removed and SUBNET multicast will handle everything automatically.
Per-host overrides: create scripts/ros-peers.local (gitignored, sourced by ros-network-env.sh) to add further peers without touching git — useful for worker-to-worker traffic or temporary IPs:
# scripts/ros-peers.local (on any host that needs extra peers)
ROBOT_BEHAVIOURS_ROS_PEERS="10.6.104.87 10.6.105.23"Verifying multicast forwarding (to confirm when the network admin has made the change):
# Run simultaneously — listener on Director, sender on a worker
pixi run -e director multicast-smoke -- --listen --seconds 20
pixi run -e launch-agent multicast-smoke -- --send --count 10multicast_received=yes means SUBNET multicast works across the subnets and static peers are no longer needed.
General DDS smoke tests:
# Inspect whether a worker launch-agent is visible from the Director
pixi run -e director ros-dds-check -- --host meca500-control
# Cross-host raw DDS beacon test
pixi run -e launch-agent ros-dds-beacon
pixi run -e director ros-dds-check -- --listen meca500-controlNote: meca500-control as a hostname resolves to its Tailscale IP via MagicDNS — that's irrelevant to lab discovery. DDS uses the host's 10.6.10x.* interface; the Tailscale address is never involved. pixi run ros-network-setup prints the effective values on any host; daemon logs record them at start.
git pull
pixi install -e director # resolves Director + launch-agent + dashboard envs (path-deps; no channel preload required)
pixi run channel-pack # OPTIONAL: harvests every workspace .conda into ~/channel for worker PCs
pixi run channel-serve-up # background daemon: serves ~/channel over HTTP on :8082
pixi run launch-agent-up # background daemon: lets the Director manage tasks on this PC too
pixi run director-up # background daemon: Director + rosbridge + dashboardOpen http://<director-host>:8081/ for the Tier 1 home (topology + instance + kill controls). Tier 2 per-instance dashboards open from any instance row at /instance/<name>.
Each worker only needs the launch agent — the Director will tell it what to spawn at runtime via /launch_agents/<hostname>/launch_task.
git clone <repo> # or git pull on an existing clone
cd robot_behaviours
# Get the prebuilt msgs + launch_agent packages from the Director's HTTP channel
# (or just `pixi install -e launch-agent` locally if you'd rather build from source on the worker).
echo 'http://<director-host>:8082' | pixi config append --workspace channels -
pixi install -e launch-agent # resolves the agent + msgs from the Director's channel
pixi run -e launch-agent launch-agent-up # background daemon: agent listens at /launch_agents/<hostname>/That's it. The agent's heartbeat /launch_agents/<hostname>/info (latched JSON) auto-registers the worker with the Director. Subsequent pixi run -e launch-agent launch-agent-status confirms it's alive, …-logs tails the daemon log, …-down stops it.
The agent shells out to pixi run -e <env> <task> on demand, so the worker also needs whichever per-host env hosts the actual workload (meca500-host, fr3-host, liconic-host, …) installed alongside launch-agent. Adding the Director's HTTP channel means those envs solve from prebuilt binaries too — no C++ toolchain required on the worker.
Fleet machines follow the production git branch for operator-triggered updates. Development can continue on main; promote a known-good commit by fast-forwarding production to that commit and pushing it:
git checkout main
git pull --ff-only
git checkout production
git merge --ff-only main
git push origin productionEach launch agent owns updates for its local checkout. The Director only relays the request; the target host runs the git and pixi work itself:
/director/update_launch_agentreceives a host name, or JSON such as{"host":"meca500-control","branch":"production","strategy":"ff-only","git":true}.- The Director forwards that JSON to
/launch_agents/<host>/update_self. - The launch agent fetches
origin/<branch>, refuses to continue if the checkout is dirty, fast-forwards the branch, runspixi install -e launch-agent, sourcesscripts/ros-network-env.sh, then restarts its own daemon from a detached helper process. - The host briefly disappears from the dashboard while the daemon restarts, then re-advertises
/launch_agents/<host>/info.
The default launch-agent parameters are self_update_remote:=origin, self_update_branch:=production, and self_update_strategy:=ff-only. reset-hard exists for locked-down production checkouts where local edits must be discarded, but ff-only is the normal safe mode. The heartbeat includes git branch, short SHA, and dirty state so the Tier 1 dashboard can show which revision each host is running.
Restart without git/pixi is separate: /director/restart_launch_agent relays to /launch_agents/<host>/restart_self, which only bounces the daemon.
| Foreground | Background daemon | Stop | Status | Logs |
|---|---|---|---|---|
channel-serve |
channel-serve-up |
channel-serve-down |
channel-serve-status |
channel-serve-logs |
launch-agent-run |
launch-agent-up |
launch-agent-down |
launch-agent-status |
launch-agent-logs |
director-run |
director-up |
director-down |
director-status |
director-logs |
Cross-platform (macOS bash 3 + Linux bash 5), no systemd / launchctl dependency. PIDs and logs under ~/.local/state/robot_behaviours/daemons/<name>/. -up is idempotent; -down does TERM → 5 s grace → KILL of the whole subprocess tree. scripts/daemon.sh start ... sources scripts/ros-network-env.sh and writes the effective ROS network values to the daemon log before detaching.
# CLI
ros2 service call /director/launch_profile robot_skills_msgs/srv/LaunchProfile \
'{profile_name: "real-parallel", continue_on_failure: false}'
# Headline capability: real-parallel spawns two skill_server instances under
# /orch_meca and /orch_fr3 so two BT campaigns run concurrently.
ros2 action send_goal /orch_meca/skill_server/execute_behavior_tree ... &
ros2 action send_goal /orch_fr3/skill_server/execute_behavior_tree ... &
# Selective kill — only that one orchestrator dies, everything else keeps running.
ros2 service call /director/kill_node robot_skills_msgs/srv/KillRosNode \
'{node_name: "/orch_meca/skill_server"}'
# Hard kill the whole fleet (escape hatch)
ros2 service call /director/kill_all robot_skills_msgs/srv/CancelActiveTask '{reason: ""}'Every service is also exposed via the MCP server (launch_profile, kill_node, spawn_orch_instance, terminate_orch_instance, list_orch_instances, get_topology, …) and via the Tier 1 dashboard.
config/topology.yaml ships five reference profiles: lab-sim, real-full-serial, real-parallel (the headline parallel-BT capability), real-meca-only, and parallel-sim. Edit the file and call /director/reload_spec to pick up changes without restarting.
Multi-instance namespacing caveat.
orchestrator.launch.pyacceptsnamespace:=…and pushes the orchestrator nodes under it, but the skill_server still publishes/subscribes on absolute/skill_server/...paths in many places. Running multiple instances at the root namespace works today (legacy single-orchestrator path); running disjoint instances under/orch_meca//orch_fr3requires the absolute-topic conversion tracked in src/robot_skill_server/NAMESPACE_AUDIT.md. Tier 2 dashboards display a banner when they're viewing a non-root instance.
# Rebuild after editing a .msg/.srv/.action — pixi-build-ros tracks the
# extra-input-globs in interfaces/<pkg>/pixi.toml and rebuilds the path-dep'd
# .conda automatically on the next install. Just re-run the env install:
pixi install -e <env>
# Open a sourced ROS shell
pixi run lite-native-shell # or real-native-shell, meca500-host shell, etc.
# Inspect the running ROS graph
pixi run status
# Run the test suite
pixi run test# List the skill registry (merged view of every */skills topic)
ros2 service call /skill_server/get_skill_descriptions \
robot_skills_msgs/srv/GetSkillDescriptions \
'{include_compounds: true, include_pddl: false}'
# Execute a behavior tree XML
ros2 action send_goal /skill_server/execute_behavior_tree \
robot_skills_msgs/action/ExecuteBehaviorTree \
"$(python3 -c 'import yaml,sys; xml=open(sys.argv[1]).read(); print(yaml.safe_dump({"tree_xml": xml, "tree_name": "demo", "target_mode": 0}, default_style="|"))' src/robot_behaviors/trees/move_to_home.xml)"
# Compose a tree from skill steps
ros2 service call /skill_server/compose_task \
robot_skills_msgs/srv/ComposeTask \
'{
task_name: "my_task",
sequential: true,
steps: [
{skill_name: "move_to_named_config", parameters_json: "{\"config_name\": \"home\"}"},
{skill_name: "gripper_control", parameters_json: "{\"command\": \"open\"}"}
]
}'
# Sanity-check a plan against PDDL preconditions / effects before running it
ros2 service call /skill_server/validate_plan \
robot_skills_msgs/srv/ValidatePlan \
'{
initial_state: ["robot_initialized", "gripper_open"],
steps: [
{skill_name: "move_to_named_config", parameters_json: "{\"config_name\": \"home\"}"},
{skill_name: "pick_object", parameters_json: "{}"}
]
}'
# Returns valid=true + final_state, OR valid=false + first_failing_step
# + missing_preconditions[] pointing at the offending entry.target_mode on ExecuteBehaviorTree: 0 = MODE_REAL (default, back-compat), 1 = MODE_SIM (one-shot dry-run), 2 = MODE_SIM_THEN_REAL (sim → operator approval gate → real).
Long-running plans (multi-step assays, hours-long incubations) get a fast pre-flight against a paired /sim/* action surface, then a human approval gate before the real phase runs.
- Each provider launch (
make_robot_skill_server_launch) acceptsnamespace_prefix:=/simand wraps its action servers in aPushRosNamespacegroup. SkillDiscoveryfilters out/sim/*manifests so the registry is single-source-of-truth on real entries.BtExecutorreads asim_namespace_prefixparameter (default/sim) and prepends it to every server name during the SIM phase — same XML, both phases.sim_lab.launch.pyoverrides this to""because lab-sim has no separate real backend;MODE_SIMandMODE_REALthen both resolve to the bare-path atoms.lab-upkeeps the default and brings up paired sim+real action servers on one box for end-to-end approval-gate testing.- On a successful sim phase,
BtExecutorlatches aDryRunStatuson/skill_server/dryrun_statusand waits on/skill_server/approve_dry_run(ApproveDryRun.srv). The dashboard surfaces an approve/reject modal.
For trees that run for weeks — operators trickling plates in and out of the Liconic, cycling each through the Hamilton-iSWAP to the imaging station and back — the framework persists tree state to SQLite and resumes after a skill_server crash without losing progress.
What survives a restart:
- A SQLite-backed persistent blackboard: any key prefixed
persistent.is mirrored to~/.local/state/skill_server/tasks/{task_id}/state.db(WAL mode). Type-checked at write — only JSON-serialisable values land on disk. - Per-node tick checkpoints on every control / decorator / loop node (
Sequenceindex,Repeatiteration,RetryUntilSuccessfulattempt,WaitUntildeadline). On resume, the executor re-ticks from the root and each Checkpointable node hydrates its index — no work is repeated past the last successful child. - Action-inflight reconciliation: every
RosActionNoderecords(node_path, server_name, goal_uuid, idempotent)before submitting. If a goal was in flight at crash time, the resume path inspects the row — idempotent skills auto-resubmit; non-idempotent skills refuse and surface an alert for the operator to resolve viaOperatorDecision.
New control / utility nodes in tree_executor.py:
KeepRunningUntilFailure,Repeat num_cycles="N",WhileDoElseWaitUntil timestamp="{...}"— wall-clock-aware sleep (deadline-preserving across restart)BlackboardCondition key="..." expected="..."— gate a subtree on a persistent flagPopFromQueue/PushToQueue— list-valued blackboard queues for operator-driven workAdvancePlate— post-cycle bookkeeping (increments cycle, recomputesnext_due_at, retires when target reached)
Operator services — split between bb_operator (campaign-level state) and skill_server (framework-level execution control):
| Service | Owner | Purpose |
|---|---|---|
/bb_operator/add_plate |
sidecar | append a plate dict to persistent.plate_queue (trickle-in) |
/bb_operator/retire_plate |
sidecar | flag plates.{name}.retiring = true so the in-flight cycle finishes naturally and isn't re-queued |
/bb_operator/pause_campaign |
sidecar | toggle persistent.paused; BlackboardCondition gate halts the next iteration boundary |
/bb_operator/operator_decision |
sidecar | resolve a stuck non-idempotent action (retry / skip-as-success / skip-as-failure / abort-tree) |
/skill_server/pause_execution |
bt_executor | framework-level pause that sets ctx.paused, honoured at step boundaries by every control-flow node — works for any tree, not just campaigns |
/skill_server/cancel_active_task |
bt_executor | session-independent hard cancel; walks _current_ctx, sets cancelled, and tears down in-flight goals — reachable by any client (the action-cancel handshake requires the original goal id, which a restarted dashboard doesn't have) |
Live dashboard view. The dashboard's Campaign panel (added 2026-04-27) subscribes to /skill_server/persistent_state — a latched JSON snapshot of the active task's persistent blackboard, republished by bb_operator after every service handler + a 1 Hz timer. Renders the plate queue + per-name index as a table with cycle / cadence / next-due / status; surfaces Add Plate (modal dialog), Pause after step / Resume, Cancel (hard halt via /skill_server/cancel_active_task), and per-row Retire trash icons. The Campaign preset layout (Layouts → CAMPAIGN) tiles it alongside Task Monitor + BT Tree + Executor + Logs. Validated end-to-end via Playwright + chromium-headless: empty-state → submit campaign tree → AddPlate → Pause → Resume → Cancel.
The campaign is defined by a campaign file (src/robot_behaviors/campaigns/plate_imaging_standard.campaign.xml) whose phases name behavior trees; the orchestrator (campaign_manager + bb_operator) owns the scheduling / dispatch / advancement loop. The per-plate cycle plate_imaging_cycle.xml is LiconicFetch → ObsImagingSequence (ObsBot PTZ, 3 presets) → LiconicTakeIn — the plate is imaged in place on the Liconic transfer tray (the Liconic shovel presents it; the ObsBot camera images it). The Hamilton iSWAP is not in the imaging leg — it only loads/unloads the incubator at campaign start/end.
Skill idempotency is declared per atom in SkillDescription.idempotent (defaults to false). The resume path uses it to decide whether to auto-resubmit on goal-gone or to halt and ask the operator.
Agents (LLMs, MCP clients) drive the lab through agent/robot_skill_mcp/ — a FastMCP stdio server bridged into ROS:
| Tool | Purpose |
|---|---|
list_skills, list_trees |
introspect the runtime registry |
compose_task |
build a BT XML from steps |
execute_tree |
dispatch /skill_server/execute_behavior_tree |
register_script, list_scripts, delete_script |
session-scoped agent-authored skills |
get_dryrun_status, approve_dry_run |
drive the sim-then-real gate |
get_pending_agent_prompts, submit_agent_response |
answer in-tree Agent* leaves (yes/no, choice, freeform, image analysis) |
read_image_snapshot |
fetch a PNG written by AgentAnalyzeImage (path-traversal-guarded to ~/.local/state/skill_server/agent_snapshots/) |
★ get_topology, list_orch_instances |
inspect fleet state — hosts, tracked tasks, running orchestrator instances, recent events |
★ launch_profile, stop_profile, stop_all, kill_all |
bring up / tear down a named profile from config/topology.yaml |
★ kill_node |
hard-kill a ROS node by FQN (Director fans out to every reachable launch agent) |
★ spawn_orch_instance, terminate_orch_instance |
stand up an extra orchestrator instance ad-hoc for a parallel BT |
★ refresh_host_env, reload_topology_spec |
trigger pixi install on a worker; re-read config/topology.yaml |
Agent-authored scripts run on the agent's host via agent/robot_script_server/ — the orchestrator never executes agent code; from its point of view a registered script is just another RunScript-typed action.
In-tree agent decision steps (AgentConfirm, AgentInput, AgentDecide, AgentAnalyzeImage) flow the other direction: the BT publishes AgentPrompt on /skill_server/agent_prompts, an MCP client pulls it via get_pending_agent_prompts, and the response goes back through submit_agent_response. Sample tree at src/robot_behaviors/trees/test_agent_inspect.xml.
pixi install -e <env> uses symlink-install for Python and XML share files; edits are live on save.
| Change type | Action |
|---|---|
| Behavior tree XML | save the file — BtExecutor polls every 2 s, dashboard updates over the latched topic |
| Python skill / orchestrator code | save the file — symlinked. Restart the host node if it caches state at startup |
interfaces/<msgs>/*.msg/.srv/.action |
pixi install -e <env> rebuilds the path-dep'd .conda and consumers, then restart the node |
lib/robot_skills_py/ base classes |
pixi install -e <env> (pixi-build-ros rebuilds the path-dep'd .conda and consumers) + node restart |
| Frontend | vite build + pixi install -e director (or pixi run dashboard-dev for hot reload) |
See docs/adding-skills.md for the three authoring patterns (A: an @action method on RobotArmActionServer — most arm-shaped atoms land here; B: a standalone SkillNode subclass for one-off single-action atoms; C: an InstrumentMultiActionNode subclass for multi-action instruments). Plus the Claude Code commands:
/new-skill-atom— generic primitive (lib/robot_arm_skills) or provider-specific atom/new-compound-skill— vetted, persisted compound skill/new-behavior-tree— XML tree/debug-skill— diagnose a failing skill or tree
- docs/architecture.md — full architecture writeup
- CLAUDE.md — development environment, validation status, known follow-ups
- docs/adding-skills.md — three skill-authoring patterns
- docs/adr/0001-action-server-language.md — why all action servers are Python and what the C++→Python port left behind


