Skip to content

docs(planning): HEADLESS-PERSONA-HOST-LOOP — slice 13 design#1510

Merged
joelteply merged 3 commits into
canaryfrom
docs/headless-host-loop-slice-13-design
Jun 7, 2026
Merged

docs(planning): HEADLESS-PERSONA-HOST-LOOP — slice 13 design#1510
joelteply merged 3 commits into
canaryfrom
docs/headless-host-loop-slice-13-design

Conversation

@joelteply

Copy link
Copy Markdown
Contributor

Summary

Design doc for #133 slice 13 — the IPC-boot rewire that finally makes continuum-core host personas headlessly without the demo binary in the inner ring.

Net-additive: no code changes. Captures the boot wire-up plan + the cleanup model verified during PR #1508's slice-12 review, before touching the load-bearing crate::ipc::start_server flow.

What's in the doc

  • Before/after of the IPC boot loop showing the exact gap (current loop logs welcome, never starts the service loop).
  • Cleanup model traced through airc_lib::EventStream::droptokio::sync::broadcast::Receiver auto-decrement; wire subscriber ref-counted across local subscribers. No leak.
  • Five open questions with recommendations:
    1. Arc<Registry> for build_profile — add model_registry::global_arc()
    2. HwCapabilityTier source — HostCapabilityProbe::detect_at_boot()
    3. hosted_handles ownership — new PersonaSupervisor module
    4. ResumeOrMintProvider role-mapping — deferred to slice 14
    5. BootSummary event for per-slot failures
  • Test plan: PersonaBootstrapper trait split for stubbing.
  • Explicit non-goals + reference docs + memories worth refreshing.

Why a design PR before the implementation PR

The boot loop is load-bearing. Reviewers should weigh in on the architectural choices (open questions 1-5) before I write 200 lines of wire-up that lock the answers in.

Reference docs

Test plan

  • Doc compiles (markdown — N/A)
  • CI standard checks (lint, validate) pass
  • Joel reviews open questions 1-5 before slice 13 implementation begins

🤖 Generated with Claude Code

Captures the boot wire-up plan for #133 slice 13 before touching
the load-bearing `ipc::start_server` flow.

Documents:
- The "moment-of-truth" gap: composition pieces from slices 5-12
exist; nothing in continuum-core actually calls them at boot.
- Before/after of the IPC boot loop (~ipc/mod.rs:1024-1089).
- The cleanup model verified during PR #1508's slice-12 review:
.abort() → JoinHandle drop → conversation drop → EventStream drop
→ broadcast::Receiver drop auto-decrements subscriber count. Wire
subscriber is ref-counted across local subscribers; tears down
when runtime Arc reaches 0. No leak; no manual unsubscribe.
- Five open questions with recommendations:
  1. Arc<Registry> for build_profile — add model_registry::global_arc()
  2. HwCapabilityTier source — HostCapabilityProbe::detect_at_boot()
  3. hosted_handles ownership — new PersonaSupervisor module
  4. ResumeOrMintProvider role-mapping — slice 14 territory
  5. BootSummary event for per-slot failures
- Test plan: PersonaBootstrapper trait split for stubbing.
- Explicit non-goals (shared-base / cross-grid / LoRA / role-aware).
- Doctrine memories worth refreshing on implementation.

Net-additive doc — no code changes. Slice 13 implementation lands
as a follow-up PR.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@joelteply

Copy link
Copy Markdown
Contributor Author

Adversarial review: HEADLESS-PERSONA-HOST-LOOP slice 13 design

Verdict: REQUEST_CHANGES. The doc captures the right architectural ambition but contains a load-bearing factual error in the cleanup-model section, gets the role/identity invariant in Q4 wrong (position-pairing is already broken in the LCD case from boot 2 onward, not just "at higher tiers"), and skips an obvious reuse opportunity (bootstrap_planned + materialize_adapters already exist; the "after" code open-codes a third copy of the same loop). Two prerequisites for the implementation are missing entirely: a Runtime::shutdown caller in ipc::start_server, and back-pressure on N concurrent persona task spawns.

Below: findings ordered by severity. Each one names file + line + the doctrine it tripped (where applicable).


1. BLOCKER — Cleanup model claim is wrong in daemon mode (doc lines 104–116)

The doc claims: "Wire-level subscription (messaging.rs:186 ensure_wire_subscriber): this is the per-room daemon-side subscription. It is ref-counted across all local subscribers to the same room..."

That is not what airc-lib does. There is no function ensure_wire_subscriber anywhere in the airc-lib worktree (/Users/joel/.airc/worktrees/c9b28925/crates/airc-lib/src/). The actual subscribe path:

  • messaging.rs:204 fn subscribe()if is_daemon_attached() return self.daemon_subscribe(vec![room.channel])
  • daemon.rs:358 fn daemon_subscribe() opens a brand-new client.attach(...) per channel, per call, spawning a per-subscriber tokio task that owns the IPC connection.
  • stream.rs:58 struct DaemonAttachGuard aborts THAT subscriber's tasks on drop.

The Arc handle is shared. The daemon subscribe is per-call, NOT ref-counted. In production (daemon attached, which is the prod path per ipc/mod.rs:954), every AircPersonaConversation::next_message first call spawns its OWN per-channel attach loop. There is no shared "wire subscriber" being ref-counted across local subscribers — that only describes the in-process broadcast mode (stream.rs:36 EventStream::from_broadcast), which production never hits when there's a daemon.

This invalidates the doc's correctness conclusion at line 115: "dropping the JoinHandle alone is sufficient cleanup. No leak." For ONE persona owning ONE conversation that's still true (the JoinHandle abort drops the conversation, which drops the EventStream, which drops the DaemonAttachGuard, which aborts the per-channel attach tasks). But the reason is wrong, and any future reviewer who reads this doc and concludes "Arc semantics + ref-counted wire" will design wrong code (e.g., expecting that spawning a second conversation on the same room reuses the wire attach).

Required fix: rewrite section 104–116 against the actual daemon_subscribe path. Cite airc-lib/src/daemon.rs:358-450 and airc-lib/src/stream.rs:42-68 (DaemonAttachGuard). Note that each subscribe call is its own attach + abort cycle, and the right cleanup invariant is "DaemonAttachGuard::drop guarantees the per-channel reader tasks abort within one poll" — not "Arc refcount reaches 0."

This is the BLOCKER finding. Defaulting to REQUEST_CHANGES on this alone.


2. BLOCKER — Position-pairing of intents to plan rows (Q4) is broken from boot 2, even on LCD

Doc line 157: "Slice 13 pairs the Nth intent with the Nth plan entry by position. That's reasonable for the LCD case (every persona at this tier is helper/coder)..."

It's not reasonable. Trace:

  1. First boot: ResumeOrMintProvider::new(root, 2) returns 0 resumed + mints 2 fresh. mint_fresh_intent() (resume_or_mint_provider.rs:134-142) creates a UUIDv4 + derives the agent_name from agent_name_from_identity. So name is random per UUID.
  2. Plan order = [Helper, Coder]. Persona "Maya" (random) is yielded slot 0 as Helper, persona "Bart" (random) yielded slot 1 as Coder. Both bootstrapped, both write seed.json in their personas/<name>/ dirs.
  3. Restart: ResumeOrMintProvider::new(root, 2) calls scan_personas_dir. At resume_or_mint_provider.rs:200, dir_entries.sort() sorts alphabetically. Disk order: [Bart, Maya] (alphabetic). Provider yields Bart first, Maya second.
  4. Position-pair against [Helper, Coder]: Bart=Helper (WAS Coder), Maya=Coder (WAS Helper). Role identity flipped.

The mint order on boot 1 is plan-order. The reload order on boot 2 is alpha-order-of-randomly-derived-names. They almost never match. This breaks even the 2-persona LCD case the doc declares "reasonable."

Existing tests don't catch this because resumes_one_plus_mints_to_floor (resume_or_mint_provider.rs:280) uses one fixed seed name "Pax" and one random mint, doesn't check role.

Compounding problem: bootstrap_planned (spawner_module.rs:316) ALREADY does position-pairing today (slice 8). So this bug isn't new in slice 13 — but slice 13 is the first place where the (persona_id, role) pair becomes load-bearing for cognition. It will turn from a latent bug into an observable one the moment slice 13 ships.

Required fix: either

  • (a) write role into the seed.json on mint, read it back on resume, fail loud when missing; OR
  • (b) defer slice 13 to land AFTER Build(deps-dev): Bump @typescript-eslint/parser from 8.29.1 to 8.44.1 #133 slice 14's RoleAwareProvider so role identity is durable; OR
  • (c) constrain slice 13 to single-persona-per-tier-plan only (1-row plan) and add a debug_assert that plan.len() == 1.

(b) is cleanest and the doc already acknowledges slice 14 is the right home for this — but then slice 13 cannot ship before slice 14. The doc currently asserts the opposite at line 157.


3. MAJOR — The "after" code at lines 50–100 duplicates two existing functions that slice 13 should compose, not reimplement

The doc presents an open-coded loop calling bootstrap_one + build_profile + factory.build_adapter + spawn_persona_service per slot. But:

  • spawner_module.rs:316 bootstrap_planned already does steps 1–3 (provider iteration + bootstrap_one + profile materialization), returning Vec<MaterializedPersonaPlan> with structured per-slot errors.
  • supervisor.rs:150 materialize_adapters already does step 4 (adapter factory call), returning Vec<Result<HostedPersona, SupervisorError>>.

The doc's "after" code reimplements both, ignoring the structured BootstrapPlannedError + SupervisorError paths that slices 8 + 9 specifically introduced. Per [[organization-purity-as-we-migrate]] this is the duplicate-of-(A) trap the doc itself names in Q2.

Required fix: rewrite "after" to be roughly:

let plans = bootstrap_planned(&spawner, &instance_manager, &mut provider,
                              hw_capability_id_str, &model_registry).await?;
let hosted = materialize_adapters(plans, &*factory).await;
for (slot, result) in hosted.into_iter().enumerate() {
    match result { ... spawn_persona_service ... }
}

This drops slice 13's net-new code from ~50 lines to ~15 lines, eliminates two error-shape duplications, and forces the implementer to confront the per-slot error variants that slices 8 + 9 already typed.


4. MAJOR — Q3's PersonaSupervisor module duplicates a concern that PersonaAircRuntimeRegistry already owns

airc_runtime_registry.rs:50-78 already holds "who's currently online in The Grid" keyed by persona_id → Arc<PersonaAircRuntime>. The proposed PersonaSupervisor would hold "who's currently being served by the substrate" keyed by persona_id → JoinHandle<...>. These are the same keyspace and (modulo timing windows) the same set.

Per [[organization-purity-as-we-migrate]] and the compression principle in CLAUDE.md: two modules owning per-persona-lifetime info keyed by persona_id is a compression failure. Two candidates:

  • (a) Extend PersonaAircRuntimeRegistry::register to take Option<JoinHandle> and own the supervision handle alongside the runtime Arc. Single registry, both concerns.
  • (b) Inverted: make PersonaSupervisor own the JoinHandles AND surface a runtime(persona_id) method that proxies the existing registry — but then the registry becomes redundant. Delete it.

The doc's Q3 doesn't grapple with this overlap. It just proposes a new module. Either reuse the existing registry, or justify why two modules with the same keyspace + lifetime is correct here.

Required fix: the design must address whether PersonaAircRuntimeRegistry merges into PersonaSupervisor, becomes its substrate, or stays orthogonal — with the reasoning visible.


5. MAJOR — Missing prerequisite: no Runtime::shutdown caller exists in ipc/mod.rs

Runtime::shutdown exists at runtime/runtime.rs:354 — it iterates registered modules and calls each one's shutdown().await. But grep across src/workers/continuum-core/src/{ipc,bin} for runtime.shutdown / shutdown() returns zero callers. The server has no Ctrl-C → graceful shutdown wiring today.

Slice 13's Q3 talks about "the supervisor's .abort() on server stop" as the abort hook — but no server stop event fires Runtime::shutdown to invoke the supervisor's shutdown() impl. The wire is missing one stage upstream of where the doc draws its boundary.

Required fix: either slice 13 must wire tokio::signal::ctrl_c()runtime.shutdown() itself, OR the doc must add a "depends on issue X for shutdown signal wiring" prerequisite that lands before slice 13. Without one of those, the JoinHandle abort path documented in Q3 is dead code — handles drop on process exit, but the daemon-attach tasks they own get reaped by tokio's runtime drop, not by the orderly path Q3 describes.


6. MAJOR — Missing prerequisite: per-persona resource budgeting / back-pressure

The doc spawns N JoinHandles unconditionally in a for loop. On Compat tier plan = 2 personas × LCD model = ~1 GiB (per the doc's own line 183). On M5UmaProMax the plan likely grows. The substrate has PressureBroker (modules/pressure_broker_module.rs) + ResourceBroker (modules/resource_broker.rs) for exactly this kind of admission control, but slice 13's "after" code doesn't consult either.

Per [[substrate-is-a-good-citizen-on-the-host]] (memory the doc itself cites) the substrate must not RAM-hoard. Spawning N adapters in a tight loop with no admission check is the same anti-pattern the substrate built brokers to prevent.

Required fix: the "after" code MUST go through ResourceBroker.acquire(memory_budget) before each factory.build_adapter. Per-slot rejection → mark slot as RejectedByBroker in the per-slot error report. Tie the resource lease into the JoinHandle drop chain (lease releases on conversation drop).


7. MODERATE — Q2 missed that detect_host_capability already exists as a free function

The doc proposes HostCapabilityProbe::detect_at_boot() as if it's a new struct. The existing code at cognition/host_capability_probe.rs:87 fn detect_host_capability(gpu_monitor, system_info) is a free function that already does the work. The doc's Q2(B) ("synthesize from gpu_manager") is wrong shape — GpuMemoryManager is not a GpuMonitor. The right options are:

  • (A') Call detect_host_capability(&gpu_monitor, &system_info) from the IPC boot path. NO new struct. ~3 lines.
  • (B') If a struct wrapper is desired for testability, name it consistently with the existing module (it's the host_capability_probe module already).

The doc presents this as a choice between two unbuilt options, when there's an obvious third option that uses the prior art. Also, task #52 (Mac Intel misclassification) is named at the top of the doc but its impact on this probe call is not addressed — does slice 13 ship blocked on #52, or does it knowingly inherit the misclassification?

Required fix: restate Q2 in terms of the existing free function. Add explicit decision on #52: does slice 13 wait? If not, what's the user-visible behavior on Mac Intel?


8. MODERATE — Q1's global_arc() proposal requires changing the singleton storage

Doc Q1 line 127: "Add model_registry::global_arc() → Arc<Registry> (clones the inner state once). Tiny change, idiomatic."

The current singleton is static GLOBAL: OnceLock<Registry> (model_registry/singleton.rs:23). To return an Arc<Registry> without cloning the whole Registry every call, you'd need to change storage to OnceLock<Arc<Registry>>. That's not "tiny" — it touches every existing caller of global() → &'static Registry. Every callsite that takes &'static Registry now takes &Arc<Registry>, and the lifetime semantics shift.

The doc's stated alternative (B) — refactor build_profile to take &Registry — is a one-callsite change inside slice 5's signature. That's the smaller change. The doc inverted the cost analysis.

Required fix: either change Q1's recommendation to (B), OR keep (A) and explicitly walk the singleton-storage migration + all current global() callers' signature updates. Don't claim "tiny change, idiomatic" when neither is true.


9. MINOR — Q5's BootSummary event needs a venue

MessageBus::publish (runtime/message_bus.rs) exists and takes BusEvent { name, payload }. So technically the event is publishable. But:

  • Who subscribes? The doc doesn't say.
  • What event class does it declare via events/declare-class (the L1-1 events module)?
  • Per [[observability-is-half-the-architecture]] which the doc cites: if the answer is "nobody subscribes today, the operator scrapes recent_events later," that's fine — but the doc should say so explicitly.

The recommendation is fine in spirit; the venue specification is missing.


10. MINOR — Hot-reload semantics for seed.json are silently undefined

If an operator adds a new personas/Quinn/seed.json while the server is running, does the boot loop re-fire? Today the loop runs once and exits. Slice 13 inherits that one-shot shape. Is that intentional? The doc doesn't say. If not, who triggers re-scan? The existing persona/instances/bootstrap IPC command (ipc/mod.rs:947) is the manual path; should slice 13 add a filesystem-watcher path?

This is a "non-goal" candidate but isn't explicitly listed in the non-goals section (lines 181–186). Either declare it out of scope or address it.


11. MINOR — Wire-subscription failure mid-boot is not handled

If AircPersonaConversation::next_message's first call (lazy subscribe per airc_persona_conversation.rs:60-63) fails (daemon goes down mid-boot), the loop returns an error, the serve_persona_loop returns Err, the JoinHandle resolves with Err — but slice 13's boot code at line 95 just pushes the handle and moves on. By the time the operator looks at handles, the persona is silently dead.

Q5's BootSummary covers bootstrap-phase failures (line 161-163), but NOT post-bootstrap, in-service-loop failures. Slice 13 should clarify that the JoinHandle's terminal Result is part of the supervision surface — who's reading those? When?


12. MINOR — Slice boundary is reasonable but should add an explicit test gate

The slice 13 vs slice 14 boundary (line 159) is the right cut once Q4 is resolved per finding #2 (i.e., slice 14 lands first OR slice 13 is constrained to 1-row plans). The test plan at lines 169–177 is fine as a sketch, but the "integration test happens through the IPC server itself" comment at line 177 leaves the slice unvalidatable end-to-end without a stub airc daemon. The substrate's existing mock GPU monitor pattern + a stub airc daemon (slice 12 already showed how to do this for PersonaAdapterFactory) should be made an explicit prerequisite test artifact.


What's right in the doc


Required path to APPROVE

  1. Fix finding Feature: CLI Implementation with Testing #1 (cleanup model — rewrite against actual daemon_subscribe path).
  2. Resolve finding Feature: Add CI/CD Configuration #2 (Q4 role identity) — pick one of (a)/(b)/(c) and update the doc.
  3. Rewrite the "after" code per finding Build(deps): Bump actions/stale from 8 to 9 #3 to use bootstrap_planned + materialize_adapters.
  4. Address finding Build(deps): Bump actions/checkout from 3 to 4 #4 (Q3 duplication with PersonaAircRuntimeRegistry).
  5. Acknowledge findings Build(deps): Bump actions/labeler from 4 to 5 #5 and Build(deps): Bump actions/setup-node from 3 to 4 #6 as prerequisites (either land them first, or surface them as blockers in the test plan).
  6. Restate Q2 per finding Build(deps-dev): Bump lerna from 6.6.2 to 8.2.1 #7, Q1 per finding Build(deps): Bump commander from 10.0.1 to 13.1.0 #8.
  7. Specify venue + subscriber for the BootSummary event (finding Build(deps-dev): Bump @typescript-eslint/eslint-plugin from 5.62.0 to 8.29.1 #9).
  8. Either resolve or explicitly non-goal findings Build(deps-dev): Bump @typescript-eslint/parser from 5.62.0 to 8.29.1 #10 and Build(deps): Bump inquirer from 9.3.7 to 12.5.2 #11.

When this happens the doc becomes the load-bearing contract slice 13's implementation can be written against. Today it isn't.

PR #1510's adversarial reviewer caught 9 findings, 3 blocking.
Substantial revision addresses each:

Blockers:
- Finding #1 (cleanup model wrong): rewritten "Cleanup model" section.
  `.abort()` alone is INSUFFICIENT — wire subscriber stays in
  Arc<Airc>.inner.subscribers until registry-remove drops the Arc.
  Supervisor shutdown path MUST: abort → await → registry.remove.
- Finding #2 (position-pairing broken): elevated to hard prereq P2.
  scan_personas_dir sorts alphabetically; boot 2 yields persona
  order != plan order on random-derived names. Slice 13 ships with
  plan.len() <= 1 (Helper only); slice 14 lands role-in-seed.json
  + RoleAwareProvider before re-enabling Coder.
- Finding #3 (after-code reimplements existing fns): "after" rewrite
  uses bootstrap_planned (slice 8) + materialize_adapters (slice 9)
  as composition, not open-coded copies. Net-new code ~25 lines, not
  ~50.

Major:
- Finding #4 (PersonaSupervisor dup with PersonaAircRuntimeRegistry):
  Q3 revised to EXTEND the registry — HostedPersonaRuntime owns both
  airc_runtime + service_loop JoinHandle. registry.remove becomes
  the natural shutdown path. One keyspace, one cleanup chain.
- Finding #5 (no Runtime::shutdown caller): elevated to hard prereq
  P1. Slice 13 wires tokio::signal::ctrl_c -> runtime.shutdown.
- Finding #6 (no broker admission for N adapter spawns): elevated to
  hard prereq P3. ResourceBroker.acquire before each
  factory.build_adapter.

Moderate:
- Finding #7 (detect_host_capability already exists): Q2 restated to
  call the existing free function at host_capability_probe.rs:87.
- Finding #8 (global_arc() cost inverted): Q1 recommendation flipped
  to (B) — refactor bootstrap_planned to take &Registry. Singleton
  storage migration would have touched every callsite of global().

Minor:
- Finding #9 (BootSummary venue): Q5 specifies
  MessageBus::publish("persona:boot:summary", ...) with operator
  scraping via events/recent; no declared subscribers in slice 13.
- Finding #10 (hot-reload undefined): added as Q6, declared
  out-of-scope.
- Finding #11 (wire-subscription failure mid-boot): added as Q7,
  supervisor polls JoinHandle::is_finished every 5s.

Plus added implementation checklist at end so slice 13 PR has a
sign-off surface.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@joelteply

Copy link
Copy Markdown
Contributor Author

Addressed all 11 findings in commit f81de21. Summary:

Blockers fixed

  • Feature: CLI Implementation with Testing #1 (cleanup model): Rewrote "Cleanup model" section. The right invariant is abort → await → registry.remove(persona_id) (drops Arc → drops Arc → drops inner.subscribers map → wire subscriber tasks abort). .abort() alone is insufficient; the wire subscriber stays in Arc<Airc>.inner.subscribers until the Arc itself drops. This is now load-bearing for the supervisor design.

    • Note: ensure_wire_subscriber does exist in my airc checkout (crates/airc-lib/src/transport.rs:65 in git checkouts/airc-ff7f2c64d403af02/428f928). You may have been reading worktrees/c9b28925/ which is stale. Either way, your underlying concern about the cleanup path was valid — the original doc only mentioned JoinHandle.abort(), not registry-removal.
  • Feature: Add CI/CD Configuration #2 (position-pairing): Elevated to hard prereq P2. Confirmed dir_entries.sort() at resume_or_mint_provider.rs:200 is alphabetical. Slice 13 ships with plan.len() <= 1 (Helper only) and a debug_assert. Slice 14 lands role-in-seed.json + RoleAwareProvider before re-enabling Coder. The regression test in the test plan (Build(deps): Bump actions/labeler from 4 to 5 #5) goes red until slice 14.

  • Build(deps): Bump actions/stale from 8 to 9 #3 (after-code reimplements existing fns): "After" code rewritten to compose bootstrap_planned (slice 8) + materialize_adapters (slice 9) directly. Net-new code dropped from ~50 to ~25 lines.

Major fixed

Moderate fixed

Minor fixed

Plus added an implementation checklist at the end so slice 13's PR has a sign-off surface. Ready for re-review.

@joelteply

Copy link
Copy Markdown
Contributor Author

Re-review of revision f81de2190 (slice 13 design doc, after first-pass review)

I read the revised doc end-to-end against the implementation in PR #1511 (commits 71429daae, 3f843aada, f940fa47d, 8611bae56, 2abd59684) and against the airc-lib rev the workspace currently pins (f6ed190, per src/workers/Cargo.toml:44-48).

What the revision got right

What needs another pass

1. The cleanup-model rewrite is still wrong against the currently-pinned airc-lib rev. (Same finding, different root cause.)

The author's reply asserts ensure_wire_subscriber exists at transport.rs:65 in the production airc checkout at git checkouts/airc-ff7f2c64d403af02/428f928/, and accuses my prior review of reading a stale worktree.

Both halves are mistaken. The workspace Cargo.toml:44-48 pins airc-{ipc,core,lib,protocol,wire} to rev f6ed190, not 428f928 (428f928 is the previous pin per the bump comment at line 27). At the pinned rev:

  • crates/airc-lib/src/transport.rs does not contain ensure_wire_subscriber (grep -rn "ensure_wire_subscriber" /Users/joel/.cargo/git/checkouts/airc-ff7f2c64d403af02/f6ed190/crates/ → 0 hits).
  • AircInner does not contain a subscribers: HashMap<PathBuf, WireSubscriber> field (crates/airc-lib/src/airc.rs:121-173 shows only per-transport lan_subscriber/relay_subscriber/udp_subscriber/webrtc_subscribers).
  • The actual live-subscription path is Airc::subscribe()daemon_subscribe(channels) (crates/airc-lib/src/daemon.rs:358-515), which spawns N per-channel attach tasks per call. The handles are owned by DaemonAttachGuard inside the returned EventStream (crates/airc-lib/src/stream.rs:42-68). DaemonAttachGuard::drop aborts every attach task when the EventStream drops.

The c9b28925 worktree I cited the first time is consistent with f6ed190 on this point — it also lacks ensure_wire_subscriber. The 428f928 the author pointed me at is a stale rev no longer in the dependency graph.

Architectural implication: against f6ed190, dropping the EventStream IS sufficient to release the daemon attach (each subscribe call gets fresh per-channel tasks; no shared idempotent registry across subscribers). The doc's lines 156-160 still describe the old 428f928 architecture; the registry's docstring at airc_runtime_registry.rs:32-37 and shutdown_slot's docstring at :207-212 embed the same wrong rationale.

The .abort() + .await + registry.remove sequence in the impl is still safe (it's strictly stronger than necessary), but the stated technical justification for needing all three steps is incorrect. Risk: if airc-lib's architecture continues to evolve, the next reviewer will check this doc against the new transport layer and reach the wrong conclusion about what the cleanup invariants are.

Required: re-trace the cleanup chain against f6ed190 (or whatever rev is pinned at merge time), correct the doc, correct the docstrings in airc_runtime_registry.rs.

2. "Hard prerequisites" section is now factually misleading vs. the impl.

Lines 28-30 of the doc declare:

Hard prerequisites (must land BEFORE or AS PART OF slice 13)

These are not optional. Without them the slice-13 wire-up is dead code.

But PR #1511's boot composition explicitly defers P1 (ctrl_cRuntime::shutdown) and P3 (ResourceBroker.acquire) with TODO markers (ipc/mod.rs:1049-1063). Q5 (MessageBus::publish("persona:boot:summary", ...)) is also not emitted — the impl uses tracing::info! with hosted_count/failed_count instead. Q7 (supervisor poller on JoinHandle::is_finished()) is referenced only in registry docstrings; no poller is wired anywhere in src/persona/.

Either:

The author's PR #1511 body argues for (a) with concrete justification (single-persona LCD is within all tiers' headroom; per-slot shutdown is exposed via IPC commands). That's a reasonable position — but then the design doc should say that, not the opposite. Right now the doc and the impl disagree about what slice 13 is. The implementation checklist at the end of the doc (lines 257-268) lists P1/P3/Q5/Q7 as required checkboxes that the implementation will not tick.

Required: realign the doc with the impl's actual scope. I lean (a) since the impl's argument is defensible, but the doc must reflect it.

3. Minor: boot composition assertion drift.

The doc's "after" snippet (lines 93-142) shows bus.publish(BusEvent::new("persona:boot:summary", summary.into())) at line 140; the impl has no such publish. The "after" snippet also shows BootSummary accumulating failed: Vec<(slot, role, error)>; the impl tracks only hosted_count / failed_count counters with per-slot details going to tracing. Same root cause as #2 — pick one and ship it.

Verdict

REQUEST_CHANGES on PR #1510, with the recommendation that the doc be revised to (a) correct the cleanup-model section against the f6ed190 architecture and (b) realign the "Hard prerequisites" framing with PR #1511's actual scope. The impl-side fixups in #1511 (BLOCKER 1 leaked-task drain, BLOCKER 2 orphan drain, debug_assert tripwire) are well-judged — those are the load-bearing correctness fixes and they're solid.

The doc divergences are corrigible without architectural rework. Once they're aligned with the impl, this is a clean approve.

— prior reviewer

joelteply added a commit that referenced this pull request Jun 2, 2026
* refactor(persona): bootstrap_planned/derive_spawn_plan/build_profile take &Registry

Per HEADLESS-PERSONA-HOST-LOOP design doc Q1 (PR #1510 review found
the original recommendation was inverted): the substrate boot path
holds `&'static Registry` from `model_registry::global()`. Migrating
the singleton to `OnceLock<Arc<Registry>>` would touch every callsite
of `global()` and change the lifetime semantics throughout the crate.

Smaller change: drop the Arc requirement from the three functions
that took `&Arc<Registry>` and accept `&Registry` instead. Rust's
Deref coercion at the test call sites handles `Arc<Registry>` ↦
`&Registry` transparently — no test changes needed.

Functions updated:
- profile_builder::build_profile (slice 5)
- spawner::derive_spawn_plan (slice 6)
- spawner_module::bootstrap_planned (slice 8)

All slice 5-9 tests still pass:
  persona::profile_builder — 4 passed
  persona::spawner — 4 passed
  persona::spawner_module — 5 passed

Unblocks the slice-13 boot composition at `ipc::start_server` where
the registry is `&'static Registry`, not an Arc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(persona): extend PersonaAircRuntimeRegistry → PersonaSlot with service_loop

Per HEADLESS-PERSONA-HOST-LOOP design doc Q3 (PR #1510 review):
a PersonaSupervisor module storing JoinHandles keyed by persona_id
would duplicate the existing PersonaAircRuntimeRegistry keyspace.
Both modules would own per-persona lifetime info. Compression
failure per [[organization-purity-as-we-migrate]].

Resolution: extend the registry. Each slot becomes:

    PersonaSlot {
        runtime: Arc<PersonaAircRuntime>,
        service_loop: Mutex<Option<JoinHandle<Result<ServeOutcome, _>>>>,
    }

New methods:
- attach_service_loop(persona_id, handle) — supervisor wires the
  per-persona serve loop into the slot. Refuses silent overwrites.
- is_service_loop_finished(persona_id) — Q7's periodic crash poll.
- shutdown_slot(persona_id) — the orderly path: take JoinHandle →
  abort → await → remove slot. The slot drop cascades:
    Arc<PersonaSlot> → Arc<PersonaAircRuntime> → Arc<Airc> →
    inner.subscribers map drop → daemon-attach wire tasks abort.
  Per the cleanup-model section of the design doc, BOTH steps
  (abort + slot remove) are required — abort alone leaves the
  wire subscriber alive until the Arc drops via registry removal.
- ids() — Vec<Uuid> snapshot for the supervisor's poller without
  cloning N runtime Arcs.

Existing surface preserved for back-compat:
- register, get, get_by_agent_name, remove (sync), iter, len,
  is_empty all return runtime Arcs (not slot Arcs). The slot is
  internal.

Tests cover the failure modes:
- attach_service_loop_errors_when_no_slot
- is_service_loop_finished_returns_none_for_missing_slot
- shutdown_slot_returns_none_for_missing_persona

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(persona): constrain plan_for_tier to single Helper until slice 14

Per HEADLESS-PERSONA-HOST-LOOP design doc P2 (PR #1510 review
finding #2 — position-pairing broken from boot 2):

ResumeOrMintProvider::scan_personas_dir sorts directory entries
alphabetically (resume_or_mint_provider.rs:200). On boot 1 the
substrate mints personas in plan order [Helper, Coder] with
random-derived names (e.g. Maya for Helper, Bart for Coder). On
boot 2, scan yields them in alphabetic order [Bart, Maya] —
position-pairing against [Helper, Coder] flips the roles. Bart
becomes Helper when he was Coder. Role identity flipped silently.

The hazard exists in slice 8's bootstrap_planned today but doesn't
manifest because nothing depends on (persona_id, role) yet. Slice
13 IS that consumer (cognition + supervisor both observe the role).
Without a fix, slice 13 ships with a latent boot-2 regression.

Fix shape:
- plan_for_tier returns ONE Helper for all tiers until slice 14.
- TODO marker names slice 14 as the load-bearing fix
  (role-in-seed.json + RoleAwareProvider).
- Existing test `compat_tier_plans_helper_and_coder_on_lcd` renamed
  to `compat_tier_plans_single_helper_on_lcd` with updated invariant.
- New `slice_14_restores_helper_plus_coder_for_compat` test pinned
  `#[ignore]` until slice 14 — it's the spec slice 14 has to
  satisfy. Going red on the ignore-removal date is the design's
  reminder.
- bootstrap_planned_exhausted_provider_errors_with_slot_info updated:
  `required` field now 1, not 2.

Net result: slice 13's substrate hosts ONE Helper per tier through
the managed path. Same coverage the demo binary currently provides,
but composed via the substrate. Slice 14 reopens the multi-role
case once role identity is durable across boots.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat(ipc): substrate hosts personas via composed slice 7-12 pipeline (#133 slice 13)

The moment-of-truth. Before this commit, the IPC boot loop at
`ipc/mod.rs:1024-1089` called `bootstrap_one(&intent)` per
ResumeOrMintProvider output and only LOGGED a welcome. The persona
was reachable via `airc peers` but never responded — a mute citizen.

After this commit, the boot path composes:

  PersonaSpawnerModule::plan_for_tier (slice 7)
    → bootstrap_planned (slice 8): mints/resumes airc identities
    → materialize_adapters (slice 9): builds inference adapters
    → spawn_persona_service (slice 12): runs serve_persona_loop
    → PersonaAircRuntimeRegistry::attach_service_loop (slice 13 Q3):
      parks the JoinHandle in the slot alongside the runtime

Each planned persona ends up with a tokio task running her cognition
path. The substrate hosts personas headlessly — no `airc_chat_demo`
in the inner ring.

Status against the design doc HEADLESS-PERSONA-HOST-LOOP.md:

APPLIED:
- P2: plan_for_tier returns single Helper (separate commit f940fa4).
- Q1: bootstrap_planned takes &Registry (separate commit 71429da).
- Q3: registry slot owns runtime + service_loop (commit 3f843aa).
- Boot composition collapses ~65 lines of inline bootstrap-only loop
  into ~115 lines of substrate composition using the existing slice
  primitives. Per [[organization-purity-as-we-migrate]], the old
  welcome-log-only path is DELETED, not kept alongside.

DEFERRED with TODO markers:
- Q2 (detect_host_capability wiring): the existing free function at
  cognition/host_capability_probe.rs:87 takes &dyn GpuMonitor +
  &System. No production code constructs a GpuMonitor today — only
  tests do. Slice 13 uses HwCapabilityTier::CpuOnly + HwTierCategory::
  Compat as the safe floor (the LCD Helper Qwen2.5-0.5B works for
  all tiers). TODO #52 cited for when GpuMonitor construction lands.
- P1 (tokio::signal::ctrl_c → Runtime::shutdown): the per-slot
  shutdown is available via `PersonaAircRuntimeRegistry::shutdown_slot`
  and exercised by persona/instances/* IPC commands. The server-
  level signal handler is its own sub-slice.
- P3 (ResourceBroker.acquire admission): current LCD case is 1
  persona × ~500 MiB GGUF, well within all tiers. Becomes load-
  bearing when multi-persona returns in slice 14.

Tests:
- 31 tests across slices 5-13 all green (registry, service_loop,
  supervisor, spawner, spawner_module, profile_builder, host).
- No new tests in this commit — the boot composition is the
  integration point; the integration test requires a stub
  PersonaInstanceManagerModule (slice 13 follow-up).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fixup: PR #1511 review — drain leaked task + orderly-drain orphan personas

Two blockers from the adversarial review:

BLOCKER 1 — JoinHandle leaked on attach_service_loop failure.
  `JoinHandle::drop` detaches rather than aborts. When
  `attach_service_loop` returned an error and the boot loop did
  `continue`, the spawned `serve_persona_loop` kept running
  untracked. The boot log lied "persona will not respond on the
  grid" while in fact the loop did respond, just outside the
  registry's view (so `shutdown_slot` couldn't find it). Worse on
  `"already attached"`: two loops competed for the same persona.
  Fix: `attach_service_loop` signature changed to
  `Result<(), (JoinHandle, &'static str)>` so the caller can
  orderly-drain (abort + await). Boot loop updated. Existing test
  updated to assert the handle comes back live (proves no implicit
  detach) before the test drains it.

BLOCKER 2 — Partial-bootstrap orphans on bootstrap_planned error.
  `bootstrap_planned` registers each persona via `bootstrap_one`
  BEFORE the next slot's mint runs. If slot k fails, slots 0..k-1
  are already in the registry but with no service loop attached —
  mute citizens. The boot loop early-returned with "no personas
  hosted" but they were. Fix: on `bootstrap_planned` error, the
  boot loop calls `registry.ids()` to get the partially-registered
  set and `shutdown_slot`s each. `shutdown_slot` handles "no
  service loop attached" gracefully (handle_opt is None) and drops
  the Arc cleanly — same orderly cleanup path as the normal
  shutdown, just no loop to abort. Error log updated to report
  `orphans_drained` count honestly.

Advisory 3 — `debug_assert!(plan.len() <= 1)` at the producer.
  P2 invariant was named in the commit body + tested in
  `compat_tier_plans_single_helper_on_lcd` but had no runtime
  tripwire. Added the debug_assert at `plan_for_tier`'s producer
  site with a TODO marker tying it to slice 14 (when the assert
  comes out alongside RoleAwareProvider + role-in-seed.json).

Verification:
  cargo test persona::airc_runtime_registry → 5 passed
  cargo test persona::spawner_module → 5 passed + 1 ignored
  cargo build --lib clean

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(persona): join by room_name, not UUID-as-string — substrate hosts in correct channel

PR #1511 integration test on Joel's Intel Mac revealed:
PersonaAircRuntime::bootstrap was calling
`airc.join(&default_room.as_uuid().to_string())`, which passes the
UUID's string representation into airc-lib's `ChannelName::new(name)`
— which DERIVES a channel UUID from the string. The persona landed
in derived channel `5d33e2a7…` while the operator's `airc room`
points at canonical `11c1a7ac…`. Same room, two channels, never see
each other.

The demo binary worked around this in slice 11 by using
`from_attached` (joining by name manually first), but the
substrate-managed path through PersonaInstanceManagerModule still
called the broken bootstrap.

Fix threads through 4 layers:
- airc/discovery.rs: new `discover_default_room_name()` parses
  `room: <name>` line from `airc room` stdout. Mirrors the existing
  `discover_default_channel()` shape; env override
  `AIRC_DEFAULT_ROOM_NAME` for tests/operators.
- airc/mod.rs: re-export the new function.
- modules/airc.rs: AircModule stores `attach_room_name: Option<String>`;
  `default_room_name() -> Option<&str>` getter. Loud warn if discovery
  fails — names the failure mode so operators see what's broken.
- modules/persona_instance_manager.rs:
  PersonaInstanceManagerModule::new takes Option<String> room name;
  bootstrap_one passes it to PersonaAircRuntime::bootstrap.
- persona/airc_runtime.rs::bootstrap: joins by name if Some,
  falls back to UUID-as-string + WARN if None.
- ipc/mod.rs: discovers + threads through.

Integration trace confirmed (slice13-server.log line 1078ish):
  joined_room=11c1a7ac-cb85-5ca0-a5b4-2847280ea3fa
  room_name=continuum

Test sites updated to pass `None` (4 in persona_instance_manager.rs
tests, 1 in spawner_module.rs).

Status after this fix:
✅ Substrate boot composition fires
✅ Persona hosted as substrate-managed Helper
✅ Joins canonical airc channel
✅ Receives operator messages via subscribe
✅ Service loop invokes inspect_persona_rag_with_inference
❌ Inference fails with `llama_decode returned -1` on mac-cpu-only —
   separate inference-layer bug, tracked as task #131-adjacent.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* refactor(persona): PersonaContext is the substrate's &ctx — Android-Context analog

Per Joel 2026-06-02: "design got out of control due to you failing to
use a shared object for all state info required for a persona OR
user. This is the airc user. Or base user with airc props." And
"make this pattern regular, ubiquitous &ctx, store references, make
it elegant."

What changed:
- HostedPersona is now PersonaContext (with `pub type HostedPersona =
  PersonaContext` for slice-9-era callers). The struct holds the
  persona's full context — identity (airc citizen facts), role,
  inference profile, adapter, runtime — and is passed by reference
  (`&ctx`) to every persona-scoped function.
- HostedPersona.instance renamed to `.identity`: it's the airc user
  identity (peer_id, agent_name, home, default_room, source).
- HostedPersona.profile (new) carries the PersonaInferenceProfile
  directly — single source of truth for inference shape. Replaces
  the prior context_window-only field. Downstream code reads
  `ctx.profile.context_length` etc. — no copied fields, no derived
  constants outside the named derivation site.
- HostedPersona.runtime (new) holds `Option<Arc<PersonaAircRuntime>>`.
  Production always Some (filled by materialize_adapters via the
  registry_lookup closure). Tests construct with None — the proper
  AircHandle trait abstraction lands as part of task #142.
- spawn_persona_service signature simplified — no separate runtime
  arg (`ctx.runtime` carries it).
- materialize_adapters takes a `runtime_lookup` closure so the
  supervisor folds the live runtime into each context at the
  composition seam.
- RagInspectionRequest::for_persona(persona_id, name, now_ms,
  &profile) is the single derivation site. The old `defaults_for`
  (32k hardcoded budget) stays for back-compat but is documented as
  legacy; service_loop uses `for_persona` exclusively.

Why this matters (the bug it fixes):
- PR #1511 integration trace caught `llama_decode returned -1` on
  Intel Mac mac-cpu-only: the LCD model was loaded at n_ctx=2048
  (Compat tier per profile_builder), but RagInspectionRequest::
  defaults_for was setting context_window=32_768. The RAG layer
  built a 32k-budget prompt that overflowed the 2k KV cache.
- The structural fix is the &ctx doctrine: profile is the single
  source, derivation happens in one named function.

Task #142 (BaseUser hierarchy) is the natural follow-up: extract
the airc props (identity + runtime) into a `BaseUser` trait that
persona/human/web actor contexts all derive from. Same shape per
[[personas-are-citizens-airc-is-identity-provider]].

Verification:
- cargo build --bin continuum-core-server clean
- cargo build --lib --tests clean
- Substrate boot composition still hosts Paige in correct channel
  (continuum, 11c1a7ac…)
- Service loop fires inference (slow on CPU; iteration target)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…+ correct cleanup model

PR #1510 re-review (after the first revision) caught:

1. **Cleanup model still wrong** — claimed `ensure_wire_subscriber`
   / `inner.subscribers` (was the 428f928 worktree I was reading,
   not the pinned f6ed190). Real mechanism in f6ed190 is
   `EventStream::EventStreamInner::Daemon` holding
   `DaemonAttachGuard` (`stream.rs:25-68`). On `EventStream` drop,
   guard drops, per-channel attach `JoinHandle`s abort.
   **`.abort()` ALONE is sufficient** against the pinned rev. The
   `shutdown_slot` registry-remove is for in-substrate Arc state
   hygiene, not for daemon-side teardown. Section rewritten with
   the correct mechanism, file:line cited.

2. **"Hard prerequisites" misrepresented impl scope** — listed
   P1 (ctrl_c→shutdown), P3 (broker), Q5 (BootSummary publish),
   Q7 (5s poller) as "MUST land" but the implementation explicitly
   deferred all four. Restructured to "Slice 13 scope vs deferred
   follow-ups" with honest splits: shipped vs deferred + why each
   deferral is acceptable (single-persona LCD is in-budget;
   `shutdown_slot` available via IPC commands; cleanup model now
   shows `.abort()` is sufficient).

3. **"After" snippet didn't match impl** — used
   `BootSummary::default()` / `bus.publish` / `summary.failed.push`,
   but the impl uses `tracing::info!(hosted=N, failed=N)` counters.
   Rewrote the snippet to match what shipped. Notes Q5
   (BootSummary publish) as deferred.

Plus the implementation status section is now accurate:
- Q1, Q3, P2, Q2-partial: shipped ✅
- P1, P3, Q5, Q7, Q2-full: deferred to slice 13.5+ ❌
- Integration polish in #1511: room-name discovery + LCD model
  registry entry + PersonaContext rename + RagInspectionRequest::
  for_persona derivation site (the `&ctx` doctrine)
- Integration validation: Paige replied in continuum room

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@joelteply

Copy link
Copy Markdown
Contributor Author

Rev2 addresses all three findings from the re-review (commit 18460c6):

Finding #1 (cleanup model wrong): rewrote against pinned f6ed190. The actual mechanism is EventStream::EventStreamInner::Daemon holding DaemonAttachGuard (stream.rs:25-68). On EventStream drop, the guard's Vec<JoinHandle<()>> aborts. .abort() ALONE is sufficientshutdown_slot's registry-remove is for in-substrate Arc state hygiene, not for daemon-side teardown. Section now cites the right file:lines and explains both layers (daemon-side via DaemonAttachGuard, in-substrate via registry-remove).

Finding #2 ("Hard prerequisites" misrepresents impl scope): replaced the section with "Slice 13 scope vs deferred follow-ups" that honestly splits:

  • Shipped: Q1, Q3, P2, Q2-partial, boot composition replacement.
  • Deferred (slice 13.5+): P1, P3, Q5, Q7, Q2-full. Each one has its rationale — single-persona LCD is in-budget; shutdown_slot available via IPC commands; cleanup model shows .abort() is sufficient.
  • "Why this divergence is acceptable" subsection makes the substrate-doesn't-break-without-them argument explicit.

Finding #3 ("after" snippet doesn't match impl): rewrote the snippet to use tracing::info!(hosted=N, failed=N) counters + per-slot tracing::warn! exactly as the merged impl does. Notes Q5 (structured BootSummary publish) as deferred follow-up.

Plus updated the implementation status section to reflect what actually shipped in #1511 (merged at 9cc0571d3), including integration polish (room-name discovery, LCD model registry entry, PersonaContext rename, RagInspectionRequest::for_persona(&profile) single derivation site) and the integration validation — Paige replied in Joel's continuum room.

Ready for re-review.

@joelteply joelteply enabled auto-merge (squash) June 7, 2026 16:18
@joelteply joelteply merged commit 9ad2ea7 into canary Jun 7, 2026
2 checks passed
@joelteply joelteply deleted the docs/headless-host-loop-slice-13-design branch June 7, 2026 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant