docs(planning): HEADLESS-PERSONA-HOST-LOOP — slice 13 design#1510
Conversation
Captures the boot wire-up plan for #133 slice 13 before touching the load-bearing `ipc::start_server` flow. Documents: - The "moment-of-truth" gap: composition pieces from slices 5-12 exist; nothing in continuum-core actually calls them at boot. - Before/after of the IPC boot loop (~ipc/mod.rs:1024-1089). - The cleanup model verified during PR #1508's slice-12 review: .abort() → JoinHandle drop → conversation drop → EventStream drop → broadcast::Receiver drop auto-decrements subscriber count. Wire subscriber is ref-counted across local subscribers; tears down when runtime Arc reaches 0. No leak; no manual unsubscribe. - Five open questions with recommendations: 1. Arc<Registry> for build_profile — add model_registry::global_arc() 2. HwCapabilityTier source — HostCapabilityProbe::detect_at_boot() 3. hosted_handles ownership — new PersonaSupervisor module 4. ResumeOrMintProvider role-mapping — slice 14 territory 5. BootSummary event for per-slot failures - Test plan: PersonaBootstrapper trait split for stubbing. - Explicit non-goals (shared-base / cross-grid / LoRA / role-aware). - Doctrine memories worth refreshing on implementation. Net-additive doc — no code changes. Slice 13 implementation lands as a follow-up PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adversarial review: HEADLESS-PERSONA-HOST-LOOP slice 13 designVerdict: REQUEST_CHANGES. The doc captures the right architectural ambition but contains a load-bearing factual error in the cleanup-model section, gets the role/identity invariant in Q4 wrong (position-pairing is already broken in the LCD case from boot 2 onward, not just "at higher tiers"), and skips an obvious reuse opportunity ( Below: findings ordered by severity. Each one names file + line + the doctrine it tripped (where applicable). 1. BLOCKER — Cleanup model claim is wrong in daemon mode (doc lines 104–116)The doc claims: "Wire-level subscription (messaging.rs:186 ensure_wire_subscriber): this is the per-room daemon-side subscription. It is ref-counted across all local subscribers to the same room..." That is not what airc-lib does. There is no function
The Arc handle is shared. The daemon subscribe is per-call, NOT ref-counted. In production (daemon attached, which is the prod path per This invalidates the doc's correctness conclusion at line 115: "dropping the JoinHandle alone is sufficient cleanup. No leak." For ONE persona owning ONE conversation that's still true (the JoinHandle abort drops the conversation, which drops the EventStream, which drops the DaemonAttachGuard, which aborts the per-channel attach tasks). But the reason is wrong, and any future reviewer who reads this doc and concludes "Arc semantics + ref-counted wire" will design wrong code (e.g., expecting that spawning a second conversation on the same room reuses the wire attach). Required fix: rewrite section 104–116 against the actual daemon_subscribe path. Cite This is the BLOCKER finding. Defaulting to REQUEST_CHANGES on this alone. 2. BLOCKER — Position-pairing of intents to plan rows (Q4) is broken from boot 2, even on LCDDoc line 157: "Slice 13 pairs the Nth intent with the Nth plan entry by position. That's reasonable for the LCD case (every persona at this tier is helper/coder)..." It's not reasonable. Trace:
The mint order on boot 1 is plan-order. The reload order on boot 2 is alpha-order-of-randomly-derived-names. They almost never match. This breaks even the 2-persona LCD case the doc declares "reasonable." Existing tests don't catch this because Compounding problem: Required fix: either
(b) is cleanest and the doc already acknowledges slice 14 is the right home for this — but then slice 13 cannot ship before slice 14. The doc currently asserts the opposite at line 157. 3. MAJOR — The "after" code at lines 50–100 duplicates two existing functions that slice 13 should compose, not reimplementThe doc presents an open-coded loop calling
The doc's "after" code reimplements both, ignoring the structured Required fix: rewrite "after" to be roughly: let plans = bootstrap_planned(&spawner, &instance_manager, &mut provider,
hw_capability_id_str, &model_registry).await?;
let hosted = materialize_adapters(plans, &*factory).await;
for (slot, result) in hosted.into_iter().enumerate() {
match result { ... spawn_persona_service ... }
}This drops slice 13's net-new code from ~50 lines to ~15 lines, eliminates two error-shape duplications, and forces the implementer to confront the per-slot error variants that slices 8 + 9 already typed. 4. MAJOR — Q3's PersonaSupervisor module duplicates a concern that
|
PR #1510's adversarial reviewer caught 9 findings, 3 blocking. Substantial revision addresses each: Blockers: - Finding #1 (cleanup model wrong): rewritten "Cleanup model" section. `.abort()` alone is INSUFFICIENT — wire subscriber stays in Arc<Airc>.inner.subscribers until registry-remove drops the Arc. Supervisor shutdown path MUST: abort → await → registry.remove. - Finding #2 (position-pairing broken): elevated to hard prereq P2. scan_personas_dir sorts alphabetically; boot 2 yields persona order != plan order on random-derived names. Slice 13 ships with plan.len() <= 1 (Helper only); slice 14 lands role-in-seed.json + RoleAwareProvider before re-enabling Coder. - Finding #3 (after-code reimplements existing fns): "after" rewrite uses bootstrap_planned (slice 8) + materialize_adapters (slice 9) as composition, not open-coded copies. Net-new code ~25 lines, not ~50. Major: - Finding #4 (PersonaSupervisor dup with PersonaAircRuntimeRegistry): Q3 revised to EXTEND the registry — HostedPersonaRuntime owns both airc_runtime + service_loop JoinHandle. registry.remove becomes the natural shutdown path. One keyspace, one cleanup chain. - Finding #5 (no Runtime::shutdown caller): elevated to hard prereq P1. Slice 13 wires tokio::signal::ctrl_c -> runtime.shutdown. - Finding #6 (no broker admission for N adapter spawns): elevated to hard prereq P3. ResourceBroker.acquire before each factory.build_adapter. Moderate: - Finding #7 (detect_host_capability already exists): Q2 restated to call the existing free function at host_capability_probe.rs:87. - Finding #8 (global_arc() cost inverted): Q1 recommendation flipped to (B) — refactor bootstrap_planned to take &Registry. Singleton storage migration would have touched every callsite of global(). Minor: - Finding #9 (BootSummary venue): Q5 specifies MessageBus::publish("persona:boot:summary", ...) with operator scraping via events/recent; no declared subscribers in slice 13. - Finding #10 (hot-reload undefined): added as Q6, declared out-of-scope. - Finding #11 (wire-subscription failure mid-boot): added as Q7, supervisor polls JoinHandle::is_finished every 5s. Plus added implementation checklist at end so slice 13 PR has a sign-off surface. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Addressed all 11 findings in commit f81de21. Summary: Blockers fixed
Major fixed
Moderate fixed
Minor fixed
Plus added an implementation checklist at the end so slice 13's PR has a sign-off surface. Ready for re-review. |
Re-review of revision
|
* refactor(persona): bootstrap_planned/derive_spawn_plan/build_profile take &Registry Per HEADLESS-PERSONA-HOST-LOOP design doc Q1 (PR #1510 review found the original recommendation was inverted): the substrate boot path holds `&'static Registry` from `model_registry::global()`. Migrating the singleton to `OnceLock<Arc<Registry>>` would touch every callsite of `global()` and change the lifetime semantics throughout the crate. Smaller change: drop the Arc requirement from the three functions that took `&Arc<Registry>` and accept `&Registry` instead. Rust's Deref coercion at the test call sites handles `Arc<Registry>` ↦ `&Registry` transparently — no test changes needed. Functions updated: - profile_builder::build_profile (slice 5) - spawner::derive_spawn_plan (slice 6) - spawner_module::bootstrap_planned (slice 8) All slice 5-9 tests still pass: persona::profile_builder — 4 passed persona::spawner — 4 passed persona::spawner_module — 5 passed Unblocks the slice-13 boot composition at `ipc::start_server` where the registry is `&'static Registry`, not an Arc. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(persona): extend PersonaAircRuntimeRegistry → PersonaSlot with service_loop Per HEADLESS-PERSONA-HOST-LOOP design doc Q3 (PR #1510 review): a PersonaSupervisor module storing JoinHandles keyed by persona_id would duplicate the existing PersonaAircRuntimeRegistry keyspace. Both modules would own per-persona lifetime info. Compression failure per [[organization-purity-as-we-migrate]]. Resolution: extend the registry. Each slot becomes: PersonaSlot { runtime: Arc<PersonaAircRuntime>, service_loop: Mutex<Option<JoinHandle<Result<ServeOutcome, _>>>>, } New methods: - attach_service_loop(persona_id, handle) — supervisor wires the per-persona serve loop into the slot. Refuses silent overwrites. - is_service_loop_finished(persona_id) — Q7's periodic crash poll. - shutdown_slot(persona_id) — the orderly path: take JoinHandle → abort → await → remove slot. The slot drop cascades: Arc<PersonaSlot> → Arc<PersonaAircRuntime> → Arc<Airc> → inner.subscribers map drop → daemon-attach wire tasks abort. Per the cleanup-model section of the design doc, BOTH steps (abort + slot remove) are required — abort alone leaves the wire subscriber alive until the Arc drops via registry removal. - ids() — Vec<Uuid> snapshot for the supervisor's poller without cloning N runtime Arcs. Existing surface preserved for back-compat: - register, get, get_by_agent_name, remove (sync), iter, len, is_empty all return runtime Arcs (not slot Arcs). The slot is internal. Tests cover the failure modes: - attach_service_loop_errors_when_no_slot - is_service_loop_finished_returns_none_for_missing_slot - shutdown_slot_returns_none_for_missing_persona Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(persona): constrain plan_for_tier to single Helper until slice 14 Per HEADLESS-PERSONA-HOST-LOOP design doc P2 (PR #1510 review finding #2 — position-pairing broken from boot 2): ResumeOrMintProvider::scan_personas_dir sorts directory entries alphabetically (resume_or_mint_provider.rs:200). On boot 1 the substrate mints personas in plan order [Helper, Coder] with random-derived names (e.g. Maya for Helper, Bart for Coder). On boot 2, scan yields them in alphabetic order [Bart, Maya] — position-pairing against [Helper, Coder] flips the roles. Bart becomes Helper when he was Coder. Role identity flipped silently. The hazard exists in slice 8's bootstrap_planned today but doesn't manifest because nothing depends on (persona_id, role) yet. Slice 13 IS that consumer (cognition + supervisor both observe the role). Without a fix, slice 13 ships with a latent boot-2 regression. Fix shape: - plan_for_tier returns ONE Helper for all tiers until slice 14. - TODO marker names slice 14 as the load-bearing fix (role-in-seed.json + RoleAwareProvider). - Existing test `compat_tier_plans_helper_and_coder_on_lcd` renamed to `compat_tier_plans_single_helper_on_lcd` with updated invariant. - New `slice_14_restores_helper_plus_coder_for_compat` test pinned `#[ignore]` until slice 14 — it's the spec slice 14 has to satisfy. Going red on the ignore-removal date is the design's reminder. - bootstrap_planned_exhausted_provider_errors_with_slot_info updated: `required` field now 1, not 2. Net result: slice 13's substrate hosts ONE Helper per tier through the managed path. Same coverage the demo binary currently provides, but composed via the substrate. Slice 14 reopens the multi-role case once role identity is durable across boots. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(ipc): substrate hosts personas via composed slice 7-12 pipeline (#133 slice 13) The moment-of-truth. Before this commit, the IPC boot loop at `ipc/mod.rs:1024-1089` called `bootstrap_one(&intent)` per ResumeOrMintProvider output and only LOGGED a welcome. The persona was reachable via `airc peers` but never responded — a mute citizen. After this commit, the boot path composes: PersonaSpawnerModule::plan_for_tier (slice 7) → bootstrap_planned (slice 8): mints/resumes airc identities → materialize_adapters (slice 9): builds inference adapters → spawn_persona_service (slice 12): runs serve_persona_loop → PersonaAircRuntimeRegistry::attach_service_loop (slice 13 Q3): parks the JoinHandle in the slot alongside the runtime Each planned persona ends up with a tokio task running her cognition path. The substrate hosts personas headlessly — no `airc_chat_demo` in the inner ring. Status against the design doc HEADLESS-PERSONA-HOST-LOOP.md: APPLIED: - P2: plan_for_tier returns single Helper (separate commit f940fa4). - Q1: bootstrap_planned takes &Registry (separate commit 71429da). - Q3: registry slot owns runtime + service_loop (commit 3f843aa). - Boot composition collapses ~65 lines of inline bootstrap-only loop into ~115 lines of substrate composition using the existing slice primitives. Per [[organization-purity-as-we-migrate]], the old welcome-log-only path is DELETED, not kept alongside. DEFERRED with TODO markers: - Q2 (detect_host_capability wiring): the existing free function at cognition/host_capability_probe.rs:87 takes &dyn GpuMonitor + &System. No production code constructs a GpuMonitor today — only tests do. Slice 13 uses HwCapabilityTier::CpuOnly + HwTierCategory:: Compat as the safe floor (the LCD Helper Qwen2.5-0.5B works for all tiers). TODO #52 cited for when GpuMonitor construction lands. - P1 (tokio::signal::ctrl_c → Runtime::shutdown): the per-slot shutdown is available via `PersonaAircRuntimeRegistry::shutdown_slot` and exercised by persona/instances/* IPC commands. The server- level signal handler is its own sub-slice. - P3 (ResourceBroker.acquire admission): current LCD case is 1 persona × ~500 MiB GGUF, well within all tiers. Becomes load- bearing when multi-persona returns in slice 14. Tests: - 31 tests across slices 5-13 all green (registry, service_loop, supervisor, spawner, spawner_module, profile_builder, host). - No new tests in this commit — the boot composition is the integration point; the integration test requires a stub PersonaInstanceManagerModule (slice 13 follow-up). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fixup: PR #1511 review — drain leaked task + orderly-drain orphan personas Two blockers from the adversarial review: BLOCKER 1 — JoinHandle leaked on attach_service_loop failure. `JoinHandle::drop` detaches rather than aborts. When `attach_service_loop` returned an error and the boot loop did `continue`, the spawned `serve_persona_loop` kept running untracked. The boot log lied "persona will not respond on the grid" while in fact the loop did respond, just outside the registry's view (so `shutdown_slot` couldn't find it). Worse on `"already attached"`: two loops competed for the same persona. Fix: `attach_service_loop` signature changed to `Result<(), (JoinHandle, &'static str)>` so the caller can orderly-drain (abort + await). Boot loop updated. Existing test updated to assert the handle comes back live (proves no implicit detach) before the test drains it. BLOCKER 2 — Partial-bootstrap orphans on bootstrap_planned error. `bootstrap_planned` registers each persona via `bootstrap_one` BEFORE the next slot's mint runs. If slot k fails, slots 0..k-1 are already in the registry but with no service loop attached — mute citizens. The boot loop early-returned with "no personas hosted" but they were. Fix: on `bootstrap_planned` error, the boot loop calls `registry.ids()` to get the partially-registered set and `shutdown_slot`s each. `shutdown_slot` handles "no service loop attached" gracefully (handle_opt is None) and drops the Arc cleanly — same orderly cleanup path as the normal shutdown, just no loop to abort. Error log updated to report `orphans_drained` count honestly. Advisory 3 — `debug_assert!(plan.len() <= 1)` at the producer. P2 invariant was named in the commit body + tested in `compat_tier_plans_single_helper_on_lcd` but had no runtime tripwire. Added the debug_assert at `plan_for_tier`'s producer site with a TODO marker tying it to slice 14 (when the assert comes out alongside RoleAwareProvider + role-in-seed.json). Verification: cargo test persona::airc_runtime_registry → 5 passed cargo test persona::spawner_module → 5 passed + 1 ignored cargo build --lib clean Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(persona): join by room_name, not UUID-as-string — substrate hosts in correct channel PR #1511 integration test on Joel's Intel Mac revealed: PersonaAircRuntime::bootstrap was calling `airc.join(&default_room.as_uuid().to_string())`, which passes the UUID's string representation into airc-lib's `ChannelName::new(name)` — which DERIVES a channel UUID from the string. The persona landed in derived channel `5d33e2a7…` while the operator's `airc room` points at canonical `11c1a7ac…`. Same room, two channels, never see each other. The demo binary worked around this in slice 11 by using `from_attached` (joining by name manually first), but the substrate-managed path through PersonaInstanceManagerModule still called the broken bootstrap. Fix threads through 4 layers: - airc/discovery.rs: new `discover_default_room_name()` parses `room: <name>` line from `airc room` stdout. Mirrors the existing `discover_default_channel()` shape; env override `AIRC_DEFAULT_ROOM_NAME` for tests/operators. - airc/mod.rs: re-export the new function. - modules/airc.rs: AircModule stores `attach_room_name: Option<String>`; `default_room_name() -> Option<&str>` getter. Loud warn if discovery fails — names the failure mode so operators see what's broken. - modules/persona_instance_manager.rs: PersonaInstanceManagerModule::new takes Option<String> room name; bootstrap_one passes it to PersonaAircRuntime::bootstrap. - persona/airc_runtime.rs::bootstrap: joins by name if Some, falls back to UUID-as-string + WARN if None. - ipc/mod.rs: discovers + threads through. Integration trace confirmed (slice13-server.log line 1078ish): joined_room=11c1a7ac-cb85-5ca0-a5b4-2847280ea3fa room_name=continuum Test sites updated to pass `None` (4 in persona_instance_manager.rs tests, 1 in spawner_module.rs). Status after this fix: ✅ Substrate boot composition fires ✅ Persona hosted as substrate-managed Helper ✅ Joins canonical airc channel ✅ Receives operator messages via subscribe ✅ Service loop invokes inspect_persona_rag_with_inference ❌ Inference fails with `llama_decode returned -1` on mac-cpu-only — separate inference-layer bug, tracked as task #131-adjacent. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * refactor(persona): PersonaContext is the substrate's &ctx — Android-Context analog Per Joel 2026-06-02: "design got out of control due to you failing to use a shared object for all state info required for a persona OR user. This is the airc user. Or base user with airc props." And "make this pattern regular, ubiquitous &ctx, store references, make it elegant." What changed: - HostedPersona is now PersonaContext (with `pub type HostedPersona = PersonaContext` for slice-9-era callers). The struct holds the persona's full context — identity (airc citizen facts), role, inference profile, adapter, runtime — and is passed by reference (`&ctx`) to every persona-scoped function. - HostedPersona.instance renamed to `.identity`: it's the airc user identity (peer_id, agent_name, home, default_room, source). - HostedPersona.profile (new) carries the PersonaInferenceProfile directly — single source of truth for inference shape. Replaces the prior context_window-only field. Downstream code reads `ctx.profile.context_length` etc. — no copied fields, no derived constants outside the named derivation site. - HostedPersona.runtime (new) holds `Option<Arc<PersonaAircRuntime>>`. Production always Some (filled by materialize_adapters via the registry_lookup closure). Tests construct with None — the proper AircHandle trait abstraction lands as part of task #142. - spawn_persona_service signature simplified — no separate runtime arg (`ctx.runtime` carries it). - materialize_adapters takes a `runtime_lookup` closure so the supervisor folds the live runtime into each context at the composition seam. - RagInspectionRequest::for_persona(persona_id, name, now_ms, &profile) is the single derivation site. The old `defaults_for` (32k hardcoded budget) stays for back-compat but is documented as legacy; service_loop uses `for_persona` exclusively. Why this matters (the bug it fixes): - PR #1511 integration trace caught `llama_decode returned -1` on Intel Mac mac-cpu-only: the LCD model was loaded at n_ctx=2048 (Compat tier per profile_builder), but RagInspectionRequest:: defaults_for was setting context_window=32_768. The RAG layer built a 32k-budget prompt that overflowed the 2k KV cache. - The structural fix is the &ctx doctrine: profile is the single source, derivation happens in one named function. Task #142 (BaseUser hierarchy) is the natural follow-up: extract the airc props (identity + runtime) into a `BaseUser` trait that persona/human/web actor contexts all derive from. Same shape per [[personas-are-citizens-airc-is-identity-provider]]. Verification: - cargo build --bin continuum-core-server clean - cargo build --lib --tests clean - Substrate boot composition still hosts Paige in correct channel (continuum, 11c1a7ac…) - Service loop fires inference (slow on CPU; iteration target) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…+ correct cleanup model PR #1510 re-review (after the first revision) caught: 1. **Cleanup model still wrong** — claimed `ensure_wire_subscriber` / `inner.subscribers` (was the 428f928 worktree I was reading, not the pinned f6ed190). Real mechanism in f6ed190 is `EventStream::EventStreamInner::Daemon` holding `DaemonAttachGuard` (`stream.rs:25-68`). On `EventStream` drop, guard drops, per-channel attach `JoinHandle`s abort. **`.abort()` ALONE is sufficient** against the pinned rev. The `shutdown_slot` registry-remove is for in-substrate Arc state hygiene, not for daemon-side teardown. Section rewritten with the correct mechanism, file:line cited. 2. **"Hard prerequisites" misrepresented impl scope** — listed P1 (ctrl_c→shutdown), P3 (broker), Q5 (BootSummary publish), Q7 (5s poller) as "MUST land" but the implementation explicitly deferred all four. Restructured to "Slice 13 scope vs deferred follow-ups" with honest splits: shipped vs deferred + why each deferral is acceptable (single-persona LCD is in-budget; `shutdown_slot` available via IPC commands; cleanup model now shows `.abort()` is sufficient). 3. **"After" snippet didn't match impl** — used `BootSummary::default()` / `bus.publish` / `summary.failed.push`, but the impl uses `tracing::info!(hosted=N, failed=N)` counters. Rewrote the snippet to match what shipped. Notes Q5 (BootSummary publish) as deferred. Plus the implementation status section is now accurate: - Q1, Q3, P2, Q2-partial: shipped ✅ - P1, P3, Q5, Q7, Q2-full: deferred to slice 13.5+ ❌ - Integration polish in #1511: room-name discovery + LCD model registry entry + PersonaContext rename + RagInspectionRequest:: for_persona derivation site (the `&ctx` doctrine) - Integration validation: Paige replied in continuum room Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Rev2 addresses all three findings from the re-review (commit 18460c6): Finding #1 (cleanup model wrong): rewrote against pinned Finding #2 ("Hard prerequisites" misrepresents impl scope): replaced the section with "Slice 13 scope vs deferred follow-ups" that honestly splits:
Finding #3 ("after" snippet doesn't match impl): rewrote the snippet to use Plus updated the implementation status section to reflect what actually shipped in #1511 (merged at Ready for re-review. |
Summary
Design doc for #133 slice 13 — the IPC-boot rewire that finally makes
continuum-corehost personas headlessly without the demo binary in the inner ring.Net-additive: no code changes. Captures the boot wire-up plan + the cleanup model verified during PR #1508's slice-12 review, before touching the load-bearing
crate::ipc::start_serverflow.What's in the doc
airc_lib::EventStream::drop→tokio::sync::broadcast::Receiverauto-decrement; wire subscriber ref-counted across local subscribers. No leak.Arc<Registry>forbuild_profile— addmodel_registry::global_arc()HwCapabilityTiersource —HostCapabilityProbe::detect_at_boot()hosted_handlesownership — newPersonaSupervisormodulePersonaBootstrappertrait split for stubbing.Why a design PR before the implementation PR
The boot loop is load-bearing. Reviewers should weigh in on the architectural choices (open questions 1-5) before I write 200 lines of wire-up that lock the answers in.
Reference docs
Test plan
🤖 Generated with Claude Code