test(persona): multi-persona stress baseline — substrate adds 1-3ms; LLM dominates by joelteply · Pull Request #1518 · CambrianTech/continuum

joelteply · 2026-06-03T01:09:59Z

Summary

Per Joel's "leaving it organic" directive: the substrate's measurement IS the organic signal that redirects the work. Stress test using the system primitives shipped in PR #1517 exercises the real materialize_adapters + serve_persona_loop pipeline with N=2/4/8/12 personas concurrent.

Headline finding: substrate adds 1-3 ms per turn under contention; the LLM call dominates per-turn latency.

N × M	Serve wall	Mean turn	Max turn
2 × 10	521 ms	51.6 ms	53 ms
4 × 10	521 ms	51.6 ms	53 ms
8 × 5	270 ms	51.5 ms	61 ms
12 × 5	270 ms	51.7 ms	61 ms

Adapter delay was 50ms injected. Substrate cost is 0.02-0.3% of real LLM-bound per-turn wall-clock (~5-15s in live Qwen 0.5B traces).

Implications

Captured in [[substrate-overhead-is-1to3ms-LLM-dominates-latency]]:

Substrate is NOT the bottleneck. Build(deps-dev): Bump @typescript-eslint/eslint-plugin from 8.29.1 to 8.46.2 #149 / Build(deps-dev): Bump @types/node from 22.14.0 to 24.9.0 #148 latency vectors would save microseconds on a millisecond substrate.
For M5 + 12 personas: substrate is ready. Real enabler is Build(deps-dev): Bump lerna from 8.2.1 to 8.2.4 #122 shared-base + LoRA paging.
What's actually blocking "functional + intelligent": accuracy bugs — Build(deps): Bump actions/setup-node from 4 to 6 #151 greeting-loop, Working AI desktop console and roadmap #152 identity hallucination, Build(deps): Bump commander from 13.1.0 to 14.0.2 #153 service_loop bypasses evaluator, Build(deps-dev): Bump @types/node from 22.14.0 to 24.3.0 #113 should_respond via inference command.

Pivot

Pause latency-vector grinding. Pivot to accuracy + M5 readiness.

Test plan

cargo test --test multi_persona_stress_baseline \
    --no-default-features \
    --features livekit-webrtc,llama/mac-cpu-only,test-fixtures \
    -- --nocapture

Stacked on

PR #1517 (refactor/system-test-primitives) — uses the ubiquitous system primitives shipped there.

Closes #156.

🤖 Generated with Claude Code

…LLM dominates (#156) Per Joel 2026-06-02: substrate must run well on M5 with 6-12 personas in video chat; on Intel Mac at least functional for multiple personas; on typical M-series decently useful + intelligent. Need DATA before guessing at latency vectors. Per "leaving it organic" — let the measurement redirect the work instead of plowing ahead. Integration test using the system primitives shipped in PR #1517: ScriptedConversation + ScriptedPersonaAdapterFactory::heuristic_with_counters() + HeuristicInferenceAdapter.with_delay_ms(50). Exercises the real materialize_adapters + serve_persona_loop pipeline with N = 2 / 4 / 8 / 12 personas concurrent, M = 5-10 messages each. tokio multi-thread runtime, 4 worker threads. ## Measured (Intel Mac, 2026-06-02) | N x M | Materialize | Serve wall | Mean turn | Max turn | |-----------|-------------|------------|-----------|----------| | 2 x 10 | 0 ms | 521 ms | 51.6 ms | 53 ms | | 4 x 10 | 0 ms | 521 ms | 51.6 ms | 53 ms | | 8 x 5 | 0 ms | 270 ms | 51.5 ms | 61 ms | | 12 x 5 | 0 ms | 270 ms | 51.7 ms | 61 ms | Adapter delay was 50ms (injected). Substrate adds 1.5-3 ms per turn under contention. Throughput scales linearly with persona count. p100 tail latency is 61ms (only 11ms above floor). ## Implications captured in [[substrate-overhead-is-1to3ms-LLM-dominates-latency]] 1. The substrate IS NOT the bottleneck. Real Qwen 0.5B inference is 1000-15000 ms per turn (live trace). Substrate is 0.02-0.3% of total. 2. #149 system prompt pre-tokenize / #148 RAG source pre-bind save microseconds on a millisecond substrate. Not worth grinding until LLM gen shrinks. 3. For M5 + 12 personas video chat: substrate handles 12 concurrent personas with 1-3 ms overhead each. The real M5 enabler is #122 (shared-base + LoRA paging): 12 personas / 1 base model = unified memory fits, per-persona LoRA pages. 4. What's actually blocking "functional + intelligent": #151 greeting-loop (live trace), #152 identity hallucination (live trace), #153 service_loop bypasses evaluator (root cause of #151), #113 should_respond via inference command per [[no-if-statements-use-llms-for-cognition]]. ## Pivot Pause latency-vector grinding (#149, #148). Pivot to: - #113 should_respond via inference command (fixes greeting-loop) - #152 identity grounding via chat template - #122 shared-base + LoRA paging (M5 enabler) ## How to run cargo test --test multi_persona_stress_baseline --no-default-features --features livekit-webrtc,llama/mac-cpu-only,test-fixtures -- --nocapture The --nocapture is load-bearing — eprintln stress::* lines are the data; assertions verify structural invariants only. Closes #156. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

joelteply · 2026-06-07T15:51:46Z

Superseded — the semantic content of this PR landed via the cognition pipeline rewires on canary.

Verified on canary (75b21d08a):

AircCitizen trait → src/workers/continuum-core/src/persona/airc_citizen.rs:73
PersonaConversation::prime → src/workers/continuum-core/src/persona/airc_persona_conversation.rs:105 + service_loop.rs:100
LatencyAggregate + ServeOutcome.turn_latency → service_loop.rs:167,247
AIProviderAdapter::warmup → ai/adapter.rs:319, called at supervisor.rs:447
Test scaffolding primitives (scripted_conversation, scripted_adapter_factory) → persona/scripted_*.rs
PersonaSpawnSupervisor + BootSummary → persona/host.rs:119
.instrument(ctx.span()) (&ctx-pure tracing) → service_loop.rs:309
multi_persona_stress_baseline.rs → already on canary (via feat(persona): persona decides + responds via LLM in ONE structured call #1519)

The stack pre-dated #1519's "persona decides + responds via LLM in ONE structured call" rewire and #1539's cognition pipeline integration. Those PRs reshaped serve_persona_loop so heavily that rebasing this stack would mean replaying the architectural intent against code that already implements it. Closing as superseded; task system marks #144/#146/#147/#150/#154/#156 completed.

No work lost — every architectural insight from this stack is in the substrate today.

#1545) Iteration 1 of the inference-latency campaign (#195). Sprinkles probes at the load-bearing seams the substrate's RTOS debugger manual marked unchecked. With #196's on_close fix already on canary, `time_sync!` / `time_probe!` spans now persist to the JSONL sink — the timing data this commit produces is captureable end-to-end for the first time. ## What's instrumented **`inference/llamacpp_adapter.rs::generate_text`** — the dominant cost on LCD tier (95%+ wall-clock per 2026-06-06 baseline). The function was effectively a black box from the operator's POV: the existing `runtime::logger("llamacpp")` lines describe shape but not duration, and `tok_per_sec` was kept in a private `RwLock<f64>` (last-throughput-only). Probes added: - `inference.generate.enter` — request fingerprint at entry (model, persona_id, msg_count, max_tokens, has_system_prompt, parts_image, parts_audio). Pairs with `.exit` via span ancestry. - `time_sync!("inference.render_chat", ...)` — chat-template rendering. Synchronous + small, but cumulative across many turns. Bracketing it lets the operator subtract it from `forward.*` cleanly. - `time_probe!("inference.forward.text", ...)` — pure-text scheduler-managed path. The actual LLM decode. - `time_probe!("inference.forward.multimodal", ...)` — mtmd path (image / audio). Distinct seam because it bypasses the scheduler and runs single-flight. - `inference.generate.exit` — pairs with `.enter`. Carries the campaign's headline metric `tok_per_sec` plus duration_ms, tokens_out, text_len, model. A `jq` filter on `class == "inference.generate.exit"` is the latency dashboard in JSONL form. **`persona/prompt_assembly.rs::assemble`** — the leading indicator for "why is prefill slow." When engrams / social signals / matched-angle grow unbounded, `system_message_len` shadows tok/s in the timing breakdown. Probe at the function tail carries the composition shape: system_message_len, message_count, estimated_tokens, matched_angle_present, engrams_count, social_signals_present, voice_mode, multi_party_strategy. ## Doctrine alignment Per [[jtag-probes-are-rtos-debugger]] (Joel 2026-06-06): every probe site names the surrounding vars the way a breakpoint inspector would show locals. Easy one-liners; the macros do the plumbing. `class` strings follow the canonical taxonomy in `docs/architecture/RTOS-DEBUGGER-PROBES.md` (updated in this commit per the "When you add a probe, update this manual" rule). Per [[no-rust-gates-around-cognition]]: probes observe, they DO NOT decide. None of these emit changes control flow. The existing `runtime::logger` and `last_throughput_tok_s` paths remain untouched — probes are additive. Per [[init-once-handle-then-lease-zero-copy-refs]]: the macros expand to `tracing::event!` / `tracing::info_span!` calls that inherit `tracing`'s `release_max_level_*` compile-time gates. Zero cost when off; auditable per task #198 if a hot loop later needs the visitor allocation reviewed. ## Manual update `docs/architecture/RTOS-DEBUGGER-PROBES.md`: - Added the new classes to the taxonomy (`persona.prompt.assemble` with full field list; `inference.generate.{enter,exit}`; the three new `timing` seams). - Marked the prompt-assembly checklist item DONE. - Marked the llamacpp-adapter checklist item DONE with the specific call-site list and the campaign cross-reference. ## Validation - `cargo check --features metal,accelerate` — clean - `cargo test --lib persona::prompt_assembly` — 12/12 pass - `cargo test --lib inference::llamacpp` — 12/12 pass - 24/24 green across the two affected modules ## Next iteration Iteration 2 (separate slice): run a real continuum boot with CONTINUUM_PROBE_FILE set, exercise the persona service loop against the multi-persona stress fixture (#1518's baseline), `jq` the JSONL to identify the dominant bottleneck. Optimize THAT. Iterate. Until tok_per_sec on the LCD tier hits the M5-class target. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

github-actions Bot added the size: L label Jun 3, 2026

joelteply mentioned this pull request Jun 3, 2026

feat(persona): persona decides + responds via LLM in ONE structured call #1519

Merged

3 tasks

joelteply deleted the branch refactor/system-test-primitives June 7, 2026 15:51

joelteply closed this Jun 7, 2026

joelteply deleted the feat/multi-persona-stress-baseline branch June 7, 2026 15:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(persona): multi-persona stress baseline — substrate adds 1-3ms; LLM dominates#1518

test(persona): multi-persona stress baseline — substrate adds 1-3ms; LLM dominates#1518
joelteply wants to merge 1 commit into
refactor/system-test-primitivesfrom
feat/multi-persona-stress-baseline

joelteply commented Jun 3, 2026

Uh oh!

joelteply commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant