test(persona): multi-persona stress baseline — substrate adds 1-3ms; LLM dominates#1518
Closed
joelteply wants to merge 1 commit into
Closed
test(persona): multi-persona stress baseline — substrate adds 1-3ms; LLM dominates#1518joelteply wants to merge 1 commit into
joelteply wants to merge 1 commit into
Conversation
…LLM dominates (#156) Per Joel 2026-06-02: substrate must run well on M5 with 6-12 personas in video chat; on Intel Mac at least functional for multiple personas; on typical M-series decently useful + intelligent. Need DATA before guessing at latency vectors. Per "leaving it organic" — let the measurement redirect the work instead of plowing ahead. Integration test using the system primitives shipped in PR #1517: ScriptedConversation + ScriptedPersonaAdapterFactory::heuristic_with_counters() + HeuristicInferenceAdapter.with_delay_ms(50). Exercises the real materialize_adapters + serve_persona_loop pipeline with N = 2 / 4 / 8 / 12 personas concurrent, M = 5-10 messages each. tokio multi-thread runtime, 4 worker threads. ## Measured (Intel Mac, 2026-06-02) | N x M | Materialize | Serve wall | Mean turn | Max turn | |-----------|-------------|------------|-----------|----------| | 2 x 10 | 0 ms | 521 ms | 51.6 ms | 53 ms | | 4 x 10 | 0 ms | 521 ms | 51.6 ms | 53 ms | | 8 x 5 | 0 ms | 270 ms | 51.5 ms | 61 ms | | 12 x 5 | 0 ms | 270 ms | 51.7 ms | 61 ms | Adapter delay was 50ms (injected). Substrate adds 1.5-3 ms per turn under contention. Throughput scales linearly with persona count. p100 tail latency is 61ms (only 11ms above floor). ## Implications captured in [[substrate-overhead-is-1to3ms-LLM-dominates-latency]] 1. The substrate IS NOT the bottleneck. Real Qwen 0.5B inference is 1000-15000 ms per turn (live trace). Substrate is 0.02-0.3% of total. 2. #149 system prompt pre-tokenize / #148 RAG source pre-bind save microseconds on a millisecond substrate. Not worth grinding until LLM gen shrinks. 3. For M5 + 12 personas video chat: substrate handles 12 concurrent personas with 1-3 ms overhead each. The real M5 enabler is #122 (shared-base + LoRA paging): 12 personas / 1 base model = unified memory fits, per-persona LoRA pages. 4. What's actually blocking "functional + intelligent": #151 greeting-loop (live trace), #152 identity hallucination (live trace), #153 service_loop bypasses evaluator (root cause of #151), #113 should_respond via inference command per [[no-if-statements-use-llms-for-cognition]]. ## Pivot Pause latency-vector grinding (#149, #148). Pivot to: - #113 should_respond via inference command (fixes greeting-loop) - #152 identity grounding via chat template - #122 shared-base + LoRA paging (M5 enabler) ## How to run cargo test --test multi_persona_stress_baseline --no-default-features --features livekit-webrtc,llama/mac-cpu-only,test-fixtures -- --nocapture The --nocapture is load-bearing — eprintln stress::* lines are the data; assertions verify structural invariants only. Closes #156. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
Contributor
Author
|
Superseded — the semantic content of this PR landed via the cognition pipeline rewires on canary. Verified on
The stack pre-dated #1519's "persona decides + responds via LLM in ONE structured call" rewire and #1539's cognition pipeline integration. Those PRs reshaped No work lost — every architectural insight from this stack is in the substrate today. |
joelteply
added a commit
that referenced
this pull request
Jun 7, 2026
#1545) Iteration 1 of the inference-latency campaign (#195). Sprinkles probes at the load-bearing seams the substrate's RTOS debugger manual marked unchecked. With #196's on_close fix already on canary, `time_sync!` / `time_probe!` spans now persist to the JSONL sink — the timing data this commit produces is captureable end-to-end for the first time. ## What's instrumented **`inference/llamacpp_adapter.rs::generate_text`** — the dominant cost on LCD tier (95%+ wall-clock per 2026-06-06 baseline). The function was effectively a black box from the operator's POV: the existing `runtime::logger("llamacpp")` lines describe shape but not duration, and `tok_per_sec` was kept in a private `RwLock<f64>` (last-throughput-only). Probes added: - `inference.generate.enter` — request fingerprint at entry (model, persona_id, msg_count, max_tokens, has_system_prompt, parts_image, parts_audio). Pairs with `.exit` via span ancestry. - `time_sync!("inference.render_chat", ...)` — chat-template rendering. Synchronous + small, but cumulative across many turns. Bracketing it lets the operator subtract it from `forward.*` cleanly. - `time_probe!("inference.forward.text", ...)` — pure-text scheduler-managed path. The actual LLM decode. - `time_probe!("inference.forward.multimodal", ...)` — mtmd path (image / audio). Distinct seam because it bypasses the scheduler and runs single-flight. - `inference.generate.exit` — pairs with `.enter`. Carries the campaign's headline metric `tok_per_sec` plus duration_ms, tokens_out, text_len, model. A `jq` filter on `class == "inference.generate.exit"` is the latency dashboard in JSONL form. **`persona/prompt_assembly.rs::assemble`** — the leading indicator for "why is prefill slow." When engrams / social signals / matched-angle grow unbounded, `system_message_len` shadows tok/s in the timing breakdown. Probe at the function tail carries the composition shape: system_message_len, message_count, estimated_tokens, matched_angle_present, engrams_count, social_signals_present, voice_mode, multi_party_strategy. ## Doctrine alignment Per [[jtag-probes-are-rtos-debugger]] (Joel 2026-06-06): every probe site names the surrounding vars the way a breakpoint inspector would show locals. Easy one-liners; the macros do the plumbing. `class` strings follow the canonical taxonomy in `docs/architecture/RTOS-DEBUGGER-PROBES.md` (updated in this commit per the "When you add a probe, update this manual" rule). Per [[no-rust-gates-around-cognition]]: probes observe, they DO NOT decide. None of these emit changes control flow. The existing `runtime::logger` and `last_throughput_tok_s` paths remain untouched — probes are additive. Per [[init-once-handle-then-lease-zero-copy-refs]]: the macros expand to `tracing::event!` / `tracing::info_span!` calls that inherit `tracing`'s `release_max_level_*` compile-time gates. Zero cost when off; auditable per task #198 if a hot loop later needs the visitor allocation reviewed. ## Manual update `docs/architecture/RTOS-DEBUGGER-PROBES.md`: - Added the new classes to the taxonomy (`persona.prompt.assemble` with full field list; `inference.generate.{enter,exit}`; the three new `timing` seams). - Marked the prompt-assembly checklist item DONE. - Marked the llamacpp-adapter checklist item DONE with the specific call-site list and the campaign cross-reference. ## Validation - `cargo check --features metal,accelerate` — clean - `cargo test --lib persona::prompt_assembly` — 12/12 pass - `cargo test --lib inference::llamacpp` — 12/12 pass - 24/24 green across the two affected modules ## Next iteration Iteration 2 (separate slice): run a real continuum boot with CONTINUUM_PROBE_FILE set, exercise the persona service loop against the multi-persona stress fixture (#1518's baseline), `jq` the JSONL to identify the dominant bottleneck. Optimize THAT. Iterate. Until tok_per_sec on the LCD tier hits the M5-class target. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Per Joel's "leaving it organic" directive: the substrate's measurement IS the organic signal that redirects the work. Stress test using the system primitives shipped in PR #1517 exercises the real
materialize_adapters+serve_persona_looppipeline with N=2/4/8/12 personas concurrent.Headline finding: substrate adds 1-3 ms per turn under contention; the LLM call dominates per-turn latency.
Adapter delay was 50ms injected. Substrate cost is 0.02-0.3% of real LLM-bound per-turn wall-clock (~5-15s in live Qwen 0.5B traces).
Implications
Captured in [[substrate-overhead-is-1to3ms-LLM-dominates-latency]]:
Pivot
Pause latency-vector grinding. Pivot to accuracy + M5 readiness.
Test plan
cargo test --test multi_persona_stress_baseline \ --no-default-features \ --features livekit-webrtc,llama/mac-cpu-only,test-fixtures \ -- --nocaptureStacked on
PR #1517 (
refactor/system-test-primitives) — uses the ubiquitous system primitives shipped there.Closes #156.
🤖 Generated with Claude Code