Skip to content

test(persona): multi-persona stress baseline — substrate adds 1-3ms; LLM dominates#1518

Closed
joelteply wants to merge 1 commit into
refactor/system-test-primitivesfrom
feat/multi-persona-stress-baseline
Closed

test(persona): multi-persona stress baseline — substrate adds 1-3ms; LLM dominates#1518
joelteply wants to merge 1 commit into
refactor/system-test-primitivesfrom
feat/multi-persona-stress-baseline

Conversation

@joelteply
Copy link
Copy Markdown
Contributor

Summary

Per Joel's "leaving it organic" directive: the substrate's measurement IS the organic signal that redirects the work. Stress test using the system primitives shipped in PR #1517 exercises the real materialize_adapters + serve_persona_loop pipeline with N=2/4/8/12 personas concurrent.

Headline finding: substrate adds 1-3 ms per turn under contention; the LLM call dominates per-turn latency.

N × M Materialize Serve wall Mean turn Max turn
2 × 10 0 ms 521 ms 51.6 ms 53 ms
4 × 10 0 ms 521 ms 51.6 ms 53 ms
8 × 5 0 ms 270 ms 51.5 ms 61 ms
12 × 5 0 ms 270 ms 51.7 ms 61 ms

Adapter delay was 50ms injected. Substrate cost is 0.02-0.3% of real LLM-bound per-turn wall-clock (~5-15s in live Qwen 0.5B traces).

Implications

Captured in [[substrate-overhead-is-1to3ms-LLM-dominates-latency]]:

  1. Substrate is NOT the bottleneck. Build(deps-dev): Bump @typescript-eslint/eslint-plugin from 8.29.1 to 8.46.2 #149 / Build(deps-dev): Bump @types/node from 22.14.0 to 24.9.0 #148 latency vectors would save microseconds on a millisecond substrate.
  2. For M5 + 12 personas: substrate is ready. Real enabler is Build(deps-dev): Bump lerna from 8.2.1 to 8.2.4 #122 shared-base + LoRA paging.
  3. What's actually blocking "functional + intelligent": accuracy bugs — Build(deps): Bump actions/setup-node from 4 to 6 #151 greeting-loop, Working AI desktop console and roadmap #152 identity hallucination, Build(deps): Bump commander from 13.1.0 to 14.0.2 #153 service_loop bypasses evaluator, Build(deps-dev): Bump @types/node from 22.14.0 to 24.3.0 #113 should_respond via inference command.

Pivot

Pause latency-vector grinding. Pivot to accuracy + M5 readiness.

Test plan

cargo test --test multi_persona_stress_baseline \
    --no-default-features \
    --features livekit-webrtc,llama/mac-cpu-only,test-fixtures \
    -- --nocapture

Stacked on

PR #1517 (refactor/system-test-primitives) — uses the ubiquitous system primitives shipped there.

Closes #156.

🤖 Generated with Claude Code

…LLM dominates (#156)

Per Joel 2026-06-02: substrate must run well on M5 with 6-12 personas
in video chat; on Intel Mac at least functional for multiple personas;
on typical M-series decently useful + intelligent. Need DATA before
guessing at latency vectors. Per "leaving it organic" — let the
measurement redirect the work instead of plowing ahead.

Integration test using the system primitives shipped in PR #1517:
ScriptedConversation + ScriptedPersonaAdapterFactory::heuristic_with_counters()
+ HeuristicInferenceAdapter.with_delay_ms(50). Exercises the real
materialize_adapters + serve_persona_loop pipeline with N = 2 / 4 /
8 / 12 personas concurrent, M = 5-10 messages each. tokio multi-thread
runtime, 4 worker threads.

## Measured (Intel Mac, 2026-06-02)

| N x M     | Materialize | Serve wall | Mean turn | Max turn |
|-----------|-------------|------------|-----------|----------|
| 2 x 10    | 0 ms        | 521 ms     | 51.6 ms   | 53 ms    |
| 4 x 10    | 0 ms        | 521 ms     | 51.6 ms   | 53 ms    |
| 8 x 5     | 0 ms        | 270 ms     | 51.5 ms   | 61 ms    |
| 12 x 5    | 0 ms        | 270 ms     | 51.7 ms   | 61 ms    |

Adapter delay was 50ms (injected). Substrate adds 1.5-3 ms per turn
under contention. Throughput scales linearly with persona count.
p100 tail latency is 61ms (only 11ms above floor).

## Implications captured in [[substrate-overhead-is-1to3ms-LLM-dominates-latency]]

1. The substrate IS NOT the bottleneck. Real Qwen 0.5B inference is
   1000-15000 ms per turn (live trace). Substrate is 0.02-0.3% of
   total.

2. #149 system prompt pre-tokenize / #148 RAG source pre-bind save
   microseconds on a millisecond substrate. Not worth grinding until
   LLM gen shrinks.

3. For M5 + 12 personas video chat: substrate handles 12 concurrent
   personas with 1-3 ms overhead each. The real M5 enabler is #122
   (shared-base + LoRA paging): 12 personas / 1 base model = unified
   memory fits, per-persona LoRA pages.

4. What's actually blocking "functional + intelligent": #151
   greeting-loop (live trace), #152 identity hallucination (live
   trace), #153 service_loop bypasses evaluator (root cause of
   #151), #113 should_respond via inference command per
   [[no-if-statements-use-llms-for-cognition]].

## Pivot

Pause latency-vector grinding (#149, #148). Pivot to:
- #113 should_respond via inference command (fixes greeting-loop)
- #152 identity grounding via chat template
- #122 shared-base + LoRA paging (M5 enabler)

## How to run

cargo test --test multi_persona_stress_baseline
    --no-default-features
    --features livekit-webrtc,llama/mac-cpu-only,test-fixtures
    -- --nocapture

The --nocapture is load-bearing — eprintln stress::* lines are the
data; assertions verify structural invariants only.

Closes #156.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@joelteply
Copy link
Copy Markdown
Contributor Author

Superseded — the semantic content of this PR landed via the cognition pipeline rewires on canary.

Verified on canary (75b21d08a):

  • AircCitizen trait → src/workers/continuum-core/src/persona/airc_citizen.rs:73
  • PersonaConversation::primesrc/workers/continuum-core/src/persona/airc_persona_conversation.rs:105 + service_loop.rs:100
  • LatencyAggregate + ServeOutcome.turn_latencyservice_loop.rs:167,247
  • AIProviderAdapter::warmupai/adapter.rs:319, called at supervisor.rs:447
  • Test scaffolding primitives (scripted_conversation, scripted_adapter_factory) → persona/scripted_*.rs
  • PersonaSpawnSupervisor + BootSummarypersona/host.rs:119
  • .instrument(ctx.span()) (&ctx-pure tracing) → service_loop.rs:309
  • multi_persona_stress_baseline.rs → already on canary (via feat(persona): persona decides + responds via LLM in ONE structured call #1519)

The stack pre-dated #1519's "persona decides + responds via LLM in ONE structured call" rewire and #1539's cognition pipeline integration. Those PRs reshaped serve_persona_loop so heavily that rebasing this stack would mean replaying the architectural intent against code that already implements it. Closing as superseded; task system marks #144/#146/#147/#150/#154/#156 completed.

No work lost — every architectural insight from this stack is in the substrate today.

@joelteply joelteply deleted the branch refactor/system-test-primitives June 7, 2026 15:51
@joelteply joelteply closed this Jun 7, 2026
@joelteply joelteply deleted the feat/multi-persona-stress-baseline branch June 7, 2026 15:51
joelteply added a commit that referenced this pull request Jun 7, 2026
#1545)

Iteration 1 of the inference-latency campaign (#195). Sprinkles
probes at the load-bearing seams the substrate's RTOS debugger
manual marked unchecked. With #196's on_close fix already on
canary, `time_sync!` / `time_probe!` spans now persist to the
JSONL sink — the timing data this commit produces is captureable
end-to-end for the first time.

## What's instrumented

**`inference/llamacpp_adapter.rs::generate_text`** — the dominant
cost on LCD tier (95%+ wall-clock per 2026-06-06 baseline). The
function was effectively a black box from the operator's POV: the
existing `runtime::logger("llamacpp")` lines describe shape but
not duration, and `tok_per_sec` was kept in a private
`RwLock<f64>` (last-throughput-only). Probes added:

- `inference.generate.enter` — request fingerprint at entry
  (model, persona_id, msg_count, max_tokens, has_system_prompt,
  parts_image, parts_audio). Pairs with `.exit` via span ancestry.
- `time_sync!("inference.render_chat", ...)` — chat-template
  rendering. Synchronous + small, but cumulative across many
  turns. Bracketing it lets the operator subtract it from
  `forward.*` cleanly.
- `time_probe!("inference.forward.text", ...)` — pure-text
  scheduler-managed path. The actual LLM decode.
- `time_probe!("inference.forward.multimodal", ...)` — mtmd path
  (image / audio). Distinct seam because it bypasses the
  scheduler and runs single-flight.
- `inference.generate.exit` — pairs with `.enter`. Carries the
  campaign's headline metric `tok_per_sec` plus duration_ms,
  tokens_out, text_len, model. A `jq` filter on
  `class == "inference.generate.exit"` is the latency dashboard
  in JSONL form.

**`persona/prompt_assembly.rs::assemble`** — the leading
indicator for "why is prefill slow." When engrams / social
signals / matched-angle grow unbounded, `system_message_len`
shadows tok/s in the timing breakdown. Probe at the function
tail carries the composition shape: system_message_len,
message_count, estimated_tokens, matched_angle_present,
engrams_count, social_signals_present, voice_mode,
multi_party_strategy.

## Doctrine alignment

Per [[jtag-probes-are-rtos-debugger]] (Joel 2026-06-06): every
probe site names the surrounding vars the way a breakpoint
inspector would show locals. Easy one-liners; the macros do the
plumbing. `class` strings follow the canonical taxonomy in
`docs/architecture/RTOS-DEBUGGER-PROBES.md` (updated in this
commit per the "When you add a probe, update this manual" rule).

Per [[no-rust-gates-around-cognition]]: probes observe, they
DO NOT decide. None of these emit changes control flow. The
existing `runtime::logger` and `last_throughput_tok_s` paths
remain untouched — probes are additive.

Per [[init-once-handle-then-lease-zero-copy-refs]]: the macros
expand to `tracing::event!` / `tracing::info_span!` calls that
inherit `tracing`'s `release_max_level_*` compile-time gates.
Zero cost when off; auditable per task #198 if a hot loop
later needs the visitor allocation reviewed.

## Manual update

`docs/architecture/RTOS-DEBUGGER-PROBES.md`:
- Added the new classes to the taxonomy (`persona.prompt.assemble`
  with full field list; `inference.generate.{enter,exit}`; the
  three new `timing` seams).
- Marked the prompt-assembly checklist item DONE.
- Marked the llamacpp-adapter checklist item DONE with the
  specific call-site list and the campaign cross-reference.

## Validation

- `cargo check --features metal,accelerate` — clean
- `cargo test --lib persona::prompt_assembly` — 12/12 pass
- `cargo test --lib inference::llamacpp` — 12/12 pass
- 24/24 green across the two affected modules

## Next iteration

Iteration 2 (separate slice): run a real continuum boot with
CONTINUUM_PROBE_FILE set, exercise the persona service loop
against the multi-persona stress fixture (#1518's baseline),
`jq` the JSONL to identify the dominant bottleneck. Optimize
THAT. Iterate. Until tok_per_sec on the LCD tier hits the
M5-class target.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant