Skip to content

Stress Testing

mtecnic edited this page May 28, 2026 · 1 revision

Stress Testing

/stress opens the load-testing dashboard. Six modes, one engine: stress_tester.StressTester. Every mode runs against the model you're already chatting with — no separate setup.


Mode overview

# Mode Function What it measures
1 Throughput run_throughput_test Concurrent burst (5 – 50 simultaneous requests)
2 Token Stress run_token_stress_test Performance vs prompt length (500 – 10,000 tokens)
3 Sustained run_sustained_load_test Endurance over time at fixed RPM
4 Consistency run_consistency_test Same prompt N times serially — isolates hardware noise
5 Realistic User run_realistic_user_test Poisson arrivals + multi-turn growing context
6 Tool Bench run_tool_bench_test Agentic tool-calling — see Tool-Calling-Benchmark

Live dashboard streams: per-request status, error log, percentile latencies, variance / drift, and a final summary panel. All runs log to logs/stress_test_*.log.


Shared mechanics

Every mode produces a list of TestResult records:

@dataclass
class TestResult:
    request_id: int
    status: str          # pending | running | success | error
    prompt: str
    response: str = ""
    start_time: float = 0.0
    end_time: float = 0.0
    token_count: int = 0
    tokens_per_sec: float = 0.0
    prompt_tokens: int = 0
    completion_tokens: int = 0
    ttft: float = 0.0
    decode_tps: float = 0.0
    session_id: int = 0
    turn_number: int = 0
    # …tool-bench fields below…

_compute_stats() rolls them into TestStats with percentiles, variance, and first-half/second-half drift. The summary panel renders TestStats.

Connection reuse: the tester holds one httpx.AsyncClient per run rather than opening a new client per request. This makes throughput tests measure your server, not your kernel's TCP stack.


1 · Throughput

Burst N requests at once, measure how the server handles parallelism.

  Concurrency: 20
  Total:       100
  ────────────────────────────────────────────
  Wall clock:  18.4s
  Avg TTFT:    340ms   p95: 720ms   p99: 980ms
  Avg decode:  41.2 t/s
  Errors:      0
  ────────────────────────────────────────────

Use this to find the concurrency knee — where adding more parallel requests stops improving aggregate throughput and starts increasing tail latency.


2 · Token Stress

Same concurrency, varying prompt lengths (500 / 1000 / 2000 / 5000 / 10000 tokens). Reveals how prefill cost scales:

  • Linear scaling → server is compute-bound on prefill.
  • Sub-linear → KV cache / paged attention is working.
  • Super-linear → something is recomputing per request (sliding window without cache, bad serving config).

3 · Sustained Load

Fixed RPM over a chosen window (1 min → 24 hrs). Checks for:

  • Memory leaks — TTFT trending up over time.
  • Thermal throttling — decode TPS dropping after ~10 min.
  • Driver / kernel bugs — sporadic 500s that only appear under sustained load.

The drift detector compares first-half mean against second-half mean and flags >10% degradation.


4 · Consistency

Sends the same prompt N times sequentially, varying nothing. The point isn't to test the model — it's to test your stack noise:

  • Thermal regulation and DVFS scheduling
  • Kernel scheduling jitter
  • Driver-level batching variance

Output:

  Same prompt × 30 sequential runs
  ────────────────────────────────
  Decode TPS:    44.1 ± 0.8  (stddev 1.8%)
  TTFT:          312 ± 45 ms (stddev 14.4%)
  First-half avg:  44.4 t/s
  Second-half avg: 43.8 t/s  (drift -1.3%)

Low decode variance + high TTFT variance usually means a flaky network, not a flaky GPU.


5 · Realistic User

The most useful mode if you're sizing for real traffic. Models bursty multi-turn sessions:

  Session arrivals:  Poisson(λ = 0.5 / s)
  Turns per session: log-normal(μ=2.5, σ=0.6)
  Think time:        log-normal(μ=2.0, σ=0.4)
  Context growth:    cumulative across turns

Three depth profiles:

Profile Median turns Avg session length
One-shot 1 ~5 s
Short 3 – 5 ~30 s
Long 8 – 15 ~3 min

Stats include: sessions completed, total turns, p50/p95/p99 per-turn latency, and per-turn-position TTFT so you can see prefill cost climb with context size.


6 · Tool Bench

Full agentic benchmark — see Tool-Calling-Benchmark for the long version.


Reading the dashboard

┌─────────────────────────────────────────────────────────────┐
│  Throughput · concurrency=20 · 76/100 complete              │
├─────────────────────────────────────────────────────────────┤
│  [✓] req 0073   1.4s   45.1 t/s   ttft 280ms                │
│  [✓] req 0074   1.6s   42.7 t/s   ttft 310ms                │
│  [●] req 0075   running…                                    │
│  [✗] req 0076   timeout after 60s                           │
│  …                                                          │
├─────────────────────────────────────────────────────────────┤
│  Errors (3):                                                 │
│    req 0029: 503 model_unavailable                          │
│    req 0067: connection reset                               │
│    req 0076: timeout 60s                                    │
└─────────────────────────────────────────────────────────────┘

= in-flight. = success. = failure. The error log is bounded — only the last N are kept on-screen, but all errors land in logs/stress_test_*.log with full tracebacks.


Logs

All runs write to logs/stress_test_YYYYMMDD_HHMMSS.log via logger.setup_logger():

  • Console handler: INFO level — high-signal events.
  • File handler: DEBUG level — every request, every error trace, every metric.

log_request_error() captures the full request payload + response body for failed requests. log_vllm_error() adds vLLM-specific context (queue depth, prefix cache stats) when the server advertises them in headers.


Tips for getting clean numbers

  1. Warm up first. First request loads weights — run a 3-request warmup before measuring. The throughput mode does this automatically; consistency mode does not (by design, to expose cold-start variance).
  2. Pick a representative prompt length. Token Stress shows you the curve; use the closest length to your real workload for other modes.
  3. Disable other tenants. If the server is shared, your numbers are everyone's numbers.
  4. Look at the variance, not just the mean. A model that runs at 42 t/s ± 8 t/s is worse for users than one at 38 t/s ± 1 t/s, even though the mean is higher.

Clone this wiki locally