-
Notifications
You must be signed in to change notification settings - Fork 0
Stress Testing
/stress opens the load-testing dashboard. Six modes, one engine: stress_tester.StressTester. Every mode runs against the model you're already chatting with — no separate setup.
| # | Mode | Function | What it measures |
|---|---|---|---|
| 1 | Throughput | run_throughput_test |
Concurrent burst (5 – 50 simultaneous requests) |
| 2 | Token Stress | run_token_stress_test |
Performance vs prompt length (500 – 10,000 tokens) |
| 3 | Sustained | run_sustained_load_test |
Endurance over time at fixed RPM |
| 4 | Consistency | run_consistency_test |
Same prompt N times serially — isolates hardware noise |
| 5 | Realistic User | run_realistic_user_test |
Poisson arrivals + multi-turn growing context |
| 6 | Tool Bench | run_tool_bench_test |
Agentic tool-calling — see Tool-Calling-Benchmark |
Live dashboard streams: per-request status, error log, percentile latencies, variance / drift, and a final summary panel. All runs log to logs/stress_test_*.log.
Every mode produces a list of TestResult records:
@dataclass
class TestResult:
request_id: int
status: str # pending | running | success | error
prompt: str
response: str = ""
start_time: float = 0.0
end_time: float = 0.0
token_count: int = 0
tokens_per_sec: float = 0.0
prompt_tokens: int = 0
completion_tokens: int = 0
ttft: float = 0.0
decode_tps: float = 0.0
session_id: int = 0
turn_number: int = 0
# …tool-bench fields below…_compute_stats() rolls them into TestStats with percentiles, variance, and first-half/second-half drift. The summary panel renders TestStats.
Connection reuse: the tester holds one httpx.AsyncClient per run rather than opening a new client per request. This makes throughput tests measure your server, not your kernel's TCP stack.
Burst N requests at once, measure how the server handles parallelism.
Concurrency: 20
Total: 100
────────────────────────────────────────────
Wall clock: 18.4s
Avg TTFT: 340ms p95: 720ms p99: 980ms
Avg decode: 41.2 t/s
Errors: 0
────────────────────────────────────────────
Use this to find the concurrency knee — where adding more parallel requests stops improving aggregate throughput and starts increasing tail latency.
Same concurrency, varying prompt lengths (500 / 1000 / 2000 / 5000 / 10000 tokens). Reveals how prefill cost scales:
- Linear scaling → server is compute-bound on prefill.
- Sub-linear → KV cache / paged attention is working.
- Super-linear → something is recomputing per request (sliding window without cache, bad serving config).
Fixed RPM over a chosen window (1 min → 24 hrs). Checks for:
- Memory leaks — TTFT trending up over time.
- Thermal throttling — decode TPS dropping after ~10 min.
- Driver / kernel bugs — sporadic 500s that only appear under sustained load.
The drift detector compares first-half mean against second-half mean and flags >10% degradation.
Sends the same prompt N times sequentially, varying nothing. The point isn't to test the model — it's to test your stack noise:
- Thermal regulation and DVFS scheduling
- Kernel scheduling jitter
- Driver-level batching variance
Output:
Same prompt × 30 sequential runs
────────────────────────────────
Decode TPS: 44.1 ± 0.8 (stddev 1.8%)
TTFT: 312 ± 45 ms (stddev 14.4%)
First-half avg: 44.4 t/s
Second-half avg: 43.8 t/s (drift -1.3%)
Low decode variance + high TTFT variance usually means a flaky network, not a flaky GPU.
The most useful mode if you're sizing for real traffic. Models bursty multi-turn sessions:
Session arrivals: Poisson(λ = 0.5 / s)
Turns per session: log-normal(μ=2.5, σ=0.6)
Think time: log-normal(μ=2.0, σ=0.4)
Context growth: cumulative across turns
Three depth profiles:
| Profile | Median turns | Avg session length |
|---|---|---|
| One-shot | 1 | ~5 s |
| Short | 3 – 5 | ~30 s |
| Long | 8 – 15 | ~3 min |
Stats include: sessions completed, total turns, p50/p95/p99 per-turn latency, and per-turn-position TTFT so you can see prefill cost climb with context size.
Full agentic benchmark — see Tool-Calling-Benchmark for the long version.
┌─────────────────────────────────────────────────────────────┐
│ Throughput · concurrency=20 · 76/100 complete │
├─────────────────────────────────────────────────────────────┤
│ [✓] req 0073 1.4s 45.1 t/s ttft 280ms │
│ [✓] req 0074 1.6s 42.7 t/s ttft 310ms │
│ [●] req 0075 running… │
│ [✗] req 0076 timeout after 60s │
│ … │
├─────────────────────────────────────────────────────────────┤
│ Errors (3): │
│ req 0029: 503 model_unavailable │
│ req 0067: connection reset │
│ req 0076: timeout 60s │
└─────────────────────────────────────────────────────────────┘
● = in-flight. ✓ = success. ✗ = failure. The error log is bounded — only the last N are kept on-screen, but all errors land in logs/stress_test_*.log with full tracebacks.
All runs write to logs/stress_test_YYYYMMDD_HHMMSS.log via logger.setup_logger():
- Console handler: INFO level — high-signal events.
- File handler: DEBUG level — every request, every error trace, every metric.
log_request_error() captures the full request payload + response body for failed requests. log_vllm_error() adds vLLM-specific context (queue depth, prefix cache stats) when the server advertises them in headers.
- Warm up first. First request loads weights — run a 3-request warmup before measuring. The throughput mode does this automatically; consistency mode does not (by design, to expose cold-start variance).
- Pick a representative prompt length. Token Stress shows you the curve; use the closest length to your real workload for other modes.
- Disable other tenants. If the server is shared, your numbers are everyone's numbers.
- Look at the variance, not just the mean. A model that runs at 42 t/s ± 8 t/s is worse for users than one at 38 t/s ± 1 t/s, even though the mean is higher.
Getting started
Features
Internals
Operating