AccelBench

A self-hosted benchmarking platform for LLM inference on AWS accelerated instances. Deploy any HuggingFace model onto GPU or Neuron instances, run standardized load tests, and compare latency, throughput, GPU utilization, and cost across configurations.

Features

Benchmarks catalog — Browse and compare pre-computed results, filterable by model, instance family, and accelerator type. Side-by-side comparison of up to 4 runs.
On-demand benchmarks — Run against any HuggingFace model on any supported accelerated instance type. Pick a scenario (chatbot, batch, stress, production, long-context) or customize parameters.
Test suites — Run a series of scenarios against one model+instance in a single deployment. The model stays loaded; each scenario runs sequentially with its own load profile.
Configuration recommender — Deterministic recommendations for tensor parallelism, quantization, max_model_len, and concurrency based on model architecture and accelerator memory. Surfaces OOM history and memory breakdown explanations.
Estimate page — Predicted TTFT, throughput, and cost for a (model × instance × scenario) combination before you run it.
Model cache — Pre-cache HuggingFace models to S3. Cached models load on GPU via Run:ai Streamer instead of downloading from HF on every run, cutting deploy time for large models.
Seed automation — Matrix-seed the Benchmarks catalog from the Configuration page. The seeder walks model × instance pairs, dedups against existing runs, and dispatches runs in-process (no bash, no ConfigMap).
Configuration page — UI for credentials (HF + Docker Hub in AWS Secrets Manager), the seeding matrix, per-scenario inference-perf overrides, registry/pull-through state, capacity reservations (ODCR + Capacity Block), and an audit log of config changes.
Capacity reservations — Attach EC2 ODCRs or Capacity Blocks for ML to the GPU/Neuron Karpenter NodeClasses so benchmarks can target reserved capacity when on-demand is tight.
Exports — Export a completed run's vLLM Kubernetes manifest; export single-run HTML reports or comparison reports as HTML/CSV.
Pricing comparison — Benchmark results joined with on-demand and reserved pricing across 9 AWS regions.
Job management — Monitor, cancel, and delete running benchmarks; view rendered inference-perf config and vLLM logs.

Architecture

┌──────────────┐                       ┌───────────────────────┐
│  React SPA   │                       │  Aurora PostgreSQL    │
│  (nginx)     │                       │  runs, metrics,       │
└──────┬───────┘                       │  scenarios, overrides │
       │                               │  cache, audit log     │
       ▼                               └───────────▲───────────┘
┌──────────────┐        client-go      ┌───────────┴───────────┐
│  Go API      │──────────────────────▶│  Orchestrator         │
│  Server      │                       │  (in-process)         │
│              │◀────────────────────  └───────────┬───────────┘
└──┬───┬───────┘                                   │
   │   │                                           │
   │   │    ┌────────────────────────┐      ┌──────▼────────┐
   │   └───▶│ AWS Secrets Manager    │      │ Karpenter +   │
   │        │ HF + Docker Hub tokens │      │ SOCI parallel │
   │        └────────────────────────┘      │ pull on NVMe  │
   │                                        └───────┬───────┘
   │                                                │
   │   ┌─────────────┐   ┌─────────────┐   ┌────────▼────────┐
   │   │ S3: results │   │ S3: weights │   │ GPU / Neuron    │
   └──▶│ (per run)   │   │ (model cache│◀─▶│ nodes running   │
       └─────────────┘   │via Streamer)|   │ vLLM + loadgen  │
                         └─────────────┘   └─────────────────┘

Benchmark lifecycle:

Recommend / submit — User picks model + instance + scenario from the Run form, or POSTs /api/v1/runs directly.
Deploy — Orchestrator renders a Deployment + Service running vLLM (weights from HF or, for cached models, from S3 via Run:ai Streamer).
Ready — Wait for the model to load and pass /health (up to 25 min — long for p5.48xlarge with 70B models).
Load test — Launch a Job running inference-perf with a per-scenario config rendered into a ConfigMap.
Collect — Load generator uploads JSON results to S3 (accelbench-results-<account>); orchestrator downloads and parses percentiles.
Persist — Metrics written to Postgres; DCGM / OOM events scraped if relevant.
Teardown — Deployment, Service, and Job deleted; Karpenter consolidates the node.

Supported instance families

Category	Families	Notes
NVIDIA GPU (Ampere)	g5, p4d, p4de	A10G / A100
NVIDIA GPU (Ada/Hopper/Blackwell)	g6, g6e, g7e, gr6, p5, p5e, p5en, p6-b200, p6-b300	L4, L40S, H100, H200, B200, B300
AWS Neuron	inf2, trn1, trn1n, trn2	Inferentia2 / Trainium

Instance selection in the Run form pulls live pricing and filters by accelerator type.

Tech stack

Component	Technology
API server	Go 1.24, stdlib `net/http`, `jackc/pgx/v5`, k8s `client-go` (typed + dynamic), AWS SDK v2 (Secrets Manager, EC2, ECR, S3, Pricing)
Frontend	React 18, TypeScript, Tailwind CSS, Vite, Recharts
Load generator	inference-perf (Python 3.12)
Inference	vLLM (GPU), vLLM-Neuron (Inferentia/Trainium)
Database	Aurora PostgreSQL Serverless v2
Infrastructure	Terraform, Helm, Karpenter 1.9 (SOCI parallel-pull, NVMe instance store, reserved-capacity beta)
Cluster	EKS 1.31, AL2023 NVIDIA-optimized AMIs for GPU nodes

Prerequisites

AWS account with quota for accelerated instance types
Terraform >= 1.5
Helm >= 3.0
kubectl configured for your cluster
Docker for building images (BuildKit required — cache mounts are used)
A Docker Hub account with an access token (needed for the ECR pull-through cache that mirrors the vLLM image; settable after install via the Configuration page)

Deployment

1. Infrastructure (Terraform)

cd terraform
terraform init
terraform apply

terraform.tfvars is optional for a barebones install. Default apply creates the EKS cluster, Aurora, ECR repos, and the AWS Load Balancer Controller — but no public URL. You reach the UI via kubectl port-forward (see step 6). Set Docker Hub credentials here (or via the Configuration page after install) and, if you want a public HTTPS URL, one of three ingress modes documented in terraform/README.md (PRD-43a). cp terraform.tfvars.example terraform.tfvars to see annotated examples.

The Terraform config creates:

VPC with public/private subnets across 3 AZs
EKS 1.31 cluster with a managed system node group + Karpenter for accelerated workloads
Aurora PostgreSQL Serverless v2
Karpenter EC2NodeClass + NodePool for gpu and neuron (SOCI parallel-pull + NVMe RAID0 on GPU; capacity-type: [reserved, on-demand] so attached reservations are consumed)
ECR repos for all app images + a pull-through cache rule at <account>.dkr.ecr.<region>.amazonaws.com/dockerhub/* pointing at Docker Hub
S3 buckets for results and cached model weights
IAM roles for API/loadgen/cache-job/model pods via EKS Pod Identity (Secrets Manager, EC2 describe, ECR describe, S3, Pricing)
AWS Load Balancer Controller (chart v3.2.2) in kube-system via Pod Identity — provisions ALBs for any Ingress with ingressClassName: alb. Skip via install_alb_controller=false if your cluster already has it.
Optionally: an ACM certificate + Route 53 alias for a public HTTPS URL. Off by default. See terraform/README.md for the three modes (acm-route53 / acm-existing / none).

2. Container images

Six images — five app images plus the tools image used by CLI operations. GPU nodes pull the vLLM image directly from the pull-through cache, not from this registry.

export REGION=us-east-2
export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export REGISTRY=${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com

aws ecr get-login-password --region $REGION | docker login --username AWS --password-stdin $REGISTRY

# accelbench-tools is not managed by Terraform — create it first:
aws ecr describe-repositories --repository-names accelbench-tools --region $REGION >/dev/null 2>&1 \
  || aws ecr create-repository --repository-name accelbench-tools --region $REGION --image-tag-mutability MUTABLE

# Build + push
for svc in api web loadgen migration cache-job tools; do
  docker build --platform linux/amd64 \
    -t $REGISTRY/accelbench-${svc}:latest \
    -f docker/Dockerfile.${svc} .
  docker push $REGISTRY/accelbench-${svc}:latest
done

BuildKit cache mounts — Dockerfile.api uses RUN --mount=type=cache for the Go build + module caches. The first build populates the caches (~5 min); subsequent builds compile incrementally (~30-60s).

3. Database secret

Terraform creates the accelbench namespace and the accelbench-db Kubernetes secret (with a URL-encoded DATABASE_URL built from the Aurora master user secret) automatically as part of terraform apply. No manual steps.

If you ever change the Aurora password out of band (RDS-managed rotation, manual reset), re-run terraform apply to refresh the Kubernetes secret.

On an existing cluster where the namespace was created manually (pre-this change), set -var manage_accelbench_namespace=false to avoid conflicts, or terraform import kubernetes_namespace.accelbench[0] accelbench && terraform import kubernetes_secret.accelbench_db[0] accelbench/accelbench-db to take ownership.

4. Helm install

helm install accelbench helm/accelbench \
  --namespace accelbench \
  --set image.api.repository=$REGISTRY/accelbench-api \
  --set image.web.repository=$REGISTRY/accelbench-web \
  --set image.loadgen.repository=$REGISTRY/accelbench-loadgen \
  --set image.migration.repository=$REGISTRY/accelbench-migration \
  --set image.cacheJob.repository=$REGISTRY/accelbench-cache-job \
  --set image.tools.repository=$REGISTRY/accelbench-tools \
  --set database.existingSecret=accelbench-db \
  --set results.s3Bucket=accelbench-results-${ACCOUNT_ID} \
  --set models.s3Bucket=accelbench-models-${ACCOUNT_ID} \
  --set registry.pullThroughEnabled=true \
  --set registry.pullThroughURI=$REGISTRY

The chart deploys:

API server (2 replicas) with Secrets Manager + EC2/ECR describe + Karpenter CRD patch permissions
Web frontend (2 replicas)
Database migration Job (runs as a Helm pre-upgrade hook on every helm upgrade)
Pricing refresh CronJob (daily)
Catalog refresh CronJob (weekly — curls /api/v1/catalog/seed)

No public Ingress is rendered by default. See terraform/README.md to opt in to a public HTTPS URL.

5. Access the app

Port-forward (default, works immediately):

kubectl port-forward -n accelbench svc/accelbench-web 8080:80
# → http://localhost:8080

Public HTTPS URL (optional): add the ingress variables to terraform.tfvars (see terraform/README.md), run terraform apply, then re-run helm upgrade with the ingress flags shown there. After a second terraform apply the app is reachable at your configured hostname.

The migration Job applies every SQL file in db/migrations/ on startup. Migrations are idempotent (CREATE TABLE IF NOT EXISTS, ON CONFLICT DO NOTHING), so re-running them is safe.

6. Platform configuration (UI)

Once the cluster is up, open the app (via port-forward or your public hostname) and open Configuration (left nav, gear icon). This is where operators set up the runtime knobs that aren't baked into the Helm chart:

Credentials — save an HF token once (for gated models like meta-llama/*) and a Docker Hub access token (the pull-through cache needs this to hydrate new images). If you skipped the Docker Hub tfvars at install time, set it here first — the secret entry exists but is empty until someone writes to it. Tokens go to AWS Secrets Manager (accelbench/config/hf-token, ecr-pullthroughcache/dockerhub) and auto-inject into every benchmark run, model-cache job, and catalog seed. Values are never shown after save.

Seeding Matrix — edit the models × instance types the "Seed Benchmarks" button explores. Models use a HuggingFace autocomplete; instance types are a dropdown populated from /api/v1/instance-types. Presence in the list = enabled.

Scenario Overrides — scenarios (chatbot, batch, stress, production, long-context) are code-defined in internal/scenario/builtin.go. This card lets operators override a scenario's num_workers, streaming, input_mean, or output_mean per scenario without a rebuild. Empty = inherit from code. Overriding input_mean or output_mean re-derives std_dev/min/max via the same formula scenarios use.

Registry — read-only view of the Docker Hub pull-through cache. Shows each dockerhub/* repo's size and last-pulled timestamp. When disabled, shows a helm upgrade snippet to turn it on.

Capacity Reservations — attach existing ODCRs or Capacity Blocks for ML to the GPU/Neuron Karpenter EC2NodeClass. Validates against EC2 live state (AZ match, instance family match, not cancelled/expired). Capacity Blocks show a drain warning ~40 min before end (when Karpenter pre-empts). Karpenter prioritizes reserved capacity and falls back to on-demand when reservations are exhausted.

Audit Log — last 50 write operations under /api/v1/config/*. Only action + short summary are stored; no token values.

7. Pre-cache popular models (recommended)

Cached models (1) skip the HF download on every benchmark and (2) bypass the HF gated-model check entirely since config.json and weights come from S3. Visit the Models page to queue a cache job. The job runs on a system node and uploads the full HF snapshot to accelbench-models-<account>/models/<org>/<model>.

API endpoints

Core workflow:

Method	Path	Purpose
`GET`	`/api/v1/status`	Component health
`GET`	`/api/v1/catalog`	Benchmarks catalog (filters: model, instance_family, accelerator_type, sort)
`GET`	`/api/v1/jobs`	List runs (alias kept for back-compat)
`POST`	`/api/v1/runs`	Submit a new run
`GET`	`/api/v1/runs/{id}`	Run details + metrics
`GET`	`/api/v1/runs/{id}/metrics`	Metrics only
`POST`	`/api/v1/runs/{id}/cancel`	Cancel
`DELETE`	`/api/v1/runs/{id}`	Delete
`GET`	`/api/v1/runs/{id}/export`	Export rendered K8s manifest
`GET`	`/api/v1/runs/{id}/report`	HTML report

Planning / exploration:

Method	Path	Purpose
`GET`	`/api/v1/instance-types`	All known accelerated instance types + specs
`GET`	`/api/v1/pricing?region=us-east-2`	On-demand + reserved pricing
`GET`	`/api/v1/recommend?model=...&instance_type=...`	Config recommendation (TP, quant, concurrency, max_model_len)
`GET`	`/api/v1/estimate`	Predicted TTFT / throughput / cost for a run shape
`GET`	`/api/v1/memory-breakdown`	Per-component memory for a (model × instance × TP × quant)
`GET`	`/api/v1/oom-history`	Past OOM events for a (model × instance) pair
`GET`	`/api/v1/compare/report`	Comparison report (HTML) across up to 4 runs
`GET`	`/api/v1/compare/csv`	Comparison report (CSV)

Test suites + scenarios:

Method	Path	Purpose
`GET`	`/api/v1/scenarios`	Built-in scenarios + effective (merged-with-override) values
`GET`	`/api/v1/test-suites`	Built-in test suites
`GET`	`/api/v1/suite-runs`	List suite runs
`POST`	`/api/v1/suite-runs`	Submit a suite run
`GET`	`/api/v1/suite-runs/{id}`	Suite run + per-scenario results

Model cache:

Method	Path	Purpose
`GET`	`/api/v1/model-cache`	List cached models
`POST`	`/api/v1/model-cache`	Trigger a cache job (HF → S3)
`POST`	`/api/v1/model-cache/register`	Register a pre-existing S3 model (no job)
`GET`	`/api/v1/model-cache/{id}`	Cache entry status
`DELETE`	`/api/v1/model-cache/{id}`	Remove entry (also deletes S3 prefix)

Catalog seeding:

Method	Path	Purpose
`POST`	`/api/v1/catalog/seed`	Start a seed run (in-process goroutine). `?dry_run=true` to preview
`GET`	`/api/v1/catalog/seed`	Latest seed status

Configuration (all covered by the Configuration page; writes audit-logged):

Method	Path	Purpose
`GET` / `PUT` / `DELETE`	`/api/v1/config/credentials/hf-token`	HF token (write-only; describe only returns set + timestamp)
`PUT` / `DELETE`	`/api/v1/config/credentials/dockerhub-token`	Docker Hub token
`GET`	`/api/v1/config/credentials`	`{hf_token: {set, updated_at}, dockerhub_token: {set, updated_at}}`
`GET` / `PUT`	`/api/v1/config/catalog-matrix`	Seeding matrix (optimistic concurrency via `version`)
`GET`	`/api/v1/config/scenario-overrides`	Scenarios + effective defaults + overrides
`PUT` / `DELETE`	`/api/v1/config/scenario-overrides/{id}`	Upsert / clear a scenario override
`GET`	`/api/v1/config/registry`	Pull-through cache enabled + cached repos
`GET` / `POST`	`/api/v1/config/capacity-reservations`	List / attach
`DELETE`	`/api/v1/config/capacity-reservations/{node_class}/{reservation_id}`	Detach
`GET`	`/api/v1/config/audit-log`	Last 50 config writes

For gated HF models when the platform token isn't set: pass X-HF-Token on /recommend, /estimate, and use the per-run HF token field. The platform token always wins the fallback race.

Configuration recommender

Given a model and instance, the recommender deterministically outputs tensor parallelism, quantization, max_model_len, and max concurrency.

Inputs:

Model metadata — parameter count, num_attention_heads, num_kv_heads, hidden_size, max_position_embeddings, torch_dtype. Fetched from HF, or read from config.json in S3 when the model is cached.
Instance specs — GPU count, GPU memory, GPU type (from the database).
Memory model — weights (adjusted for quantization) + KV cache (f(context, batch, num_kv_heads)) + 10% CUDA/activation overhead. Cross-checks against the benchmark_run_oom_events table for empirical overrides.

Output is either a concrete recommendation or, if the model doesn't fit, alternatives (quantization drops on the current instance; larger instance suggestions).

Metrics collected

Metric	Description
TTFT p50/p90/p95/p99	Time to first token
E2E latency p50/p90/p95/p99	End-to-end request latency
ITL p50/p90/p95/p99	Inter-token latency
TPOT p50/p90/p99	Time per output token
Throughput per request	Tokens/second per request
Throughput aggregate	Tokens/second across concurrent requests
Requests/second	Completed requests per second
GPU utilization peak/avg	From DCGM (`DCGM_FI_DEV_GPU_UTIL`)
SM active peak/avg	From DCGM DCP (`DCGM_FI_PROF_SM_ACTIVE`)
Tensor pipe active peak/avg	From DCGM DCP (`DCGM_FI_PROF_PIPE_TENSOR_ACTIVE`)
DRAM active peak/avg	From DCGM DCP (`DCGM_FI_PROF_DRAM_ACTIVE`)
GPU memory peak/avg	From DCGM (`DCGM_FI_DEV_FB_USED`)
Waiting requests max	From vLLM scheduler (via DCGM pipe custom counter)

Project structure

.
├── cmd/
│   ├── server/          # API server entrypoint
│   ├── cli/             # CLI tool for headless operation
│   ├── loadgen/         # Python loadgen (inference-perf wrapper)
│   └── pricingrefresh/  # Pricing CronJob binary
├── internal/
│   ├── api/             # HTTP handlers + routing (incl. PRD-30..33 config)
│   ├── database/        # Postgres repo (pgx/v5, dynamic CRD access for Karpenter)
│   ├── manifest/        # K8s YAML templates (deployment, loadgen job, cache job)
│   ├── metrics/         # Loadgen JSON parsing + percentile computation
│   ├── oom/             # OOM event detection (scans pod events + DCGM)
│   ├── orchestrator/    # Benchmark lifecycle state machine
│   ├── pricing/         # EC2 pricing API integration
│   ├── recommend/       # Deterministic configuration recommender
│   ├── report/          # HTML report / comparison report generation
│   ├── scenario/        # Built-in scenarios + Override merger
│   ├── secrets/         # AWS Secrets Manager wrapper (HF + Docker Hub)
│   ├── seed/            # In-process catalog seeder (replaces bash)
│   └── testsuite/       # Built-in test suites
├── frontend/            # React/TypeScript SPA
├── helm/accelbench/     # Helm chart
├── terraform/           # VPC, EKS, Karpenter, Aurora, ECR, pull-through
├── db/migrations/       # SQL migration files (idempotent)
├── docker/              # Dockerfiles (api, web, loadgen, migration, cache-job, tools)
└── scripts/             # Operational scripts + CronJobs (refresh-catalog.yaml)

Development

API server

# Requires DATABASE_URL and a kubeconfig. For AWS-dependent endpoints
# (credentials, registry, capacity reservations) you also need AWS creds.
export DATABASE_URL="postgres://user:pass@localhost:5432/accelbench?sslmode=disable"
go run ./cmd/server

Frontend

cd frontend
npm install
npm run dev    # http://localhost:5173, proxies /api to localhost:8080

Tests

go test ./...        # all packages
cd frontend && npm test

Full validation (matches CI):

go build ./... && go test ./... && \
  terraform -chdir=terraform validate && \
  helm lint helm/accelbench && \
  (cd frontend && npm run build)

License

This project is provided as-is for benchmarking and evaluation purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AccelBench

Features

Architecture

Supported instance families

Tech stack

Prerequisites

Deployment

1. Infrastructure (Terraform)

2. Container images

3. Database secret

4. Helm install

5. Access the app

6. Platform configuration (UI)

7. Pre-cache popular models (recommended)

API endpoints

Configuration recommender

Metrics collected

Project structure

Development

API server

Frontend

Tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
.github/workflows		.github/workflows
cmd		cmd
db/migrations		db/migrations
docker		docker
frontend		frontend
helm/accelbench		helm/accelbench
img		img
internal		internal
scripts		scripts
terraform		terraform
.gitignore		.gitignore
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Folders and files

Latest commit

History

Repository files navigation

AccelBench

Features

Architecture

Supported instance families

Tech stack

Prerequisites

Deployment

1. Infrastructure (Terraform)

2. Container images

3. Database secret

4. Helm install

5. Access the app

6. Platform configuration (UI)

7. Pre-cache popular models (recommended)

API endpoints

Configuration recommender

Metrics collected

Project structure

Development

API server

Frontend

Tests

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages