Direct TT-Metal bringup of modern open-weight LLMs on Tenstorrent Blackhole, with custom
owned_*compute kernels and a continuous-batching OpenAI-compatible HTTP server.
What: Hand-written TT-Metal graphs (no PJRT, no JAX) for Qwen3.6-27B,
Qwen3.6-35B-A3B MoE, Gemma 4 12B, Nemotron-3 Nano 30B-A3B, plus a model zoo
of single-chip Llama / Qwen2.5 / SmolLM ports.
Why: Squeeze production-grade per-token throughput out of Blackhole P150s
by owning the compute graph end-to-end — kernels, scheduler, HTTP server.
How: Custom fused-op kernels under experiments/owned_ops/, a
continuous-batching engine under experiments/serve/, and a learning-wiki
documenting every design decision.
Cold-start? Read
HANDOFF.mdfirst — the one-page entry point with current perf, production paths, and what's next.
- Hardware
- Quickstart
- Models brought up
- Chat server
- Repo layout
- Long context
- Troubleshooting
- Related projects
- Top-level docs
Bringups run on Tenstorrent QuietBox workstations
(4× Blackhole P150 per box) with FABRIC_1D inter-chip links. A single
P150 is enough for the legacy 8B-class demos under models/.
27B / 35B / Gemma 4 12B require a (1, 4) mesh.
| Model class | Mesh | Path |
|---|---|---|
| Llama 1B/3B/8B, SmolLM3, Qwen2.5/3 small | single P150 | models/*.py |
| Qwen3.6-27B, Gemma 4 12B | (1, 4) 4×P150 |
experiments/serve/server_*.py |
| Qwen3.6-35B-A3B MoE | (1, 4) 4×P150 |
experiments/serve/server_35b_*.py |
| Nemotron-3 Nano 30B-A3B | (1, 4) 4×P150 (target) |
experiments/owned_ops/nemotron3_mamba2_decode_owned/ |
No local device execution. Every command in this repo assumes you are on the QuietBox; the dev loop expects ssh into a host with TT-Metal built from source.
On a QuietBox with TT-Metal built from source:
git clone https://github.com/aweditya/tt-model-bringup.git ~/tt-xla
cd ~/tt-xla
make setup # uv sync — Python deps into .venv
make install-ttnn # editable ttnn from $TT_METAL_HOME
make check # sanity-check setup (no device open)
make kernels # build the owned_ops custom kernels
bash experiments/serve/scripts/serve_cb.sh start # boot the CB chat server (~6 min)Talk to it once /health returns 200:
curl http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"messages":[{"role":"user","content":"Hi!"}],"max_tokens":200}'Stop gracefully (hard-killing can wedge the fabric):
bash experiments/serve/scripts/serve_cb.sh stopThe local working directory stays
~/tt-xlafor historical reasons (the repo was originally a JAX/XLA PJRT-backend exploration). Renaming breaks too many rsync paths to be worth it.
See REPRODUCE.md for the full reproducible-on-any-QuietBox
recipe, tested-versions matrix, and per-demo expected numbers.
| Model | Params | Architecture | Status |
|---|---|---|---|
| Qwen3.6-27B | 27 B dense | Hybrid attention + GatedDeltaNet | Production CB + TP, prefix-cache live |
| Qwen3.6-35B-A3B | 35 B / 3 B-A | Hybrid + GatedDeltaNet + MoE | CB shipped; B>1 blocked on slot-poisoning fix |
| Gemma 4 12B (base + IT) | 12 B dense | Dual sliding/global attention | End-to-end CB + HTTP chat |
| Nemotron-3 Nano 30B-A3B | 30 B / 3 B-A | Mamba2-Transformer hybrid MoE | In progress — owned Mamba2 SSD kernel G1 single-core complete (modes 1–5 PASS); G2 multi-core next |
| Llama 1B/3B/8B, SmolLM3, Qwen2.5/3 small | up to 8 B | Decoder-only Transformer | Legacy single-chip demos (see models/) |
Per-token throughput numbers drift across kernel work — see
HANDOFF.md for the current figures.
experiments/serve/cb_api.py is the production
HTTP server: an OpenAI-compatible API hosting a continuous-batching engine
on top of the Orca scheduler and the logits-traced forward.
- Endpoints:
/v1/chat/completions,/v1/completions,/v1/models,/health,/metrics - Multi-client by design; per-request sampling (
temperature/top_p/top_k/seed); slot freed on client disconnect - Backend selection via
TT_BACKEND(27b,35b,gemma4_12b) - Tuning knobs:
TT_CB_PORT,TT_CB_SLOTS,TT_CB_MAX_NEW,TT_CB_MAX_INFLIGHT,TT_CB_PREFIX_CACHE,TT_CB_CHUNKED_PREFILL— full list inexperiments/serve/scripts/serve_cb.sh
A Claude-Code-style chat TUI lives at
scripts/chat.py — see
scripts/CHAT_TUI.md for the slash-command reference.
| Path | Purpose |
|---|---|
experiments/ |
Production servers, owned kernels, CB engine + validators, probes |
research/ |
Design docs and living plans (index: research/README.md) |
wiki/ |
Q&A wiki — learning-by-building notes on JAX/XLA + TT-Metal internals |
scripts/ |
Dev-loop scripts: deploy.sh, run_remote.sh, build_owned_ops.sh, chat TUI |
models/ |
Legacy single-chip multi-model demos (Llama, SmolLM, Qwen2.5/3) |
archive/ |
Retired probes, the founding JAX/PJRT-era sources (legacy/), the 2026-06-04 poster + measurements, and dated cleanup buckets |
The 4-chip TP path is validated to L=4000 with verbatim 8-character
needle-haystack retrieval at all needle positions (0.25 / 0.5 / 0.75
frac), using the B3 HiFi2 SDPA recipe.
Probe: experiments/utils/needle_haystack_qb2_tp.py.
| Symptom | Fix |
|---|---|
| Fabric wedged after a hard-kill | make reset (tt-smi -r 0,1,2,3) and restart the server; always prefer the serve_*.sh stop path. |
AutoTokenizer fails with 401 |
No HF token configured — uv run hf auth login or set HF_TOKEN in .env. |
HF_HOME fills $HOME |
Override: export HF_HOME=/path/with/space/.cache/hf. |
| Legacy demos won't open device 0 | Prod server holds it — bash experiments/serve/scripts/serve_cb.sh stop first. |
- tenstorrent/tt-metal — the device runtime + LLK that everything here links against.
- tenstorrent/tt-xla — Tenstorrent's
official PJRT backend (different from this repo's archived
archive/legacy/pjrt_plugin/). - tenstorrent/vllm — Tenstorrent's vLLM fork; the continuous-batching design here lines up with its API.
- Corsix's Tenstorrent Wormhole blog series — best architectural reference outside the TT docs.
HANDOFF.md— current perf, production paths, next bringup target.REPRODUCE.md— reproduce the chat server + legacy demos on a fresh QuietBox.CONTRIBUTING.md— dev loop, canary gates, code style.CLAUDE.md— project non-negotiables (host-specific).research/perf_summary_2026-06-05.md— single perf-table summary across every shipped model.