tt-model-bringup

Direct TT-Metal bringup of modern open-weight LLMs on Tenstorrent Blackhole, with custom owned_* compute kernels and a continuous-batching OpenAI-compatible HTTP server.

What: Hand-written TT-Metal graphs (no PJRT, no JAX) for Qwen3.6-27B, Qwen3.6-35B-A3B MoE, Gemma 4 12B, Nemotron-3 Nano 30B-A3B, plus a model zoo of single-chip Llama / Qwen2.5 / SmolLM ports. Why: Squeeze production-grade per-token throughput out of Blackhole P150s by owning the compute graph end-to-end — kernels, scheduler, HTTP server. How: Custom fused-op kernels under experiments/owned_ops/, a continuous-batching engine under experiments/serve/, and a learning-wiki documenting every design decision.

Cold-start? Read HANDOFF.md first — the one-page entry point with current perf, production paths, and what's next.

Hardware

Bringups run on Tenstorrent QuietBox workstations (4× Blackhole P150 per box) with FABRIC_1D inter-chip links. A single P150 is enough for the legacy 8B-class demos under models/. 27B / 35B / Gemma 4 12B require a (1, 4) mesh.

Model class	Mesh	Path
Llama 1B/3B/8B, SmolLM3, Qwen2.5/3 small	single P150	`models/*.py`
Qwen3.6-27B, Gemma 4 12B	`(1, 4)` 4×P150	`experiments/serve/server_*.py`
Qwen3.6-35B-A3B MoE	`(1, 4)` 4×P150	`experiments/serve/server_35b_*.py`
Nemotron-3 Nano 30B-A3B	`(1, 4)` 4×P150 (target)	`experiments/owned_ops/nemotron3_mamba2_decode_owned/`

No local device execution. Every command in this repo assumes you are on the QuietBox; the dev loop expects ssh into a host with TT-Metal built from source.

Quickstart

On a QuietBox with TT-Metal built from source:

git clone https://github.com/aweditya/tt-model-bringup.git ~/tt-xla
cd ~/tt-xla

make setup                                          # uv sync — Python deps into .venv
make install-ttnn                                   # editable ttnn from $TT_METAL_HOME
make check                                          # sanity-check setup (no device open)
make kernels                                        # build the owned_ops custom kernels
bash experiments/serve/scripts/serve_cb.sh start    # boot the CB chat server (~6 min)

Talk to it once /health returns 200:

curl http://localhost:8000/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{"messages":[{"role":"user","content":"Hi!"}],"max_tokens":200}'

Stop gracefully (hard-killing can wedge the fabric):

bash experiments/serve/scripts/serve_cb.sh stop

The local working directory stays ~/tt-xla for historical reasons (the repo was originally a JAX/XLA PJRT-backend exploration). Renaming breaks too many rsync paths to be worth it.

See REPRODUCE.md for the full reproducible-on-any-QuietBox recipe, tested-versions matrix, and per-demo expected numbers.

Models brought up

Model	Params	Architecture	Status
Qwen3.6-27B	27 B dense	Hybrid attention + GatedDeltaNet	Production CB + TP, prefix-cache live
Qwen3.6-35B-A3B	35 B / 3 B-A	Hybrid + GatedDeltaNet + MoE	CB shipped; B>1 blocked on slot-poisoning fix
Gemma 4 12B (base + IT)	12 B dense	Dual sliding/global attention	End-to-end CB + HTTP chat
Nemotron-3 Nano 30B-A3B	30 B / 3 B-A	Mamba2-Transformer hybrid MoE	In progress — owned Mamba2 SSD kernel G1 single-core complete (modes 1–5 PASS); G2 multi-core next
Llama 1B/3B/8B, SmolLM3, Qwen2.5/3 small	up to 8 B	Decoder-only Transformer	Legacy single-chip demos (see `models/`)

Per-token throughput numbers drift across kernel work — see HANDOFF.md for the current figures.

Chat server

experiments/serve/cb_api.py is the production HTTP server: an OpenAI-compatible API hosting a continuous-batching engine on top of the Orca scheduler and the logits-traced forward.

Endpoints: /v1/chat/completions, /v1/completions, /v1/models, /health, /metrics
Multi-client by design; per-request sampling (temperature / top_p / top_k / seed); slot freed on client disconnect
Backend selection via TT_BACKEND (27b, 35b, gemma4_12b)
Tuning knobs: TT_CB_PORT, TT_CB_SLOTS, TT_CB_MAX_NEW, TT_CB_MAX_INFLIGHT, TT_CB_PREFIX_CACHE, TT_CB_CHUNKED_PREFILL — full list in experiments/serve/scripts/serve_cb.sh

A Claude-Code-style chat TUI lives at scripts/chat.py — see scripts/CHAT_TUI.md for the slash-command reference.

Repo layout

Path	Purpose
`experiments/`	Production servers, owned kernels, CB engine + validators, probes
`research/`	Design docs and living plans (index: `research/README.md`)
`wiki/`	Q&A wiki — learning-by-building notes on JAX/XLA + TT-Metal internals
`scripts/`	Dev-loop scripts: `deploy.sh`, `run_remote.sh`, `build_owned_ops.sh`, chat TUI
`models/`	Legacy single-chip multi-model demos (Llama, SmolLM, Qwen2.5/3)
`archive/`	Retired probes, the founding JAX/PJRT-era sources (`legacy/`), the 2026-06-04 poster + measurements, and dated cleanup buckets

Long context

The 4-chip TP path is validated to L=4000 with verbatim 8-character needle-haystack retrieval at all needle positions (0.25 / 0.5 / 0.75 frac), using the B3 HiFi2 SDPA recipe.

Probe: experiments/utils/needle_haystack_qb2_tp.py.

Troubleshooting

Symptom	Fix
Fabric wedged after a hard-kill	`make reset` (`tt-smi -r 0,1,2,3`) and restart the server; always prefer the `serve_*.sh stop` path.
`AutoTokenizer` fails with 401	No HF token configured — `uv run hf auth login` or set `HF_TOKEN` in `.env`.
`HF_HOME` fills `$HOME`	Override: `export HF_HOME=/path/with/space/.cache/hf`.
Legacy demos won't open device 0	Prod server holds it — `bash experiments/serve/scripts/serve_cb.sh stop` first.

Related projects

tenstorrent/tt-metal — the device runtime + LLK that everything here links against.
tenstorrent/tt-xla — Tenstorrent's official PJRT backend (different from this repo's archived archive/legacy/pjrt_plugin/).
tenstorrent/vllm — Tenstorrent's vLLM fork; the continuous-batching design here lines up with its API.
Corsix's Tenstorrent Wormhole blog series — best architectural reference outside the TT docs.

Top-level docs

HANDOFF.md — current perf, production paths, next bringup target.
REPRODUCE.md — reproduce the chat server + legacy demos on a fresh QuietBox.
CONTRIBUTING.md — dev loop, canary gates, code style.
CLAUDE.md — project non-negotiables (host-specific).
research/perf_summary_2026-06-05.md — single perf-table summary across every shipped model.

Name		Name	Last commit message	Last commit date
Latest commit History 1,553 Commits
.cache		.cache
.github/workflows		.github/workflows
archive		archive
experiments		experiments
models		models
research		research
scripts		scripts
wiki		wiki
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CLAUDE.md		CLAUDE.md
CLEANUP_LOG.md		CLEANUP_LOG.md
CONTRIBUTING.md		CONTRIBUTING.md
HANDOFF.md		HANDOFF.md
Makefile		Makefile
README.md		README.md
REPRODUCE.md		REPRODUCE.md
pyproject.toml		pyproject.toml
tt-metal-sha.txt		tt-metal-sha.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tt-model-bringup

Table of contents

Hardware

Quickstart

Models brought up

Chat server

Repo layout

Long context

Troubleshooting

Related projects

Top-level docs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tt-model-bringup

Table of contents

Hardware

Quickstart

Models brought up

Chat server

Repo layout

Long context

Troubleshooting

Related projects

Top-level docs

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages