FajarQuant — Quantization Research for Compiler-Verified LLM Systems

Two research arms, one umbrella. Phase D IntLLM trains a 1.58-bit MatMul-Free LLM family (Mini/Base/Medium 21M-74M params), validates a 3-scale calibrated training-gate chain with monotonically widening margins (0.12 → 0.21 → 0.28 nat), and deploys end-to-end inside the OS kernel via FajarOS Nova IntLLM kernel-path. v3.1 KV Cache Quant (mature, paper artifact) profiles each KV head and routes to the optimal strategy, matching or beating the best fixed method in 7 of 9 evaluation cells with zero catastrophic failures.

FajarQuant is a Rust + Python research repository housing two distinct research lines in LLM quantization, both with compile-time @kernel/@device safety guarantees through the Fajar Lang compiler. The repo started as a KV cache quantization library (v0.1.0–v0.3.0, paper v3.1) and expanded with the Phase D IntLLM training-quantization research line in v0.4.0 (this release).

Two Research Arms

Arm A — Phase D IntLLM (training-time quant, v0.4.0, primary going forward)

Train ternary {-1, 0, +1} weights end-to-end via MatMul-Free LLM architecture (HGRNBitForCausalLM), validate scaling chain across 3 calibrated gates, deploy entirely inside @kernel context with no heap allocation.

Models: intllm-mini (21.5M), intllm-base (46.4M), intllm-medium (74.5M)
Training: 491M / 982M / 1.819B tokens (Chinchilla 22.8 / 21.16 / 24.4 tok/p)
3 gates PASS with monotonically widening margins:
- Mini v2 val_loss 4.38 PPL 80.0 (gate < 4.5, margin 0.12 nat)
- Base c.1 val_loss 3.99 PPL 54.1 (gate < 4.2, margin 0.21 nat)
- Medium c.1 val_loss 3.72 PPL 41.3 (gate < 4.0, margin 0.28 nat)
Track B 5+1-layer interruption-safety: ckpt_every / --resume / StepWatchdog / HF timeout+retry / regression gate / nohup line-buffering. Validated end-to-end during a real laptop-shutdown event mid-Medium training.
In-kernel deployment via FajarOS Nova v3.9.0 IntLLM Kernel Path.

Arm B — v3.1 KV Cache Quantization (mature, paper artifact)

Profile each KV head's statistical properties at calibration time, route to optimal quantization strategy. First systematic cross-architecture perplexity evaluation (3 models × 3 bit widths = 9 cells) using the canonical R-α.1 model-surgery protocol. Paper at paper/fajarquant.pdf (MLSys 2027 target).

See Arm B section below for full results.

Phase D IntLLM — Scaling Chain Results (Arm A)

3-row monotonic LM-modeling improvement on 8-task lm-eval v0.4.11 (real bench, no limit, RTX 4090 Laptop):

Metric	Mini 21M	Base 46M	Medium 74M	Δ Mini→Med
wikitext word_PPL	342.98	201.09	138.36	−60%
wikitext bits_per_byte	1.575	1.431	1.330	−16%
lambada_openai PPL	51,121	16,729	5,277	−90%
lambada_openai acc	0.001	0.007	0.023	16×
arc_easy acc	0.306	0.319	0.341	+0.035
openbookqa acc	0.110	0.128	0.130	+0.020

Pure LM modeling (wikitext, lambada): clean monotonic scaling. Lambada PPL drops by an order of magnitude per scale step. Lambada accuracy scales 16× from Mini to Medium.

Multi-choice reasoning (hellaswag, piqa, winogrande, arc_*, openbookqa): noisy at sub-100M scale, mostly within ±1-2 stderr. Per Chinchilla literature, sub-100M models cannot meaningfully beat random on these tasks; expectation-aligned. Phase D contribution is NOT "win on benchmark X" — model is too small for that.

Phase D Contribution

The actual contribution of Phase D is three-fold:

Compiler/kernel-path enabling LLM inference inside @kernel context with no heap allocation (FajarOS Nova v3.9.0 IntLLM Kernel Path).
Track B 5+1 layer interruption-safety hardening validated end-to-end during a real laptop-shutdown event mid-Medium training (no progress lost beyond the worst-case 36-min checkpoint window).
Calibrated training-gate methodology (Mini < 4.5 / Base < 4.2 / Medium < 4.0 / Stretch < 3.7) that all three calibrated scales pass with monotonically widening margins (0.12 → 0.21 → 0.28 nat).

Bench numbers above verify the scaling validation but are not the headline claim.

Phase D Reproducibility

make verify-intllm-tables          # 12/13 paper claims verified (--strict)
make bench-canonical-real TAG=mini # 8-task lm-eval on mini_final.pt (~10 min)
make bench-canonical-real TAG=base
make bench-canonical-real TAG=medium
make test-train-watchdog            # Track B 5+1 layer gate (24 tests + signal delivery)
make test-intllm-fp16-parity       # fp16-vs-ternary parity (37 hooks, IntLLM differentiator)

See docs/FJQ_PHASE_D_PRODUCTION_PLAN.md for the 9-week plan + docs/FJQ_PHASE_D_GATE_CALIBRATION.md for evidence-backed calibrated gate thresholds.

v3.1 KV Cache Quant — Headline Results (Arm B)

Model	Arch	Bits	FP16	FQ v3.1	KIVI	TQ outlier	Strategy
Gemma 4 E2B	MQA (1 KV head)	2	28.11	39.91	480.66	39.91	PPL-guided → Path C
Gemma 4 E2B	MQA (1 KV head)	3	28.11	16.51	21.69	26.40	17A + 13B
Gemma 4 E2B	MQA (1 KV head)	4	28.11	28.13	35.11	27.52	17A + 13B
Mistral 7B	GQA (8 KV heads)	2	5.67	24.95	24.95	163.96	PPL-guided → all-A
Mistral 7B	GQA (8 KV heads)	3	5.67	6.32	6.00	9.44	278A + 234B
Mistral 7B	GQA (8 KV heads)	4	5.67	5.73	5.73	5.88	all-A
Qwen2-7B	GQA (4 KV heads)	2	7.69	18.44	28.53	75.15	PPL-guided
Qwen2-7B	GQA (4 KV heads)	3	7.69	8.15	8.14	8.38	222A + 2B
Qwen2-7B	GQA (4 KV heads)	4	7.69	7.78	7.78	7.82	all-A

Score: 2 wins, 5 ties, 2 losses (7 of 9 match-or-beat best fixed method). Bold = best quantized method in that cell. Protocol: R-α.1 canonical model surgery, WikiText-2 test set. See paper/fajarquant.pdf §6.

Key Findings

Architecture dependence is the dominant signal. KIVI wins 6/9 cells on its own but fails catastrophically on Gemma MQA at 2-bit (PPL 480 vs FP16 28). No fixed method is safe across architectures.
Two architecture-specific optima that no fixed method finds:
- MQA at 3-bit: PCA rotation beats KIVI by −24% on Gemma (PPL 16.51 vs 21.69). Single KV head concentrates information, making PCA's decorrelation highly effective.
- GQA at 2-bit: PPL-guided selection beats KIVI by −35% on Qwen2 (PPL 18.44 vs 28.53) via novel per-head mixture.
Zero catastrophic failures. v3.0's MSE-only selection produced 2 failures (Gemma 2-bit PPL 171, Mistral 2-bit PPL 79). v3.1's PPL-guided fallback eliminates both.

FajarQuant is part of the Fajar Lang ecosystem: it ships as both a standalone Rust crate (this repo) and a native implementation in Fajar Lang with @device context, SE023 type safety, and AVX2 SIMD (1.9× Hadamard, 1.6× fused kernel, 5× vs Python reference).

Four Innovations

1. Adaptive Per-Head Method Selection — NEW IN v3.1 (`adaptive.rs`)

Profiles each KV head's statistical properties (variance per channel, kurtosis, SVD ratio) at calibration time and routes to the optimal strategy: Path A (KIVI per-channel), Path B (per-head PCA rotation), or Path C (outlier-aware TQ). At 2-bit, falls back to PPL-guided selection when MSE and PPL disagree on the best strategy.

Result: Discovers two architecture-specific optima: PCA rotation on MQA 3-bit (−24% vs KIVI), PPL-guided mixture on GQA 2-bit (−35% vs KIVI).

2. Outlier-Aware Calibrated PCA (`turboquant.rs` v2)

Replaces TurboQuant's random rotation with per-head PCA, calibrated once on representative data. Concentrates variance in fewer dimensions and handles outliers explicitly (top-K channel preservation).

Result: 63–81% MSE improvement over v1 at 2-bit; 4–6% over random rotation on Gemma 4 E2B; peak 88% on synthetic d=128, b=3.

3. Fused Quantized Attention (`fused_attention.rs`)

Computes attention directly on quantized KV vectors via codebook dot products, skipping the dequantize buffer entirely.

Result: 524,288× memory reduction at 16K context (33.5 GB → 64 B per head). Zero allocation in the hot path.

4. Hierarchical Multi-Resolution Bit Allocation (`hierarchical.rs`)

Allocates more bits to recent tokens (which dominate attention scores) and fewer to distant ones via exponential decay.

Result: 48.7% bit savings at 10K context, 55.7% at 16K, with negligible perplexity loss.

Quick Start

use fajarquant::adaptive::{select_strategy, StrategyPath};
use fajarquant::fused_attention::QuantizedKVCache;
use fajarquant::hierarchical::BitSchedule;

// v3.1 adaptive: profile head stats, select optimal path
let stats = profile_head(&kv_head_data);
let path: StrategyPath = select_strategy(stats, bits=3, n_kv_heads=8);
// path ∈ { A = KIVI per-channel, B = per-head PCA, C = TQ outlier }

// Quantized KV cache with fused attention
let mut kv_cache = QuantizedKVCache::new(128, 2);
// ... insert keys/values, attention computes on quantized form

// Hierarchical bit allocation (8 → 2 bits as tokens age)
let schedule = BitSchedule::exponential_decay(8, 2, 1024);

Repo Layout

fajarquant/
├── src/
│   ├── lib.rs                 # Public API
│   ├── turboquant.rs          # v2 outlier-aware calibrated PCA baseline
│   ├── adaptive.rs            # v3.1 adaptive per-head selector
│   ├── fused_attention.rs     # Fused attention on quantized KV
│   ├── hierarchical.rs        # Multi-resolution bit allocation
│   └── kivi.rs                # KIVI baseline (for comparison)
├── benches/
│   ├── noise_floor.rs         # Noise floor measurement
│   └── quant_latency.rs       # Quantization latency benchmark
├── examples/                  # 6 .fj demos (run via Fajar Lang)
│   ├── adaptive_demo.fj
│   ├── benchmark.fj
│   ├── fused_demo.fj
│   ├── hierarchical_demo.fj
│   ├── kv_cache.fj
│   └── paper_benchmark.fj
├── paper/                     # MLSys 2027 paper artifacts
│   ├── fajarquant.tex         # arXiv version (11 pages)
│   ├── fajarquant.pdf
│   ├── fajarquant_mlsys.tex   # MLSys 2027 formatted (10 pages)
│   ├── fajarquant_mlsys.pdf
│   ├── SUBMISSION.md          # venue decision (arXiv → MLSys 2027)
│   ├── references.bib
│   └── Makefile
├── data/
│   └── kv_cache/              # 50 prompts from WikiText-2
│       ├── prompt_000/...prompt_049/
│       ├── perplexity_v3.1_*.json       # v3.1 primary results
│       ├── comparison_results.json
│       ├── ablation_results.json
│       └── metadata.json
├── scripts/                   # Reproducibility (Python)
│   ├── extract_kv_cache.py    # Extract Gemma 4 E2B / Mistral / Qwen2 KV cache
│   ├── eval_perplexity_v3.py  # WikiText-2 perplexity eval (R-α.1 protocol)
│   ├── calibrate_fq_v3.py     # Per-head PCA calibration
│   ├── ppl_guided_select.py   # PPL-guided strategy selection at 2-bit
│   ├── profile_kv_heads.py    # Statistical profiling of KV heads
│   └── strategy_selector.py   # Architecture-agnostic decision tree
└── reproduce.sh               # One-script reproduction (4 modes)

Reproducing the Paper

One script, four modes:

./reproduce.sh --verify    # Quick verify (28 claims, ~2 minutes)
./reproduce.sh --smoke     # Smoke test (1 model, ~10 min on RTX 4090)
./reproduce.sh --full      # Full reproduction (3 models × 3 bits, ~3 hours)
./reproduce.sh --fallback  # Public-model fallback (SmolLM-135M, no HF login)

Prerequisites:

Rust 1.87+
Python 3.10+ with transformers, torch, numpy
Hugging Face access to google/gemma-4-e2b, mistralai/Mistral-7B-v0.1, Qwen/Qwen2-7B (--fallback mode avoids this)
~16 GB GPU VRAM (RTX 4090 verified)

Results land in data/kv_cache/perplexity_v3.1_*.json. Paper tables are auto-regenerated via paper/Makefile. CI verifies all 28 paper claims on every push (.github/workflows/paper-reproduce-smoke.yml).

Compile-Time `@kernel` Safety

FajarQuant is implemented in pure Rust, but its sister implementation in FajarOS Nova runs in @kernel context with the Fajar Lang compiler enforcing:

No heap allocation in hot paths (codebook lookups use stack-allocated buffers)
No tensor ops that would require @device context
No external function calls that could touch userspace state
Bounds-checked indexing at compile time via const generics

This is the first quantization library with these guarantees. PyTorch quant has none of them.

Test verification:

cd ../fajaros-x86
cat kernel/compute/fajarquant.fj | head -50
# all functions annotated @kernel, verified by `fj check`

Citation

If you use FajarQuant in academic work, please cite:

@misc{putranto2026fajarquant,
  title={FajarQuant v3.1: Adaptive Per-Head KV Cache Quantization with Compile-Time Safety Guarantees},
  author={Putranto, Muhamad Fajar},
  year={2026},
  publisher={PrimeCore.id},
  url={https://github.com/fajarkraton/fajarquant}
}

Status

Arm A — Phase D IntLLM (v0.4.0)

Component	Status
HGRNBitForCausalLM 1.58-bit ternary architecture	Production
Mistral v3 32K tokenizer + SlimPajama-6B streaming loader	Production
Calibrated gate methodology (Mini/Base/Medium/Stretch)	Production
Mini c.1 training (491M tokens)	PASS gate by 0.12 nat margin
Base c.1 training (982M Chinchilla-optimal tokens)	PASS gate by 0.21 nat margin
Medium c.1 training (1.819B tokens, ~Chinchilla)	PASS gate by 0.28 nat margin
Stretch c.1 training	Deferred to V32 post-Phase-E2 paper
Track B 5+1-layer interruption-safety (V31.C.P6.1-P6.6)	Production, validated end-to-end
`make test-train-watchdog` regression gate (24 tests + signal delivery)	Green
Bench canonical real (8 lm-eval tasks × Mini/Base/Medium)	Complete
`make verify-intllm-tables --strict` (12/13 claims PASS)	1 pending: kernel E2E Mini tok/s (FajarOS-side artifact)
Bench knowledge real (mmlu, triviaqa, boolq)	Deferred to V32 post-Phase-E2 paper
BitNet 2B4T baseline comparison	Deferred to V32 post-Phase-E2 paper
In-kernel deployment via FajarOS Nova IntLLM kernel-path	Production (FajarOS v3.9.0)
Phase D paper (Table 2 wikitext + hellaswag rows)	Real numbers populated
Phase D paper LaTeX writeup (full §4)	Deferred to Phase E paper (one combined MLSys submission)

Arm C — Phase E Bilingual Kernel-LLM (Path A submission-ready)

Phase D extends to Indonesian + English bilingual ternary LLM in kernel context. Plan v1.10 — Path A scope decision committed 2026-04-27: STOP Phase E2 after 2 honest negative results, CUT E2.2/E2.3/E2.5 (deferred to Phase F.7/F.8/F.9), pivot to paper. Manuscript drafted, compiled, and submission-ready (founder actions remain — see docs/ARXIV_SUBMISSION.md).

Component	Status
Phase E1 Bilingual corpus v1.0	✅ CLOSED. 25.67 B tokens at 60:40 ID:EN (15.40 B ID + 10.27 B EN). 0% synthetic, 0.0254% exact-hash dedup. `make verify-bilingual-corpus` 8/8 invariants.
Phase E2.0 pre-flight + Q5 baseline	✅ CLOSED. Q5 = 24K-step Mini bilingual training: val_loss(ID)=2.68, val_loss(EN)=4.73, ratio=1.77×.
Phase E2.4 balanced_calib	✅ CLOSED with HONEST NEGATIVE RESULT. `outlier_global_reduction = −82.13` (gate ≥ 0.10) — calibrated quantizer 83× WORSE than upstream baseline. See `docs/FJQ_PHASE_E_E2_BILINGUAL_CALIB_DECISION.md`.
Phase E2.1 Hadamard rotation	✅ CLOSED with HONEST NEGATIVE RESULT. val_loss(EN) = 4.852 vs Q5 4.732, regression +0.12 nat (gate required ≥+0.05 nat improvement). See `docs/FJQ_PHASE_E_E2_HADAMARD_DECISION.md`.
Path A scope decision	✅ DONE. Cut E2.2/E2.3/E2.5 to Phase F; pivot to paper. See `docs/FJQ_PHASE_E_PATH_A_PAPER_OUTLINE.md` v1.0.
Paper §1-§10 + abstract LaTeX	✅ DRAFTED with REAL data. `paper/intllm/intllm.tex` 2435 LOC; `intllm.pdf` 20 pages 787 KB; 5 figures via `scripts/generate_paper_figures.py`; refs.bib v1.0 with 15 verified citations; `verify_intllm_tables.py --strict` 32/32 PASS.
arXiv submission tarball	✅ BUILT + standalone-verified. `make arxiv-tarball` produces `paper/intllm/intllm-arxiv.tar.gz` (77 KB); verification cycle (extract to /tmp + pdflatex×3 + bibtex) green.
Founder actions for arXiv upload	⏳ PENDING — see `docs/ARXIV_SUBMISSION.md`: ORCID register, Zenodo DOI mint, arxiv.org account+endorsement, first-draft review, upload.
Phase E2.2 / E2.3 / E2.5 / E2.6	Deferred to Phase F.7 / F.8 / F.9 under Path A scope decision; original sub-task tables preserved in plan v1.10 §3 PHASE E2 for Phase F lift.
Phase E3 Bilingual pretrain (Base/Medium scale)	Pending (post-paper-acceptance; ~30-50h GPU)
Phase F (Tier 3 tax-vertical TaxPrime + F.5/F.6 post-hoc PTQ/QuaRot follow-ups)	Pending (gated on Phase E paper acceptance + 2 weeks; see `docs/FJQ_PHASE_F_TAX_VERTICAL_ROADMAP.md` v1.3)

Arm B — v3.1 KV Cache Quant (mature paper artifact)

Component	Status
Four innovations (adaptive, outlier-aware PCA, fused, hierarchical)	Production
Multi-model perplexity evaluation (Gemma 4 E2B + Mistral 7B + Qwen2-7B)	Complete
9-cell cross-architecture comparison (3 models × 3 bit widths)	Complete
Ablation: v3.0 → v3.1 (PPL-guided selection + per-head PCA)	Complete
LaTeX paper (11 pages arXiv + 10 pages MLSys 2027 template)	Complete
Venue decision committed (`paper/SUBMISSION.md`)	Complete
CI: 28 paper claim checks on every push	Complete
6 Fajar Lang demos (`examples/*.fj`)	Complete
Kernel port (FajarOS Nova)	Phase 1+2 complete
ORCID + Zenodo DOI wire-up into paper	Pending (C3.6)
MLSys 2027 paper submission	Pending (external deadline)
Wall-clock latency benchmarks (ms/token vs KIVI/TQ)	Deferred (future work)

Cross-Repo Linkage

This v0.4.0 release pairs with:

Fajar Lang v31.0.0 — compiler dependency; Phase D IntLLM uses @noinline + @inline + @cold (V29.P1) and @no_vectorize (V31.B.P2) attributes.
FajarOS Nova v3.9.0 — runs Phase D medium_final.pt checkpoints inside @kernel context via the IntLLM kernel-path (make test-intllm-kernel-path 4-invariant gate).

All three repos share Apache 2.0 license (relicensed from MIT on 2026-04-24 for fajar-lang + fajaros-x86; fajarquant has been Apache 2.0 since inception).

See paper/SUBMISSION.md and V26 Production Plan Phase C for the full roadmap and gate dates.

License

Apache License 2.0. See LICENSE.

FajarQuant crate v0.4.0 / Phase D IntLLM + KV quant v3.1 — Made in Indonesia by Muhamad Fajar Putranto (PrimeCore.id) — built with Fajar Lang v31.0.0 + Claude Opus 4.7

Name		Name	Last commit message	Last commit date
Latest commit History 319 Commits
.github/workflows		.github/workflows
audit		audit
bench		bench
benches		benches
cpu_kernels/bitnet_tl2		cpu_kernels/bitnet_tl2
data		data
docs		docs
eval		eval
examples		examples
paper		paper
python		python
scripts		scripts
src		src
tests		tests
tools/kernel_sim		tools/kernel_sim
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
NOTICE_BILINGUAL_CORPUS_V1		NOTICE_BILINGUAL_CORPUS_V1
README.md		README.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
build.rs		build.rs
run_smoke.sh		run_smoke.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FajarQuant — Quantization Research for Compiler-Verified LLM Systems

Two Research Arms

Arm A — Phase D IntLLM (training-time quant, v0.4.0, primary going forward)

Arm B — v3.1 KV Cache Quantization (mature, paper artifact)

Phase D IntLLM — Scaling Chain Results (Arm A)

Phase D Contribution

Phase D Reproducibility

v3.1 KV Cache Quant — Headline Results (Arm B)

Key Findings

Four Innovations

1. Adaptive Per-Head Method Selection — NEW IN v3.1 (`adaptive.rs`)

2. Outlier-Aware Calibrated PCA (`turboquant.rs` v2)

3. Fused Quantized Attention (`fused_attention.rs`)

4. Hierarchical Multi-Resolution Bit Allocation (`hierarchical.rs`)

Quick Start

Repo Layout

Reproducing the Paper

Compile-Time `@kernel` Safety

Citation

Status

Arm A — Phase D IntLLM (v0.4.0)

Arm C — Phase E Bilingual Kernel-LLM (Path A submission-ready)

Arm B — v3.1 KV Cache Quant (mature paper artifact)

Cross-Repo Linkage

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FajarQuant — Quantization Research for Compiler-Verified LLM Systems

Two Research Arms

Arm A — Phase D IntLLM (training-time quant, v0.4.0, primary going forward)

Arm B — v3.1 KV Cache Quantization (mature, paper artifact)

Phase D IntLLM — Scaling Chain Results (Arm A)

Phase D Contribution

Phase D Reproducibility

v3.1 KV Cache Quant — Headline Results (Arm B)

Key Findings

Four Innovations

1. Adaptive Per-Head Method Selection — NEW IN v3.1 (adaptive.rs)

2. Outlier-Aware Calibrated PCA (turboquant.rs v2)

3. Fused Quantized Attention (fused_attention.rs)

4. Hierarchical Multi-Resolution Bit Allocation (hierarchical.rs)

Quick Start

Repo Layout

Reproducing the Paper

Compile-Time @kernel Safety

Citation

Status

Arm A — Phase D IntLLM (v0.4.0)

Arm C — Phase E Bilingual Kernel-LLM (Path A submission-ready)

Arm B — v3.1 KV Cache Quant (mature paper artifact)

Cross-Repo Linkage

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Adaptive Per-Head Method Selection — NEW IN v3.1 (`adaptive.rs`)

2. Outlier-Aware Calibrated PCA (`turboquant.rs` v2)

3. Fused Quantized Attention (`fused_attention.rs`)

4. Hierarchical Multi-Resolution Bit Allocation (`hierarchical.rs`)

Compile-Time `@kernel` Safety

Packages