Two Gemmas, one laptop — local repo-wide vulnerability auditing.
GemmaForge is a two-tier, fully-local pipeline built on the Gemma 4 family.
Tier 1 — activation-probe scan. A linear probe over Gemma 4's hidden state sweeps a whole repo and emits ranked leads. Two variants under one scoring contract:
weak-probe— Gemma 4 E2B backbone, bounded line/token windows, high recall, laptop-class. Default; what the metrics below validate.strong-probe— larger Gemma checkpoint as the probe backbone, larger windows, more repo context, better ranking. Workstation-class. Samegemmaforge.leads/v1schema.
Tier 2 — pwnkit verifier. Gemma 4 8B via Ollama drives a tool-using agent (pwnkit review --seed-findings --runtime ollama) that consumes the leads as its primary worklist, reads files, greps call sites, and produces tool-grounded audit verdicts. Consumes leads from either probe.
Nothing leaves the laptop. The probe is what the metrics below validate. The agent verifier is shipped end-to-end at the CLI + runtime + dashboard layer in two merged pwnkit PRs: #371 (CLI, Ollama runtime, three-lane dashboard) and #372 (seed-findings worklist injection, streaming Ollama tool-calls, live Hunt lane). Both merged 2026-05-18.
- AUC 0.907 cross-language (Python → JavaScript) — the honest OOD headline. Train on Python, eval on JavaScript with a disjoint CWE basket (
data/eval/report.md). - AUC 0.940 Py → JS with MLP × ensemble stacked (
data/probe_mlp_ensemble_xlang.json) — closes ~80% of the gap to in-distribution; in-distribution stack saturates at 0.976. - 22× T3 adversarial robustness via max-pool aggregation — comment+reindent drops last-token AUC by 0.087; max-pool absorbs the same attack at Δ = −0.004 (
data/adversarial_full_suite.json). - LoRA vuln-pattern emission: base = 3 hits → LoRA = 0 on the 5-prompt demo eval, after 30 SFT steps in 39s on MPS (
data/lora_eval.json). - Per-CWE multi-head probe: macro AUC 0.9726 across CWE-78/338/95/94/328 (
data/probe_per_cwe_card.json). - In-distribution baseline AUC 0.963, group-by-repo AUC 0.958, held-out CWE 0.92–0.96. <1 ms per-token overhead on M1/M2/M3.
Full audit trail: RESULTS.md. Submission write-up: WRITEUP.md.
Repository layout and artifact ownership: docs/repo-map.md and data/README.md.
Two demo flows:
- IDE-mode (Tier 1 only) — the probe rides every token Gemma 4 emits. Green (<0.4), yellow (0.4–0.7), red (>0.7). The red cluster above —
db.query(\SELECT * FROM users WHERE id = ${id}`)— is a CWE-89 SQL-injection pattern. This is the streaming UI insrc/stream_with_probe.py` and the HuggingFace Space below. - Repo-audit mode (Tier 1 → leads → Tier 2) —
python -m src.scan ./repo --events live.ndjson > leads.jsonlwalks the codebase, emits rankedgemmaforge.leads/v1JSONL, streamsgemmaforge.events/v1to the pwnkit/livedashboard (Lane 1 · Probe, Lane 2 · Leads), then hands the leads topwnkit review --seed-findings ... --runtime ollamafor Gemma 4 8B agentic verification.
Live HF Space: peaktwilight/gemmaforge (HuggingFace Space, ZeroGPU).
In 2026 every IDE has an AI in it. Most of the code those AIs generate ships. Almost none of it is audited for security before it does. Existing SAST tools (Snyk, Semgrep, CodeQL) run after the code is written — and they require shipping the code somewhere. Solo maintainers, air-gapped enterprises, and developers in regions where a Snyk seat costs more than a month's rent are excluded.
GemmaForge asks a sharper question: what if a small model could read your whole repo on your laptop and rank where it smells like a CVE, and a bigger version of the same model could read those spans and confirm — without anything leaving the device?
Tier 1 is inspired by Obeso, Arditi et al. (2025), which showed linear probes on hidden states detect hallucinated entities at AUC > 0.9. We adapt the same technique to vulnerability likelihood on Gemma 4 E2B, trained on a labeled dataset of paired vulnerable/safe code drawn from public CVE disclosures, including CVE-2026-33896 in node-forge, which Doruk disclosed in March 2026.
Every number above is from the sample-level path — one label per snippet, linear probe on the last-token hidden state (src/extract_activations.py + src/train_probe.py). That is what ships in the demo today.
In parallel we are training a token-level path — per-token spans (evidence / sink / source / sanitizer), value head on every position, BCE + span-max loss à la Obeso §3 — via experiments/notebooks/training/colab_train_gemma4_probe.py and scripts/train_value_head.py. Current token-level numbers on SVEN-diff at layer 8: token AUC 0.768 / example AUC 0.700 (data/probe_spanmax_card.json). The token-level path is the principled streaming version; the intent is for it to replace sample-level once the head-to-head eval lands.
| Eval split | AUC | Source |
|---|---|---|
| Random stratified (in-dist) | 0.963 | data/probe_card.json |
| Group-by-repo (honest) | 0.958 | data/eval/report.md |
| Held-out CWE (5-class range) | 0.92–0.96 | data/probe_strict.json |
| Cross-language Py → JS (OOD headline) | 0.907 | data/eval/report.md |
| 4-layer ensemble (random) | 0.976 | data/probe_ensemble.json |
| MLP head (Py → JS) | 0.919 | data/probe_mlp_ablation.json |
| MLP × ensemble (Py → JS, stacked) | 0.940 | data/probe_mlp_ensemble_xlang.json |
| Per-CWE multi-head (macro) | 0.9726 | data/probe_per_cwe_card.json |
| T3 comment+reindent, last-token aggregation | Δ −0.087 | data/adversarial_full_suite.json |
| T3 comment+reindent, max-pool aggregation | Δ −0.004 | data/adversarial_full_suite.json |
| Length-only baseline (length-matched eval) | 0.514 | data/eval_length_matched.json |
| Probe on the same length-matched eval | 0.843 | data/eval_length_matched.json |
- A model-size family as the architecture.
weak-probe(Gemma 4 E2B) does cheap whole-repo triage on a laptop;strong-probe(larger Gemma checkpoint) does the same with bigger windows and richer context on a workstation; thepwnkit verifier(Gemma 4 8B via Ollama) does deep agentic auditing with tool use. Same model family, same tokenizer, same function-calling protocol, same license, same lead schema flowing between stages — that combination doesn't exist outside Gemma 4. - Domain transfer of streaming probes from hallucination → security. Vulnerability likelihood is fuzzier than entity hallucination; a single linear probe still hits AUC 0.907 cross-language on a 2B model with no LoRA, ~1k pairs.
- Author-informed training data. Doruk's CVE-finding work — including the
node-forgepatch (CVE-2026-33896) — informed which CWE classes the dataset emphasises and how the probe was evaluated. Canonical training metrics are on CyberSecEvalpairs_v2(data/pairs_v2.jsonl); Doruk's embargoed disclosures are deliberately excluded. - Runs locally, no network. The probe path runs on laptop-class hardware. The Gemma 4 8B verifier (Q4_K_M via Ollama) fits in ~10GB of VRAM — laptop GPUs and M-series Macs handle it comfortably. Anyone Snyk-priced-out gets the same audit capability as a Fortune 500.
- Honest OOD reporting. We lead with the cross-language number, not the in-distribution one, because Python→JavaScript brings a near-disjoint CWE basket and tests language-portability of the underlying representation.
- Length artefact (refuted, worth disclosing). On the unmatched dataset, a length-only baseline gets AUC 0.883 on Py→JS (vs probe 0.907) — a real concern. On a length-matched eval (
data/eval_length_matched.json, 230 pairs balanced to ±3 chars per class), the length baseline collapses to 0.514 while the probe still hits 0.843. The probe-minus-length delta widens from +0.024 to +0.329. - Probes classify representation, not ground truth. A high score means "looks like code I was trained on as vulnerable," not "is exploitable."
- Train-inference distribution gap. Probe trained on last-token of complete snippets; inference reads intermediate positions. Mitigated with 8-token mean; proper fix is the token-level value head (above).
- Surface-form sensitivity. Identifier renames + SQL case shift mean |Δ| ≤ 0.007 (refutes "probe detects API names"). Only T3 (comment+reindent) dominates at Δ −0.089 under last-token. Max-pool flattens all 5 transforms to |Δ| ≤ 0.005 — 22× T3 robustness for −0.031 clean AUC. Streaming UI ships max-pool; offline eval reports last-token. See
data/adversarial_full_suite.json. - Coverage. Python/JS well-covered; C/C++ activations not yet extracted; Rust/Go/Ruby/PHP not in the dataset yet (tracked in docs/coverage.md).
- Regex still wins on CWE-328. Static patterns hit AUC 0.991 on weak-hash; probe at 0.944 (
data/eval/sample_report.json→ CWE-328 →baseline_aucs.regex). Honest: a probe is not always better.
| File | Rows | Pos/Neg | Languages | Source |
|---|---|---|---|---|
data/pairs_v2.jsonl (canonical training set) |
1,198 | 599 / 599 | python (700), javascript (498) | CyberSecEval position-paired |
data/pairs_sven.jsonl |
1,560 | 780 / 780 | python (730), c (720), cpp (110) | SVEN-before / SVEN-before-leadup |
data/pairs_merged.jsonl |
2,758 | 1,379 / 1,379 | python (1,430), c (720), javascript (498), cpp (110) | union of v2 + sven |
data/dataset.jsonl (Colab token-level path) |
1,374 | 687 / 687 | python (760), c (538), cpp (76) | SVEN-before / SVEN-after |
data/pairs_rich.jsonl |
100 | token labels | python (58), javascript (42) | derived for token-level probe |
CWE coverage in pairs_v2.jsonl: CWE-22/78/79/89/94/95/185/208/312/319/328/338/345/502/770/798/908/1333 (Python-heavy on CWE-78/94/89/328/502; JS-heavy on CWE-95/345/208/22/185). Only CWE-338 is meaningfully bilingual — this is why the Py→JS number is non-trivial, the held-out language brings a near-disjoint CWE basket with it.
┌───────────────────────── Tier 1 (this repo) ──────────────────────────┐
repo/ ──► src/scan.py ──► Gemma 4 forward pass ──► hidden_state @ layer 17
(line/token ├── weak-probe: E2B (laptop) │
windows, lazy LM) └── strong-probe: larger (workstation)
▼
σ(w·h+b) + per-CWE head
│
▼
ranked leads JSONL (gemmaforge.leads/v1)
+ events stream (gemmaforge.events/v1)
│
┌───────────────────────── Tier 2 (0sec-labs/pwnkit#371) ─────────────────┘
│
pwnkit `/live` dashboard ◄── SSE ──┐ │
(Lane 1 · Probe, Lane 2 · Leads, │
Lane 3 · Hunt — all live) │
▼
pwnkit review --seed-findings leads.jsonl
--seed-only --runtime ollama
│
▼
Gemma 4 8B (Ollama /api/chat + tools[],
streaming tool-calls per pwnkit#372)
read_file · grep · follow-call-site
│
▼
structured audit verdicts
(leads are the primary worklist;
verdicts stream into Lane 3 live)
Tier 1 (this repo) — activation-probe scan, two variants:
| Variant | Backbone | Window strategy | Hardware | Status |
|---|---|---|---|---|
weak-probe |
google/gemma-4-E2B-it (~2B) |
bounded line/token windows, top-k mean aggregation |
laptop-class | shipped, what the metrics validate |
strong-probe |
larger Gemma checkpoint (--model-id <…>) |
larger windows, more repo context | workstation-class (24GB+ GPU or 32GB+ M-series) | supported by src/scan.py; probe head retrains against the larger backbone's hidden states |
- Probe head: linear classifier
σ(w·h + b)withw ∈ ℝᵈ,b ∈ ℝ, trained on the chosen backbone's hidden states.weak-probeships asdata/probe.npz(d = 1536, layer 17, ~10 KB). - Per-CWE head: 5-way one-vs-rest logits at the same layer for CWE-78/338/95/94/328 (
data/probe_per_cwe.npz). - Two entry points:
python -m src.scan <repo>— repo walk, sliding-window chunks, rankedgemmaforge.leads/v1JSONL on stdout,gemmaforge.events/v1ND-JSON to--events <path>. Schemas indocs/leads-schema.mdanddocs/events.md.python -m src.stream_with_probe --serve— single-snippet streaming SSE UI inweb/index.html. Max-pool aggregation by default (T3-robust).
- HF Space (
hfspace_gradio/app.py, Gradio + ZeroGPU) wraps the streaming demo. The older Docker/dashboard Space lives underarchive/deploy/hfspace-dashboard/.
Tier 2 (pwnkit#371, merged 2026-05-18):
--seed-findings <path|->— parses ourgemmaforge.leads/v1JSONL into typedSeedFindings and threads them through the pipeline. Stable wire format.--runtime ollama— drives Ollama/api/chatwithtools[]; defaults togemma4:latest. Streaming tool-calls ship in pwnkit#372./livedashboard — three lanes (Probe / Leads / Hunt), all live-wired via SSE. The Hunt lane subscribes topwnkit.events/v1, click-to-highlight resolves a verdict back to its originating lead.- Worklist injection.
pwnkit#372converts eachSeedFindinginto aSemgrepFinding-shaped record (severity bucketed from probe confidence,ruleIdprefixed with the source + CWE for visible provenance, full payload preserved inmetadata) and prepends it to the agent's worklist so the agent sees probe leads first.--seed-onlyhonours that by skipping the static-scan pass entirely.
- Main Track — A novel two-tier architecture that uses the Gemma 4 family across model sizes in a single product:
weak-probe(E2B) on a laptop orstrong-probe(larger Gemma) on a workstation for whole-repo triage; thepwnkit verifier(Gemma 4 8B + tools) for deep agentic corroboration. - Impact / Safety & Trust — Tier 1 peers literally inside the model (hidden-state introspection); Tier 2 produces auditable, tool-grounded verdicts a human can re-check. Both halves are explainability in two complementary senses.
- Special Tech / Ollama — Tier 2's agentic verifier runs on
gemma4:latest(Gemma 4 8B Q4_K_M) via Ollama/api/chatwithtools[]. Tier 1's probe variants both run locally. Zero data leaves the laptop.
Doruk Tan Ozturk — @peaktwilight · doruk.ch · security researcher, ex-ETH Zürich, building 0sec.ai. Disclosed CVE-2026-33896 in node-forge and additional CVE-grade vulnerabilities across widely-deployed npm packages; the public patches inform our training set. Built Tier 2 — the pwnkit agentic verifier and its GemmaForge integration PR.
Mehmet Efe Akça — MSc Data Science @ ETH Zürich · TA for Mathematics of Signals, Networks, and Learning. Coursework: Reliable and Trustworthy AI: interpretability & verifiability, Probabilistic AI, Large-Scale AI. Built Tier 1 — the linear-probe path: dataset, activation capture, training, evaluation, the per-CWE multi-head, and the streaming/scan entry points. The interpretability technique sits directly in his area of study.
Doruk alone gives you a scanner; Efe alone gives you a probe. The whole point of GemmaForge is the two together — a probe that knows about danger, handing off to an agent that can prove it.
Setup (one-time, both demos share it):
git clone https://github.com/peaktwilight/gemmaforge && cd gemmaforge
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txtTier 1 — single-snippet streaming UI (probe only, no GPU required):
python -m src.stream_with_probe --serve # loads shipped data/probe.npz
# open http://localhost:8787/Or visit the hosted Space: peaktwilight/gemmaforge.
Tier 1 + 2 — whole-repo audit (requires Ollama with gemma4:latest pulled):
# 1a. weak-probe (default) — laptop-class, shipped probe weights
python -m src.scan ./your-repo \
--top-k 25 --languages py,js \
--events live.ndjson > leads.jsonl
# 1b. strong-probe (workstation-class) — pass a larger Gemma backbone;
# the probe head must be re-trained against that backbone first.
python -m src.scan ./your-repo \
--model-id google/gemma-4-9B-it \
--probe-path data/probe.strong.npz \
--top-k 25 --languages py,js \
--events live.ndjson > leads.jsonl
# 2. hand the leads to pwnkit's agentic auditor driving Gemma 4 8B locally
# (install pwnkit from https://github.com/0sec-labs/pwnkit)
pwnkit review ./your-repo \
--seed-findings leads.jsonl \
--seed-only \
--runtime ollamaThe pwnkit /live dashboard renders all three lanes live (Probe / Leads / Hunt) — see pwnkit#371 for the CLI + runtime + dashboard scaffold and pwnkit#372 for worklist injection, streaming Ollama tool-calls, and live hunt-lane wiring. With --seed-only, the probe's leads are the agent's only worklist; without it, they prepend the static scan.
The probe doesn't have to stop at "this looks risky" — it can hand findings to an agentic hunter that actually exploits them. Today GemmaForge ships a first-party integration with pwnkit, an open-source agentic-AI pentest framework:
# 1. scan a repo with the probe → ND-JSON leads (gemmaforge.leads/v1)
python -m src.scan examples/vuln-express-app --top-k 10 > /tmp/leads.jsonl
# 2. hand the leads to pwnkit; agent loop runs on local Gemma 4 8B via Ollama
pwnkit-cli review examples/vuln-express-app \
--seed-findings /tmp/leads.jsonl \
--runtime ollama \
--model gemma4:latestSame Gemma family, two roles: small E2B probe finds, Gemma 4 8B agent hunts. Zero cloud spend, zero data leaving the laptop. Schema contracts at docs/leads-schema.md and docs/events.md. Pre-computed examples/leads.jsonl is shipped so judges without local Gemma weights can run step 2 alone. The pwnkit side landed in PR #371 and PR #372.
# 0. one-time: build the merged CyberSecEval + SVEN dataset
python scripts/build_dataset.py # writes data/pairs_v2.jsonl
# 1. cache Gemma 4 hidden states (layers 8/17/26/34), ~10 min M1 Max / ~5 min 3090
python -m src.extract_activations
# 2. AUC 0.907 cross-language, 0.963 in-distribution (~90s, writes data/probe.npz)
python -m src.train_probe
# 3. AUC 0.940 stacked headline — MLP × per-layer ensemble, Py→JS
python scripts/eval_mlp_ensemble.py # data/probe_mlp_ensemble_xlang.json
# 4. 22× T3 robustness via max-pool aggregation
python scripts/eval_maxpool_full.py # data/eval_maxpool_full.json
python scripts/adversarial_full_suite.py # data/adversarial_full_suite.json
# 5. LoRA base=3 → LoRA=0 vuln-pattern delta
python scripts/train_lora.py && python scripts/eval_lora.py # data/lora_eval.jsonEvery number in RESULTS.md is regenerated from these scripts; floating-point drift in the 4th decimal is expected.
To train and publish the probe from a fresh GPU Colab runtime, open
experiments/notebooks/training/colab_train_gemma4_probe.ipynb.
The notebook clones this repo, installs the Hugging Face stack, builds the dataset,
trains/evaluates a frozen-model token-level value head on a Gemma 4 decoder layer,
and uploads the resulting bundle to Hugging Face Hub.
Before running it, add a Colab Secret named HF_TOKEN or HF_WRITE_TOKEN with write
access to the destination model repo, then update HF_REPO_ID in the first settings
cell. GITHUB_TOKEN is optional (only needed for private REPO_URL values).
The notebook supports the rebuilt SVEN before/after dataset format from
scripts/build_dataset.py. Leave DATASET_PATH = None to build data/dataset.jsonl,
or point it at an existing JSONL file with rows like:
{"code": "db.query(`SELECT * FROM users WHERE id=${id}`)", "label": 1, "token_labels": {"sink": [[0, 8]], "source": [[38, 40]], "sanitizer": [], "evidence": [[0, 40]], "vulnerable_line": [[0, 40]]}}Rows with token_labels train on the annotated token activations. Positive ranges
come from evidence, vulnerable_line, sink, and source; sanitizer ranges are
treated as negative. Safe after-patch rows from the rebuilt builder have empty
token_labels and train as negative examples.
The streaming web UI uses the Server-Sent-Events feed at /stream?prompt=…. The CLI mode (--prompt "…") renders the same risk-coloured stream in your terminal with ANSI colour codes.
| Path | What lives here |
|---|---|
src/ |
Probe core: activation extraction, train (train_probe.py, train_probe_spanmax.py), streaming inference + SSE server (stream_with_probe.py), calibration, attention-probe variant. |
data/ |
Datasets (pairs_v2.jsonl, pairs_sven.jsonl, dataset.jsonl), cached activations (activations_v2/), shipped probe weights (probe.npz), every eval JSON cited in RESULTS.md. |
scripts/ |
Pipeline runners: build_dataset.py, eval_*.py (maxpool, per-CWE, ensemble, MLP-ensemble, length-matched, LoRA), train_lora.py, train_value_head.py, adversarial_full_suite.py, make_demo_grid.py, record_demo.py, deploy_hf.sh. |
experiments/notebooks/ |
Colab/Kaggle training notebooks and report/eval companion notebooks, grouped under training/, remote/, and reports/. |
docs/ |
coverage.md (per-language probe coverage), attention_probe_notes.md, findings_writeup_snippet.md. |
web/ |
Single-file streaming UI (index.html, vanilla JS + SSE), cover art (cover.svg), demo screenshots (screenshot-demo.png, demo-grid.png), result charts (auc-ladder.png, data-efficiency-curve.png). |
hfspace_gradio/ |
Current HuggingFace Space deployment (Gradio + ZeroGPU app, requirements, static viewer assets). |
archive/ |
Submission/video/process history and the older Docker/dashboard HF Space. |
Top-level docs: WRITEUP.md (Kaggle submission, ~1400 words), RESULTS.md (full metrics dashboard). Submission checklist, video script, voiceover, launch post, and recordings live under archive/submission/.
How does AUC scale with training size? Subsample to {5, 10, 20, 30, 50, 75, 100}% × 5 seeds, fit LogisticRegression(C=1.0), eval on the fixed 240-example held-out (data/data_efficiency.json):
- Random AUC crosses 0.9 at ~10–20% of the dataset (n ≈ 100–200).
- Cross-language (Py→JS) AUC peaks at 0.912 at n=525 and ticks back down to 0.907 at full → the curve flattens, so more Python data alone probably won't close the gap to in-distribution. The remaining 0.04 lives in the representation, not the training-set size. This is what motivates the token-level path.
See web/data-efficiency-curve.png (rendered by scripts/plot_data_efficiency.py).
Every credible OOD axis is reported. Splits are constructed by scripts/eval_splits.py and stored at data/eval/report.md:
random_stratified— 80/20 stratified on label. Leaky because positive/negative halves of a pair share_origin_repo; reported as the in-distribution ceiling, not the headline.group_repo—GroupShuffleSpliton_origin_repo. The same repository never appears in both train and test. Gap to random is only 0.005 AUC → leakage was real but small.heldout_cwe::CWE-N— train on all CWEs except N, test only on N (with positives and negatives for that CWE both moving to the test side). Tests "can the probe recognise a vulnerability class it has never seen?"heldout_lang::test=javascript— train on the full Python subset, test on the full JavaScript subset. Languages are disjoint by construction; this is the headline OOD number (0.907).heldout_lang::test=python— symmetric reverse split (0.929).
Confidence intervals are 95% bootstrap on the per-test predictions. R@5%FPR / R@10%FPR / Brier / ECE all reported in data/eval/report.md for every split.
Five logits on layer 17, one-vs-rest, same dataset (data/probe_per_cwe_card.json). Top-1 routing accuracy across the five heads is 94.7% — when the binary probe says "vulnerable," the right per-CWE head is the strongest one ~95 times out of 100.
| CWE | AUC | ACC | n_pos |
|---|---|---|---|
| CWE-78 (OS command injection) | 0.979 | 0.949 | 82 |
| CWE-338 (weak PRNG) | 0.990 | 0.971 | 79 |
| CWE-95 (eval injection) | 0.926 | 0.919 | 78 |
| CWE-94 (code injection) | 0.988 | 0.962 | 65 |
| CWE-328 (weak hash) | 0.981 | 0.947 | 55 |
| Macro | 0.9726 | — | — |
Even when the binary probe has never seen any example of the held-out CWE class, it still scores 0.92+ AUC on it across the five most-represented CWEs — the strongest evidence we have that the signal is "vulnerable-shaped representation" rather than memorised lexical surface.
This is the load-bearing argument. No closed model and no open non-Gemma model gives all of:
- Raw hidden states at every layer — required for Tier 1. GPT-5 / Claude don't expose them. Open-weight Llama / Qwen do, but only Gemma 4 gives you this plus the rest below.
- A model-size family — three slots, one architecture.
weak-probe(E2B),strong-probe(larger Gemma), and thepwnkit verifier(Gemma 4 8B) share the same vocabulary, the same hidden-state geometry conventions, and the same tool schema. The samegemmaforge.leads/v1schema flows between stages. Mixing a Llama probe with a Qwen verifier would mean two licenses, two prompt formats, and no guarantee the small model's notion of "vulnerable-looking" state maps to anything the big model's reasoning prior cares about. - Native function-calling + 131k context — required for the verifier. Gemma 4 8B reads files, greps call sites, and asks follow-up questions inside a 131k-token window.
- Apache-2.0–licensed project code; Gemma weights under the Gemma Terms of Use (open weights, commercial use permitted). One legal review, not two.
- Laptop-sized at the weak end, workstation-sized at the strong end. E2B (
weak-probe) comfortably fits on any modern MacBook; the 8B verifier (Q4_K_M via Ollama) fits in ~10GB and runs on most laptop GPUs and M-series Macs;strong-probescales up to a 24GB-class GPU when richer context is desired. The user picks their slot by hardware; the pipeline is the same. No cloud step is forced anywhere.
Tracked as open GitHub issues. Highlights:
- Token-level probe head-to-head vs sample-level. Both trainers are live (
experiments/notebooks/training/colab_train_gemma4_probe.py,scripts/train_value_head.py) againstdata/dataset.jsonlwith per-token spans. The eval decides whether token-level replaces sample-level as the production probe. strong-probevalidation run. Train the linear head against a larger Gemma checkpoint's hidden states and publish the same metric table for it.- Full DPO LoRA on dedicated GPU.
scripts/train_lora.pyalready supports--max_steps 500; we need a clean CUDA box. Stretch goal: re-train the probe jointly with the LoRA, the way Obeso §3 does — probe gradient gets a stop-grad on the base, the LoRA gradient flows. - More languages. Rust + Go via CyberSecEval-2; Ruby via RubySec advisory DB; PHP via SARD/Juliet. The activation-extraction pipeline is the bottleneck, not the probe.
- VS Code extension (scaffolded in
extensions/vscode/) that reads the same SSE stream.
- Obeso, Arditi et al. (2025). Real-time hallucination detection with linear probes on hidden states. arXiv:2509.03531.
- Bailey et al. (ICLR 2026). Surface-form robustness of activation probes. arXiv:2412.09565.
- RL-Obfuscation (2025). arXiv:2506.14261.
- CyberSecEval-2 (Meta AI, 2024); SVEN (He & Vechev, 2023); CWE Top-25 (MITRE).
- This project's code, probe weights, and training-data manifest: Apache-2.0, all in the repo.
- Gemma 4 model weights are used under the Gemma Terms of Use (open weights, commercial use permitted).
- Training data is derived from publicly disclosed CVEs and the published CyberSecEval / SVEN benchmarks; no private or embargoed vulnerability information was used.
- No user code leaves the laptop at inference time.


