Skip to content

Hert4/aletheia

Repository files navigation

aletheia (ἀλήθεια)

Python MLX License: PolyForm NC 1.0.0

ἀλήθεια — unconcealment, the unforgotten. Bringing the hidden truth of an RAG answer to light.

Benchmark platform for inference-time hallucination detection methods on Vietnamese RAG. Target: Qwen3-4B-class quantized models (MLX on Mac, vLLM on Linux). Each method is ported from its paper, evaluated against the same Vietnamese hard split, and ranked by a single rubric so we know what actually works.

Status

Method Source Latency AUROC (hard) Verdict
PCC τ (certainty) arXiv:2601.02574 ~0.2s 0.807 champion
PCC γ (consistency, K=2) same ~6s 0.74 opt-in escalation
NLL baseline ~0s 0.71 free signal
WEPR arXiv:2509.04492 ~0.08ms (LR) 0.69 SKIP (overfit, n_train=60)
BSE K=1 arXiv:2504.03579 ~0.3s 0.65 SKIP (CSD collapse)
Latent_Audit d (v5 probe) arXiv:2604.05358 ~free (hook) 0.49–0.76 SKIP (intrinsic ≈chance)

Tested on ragbench MISA admin-procedures (Vietnamese, Qwen3-4B-4bit). Hard split = subtle 1-fact corruptions in retrieved context.

Pattern observed: every logprob-only method we ported (BSE, WEPR) lost to plain NLL by ~0.02 AUROC. Logprob signal on Qwen3-4B-4bit + Vietnamese admin domain has limited information beyond NLL. PCC's verdict-prompt approach remains the only inference-time method to materially beat NLL.

Per-method docs: docs/methods/. Detailed results & decisions: notes/.

Layout

.
├── lib/                    importable modules
│   ├── pcc.py              PCC certainty + consistency  (arXiv:2601.02574)
│   ├── bse.py              Bayesian Semantic Entropy   (arXiv:2504.03579)
│   ├── wepr.py             Weighted Entropy Production Rate (arXiv:2509.04492)
│   ├── audit.py            Latent_Audit Mahalanobis    (arXiv:2604.05358)
│   ├── latentaudit_v5.py   v5 probe (paper-faithful, leak-free)
│   ├── clustering.py       Qwen-as-judge entailment clustering
│   ├── fusion_eval.py      multi-signal evaluator
│   └── util.py             shared generation helpers (top-K logprobs, CACHE)
│
├── data/                   datasets + source adapter
│   ├── examples.py         100 VN hand-crafted (faithful + hallucinated)
│   ├── adapter.py          source dispatcher: data | ragbench
│   ├── realistic_hallu.json
│   └── generate_real_hallu.py
│
├── scripts/                runners (entrypoints — invoke from project root)
│   ├── run_pcc.py, run_bse.py, run_wepr.py
│   ├── bootstrap_bse.py    multi-sample cal for BSE CSD
│   ├── run_audit.py, run_v5.py, run_ragbench.py, run_fusion.py
│   └── analyze.py, diagnostic.py, threshold_tune.py
│
├── docs/methods/           per-method user-facing docs
│   ├── pcc.md, bse.md, wepr.md, latent_audit.md
│   └── README.md (index)
│
├── notes/                  design docs + per-method results
│   ├── bse-integration-plan.md, bse-results-2026-06-01.md
│   ├── wepr-results-2026-06-02.md
│   ├── partial-support-bench-plan.md
│   └── TODO-*.md
│
├── linux_deploy/           vLLM + Docker plugin (production handoff)
├── experiments/            one-off probes
├── bse_cache/              runtime artifacts (pickles, eval JSON)
├── quickstart.py           MLX building blocks demo
└── pyproject.toml, uv.lock, ...

Run

uv sync
uv run quickstart.py                        # smoke: model loads, top-K logprobs work
uv run scripts/run_pcc.py                   # PCC champion baseline
uv run scripts/run_bse.py --csd bse_cache/csd_*.pkl    # BSE K=1
uv run scripts/run_wepr.py                  # WEPR train+eval

Requires Mac M-series, Python ≥ 3.11. Model downloads to ~/.cache/huggingface/.

Methodology

All methods evaluated under the same rubric, reported in notes/{method}-results-*.md:

  1. Generate Qwen's answer for each (context, question) in test split
  2. Score generated answer via the detection method (returns scalar uncertainty / hallu probability)
  3. Label is_hallucination = NOT (Qwen-as-judge confirms answer matches gold)
  4. AUROC over scores + labels → main ranking metric
  5. Decision rubric per method's plan doc: < 0.75 SKIP, 0.75–0.85 COMPLEMENT (fuse with PCC τ), > 0.85 REPLACE champion

Methods evaluated on real-case Vietnamese data (ragbench MISA admin procedures) — content is novel to Qwen3-4B (cannot recall from training), forcing real grounding.

Design notes

  • Inference-time only, no fine-tuning — methods that need offline probe-training on labeled activations are disqualified upfront (Latent_Audit's own v5 probe is documented as such; we keep it as a study but it's not the ship target).
  • Black-box preferred — logprob-API methods are more portable across models than white-box hidden-state methods.
  • Sample-size honesty — every report notes n_test (typically 18–57) and flags small-sample noise.

Citation

@software{hert4_aletheia_2026,
  author = {Hert4},
  title  = {aletheia: benchmark platform for inference-time hallucination
            detection on Vietnamese RAG},
  year   = {2026},
  url    = {https://github.com/Hert4/aletheia}
}

If you use a method's algorithm, please also cite its source paper (linked in the status table above and in notes/).

License

Licensed under the PolyForm Noncommercial License 1.0.0 — see LICENSE.

  • Free for noncommercial use — research, education, personal study, non-profit/academic work.
  • 🔒 Commercial / business use is not granted by default — contact the author (ductransa01@gmail.com) for written permission / a commercial license.
  • 📌 Citation required — any use that feeds a publication, model, product, or other public artifact must cite this repo (see Citation above).

About

Benchmark platform for inference-time hallucination detection on RAG. Implements and evaluates PCC (champion, AUROC 0.807), BSE, WEPR, and Latent_Audit (Mahalanobis) under identical conditions on Qwen3-4B-class quantized models.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors