ἀλήθεια — unconcealment, the unforgotten. Bringing the hidden truth of an RAG answer to light.
Benchmark platform for inference-time hallucination detection methods on Vietnamese RAG. Target: Qwen3-4B-class quantized models (MLX on Mac, vLLM on Linux). Each method is ported from its paper, evaluated against the same Vietnamese hard split, and ranked by a single rubric so we know what actually works.
| Method | Source | Latency | AUROC (hard) | Verdict |
|---|---|---|---|---|
| PCC τ (certainty) | arXiv:2601.02574 | ~0.2s | 0.807 | champion |
| PCC γ (consistency, K=2) | same | ~6s | 0.74 | opt-in escalation |
| NLL baseline | — | ~0s | 0.71 | free signal |
| WEPR | arXiv:2509.04492 | ~0.08ms (LR) | 0.69 | SKIP (overfit, n_train=60) |
| BSE K=1 | arXiv:2504.03579 | ~0.3s | 0.65 | SKIP (CSD collapse) |
Latent_Audit d (v5 probe) |
arXiv:2604.05358 | ~free (hook) | 0.49–0.76 | SKIP (intrinsic ≈chance) |
Tested on ragbench MISA admin-procedures (Vietnamese, Qwen3-4B-4bit). Hard split = subtle 1-fact corruptions in retrieved context.
Pattern observed: every logprob-only method we ported (BSE, WEPR) lost to plain NLL by ~0.02 AUROC. Logprob signal on Qwen3-4B-4bit + Vietnamese admin domain has limited information beyond NLL. PCC's verdict-prompt approach remains the only inference-time method to materially beat NLL.
Per-method docs: docs/methods/. Detailed results & decisions: notes/.
.
├── lib/ importable modules
│ ├── pcc.py PCC certainty + consistency (arXiv:2601.02574)
│ ├── bse.py Bayesian Semantic Entropy (arXiv:2504.03579)
│ ├── wepr.py Weighted Entropy Production Rate (arXiv:2509.04492)
│ ├── audit.py Latent_Audit Mahalanobis (arXiv:2604.05358)
│ ├── latentaudit_v5.py v5 probe (paper-faithful, leak-free)
│ ├── clustering.py Qwen-as-judge entailment clustering
│ ├── fusion_eval.py multi-signal evaluator
│ └── util.py shared generation helpers (top-K logprobs, CACHE)
│
├── data/ datasets + source adapter
│ ├── examples.py 100 VN hand-crafted (faithful + hallucinated)
│ ├── adapter.py source dispatcher: data | ragbench
│ ├── realistic_hallu.json
│ └── generate_real_hallu.py
│
├── scripts/ runners (entrypoints — invoke from project root)
│ ├── run_pcc.py, run_bse.py, run_wepr.py
│ ├── bootstrap_bse.py multi-sample cal for BSE CSD
│ ├── run_audit.py, run_v5.py, run_ragbench.py, run_fusion.py
│ └── analyze.py, diagnostic.py, threshold_tune.py
│
├── docs/methods/ per-method user-facing docs
│ ├── pcc.md, bse.md, wepr.md, latent_audit.md
│ └── README.md (index)
│
├── notes/ design docs + per-method results
│ ├── bse-integration-plan.md, bse-results-2026-06-01.md
│ ├── wepr-results-2026-06-02.md
│ ├── partial-support-bench-plan.md
│ └── TODO-*.md
│
├── linux_deploy/ vLLM + Docker plugin (production handoff)
├── experiments/ one-off probes
├── bse_cache/ runtime artifacts (pickles, eval JSON)
├── quickstart.py MLX building blocks demo
└── pyproject.toml, uv.lock, ...
uv sync
uv run quickstart.py # smoke: model loads, top-K logprobs work
uv run scripts/run_pcc.py # PCC champion baseline
uv run scripts/run_bse.py --csd bse_cache/csd_*.pkl # BSE K=1
uv run scripts/run_wepr.py # WEPR train+evalRequires Mac M-series, Python ≥ 3.11. Model downloads to ~/.cache/huggingface/.
All methods evaluated under the same rubric, reported in notes/{method}-results-*.md:
- Generate Qwen's answer for each (context, question) in test split
- Score generated answer via the detection method (returns scalar uncertainty / hallu probability)
- Label is_hallucination = NOT (Qwen-as-judge confirms answer matches gold)
- AUROC over scores + labels → main ranking metric
- Decision rubric per method's plan doc: < 0.75 SKIP, 0.75–0.85 COMPLEMENT (fuse with PCC τ), > 0.85 REPLACE champion
Methods evaluated on real-case Vietnamese data (ragbench MISA admin procedures) — content is novel to Qwen3-4B (cannot recall from training), forcing real grounding.
- Inference-time only, no fine-tuning — methods that need offline probe-training on labeled activations are disqualified upfront (Latent_Audit's own v5 probe is documented as such; we keep it as a study but it's not the ship target).
- Black-box preferred — logprob-API methods are more portable across models than white-box hidden-state methods.
- Sample-size honesty — every report notes
n_test(typically 18–57) and flags small-sample noise.
@software{hert4_aletheia_2026,
author = {Hert4},
title = {aletheia: benchmark platform for inference-time hallucination
detection on Vietnamese RAG},
year = {2026},
url = {https://github.com/Hert4/aletheia}
}If you use a method's algorithm, please also cite its source paper (linked in the status table above and in notes/).
Licensed under the PolyForm Noncommercial License 1.0.0 — see LICENSE.
- ✅ Free for noncommercial use — research, education, personal study, non-profit/academic work.
- 🔒 Commercial / business use is not granted by default — contact the author (ductransa01@gmail.com) for written permission / a commercial license.
- 📌 Citation required — any use that feeds a publication, model, product, or other public artifact must cite this repo (see Citation above).