Replies: 4 comments
-
|
Hi @terrywerk! I've converted this issue into a Discussion, as it's a great topic for community input. If you are looking at how to measure hallucination rates, we have several resources that talk about this:
Let me know if you have questions about any of these resources! |
Beta Was this translation helpful? Give feedback.
-
|
From my point of view, one global hallucination rate is usually less useful than a small set of failure classes. Unsupported claims, retrieval misses, bad transformations, and missing abstentions all look similar from far away but need very different fixes. What tends to work in production is a human-labeled evaluation slice for calibration plus online sampling that measures grounding, retrieval coverage, and abstention behavior separately. |
Beta Was this translation helpful? Give feedback.
-
|
Measuring hallucination is the detection layer. The prevention layer is about prompt structure. A lot of hallucinations come from the model inferring what it's supposed to do rather than being told explicitly. Vague prompts leave room for confabulation. When role, context, constraints, and output format are blended into flat prose, the model has fuzzy task boundaries and fills gaps with guesses. Named semantic blocks tighten this. Explicit sections for each part of the instruction give the model clearer scope for what counts as a valid response. I've been building flompt for exactly this, a visual prompt builder that decomposes prompts into 12 semantic blocks and compiles to Claude-optimized XML. Pairs naturally with FaithfulnessEvaluator: structured input on the prevention side, measurement on the detection side. Open-source: github.com/Nyrok/flompt A star on github.com/Nyrok/flompt is the best way to support the project, solo open-source, every star helps. |
Beta Was this translation helpful? Give feedback.
-
|
Measuring hallucination rates in production is harder than it sounds — you need a ground truth to compare against, and for open-ended generation you often don't have one. A few approaches that are practical to deploy: Faithfulness scoring with an LLM judge — for RAG-based systems, check whether the model's output is entailed by the retrieved documents. Prompt: "Given this context: [documents], does this statement: [output] follow from the context? Answer yes/no and explain." This catches "creative extrapolation" from retrieved docs. Factual consistency checking — use NLI (Natural Language Inference) models like cross-encoder/nli-deberta to score premise-hypothesis pairs. Not as accurate as LLM-based approaches but much cheaper for high-volume production traffic. Entity extraction comparison — extract named entities (people, organizations, dates, numbers) from the input documents and the model output. Flag entities that appear in the output but not in the input — these are prime hallucination candidates. Self-consistency sampling — generate N answers to the same question with temperature > 0. If the answers contradict each other on factual claims, that's a signal the model is hallucinating rather than retrieving. Human feedback annotation loop — for the queries where you can get feedback ("was this helpful? was this accurate?"), use thumbs-down signals to build a calibration dataset. Train a lightweight classifier to predict thumbs-down from inputs, use that as a real-time hallucination proxy. The challenge is that all of these add latency or cost. Most production systems use sampling: check hallucinations on 1-5% of traffic, not 100%. What kind of content are you generating — factual Q&A, document summarization, or something else? The right technique varies significantly. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
We've been experimenting with stress testing LLM systems
for hallucinations and prompt injection.
Curious how people here measure hallucination rates
in production systems?
Thanks!
Terry
Beta Was this translation helpful? Give feedback.
All reactions