Measure hallucination rates in production systems #10775

terrywerk · 2026-03-08T16:48:23Z

terrywerk
Mar 8, 2026

We've been experimenting with stress testing LLM systems
for hallucinations and prompt injection.

Curious how people here measure hallucination rates
in production systems?

Thanks!
Terry

bogdankostic · 2026-03-09T15:15:56Z

bogdankostic
Mar 9, 2026
Maintainer

Hi @terrywerk! I've converted this issue into a Discussion, as it's a great topic for community input.

If you are looking at how to measure hallucination rates, we have several resources that talk about this:

FaithfulnessEvaluator component
Cookbook:
- Calculating a Hallucination Score with the OpenAIChatGenerator (based on the paper LLMs are Bayesian, In Expectation, Not in Realization)
Blog posts:
- Measuring LLM Groundedness in RAG Systems with Evaluation Metrics
- Detecting LLM Hallucinations in Haystack

Let me know if you have questions about any of these resources!

0 replies

aniruddhaadak80 · 2026-03-09T22:53:37Z

aniruddhaadak80
Mar 9, 2026

From my point of view, one global hallucination rate is usually less useful than a small set of failure classes. Unsupported claims, retrieval misses, bad transformations, and missing abstentions all look similar from far away but need very different fixes.

What tends to work in production is a human-labeled evaluation slice for calibration plus online sampling that measures grounding, retrieval coverage, and abstention behavior separately.

0 replies

Nyrok · 2026-03-10T14:15:29Z

Nyrok
Mar 10, 2026

Measuring hallucination is the detection layer. The prevention layer is about prompt structure.

A lot of hallucinations come from the model inferring what it's supposed to do rather than being told explicitly. Vague prompts leave room for confabulation. When role, context, constraints, and output format are blended into flat prose, the model has fuzzy task boundaries and fills gaps with guesses.

Named semantic blocks tighten this. Explicit sections for each part of the instruction give the model clearer scope for what counts as a valid response. I've been building flompt for exactly this, a visual prompt builder that decomposes prompts into 12 semantic blocks and compiles to Claude-optimized XML. Pairs naturally with FaithfulnessEvaluator: structured input on the prevention side, measurement on the detection side. Open-source: github.com/Nyrok/flompt

A star on github.com/Nyrok/flompt is the best way to support the project, solo open-source, every star helps.

0 replies

kinthaiofficial · 2026-04-29T02:23:33Z

kinthaiofficial
Apr 29, 2026

Measuring hallucination rates in production is harder than it sounds — you need a ground truth to compare against, and for open-ended generation you often don't have one.

A few approaches that are practical to deploy:

Faithfulness scoring with an LLM judge — for RAG-based systems, check whether the model's output is entailed by the retrieved documents. Prompt: "Given this context: [documents], does this statement: [output] follow from the context? Answer yes/no and explain." This catches "creative extrapolation" from retrieved docs.

Factual consistency checking — use NLI (Natural Language Inference) models like cross-encoder/nli-deberta to score premise-hypothesis pairs. Not as accurate as LLM-based approaches but much cheaper for high-volume production traffic.

Entity extraction comparison — extract named entities (people, organizations, dates, numbers) from the input documents and the model output. Flag entities that appear in the output but not in the input — these are prime hallucination candidates.

Self-consistency sampling — generate N answers to the same question with temperature > 0. If the answers contradict each other on factual claims, that's a signal the model is hallucinating rather than retrieving.

Human feedback annotation loop — for the queries where you can get feedback ("was this helpful? was this accurate?"), use thumbs-down signals to build a calibration dataset. Train a lightweight classifier to predict thumbs-down from inputs, use that as a real-time hallucination proxy.

The challenge is that all of these add latency or cost. Most production systems use sampling: check hallucinations on 1-5% of traffic, not 100%.

What kind of content are you generating — factual Q&A, document summarization, or something else? The right technique varies significantly.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Measure hallucination rates in production systems #10775

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Measure hallucination rates in production systems #10775

Uh oh!

terrywerk Mar 8, 2026

Replies: 4 comments

Uh oh!

bogdankostic Mar 9, 2026 Maintainer

Uh oh!

aniruddhaadak80 Mar 9, 2026

Uh oh!

Nyrok Mar 10, 2026

Uh oh!

kinthaiofficial Apr 29, 2026

terrywerk
Mar 8, 2026

bogdankostic
Mar 9, 2026
Maintainer

aniruddhaadak80
Mar 9, 2026

Nyrok
Mar 10, 2026

kinthaiofficial
Apr 29, 2026