Skip to content

[Evaluation] Normalize evaluator validation errors to EvaluationException with USER_ERROR blame#47735

Draft
m7md7sien wants to merge 1 commit into
Azure:mainfrom
m7md7sien:mohessie/normalize-evaluator-exceptions
Draft

[Evaluation] Normalize evaluator validation errors to EvaluationException with USER_ERROR blame#47735
m7md7sien wants to merge 1 commit into
Azure:mainfrom
m7md7sien:mohessie/normalize-evaluator-exceptions

Conversation

@m7md7sien

Copy link
Copy Markdown
Contributor

Summary

Normalizes evaluation/validation error handling in azure-ai-evaluation so that user input and configuration errors are consistently raised as EvaluationException with blame=ErrorBlame.USER_ERROR (plus an appropriate category and target).

Previously several evaluators raised bare ValueError/TypeError for input/threshold validation, and a few existing EvaluationException raises did not set blame, so they defaulted to Unknown/InternalError even though they were caused by user input.

Changes

Raw ValueError/TypeErrorEvaluationException(USER_ERROR)

  • ContentSafetyEvaluator — threshold type check
  • QAEvaluator — threshold type check
  • RougeScoreEvaluator — threshold type check
  • DocumentRetrievalEvaluator — ground-truth label and input-record validation
  • Task navigation efficiency evaluator — matching_mode and ground_truth validation

Existing EvaluationException missing USER_ERROR

  • Evaluator base (_base_eval.py) — conversation message mismatch, malformed tool-call parsing, and threshold-not-a-number checks now set blame=USER_ERROR (one category=UNKNOWN corrected to INVALID_VALUE)

Supporting

  • Added QA_EVALUATOR, ROUGE_EVALUATOR, and DOCUMENT_RETRIEVAL_EVALUATOR members to ErrorTarget
  • Updated the task navigation test to expect EvaluationException for an invalid matching_mode
  • CHANGELOG entry under 1.17.1 (Unreleased)

Intentionally left unchanged (not user errors)

  • "Evaluator returned invalid output" / "Invalid score value" across the prompty and tool evaluators remain SYSTEM_ERROR (malformed LLM output, not user input).
  • Internal/defensive checks (_conversation_aggregators.py UNKNOWN, _base_rai_svc_eval.py "Not implemented") are unchanged.

Validation

  • All affected unit tests pass (document retrieval, task navigation, threshold behavior, common validators, built-in & agent evaluators).
  • black (pinned 24.4.0, repo config) passes on all modified files.

…R_ERROR blame

Convert raw ValueError/TypeError input and configuration validation failures in ContentSafety, QA, Rouge, DocumentRetrieval and TaskNavigationEfficiency evaluators to EvaluationException, and ensure user-validation errors across the evaluator base consistently set blame=ErrorBlame.USER_ERROR with appropriate category/target. Adds QA/Rouge/DocumentRetrieval ErrorTarget enum members and updates the task navigation test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant