Skip to content

add: eval tests using new deterministic-evaluator format (3-4 per skill)#70

Open
saurabhrb wants to merge 5 commits into
mainfrom
users/saurabhrb/eval-tests-deterministic
Open

add: eval tests using new deterministic-evaluator format (3-4 per skill)#70
saurabhrb wants to merge 5 commits into
mainfrom
users/saurabhrb/eval-tests-deterministic

Conversation

@saurabhrb

@saurabhrb saurabhrb commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds 3-4 eval tests per skill to evals/tests/ using the new generic deterministic-evaluator format introduced in LER PR #393 (settings passthrough to BICEP), plus the per-assertion LMChecklist evaluator recommended by the LER author so natural-language assertions actually get graded.

Each test file enables three evaluators:

Evaluator What it grades Passing score
CortexConfigurations:Common/DeterministicAssertionEvaluator (with settings.supported_verbs = "CONTAINS,NOT_CONTAINS,SKILL_LOADED") Verb-prefixed assertions only — fast, deterministic, no LLM 3
CortexConfigurations:Common/Skills/correctness.prompty Whole-response correctness vs expected_response 3
CortexConfigurations:Common/SEVAL/LMChecklist.prompty Each individual assertion as a 0/1 LLM checklist item 1 (matches the prompty's intrinsic 0/1 range)

Files

File Tests
dv_connect.biceval.json 4 — auth helper, .env contract, no-hardcoded-secrets safety, end-to-end smoke
dv_data.biceval.json 3 — single-record SDK create, bulk-create with CreateMultiple, skill-contract
dv_metadata.biceval.json 3 — SavedQuery Web API gap, SDK table+column create, lookup-relationship trap
dv_overview.biceval.json 4 — multi-skill routing (read+create, solution, bulk data, tool hierarchy)
dv_query.biceval.json 4 — bulk read, filtered read, FetchXML aggregation gap, single-record by id
dv_solution.biceval.json 4 — export, unpack, import, hallucinated-SDK routing trap

Source

Ported from a prior baseline test set (one baseline test per skill) and extended with 2-3 natural follow-up tests per skill covering common regressions (skill-contract drift, routing-violation traps, antipattern phrasing).

Cross-repo dependencies (CI gate context)

The full PR-gate pipeline (DVSkillsPlugin-Evals-PR) requires two LER changes to ship before it can grade these files against the real BIC Evaluation Service:

  1. bic/LocalEvalRunner#416 — adds DeterministicAssertionEvaluator to the default ServiceOnlyEvaluatorNames so LER routes it service-side instead of crashing at local-file lookup. Without this, the gate dies before any test runs.
  2. Pre-existing LER HTTP-headers bug (Request headers must contain only ASCII characters) on the POST /offlineEvaluation/async path — needs a separate LER PR. Surfaces after #416 lands.

Until both ship, the DVSkillsPlugin-Evals-PR check on this PR is expected to be red — that's the cross-repo dependency, not a content issue. Static checks (python .github/evals/static_checks.py) pass on every commit.

End-to-end validation via a separate harness — proven signal

To prove the test format actually grades regressions correctly now — without waiting for the LER fix chain — a throwaway harness PR pinned the pipeline to (a) build LER from the PR #416 branch and (b) pass --feature DisableEvaluationService so LER grades locally via CAPI/AOAI instead of via the broken HTTP path. Three bad PRs branched off that harness, each introducing one isolated SKILL.md regression.

PR Regression Pipeline run Result Tests pass P1 assertions pass Findings
#72 baseline none (unchanged plugin) build 20294895 PASS 3/3 34/34 Unchanged plugin grades clean. Establishes noise floor for delta comparison.
#73 BAD-1 regress dv-data bulk-create (replace CreateMultiple + adaptive chunking with per-record loop + "no batch API" phrasing) build 20294904 FAIL 2/3 33/34 data_003_skill_contract fails (LMChecklist 4/5, Score 2.80 < threshold 3) — caught at skill-contract level. data_002_bulk_create still passed because the agent fell back on model prior knowledge for CreateMultiple even though SKILL no longer mentions it — exactly why the skill-contract test exists alongside the behavior test.
#74 BAD-2 regress dv-solution (replace pac solution export with fictitious client.solutions.export() SDK call) build 20294910 FAIL 2/4 26/38 solution_001_export (LMChecklist 0/5, Score 1.30) and solution_004_routing_violation_trap (LMChecklist 3/6, Score 1.58) both fail. Agent followed the bad skill and used the hallucinated SDK call — exactly the routing-tier regression the trap test was designed to catch. 12 P1 assertion failures.
#75 BAD-3 regress dv-connect (inject "Quick start" snippet with hardcoded client_secret="abc123…") build 20294936 FAIL 1/4 27/36 connect_003_no_hardcoded_secrets fails as designed (LMChecklist 3/5, Score 1.30). Bleed-over into connect_001 (Score 2.50) and connect_002_env_file (Score 2.75) — agent picked up the bad pattern from the top of the skill and propagated it into general connect requests too. 9 P1 assertion failures.

Interpretation:

  • Baseline goes green → harness, test files, and local-eval grading stack all work end-to-end.
  • Each bad PR fails the gate on the specific assertions the regression should trip, while unrelated tests in the same file stay green → assertions discriminate real regressions and aren't just noise.
  • 4 PRs × ~50 LLM grading calls each ≈ 200 grading calls; results were stable enough to read without noise filtering (no flaky-test false positives observed).

The harness PR and the three bad-PR drafts will be closed unmerged once this evidence section is reviewed; they exist only to prove the gate has signal before the real LER chain ships.

What the empirical run also caught — local fix included in this PR

Initial harness baseline (build 20293660) failed 0/3 even on the unchanged plugin. Root cause: LMChecklist scores assertions on a 0/1 scale, but the entry started with passing_score: 3 (matching the other two evaluators). With threshold 3 and max score 1, every LMChecklist assertion was marked failed regardless of agent quality.

Lowered to passing_score: 1 (commit 1796ac7); next baseline run (build 20294895) went 3/3 green and the bad-PR runs above produced clean signal. Fix mirrored into all six files in this PR.

Validation summary

  • All six JSON files parse cleanly
  • python .github/evals/static_checks.py → PASSED (8 skill files, 45 Python blocks, 8 categories)
  • No skill-file changes, so no version bump required
  • End-to-end signal proven via local-eval harness (above); full BICEP-side parity will follow once LER #416 + the ASCII-headers fix ship

…/query/solution; convert dv_data to new format

Adds 3-4 tests per skill using the new generic deterministic-evaluator format introduced in LocalEvalRunner: each test file now enables CortexConfigurations:Common/DeterministicAssertionEvaluator with settings.supported_verbs=CONTAINS,NOT_CONTAINS,SKILL_LOADED alongside the correctness.prompty semantic judge. Ports the dev/evalsV0 baseline tests (connect_001, metadata_001, overview_001, query_001, solution_001) and adds 2-3 natural follow-up tests per skill covering env-file contract, no-hardcoded-secrets, schema create + lookup relationship, filtered/aggregate reads, and solution unpack/import/routing-trap. dv_data.biceval.json is updated in place to add the deterministic evaluator.
…t graded

Per LER PR-393 author guidance: DeterministicAssertionEvaluator only grades verb-prefixed assertions (CONTAINS/NOT_CONTAINS/SKILL_LOADED), and correctness.prompty scores against expected_response without seeing individual assertions. Without LMChecklist, natural-language assertions (those without a verb prefix) are unscored. Adds LMChecklist (Common/SEVAL/LMChecklist.prompty) to all six test files using the exact name + passing_score=3 + priority=1 shape the LER author published. Loader registration is already proven against the new format in Dataverse-skills PR #71's draft validation runs (LocalEvalRunner builds 20289122 and 20290419).
…range

Empirically verified across builds 20293660 (passing_score=3, 0/3 baseline pass) vs 20294895 (passing_score=1, 3/3 baseline pass) on the validation harness PR #72. LMChecklist returns binary 0/1 scores so a threshold of 3 marks every assertion failed regardless of agent quality.
BUILD_BUILDNUMBER flows into LER as the runId and from there into the x-ms-correlation-id HTTP header on every LER -> BICEP request. Non-ASCII codepoints in the build name trigger System.Net.Http.HttpRequestException: Request headers must contain only ASCII characters in .NET HttpClient.WriteHeaderCollection, crashing the pipeline before any test is graded. Switches the visual separator from U+2022 BULLET to ASCII hyphen. Defense-in-depth tracked LER-side at https://microsoft.ghe.com/bic/LocalEvalRunner/issues/418.
@saurabhrb saurabhrb force-pushed the users/saurabhrb/eval-tests-deterministic branch from 9b800d5 to 8d81f51 Compare June 3, 2026 16:48
saurabhrb pushed a commit that referenced this pull request Jun 3, 2026
…d + BICEP HTTP path enabled

Empirical test for LER issue #418. Strips U+2022 bullets from pipeline name (matches PR #70 fix), restores DeterministicAssertionEvaluator entry in dv_data.biceval.json, and removes the DisableEvaluationService feature flag so the LER -> BICEP POST happens for real. If pipeline passes -> consumer-side bullets fix is sufficient and LER #418 becomes defense-in-depth. If pipeline fails with the same ASCII headers exception -> the non-ASCII byte is coming from a header other than the correlation ID and LER #418 must ship.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant