add: eval tests using new deterministic-evaluator format (3-4 per skill) by saurabhrb · Pull Request #70 · microsoft/Dataverse-skills

saurabhrb · 2026-06-02T05:39:33Z

Summary

Adds 3-4 eval tests per skill to evals/tests/ using the new generic deterministic-evaluator format introduced in LER PR #393 (settings passthrough to BICEP), plus the per-assertion LMChecklist evaluator recommended by the LER author so natural-language assertions actually get graded.

Each test file enables three evaluators:

Evaluator	What it grades	Passing score
`CortexConfigurations:Common/DeterministicAssertionEvaluator` (with `settings.supported_verbs = "CONTAINS,NOT_CONTAINS,SKILL_LOADED"`)	Verb-prefixed assertions only — fast, deterministic, no LLM	3
`CortexConfigurations:Common/Skills/correctness.prompty`	Whole-response correctness vs `expected_response`	3
`CortexConfigurations:Common/SEVAL/LMChecklist.prompty`	Each individual assertion as a 0/1 LLM checklist item	1 (matches the prompty's intrinsic 0/1 range)

Files

File	Tests
`dv_connect.biceval.json`	4 — auth helper, .env contract, no-hardcoded-secrets safety, end-to-end smoke
`dv_data.biceval.json`	3 — single-record SDK create, bulk-create with CreateMultiple, skill-contract
`dv_metadata.biceval.json`	3 — SavedQuery Web API gap, SDK table+column create, lookup-relationship trap
`dv_overview.biceval.json`	4 — multi-skill routing (read+create, solution, bulk data, tool hierarchy)
`dv_query.biceval.json`	4 — bulk read, filtered read, FetchXML aggregation gap, single-record by id
`dv_solution.biceval.json`	4 — export, unpack, import, hallucinated-SDK routing trap

Source

Ported from a prior baseline test set (one baseline test per skill) and extended with 2-3 natural follow-up tests per skill covering common regressions (skill-contract drift, routing-violation traps, antipattern phrasing).

Cross-repo dependencies (CI gate context)

The full PR-gate pipeline (DVSkillsPlugin-Evals-PR) requires two LER changes to ship before it can grade these files against the real BIC Evaluation Service:

bic/LocalEvalRunner#416 — adds DeterministicAssertionEvaluator to the default ServiceOnlyEvaluatorNames so LER routes it service-side instead of crashing at local-file lookup. Without this, the gate dies before any test runs.
Pre-existing LER HTTP-headers bug (Request headers must contain only ASCII characters) on the POST /offlineEvaluation/async path — needs a separate LER PR. Surfaces after #416 lands.

Until both ship, the DVSkillsPlugin-Evals-PR check on this PR is expected to be red — that's the cross-repo dependency, not a content issue. Static checks (python .github/evals/static_checks.py) pass on every commit.

End-to-end validation via a separate harness — proven signal

To prove the test format actually grades regressions correctly now — without waiting for the LER fix chain — a throwaway harness PR pinned the pipeline to (a) build LER from the PR #416 branch and (b) pass --feature DisableEvaluationService so LER grades locally via CAPI/AOAI instead of via the broken HTTP path. Three bad PRs branched off that harness, each introducing one isolated SKILL.md regression.

PR	Regression	Pipeline run	Result	Tests pass	P1 assertions pass	Findings
#72 baseline	none (unchanged plugin)	build 20294895	✅ PASS	3/3	34/34	Unchanged plugin grades clean. Establishes noise floor for delta comparison.
#73 BAD-1	regress `dv-data` bulk-create (replace `CreateMultiple` + adaptive chunking with per-record loop + "no batch API" phrasing)	build 20294904	❌ FAIL	2/3	33/34	`data_003_skill_contract` fails (LMChecklist 4/5, Score 2.80 < threshold 3) — caught at skill-contract level. `data_002_bulk_create` still passed because the agent fell back on model prior knowledge for `CreateMultiple` even though SKILL no longer mentions it — exactly why the skill-contract test exists alongside the behavior test.
#74 BAD-2	regress `dv-solution` (replace `pac solution export` with fictitious `client.solutions.export()` SDK call)	build 20294910	❌ FAIL	2/4	26/38	`solution_001_export` (LMChecklist 0/5, Score 1.30) and `solution_004_routing_violation_trap` (LMChecklist 3/6, Score 1.58) both fail. Agent followed the bad skill and used the hallucinated SDK call — exactly the routing-tier regression the trap test was designed to catch. 12 P1 assertion failures.
#75 BAD-3	regress `dv-connect` (inject "Quick start" snippet with hardcoded `client_secret="abc123…"`)	build 20294936	❌ FAIL	1/4	27/36	`connect_003_no_hardcoded_secrets` fails as designed (LMChecklist 3/5, Score 1.30). Bleed-over into `connect_001` (Score 2.50) and `connect_002_env_file` (Score 2.75) — agent picked up the bad pattern from the top of the skill and propagated it into general connect requests too. 9 P1 assertion failures.

Interpretation:

Baseline goes green → harness, test files, and local-eval grading stack all work end-to-end.
Each bad PR fails the gate on the specific assertions the regression should trip, while unrelated tests in the same file stay green → assertions discriminate real regressions and aren't just noise.
4 PRs × ~50 LLM grading calls each ≈ 200 grading calls; results were stable enough to read without noise filtering (no flaky-test false positives observed).

The harness PR and the three bad-PR drafts will be closed unmerged once this evidence section is reviewed; they exist only to prove the gate has signal before the real LER chain ships.

What the empirical run also caught — local fix included in this PR

Initial harness baseline (build 20293660) failed 0/3 even on the unchanged plugin. Root cause: LMChecklist scores assertions on a 0/1 scale, but the entry started with passing_score: 3 (matching the other two evaluators). With threshold 3 and max score 1, every LMChecklist assertion was marked failed regardless of agent quality.

Lowered to passing_score: 1 (commit 1796ac7); next baseline run (build 20294895) went 3/3 green and the bad-PR runs above produced clean signal. Fix mirrored into all six files in this PR.

Validation summary

All six JSON files parse cleanly
python .github/evals/static_checks.py → PASSED (8 skill files, 45 Python blocks, 8 categories)
No skill-file changes, so no version bump required
End-to-end signal proven via local-eval harness (above); full BICEP-side parity will follow once LER #416 + the ASCII-headers fix ship

…/query/solution; convert dv_data to new format Adds 3-4 tests per skill using the new generic deterministic-evaluator format introduced in LocalEvalRunner: each test file now enables CortexConfigurations:Common/DeterministicAssertionEvaluator with settings.supported_verbs=CONTAINS,NOT_CONTAINS,SKILL_LOADED alongside the correctness.prompty semantic judge. Ports the dev/evalsV0 baseline tests (connect_001, metadata_001, overview_001, query_001, solution_001) and adds 2-3 natural follow-up tests per skill covering env-file contract, no-hardcoded-secrets, schema create + lookup relationship, filtered/aggregate reads, and solution unpack/import/routing-trap. dv_data.biceval.json is updated in place to add the deterministic evaluator.

…t graded Per LER PR-393 author guidance: DeterministicAssertionEvaluator only grades verb-prefixed assertions (CONTAINS/NOT_CONTAINS/SKILL_LOADED), and correctness.prompty scores against expected_response without seeing individual assertions. Without LMChecklist, natural-language assertions (those without a verb prefix) are unscored. Adds LMChecklist (Common/SEVAL/LMChecklist.prompty) to all six test files using the exact name + passing_score=3 + priority=1 shape the LER author published. Loader registration is already proven against the new format in Dataverse-skills PR #71's draft validation runs (LocalEvalRunner builds 20289122 and 20290419).

…range Empirically verified across builds 20293660 (passing_score=3, 0/3 baseline pass) vs 20294895 (passing_score=1, 3/3 baseline pass) on the validation harness PR #72. LMChecklist returns binary 0/1 scores so a threshold of 3 marks every assertion failed regardless of agent quality.

BUILD_BUILDNUMBER flows into LER as the runId and from there into the x-ms-correlation-id HTTP header on every LER -> BICEP request. Non-ASCII codepoints in the build name trigger System.Net.Http.HttpRequestException: Request headers must contain only ASCII characters in .NET HttpClient.WriteHeaderCollection, crashing the pipeline before any test is graded. Switches the visual separator from U+2022 BULLET to ASCII hyphen. Defense-in-depth tracked LER-side at https://microsoft.ghe.com/bic/LocalEvalRunner/issues/418.

…d + BICEP HTTP path enabled Empirical test for LER issue #418. Strips U+2022 bullets from pipeline name (matches PR #70 fix), restores DeterministicAssertionEvaluator entry in dv_data.biceval.json, and removes the DisableEvaluationService feature flag so the LER -> BICEP POST happens for real. If pipeline passes -> consumer-side bullets fix is sufficient and LER #418 becomes defense-in-depth. If pipeline fails with the same ASCII headers exception -> the non-ASCII byte is coming from a header other than the correlation ID and LER #418 must ship.

saurabhrb requested a review from a team June 2, 2026 05:39

saurabhrb mentioned this pull request Jun 2, 2026

[Draft – DO NOT MERGE] validation: build LER from PR #416 to verify DeterministicAssertionEvaluator fix #71

Closed

saurabhrb mentioned this pull request Jun 2, 2026

[Draft - DO NOT MERGE] bad-PR validation harness (local-eval mode) #72

Closed

saurabhrb force-pushed the users/saurabhrb/eval-tests-deterministic branch from 9b800d5 to 8d81f51 Compare June 3, 2026 16:48

saurabhrb mentioned this pull request Jun 3, 2026

[Draft - DO NOT MERGE] validate bullets-fix sufficiency for LER ASCII-headers bug (#418) #79

Closed

Merge branch 'main' into users/saurabhrb/eval-tests-deterministic

9067127

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add: eval tests using new deterministic-evaluator format (3-4 per skill)#70

add: eval tests using new deterministic-evaluator format (3-4 per skill)#70
saurabhrb wants to merge 5 commits into
mainfrom
users/saurabhrb/eval-tests-deterministic

saurabhrb commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

saurabhrb commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Source

Cross-repo dependencies (CI gate context)

End-to-end validation via a separate harness — proven signal

What the empirical run also caught — local fix included in this PR

Validation summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

saurabhrb commented Jun 2, 2026 •

edited

Loading