add: eval tests using new deterministic-evaluator format (3-4 per skill)#70
Open
saurabhrb wants to merge 5 commits into
Open
add: eval tests using new deterministic-evaluator format (3-4 per skill)#70saurabhrb wants to merge 5 commits into
saurabhrb wants to merge 5 commits into
Conversation
…/query/solution; convert dv_data to new format Adds 3-4 tests per skill using the new generic deterministic-evaluator format introduced in LocalEvalRunner: each test file now enables CortexConfigurations:Common/DeterministicAssertionEvaluator with settings.supported_verbs=CONTAINS,NOT_CONTAINS,SKILL_LOADED alongside the correctness.prompty semantic judge. Ports the dev/evalsV0 baseline tests (connect_001, metadata_001, overview_001, query_001, solution_001) and adds 2-3 natural follow-up tests per skill covering env-file contract, no-hardcoded-secrets, schema create + lookup relationship, filtered/aggregate reads, and solution unpack/import/routing-trap. dv_data.biceval.json is updated in place to add the deterministic evaluator.
…t graded Per LER PR-393 author guidance: DeterministicAssertionEvaluator only grades verb-prefixed assertions (CONTAINS/NOT_CONTAINS/SKILL_LOADED), and correctness.prompty scores against expected_response without seeing individual assertions. Without LMChecklist, natural-language assertions (those without a verb prefix) are unscored. Adds LMChecklist (Common/SEVAL/LMChecklist.prompty) to all six test files using the exact name + passing_score=3 + priority=1 shape the LER author published. Loader registration is already proven against the new format in Dataverse-skills PR #71's draft validation runs (LocalEvalRunner builds 20289122 and 20290419).
…range Empirically verified across builds 20293660 (passing_score=3, 0/3 baseline pass) vs 20294895 (passing_score=1, 3/3 baseline pass) on the validation harness PR #72. LMChecklist returns binary 0/1 scores so a threshold of 3 marks every assertion failed regardless of agent quality.
This was referenced Jun 3, 2026
BUILD_BUILDNUMBER flows into LER as the runId and from there into the x-ms-correlation-id HTTP header on every LER -> BICEP request. Non-ASCII codepoints in the build name trigger System.Net.Http.HttpRequestException: Request headers must contain only ASCII characters in .NET HttpClient.WriteHeaderCollection, crashing the pipeline before any test is graded. Switches the visual separator from U+2022 BULLET to ASCII hyphen. Defense-in-depth tracked LER-side at https://microsoft.ghe.com/bic/LocalEvalRunner/issues/418.
9b800d5 to
8d81f51
Compare
saurabhrb
pushed a commit
that referenced
this pull request
Jun 3, 2026
…d + BICEP HTTP path enabled Empirical test for LER issue #418. Strips U+2022 bullets from pipeline name (matches PR #70 fix), restores DeterministicAssertionEvaluator entry in dv_data.biceval.json, and removes the DisableEvaluationService feature flag so the LER -> BICEP POST happens for real. If pipeline passes -> consumer-side bullets fix is sufficient and LER #418 becomes defense-in-depth. If pipeline fails with the same ASCII headers exception -> the non-ASCII byte is coming from a header other than the correlation ID and LER #418 must ship.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds 3-4 eval tests per skill to
evals/tests/using the new generic deterministic-evaluator format introduced in LER PR #393 (settingspassthrough to BICEP), plus the per-assertionLMChecklistevaluator recommended by the LER author so natural-language assertions actually get graded.Each test file enables three evaluators:
CortexConfigurations:Common/DeterministicAssertionEvaluator(withsettings.supported_verbs = "CONTAINS,NOT_CONTAINS,SKILL_LOADED")CortexConfigurations:Common/Skills/correctness.promptyexpected_responseCortexConfigurations:Common/SEVAL/LMChecklist.promptyFiles
dv_connect.biceval.jsondv_data.biceval.jsondv_metadata.biceval.jsondv_overview.biceval.jsondv_query.biceval.jsondv_solution.biceval.jsonSource
Ported from a prior baseline test set (one baseline test per skill) and extended with 2-3 natural follow-up tests per skill covering common regressions (skill-contract drift, routing-violation traps, antipattern phrasing).
Cross-repo dependencies (CI gate context)
The full PR-gate pipeline (
DVSkillsPlugin-Evals-PR) requires two LER changes to ship before it can grade these files against the real BIC Evaluation Service:DeterministicAssertionEvaluatorto the defaultServiceOnlyEvaluatorNamesso LER routes it service-side instead of crashing at local-file lookup. Without this, the gate dies before any test runs.Request headers must contain only ASCII characters) on thePOST /offlineEvaluation/asyncpath — needs a separate LER PR. Surfaces after #416 lands.Until both ship, the
DVSkillsPlugin-Evals-PRcheck on this PR is expected to be red — that's the cross-repo dependency, not a content issue. Static checks (python .github/evals/static_checks.py) pass on every commit.End-to-end validation via a separate harness — proven signal
To prove the test format actually grades regressions correctly now — without waiting for the LER fix chain — a throwaway harness PR pinned the pipeline to (a) build LER from the PR #416 branch and (b) pass
--feature DisableEvaluationServiceso LER grades locally via CAPI/AOAI instead of via the broken HTTP path. Three bad PRs branched off that harness, each introducing one isolated SKILL.md regression.dv-databulk-create (replaceCreateMultiple+ adaptive chunking with per-record loop + "no batch API" phrasing)data_003_skill_contractfails (LMChecklist 4/5, Score 2.80 < threshold 3) — caught at skill-contract level.data_002_bulk_createstill passed because the agent fell back on model prior knowledge forCreateMultipleeven though SKILL no longer mentions it — exactly why the skill-contract test exists alongside the behavior test.dv-solution(replacepac solution exportwith fictitiousclient.solutions.export()SDK call)solution_001_export(LMChecklist 0/5, Score 1.30) andsolution_004_routing_violation_trap(LMChecklist 3/6, Score 1.58) both fail. Agent followed the bad skill and used the hallucinated SDK call — exactly the routing-tier regression the trap test was designed to catch. 12 P1 assertion failures.dv-connect(inject "Quick start" snippet with hardcodedclient_secret="abc123…")connect_003_no_hardcoded_secretsfails as designed (LMChecklist 3/5, Score 1.30). Bleed-over intoconnect_001(Score 2.50) andconnect_002_env_file(Score 2.75) — agent picked up the bad pattern from the top of the skill and propagated it into general connect requests too. 9 P1 assertion failures.Interpretation:
The harness PR and the three bad-PR drafts will be closed unmerged once this evidence section is reviewed; they exist only to prove the gate has signal before the real LER chain ships.
What the empirical run also caught — local fix included in this PR
Initial harness baseline (build 20293660) failed 0/3 even on the unchanged plugin. Root cause:
LMChecklistscores assertions on a 0/1 scale, but the entry started withpassing_score: 3(matching the other two evaluators). With threshold 3 and max score 1, every LMChecklist assertion was marked failed regardless of agent quality.Lowered to
passing_score: 1(commit1796ac7); next baseline run (build 20294895) went 3/3 green and the bad-PR runs above produced clean signal. Fix mirrored into all six files in this PR.Validation summary
python .github/evals/static_checks.py→ PASSED (8 skill files, 45 Python blocks, 8 categories)