[Draft – DO NOT MERGE] validation: build LER from PR #416 to verify DeterministicAssertionEvaluator fix by saurabhrb · Pull Request #71 · microsoft/Dataverse-skills

saurabhrb · 2026-06-02T15:53:23Z

[Draft – DO NOT MERGE] Validate LocalEvalRunner PR #416 fix end-to-end

This is a throwaway validation PR. It exists solely to run DVSkillsPlugin-Evals-PR against LER built from bic/LocalEvalRunner#416 so we can confirm the DeterministicAssertionEvaluator service-only routing fix unblocks the new test format end-to-end. Will be closed unmerged once the validation run finishes.

Sibling PRs

microsoft/Dataverse-skills#70 — the actual test-files PR (untouched by this validation work)
bic/LocalEvalRunner#416 — the LER fix being validated

What this PR changes

.azdo/DataversePluginEvals_PR.yml
- resources.repositories.LocalEvalRunner.ref → refs/heads/users/sbadenkal/service-only-deterministic-assertion (PR #416 branch).
- Adds - checkout: LocalEvalRunner so the LER source is on the agent.
- Passes useBuildFromSource: true and sourceRepoPath: '$(Build.SourcesDirectory)/LocalEvalRunner' to BicEval-SetupTemplate so the agent builds LER from the PR #416 branch instead of installing the published NuGet (which doesn't have the fix).
- The sourceRepoPath parameter itself is added by PR #416 — its default is $(Build.SourcesDirectory) so existing callers (LER-as-self) are unaffected.
evals/tests/dv_data.biceval.json — copied verbatim from PR add: eval tests using new deterministic-evaluator format (3-4 per skill) #70 so the run exercises the CortexConfigurations:Common/DeterministicAssertionEvaluator + settings.supported_verbs path that crashed build 20283404.

Expected outcome

❌ Before #416: BadEvalRunnerInputException: Local CortexConfigurations evaluator not found at evaluator load — 0/0 tests run.
✅ With #416: LER routes DeterministicAssertionEvaluator as service-only, forwards the request to BICEP, and the pipeline grades the dv_data tests normally.

After validation

If the run goes green → comment evidence on PR #416 and PR add: eval tests using new deterministic-evaluator format (3-4 per skill) #70, then close this PR unmerged.
If the run still fails → diagnose, push fixes to PR #416, re-run here.
The Dataverse-skills main branch never sees these .azdo changes.

…rtionEvaluator fix end-to-end DO NOT MERGE. This is a throwaway validation PR. Points the pipeline's LocalEvalRunner repo ref to https://microsoft.ghe.com/bic/LocalEvalRunner/pull/416 (users/sbadenkal/service-only-deterministic-assertion), checks out LER source explicitly, and enables useBuildFromSource so the pipeline runs LER built from the PR-416 branch instead of the published NuGet. Also copies the new-format dv_data.biceval.json from PR #70 so the run actually exercises the DeterministicAssertionEvaluator code path. PR #70 is untouched. Expected outcome: pipeline succeeds where build 20283404 crashed at evaluator load.

saurabhrb · 2026-06-02T15:53:43Z

Switching to local repro (Option 2) instead of pipeline-based validation. Closing without merge as planned.

saurabhrb · 2026-06-02T15:56:02Z

Reopening — switching back to pipeline-grounded validation for 100% confidence (vs local-only repro). LER PR #416 still has the template sourceRepoPath param this PR needs.

…actually get graded Per Lekina's feedback on PR 393 thread: DeterministicAssertionEvaluator only grades verb-prefixed assertions (CONTAINS/NOT_CONTAINS/SKILL_LOADED); correctness.prompty scores against expected_response without seeing individual assertions; therefore natural-language assertions were silently unscored. Adding LMChecklist.prompty (score 0/1 per-assertion, threshold 1) closes that gap. passing_score=1 matches the prompty's documented range.

…t graded Per LER PR-393 author guidance: DeterministicAssertionEvaluator only grades verb-prefixed assertions (CONTAINS/NOT_CONTAINS/SKILL_LOADED), and correctness.prompty scores against expected_response without seeing individual assertions. Without LMChecklist, natural-language assertions (those without a verb prefix) are unscored. Adds LMChecklist (Common/SEVAL/LMChecklist.prompty) to all six test files using the exact name + passing_score=3 + priority=1 shape the LER author published. Loader registration is already proven against the new format in Dataverse-skills PR #71's draft validation runs (LocalEvalRunner builds 20289122 and 20290419).

saurabhrb · 2026-06-02T18:36:56Z

Validation complete. Pushed LMChecklist addition to PR #70 across all six test files (commit 06d5524). Both LER PR #416 fixes proven via builds 20289122 and 20290419 (full details on https://microsoft.ghe.com/bic/LocalEvalRunner/pull/416). A separate, pre-existing LER HTTP-headers bug surfaced after the load phase passes and will be filed independently. Closing this harness PR as planned.

saurabhrb closed this Jun 2, 2026

saurabhrb reopened this Jun 2, 2026

saurabhrb closed this Jun 2, 2026

saurabhrb mentioned this pull request Jun 2, 2026

[Draft - DO NOT MERGE] bad-PR validation harness (local-eval mode) #72

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft – DO NOT MERGE] validation: build LER from PR #416 to verify DeterministicAssertionEvaluator fix#71

[Draft – DO NOT MERGE] validation: build LER from PR #416 to verify DeterministicAssertionEvaluator fix#71
saurabhrb wants to merge 2 commits into
mainfrom
users/saurabhrb/validate-ler-pr416

saurabhrb commented Jun 2, 2026

Uh oh!

saurabhrb commented Jun 2, 2026

Uh oh!

saurabhrb commented Jun 2, 2026

Uh oh!

saurabhrb commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

saurabhrb commented Jun 2, 2026

[Draft – DO NOT MERGE] Validate LocalEvalRunner PR #416 fix end-to-end

Sibling PRs

What this PR changes

Expected outcome

After validation

Uh oh!

saurabhrb commented Jun 2, 2026

Uh oh!

saurabhrb commented Jun 2, 2026

Uh oh!

saurabhrb commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant