[Draft – DO NOT MERGE] validation: build LER from PR #416 to verify DeterministicAssertionEvaluator fix#71
Closed
saurabhrb wants to merge 2 commits into
Closed
[Draft – DO NOT MERGE] validation: build LER from PR #416 to verify DeterministicAssertionEvaluator fix#71saurabhrb wants to merge 2 commits into
saurabhrb wants to merge 2 commits into
Conversation
…rtionEvaluator fix end-to-end DO NOT MERGE. This is a throwaway validation PR. Points the pipeline's LocalEvalRunner repo ref to https://microsoft.ghe.com/bic/LocalEvalRunner/pull/416 (users/sbadenkal/service-only-deterministic-assertion), checks out LER source explicitly, and enables useBuildFromSource so the pipeline runs LER built from the PR-416 branch instead of the published NuGet. Also copies the new-format dv_data.biceval.json from PR #70 so the run actually exercises the DeterministicAssertionEvaluator code path. PR #70 is untouched. Expected outcome: pipeline succeeds where build 20283404 crashed at evaluator load.
Contributor
Author
|
Switching to local repro (Option 2) instead of pipeline-based validation. Closing without merge as planned. |
Contributor
Author
|
Reopening — switching back to pipeline-grounded validation for 100% confidence (vs local-only repro). LER PR #416 still has the template sourceRepoPath param this PR needs. |
…actually get graded Per Lekina's feedback on PR 393 thread: DeterministicAssertionEvaluator only grades verb-prefixed assertions (CONTAINS/NOT_CONTAINS/SKILL_LOADED); correctness.prompty scores against expected_response without seeing individual assertions; therefore natural-language assertions were silently unscored. Adding LMChecklist.prompty (score 0/1 per-assertion, threshold 1) closes that gap. passing_score=1 matches the prompty's documented range.
saurabhrb
pushed a commit
that referenced
this pull request
Jun 2, 2026
…t graded Per LER PR-393 author guidance: DeterministicAssertionEvaluator only grades verb-prefixed assertions (CONTAINS/NOT_CONTAINS/SKILL_LOADED), and correctness.prompty scores against expected_response without seeing individual assertions. Without LMChecklist, natural-language assertions (those without a verb prefix) are unscored. Adds LMChecklist (Common/SEVAL/LMChecklist.prompty) to all six test files using the exact name + passing_score=3 + priority=1 shape the LER author published. Loader registration is already proven against the new format in Dataverse-skills PR #71's draft validation runs (LocalEvalRunner builds 20289122 and 20290419).
Contributor
Author
|
Validation complete. Pushed LMChecklist addition to PR #70 across all six test files (commit 06d5524). Both LER PR #416 fixes proven via builds 20289122 and 20290419 (full details on https://microsoft.ghe.com/bic/LocalEvalRunner/pull/416). A separate, pre-existing LER HTTP-headers bug surfaced after the load phase passes and will be filed independently. Closing this harness PR as planned. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[Draft – DO NOT MERGE] Validate LocalEvalRunner PR #416 fix end-to-end
This is a throwaway validation PR. It exists solely to run
DVSkillsPlugin-Evals-PRagainst LER built from bic/LocalEvalRunner#416 so we can confirm theDeterministicAssertionEvaluatorservice-only routing fix unblocks the new test format end-to-end. Will be closed unmerged once the validation run finishes.Sibling PRs
What this PR changes
.azdo/DataversePluginEvals_PR.ymlresources.repositories.LocalEvalRunner.ref→refs/heads/users/sbadenkal/service-only-deterministic-assertion(PR #416 branch).- checkout: LocalEvalRunnerso the LER source is on the agent.useBuildFromSource: trueandsourceRepoPath: '$(Build.SourcesDirectory)/LocalEvalRunner'toBicEval-SetupTemplateso the agent builds LER from the PR #416 branch instead of installing the published NuGet (which doesn't have the fix).sourceRepoPathparameter itself is added by PR #416 — its default is$(Build.SourcesDirectory)so existing callers (LER-as-self) are unaffected.evals/tests/dv_data.biceval.json— copied verbatim from PR add: eval tests using new deterministic-evaluator format (3-4 per skill) #70 so the run exercises theCortexConfigurations:Common/DeterministicAssertionEvaluator+settings.supported_verbspath that crashed build 20283404.Expected outcome
BadEvalRunnerInputException: Local CortexConfigurations evaluator not foundat evaluator load — 0/0 tests run.DeterministicAssertionEvaluatoras service-only, forwards the request to BICEP, and the pipeline grades the dv_data tests normally.After validation
mainbranch never sees these.azdochanges.