Skip to content

[Draft – DO NOT MERGE] validation: build LER from PR #416 to verify DeterministicAssertionEvaluator fix#71

Closed
saurabhrb wants to merge 2 commits into
mainfrom
users/saurabhrb/validate-ler-pr416
Closed

[Draft – DO NOT MERGE] validation: build LER from PR #416 to verify DeterministicAssertionEvaluator fix#71
saurabhrb wants to merge 2 commits into
mainfrom
users/saurabhrb/validate-ler-pr416

Conversation

@saurabhrb

Copy link
Copy Markdown
Contributor

[Draft – DO NOT MERGE] Validate LocalEvalRunner PR #416 fix end-to-end

This is a throwaway validation PR. It exists solely to run DVSkillsPlugin-Evals-PR against LER built from bic/LocalEvalRunner#416 so we can confirm the DeterministicAssertionEvaluator service-only routing fix unblocks the new test format end-to-end. Will be closed unmerged once the validation run finishes.

Sibling PRs

What this PR changes

  1. .azdo/DataversePluginEvals_PR.yml
    • resources.repositories.LocalEvalRunner.refrefs/heads/users/sbadenkal/service-only-deterministic-assertion (PR #416 branch).
    • Adds - checkout: LocalEvalRunner so the LER source is on the agent.
    • Passes useBuildFromSource: true and sourceRepoPath: '$(Build.SourcesDirectory)/LocalEvalRunner' to BicEval-SetupTemplate so the agent builds LER from the PR #416 branch instead of installing the published NuGet (which doesn't have the fix).
    • The sourceRepoPath parameter itself is added by PR #416 — its default is $(Build.SourcesDirectory) so existing callers (LER-as-self) are unaffected.
  2. evals/tests/dv_data.biceval.json — copied verbatim from PR add: eval tests using new deterministic-evaluator format (3-4 per skill) #70 so the run exercises the CortexConfigurations:Common/DeterministicAssertionEvaluator + settings.supported_verbs path that crashed build 20283404.

Expected outcome

  • ❌ Before #416: BadEvalRunnerInputException: Local CortexConfigurations evaluator not found at evaluator load — 0/0 tests run.
  • ✅ With #416: LER routes DeterministicAssertionEvaluator as service-only, forwards the request to BICEP, and the pipeline grades the dv_data tests normally.

After validation

…rtionEvaluator fix end-to-end

DO NOT MERGE. This is a throwaway validation PR. Points the pipeline's LocalEvalRunner repo ref to https://microsoft.ghe.com/bic/LocalEvalRunner/pull/416 (users/sbadenkal/service-only-deterministic-assertion), checks out LER source explicitly, and enables useBuildFromSource so the pipeline runs LER built from the PR-416 branch instead of the published NuGet. Also copies the new-format dv_data.biceval.json from PR #70 so the run actually exercises the DeterministicAssertionEvaluator code path. PR #70 is untouched. Expected outcome: pipeline succeeds where build 20283404 crashed at evaluator load.
@saurabhrb

Copy link
Copy Markdown
Contributor Author

Switching to local repro (Option 2) instead of pipeline-based validation. Closing without merge as planned.

@saurabhrb saurabhrb closed this Jun 2, 2026
@saurabhrb

Copy link
Copy Markdown
Contributor Author

Reopening — switching back to pipeline-grounded validation for 100% confidence (vs local-only repro). LER PR #416 still has the template sourceRepoPath param this PR needs.

@saurabhrb saurabhrb reopened this Jun 2, 2026
…actually get graded

Per Lekina's feedback on PR 393 thread: DeterministicAssertionEvaluator only grades verb-prefixed assertions (CONTAINS/NOT_CONTAINS/SKILL_LOADED); correctness.prompty scores against expected_response without seeing individual assertions; therefore natural-language assertions were silently unscored. Adding LMChecklist.prompty (score 0/1 per-assertion, threshold 1) closes that gap. passing_score=1 matches the prompty's documented range.
saurabhrb pushed a commit that referenced this pull request Jun 2, 2026
…t graded

Per LER PR-393 author guidance: DeterministicAssertionEvaluator only grades verb-prefixed assertions (CONTAINS/NOT_CONTAINS/SKILL_LOADED), and correctness.prompty scores against expected_response without seeing individual assertions. Without LMChecklist, natural-language assertions (those without a verb prefix) are unscored. Adds LMChecklist (Common/SEVAL/LMChecklist.prompty) to all six test files using the exact name + passing_score=3 + priority=1 shape the LER author published. Loader registration is already proven against the new format in Dataverse-skills PR #71's draft validation runs (LocalEvalRunner builds 20289122 and 20290419).
@saurabhrb

Copy link
Copy Markdown
Contributor Author

Validation complete. Pushed LMChecklist addition to PR #70 across all six test files (commit 06d5524). Both LER PR #416 fixes proven via builds 20289122 and 20290419 (full details on https://microsoft.ghe.com/bic/LocalEvalRunner/pull/416). A separate, pre-existing LER HTTP-headers bug surfaced after the load phase passes and will be filed independently. Closing this harness PR as planned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant