BAD PR 2 (DO NOT MERGE): regress dv-solution SKILL - hallucinated SDK call by saurabhrb · Pull Request #74 · microsoft/Dataverse-skills

saurabhrb · 2026-06-02T20:55:48Z

Intentional regression — branched off PR #72 (harness). Replaces pac solution export with fictitious client.solutions.export() Python SDK call. Pipeline testFile flipped to dv_solution.biceval.json. Expected to FAIL: solution_004_routing_violation_trap + most PRIORITY_1 CONTAINS:pac assertions. Closes unmerged after evidence captured.

…/query/solution; convert dv_data to new format Adds 3-4 tests per skill using the new generic deterministic-evaluator format introduced in LocalEvalRunner: each test file now enables CortexConfigurations:Common/DeterministicAssertionEvaluator with settings.supported_verbs=CONTAINS,NOT_CONTAINS,SKILL_LOADED alongside the correctness.prompty semantic judge. Ports the dev/evalsV0 baseline tests (connect_001, metadata_001, overview_001, query_001, solution_001) and adds 2-3 natural follow-up tests per skill covering env-file contract, no-hardcoded-secrets, schema create + lookup relationship, filtered/aggregate reads, and solution unpack/import/routing-trap. dv_data.biceval.json is updated in place to add the deterministic evaluator.

…t graded Per LER PR-393 author guidance: DeterministicAssertionEvaluator only grades verb-prefixed assertions (CONTAINS/NOT_CONTAINS/SKILL_LOADED), and correctness.prompty scores against expected_response without seeing individual assertions. Without LMChecklist, natural-language assertions (those without a verb prefix) are unscored. Adds LMChecklist (Common/SEVAL/LMChecklist.prompty) to all six test files using the exact name + passing_score=3 + priority=1 shape the LER author published. Loader registration is already proven against the new format in Dataverse-skills PR #71's draft validation runs (LocalEvalRunner builds 20289122 and 20290419).

…ionService for local-eval mode Bad-PR validation harness. Routes around the pre-existing LER->BICEP ASCII-headers bug by forcing local evaluation: LMChecklist.prompty grades each assertion (verb-prefixed and natural-language alike) via CAPI/AOAI locally. Drops the DeterministicAssertionEvaluator entry because in DisableEvaluationService mode it would fall through to a local file lookup and crash (same root issue LER PR #416 fixes for the service-enabled path). Harness PR is draft and closes unmerged; PR #70 (test files) and PR #416 (LER fixes) are untouched.

…iles for local-eval mode

…ing range

… with hallucinated SDK call Intentional regression for bad-PR gate validation. Replaces 'pac solution export' guidance with a fictitious 'client.solutions.export()' SDK call. Also points the pipeline default testFile to dv_solution.biceval.json so the relevant tests run. Expected to fail: solution_004_routing_violation_trap (PRIORITY_1 routes-to-PAC + NOT_CONTAINS:client.solutions.export). Other solution tests likely also fail their CONTAINS:pac assertions.

saurabhrb · 2026-06-03T03:23:57Z

Closing -- validation harness / demo PR, no longer needed. Real coverage now lives in PR #70 (deterministic eval tests) and PR #68 (Cursor + auth unification).

Saurabh Badenkal added 6 commits June 1, 2026 22:38

harness: drop DeterministicAssertionEvaluator from remaining 5 test f…

44c07c0

…iles for local-eval mode

harness: lower LMChecklist passing_score 3 to 1 to match its 0/1 scor…

b705f6f

…ing range

saurabhrb force-pushed the users/saurabhrb/badpr-2-dv-solution-regression branch from f8f1bed to 4952cf8 Compare June 2, 2026 22:09

saurabhrb mentioned this pull request Jun 3, 2026

add: eval tests using new deterministic-evaluator format (3-4 per skill) #70

Open

saurabhrb closed this Jun 3, 2026

saurabhrb deleted the users/saurabhrb/badpr-2-dv-solution-regression branch June 3, 2026 03:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BAD PR 2 (DO NOT MERGE): regress dv-solution SKILL - hallucinated SDK call#74

BAD PR 2 (DO NOT MERGE): regress dv-solution SKILL - hallucinated SDK call#74
saurabhrb wants to merge 6 commits into
mainfrom
users/saurabhrb/badpr-2-dv-solution-regression

saurabhrb commented Jun 2, 2026

Uh oh!

saurabhrb commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

saurabhrb commented Jun 2, 2026

Uh oh!

saurabhrb commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant