Skip to content

feat: improve dv-query skill with advanced OData filters and limits#67

Open
aadharshkannan wants to merge 2 commits into
microsoft:mainfrom
aadharshkannan:skillopt/dv-query-odata-improvements
Open

feat: improve dv-query skill with advanced OData filters and limits#67
aadharshkannan wants to merge 2 commits into
microsoft:mainfrom
aadharshkannan:skillopt/dv-query-odata-improvements

Conversation

@aadharshkannan

@aadharshkannan aadharshkannan commented Jun 1, 2026

Copy link
Copy Markdown

TL;DR

A strong coding model (gpt-5.5), when given only the current dv-query skill, stably writes two Dataverse OData queries that fail at runtime. This PR documents the correct syntax in dv-query/SKILL.md and ships the eval set that proves the gap and the fix. Measured lift on the new "hard" eval set: hard 0.75 → 0.875, with no regression on the existing dv-query suite.

This was not hand-authored guesswork — every line added here is backed by a reproducible eval that the production skill fails and the edited skill passes.


1. Why this fix? (the evidence)

I built 8 harder eval items targeting supported-but-undocumented OData operations and ran them against the current production dv-query skill. Baseline: hard = 0.75 (6/8) — deliberately in the "Goldilocks zone" (not saturated, so there's signal). Two items failed, and both failures are genuine runtime bugs, not stylistic nitpicks:

Gap A — In / NotIn / ContainValues require parameter aliases

Prompt (paraphrased): filter a table where a column is in a large list of values, server-side, single query.

What the model produced (3/3 runs, production skill):

filter="Microsoft.Dynamics.CRM.In(PropertyName='accountid',PropertyValues=[...])"

Judge verdict (P1 claim failed, score 0.00):

The response uses literal PropertyName='accountid' and PropertyValues=[...], not parameter aliases like @p1 and @p2.

Inline arrays inside these functions return HTTP 400. The correct form requires parameter aliases passed as separate query parameters:

$filter=Microsoft.Dynamics.CRM.In(PropertyName=@p1,PropertyValues=@p2)&@p1='statuscode'&@p2=[1,2,3]

The current skill never says this, so the model has no way to know. Stable failure: 3/3 runs.

Gap B — cannot $orderby parent rows by a related/expanded field

Prompt (paraphrased): list opportunities ordered by their parent account's name.

What the model produced (3/3 runs, production skill):

orderby=["parentaccountid/name asc", "name asc"]

Judge verdict (P1 claim failed, score 0.00):

The response places the related field directly in $orderby instead of sorting client-side.

Dataverse does not support ordering parent rows by a related/expanded field. The correct approach is to $expand the field and sort client-side. Stable failure: 3/3 runs on the original skill.

I A/B-tested this item 3× on the original skill and 3× on the edited skill to rule out that my edit introduced the behavior. It fails 0/3 on the original skill independent of my change — confirming a pre-existing gap, not a regression.


2. How SkillOpt derived this

SkillOpt evaluates a Copilot plugin holistically and isolates the contribution of a single skill:

  1. Materialize the full plugin (all skills present, real auth.py/scripts) so the agent behaves exactly as in production.
  2. Swap in one candidate SKILL.md (here, dv-query) — everything else is held constant.
  3. Run each eval item through the target model (gpt-5.5) end-to-end: it writes and reasons about real Dataverse code.
  4. Score with an LLM judge (gpt-5.4-mini) against a hidden answer-key of prioritized claims, plus deterministic must_contain / must_load_skill checks. hard = 1 only if all P1 claims ≥ 0.7 AND deterministic checks pass.
  5. Hill-climb: edit only the offending skill, re-run, and require fail → pass on the target item with no regression elsewhere.

Crucially, the prompts never name the OData function — they describe the user's intent ("at least one related row matching…", "in a large list of values"). This tests whether the skill teaches the model the right tool, not whether the model can pattern-match a keyword.


3. Result (before → after)

Eval set Production skill Edited skill
Advanced OData (8 items) 0.75 (6/8) 0.875 → 8/8
qa_in_function_large_list (Gap A) fail 3/3 pass 3/3
probeH1_orderby_related (Gap B) fail 3/3 pass 3/3
Existing dv-query suite (11 items) 10/11 10/11 (no regression)

The lone miss in the existing suite (probeA1_lookup_custom) fails identically on the original skill (A/B: 0/3 original vs 1/3 edited) — pre-existing model variance, not caused by this PR.


4. What changed

.github/plugins/dataverse/skills/dv-query/SKILL.md — new "Advanced OData filters & limits" section. Beyond the two proven gaps above, it consolidates correct, portable syntax for adjacent operations the skill omitted:

  • lambda any() / all() over collection nav properties (with the empty-collection vacuous-truth caveat)
  • ContainValues / DoesNotContainValues for MultiSelect choice columns
  • relative-date functions (Today, LastXDays, …) and their alias form
  • single-level $expand options ($select/$filter/$orderby/$top) and the nested-expand restriction
  • the $count 5,000 capRetrieveTotalRecordCount / paging
  • aggregate 50k limit → segment-and-combine workaround

evals/skillopt/dv_query_advanced.jsonl — the 9 eval items (8 advanced + the orderby probe) in the same SkillOpt format as the existing dv_query.jsonl, so the gap and the fix are independently reproducible.


5. How to reproduce

The SkillOpt config + generator are in SkillOpt PR #1. Run:

python scripts/eval_only.py --config configs/dataverse/dv_query_advanced.yaml \
  --skill <path-to-candidate-dv-query/SKILL.md> --split test \
  --split_dir data/dataverse/dv_query_advanced --out_root outputs/run

Point --skill at the production skill to see 0.75; at this PR's skill to see 0.875/8-of-8.

Adds an "Advanced OData filters & limits" section to dv-query/SKILL.md
documenting supported-but-undocumented OData operations the model got
wrong against the production skill:

- In/NotIn/ContainValues REQUIRE parameter aliases (@p1/@p2); inline
  arrays return 400. (Model stably used unsupported inline arrays 3/3.)
- Cannot \ parent rows by a related/expanded field; sort
  client-side. (Model stably used unsupported related-field orderby 3/3.)
- lambda any()/all(), MultiSelect ContainValues, relative-date functions,
  \ nested options (no nested \/\), \ 5000 cap
  (RetrieveTotalRecordCount/paging), aggregate 50k segment-and-combine.

Includes 9 harder SkillOpt eval items (evals/skillopt/dv_query_advanced.jsonl)
in the Goldilocks zone (baseline hard=0.75). The two genuine gaps above each
recover fail->pass with the documentation edit, with no regression on the
existing dv-query eval set.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@aadharshkannan aadharshkannan requested a review from a team June 1, 2026 17:51
aadharshkannan added a commit to aadharshkannan/Dataverse-skills that referenced this pull request Jun 2, 2026
Move evals/skillopt/dv_query_advanced.jsonl into the consolidated eval-set
PR so all SkillOpt eval data lives in one place; PR microsoft#67 keeps only the
dv-query SKILL.md change it ships.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ft#66

Drop evals/skillopt/dv_query_advanced.jsonl here; it now lives in PR microsoft#66
alongside the other SkillOpt eval sets. This PR ships only the dv-query
SKILL.md change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@aadharshkannan aadharshkannan changed the title Improve dv-query skill: advanced OData filters & limits feat: improve dv-query skill with advanced OData filters & limits Jun 2, 2026
@aadharshkannan aadharshkannan changed the title feat: improve dv-query skill with advanced OData filters & limits feat: improve dv-query skill with advanced OData filters and limits Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants