Skip to content

fix: Make judge runners non-multi-turn#185

Merged
jsonbailey merged 1 commit into
mainfrom
jb/aic-2544/judge-stateless-runner
May 14, 2026
Merged

fix: Make judge runners non-multi-turn#185
jsonbailey merged 1 commit into
mainfrom
jb/aic-2544/judge-stateless-runner

Conversation

@jsonbailey
Copy link
Copy Markdown
Contributor

@jsonbailey jsonbailey commented May 14, 2026

Summary

Restores judge correctness lost in #166, which moved conversation history into the provider model runners.

A Judge keeps one runner across successive evaluate(...) calls. After #166, every successful run() mutated the runner's internal _history / _chat_history, so each evaluation observed the previous one's prompt and response in its history, and concurrent evaluations raced on that mutable state.

Fix

  • Add a multi_turn: bool = True parameter on OpenAIModelRunner.__init__ and LangChainModelRunner.__init__. The success-path history mutation is now guarded by self._multi_turn. Default is True, so existing chat behavior is preserved.
  • Thread multi_turn through RunnerFactory.create_model and through both provider create_model implementations.
  • In LDAIClient._create_judge_instance, construct the judge's runner with multi_turn=False so judges share a stateless runner.
  • Fix Judge.evaluate_messages to render the conversation as "{role}: {content}" on newlines (was '\r\n'.join([msg.content ...])), preserving who said what so the judge can distinguish user vs. assistant turns.

Because the new parameter is additive (default True), no caller's behavior changes outside the judge path.

Tests

  • OpenAI runner: multi_turn=False keeps _history length pinned across two run() calls; default behavior still accumulates (regression guard for feat: Support conversation history directly in AI Provider model runners #166).
  • LangChain runner: same shape against _chat_history.messages.
  • RunnerFactory.create_model forwards multi_turn to the provider and the constructed runner reports _multi_turn == False.
  • Judge integration test: a fake runner snapshots its history at call time; two evaluate() calls see the same baseline.
  • evaluate_messages test: [user/hi, assistant/hello] is forwarded as "user: hi\nassistant: hello".

make test and make lint pass clean.

Refs: AIC-2544
Relates: https://github.com/launchdarkly/sdk-specs/pull/166

Test plan

  • make test
  • make lint
  • e2e: run the chat-judge example against this branch and verify two successive judge evaluations do not see each other's history

Note

Medium Risk
Changes core runner construction and statefulness semantics by threading a new multi_turn flag through factories and runners; incorrect wiring could alter conversation behavior or judge correctness under concurrency.

Overview
Makes model runners optionally stateless by adding a multi_turn flag (default True) to the OpenAI and LangChain model runners and their create_model factory methods, and only appending to in-memory history when multi_turn is enabled.

Threads multi_turn through AIProvider.create_model and RunnerFactory.create_model, and updates judge creation (LDAIClient._create_judge_instance) to construct runners with multi_turn=False so repeated judge evaluations don’t contaminate each other’s prompts.

Updates Judge.evaluate_messages to include message roles when building the judge input ("{role}: {content}" joined by newlines), and adds targeted tests covering non-multi-turn behavior, flag forwarding, and role-preserving rendering.

Reviewed by Cursor Bugbot for commit 67dda58. Bugbot is set up for automated code reviews on this repo. Configure here.

PR #166 moved conversation history into the provider model runners. That
broke judges: a Judge shares a single runner across successive
evaluate() calls, so each evaluation polluted the next one's history
and concurrent evaluations raced on the mutable state.

Add a ``multi_turn`` flag (default True) to the OpenAI and LangChain
model runners and thread it through ``RunnerFactory.create_model`` and
the provider factories. The judge factory now constructs its runner with
``multi_turn=False`` so each evaluation starts from the same baseline.

Also fix ``Judge.evaluate_messages`` to preserve message roles when
rendering the conversation for the judge, so user and assistant turns
remain distinguishable.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jsonbailey jsonbailey marked this pull request as ready for review May 14, 2026 20:04
@jsonbailey jsonbailey requested a review from a team as a code owner May 14, 2026 20:04
@jsonbailey jsonbailey merged commit 5c21bd0 into main May 14, 2026
48 checks passed
@jsonbailey jsonbailey deleted the jb/aic-2544/judge-stateless-runner branch May 14, 2026 22:12
@github-actions github-actions Bot mentioned this pull request May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants