fix: Make judge runners non-multi-turn#185
Merged
Merged
Conversation
PR #166 moved conversation history into the provider model runners. That broke judges: a Judge shares a single runner across successive evaluate() calls, so each evaluation polluted the next one's history and concurrent evaluations raced on the mutable state. Add a ``multi_turn`` flag (default True) to the OpenAI and LangChain model runners and thread it through ``RunnerFactory.create_model`` and the provider factories. The judge factory now constructs its runner with ``multi_turn=False`` so each evaluation starts from the same baseline. Also fix ``Judge.evaluate_messages`` to preserve message roles when rendering the conversation for the judge, so user and assistant turns remain distinguishable. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
andrewklatzke
approved these changes
May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Restores judge correctness lost in #166, which moved conversation history into the provider model runners.
A
Judgekeeps one runner across successiveevaluate(...)calls. After #166, every successfulrun()mutated the runner's internal_history/_chat_history, so each evaluation observed the previous one's prompt and response in its history, and concurrent evaluations raced on that mutable state.Fix
multi_turn: bool = Trueparameter onOpenAIModelRunner.__init__andLangChainModelRunner.__init__. The success-path history mutation is now guarded byself._multi_turn. Default isTrue, so existing chat behavior is preserved.multi_turnthroughRunnerFactory.create_modeland through both providercreate_modelimplementations.LDAIClient._create_judge_instance, construct the judge's runner withmulti_turn=Falseso judges share a stateless runner.Judge.evaluate_messagesto render the conversation as"{role}: {content}"on newlines (was'\r\n'.join([msg.content ...])), preserving who said what so the judge can distinguish user vs. assistant turns.Because the new parameter is additive (default
True), no caller's behavior changes outside the judge path.Tests
multi_turn=Falsekeeps_historylength pinned across tworun()calls; default behavior still accumulates (regression guard for feat: Support conversation history directly in AI Provider model runners #166)._chat_history.messages.RunnerFactory.create_modelforwardsmulti_turnto the provider and the constructed runner reports_multi_turn == False.evaluate()calls see the same baseline.evaluate_messagestest:[user/hi, assistant/hello]is forwarded as"user: hi\nassistant: hello".make testandmake lintpass clean.Refs: AIC-2544
Relates: https://github.com/launchdarkly/sdk-specs/pull/166
Test plan
make testmake lintNote
Medium Risk
Changes core runner construction and statefulness semantics by threading a new
multi_turnflag through factories and runners; incorrect wiring could alter conversation behavior or judge correctness under concurrency.Overview
Makes model runners optionally stateless by adding a
multi_turnflag (defaultTrue) to the OpenAI and LangChain model runners and theircreate_modelfactory methods, and only appending to in-memory history whenmulti_turnis enabled.Threads
multi_turnthroughAIProvider.create_modelandRunnerFactory.create_model, and updates judge creation (LDAIClient._create_judge_instance) to construct runners withmulti_turn=Falseso repeated judge evaluations don’t contaminate each other’s prompts.Updates
Judge.evaluate_messagesto include message roles when building the judge input ("{role}: {content}"joined by newlines), and adds targeted tests covering non-multi-turn behavior, flag forwarding, and role-preserving rendering.Reviewed by Cursor Bugbot for commit 67dda58. Bugbot is set up for automated code reviews on this repo. Configure here.