fix: Make judge runners non-multi-turn by jsonbailey · Pull Request #185 · launchdarkly/python-server-sdk-ai

jsonbailey · 2026-05-14T18:09:07Z

Summary

Restores judge correctness lost in #166, which moved conversation history into the provider model runners.

A Judge keeps one runner across successive evaluate(...) calls. After #166, every successful run() mutated the runner's internal _history / _chat_history, so each evaluation observed the previous one's prompt and response in its history, and concurrent evaluations raced on that mutable state.

Fix

Add a multi_turn: bool = True parameter on OpenAIModelRunner.__init__ and LangChainModelRunner.__init__. The success-path history mutation is now guarded by self._multi_turn. Default is True, so existing chat behavior is preserved.
Thread multi_turn through RunnerFactory.create_model and through both provider create_model implementations.
In LDAIClient._create_judge_instance, construct the judge's runner with multi_turn=False so judges share a stateless runner.
Fix Judge.evaluate_messages to render the conversation as "{role}: {content}" on newlines (was '\r\n'.join([msg.content ...])), preserving who said what so the judge can distinguish user vs. assistant turns.

Because the new parameter is additive (default True), no caller's behavior changes outside the judge path.

Tests

OpenAI runner: multi_turn=False keeps _history length pinned across two run() calls; default behavior still accumulates (regression guard for feat: Support conversation history directly in AI Provider model runners #166).
LangChain runner: same shape against _chat_history.messages.
RunnerFactory.create_model forwards multi_turn to the provider and the constructed runner reports _multi_turn == False.
Judge integration test: a fake runner snapshots its history at call time; two evaluate() calls see the same baseline.
evaluate_messages test: [user/hi, assistant/hello] is forwarded as "user: hi\nassistant: hello".

make test and make lint pass clean.

Refs: AIC-2544
Relates: https://github.com/launchdarkly/sdk-specs/pull/166

Test plan

make test
make lint
e2e: run the chat-judge example against this branch and verify two successive judge evaluations do not see each other's history

Note

Medium Risk
Changes core runner construction and statefulness semantics by threading a new multi_turn flag through factories and runners; incorrect wiring could alter conversation behavior or judge correctness under concurrency.

Overview
Makes model runners optionally stateless by adding a multi_turn flag (default True) to the OpenAI and LangChain model runners and their create_model factory methods, and only appending to in-memory history when multi_turn is enabled.

Threads multi_turn through AIProvider.create_model and RunnerFactory.create_model, and updates judge creation (LDAIClient._create_judge_instance) to construct runners with multi_turn=False so repeated judge evaluations don’t contaminate each other’s prompts.

Updates Judge.evaluate_messages to include message roles when building the judge input ("{role}: {content}" joined by newlines), and adds targeted tests covering non-multi-turn behavior, flag forwarding, and role-preserving rendering.

^{Reviewed by Cursor Bugbot for commit 67dda58. Bugbot is set up for automated code reviews on this repo. Configure here.}

PR #166 moved conversation history into the provider model runners. That broke judges: a Judge shares a single runner across successive evaluate() calls, so each evaluation polluted the next one's history and concurrent evaluations raced on the mutable state. Add a ``multi_turn`` flag (default True) to the OpenAI and LangChain model runners and thread it through ``RunnerFactory.create_model`` and the provider factories. The judge factory now constructs its runner with ``multi_turn=False`` so each evaluation starts from the same baseline. Also fix ``Judge.evaluate_messages`` to preserve message roles when rendering the conversation for the judge, so user and assistant turns remain distinguishable. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

jsonbailey marked this pull request as ready for review May 14, 2026 20:04

jsonbailey requested a review from a team as a code owner May 14, 2026 20:04

andrewklatzke approved these changes May 14, 2026

View reviewed changes

jsonbailey merged commit 5c21bd0 into main May 14, 2026
48 checks passed

jsonbailey deleted the jb/aic-2544/judge-stateless-runner branch May 14, 2026 22:12

github-actions Bot mentioned this pull request May 14, 2026

chore: release main #186

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Make judge runners non-multi-turn#185

fix: Make judge runners non-multi-turn#185
jsonbailey merged 1 commit into
mainfrom
jb/aic-2544/judge-stateless-runner

jsonbailey commented May 14, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jsonbailey commented May 14, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Tests

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jsonbailey commented May 14, 2026 •

edited by cursor Bot

Loading