Skip to content

feat: add fleet.judge module for LLM-as-a-judge grading (ENG-1527)#92

Closed
jarredFleetSo wants to merge 92 commits into
mainfrom
feat/judge-module-041026-1
Closed

feat: add fleet.judge module for LLM-as-a-judge grading (ENG-1527)#92
jarredFleetSo wants to merge 92 commits into
mainfrom
feat/judge-module-041026-1

Conversation

@jarredFleetSo

Copy link
Copy Markdown
Contributor

Summary

  • Creates the fleet/judge/ package that provides LLM-as-a-judge grading capability, enabling verifiers to use from fleet.judge import Rubric, Criterion as designed
  • Implements standalone judge grading via configurable LLM endpoints (Anthropic or OpenAI-compatible), enabling offline/airgapped Docker Worlds evaluation
  • Mirrors the orchestrator's prompt construction logic so judge grading works entirely client-side without calling the Fleet orchestrator

New Files

File Purpose
fleet/judge/__init__.py Public exports: Rubric, Criterion, JudgeClient, JudgeEndpointConfig, JudgeResult
fleet/judge/models.py Data classes — Criterion(name, max, levels), Rubric(criteria), CriterionResult, JudgeResult (with __float__), JudgeEndpointConfig
fleet/judge/client.py JudgeClient with async grade() + sync grade_sync(), Anthropic/OpenAI backends via raw httpx, JSON extraction with trailing comma repair
fleet/judge/prompt.py build_structured_rubric_prompt() mirroring orchestrator logic
tests/test_judge.py 36 tests covering all models, JSON extraction, prompt construction, env var config, response parsing

Configuration

# Docker Worlds / airgapped — configure via env vars
JUDGE_ENDPOINT=https://customer-llm.internal/v1
JUDGE_API_KEY=sk-...
JUDGE_API_FORMAT=openai
JUDGE_MODEL=gpt-4o

Key Design Decisions

  • Plain classes (not Pydantic) — Criterion("name", max=10, levels={...}) matches verifier usage exactly
  • No new dependencies — uses only httpx (already a dep) for HTTP calls
  • Config priority: explicit config → JUDGE_* env vars → ANTHROPIC_API_KEY → helpful error
  • float(result) returns normalized score (0.0-1.0) for verifier compatibility

Related

Test plan

  • 36 unit tests passing
  • Verify from fleet.judge import Rubric, Criterion works in verifier context
  • Integration test with actual LLM endpoint

🤖 Generated with Claude Code

mikesklar and others added 30 commits January 13, 2026 12:21
When verifier code contains multiple functions (e.g., a main verifier
function and helper functions), the helper functions were not accessible
from the main function due to namespace isolation.

The exec() call created functions in local_namespace, but the main
function's __globals__ pointed to exec_globals which didn't contain
the helper functions. This caused NameError when the main function
tried to call helpers, which was silently caught and returned 0.0.

Fix: Merge local_namespace into exec_globals after exec() so all
defined functions are accessible when the verifier is called.
…mespace

fix: allow verifier helper functions to be called from main verifier
InstanceRequest changes:
- Add: profile_id, async_provision, instance_mode, ssh_public_keys, snapshot_interval_minutes, version (deprecated)
- Fix: region default changed from 'us-west-1' to None (server decides)
- Fix: created_from default changed from None to 'api'

TaskRequest changes:
- Add: verifier_func, project_key, data_id, data_version, writer_metadata
- Add: model_config with extra='ignore' and populate_by_name=True
- Add: alias='env_id' for environment_id field
- Remove: metadata (doesn't exist in orchestrator TaskRequest, only in TaskResponse)
…odels

Add factual_answer field to support research/factual tasks:
- Task model: stores expected answer for verification
- TaskRequest: accept factual_answer when creating tasks
- TaskResponse: return factual_answer from API

Part of: https://linear.app/fleet-ai/issue/ENG-843/import-script-needs-to-support-output-json-schemas

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
feat: add factual_answer field to Task and API models
Add task_modality field to Task and TaskResponse models to support
copying task modality (computer_use, tool_use, browser) when importing
tasks via the SDK.

Changes:
- Add task_modality to TaskResponse model (API response)
- Add task_modality to Task model (SDK model)
- Pass task_modality from TaskResponse to Task in load_tasks

Co-authored-by: Cursor <cursoragent@cursor.com>
Addresses Bugbot comment: load_task_from_json wasn't extracting
task_modality from JSON data, causing tasks loaded from JSON files
to have task_modality=None even when the JSON contains this field.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Add task_modality field to async Task model, TaskResponse model,
and update load_task_from_json and load_tasks to preserve task_modality.

Co-authored-by: Cursor <cursoragent@cursor.com>
andrew-stelmach-fleet and others added 27 commits February 19, 2026 16:44
fix: remove stale output_json_schema warning from import_tasks
Simple scripts for task authors to download existing tasks, edit them
locally, and upload as new tasks. Uses raw requests (no SDK dependency)
with auto-resolved team ID from API key.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: add task bundle editing scripts
* fix(download): don't pass auto-resolved team_id to task GET

The team_id query param on GET /v1/tasks/{key} requires admin
privileges. Previously the script always passed it (from auto-resolve),
causing 403 errors for non-admin API keys. Now only passes team_id
when explicitly provided via --team-id flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: skip resolve_team_id when --team-id is explicitly provided

Avoids an unnecessary API call that could fail and block the download.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
)

* feat(upload): add job launching and auto-generated unique task keys

- Make --key optional; auto-generates {original_key}_{uuid[:8]} when omitted
- Replace local key comparison with server-side existence check (GET /v1/tasks/{key})
- Launch job by default after upload (POST /v1/jobs) with --no-launch-job to skip
- Add --models, --pass-k flags for job configuration
- Default models: gemini-3.1-pro-preview, claude-opus-4.6, gpt-5.2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(upload): raise on unexpected status in task existence check

Previously any non-200 (including 500, 403, 429) was treated as
"key available", silently skipping the guard. Now only 404 means
available; other errors are surfaced.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
The file-set key may differ from the task key (e.g., without a version
suffix). Pull the key from the task's env_variables.TASK_KEY when
available, falling back to the CLI --task-key argument.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…validation (#73)

The jobs API returns `job_id` not `id` — fix extraction so the job ID and
dashboard URL are printed after launch. Also add validation that data files
are under files/notebooks/ (the path unpacked into the agent workspace) and
that the prompt's list_workspace_files() pattern matches actual files.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix(upload): extract job_id from API response and add workspace file validation

The jobs API returns `job_id` not `id` — fix extraction so the job ID and
dashboard URL are printed after launch. Also add validation that data files
are under files/notebooks/ (the path unpacked into the agent workspace) and
that the prompt's list_workspace_files() pattern matches actual files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add standalone launch_job script for existing tasks

Provides a simple way to launch jobs for tasks that already exist on the
server, without the upload/create flow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
The web page now redirects the browser to http://127.0.0.1:PORT/callback
with tokens as query params instead of POSTing JSON. Replace do_POST +
do_OPTIONS (CORS preflight) with a do_GET handler that reads query params,
validates the state nonce, and returns a plain HTML confirmation page.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… URL

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat: add flt login browser auth flow (ENG-1192)
…#79)

* feat(task-bundle-editing): add Makefile, templates, CLAUDE.md, and workflow docs

Add task authoring toolkit for creating, editing, and deploying Fleet
evaluation tasks. Includes Makefile wrapping existing Python scripts,
verifier templates for analysis and plot tasks, CLAUDE.md for Claude Code
integration, and extended README with end-to-end workflow documentation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: clarify placeholder notation in template schema notes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* feat(track): add passive AI session sync daemon

Adds `flt track` — a background daemon that watches local AI coding
session files (Claude Code, Cursor, Codex) and syncs them to S3 using
a Merkle tree-based dedup engine.

- Merkle sync: flat {path: sha256} map, O(1) root hash change detection
- WAL-backed SQLite upload queue with exponential backoff
- Presigned S3 PUT URLs — daemon never holds AWS creds
- FSEvents/inotify watcher with polling fallback
- launchd (macOS) + systemd (Linux) service install via `flt track enable`
- Manifest written only after uploads confirmed (_confirmed_map pattern)
- 10-minute reconcile interval

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(track): add status = 'pending' filter to claim_batch UPDATE

The UPDATE was missing a status filter, so it could revert done/failed/
in_flight rows back to in_flight when the same path appeared in the
SELECT results. Added WHERE status = 'pending' to match the SELECT.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(track): prevent double-close of fd in write_status

If os.replace() raised after os.close(fd) succeeded, the except handler
would call os.close(fd) again on the already-closed descriptor, masking
the original exception and skipping os.unlink(tmp). Set fd = -1 after
close to guard against the double-close.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(track): serialise _drain_queue with a lock to prevent duplicate uploads

_drain_queue was called from both the main loop and the watcher debounce
thread without synchronisation. The non-atomic SELECT+UPDATE in
claim_batch meant both threads could select the same pending items before
either UPDATE executed, causing duplicate presigned-URL requests and
duplicate pool submissions. A threading.Lock now serialises the
claim_batch call so only one thread can hold claimed items at a time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(track): propagate FLEET_TRACK_BASE_URL into systemd unit on Linux

The launchd plist conditionally passed FLEET_TRACK_BASE_URL through but
the systemd unit only set PATH. On Linux, any custom base URL would be
silently ignored and the daemon would fall back to the production server.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: bump version to 0.2.119

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(track): use GLOBAL_BASE_URL (orchestrator.fleetai.com) as default

Replaced the hardcoded regional URL with the shared GLOBAL_BASE_URL
constant from fleet/config.py, consistent with how the rest of the SDK
resolves the default base URL.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(auth): use urlsafe_b64decode and correct padding for JWT expiry check

JWTs use base64url encoding (- and _ instead of + and /). Using
b64decode silently corrupted payloads containing those chars, causing
json.loads to fail and the except handler to return True (expired),
forcing a Supabase refresh on every get_valid_token() call.

Also fixed the padding: (4 - n % 4) % 4 → -n % 4 so a string already
aligned to 4 bytes gets 0 padding chars instead of 4.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(auth): move Supabase credentials out of source into GitHub secrets

Hardcoded SUPABASE_URL and SUPABASE_ANON_KEY are replaced by a
placeholder fleet/_supabase.py that gets sed-replaced with real values
from GitHub secrets (SUPABASE_URL, SUPABASE_ANON_KEY) before the PyPI
build runs. Local dev can override via env vars.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor(track): remove S3 details from provision response

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(track): key claim_batch UPDATE and mark_done/failed on (path, sha256) not path alone

Path-only matching in claim_batch could orphan rows when two entries share a path
but differ in sha256 — the UPDATE would set both to in_flight even though only one
was returned by the SELECT. That extra row would never be uploaded or resolved.

Fix: use (path, sha256) IN (...) for the UPDATE, and thread sha256 through
pool.submit → on_done/on_failed → mark_done/mark_failed so every callback
targets the exact row that was claimed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(track): add passive AI session sync daemon

Adds `flt track` — a background daemon that watches local AI coding
session files (Claude Code, Cursor, Codex) and syncs them to S3 using
a Merkle tree-based dedup engine.

- Merkle sync: flat {path: sha256} map, O(1) root hash change detection
- WAL-backed SQLite upload queue with exponential backoff
- Presigned S3 PUT URLs — daemon never holds AWS creds
- FSEvents/inotify watcher with polling fallback
- launchd (macOS) + systemd (Linux) service install via `flt track enable`
- Manifest written only after uploads confirmed (_confirmed_map pattern)
- 10-minute reconcile interval

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(track): add status = 'pending' filter to claim_batch UPDATE

The UPDATE was missing a status filter, so it could revert done/failed/
in_flight rows back to in_flight when the same path appeared in the
SELECT results. Added WHERE status = 'pending' to match the SELECT.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(track): prevent double-close of fd in write_status

If os.replace() raised after os.close(fd) succeeded, the except handler
would call os.close(fd) again on the already-closed descriptor, masking
the original exception and skipping os.unlink(tmp). Set fd = -1 after
close to guard against the double-close.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(track): serialise _drain_queue with a lock to prevent duplicate uploads

_drain_queue was called from both the main loop and the watcher debounce
thread without synchronisation. The non-atomic SELECT+UPDATE in
claim_batch meant both threads could select the same pending items before
either UPDATE executed, causing duplicate presigned-URL requests and
duplicate pool submissions. A threading.Lock now serialises the
claim_batch call so only one thread can hold claimed items at a time.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(track): propagate FLEET_TRACK_BASE_URL into systemd unit on Linux

The launchd plist conditionally passed FLEET_TRACK_BASE_URL through but
the systemd unit only set PATH. On Linux, any custom base URL would be
silently ignored and the daemon would fall back to the production server.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: bump version to 0.2.119

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(track): use GLOBAL_BASE_URL (orchestrator.fleetai.com) as default

Replaced the hardcoded regional URL with the shared GLOBAL_BASE_URL
constant from fleet/config.py, consistent with how the rest of the SDK
resolves the default base URL.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(auth): use urlsafe_b64decode and correct padding for JWT expiry check

JWTs use base64url encoding (- and _ instead of + and /). Using
b64decode silently corrupted payloads containing those chars, causing
json.loads to fail and the except handler to return True (expired),
forcing a Supabase refresh on every get_valid_token() call.

Also fixed the padding: (4 - n % 4) % 4 → -n % 4 so a string already
aligned to 4 bytes gets 0 padding chars instead of 4.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(auth): move Supabase credentials out of source into GitHub secrets

Hardcoded SUPABASE_URL and SUPABASE_ANON_KEY are replaced by a
placeholder fleet/_supabase.py that gets sed-replaced with real values
from GitHub secrets (SUPABASE_URL, SUPABASE_ANON_KEY) before the PyPI
build runs. Local dev can override via env vars.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor(track): remove S3 details from provision response

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(track): key claim_batch UPDATE and mark_done/failed on (path, sha256) not path alone

Path-only matching in claim_batch could orphan rows when two entries share a path
but differ in sha256 — the UPDATE would set both to in_flight even though only one
was returned by the SELECT. That extra row would never be uploaded or resolved.

Fix: use (path, sha256) IN (...) for the UPDATE, and thread sha256 through
pool.submit → on_done/on_failed → mark_done/mark_failed so every callback
targets the exact row that was claimed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(track): add mutex to UploadQueue to prevent concurrent SQLite access

check_same_thread=False only suppresses SQLite's thread check — it doesn't make
the connection safe. Eight uploader worker threads calling mark_done/mark_failed
concurrently on the same connection caused sqlite3.InterfaceError in production.

Added threading.Lock() in UploadQueue wrapping every execute+commit. Also removed
the now-redundant _drain_lock in daemon (claim_batch is protected by the queue lock).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(track): detect and recover from corrupt SQLite queue on startup

Runs PRAGMA integrity_check after opening the connection. If the database is
corrupt (residue from the pre-lock threading bug), logs a warning, deletes the
.db/.db-shm/.db-wal files, and reinitialises. Queue items are lost but the next
reconcile re-enqueues anything not yet confirmed on S3.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* chore: bump version to 0.2.121

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Gautham Elango <gautham.gg@gmail.com>
When a file changes sha256 over time, old (path, sha256) rows accumulate in
failed/pending state. Once a newer sha256 uploads successfully, those rows are
superseded and will never produce useful uploads. Delete them in mark_done to
keep the queue clean and status counts meaningful.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(track): prune stale queue rows for same path on successful upload

When a file changes sha256 over time, old (path, sha256) rows accumulate in
failed/pending state. Once a newer sha256 uploads successfully, those rows are
superseded and will never produce useful uploads. Delete them in mark_done to
keep the queue clean and status counts meaningful.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(track): use per-phase httpx timeouts for S3 uploads

A single 120s blanket timeout was too short for large JSONL files on slow
connections — the write phase alone could exceed it. Split into per-phase
timeouts: 10s connect, 30s read, 300s write.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix(track): only prune queue rows enqueued before the completing upload

The previous DELETE removed all other pending/failed rows for the same path,
including newer versions enqueued by the watcher while an older upload was
in-flight. For rapidly-growing JSONL files this race was likely: older version
completes, newer pending row gets deleted, newer version doesn't upload until
the next reconcile (~10 min).

Fix: filter the DELETE to only rows with enqueued_at <= the completing row's
enqueued_at, preserving any newer versions for immediate upload.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Stores email, user_id, team_id, team_name, hostname, platform in the config
at provision time (pulled from credentials.json + provision response), then
writes them into every manifest.json upload.

This lets you associate any device/S3 prefix with a real user without a
Supabase lookup.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ute convention

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat(track): embed user identity in manifest.json
* feat: add verifier structure checks to bundle validator

Checks for: wrong import module (fleet.verifier), old function signature
(submission_dir), wrong Criterion API (weight=), missing env.judge.grade(),
env.read_file() usage, missing File.from_env for prompted output files,
empty files={} dict, solutions/ without .s3() references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: switch upload_task.py to tar-based seed upload

Replace the old file-sets upload flow with the new /v1/seeds API:
- Package files/ as seed.tar.zst via piped tar|zstd
- Get presigned S3 PUT URL from POST /v1/seeds/{data_key}/{env_key}/upload-url
- Upload tar directly to S3
- Create task with data_id/data_version and FLEET__FLEXIBLE_SEED env vars

The data_key is always the task key (unique per task). The --env-version
flag controls both the seed registration and the task's environment version.

Depends on: fleet-ai/theseus#3935

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: update download_task.py for tar seed bundles

Download tasks in the bundle format that upload_task.py expects:
- Fetch task from API, save minimal upload-compatible task.json
- Download seed tar from S3 via aws cli, extract to files/
- Fall back to legacy file-sets API for old tasks
- Print re-upload command for easy round-tripping

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address bugbot findings

- Read tar stderr to prevent pipe buffer deadlock
- Fix has_files check to filter directories (only count actual files)
- Update Makefile: remove --allow-overwrite and --team-id flags,
  add --env-version via VERSION= variable

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* fix: address bugbot findings in upload/download scripts

- Remove carlisle default for environment_id; fail early if missing
- Return False on download failures instead of sys.exit(1), enabling
  fallback to legacy file-sets API
- Fix docstring to match actual behavior (aws cli, not API endpoint)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ator (#90)

Data files no longer need to live under files/notebooks/ specifically,
and prompts don't need to include list_workspace_files(). The validator
still checks that files/ exists and has content (check 5).

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The default models were accidentally narrowed to just claude-sonnet-4.6.
Restore the original cross-model set (Gemini, Opus, GPT-5.2) so uploads
get a proper multi-model sweep by default.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the fleet/judge/ package with Rubric, Criterion, JudgeClient,
JudgeEndpointConfig, and JudgeResult classes. Supports both Anthropic and
OpenAI-compatible endpoints via raw httpx, with env var configuration for
airgapped/Docker Worlds environments. Includes prompt construction that
mirrors the orchestrator's judge_service.py logic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6bdb38e. Configure here.

Comment thread fleet/judge/models.py
return (
f"JudgeResult(score={self.normalized_score:.4f}, "
f"criteria={len(self.criteria)})"
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JudgeResult lacks .score attribute for verifier compatibility

High Severity

JudgeResult is designed for "verifier compatibility" via __float__, but the verifier framework's local execution path checks hasattr(result, "score") before ever calling float(). Since JudgeResult only has normalized_score, total_score, and max_score — but no .score attribute — the verifier raises a ValueError, which is silently caught, returning 0.0. A verifier returning a JudgeResult directly will always silently produce a zero score in local execution.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6bdb38e. Configure here.

Comment thread fleet/judge/prompt.py

parts.append(f"## Submission\n\n{submission}")

return "\n\n".join(parts)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused rubric parameter in _build_user_message

Low Severity

_build_user_message accepts a rubric parameter that is never referenced in the function body. The caller build_structured_rubric_prompt explicitly passes rubric=rubric, but the function only uses submission, ground_truth, problem, context, and conversation. This is either dead code or indicates missing logic that was intended to include rubric information in the user message.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6bdb38e. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants