feat: add fleet.judge module for LLM-as-a-judge grading (ENG-1527)#92
feat: add fleet.judge module for LLM-as-a-judge grading (ENG-1527)#92jarredFleetSo wants to merge 92 commits into
Conversation
When verifier code contains multiple functions (e.g., a main verifier function and helper functions), the helper functions were not accessible from the main function due to namespace isolation. The exec() call created functions in local_namespace, but the main function's __globals__ pointed to exec_globals which didn't contain the helper functions. This caused NameError when the main function tried to call helpers, which was silently caught and returned 0.0. Fix: Merge local_namespace into exec_globals after exec() so all defined functions are accessible when the verifier is called.
…mespace fix: allow verifier helper functions to be called from main verifier
InstanceRequest changes: - Add: profile_id, async_provision, instance_mode, ssh_public_keys, snapshot_interval_minutes, version (deprecated) - Fix: region default changed from 'us-west-1' to None (server decides) - Fix: created_from default changed from None to 'api' TaskRequest changes: - Add: verifier_func, project_key, data_id, data_version, writer_metadata - Add: model_config with extra='ignore' and populate_by_name=True - Add: alias='env_id' for environment_id field - Remove: metadata (doesn't exist in orchestrator TaskRequest, only in TaskResponse)
…API" This reverts commit 9a0af14.
add metadata to tasks in SDK
bump version
…odels Add factual_answer field to support research/factual tasks: - Task model: stores expected answer for verification - TaskRequest: accept factual_answer when creating tasks - TaskResponse: return factual_answer from API Part of: https://linear.app/fleet-ai/issue/ENG-843/import-script-needs-to-support-output-json-schemas Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
feat: add factual_answer field to Task and API models
Add task_modality field to Task and TaskResponse models to support copying task modality (computer_use, tool_use, browser) when importing tasks via the SDK. Changes: - Add task_modality to TaskResponse model (API response) - Add task_modality to Task model (SDK model) - Pass task_modality from TaskResponse to Task in load_tasks Co-authored-by: Cursor <cursoragent@cursor.com>
Addresses Bugbot comment: load_task_from_json wasn't extracting task_modality from JSON data, causing tasks loaded from JSON files to have task_modality=None even when the JSON contains this field. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Add task_modality field to async Task model, TaskResponse model, and update load_task_from_json and load_tasks to preserve task_modality. Co-authored-by: Cursor <cursoragent@cursor.com>
fix: remove stale output_json_schema warning from import_tasks
Simple scripts for task authors to download existing tasks, edit them locally, and upload as new tasks. Uses raw requests (no SDK dependency) with auto-resolved team ID from API key. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: add task bundle editing scripts
* fix(download): don't pass auto-resolved team_id to task GET
The team_id query param on GET /v1/tasks/{key} requires admin
privileges. Previously the script always passed it (from auto-resolve),
causing 403 errors for non-admin API keys. Now only passes team_id
when explicitly provided via --team-id flag.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* fix: skip resolve_team_id when --team-id is explicitly provided
Avoids an unnecessary API call that could fail and block the download.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
) * feat(upload): add job launching and auto-generated unique task keys - Make --key optional; auto-generates {original_key}_{uuid[:8]} when omitted - Replace local key comparison with server-side existence check (GET /v1/tasks/{key}) - Launch job by default after upload (POST /v1/jobs) with --no-launch-job to skip - Add --models, --pass-k flags for job configuration - Default models: gemini-3.1-pro-preview, claude-opus-4.6, gpt-5.2 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(upload): raise on unexpected status in task existence check Previously any non-200 (including 500, 403, 429) was treated as "key available", silently skipping the guard. Now only 404 means available; other errors are surfaced. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
The file-set key may differ from the task key (e.g., without a version suffix). Pull the key from the task's env_variables.TASK_KEY when available, falling back to the CLI --task-key argument. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…validation (#73) The jobs API returns `job_id` not `id` — fix extraction so the job ID and dashboard URL are printed after launch. Also add validation that data files are under files/notebooks/ (the path unpacked into the agent workspace) and that the prompt's list_workspace_files() pattern matches actual files. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix(upload): extract job_id from API response and add workspace file validation The jobs API returns `job_id` not `id` — fix extraction so the job ID and dashboard URL are printed after launch. Also add validation that data files are under files/notebooks/ (the path unpacked into the agent workspace) and that the prompt's list_workspace_files() pattern matches actual files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add standalone launch_job script for existing tasks Provides a simple way to launch jobs for tasks that already exist on the server, without the upload/create flow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
The web page now redirects the browser to http://127.0.0.1:PORT/callback with tokens as query params instead of POSTing JSON. Replace do_POST + do_OPTIONS (CORS preflight) with a do_GET handler that reads query params, validates the state nonce, and returns a plain HTML confirmation page. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… URL Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat: add flt login browser auth flow (ENG-1192)
…#79) * feat(task-bundle-editing): add Makefile, templates, CLAUDE.md, and workflow docs Add task authoring toolkit for creating, editing, and deploying Fleet evaluation tasks. Includes Makefile wrapping existing Python scripts, verifier templates for analysis and plot tasks, CLAUDE.md for Claude Code integration, and extended README with end-to-end workflow documentation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: clarify placeholder notation in template schema notes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* feat(track): add passive AI session sync daemon
Adds `flt track` — a background daemon that watches local AI coding
session files (Claude Code, Cursor, Codex) and syncs them to S3 using
a Merkle tree-based dedup engine.
- Merkle sync: flat {path: sha256} map, O(1) root hash change detection
- WAL-backed SQLite upload queue with exponential backoff
- Presigned S3 PUT URLs — daemon never holds AWS creds
- FSEvents/inotify watcher with polling fallback
- launchd (macOS) + systemd (Linux) service install via `flt track enable`
- Manifest written only after uploads confirmed (_confirmed_map pattern)
- 10-minute reconcile interval
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(track): add status = 'pending' filter to claim_batch UPDATE
The UPDATE was missing a status filter, so it could revert done/failed/
in_flight rows back to in_flight when the same path appeared in the
SELECT results. Added WHERE status = 'pending' to match the SELECT.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(track): prevent double-close of fd in write_status
If os.replace() raised after os.close(fd) succeeded, the except handler
would call os.close(fd) again on the already-closed descriptor, masking
the original exception and skipping os.unlink(tmp). Set fd = -1 after
close to guard against the double-close.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(track): serialise _drain_queue with a lock to prevent duplicate uploads
_drain_queue was called from both the main loop and the watcher debounce
thread without synchronisation. The non-atomic SELECT+UPDATE in
claim_batch meant both threads could select the same pending items before
either UPDATE executed, causing duplicate presigned-URL requests and
duplicate pool submissions. A threading.Lock now serialises the
claim_batch call so only one thread can hold claimed items at a time.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(track): propagate FLEET_TRACK_BASE_URL into systemd unit on Linux
The launchd plist conditionally passed FLEET_TRACK_BASE_URL through but
the systemd unit only set PATH. On Linux, any custom base URL would be
silently ignored and the daemon would fall back to the production server.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: bump version to 0.2.119
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(track): use GLOBAL_BASE_URL (orchestrator.fleetai.com) as default
Replaced the hardcoded regional URL with the shared GLOBAL_BASE_URL
constant from fleet/config.py, consistent with how the rest of the SDK
resolves the default base URL.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(auth): use urlsafe_b64decode and correct padding for JWT expiry check
JWTs use base64url encoding (- and _ instead of + and /). Using
b64decode silently corrupted payloads containing those chars, causing
json.loads to fail and the except handler to return True (expired),
forcing a Supabase refresh on every get_valid_token() call.
Also fixed the padding: (4 - n % 4) % 4 → -n % 4 so a string already
aligned to 4 bytes gets 0 padding chars instead of 4.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(auth): move Supabase credentials out of source into GitHub secrets
Hardcoded SUPABASE_URL and SUPABASE_ANON_KEY are replaced by a
placeholder fleet/_supabase.py that gets sed-replaced with real values
from GitHub secrets (SUPABASE_URL, SUPABASE_ANON_KEY) before the PyPI
build runs. Local dev can override via env vars.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(track): remove S3 details from provision response
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(track): key claim_batch UPDATE and mark_done/failed on (path, sha256) not path alone
Path-only matching in claim_batch could orphan rows when two entries share a path
but differ in sha256 — the UPDATE would set both to in_flight even though only one
was returned by the SELECT. That extra row would never be uploaded or resolved.
Fix: use (path, sha256) IN (...) for the UPDATE, and thread sha256 through
pool.submit → on_done/on_failed → mark_done/mark_failed so every callback
targets the exact row that was claimed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(track): add passive AI session sync daemon
Adds `flt track` — a background daemon that watches local AI coding
session files (Claude Code, Cursor, Codex) and syncs them to S3 using
a Merkle tree-based dedup engine.
- Merkle sync: flat {path: sha256} map, O(1) root hash change detection
- WAL-backed SQLite upload queue with exponential backoff
- Presigned S3 PUT URLs — daemon never holds AWS creds
- FSEvents/inotify watcher with polling fallback
- launchd (macOS) + systemd (Linux) service install via `flt track enable`
- Manifest written only after uploads confirmed (_confirmed_map pattern)
- 10-minute reconcile interval
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(track): add status = 'pending' filter to claim_batch UPDATE
The UPDATE was missing a status filter, so it could revert done/failed/
in_flight rows back to in_flight when the same path appeared in the
SELECT results. Added WHERE status = 'pending' to match the SELECT.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(track): prevent double-close of fd in write_status
If os.replace() raised after os.close(fd) succeeded, the except handler
would call os.close(fd) again on the already-closed descriptor, masking
the original exception and skipping os.unlink(tmp). Set fd = -1 after
close to guard against the double-close.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(track): serialise _drain_queue with a lock to prevent duplicate uploads
_drain_queue was called from both the main loop and the watcher debounce
thread without synchronisation. The non-atomic SELECT+UPDATE in
claim_batch meant both threads could select the same pending items before
either UPDATE executed, causing duplicate presigned-URL requests and
duplicate pool submissions. A threading.Lock now serialises the
claim_batch call so only one thread can hold claimed items at a time.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(track): propagate FLEET_TRACK_BASE_URL into systemd unit on Linux
The launchd plist conditionally passed FLEET_TRACK_BASE_URL through but
the systemd unit only set PATH. On Linux, any custom base URL would be
silently ignored and the daemon would fall back to the production server.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: bump version to 0.2.119
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(track): use GLOBAL_BASE_URL (orchestrator.fleetai.com) as default
Replaced the hardcoded regional URL with the shared GLOBAL_BASE_URL
constant from fleet/config.py, consistent with how the rest of the SDK
resolves the default base URL.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(auth): use urlsafe_b64decode and correct padding for JWT expiry check
JWTs use base64url encoding (- and _ instead of + and /). Using
b64decode silently corrupted payloads containing those chars, causing
json.loads to fail and the except handler to return True (expired),
forcing a Supabase refresh on every get_valid_token() call.
Also fixed the padding: (4 - n % 4) % 4 → -n % 4 so a string already
aligned to 4 bytes gets 0 padding chars instead of 4.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(auth): move Supabase credentials out of source into GitHub secrets
Hardcoded SUPABASE_URL and SUPABASE_ANON_KEY are replaced by a
placeholder fleet/_supabase.py that gets sed-replaced with real values
from GitHub secrets (SUPABASE_URL, SUPABASE_ANON_KEY) before the PyPI
build runs. Local dev can override via env vars.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* refactor(track): remove S3 details from provision response
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(track): key claim_batch UPDATE and mark_done/failed on (path, sha256) not path alone
Path-only matching in claim_batch could orphan rows when two entries share a path
but differ in sha256 — the UPDATE would set both to in_flight even though only one
was returned by the SELECT. That extra row would never be uploaded or resolved.
Fix: use (path, sha256) IN (...) for the UPDATE, and thread sha256 through
pool.submit → on_done/on_failed → mark_done/mark_failed so every callback
targets the exact row that was claimed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(track): add mutex to UploadQueue to prevent concurrent SQLite access
check_same_thread=False only suppresses SQLite's thread check — it doesn't make
the connection safe. Eight uploader worker threads calling mark_done/mark_failed
concurrently on the same connection caused sqlite3.InterfaceError in production.
Added threading.Lock() in UploadQueue wrapping every execute+commit. Also removed
the now-redundant _drain_lock in daemon (claim_batch is protected by the queue lock).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* fix(track): detect and recover from corrupt SQLite queue on startup
Runs PRAGMA integrity_check after opening the connection. If the database is
corrupt (residue from the pre-lock threading bug), logs a warning, deletes the
.db/.db-shm/.db-wal files, and reinitialises. Queue items are lost but the next
reconcile re-enqueues anything not yet confirmed on S3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* chore: bump version to 0.2.121
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: Gautham Elango <gautham.gg@gmail.com>
When a file changes sha256 over time, old (path, sha256) rows accumulate in failed/pending state. Once a newer sha256 uploads successfully, those rows are superseded and will never produce useful uploads. Delete them in mark_done to keep the queue clean and status counts meaningful. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…l upload" This reverts commit 2f6891f.
* fix(track): prune stale queue rows for same path on successful upload When a file changes sha256 over time, old (path, sha256) rows accumulate in failed/pending state. Once a newer sha256 uploads successfully, those rows are superseded and will never produce useful uploads. Delete them in mark_done to keep the queue clean and status counts meaningful. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(track): use per-phase httpx timeouts for S3 uploads A single 120s blanket timeout was too short for large JSONL files on slow connections — the write phase alone could exceed it. Split into per-phase timeouts: 10s connect, 30s read, 300s write. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(track): only prune queue rows enqueued before the completing upload The previous DELETE removed all other pending/failed rows for the same path, including newer versions enqueued by the watcher while an older upload was in-flight. For rapidly-growing JSONL files this race was likely: older version completes, newer pending row gets deleted, newer version doesn't upload until the next reconcile (~10 min). Fix: filter the DELETE to only rows with enqueued_at <= the completing row's enqueued_at, preserving any newer versions for immediate upload. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Stores email, user_id, team_id, team_name, hostname, platform in the config at provision time (pulled from credentials.json + provision response), then writes them into every manifest.json upload. This lets you associate any device/S3 prefix with a real user without a Supabase lookup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ute convention Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat(track): embed user identity in manifest.json
* feat: add verifier structure checks to bundle validator
Checks for: wrong import module (fleet.verifier), old function signature
(submission_dir), wrong Criterion API (weight=), missing env.judge.grade(),
env.read_file() usage, missing File.from_env for prompted output files,
empty files={} dict, solutions/ without .s3() references.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: switch upload_task.py to tar-based seed upload
Replace the old file-sets upload flow with the new /v1/seeds API:
- Package files/ as seed.tar.zst via piped tar|zstd
- Get presigned S3 PUT URL from POST /v1/seeds/{data_key}/{env_key}/upload-url
- Upload tar directly to S3
- Create task with data_id/data_version and FLEET__FLEXIBLE_SEED env vars
The data_key is always the task key (unique per task). The --env-version
flag controls both the seed registration and the task's environment version.
Depends on: fleet-ai/theseus#3935
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: update download_task.py for tar seed bundles
Download tasks in the bundle format that upload_task.py expects:
- Fetch task from API, save minimal upload-compatible task.json
- Download seed tar from S3 via aws cli, extract to files/
- Fall back to legacy file-sets API for old tasks
- Print re-upload command for easy round-tripping
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: address bugbot findings
- Read tar stderr to prevent pipe buffer deadlock
- Fix has_files check to filter directories (only count actual files)
- Update Makefile: remove --allow-overwrite and --team-id flags,
add --env-version via VERSION= variable
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: address bugbot findings in upload/download scripts
- Remove carlisle default for environment_id; fail early if missing
- Return False on download failures instead of sys.exit(1), enabling
fallback to legacy file-sets API
- Fix docstring to match actual behavior (aws cli, not API endpoint)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ator (#90) Data files no longer need to live under files/notebooks/ specifically, and prompts don't need to include list_workspace_files(). The validator still checks that files/ exists and has content (check 5). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The default models were accidentally narrowed to just claude-sonnet-4.6. Restore the original cross-model set (Gemini, Opus, GPT-5.2) so uploads get a proper multi-model sweep by default. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Implements the fleet/judge/ package with Rubric, Criterion, JudgeClient, JudgeEndpointConfig, and JudgeResult classes. Supports both Anthropic and OpenAI-compatible endpoints via raw httpx, with env var configuration for airgapped/Docker Worlds environments. Includes prompt construction that mirrors the orchestrator's judge_service.py logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 6bdb38e. Configure here.
| return ( | ||
| f"JudgeResult(score={self.normalized_score:.4f}, " | ||
| f"criteria={len(self.criteria)})" | ||
| ) |
There was a problem hiding this comment.
JudgeResult lacks .score attribute for verifier compatibility
High Severity
JudgeResult is designed for "verifier compatibility" via __float__, but the verifier framework's local execution path checks hasattr(result, "score") before ever calling float(). Since JudgeResult only has normalized_score, total_score, and max_score — but no .score attribute — the verifier raises a ValueError, which is silently caught, returning 0.0. A verifier returning a JudgeResult directly will always silently produce a zero score in local execution.
Reviewed by Cursor Bugbot for commit 6bdb38e. Configure here.
|
|
||
| parts.append(f"## Submission\n\n{submission}") | ||
|
|
||
| return "\n\n".join(parts) |
There was a problem hiding this comment.
Unused rubric parameter in _build_user_message
Low Severity
_build_user_message accepts a rubric parameter that is never referenced in the function body. The caller build_structured_rubric_prompt explicitly passes rubric=rubric, but the function only uses submission, ground_truth, problem, context, and conversation. This is either dead code or indicates missing logic that was intended to include rubric information in the user message.
Reviewed by Cursor Bugbot for commit 6bdb38e. Configure here.


Summary
fleet/judge/package that provides LLM-as-a-judge grading capability, enabling verifiers to usefrom fleet.judge import Rubric, Criterionas designedNew Files
fleet/judge/__init__.pyRubric,Criterion,JudgeClient,JudgeEndpointConfig,JudgeResultfleet/judge/models.pyCriterion(name, max, levels),Rubric(criteria),CriterionResult,JudgeResult(with__float__),JudgeEndpointConfigfleet/judge/client.pyJudgeClientwith asyncgrade()+ syncgrade_sync(), Anthropic/OpenAI backends via raw httpx, JSON extraction with trailing comma repairfleet/judge/prompt.pybuild_structured_rubric_prompt()mirroring orchestrator logictests/test_judge.pyConfiguration
# Docker Worlds / airgapped — configure via env vars JUDGE_ENDPOINT=https://customer-llm.internal/v1 JUDGE_API_KEY=sk-... JUDGE_API_FORMAT=openai JUDGE_MODEL=gpt-4oKey Design Decisions
Criterion("name", max=10, levels={...})matches verifier usage exactlyhttpx(already a dep) for HTTP callsJUDGE_*env vars →ANTHROPIC_API_KEY→ helpful errorfloat(result)returns normalized score (0.0-1.0) for verifier compatibilityRelated
Test plan
from fleet.judge import Rubric, Criterionworks in verifier context🤖 Generated with Claude Code