feat: add fleet.judge module for LLM-as-a-judge grading (ENG-1527) by jarredFleetSo · Pull Request #92 · fleet-ai/fleet-sdk

jarredFleetSo · 2026-04-10T16:07:03Z

Summary

Creates the fleet/judge/ package that provides LLM-as-a-judge grading capability, enabling verifiers to use from fleet.judge import Rubric, Criterion as designed
Implements standalone judge grading via configurable LLM endpoints (Anthropic or OpenAI-compatible), enabling offline/airgapped Docker Worlds evaluation
Mirrors the orchestrator's prompt construction logic so judge grading works entirely client-side without calling the Fleet orchestrator

New Files

File	Purpose
`fleet/judge/__init__.py`	Public exports: `Rubric`, `Criterion`, `JudgeClient`, `JudgeEndpointConfig`, `JudgeResult`
`fleet/judge/models.py`	Data classes — `Criterion(name, max, levels)`, `Rubric(criteria)`, `CriterionResult`, `JudgeResult` (with `__float__`), `JudgeEndpointConfig`
`fleet/judge/client.py`	`JudgeClient` with async `grade()` + sync `grade_sync()`, Anthropic/OpenAI backends via raw httpx, JSON extraction with trailing comma repair
`fleet/judge/prompt.py`	`build_structured_rubric_prompt()` mirroring orchestrator logic
`tests/test_judge.py`	36 tests covering all models, JSON extraction, prompt construction, env var config, response parsing

Configuration

# Docker Worlds / airgapped — configure via env vars
JUDGE_ENDPOINT=https://customer-llm.internal/v1
JUDGE_API_KEY=sk-...
JUDGE_API_FORMAT=openai
JUDGE_MODEL=gpt-4o

Key Design Decisions

Plain classes (not Pydantic) — Criterion("name", max=10, levels={...}) matches verifier usage exactly
No new dependencies — uses only httpx (already a dep) for HTTP calls
Config priority: explicit config → JUDGE_* env vars → ANTHROPIC_API_KEY → helpful error
float(result) returns normalized score (0.0-1.0) for verifier compatibility

Test plan

36 unit tests passing
Verify from fleet.judge import Rubric, Criterion works in verifier context
Integration test with actual LLM endpoint

🤖 Generated with Claude Code

When verifier code contains multiple functions (e.g., a main verifier function and helper functions), the helper functions were not accessible from the main function due to namespace isolation. The exec() call created functions in local_namespace, but the main function's __globals__ pointed to exec_globals which didn't contain the helper functions. This caused NameError when the main function tried to call helpers, which was silently caught and returned 0.0. Fix: Merge local_namespace into exec_globals after exec() so all defined functions are accessible when the verifier is called.

…mespace fix: allow verifier helper functions to be called from main verifier

InstanceRequest changes: - Add: profile_id, async_provision, instance_mode, ssh_public_keys, snapshot_interval_minutes, version (deprecated) - Fix: region default changed from 'us-west-1' to None (server decides) - Fix: created_from default changed from None to 'api' TaskRequest changes: - Add: verifier_func, project_key, data_id, data_version, writer_metadata - Add: model_config with extra='ignore' and populate_by_name=True - Add: alias='env_id' for environment_id field - Remove: metadata (doesn't exist in orchestrator TaskRequest, only in TaskResponse)

…API" This reverts commit 9a0af14.

add metadata to tasks in SDK

bump version

consolidate

…odels Add factual_answer field to support research/factual tasks: - Task model: stores expected answer for verification - TaskRequest: accept factual_answer when creating tasks - TaskResponse: return factual_answer from API Part of: https://linear.app/fleet-ai/issue/ENG-843/import-script-needs-to-support-output-json-schemas Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

feat: add factual_answer field to Task and API models

Add task_modality field to Task and TaskResponse models to support copying task modality (computer_use, tool_use, browser) when importing tasks via the SDK. Changes: - Add task_modality to TaskResponse model (API response) - Add task_modality to Task model (SDK model) - Pass task_modality from TaskResponse to Task in load_tasks Co-authored-by: Cursor <cursoragent@cursor.com>

Addresses Bugbot comment: load_task_from_json wasn't extracting task_modality from JSON data, causing tasks loaded from JSON files to have task_modality=None even when the JSON contains this field. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Add task_modality field to async Task model, TaskResponse model, and update load_task_from_json and load_tasks to preserve task_modality. Co-authored-by: Cursor <cursoragent@cursor.com>

fix: remove stale output_json_schema warning from import_tasks

Simple scripts for task authors to download existing tasks, edit them locally, and upload as new tasks. Uses raw requests (no SDK dependency) with auto-resolved team ID from API key. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

feat: add task bundle editing scripts

* fix(download): don't pass auto-resolved team_id to task GET The team_id query param on GET /v1/tasks/{key} requires admin privileges. Previously the script always passed it (from auto-resolve), causing 403 errors for non-admin API keys. Now only passes team_id when explicitly provided via --team-id flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: skip resolve_team_id when --team-id is explicitly provided Avoids an unnecessary API call that could fail and block the download. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

) * feat(upload): add job launching and auto-generated unique task keys - Make --key optional; auto-generates {original_key}_{uuid[:8]} when omitted - Replace local key comparison with server-side existence check (GET /v1/tasks/{key}) - Launch job by default after upload (POST /v1/jobs) with --no-launch-job to skip - Add --models, --pass-k flags for job configuration - Default models: gemini-3.1-pro-preview, claude-opus-4.6, gpt-5.2 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(upload): raise on unexpected status in task existence check Previously any non-200 (including 500, 403, 429) was treated as "key available", silently skipping the guard. Now only 404 means available; other errors are surfaced. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

The file-set key may differ from the task key (e.g., without a version suffix). Pull the key from the task's env_variables.TASK_KEY when available, falling back to the CLI --task-key argument. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…validation (#73) The jobs API returns `job_id` not `id` — fix extraction so the job ID and dashboard URL are printed after launch. Also add validation that data files are under files/notebooks/ (the path unpacked into the agent workspace) and that the prompt's list_workspace_files() pattern matches actual files. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* fix(upload): extract job_id from API response and add workspace file validation The jobs API returns `job_id` not `id` — fix extraction so the job ID and dashboard URL are printed after launch. Also add validation that data files are under files/notebooks/ (the path unpacked into the agent workspace) and that the prompt's list_workspace_files() pattern matches actual files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: add standalone launch_job script for existing tasks Provides a simple way to launch jobs for tasks that already exist on the server, without the upload/create flow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

The web page now redirects the browser to http://127.0.0.1:PORT/callback with tokens as query params instead of POSTing JSON. Replace do_POST + do_OPTIONS (CORS preflight) with a do_GET handler that reads query params, validates the state nonce, and returns a plain HTML confirmation page. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… URL Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: add flt login browser auth flow (ENG-1192)

…#79) * feat(task-bundle-editing): add Makefile, templates, CLAUDE.md, and workflow docs Add task authoring toolkit for creating, editing, and deploying Fleet evaluation tasks. Includes Makefile wrapping existing Python scripts, verifier templates for analysis and plot tasks, CLAUDE.md for Claude Code integration, and extended README with end-to-end workflow documentation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: clarify placeholder notation in template schema notes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* feat(track): add passive AI session sync daemon Adds `flt track` — a background daemon that watches local AI coding session files (Claude Code, Cursor, Codex) and syncs them to S3 using a Merkle tree-based dedup engine. - Merkle sync: flat {path: sha256} map, O(1) root hash change detection - WAL-backed SQLite upload queue with exponential backoff - Presigned S3 PUT URLs — daemon never holds AWS creds - FSEvents/inotify watcher with polling fallback - launchd (macOS) + systemd (Linux) service install via `flt track enable` - Manifest written only after uploads confirmed (_confirmed_map pattern) - 10-minute reconcile interval Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(track): add status = 'pending' filter to claim_batch UPDATE The UPDATE was missing a status filter, so it could revert done/failed/ in_flight rows back to in_flight when the same path appeared in the SELECT results. Added WHERE status = 'pending' to match the SELECT. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(track): prevent double-close of fd in write_status If os.replace() raised after os.close(fd) succeeded, the except handler would call os.close(fd) again on the already-closed descriptor, masking the original exception and skipping os.unlink(tmp). Set fd = -1 after close to guard against the double-close. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(track): serialise _drain_queue with a lock to prevent duplicate uploads _drain_queue was called from both the main loop and the watcher debounce thread without synchronisation. The non-atomic SELECT+UPDATE in claim_batch meant both threads could select the same pending items before either UPDATE executed, causing duplicate presigned-URL requests and duplicate pool submissions. A threading.Lock now serialises the claim_batch call so only one thread can hold claimed items at a time. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(track): propagate FLEET_TRACK_BASE_URL into systemd unit on Linux The launchd plist conditionally passed FLEET_TRACK_BASE_URL through but the systemd unit only set PATH. On Linux, any custom base URL would be silently ignored and the daemon would fall back to the production server. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump version to 0.2.119 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(track): use GLOBAL_BASE_URL (orchestrator.fleetai.com) as default Replaced the hardcoded regional URL with the shared GLOBAL_BASE_URL constant from fleet/config.py, consistent with how the rest of the SDK resolves the default base URL. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(auth): use urlsafe_b64decode and correct padding for JWT expiry check JWTs use base64url encoding (- and _ instead of + and /). Using b64decode silently corrupted payloads containing those chars, causing json.loads to fail and the except handler to return True (expired), forcing a Supabase refresh on every get_valid_token() call. Also fixed the padding: (4 - n % 4) % 4 → -n % 4 so a string already aligned to 4 bytes gets 0 padding chars instead of 4. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(auth): move Supabase credentials out of source into GitHub secrets Hardcoded SUPABASE_URL and SUPABASE_ANON_KEY are replaced by a placeholder fleet/_supabase.py that gets sed-replaced with real values from GitHub secrets (SUPABASE_URL, SUPABASE_ANON_KEY) before the PyPI build runs. Local dev can override via env vars. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor(track): remove S3 details from provision response Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(track): key claim_batch UPDATE and mark_done/failed on (path, sha256) not path alone Path-only matching in claim_batch could orphan rows when two entries share a path but differ in sha256 — the UPDATE would set both to in_flight even though only one was returned by the SELECT. That extra row would never be uploaded or resolved. Fix: use (path, sha256) IN (...) for the UPDATE, and thread sha256 through pool.submit → on_done/on_failed → mark_done/mark_failed so every callback targets the exact row that was claimed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(track): add passive AI session sync daemon Adds `flt track` — a background daemon that watches local AI coding session files (Claude Code, Cursor, Codex) and syncs them to S3 using a Merkle tree-based dedup engine. - Merkle sync: flat {path: sha256} map, O(1) root hash change detection - WAL-backed SQLite upload queue with exponential backoff - Presigned S3 PUT URLs — daemon never holds AWS creds - FSEvents/inotify watcher with polling fallback - launchd (macOS) + systemd (Linux) service install via `flt track enable` - Manifest written only after uploads confirmed (_confirmed_map pattern) - 10-minute reconcile interval Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(track): add status = 'pending' filter to claim_batch UPDATE The UPDATE was missing a status filter, so it could revert done/failed/ in_flight rows back to in_flight when the same path appeared in the SELECT results. Added WHERE status = 'pending' to match the SELECT. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(track): prevent double-close of fd in write_status If os.replace() raised after os.close(fd) succeeded, the except handler would call os.close(fd) again on the already-closed descriptor, masking the original exception and skipping os.unlink(tmp). Set fd = -1 after close to guard against the double-close. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(track): serialise _drain_queue with a lock to prevent duplicate uploads _drain_queue was called from both the main loop and the watcher debounce thread without synchronisation. The non-atomic SELECT+UPDATE in claim_batch meant both threads could select the same pending items before either UPDATE executed, causing duplicate presigned-URL requests and duplicate pool submissions. A threading.Lock now serialises the claim_batch call so only one thread can hold claimed items at a time. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(track): propagate FLEET_TRACK_BASE_URL into systemd unit on Linux The launchd plist conditionally passed FLEET_TRACK_BASE_URL through but the systemd unit only set PATH. On Linux, any custom base URL would be silently ignored and the daemon would fall back to the production server. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump version to 0.2.119 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(track): use GLOBAL_BASE_URL (orchestrator.fleetai.com) as default Replaced the hardcoded regional URL with the shared GLOBAL_BASE_URL constant from fleet/config.py, consistent with how the rest of the SDK resolves the default base URL. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(auth): use urlsafe_b64decode and correct padding for JWT expiry check JWTs use base64url encoding (- and _ instead of + and /). Using b64decode silently corrupted payloads containing those chars, causing json.loads to fail and the except handler to return True (expired), forcing a Supabase refresh on every get_valid_token() call. Also fixed the padding: (4 - n % 4) % 4 → -n % 4 so a string already aligned to 4 bytes gets 0 padding chars instead of 4. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(auth): move Supabase credentials out of source into GitHub secrets Hardcoded SUPABASE_URL and SUPABASE_ANON_KEY are replaced by a placeholder fleet/_supabase.py that gets sed-replaced with real values from GitHub secrets (SUPABASE_URL, SUPABASE_ANON_KEY) before the PyPI build runs. Local dev can override via env vars. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor(track): remove S3 details from provision response Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(track): key claim_batch UPDATE and mark_done/failed on (path, sha256) not path alone Path-only matching in claim_batch could orphan rows when two entries share a path but differ in sha256 — the UPDATE would set both to in_flight even though only one was returned by the SELECT. That extra row would never be uploaded or resolved. Fix: use (path, sha256) IN (...) for the UPDATE, and thread sha256 through pool.submit → on_done/on_failed → mark_done/mark_failed so every callback targets the exact row that was claimed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(track): add mutex to UploadQueue to prevent concurrent SQLite access check_same_thread=False only suppresses SQLite's thread check — it doesn't make the connection safe. Eight uploader worker threads calling mark_done/mark_failed concurrently on the same connection caused sqlite3.InterfaceError in production. Added threading.Lock() in UploadQueue wrapping every execute+commit. Also removed the now-redundant _drain_lock in daemon (claim_batch is protected by the queue lock). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(track): detect and recover from corrupt SQLite queue on startup Runs PRAGMA integrity_check after opening the connection. If the database is corrupt (residue from the pre-lock threading bug), logs a warning, deletes the .db/.db-shm/.db-wal files, and reinitialises. Queue items are lost but the next reconcile re-enqueues anything not yet confirmed on S3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * chore: bump version to 0.2.121 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Gautham Elango <gautham.gg@gmail.com>

When a file changes sha256 over time, old (path, sha256) rows accumulate in failed/pending state. Once a newer sha256 uploads successfully, those rows are superseded and will never produce useful uploads. Delete them in mark_done to keep the queue clean and status counts meaningful. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…l upload" This reverts commit 2f6891f.

* fix(track): prune stale queue rows for same path on successful upload When a file changes sha256 over time, old (path, sha256) rows accumulate in failed/pending state. Once a newer sha256 uploads successfully, those rows are superseded and will never produce useful uploads. Delete them in mark_done to keep the queue clean and status counts meaningful. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(track): use per-phase httpx timeouts for S3 uploads A single 120s blanket timeout was too short for large JSONL files on slow connections — the write phase alone could exceed it. Split into per-phase timeouts: 10s connect, 30s read, 300s write. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix(track): only prune queue rows enqueued before the completing upload The previous DELETE removed all other pending/failed rows for the same path, including newer versions enqueued by the watcher while an older upload was in-flight. For rapidly-growing JSONL files this race was likely: older version completes, newer pending row gets deleted, newer version doesn't upload until the next reconcile (~10 min). Fix: filter the DELETE to only rows with enqueued_at <= the completing row's enqueued_at, preserving any newer versions for immediate upload. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

Stores email, user_id, team_id, team_name, hostname, platform in the config at provision time (pulled from credentials.json + provision response), then writes them into every manifest.json upload. This lets you associate any device/S3 prefix with a real user without a Supabase lookup. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ute convention Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(track): embed user identity in manifest.json

* feat: add verifier structure checks to bundle validator Checks for: wrong import module (fleet.verifier), old function signature (submission_dir), wrong Criterion API (weight=), missing env.judge.grade(), env.read_file() usage, missing File.from_env for prompted output files, empty files={} dict, solutions/ without .s3() references. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: switch upload_task.py to tar-based seed upload Replace the old file-sets upload flow with the new /v1/seeds API: - Package files/ as seed.tar.zst via piped tar|zstd - Get presigned S3 PUT URL from POST /v1/seeds/{data_key}/{env_key}/upload-url - Upload tar directly to S3 - Create task with data_id/data_version and FLEET__FLEXIBLE_SEED env vars The data_key is always the task key (unique per task). The --env-version flag controls both the seed registration and the task's environment version. Depends on: fleet-ai/theseus#3935 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: update download_task.py for tar seed bundles Download tasks in the bundle format that upload_task.py expects: - Fetch task from API, save minimal upload-compatible task.json - Download seed tar from S3 via aws cli, extract to files/ - Fall back to legacy file-sets API for old tasks - Print re-upload command for easy round-tripping Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address bugbot findings - Read tar stderr to prevent pipe buffer deadlock - Fix has_files check to filter directories (only count actual files) - Update Makefile: remove --allow-overwrite and --team-id flags, add --env-version via VERSION= variable Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: address bugbot findings in upload/download scripts - Remove carlisle default for environment_id; fail early if missing - Return False on download failures instead of sys.exit(1), enabling fallback to legacy file-sets API - Fix docstring to match actual behavior (aws cli, not API endpoint) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ator (#90) Data files no longer need to live under files/notebooks/ specifically, and prompts don't need to include list_workspace_files(). The validator still checks that files/ exists and has content (check 5). Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The default models were accidentally narrowed to just claude-sonnet-4.6. Restore the original cross-model set (Gemini, Opus, GPT-5.2) so uploads get a proper multi-model sweep by default. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Implements the fleet/judge/ package with Rubric, Criterion, JudgeClient, JudgeEndpointConfig, and JudgeResult classes. Supports both Anthropic and OpenAI-compatible endpoints via raw httpx, with env var configuration for airgapped/Docker Worlds environments. Includes prompt construction that mirrors the orchestrator's judge_service.py logic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 6bdb38e. Configure here.}

cursor · 2026-04-10T16:14:25Z

+        return (
+            f"JudgeResult(score={self.normalized_score:.4f}, "
+            f"criteria={len(self.criteria)})"
+        )


JudgeResult lacks .score attribute for verifier compatibility

High Severity

JudgeResult is designed for "verifier compatibility" via __float__, but the verifier framework's local execution path checks hasattr(result, "score") before ever calling float(). Since JudgeResult only has normalized_score, total_score, and max_score — but no .score attribute — the verifier raises a ValueError, which is silently caught, returning 0.0. A verifier returning a JudgeResult directly will always silently produce a zero score in local execution.

^{Reviewed by Cursor Bugbot for commit 6bdb38e. Configure here.}

cursor · 2026-04-10T16:14:25Z

+
+    parts.append(f"## Submission\n\n{submission}")
+
+    return "\n\n".join(parts)


Unused rubric parameter in _build_user_message

Low Severity

_build_user_message accepts a rubric parameter that is never referenced in the function body. The caller build_structured_rubric_prompt explicitly passes rubric=rubric, but the function only uses submission, ground_truth, problem, context, and conversation. This is either dead code or indicates missing logic that was intended to include rubric information in the user message.

^{Reviewed by Cursor Bugbot for commit 6bdb38e. Configure here.}

mikesklar and others added 30 commits January 13, 2026 12:21

Update agent.py

b301d67

update gemini cua agent with latest updates

32fa85f

update name

8852db9

Merge pull request #38 from fleet-ai/fix/verifier-helper-functions-na…

5f89234

…mespace fix: allow verifier helper functions to be called from main verifier

Bump version to 0.2.104

54feffd

add metadata to tasks

f895cb6

fixes

fef862d

Revert "fix: align InstanceRequest and TaskRequest with orchestrator …

9cdd7e6

…API" This reverts commit 9a0af14.

Merge pull request #40 from fleet-ai/zz/add-metadata-0121

acc58ed

add metadata to tasks in SDK

bump version

58f8faf

Merge pull request #41 from fleet-ai/zz/2.105

5dbb76e

bump version

consolidate

ee6a8a5

Consolidate all metadata into "metadata" in TaskResponse

05112f7

consolidate

README.md

afb516c

README.md

c08b76b

Update README.md

e303638

Update README.md

0c4d149

Delete export_tasks_filtered.py

f0737e5

Update README.md

53609ae

Update README.md

8b92120

chore: bump version to 0.2.107

33459b2

Co-authored-by: Cursor <cursoragent@cursor.com>

chore: update lockfile for 0.2.107

7dbbe39

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge pull request #47 from fleet-ai/add-factual-answer-support

1bcfb21

feat: add factual_answer field to Task and API models

chore: bump version to 0.2.108

875a297

Co-authored-by: Cursor <cursoragent@cursor.com>

feat: add task_modality support to async SDK

aa03cd0

Add task_modality field to async Task model, TaskResponse model, and update load_task_from_json and load_tasks to preserve task_modality. Co-authored-by: Cursor <cursoragent@cursor.com>

andrew-stelmach-fleet and others added 27 commits February 19, 2026 16:44

Merge pull request #65 from fleet-ai/fix/remove-output-schema-warning

1e6a928

fix: remove stale output_json_schema warning from import_tasks

Merge pull request #68 from fleet-ai/feat/task-bundle-editing

62d2251

feat: add task bundle editing scripts

fix: use fleetai.com instead of invented app. subdomain for cli-login…

4a496c8

… URL Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge pull request #76 from fleet-ai/nico/eng-1192-add-auth-to-our-sdk

5ae2ddf

feat: add flt login browser auth flow (ENG-1192)

v0.2.120

656efc4

Revert "fix(track): prune stale queue rows for same path on successfu…

242ff5e

…l upload" This reverts commit 2f6891f.

v0.2.122

34b06bc

chore: bump version to 0.2.123

4e8abb2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(track): initialize _identity in __init__ to match existing attrib…

cae219a

…ute convention Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge pull request #87 from fleet-ai/nico/track-manifest-identity

24b2852

feat(track): embed user identity in manifest.json

cursor Bot reviewed Apr 10, 2026

View reviewed changes

gg2001 force-pushed the main branch from 51131ab to e3c5571 Compare April 14, 2026 22:16

gg2001 closed this May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add fleet.judge module for LLM-as-a-judge grading (ENG-1527)#92

feat: add fleet.judge module for LLM-as-a-judge grading (ENG-1527)#92
jarredFleetSo wants to merge 92 commits into
mainfrom
feat/judge-module-041026-1

jarredFleetSo commented Apr 10, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Apr 10, 2026

Uh oh!

cursor Bot Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants


		parts.append(f"## Submission\n\n{submission}")

		return "\n\n".join(parts)

Conversation

jarredFleetSo commented Apr 10, 2026

Summary

New Files

Configuration

Key Design Decisions

Related

Test plan

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Apr 10, 2026

Choose a reason for hiding this comment

JudgeResult lacks .score attribute for verifier compatibility

Uh oh!

cursor Bot Apr 10, 2026

Choose a reason for hiding this comment

Unused rubric parameter in _build_user_message

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

`JudgeResult` lacks `.score` attribute for verifier compatibility

Unused `rubric` parameter in `_build_user_message`