Skip to content

fix(validate): error when verifier S3 solutions path doesn't match task key#75

Open
mikesklar wants to merge 75 commits into
mainfrom
fix/validate-s3-solutions-path
Open

fix(validate): error when verifier S3 solutions path doesn't match task key#75
mikesklar wants to merge 75 commits into
mainfrom
fix/validate-s3-solutions-path

Conversation

@mikesklar

Copy link
Copy Markdown
Collaborator

Summary

  • Verifiers using Image.s3() embed the task key as a path segment in S3 URLs (e.g. .../<TASK_KEY>/solutions/gold_plot.png)
  • When uploading under a new key, forgetting to update these paths causes the verifier to silently load the wrong gold-reference images
  • The validator now checks that the expected TASK_KEY appears as a path segment in every S3 URL containing /solutions/

Test plan

  • Mismatched key → errors with clear message pointing at the stale URL
  • Matching key → passes cleanly

🤖 Generated with Claude Code

mikesklar and others added 30 commits January 13, 2026 12:21
When verifier code contains multiple functions (e.g., a main verifier
function and helper functions), the helper functions were not accessible
from the main function due to namespace isolation.

The exec() call created functions in local_namespace, but the main
function's __globals__ pointed to exec_globals which didn't contain
the helper functions. This caused NameError when the main function
tried to call helpers, which was silently caught and returned 0.0.

Fix: Merge local_namespace into exec_globals after exec() so all
defined functions are accessible when the verifier is called.
…mespace

fix: allow verifier helper functions to be called from main verifier
InstanceRequest changes:
- Add: profile_id, async_provision, instance_mode, ssh_public_keys, snapshot_interval_minutes, version (deprecated)
- Fix: region default changed from 'us-west-1' to None (server decides)
- Fix: created_from default changed from None to 'api'

TaskRequest changes:
- Add: verifier_func, project_key, data_id, data_version, writer_metadata
- Add: model_config with extra='ignore' and populate_by_name=True
- Add: alias='env_id' for environment_id field
- Remove: metadata (doesn't exist in orchestrator TaskRequest, only in TaskResponse)
…odels

Add factual_answer field to support research/factual tasks:
- Task model: stores expected answer for verification
- TaskRequest: accept factual_answer when creating tasks
- TaskResponse: return factual_answer from API

Part of: https://linear.app/fleet-ai/issue/ENG-843/import-script-needs-to-support-output-json-schemas

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
feat: add factual_answer field to Task and API models
Add task_modality field to Task and TaskResponse models to support
copying task modality (computer_use, tool_use, browser) when importing
tasks via the SDK.

Changes:
- Add task_modality to TaskResponse model (API response)
- Add task_modality to Task model (SDK model)
- Pass task_modality from TaskResponse to Task in load_tasks

Co-authored-by: Cursor <cursoragent@cursor.com>
Addresses Bugbot comment: load_task_from_json wasn't extracting
task_modality from JSON data, causing tasks loaded from JSON files
to have task_modality=None even when the JSON contains this field.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Add task_modality field to async Task model, TaskResponse model,
and update load_task_from_json and load_tasks to preserve task_modality.

Co-authored-by: Cursor <cursoragent@cursor.com>
andrew-stelmach-fleet and others added 28 commits February 5, 2026 11:52
The task_lifecycle_status field was added to the Task model but was
missing from:
- TaskResponse model (sync and async) - needed to parse API response
- load_tasks method - needed to pass the field to Task constructor

This completes the task_lifecycle_status support in the SDK.

Co-authored-by: Cursor <cursoragent@cursor.com>
The field was renamed to env_key but there was already a property with
the same name, causing infinite recursion. Renamed the property to
get_env_key() method.

Also restored fallback for env_key in load_task_from_json to support
JSON files that use env_key field.

Co-authored-by: Cursor <cursoragent@cursor.com>
The field was renamed to env_key but there was already a property with
the same name, causing infinite recursion. Renamed the property to
get_env_key() method.

Also restored env_id fallback in load_task_from_json for backward
compatibility with existing JSON files.

Co-authored-by: Cursor <cursoragent@cursor.com>
The make() method was using self.env_key (raw field) instead of
self.get_env_key() (computed method with version). This would cause
environments to be created without the version suffix.

Co-authored-by: Cursor <cursoragent@cursor.com>
The API returns env_id but TaskInfo was renamed to use env_key.
Added alias="env_id" so Pydantic accepts both field names during
deserialization of API responses.

Co-authored-by: Cursor <cursoragent@cursor.com>
When export_tasks serializes tasks, it outputs env_key. The loading
function needs to check for env_key first (canonical name), then
fallback to environment_id (API) and env_id (legacy).

Co-authored-by: Cursor <cursoragent@cursor.com>
- TaskResponse: rename environment_id -> env_key (alias="environment_id")
- TaskRequest: rename environment_id -> env_key (alias="environment_id")
- Add ConfigDict(populate_by_name=True) for alias support
- Add Task.env_spec property for env_key:version string
- Use task.env_spec in Task.make() and make_for_task()
- Clean up load_tasks to use task_response.env_key directly
- Remove scattered inline env_key:version string building

Co-authored-by: Cursor <cursoragent@cursor.com>
- data_spec: renamed from data_key (data_key kept as alias)
- has_verifier: whether task has verifier_func or verifier
- is_research_based: whether task has a factual_answer
- is_action_based: inverse of is_research_based

Co-authored-by: Cursor <cursoragent@cursor.com>
TaskInfo has alias="env_id" on env_key field but was missing
model_config = ConfigDict(populate_by_name=True). Without this,
creating TaskInfo(env_key="...") would fail since only the alias
name was accepted.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
feat: Add task_lifecycle_status field to Task model
The PUT /v1/tasks/{task_key} endpoint can return environment_id: null,
which caused a Pydantic validation error since env_key was required.
This made update_task crash instead of returning a TaskResponse.

- TaskResponse.env_key: str -> Optional[str]
- Task.env_key: str -> Optional[str]
- Task.env_spec now returns None when env_key is absent

Co-authored-by: Cursor <cursoragent@cursor.com>
When a task has env_key=None, make_for_task would pass None to make()
causing a TypeError at ":" in env_key. Now raises a clear ValueError
matching the guard in Task.make().

Co-authored-by: Cursor <cursoragent@cursor.com>
fix: make TaskResponse.env_key optional to handle null API responses
…al-env-key"

This reverts commit 3a4f711, reversing
changes made to 7ec526b.
…v-key

revert: restore env_key as required in TaskResponse and Task
The SDK now correctly imports output_json_schema automatically via the
API, so the manual-copy warning is no longer accurate.

Co-authored-by: Cursor <cursoragent@cursor.com>
fix: remove stale output_json_schema warning from import_tasks
Simple scripts for task authors to download existing tasks, edit them
locally, and upload as new tasks. Uses raw requests (no SDK dependency)
with auto-resolved team ID from API key.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
feat: add task bundle editing scripts
* fix(download): don't pass auto-resolved team_id to task GET

The team_id query param on GET /v1/tasks/{key} requires admin
privileges. Previously the script always passed it (from auto-resolve),
causing 403 errors for non-admin API keys. Now only passes team_id
when explicitly provided via --team-id flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: skip resolve_team_id when --team-id is explicitly provided

Avoids an unnecessary API call that could fail and block the download.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
)

* feat(upload): add job launching and auto-generated unique task keys

- Make --key optional; auto-generates {original_key}_{uuid[:8]} when omitted
- Replace local key comparison with server-side existence check (GET /v1/tasks/{key})
- Launch job by default after upload (POST /v1/jobs) with --no-launch-job to skip
- Add --models, --pass-k flags for job configuration
- Default models: gemini-3.1-pro-preview, claude-opus-4.6, gpt-5.2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(upload): raise on unexpected status in task existence check

Previously any non-200 (including 500, 403, 429) was treated as
"key available", silently skipping the guard. Now only 404 means
available; other errors are surfaced.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
The file-set key may differ from the task key (e.g., without a version
suffix). Pull the key from the task's env_variables.TASK_KEY when
available, falling back to the CLI --task-key argument.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…validation (#73)

The jobs API returns `job_id` not `id` — fix extraction so the job ID and
dashboard URL are printed after launch. Also add validation that data files
are under files/notebooks/ (the path unpacked into the agent workspace) and
that the prompt's list_workspace_files() pattern matches actual files.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix(upload): extract job_id from API response and add workspace file validation

The jobs API returns `job_id` not `id` — fix extraction so the job ID and
dashboard URL are printed after launch. Also add validation that data files
are under files/notebooks/ (the path unpacked into the agent workspace) and
that the prompt's list_workspace_files() pattern matches actual files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add standalone launch_job script for existing tasks

Provides a simple way to launch jobs for tasks that already exist on the
server, without the upload/create flow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…sk key

Verifiers that use Image.s3() to load gold-reference images embed the
task key as a path segment. When uploading under a new key, forgetting
to update these paths causes the verifier to silently load the wrong
solutions. The validator now checks that the expected TASK_KEY appears
as a path segment in every S3 URL containing /solutions/.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Drop fallback chain — only check S3 solutions paths against the
TASK_KEY env variable, which is the actual file-set key used in S3.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

f"Verifier S3 solutions path does not contain "
f"expected key '{expected_key}' as a path segment: "
f"{url}"
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3 path check ignores new_key argument

High Severity

The new S3 solutions-path validation uses task_key_var (the TASK_KEY from env_variables) as expected_key, but ignores new_key when it is supplied via --new-key. This means that when a user uploads under a new key and forgets to update the verifier's S3 URLs, both task_key_var and the stale URLs still contain the old key — the check finds them "consistent" and emits no error. The exact bug the PR is designed to catch is silently missed. expected_key should be new_key or task_key_var so that the check validates URLs against the intended upload key when one is provided.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants