Skip to content

feat(task-bundle-editing): add skills, generic template, fix Makefile#80

Open
mnarayan wants to merge 78 commits into
fleet-ai:mainfrom
mnarayan:feat/task-authoring-toolkit
Open

feat(task-bundle-editing): add skills, generic template, fix Makefile#80
mnarayan wants to merge 78 commits into
fleet-ai:mainfrom
mnarayan:feat/task-authoring-toolkit

Conversation

@mnarayan

Copy link
Copy Markdown
Contributor

Summary

  • Add Claude Code skills for task authoring (task-authoring) and job monitoring (fleet-status)
  • Add generic verifier_template.json; replace DS-specific templates (analysis, plot) with a single task-agnostic scaffold
  • Fix Makefile: validate and upload now accept DIR= without requiring an active task
  • Improve README: expand edit workflow docs, clarify templates as design scaffolds, add from-scratch orientation guide
  • Update CLAUDE.md skills section

Addresses review feedback from #79

  • Templates clarified as design scaffolds, not directly valid task.json files
  • Makefile DIR=-only workflow unblocked for validate and upload targets

Test plan

  • make validate DIR=<bundle> works without .task set
  • Skills load correctly in Claude Code when copied to .claude/skills/
  • verifier_template.json placeholders are clear and task-agnostic

🤖 Generated with Claude Code

mikesklar and others added 30 commits January 13, 2026 12:21
When verifier code contains multiple functions (e.g., a main verifier
function and helper functions), the helper functions were not accessible
from the main function due to namespace isolation.

The exec() call created functions in local_namespace, but the main
function's __globals__ pointed to exec_globals which didn't contain
the helper functions. This caused NameError when the main function
tried to call helpers, which was silently caught and returned 0.0.

Fix: Merge local_namespace into exec_globals after exec() so all
defined functions are accessible when the verifier is called.
…tions-namespace

fix: allow verifier helper functions to be called from main verifier
InstanceRequest changes:
- Add: profile_id, async_provision, instance_mode, ssh_public_keys, snapshot_interval_minutes, version (deprecated)
- Fix: region default changed from 'us-west-1' to None (server decides)
- Fix: created_from default changed from None to 'api'

TaskRequest changes:
- Add: verifier_func, project_key, data_id, data_version, writer_metadata
- Add: model_config with extra='ignore' and populate_by_name=True
- Add: alias='env_id' for environment_id field
- Remove: metadata (doesn't exist in orchestrator TaskRequest, only in TaskResponse)
…odels

Add factual_answer field to support research/factual tasks:
- Task model: stores expected answer for verification
- TaskRequest: accept factual_answer when creating tasks
- TaskResponse: return factual_answer from API

Part of: https://linear.app/fleet-ai/issue/ENG-843/import-script-needs-to-support-output-json-schemas

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
feat: add factual_answer field to Task and API models
Add task_modality field to Task and TaskResponse models to support
copying task modality (computer_use, tool_use, browser) when importing
tasks via the SDK.

Changes:
- Add task_modality to TaskResponse model (API response)
- Add task_modality to Task model (SDK model)
- Pass task_modality from TaskResponse to Task in load_tasks

Co-authored-by: Cursor <cursoragent@cursor.com>
Addresses Bugbot comment: load_task_from_json wasn't extracting
task_modality from JSON data, causing tasks loaded from JSON files
to have task_modality=None even when the JSON contains this field.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Add task_modality field to async Task model, TaskResponse model,
and update load_task_from_json and load_tasks to preserve task_modality.

Co-authored-by: Cursor <cursoragent@cursor.com>
andrew-stelmach-fleet and others added 29 commits February 5, 2026 12:01
The field was renamed to env_key but there was already a property with
the same name, causing infinite recursion. Renamed the property to
get_env_key() method.

Also restored env_id fallback in load_task_from_json for backward
compatibility with existing JSON files.

Co-authored-by: Cursor <cursoragent@cursor.com>
The make() method was using self.env_key (raw field) instead of
self.get_env_key() (computed method with version). This would cause
environments to be created without the version suffix.

Co-authored-by: Cursor <cursoragent@cursor.com>
The API returns env_id but TaskInfo was renamed to use env_key.
Added alias="env_id" so Pydantic accepts both field names during
deserialization of API responses.

Co-authored-by: Cursor <cursoragent@cursor.com>
When export_tasks serializes tasks, it outputs env_key. The loading
function needs to check for env_key first (canonical name), then
fallback to environment_id (API) and env_id (legacy).

Co-authored-by: Cursor <cursoragent@cursor.com>
- TaskResponse: rename environment_id -> env_key (alias="environment_id")
- TaskRequest: rename environment_id -> env_key (alias="environment_id")
- Add ConfigDict(populate_by_name=True) for alias support
- Add Task.env_spec property for env_key:version string
- Use task.env_spec in Task.make() and make_for_task()
- Clean up load_tasks to use task_response.env_key directly
- Remove scattered inline env_key:version string building

Co-authored-by: Cursor <cursoragent@cursor.com>
- data_spec: renamed from data_key (data_key kept as alias)
- has_verifier: whether task has verifier_func or verifier
- is_research_based: whether task has a factual_answer
- is_action_based: inverse of is_research_based

Co-authored-by: Cursor <cursoragent@cursor.com>
TaskInfo has alias="env_id" on env_key field but was missing
model_config = ConfigDict(populate_by_name=True). Without this,
creating TaskInfo(env_key="...") would fail since only the alias
name was accepted.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…status

feat: Add task_lifecycle_status field to Task model
The PUT /v1/tasks/{task_key} endpoint can return environment_id: null,
which caused a Pydantic validation error since env_key was required.
This made update_task crash instead of returning a TaskResponse.

- TaskResponse.env_key: str -> Optional[str]
- Task.env_key: str -> Optional[str]
- Task.env_spec now returns None when env_key is absent

Co-authored-by: Cursor <cursoragent@cursor.com>
When a task has env_key=None, make_for_task would pass None to make()
causing a TypeError at ":" in env_key. Now raises a clear ValueError
matching the guard in Task.make().

Co-authored-by: Cursor <cursoragent@cursor.com>
…al-env-key

fix: make TaskResponse.env_key optional to handle null API responses
…e-optional-env-key"

This reverts commit 3a4f711, reversing
changes made to 7ec526b.
…ional-env-key

revert: restore env_key as required in TaskResponse and Task
The SDK now correctly imports output_json_schema automatically via the
API, so the manual-copy warning is no longer accurate.

Co-authored-by: Cursor <cursoragent@cursor.com>
…-warning

fix: remove stale output_json_schema warning from import_tasks
Simple scripts for task authors to download existing tasks, edit them
locally, and upload as new tasks. Uses raw requests (no SDK dependency)
with auto-resolved team ID from API key.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
)

* fix(download): don't pass auto-resolved team_id to task GET

The team_id query param on GET /v1/tasks/{key} requires admin
privileges. Previously the script always passed it (from auto-resolve),
causing 403 errors for non-admin API keys. Now only passes team_id
when explicitly provided via --team-id flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: skip resolve_team_id when --team-id is explicitly provided

Avoids an unnecessary API call that could fail and block the download.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…leet-ai#71)

* feat(upload): add job launching and auto-generated unique task keys

- Make --key optional; auto-generates {original_key}_{uuid[:8]} when omitted
- Replace local key comparison with server-side existence check (GET /v1/tasks/{key})
- Launch job by default after upload (POST /v1/jobs) with --no-launch-job to skip
- Add --models, --pass-k flags for job configuration
- Default models: gemini-3.1-pro-preview, claude-opus-4.6, gpt-5.2

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(upload): raise on unexpected status in task existence check

Previously any non-200 (including 500, 403, 429) was treated as
"key available", silently skipping the guard. Now only 404 means
available; other errors are surfaced.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
The file-set key may differ from the task key (e.g., without a version
suffix). Pull the key from the task's env_variables.TASK_KEY when
available, falling back to the CLI --task-key argument.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…validation (fleet-ai#73)

The jobs API returns `job_id` not `id` — fix extraction so the job ID and
dashboard URL are printed after launch. Also add validation that data files
are under files/notebooks/ (the path unpacked into the agent workspace) and
that the prompt's list_workspace_files() pattern matches actual files.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* fix(upload): extract job_id from API response and add workspace file validation

The jobs API returns `job_id` not `id` — fix extraction so the job ID and
dashboard URL are printed after launch. Also add validation that data files
are under files/notebooks/ (the path unpacked into the agent workspace) and
that the prompt's list_workspace_files() pattern matches actual files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add standalone launch_job script for existing tasks

Provides a simple way to launch jobs for tasks that already exist on the
server, without the upload/create flow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
The web page now redirects the browser to http://127.0.0.1:PORT/callback
with tokens as query params instead of POSTing JSON. Replace do_POST +
do_OPTIONS (CORS preflight) with a do_GET handler that reads query params,
validates the state nonce, and returns a plain HTML confirmation page.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… URL

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…o-our-sdk

feat: add flt login browser auth flow (ENG-1192)
…fleet-ai#79)

* feat(task-bundle-editing): add Makefile, templates, CLAUDE.md, and workflow docs

Add task authoring toolkit for creating, editing, and deploying Fleet
evaluation tasks. Includes Makefile wrapping existing Python scripts,
verifier templates for analysis and plot tasks, CLAUDE.md for Claude Code
integration, and extended README with end-to-end workflow documentation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: clarify placeholder notation in template schema notes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
… guards

- Add Claude Code skills: task-authoring (generic), fleet-status
- Add generic verifier_template.json; remove DS-specific templates
  (analysis, plot) for a future domain-specific PR
- Fix Makefile: validate/upload accept DIR= without requiring TASK
- README: expand edit workflow, clarify templates as design scaffolds
- CLAUDE.md: update skills section

Addresses review comments on PR fleet-ai#79:
- Templates clarified as design scaffolds, not direct task.json
- Makefile DIR-only workflow unblocked

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants