Implement assistant job model and lifecycle

# Implement assistant job model and lifecycle

Status: blocked
Tags: `enhancement`, `assistant`, `podcast`, `work-engine`, `backend`, `frontend`, `portal`, `data`, `infra`, `P0`
Depends on: #29
Blocks: None

## Current State

Agent-verifiable assistant job model, API, UI, export, SAM, and deterministic dry-run work is complete on current `main`; this issue remains open only for the `[HUMAN]` real external assistant execution decision/smoke.

Evidence:

- Current human-gate audit: https://github.com/DataTalksClub/dataops/issues/30#issuecomment-4827156773
- Parent cutover tracker: #71

Remaining gate:

- [ ] [HUMAN] Verify the real Telegram/Heru/Codex/Claude execution path with production credentials, or explicitly defer that external execution path from V1 launch acceptance.

## Scope

Implement the DataOps V1 assistant job model as part of the unified workflow-first portal. Assistant runs must be durable workflow entities in `work-engine`, not local files under `assistants/podcast/inbox/`, `assistants/podcast/documents/`, or `assistants/podcast/heru_runs/`.

This issue covers the model, lifecycle rules, API, operator UI, portable export behavior, and the first safe Podcast Assistant integration boundary. The goal is that an operator can request assistant help from a task or bundle, watch the job lifecycle, review logs and failures, approve or reject outputs, retry failed work, and see generated artifacts attached back to the workflow.

Implement in the DataOps repo only. Do not modify source repos such as `../podcast-assistant`, `../dtc-operations`, or `../datatasks`.

### Product Behavior

- Add assistant jobs as first-class runtime records linked to a task, bundle, or both.
- Add a stable assistant lifecycle with explicit statuses:
  - `draft`: intake exists but is not submitted.
  - `queued`: submitted and waiting to run.
  - `running`: worker or operator-triggered runner is active.
  - `waiting_approval`: outputs exist and require operator review.
  - `approved`: operator accepted outputs.
  - `rejected`: operator rejected outputs with a reason.
  - `retrying`: retry was requested after a failure or rejected output.
  - `succeeded`: completed without further approval required, or completed after approval.
  - `failed`: terminal failure after allowed attempts or explicit failure.
  - `canceled`: operator canceled the job before completion.
- Persist job lifecycle changes, retry attempts, approval decisions, errors, and output attachment events as append-only job log or audit entries. Raw assistant transcript/log text must be stored as artifact/log references, not as unbounded DynamoDB blobs.
- Link assistant outputs to the artifact model from #29 through `artifactRefs`/`outputArtifactIds`. If #29 changes exact artifact fields, use #29 as the source of truth.
- Preserve the existing workflow-first model: tasks and bundles remain the primary operator screens; assistant jobs are contextual workflow support, not a separate disconnected app.

### Data Model

Add durable assistant job records in `work-engine` with migration-safe stable IDs. Minimum fields:

- `id` / export `assistant_job_id`
- `assistantType` / `assistant_type` such as `podcast`
- `title`
- `status`
- `taskId`
- `bundleId`
- `requestedBy`
- `inputRefs`: array of typed references to source messages, files, URLs, docs, tasks, or bundle links
- `outputArtifactIds`: stable artifact IDs produced by the job
- `logRefs`: references to stored run logs or transcript artifacts
- `approvalRequired`
- `approval`: `{ status, decidedBy, decidedAt, reason }` when applicable
- `attemptCount`
- `maxAttempts`
- `retryOfJobId` when a retry creates a new job record, or equivalent attempt history if retries stay on one job
- `lastError`: sanitized error summary and code, no secrets
- `createdAt`, `queuedAt`, `startedAt`, `completedAt`, `updatedAt`

Add append-only job event/log records or audit events with stable IDs and these minimum fields:

- `id` / export `audit_event_id` or equivalent job-event ID
- `assistantJobId`
- `actorId`
- `action`: `created`, `queued`, `started`, `log-appended`, `artifact-attached`, `approval-requested`, `approved`, `rejected`, `retry-requested`, `failed`, `canceled`, `succeeded`
- `summary`
- `metadata`: bounded JSON payload for attempt number, artifact IDs, error code, runner name, or status transition
- `createdAt`

DynamoDB requirements:

- Production tables must be declared in `lambda-functions/template.full.yaml`; production code must not create unmanaged tables on cold start.
- Add environment variables for any new tables, including `DATAOPS_ASSISTANT_JOBS_TABLE` and, if implemented separately, `DATAOPS_AUDIT_EVENTS_TABLE`.
- Use stack-scoped table names, `PAY_PER_REQUEST`, point-in-time recovery, retain policy, DataOps tags, and least-privilege IAM for `WorkEngineFunction`.
- Local/test mode may auto-create local tables through `work-engine/src/db/setup.ts`.

### API

Add authenticated work-engine API routes. Exact route names can follow repo conventions, but they must support:

- Create/update draft job intake with `assistantType`, `taskId`, `bundleId`, `inputRefs`, `approvalRequired`, and retry policy.
- Submit a draft job to `queued`.
- List jobs with filters for `status`, `assistantType`, `taskId`, `bundleId`, and jobs needing approval.
- Fetch one job with its related artifacts and bounded event/log timeline.
- Append bounded job log/event entries from trusted backend code.
- Transition `queued -> running -> waiting_approval|succeeded|failed` with validation.
- Retry failed or rejected jobs without losing the original attempt history.
- Approve or reject jobs that are in `waiting_approval`; rejection requires a reason.
- Cancel jobs that are not terminal.
- Attach output artifacts to jobs and reflect references on the linked task/bundle.

Validation requirements:

- Reject unknown statuses and invalid state transitions.
- Reject jobs without at least one of `taskId` or `bundleId`.
- Reject approval/rejection on jobs not in `waiting_approval`.
- Reject retry when `attemptCount >= maxAttempts` unless the API explicitly records a privileged override.
- Redact or reject secrets in logs and errors where detectable.
- Return consistent 400/401/404 responses matching existing work-engine API style.

### Portal UI

Integrate assistant jobs into the existing unified portal UI:

- Add operator visibility for assistant jobs from linked task rows and bundle detail pages.
- Add an Assistants view or panel that lists recent jobs, running jobs, failed jobs, and jobs waiting for approval.
- Show assistant type, linked workflow/task, status, attempt count, last update time, and next available action.
- Allow the operator to create a Podcast Assistant job from a podcast task or bundle by selecting existing task/bundle context and adding input references.
- Allow retry, cancel, approve, and reject actions from the UI when state allows them.
- Show a bounded timeline/log preview with links to full log artifacts where applicable.
- Show output artifacts on the linked task/bundle after they are attached.
- Failed assistant jobs must surface as operator-visible risk, either in the Assistants view and/or workflow detail, and should produce an `automation-failure` notification where appropriate.

### Podcast Assistant Boundary

Use `assistants/podcast` as the first assistant type, but do not require production Telegram, Groq, Heru, Codex, or Claude credentials for normal tests.

- Preserve the current safe unit-test behavior of `uv run --project assistants/podcast pytest`.
- Add a dry-run or fake-runner boundary that can turn a podcast job payload into deterministic output metadata for tests.
- Do not treat `assistants/podcast/inbox/`, `assistants/podcast/documents/`, or `assistants/podcast/heru_runs/` as production storage.
- Real Telegram/Heru processing may remain opt-in and must not be required for CI.
- [HUMAN] Real production assistant execution with Telegram/Heru credentials, if connected in this issue, requires manual verification before closing any human-only rollout checklist.

### Export And Data Safety

Portable exports must include assistant job data safely:

- Add `assistant_jobs.jsonl` to normal exports once the entity is implemented.
- Include job IDs, relationships, statuses, input/output/log references, retry/approval metadata, sanitized errors, and timestamps.
- Include job log/audit entities in `audit_events.jsonl` or a documented exported job-event entity. Prefer the existing `audit_events.jsonl` contract from `docs/v1-execution-data-safety.md` unless implementation constraints require a documented alternative.
- Remove `assistant_jobs` from `manifest.omitted_entities` after export support is implemented.
- Keep `artifacts` omitted only if #29 is not merged into the implementation branch; if #29 is merged, relationship validation must verify output artifact IDs.
- Validate relationship integrity for job `task_id`, `bundle_id`, `requested_by`, `output_artifact_ids`, and log/artifact references.
- Normal exports must exclude secrets, session tokens, raw credentials, signed URLs, and unbounded raw runner transcripts.
- Dry-run import validation must report assistant job inserts/updates without writing production data.


- [x] `work-engine` defines typed assistant job status, job record, input reference, approval, retry, and job event/log contracts that align with `docs/v1-workflow-data-model.md` and `docs/v1-execution-state-schema.md`.
- [x] Production DynamoDB/SAM configuration owns the new durable assistant-job storage and any audit/log storage; local/test auto-create remains available only for local/test mode.
- [x] Authenticated APIs support job intake, submit, list/filter, detail, lifecycle transitions, bounded logs/events, retry, cancel, approval, rejection, and artifact attachment.
- [x] API validation rejects invalid state transitions, missing task/bundle context, invalid retries, invalid approvals, unknown statuses, and malformed references.
- [x] Task and bundle records can show assistant job references without embedding job internals or raw logs.
- [x] Portal UI exposes assistant jobs from workflow context and provides an operator-visible queue for running, failed, and approval-needed jobs.
- [x] Podcast Assistant has a deterministic dry-run/fake-runner integration path that can create or update a podcast assistant job and output artifact metadata without external credentials.
- [x] Assistant failures and exhausted retries are visible to operators and produce enough context to retry, cancel, or file a follow-up issue.
- [x] Approval-required jobs cannot become `succeeded` until approved; rejected jobs preserve the rejection reason and can be retried according to retry policy.
- [x] Portable export writes and validates `assistant_jobs.jsonl`; manifest omissions are updated accurately; relationship checks cover assistant job links and exported job events/logs.
- [x] Normal export redacts secrets and does not include raw Telegram tokens, API keys, session tokens, or unbounded assistant transcript/log bodies.
- [x] Existing work-engine task, bundle, template, notification, file, recurring, export, and auth tests still pass.
- [x] Existing Podcast Assistant unit tests still pass without Telegram, Groq, Heru, Codex, or Claude credentials.
- [x] Tester captures screenshots for changed assistant/job UI surfaces and verifies they are not 404s, blank, broken, or overlapping.
- [ ] [HUMAN] Any real Telegram/Heru/Codex/Claude production execution path is manually verified with real credentials before being treated as production-ready.


### Scenario: Create and submit assistant job from workflow context

Given: an authenticated operator, an active podcast bundle, and at least one task in that bundle
When: the operator creates a Podcast Assistant job with source input references and submits it
Then: the job is stored as `queued`, references the bundle/task, appears in the assistant queue, and appears from the linked workflow context.

### Scenario: Job runs and waits for approval

Given: a queued podcast assistant job with `approvalRequired=true`
When: the deterministic runner records `running`, attaches output artifact metadata, and finishes
Then: the job becomes `waiting_approval`, output artifact IDs are visible, and the linked task/bundle shows the assistant job reference.

### Scenario: Approve output

Given: a job in `waiting_approval` with output artifacts
When: the operator approves it
Then: the job records the approver and timestamp, transitions to `approved` or `succeeded` according to the implemented lifecycle contract, and appends an immutable approval event.

### Scenario: Reject and retry output

Given: a job in `waiting_approval`
When: the operator rejects it with a reason and then retries it
Then: the rejection reason remains in history, retry attempt count increments or a linked retry job is created, and the new attempt can progress without overwriting the original history.

### Scenario: Failure and exhausted retries

Given: a job with `maxAttempts=2`
When: the runner fails twice with sanitized errors
Then: the job becomes `failed`, no further non-override retry is allowed, and the UI shows the failure with retry/cancel/follow-up context as appropriate.

### Scenario: Invalid transitions are blocked

Given: jobs in terminal and non-approval states
When: API clients try to approve a non-approval job, cancel a succeeded job, retry beyond max attempts, or mark an approval-required job as succeeded without approval
Then: the API returns 400 and persists no invalid transition event.

### Scenario: Export and validate assistant job data

Given: users, tasks, bundles, assistant jobs, output artifact references, and job events/logs exist in local test data
When: `export:data`, `validate:export`, and dry-run import are run
Then: `assistant_jobs.jsonl` is present, manifest counts/checksums are correct, omitted entities are accurate, references validate, and secrets/raw unbounded logs are absent.

### Scenario: Existing workflow behavior remains intact

Given: existing tasks, bundles, templates, recurring configs, files, notifications, and auth flows
When: the full relevant work-engine unit/typecheck/build and UI tests run
Then: existing behavior still passes and assistant job additions do not regress task completion, proof requirements, waiting tasks, or bundle detail rendering.

## Out Of Scope

- Grooming or implementing the artifact storage policy itself; that belongs to #29.
- Replacing the current vanilla work-engine frontend framework.
- Building a general external queue service, EventBridge worker fleet, or background orchestration platform beyond what is needed for V1 assistant job lifecycle state.
- Requiring real Telegram, Groq, Heru, Codex, Claude, Google Drive, Dropbox, or S3 credentials in normal automated tests.
- Moving production binaries or large generated files into DynamoDB.
- Modifying `../podcast-assistant`, `../dtc-operations`, `../datatasks`, or other source repos.
- Automatically pushing assistant outputs to public websites, Slack, Telegram, email, Google Docs, or other external systems.
- Full generalized audit-event product UX beyond assistant job lifecycle history needed here.

## Dependencies

- #29 should define the artifact model and storage policy used by `outputArtifactIds`, `artifactRefs`, log artifacts, storage URIs, artifact statuses, and artifact export validation.
- The existing V1 runtime architecture in `docs/v1-runtime-architecture.md` remains the deployment boundary: public Python portal brokers authenticated `/work/api/*` calls to private `WorkEngineFunction`.
- SAM/CloudFormation must remain the owner of production DynamoDB tables.
- Existing assistant docs and code under `assistants/podcast` remain the local Podcast Assistant reference and test suite.

## Verification Commands

Run from repo root unless noted otherwise:

```bash
npm --prefix work-engine test
npm --prefix work-engine run typecheck
npm --prefix work-engine run build
npm --prefix work-engine run test:e2e
uv run --project assistants/podcast pytest
npm --prefix work-engine run export:data -- .tmp/exports/assistant-jobs
npm --prefix work-engine run validate:export -- .tmp/exports/assistant-jobs
npm --prefix work-engine run dry-run:import -- .tmp/exports/assistant-jobs
```

If `lambda-functions/template.full.yaml` or deployment workflow files change, also run:

```bash
sam validate --template-file lambda-functions/template.full.yaml
```

Tester must include screenshot evidence for the Assistants queue/panel and any changed task or bundle detail assistant-job surfaces.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement assistant job model and lifecycle #30

Implement assistant job model and lifecycle

Current State

Scope

Product Behavior

Data Model

API

Portal UI

Podcast Assistant Boundary

Export And Data Safety

Scenario: Create and submit assistant job from workflow context

Scenario: Job runs and waits for approval

Scenario: Approve output

Scenario: Reject and retry output

Scenario: Failure and exhausted retries

Scenario: Invalid transitions are blocked

Scenario: Export and validate assistant job data

Scenario: Existing workflow behavior remains intact

Out Of Scope

Dependencies

Verification Commands

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Implement assistant job model and lifecycle #30

Description

Implement assistant job model and lifecycle

Current State

Scope

Product Behavior

Data Model

API

Portal UI

Podcast Assistant Boundary

Export And Data Safety

Scenario: Create and submit assistant job from workflow context

Scenario: Job runs and waits for approval

Scenario: Approve output

Scenario: Reject and retry output

Scenario: Failure and exhausted retries

Scenario: Invalid transitions are blocked

Scenario: Export and validate assistant job data

Scenario: Existing workflow behavior remains intact

Out Of Scope

Dependencies

Verification Commands

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions