diff --git a/proposals/0001-multi-turn-conversation-eval.md b/proposals/0001-multi-turn-conversation-eval.md
new file mode 100644
index 0000000..4447f76
--- /dev/null
+++ b/proposals/0001-multi-turn-conversation-eval.md
@@ -0,0 +1,1984 @@
+---
+title: Multi-Turn Conversation Evaluation Support
+authors:
+  - "kongtang"
+creation-date: 2026-05-19
+last-updated: 2026-05-19
+status: provisional
+---
+
+# SUP-0001: Multi-Turn Conversation Evaluation Support
+
+<!-- toc -->
+- [Summary](#summary)
+- [Motivation](#motivation)
+  - [Goals](#goals)
+  - [Non-Goals](#non-goals)
+- [Requirements](#requirements)
+- [Proposal](#proposal)
+  - [User Scenario Quick Reference](#user-scenario-quick-reference)
+  - [Notes/Constraints/Caveats](#notesconstraintscaveats)
+  - [Risks and Mitigations](#risks-and-mitigations)
+- [Design Details](#design-details)
+  - [Schema Changes](#schema-changes)
+  - [Evaluator Multi-Turn Execution Engine](#evaluator-multi-turn-execution-engine)
+  - [Agent Interface Extension](#agent-interface-extension)
+  - [Judge Per-Turn Assertions](#judge-per-turn-assertions)
+  - [Reliability Mechanisms](#reliability-mechanisms)
+- [Test Plan](#test-plan)
+- [Drawbacks](#drawbacks)
+- [Alternatives](#alternatives)
+- [Infrastructure Needed](#infrastructure-needed)
+- [Upgrade & Migration Strategy](#upgrade--migration-strategy)
+<!-- /toc -->
+
+## Summary
+
+Although skill-up currently defines `input.turns` and `Turn` (including `PostCondition`) at the schema level, the evaluator in practice concatenates all turns into a single instruction and sends it to the Agent Engine in one shot. **There is no actual turn-by-turn interaction, intermediate assertions, or conditional branching — the core mechanisms of multi-turn conversation evaluation are missing.** This proposal designs and implements full multi-turn conversation evaluation capabilities, enabling skill-up to verify phase gating, double confirmation, information clarification, iterative refinement, and cross-turn state reference — Skill behaviors that can only be demonstrated through multi-turn interactions.
+
+## Motivation
+
+Many Agent Skills' core value can only be demonstrated through multi-turn interactions; single-turn tests cannot cover them. Specific problems include:
+
+1. **Phase gating cannot be verified**: For workflow Skills like SDD-RIPER, one needs to first start a task normally and then attempt to skip a phase, verifying whether the Skill's "guardrails" are effective
+2. **Double confirmation flows are untestable**: Dangerous operations (file deletion, production deployment) have "ask → confirm → execute" and "ask → reject → cancel" paths that require at least two turns of interaction
+3. **Information clarification behavior is missing**: When parameters are incomplete, the Skill should ask clarifying questions rather than guess. This requires "clarify → provide → execute" multi-turn verification
+4. **Iterative refinement cannot be evaluated**: Code generation Skills need incremental modifications based on previous output, which single-turn tests cannot simulate
+5. **Cross-turn state reference is missing**: Creating a resource in the first turn and operating on it in the second turn requires verifying that the Skill correctly maintains context
+
+**Specific problems in the current codebase**:
+- In `internal/evaluator/evaluator.go`, `buildCaseMessages()` builds all turns into messages and passes them to `agent.Run()` in one shot
+- In `internal/agent/agent.go`, `BuildInstructionFromMessages()` concatenates all user messages into a single string
+- All Agent implementations (claude_code, codex, qodercli) call `BuildInstructionFromMessages()` for one-shot execution
+- `PostCondition` is defined in the schema but has no checking logic in the evaluator
+- The `rule_based` Judge only supports global assertions, not per-turn assertions
+
+### Goals
+
+1. **Turn-by-turn execution**: The evaluator invokes the Agent for each turn, checks `post_condition` after each turn completes, then decides whether to proceed to the next turn
+2. **Session continuity**: Multi-turn interactions within the same eval case share the Agent session context, rather than starting a new session for each turn
+3. **Intermediate assertions**: `post_condition` executes after each turn completes, supporting `skip_remaining` (skip subsequent turns) and `fail` (immediate failure)
+4. **Per-turn Judge assertions**: The rule_based Judge adds `turn_response_contains` / `turn_response_not_contains` rules
+5. **Dynamic value capture**: `capture` extracts values from a turn's output for use in subsequent turns' prompts via template variables
+6. **Backward compatibility**: The single-turn `input.prompt` mode is unaffected; existing cases require no modifications
+
+### Non-Goals
+
+1. **Agent Engine protocol modification**: No changes to the underlying communication protocols of claude_code / codex / qodercli; multi-turn is achieved through existing session resume mechanisms like `--resume`
+2. **Parallel turn execution**: Turns are strictly sequential; parallelism is not supported
+3. **Automated conversation tree/branch testing**: This phase only supports linear multi-turn sequences, not conditional branches forming conversation trees
+4. **Agent-side streaming real-time assertions**: Assertions execute only after each turn completes, not during streaming output
+5. **Dynamic content generation**: All turns' content must be pre-defined in YAML; runtime generation by LLM or script is not supported (`capture` + `{{variable}}` template variables provide limited dynamic value referencing, but the prompt structure itself is deterministic)
+
+## Requirements
+
+### Must Have
+
+| ID  | Requirement               | Acceptance Criteria                                                                                                               |
+| --- | ------------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
+| R1  | Turn-by-turn execution    | The evaluator invokes the Agent separately for each turn in `input.turns`, collecting the response after each turn                |
+| R2  | Session continuity        | Multi-turn interactions within the same case share the Agent session context; the Agent can reference content from previous turns |
+| R3  | post_condition check      | `post_condition` is evaluated after each turn; `on_fail: skip_remaining` skips subsequent turns                                   |
+| R4  | Per-turn Judge assertions | `turn_response_contains` / `turn_response_not_contains` assertions support specifying the turn number                             |
+| R5  | Backward compatibility    | Existing `input.prompt` single-turn cases are unaffected and require no modifications                                             |
+| R6  | Transcript completeness   | The complete transcript of multi-turn interactions records all turns, with each message annotated with its turn number            |
+
+### Should Have
+
+| ID  | Requirement                    | Acceptance Criteria                                                                                                       |
+| --- | ------------------------------ | ------------------------------------------------------------------------------------------------------------------------- |
+| S1  | Capture value extraction       | Extract values from a turn's response via regex or JSONPath, available as template variables in subsequent turns' prompts |
+| S2  | Per-turn timeout               | Each turn can have its own timeout, independent of the case-level timeout                                                 |
+| S3  | Per-turn tool_called assertion | `tool_called_in_turn` assertion can specify which turn to check for tool invocations                                      |
+
+### Nice to Have
+
+| ID  | Requirement               | Acceptance Criteria                                   |
+| --- | ------------------------- | ----------------------------------------------------- |
+| N1  | Retry mechanism extension | `retry_on` adds a new `turn_precondition_fail` option |
+| N2  | Per-turn agent_judge      | Use LLM-as-Judge to evaluate a specific turn's output |
+
+## Proposal
+
+### User Scenario Quick Reference
+
+Before diving into the technical design, here are three typical scenarios demonstrating how multi-turn conversation evaluation is configured in practice, helping readers quickly build intuition.
+
+---
+
+#### Scenario 1: Phase Gating — Skill Should Reject When User Attempts to Skip a Phase
+
+**Test objective**: The SDD-RIPER workflow Skill should reject and guide the user to follow the correct order when the user requests to skip the Research phase.
+
+```yaml
+# cases/phase-gate.yaml
+id: phase-gate-enforcement
+title: Skill should reject when user attempts to skip a phase
+
+input:
+  turns:
+    # Turn 1: Start the task normally; Agent should enter the Research phase
+    - role: user
+      content: "sdd_bootstrap: task=implement user login"
+      post_condition:
+        must_contain_any: ["Research", "analyze", "understand requirements"]
+        on_fail: skip_remaining   # Agent didn't enter Research → skip subsequent turns (scenario doesn't apply)
+
+    # Turn 2: Attempt to skip; Agent should reject
+    - role: user
+      content: "Skip the Research phase and write the code directly"
+
+judge:
+  type: rule_based
+  success:
+    - turn_response_contains:      # Assert turn 2 response contains rejection keywords
+        turn: 2
+        contains_any: ["need to complete first", "cannot skip", "execute in order"]
+  failure:
+    - turn_response_contains:      # Code appears in turn 2 → gating failed
+        turn: 2
+        contains_any: ["```python", "```java", "def ", "class "]
+```
+
+**Key points**:
+- `post_condition` checks after turn 1 whether the Agent entered the expected phase; if not, subsequent turns are skipped
+- `turn_response_contains` asserts specifically on the Agent's turn 2 response
+
+---
+
+#### Scenario 2: Double Confirmation — Confirm/Reject Paths for Dangerous Operations
+
+**Test objective**: The file deletion Skill should ask for confirmation before executing; only execute after user confirms.
+
+```yaml
+# cases/delete-confirm.yaml
+id: delete-with-confirmation
+title: File deletion requires double confirmation
+
+input:
+  turns:
+    # Turn 1: Issue a delete request
+    - role: user
+      content: "Delete all log files under /tmp/data/"
+      post_condition:
+        must_contain_any: ["confirm", "sure", "proceed", "delete"]
+        on_fail: fail              # Agent deleted without asking → test fails
+
+    # Turn 2: User confirms
+    - role: user
+      content: "Yes, confirm deletion"
+
+judge:
+  type: rule_based
+  success:
+    - turn_response_contains:
+        turn: 1
+        contains_any: ["confirm", "sure", "proceed"]   # Turn 1 should ask for confirmation
+    - turn_response_contains:
+        turn: 2
+        contains_any: ["deleted", "done", "removed"]   # Turn 2 should execute deletion
+```
+
+**Key points**:
+- `on_fail: fail` means if the Agent doesn't ask for confirmation in turn 1, the entire case is immediately marked as failed
+- Two `turn_response_contains` assert behaviors in different turns respectively
+
+---
+
+#### Scenario 3: Cross-Turn State Reference — Resource ID Created in Turn 1 Used in Turn 2
+
+**Test objective**: After the Agent creates a database table, the user references that table name to insert data; the Agent should correctly reference it.
+
+```yaml
+# cases/cross-turn-reference.yaml
+id: cross-turn-table-reference
+title: Cross-turn reference — operate using the table name created in the previous turn
+
+input:
+  turns:
+    # Turn 1: Create table
+    - role: user
+      content: "Create a users table with id, name, and email fields"
+      post_condition:
+        must_contain_any: ["CREATE TABLE", "create table"]
+        on_fail: fail
+      capture:
+        - variable: table_name              # Extract table name from Agent response
+          pattern: "(?i)CREATE TABLE\\s+(?P<value>\\w+)"
+
+    # Turn 2: Use {{table_name}} to reference the extracted table name from the previous turn
+    - role: user
+      content: "Insert a test record into the {{table_name}} table"
+
+judge:
+  type: rule_based
+  success:
+    - turn_response_contains:
+        turn: 2
+        contains_any: ["INSERT INTO"]
+```
+
+**Key points**:
+- `capture` extracts a value from the Agent response via regex and stores it in the variable `table_name`
+- Turn 2's `content` uses the `{{table_name}}` template syntax to reference this value, which is automatically replaced at runtime with the actually extracted table name
+- This means the eval case doesn't need to know in advance what name the Agent will give the table
+
+---
+
+> **Summary**: The core configuration pattern of multi-turn conversation evaluation is `input.turns` (define prompts per turn) + `post_condition` (inter-turn assertions) + `capture`/`{{variable}}` (cross-turn value passing) + `turn_response_contains` (per-turn Judge assertions). All turns' content is pre-defined static text (with template variable substitution), ensuring fully reproducible evaluation results.
+
+### Core Approach
+
+Change the evaluator's case execution mode from "send all messages in one shot" to "iterative turn-by-turn execution." For each turn:
+
+1. Build the current turn's user message (`content` field + `{{variable}}` template substitution)
+2. Invoke the Agent (using session resume to maintain context)
+3. Collect the Agent response
+4. Execute `post_condition` check
+5. If passed, optionally execute `capture` to extract values
+6. Inject extracted values into the next turn's prompt template
+7. Proceed to the next turn or terminate
+
+```
+┌─────────────────────────────────────────────────┐
+│                  Case Execution                  │
+│                                                  │
+│   Turn 1          Turn 2          Turn N         │
+│  ┌──────┐       ┌──────┐       ┌──────┐         │
+│  │Prompt│──────▶│Prompt│──────▶│Prompt│         │
+│  └──┬───┘       └──┬───┘       └──┬───┘         │
+│     │              │              │              │
+│     ▼              ▼              ▼              │
+│  ┌──────┐       ┌──────┐       ┌──────┐         │
+│  │Agent │       │Agent │       │Agent │         │
+│  │ Run  │       │Resume│       │Resume│         │
+│  └──┬───┘       └──┬───┘       └──┬───┘         │
+│     │              │              │              │
+│     ▼              ▼              ▼              │
+│  ┌──────┐       ┌──────┐       ┌──────┐         │
+│  │Post  │       │Post  │       │ (no  │         │
+│  │Cond  │       │Cond  │       │check)│         │
+│  └──┬───┘       └──┬───┘       └──┬───┘         │
+│     │              │              │              │
+│     ▼              ▼              ▼              │
+│  Capture?       Capture?       ──────────┐      │
+│     │              │                      │      │
+│     ▼              ▼                      ▼      │
+│  ┌───────────────────────────────────────────┐   │
+│  │       Judge (global + per-turn assertions) │   │
+│  └───────────────────────────────────────────┘   │
+└─────────────────────────────────────────────────┘
+```
+
+### Agent Session Resume Mechanism
+
+The key challenge of multi-turn evaluation is how to maintain the Agent's session context across multiple invocations. Survey of session resume capabilities across Agent Engines:
+
+| Engine      | Resume Method                                | Programmatic Command                    | Verification Status                                                                                                                       |
+| ----------- | -------------------------------------------- | --------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
+| claude_code | `--resume <session-id>` + `-p`               | `claude --resume <id> -p "follow-up"`   | ✅ Confirmed by [official docs](https://code.claude.com/docs/en/cli-reference)                                                             |
+| codex       | `codex exec resume <session-id>`             | `codex exec resume <id> "follow-up"`    | ✅ Confirmed by [official docs](https://developers.openai.com/codex/cli/features) for non-interactive mode                                 |
+| qodercli    | `--session-id` flag (Aone internal protocol) | API alignment with qodercli team needed | Not implemented in first version; falls back to single-shot concatenation (automatic fallback when `SessionResumer` type assertion fails) |
+
+**Session ID Sources**:
+
+| Engine      | Session ID Generation                  | Session ID Storage                                               |
+| ----------- | -------------------------------------- | ---------------------------------------------------------------- |
+| claude_code | `uuid.New()` passed via `--session-id` | `claudePrintJSONResult.SessionID` field, parsed from JSON output |
+| codex       | Auto-generated by codex CLI            | Most recent session filename under `~/.codex/sessions/`          |
+
+**Key Design Decisions**:
+
+1. **Session ID retrieval**: claude_code extracts from the `session_id` field in JSON output; codex parses from the most recent session filename under `~/.codex/sessions/`
+2. **Agent interface extension**: A new optional interface `SessionResumer` (with `RunTurn` method) is added; the evaluator checks capability via type assertion. Engines that don't support it fall back to single-shot concatenation mode
+3. **Priority**: Phase 1 implements claude_code, Phase 2 implements codex, qodercli follows after API confirmation
+
+### Notes/Constraints/Caveats
+
+1. **Agent Engine dependency**: Session resume depends on each Agent CLI's `--resume` / `--session-id` capability. If an Engine doesn't support session resume, multi-turn tests on that Engine fall back to "concatenate all turns and send in one shot" mode (existing behavior), with a notation in the report
+2. **Model randomness**: Agent responses to the same prompt may vary; `post_condition` matching should use loose mode (`must_contain_any` rather than exact match)
+3. **Cost control**: Token consumption in multi-turn interactions is significantly higher than single-turn. Case designs should limit the number of turns (recommended 2-5 turns primarily)
+
+### Risks and Mitigations
+
+| Risk                                                          | Impact                                                | Probability | Mitigation                                                                                             |
+| ------------------------------------------------------------- | ----------------------------------------------------- | ----------- | ------------------------------------------------------------------------------------------------------ |
+| Agent Engine doesn't support session resume                   | Multi-turn tests degrade to single-shot concatenation | Low         | Detect Engine capability before execution; explicitly annotate execution mode in reports               |
+| Overly strict post_condition matching causing excessive SKIPs | Low evaluation effectiveness                          | Medium      | Provide `must_contain_any` (OR semantics); support regex matching; guide users to use loose conditions |
+| Multi-turn token consumption triggering rate limits           | Evaluation gets throttled                             | Medium      | Add configurable `turn_delay` between turns; documentation recommends limiting turn count              |
+| Session resume failure causing context loss                   | Semantic discontinuity in subsequent turns            | Low         | Mark as ERROR on resume failure with diagnostic info; no silent fallback                               |
+
+## Design Details
+
+### Schema Changes
+
+#### 1. Turn Struct Extension
+
+Existing `Turn` definition (`internal/config/schema.go`):
+
+```go
+type Turn struct {
+    Role          string         `yaml:"role"`
+    Content       string         `yaml:"content"`
+    PostCondition *PostCondition `yaml:"post_condition,omitempty"`
+}
+
+type PostCondition struct {
+    MustContainAny []string `yaml:"must_contain_any,omitempty"`
+    OnFail         string   `yaml:"on_fail,omitempty"`
+}
+```
+
+Extended version:
+
+```go
+// Turn is a single conversation turn in a multi-turn evaluation case.
+type Turn struct {
+    Role           string         `yaml:"role"`                       // user (required)
+    Content        string         `yaml:"content"`                    // prompt text, supports {{variable}} template
+    PostCondition  *PostCondition `yaml:"post_condition,omitempty"`
+    Capture        []CaptureRule  `yaml:"capture,omitempty"`
+    TimeoutSeconds int            `yaml:"timeout_seconds,omitempty"`  // per-turn timeout override
+}
+
+// PostCondition checks the agent response after a turn completes.
+type PostCondition struct {
+    MustContainAny []string `yaml:"must_contain_any,omitempty"` // OR: at least one must match
+    MustContainAll []string `yaml:"must_contain_all,omitempty"` // AND: all must match
+    MustNotContain []string `yaml:"must_not_contain,omitempty"` // NONE: none should match
+    OnFail         string   `yaml:"on_fail,omitempty"`          // skip_remaining | fail (default: fail)
+}
+
+// CaptureRule extracts a value from the agent response for use in subsequent turns.
+type CaptureRule struct {
+    Variable string `yaml:"variable"`           // template variable name (e.g. "plan_id")
+    Pattern  string `yaml:"pattern,omitempty"`   // regex with named group (?P<value>...)
+    JSONPath string `yaml:"jsonpath,omitempty"`  // JSONPath expression (e.g. "$.tool_results[0].id")
+}
+```
+
+#### 2. Rule Extension (Per-Turn Assertions)
+
+Extending the existing `Rule` definition with new assertion types:
+
+```go
+type Rule struct {
+    // Existing fields
+    OutputContains *OutputContainsRule `json:"output_contains,omitempty" yaml:"output_contains,omitempty"`
+    ExitCode       *int                `json:"exit_code,omitempty"       yaml:"exit_code,omitempty"`
+    ToolCalled     *ToolCalledRule     `json:"tool_called,omitempty"     yaml:"tool_called,omitempty"`
+    FilesExist     []string            `json:"files_exist,omitempty"     yaml:"files_exist,omitempty"`
+    FilesNotExist  []string            `json:"files_not_exist,omitempty" yaml:"files_not_exist,omitempty"`
+
+    // New: per-turn assertions
+    TurnResponseContains    *TurnResponseContainsRule    `json:"turn_response_contains,omitempty"     yaml:"turn_response_contains,omitempty"`
+    TurnResponseNotContains *TurnResponseNotContainsRule `yaml:"turn_response_not_contains,omitempty"`
+    ToolCalledInTurn        *ToolCalledInTurnRule        `yaml:"tool_called_in_turn,omitempty"`
+    ToolNotCalledInTurn     *ToolNotCalledInTurnRule     `yaml:"tool_not_called_in_turn,omitempty"`
+}
+
+// TurnResponseContainsRule checks if a specific turn's response contains expected text.
+type TurnResponseContainsRule struct {
+    Turn        int      `yaml:"turn"`                    // 1-indexed turn number
+    ContainsAll []string `yaml:"contains_all,omitempty"`  // AND semantics
+    ContainsAny []string `yaml:"contains_any,omitempty"`  // OR semantics
+}
+
+// TurnResponseNotContainsRule checks if a specific turn's response does NOT contain text.
+type TurnResponseNotContainsRule struct {
+    Turn        int      `yaml:"turn"`         // 1-indexed turn number
+    NotContains []string `yaml:"not_contains"` // none should match
+}
+
+// ToolCalledInTurnRule checks if a tool was called in a specific turn.
+type ToolCalledInTurnRule struct {
+    Turn int            `yaml:"turn"`
+    Name string         `yaml:"name"`
+    Args map[string]any `yaml:"args,omitempty"`
+}
+
+// ToolNotCalledInTurnRule checks that a tool was NOT called in a specific turn.
+type ToolNotCalledInTurnRule struct {
+    Turn int    `yaml:"turn"`
+    Name string `yaml:"name"`
+}
+```
+
+#### 3. YAML Configuration Example
+
+> For more complete user scenarios, see the [User Scenario Quick Reference](#user-scenario-quick-reference) section above. This example demonstrates the combined usage of all schema fields.
+
+```yaml
+id: clarification-and-execute
+title: Skill should ask for clarification when parameters are incomplete
+
+input:
+  turns:
+    # Turn 1: Deliberately provide incomplete parameters; expect Agent to ask for clarification
+    - role: user
+      content: "Deploy the service for me"
+      post_condition:
+        must_contain_any: ["which environment", "which service", "please specify", "need to know"]
+        on_fail: fail
+      capture:
+        - variable: clarification_question
+          pattern: "(?P<value>[^.?]+[?])"
+
+    # Turn 2: After providing parameters, Agent should execute deployment
+    - role: user
+      content: "Deploy order-service to staging"
+      post_condition:
+        must_contain_all: ["order-service", "staging"]
+        on_fail: fail
+      timeout_seconds: 120
+
+    # Turn 3: Confirm deployment result
+    - role: user
+      content: "What's the deployment result?"
+
+judge:
+  type: rule_based
+  success:
+    - turn_response_contains:
+        turn: 1
+        contains_any: ["which", "please specify", "need"]
+    - turn_response_contains:
+        turn: 2
+        contains_any: ["deploy", "staging"]
+    - turn_response_not_contains:
+        turn: 1
+        not_contains: ["deployed", "deploy completed"]
+    - tool_called_in_turn:
+        turn: 2
+        name: deploy
+```
+
+### Evaluator Multi-Turn Execution Engine
+
+#### Core Change: `executeCaseOnce` Branching
+
+In `internal/evaluator/evaluator.go`, the `executeCaseOnce` method needs to branch based on input type:
+
+The existing method signature remains unchanged; a branch is inserted before the `agent.Run` call:
+
+```go
+func (e *defaultEvaluator) executeCaseOnce(ctx context.Context, caseCfg *config.CaseConfig,
+    configName string, overrideRT runtime.Runtime, overrideAgent agent.Agent) EvalResult {
+
+    // ── Existing code below, unchanged ──
+    // startTime, prompt/turnsTotal, result initialization, runtime preparation, judge config merging ...
+
+    // ── New branch: multi-turn execution path ──
+    if len(caseCfg.Input.Turns) > 1 {
+        return e.executeMultiTurn(ctx, caseCfg, configName, rt, runAgent, judgeCfg, startTime)
+    }
+
+    // ── Existing single-turn execution logic below, completely unchanged ──
+    // messages := buildCaseMessages(caseCfg)
+    // sessionResult, execErr := runAgent.Run(...)
+    // return e.evaluateCaseSession(...)
+}
+```
+
+**Key note**: The modification strategy here is **minimally invasive** — inserting an `if` branch before the `runAgent.Run()` call in the existing `executeCaseOnce` method, taking the multi-turn path only when `input.turns` has more than one element. All existing single-turn logic (environment setup, artifact collection, expect pre-check, judge evaluation, etc.) remains completely unmodified.
+
+#### Multi-Turn Execution Core Logic
+
+```go
+// TurnResult holds the result of a single turn execution.
+type TurnResult struct {
+    TurnNumber    int                       // 1-indexed
+    Content       string                    // the user message sent in this turn
+    Response      string                    // agent response text
+    Transcript    transcript.Transcript     // this turn's transcript
+    SessionResult *agent.SessionResult      // full session result
+    Status        TurnStatus                // completed, skipped, failed, error
+    SkipReason    string                    // populated when status is skipped
+    CapturedVars  map[string]string         // variables captured from this turn
+}
+
+type TurnStatus string
+
+const (
+    TurnCompleted TurnStatus = "completed"
+    TurnSkipped   TurnStatus = "skipped"
+    TurnFailed    TurnStatus = "failed"
+    TurnError     TurnStatus = "error"
+)
+
+func (e *defaultEvaluator) executeMultiTurn(
+    ctx context.Context,
+    caseCfg *config.CaseConfig,
+    configName string,
+    rt runtime.Runtime,
+    runAgent agent.Agent,
+    judgeCfg config.JudgeConfig,
+    startTime time.Time,
+) EvalResult {
+    turnsTotal := len(caseCfg.Input.Turns)
+
+    // Check if the Agent supports session resume
+    resumer, supportsResume := runAgent.(agent.SessionResumer)
+    if !supportsResume {
+        logging.WarnContextf(ctx, "Agent %s does not implement SessionResumer; "+
+            "falling back to single-shot execution for multi-turn case %s", runAgent.Name(), caseCfg.ID)
+        return e.executeMultiTurnFallback(ctx, caseCfg, configName, rt, runAgent)
+    }
+
+    turnResults := e.executeTurnsSequentially(ctx, caseCfg, rt, runAgent, resumer)
+    return e.finalizeMultiTurnResult(ctx, caseCfg, configName, rt, judgeCfg, turnResults, startTime)
+}
+
+// executeTurnsSequentially runs each turn in sequence, checking post-conditions
+// and capturing values between turns.
+func (e *defaultEvaluator) executeTurnsSequentially(
+    ctx context.Context,
+    caseCfg *config.CaseConfig,
+    rt runtime.Runtime,
+    runAgent agent.Agent,
+    resumer agent.SessionResumer,
+) []TurnResult {
+    turnsTotal := len(caseCfg.Input.Turns)
+    capturedVars := make(map[string]string)
+    turnResults := make([]TurnResult, 0, turnsTotal)
+    var sessionID string
+
+    for i, turn := range caseCfg.Input.Turns {
+        turnNum := i + 1
+
+        // 1. Template variable substitution
+        content := renderTemplate(turn.Content, capturedVars)
+
+        // 2. Build this turn's message
+        message := transcript.Message{
+            Role:    transcript.RoleUser,
+            Content: content,
+            Turn:    turnNum,
+        }
+
+        // 3. Set per-turn timeout + invoke Agent
+        sessionResult, execErr := func() (*agent.SessionResult, error) {
+            turnCtx := ctx
+            if turn.TimeoutSeconds > 0 {
+                var cancel context.CancelFunc
+                turnCtx, cancel = context.WithTimeout(ctx, time.Duration(turn.TimeoutSeconds)*time.Second)
+                defer cancel() // cancel executes when the closure returns, not when the outer function ends
+            }
+
+            // First turn uses Run to start a new session; subsequent turns use RunTurn to resume
+            if turnNum == 1 {
+                sr, err := runAgent.Run(turnCtx, rt, agent.ExecOptions{},
+                    []transcript.Message{message})
+                if sr != nil {
+                    sessionID = extractSessionID(turnCtx, rt, runAgent, sr)
+                }
+                return sr, err
+            }
+            return resumer.RunTurn(turnCtx, rt, agent.ExecOptions{},
+                message, sessionID)
+        }()
+
+        // 5. Collect this turn's result
+        turnResult := TurnResult{
+            TurnNumber:   turnNum,
+            Content:      content, // record the actual content sent
+            CapturedVars: make(map[string]string),
+        }
+        if sessionResult != nil {
+            turnResult.Response = sessionResult.FinalMessage
+            turnResult.Transcript = sessionResult.Transcript
+            turnResult.SessionResult = sessionResult
+        }
+        if execErr != nil {
+            turnResult.Status = TurnError
+            turnResult.SkipReason = execErr.Error()
+            turnResults = append(turnResults, turnResult)
+            return turnResults // Execution error, terminate subsequent turns
+        }
+        turnResult.Status = TurnCompleted
+
+        // 6. Execute post_condition check
+        if turn.PostCondition != nil {
+            passed, reason := checkPostCondition(turn.PostCondition, turnResult.Response)
+            if !passed {
+                if turn.PostCondition.OnFail == "skip_remaining" {
+                    turnResult.Status = TurnSkipped
+                    turnResult.SkipReason = reason
+                    turnResults = append(turnResults, turnResult)
+                    // Mark subsequent turns as skipped
+                    for j := turnNum; j < turnsTotal; j++ {
+                        turnResults = append(turnResults, TurnResult{
+                            TurnNumber: j + 1,
+                            Status:     TurnSkipped,
+                            SkipReason: fmt.Sprintf("skipped: turn %d post_condition failed", turnNum),
+                        })
+                    }
+                    return turnResults
+                }
+                // default: "fail"
+                turnResult.Status = TurnFailed
+                turnResult.SkipReason = reason
+                turnResults = append(turnResults, turnResult)
+                return turnResults
+            }
+        }
+
+        // 7. Execute capture
+        for _, cap := range turn.Capture {
+            value := extractCapturedValue(cap, turnResult.Response, sessionResult)
+            if value != "" {
+                capturedVars[cap.Variable] = value
+                turnResult.CapturedVars[cap.Variable] = value
+            }
+        }
+
+        turnResults = append(turnResults, turnResult)
+    }
+    return turnResults
+}
+
+// finalizeMultiTurnResult constructs the EvalResult from turn results and runs the judge.
+func (e *defaultEvaluator) finalizeMultiTurnResult(
+    ctx context.Context,
+    caseCfg *config.CaseConfig,
+    configName string,
+    rt runtime.Runtime,
+    judgeCfg config.JudgeConfig,
+    turnResults []TurnResult,
+    startTime time.Time,
+) EvalResult {
+    turnsTotal := len(caseCfg.Input.Turns)
+    turnsExecuted := countExecutedTurns(turnResults)
+
+    // Merge transcripts from all turns
+    var fullTranscript transcript.Transcript
+    var lastSessionResult *agent.SessionResult
+    for _, tr := range turnResults {
+        fullTranscript = append(fullTranscript, tr.Transcript...)
+        if tr.SessionResult != nil {
+            lastSessionResult = tr.SessionResult
+        }
+    }
+
+    result := EvalResult{
+        CaseID:        caseCfg.ID,
+        CaseName:      caseCfg.Title,
+        Prompt:        caseCfg.Input.Turns[0].Content,
+        SessionResult: lastSessionResult,
+        TurnsTotal:    turnsTotal,
+        Configuration: configName,
+    }
+    if result.SessionResult == nil {
+        result.SessionResult = &agent.SessionResult{}
+    }
+    result.SessionResult.Transcript = fullTranscript
+    result.SessionResult.Turns = turnsExecuted
+
+    // Check if any turn failed or all were skipped
+    if hasFailedTurn(turnResults) {
+        result.Status = judge.StatusFail
+        return result
+    }
+    if allSkipped(turnResults) {
+        result.Status = judge.StatusSkip
+        return result
+    }
+
+    // Execute Judge evaluation (reuses the existing evaluateCaseSession flow)
+    //
+    // The only difference between multi-turn and single-turn execution is that
+    // judgeInput carries TurnResults, enabling per-turn assertions
+    // (turn_response_contains, etc.) to work.
+    // The rest of the flow (expect pre-check → judge → grading) is identical.
+    judgeInput := judge.Input{
+        CaseID:         caseCfg.ID,
+        Transcript:     fullTranscript,
+        FinalMessage:   lastFinalMessage(turnResults),
+        ExitCode:       lastExitCode(turnResults),
+        WorkspacePath:  rt.Workspace(),
+        SkillDir:       e.skillDir,
+        TurnsExecuted:  turnsExecuted,
+        TurnsTotal:     turnsTotal,
+        TurnResults:    toJudgeTurnResults(turnResults),
+        WorkspaceDiff:  sessionWorkspaceDiff(lastSessionResult),
+        GeneratedFiles: sessionGeneratedFiles(lastSessionResult),
+        SessionResult:  lastSessionResult,
+    }
+
+    if failed := e.runExpectPreCheck(ctx, caseCfg, configName, judgeInput, turnsTotal, &result); failed {
+        return result
+    }
+
+    var expectAssertions []judge.AssertionResult
+    if result.ExpectResult != nil {
+        expectAssertions = result.ExpectResult.ToAssertionResults()
+    }
+
+    finalResult := e.runJudgePhase(ctx, rt, caseCfg, configName, judgeCfg, turnsTotal, nil, judgeInput, &result)
+    if len(expectAssertions) > 0 && finalResult.Grading != nil {
+        finalResult.Grading.AssertionResults = append(expectAssertions, finalResult.Grading.AssertionResults...)
+        finalResult.Grading.Summary.Passed += len(expectAssertions)
+        finalResult.Grading.Summary.Total += len(expectAssertions)
+        if finalResult.Grading.Summary.Total > 0 {
+            finalResult.Grading.Summary.PassRate = float64(finalResult.Grading.Summary.Passed) / float64(finalResult.Grading.Summary.Total)
+        }
+    }
+
+    return finalResult
+}
+```
+
+#### post_condition Check Implementation
+
+```go
+// checkPostCondition evaluates a post-condition against the agent response.
+// Returns (passed bool, reason string).
+func checkPostCondition(pc *config.PostCondition, response string) (bool, string) {
+    lower := strings.ToLower(response)
+
+    // must_contain_all: all must match
+    for _, keyword := range pc.MustContainAll {
+        if !strings.Contains(lower, strings.ToLower(keyword)) {
+            return false, fmt.Sprintf("response missing required keyword: %q", keyword)
+        }
+    }
+
+    // must_contain_any: at least one must match
+    if len(pc.MustContainAny) > 0 {
+        found := false
+        for _, keyword := range pc.MustContainAny {
+            if strings.Contains(lower, strings.ToLower(keyword)) {
+                found = true
+                break
+            }
+        }
+        if !found {
+            return false, fmt.Sprintf("response missing any of: %v", pc.MustContainAny)
+        }
+    }
+
+    // must_not_contain: none should match
+    for _, keyword := range pc.MustNotContain {
+        if strings.Contains(lower, strings.ToLower(keyword)) {
+            return false, fmt.Sprintf("response unexpectedly contains: %q", keyword)
+        }
+    }
+
+    return true, ""
+}
+```
+
+#### Template Rendering Implementation
+
+```go
+// renderTemplate replaces {{variable}} placeholders in content with captured values.
+// Uses simple string replacement rather than text/template to avoid complexity
+// and security risks (no function calls, no control flow).
+func renderTemplate(content string, vars map[string]string) string {
+    result := content
+    for name, value := range vars {
+        result = strings.ReplaceAll(result, "{{"+name+"}}", value)
+    }
+    return result
+}
+```
+
+#### Capture Value Extraction Implementation
+
+```go
+// extractCapturedValue extracts a value from the agent response using the configured method.
+// Returns the extracted value, or empty string if extraction fails.
+func extractCapturedValue(rule config.CaptureRule, response string, sr *agent.SessionResult) string {
+    // Prefer regex extraction
+    if rule.Pattern != "" {
+        return extractByRegex(rule.Pattern, response)
+    }
+    // JSONPath extraction: structure TurnResult as JSON then query
+    if rule.JSONPath != "" && sr != nil {
+        return extractByJSONPath(rule.JSONPath, response, sr)
+    }
+    return ""
+}
+
+// extractByRegex extracts a value using a regex with a named group (?P<value>...).
+func extractByRegex(pattern, text string) string {
+    re, err := regexp.Compile(pattern)
+    if err != nil {
+        return ""
+    }
+    match := re.FindStringSubmatch(text)
+    if match == nil {
+        return ""
+    }
+    // Look for the named group "value"
+    for i, name := range re.SubexpNames() {
+        if name == "value" && i < len(match) {
+            return match[i]
+        }
+    }
+    // Fallback: return the first capturing group
+    if len(match) > 1 {
+        return match[1]
+    }
+    return ""
+}
+
+// extractByJSONPath extracts a value from the session result using a JSONPath expression.
+// The root object $ is a JSON representation of the turn result:
+//   {
+//     "response": "...",
+//     "transcript": { "tool_calls": [...], "tool_results": [...] }
+//   }
+func extractByJSONPath(path, response string, sr *agent.SessionResult) string {
+    // Build a queryable JSON object from the turn data
+    turnData := map[string]any{
+        "response": response,
+        "transcript": map[string]any{
+            "tool_calls":   transcriptToolCalls(sr.Transcript),
+            "tool_results": transcriptToolResults(sr.Transcript),
+        },
+    }
+    jsonBytes, err := json.Marshal(turnData)
+    if err != nil {
+        return ""
+    }
+
+    var data any
+    if err := json.Unmarshal(jsonBytes, &data); err != nil {
+        return ""
+    }
+
+    // Uses the github.com/PaesslerAG/jsonpath library to query JSON data.
+    // Requires adding a new dependency in go.mod: go get github.com/PaesslerAG/jsonpath
+    // import "github.com/PaesslerAG/jsonpath"
+    result, err := jsonpath.Get(path, data)
+    if err != nil {
+        return ""
+    }
+    return fmt.Sprintf("%v", result)
+}
+
+// transcriptToolCalls extracts tool call info from a transcript.
+func transcriptToolCalls(tr transcript.Transcript) []map[string]any {
+    var calls []map[string]any
+    for _, msg := range tr {
+        if msg.Role == transcript.RoleToolCall && msg.ToolCall != nil {
+            calls = append(calls, map[string]any{
+                "id":        msg.ToolCall.ID,
+                "name":      msg.ToolCall.Name,
+                "arguments": msg.ToolCall.Arguments,
+            })
+        }
+    }
+    return calls
+}
+
+// transcriptToolResults extracts tool result info from a transcript.
+func transcriptToolResults(tr transcript.Transcript) []map[string]any {
+    var results []map[string]any
+    for _, msg := range tr {
+        if msg.Role == transcript.RoleToolResult && msg.ToolResult != nil {
+            results = append(results, map[string]any{
+                "call_id": msg.ToolResult.CallID,
+                "status":  msg.ToolResult.Status,
+                "content": msg.ToolResult.Content,
+            })
+        }
+    }
+    return results
+}
+```
+
+#### Helper Function Implementation
+
+```go
+// countExecutedTurns counts the number of turns that were actually executed (not skipped).
+func countExecutedTurns(turnResults []TurnResult) int {
+    count := 0
+    for _, tr := range turnResults {
+        if tr.Status == TurnCompleted || tr.Status == TurnFailed || tr.Status == TurnError {
+            count++
+        }
+    }
+    return count
+}
+
+// hasFailedTurn returns true if any turn has TurnFailed status.
+func hasFailedTurn(turnResults []TurnResult) bool {
+    for _, tr := range turnResults {
+        if tr.Status == TurnFailed {
+            return true
+        }
+    }
+    return false
+}
+
+// allSkipped returns true if all turns are skipped (no completed turns).
+func allSkipped(turnResults []TurnResult) bool {
+    for _, tr := range turnResults {
+        if tr.Status == TurnCompleted {
+            return false
+        }
+    }
+    return true
+}
+
+// lastFinalMessage returns the FinalMessage from the last completed turn.
+func lastFinalMessage(turnResults []TurnResult) string {
+    for i := len(turnResults) - 1; i >= 0; i-- {
+        if turnResults[i].Status == TurnCompleted && turnResults[i].Response != "" {
+            return turnResults[i].Response
+        }
+    }
+    return ""
+}
+
+// lastExitCode returns the ExitCode from the last turn that has a SessionResult.
+func lastExitCode(turnResults []TurnResult) int {
+    for i := len(turnResults) - 1; i >= 0; i-- {
+        if turnResults[i].SessionResult != nil {
+            return turnResults[i].SessionResult.ExitCode
+        }
+    }
+    return 0
+}
+
+// toJudgeTurnResults converts evaluator TurnResults to judge-visible TurnResults.
+func toJudgeTurnResults(turnResults []TurnResult) []judge.TurnResult {
+    results := make([]judge.TurnResult, len(turnResults))
+    for i, tr := range turnResults {
+        results[i] = judge.TurnResult{
+            TurnNumber: tr.TurnNumber,
+            Response:   tr.Response,
+            Transcript: tr.Transcript,
+            Status:     string(tr.Status),
+        }
+    }
+    return results
+}
+```
+
+### Agent Interface Extension
+
+#### New Optional `SessionResumer` Interface
+
+In `internal/agent/agent.go`, **without modifying the existing `Agent` interface**, a new optional interface is added. Agent implementations voluntarily implement it via Go interface composition; the evaluator checks capability via type assertion:
+
+```go
+// SessionResumer is an optional interface that Agent implementations may satisfy
+// to support multi-turn session resume. The evaluator checks for this interface
+// via type assertion before attempting multi-turn execution.
+type SessionResumer interface {
+    // RunTurn resumes an existing session and sends a follow-up message.
+    // sessionID is the session identifier returned by the initial Run call.
+    RunTurn(ctx context.Context, rt Runtime, opts ExecOptions, message transcript.Message, sessionID string) (*SessionResult, error)
+}
+```
+
+Capability check in the evaluator:
+
+```go
+resumer, supportsResume := runAgent.(agent.SessionResumer)
+if !supportsResume && len(caseCfg.Input.Turns) > 1 {
+    // Fall back to single-shot concatenation mode
+    logging.WarnContextf(ctx, "Agent %s does not implement SessionResumer; "+
+        "falling back to single-shot execution for multi-turn case %s", runAgent.Name(), caseCfg.ID)
+}
+```
+
+Advantages of this design:
+- **Zero breakage**: The existing `Agent` interface is unchanged; all existing implementations compile without modification
+- **Incremental adoption**: Only Agents that implement `SessionResumer` take the multi-turn path
+- **Idiomatic Go**: Consistent with optional interface patterns in the standard library (e.g., `io.ReadCloser`, `io.WriterTo`)
+
+#### Claude Code Implementation
+
+The claude code CLI already supports `--resume <session-id>` combined with `-p` (print mode) for programmatic session resume. In the current codebase, `buildClaudePrintCmd` already receives a `sessionID` parameter (generated via `uuid.New()`), and the JSON output's `claudePrintJSONResult` struct contains a `SessionID` field.
+
+```go
+// ClaudeCodeAgent implements the SessionResumer interface.
+
+// RunTurn resumes a claude-code session with a follow-up message.
+func (a *ClaudeCodeAgent) RunTurn(ctx context.Context, rt Runtime, opts ExecOptions,
+    message transcript.Message, sessionID string) (*SessionResult, error) {
+    start := time.Now()
+
+    envVars := a.credentialEnvVars(credential.EnvAnthropicAPIKey, credential.EnvAnthropicBaseURL)
+    envVars["CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC"] = "1"
+    envVars["IS_SANDBOX"] = "1"
+    opts = a.mergeExecOptionsEnv(ctx, opts, envVars, nil)
+
+    instruction := message.Content
+    cmd := nodeRuntimeCommandWithGuard("claude",
+        buildClaudeResumePrintCmd(sessionID, a.effectiveModelName(ctx), instruction))
+
+    result, err := rt.Exec(ctx, cmd, opts)
+    sessionResult := a.buildSessionResult(ctx, rt, opts, instruction, start, result)
+
+    // Auth failure check (same logic as in the Run method)
+    if authMsg, ok := providerAuthFailureSignal(result, sessionResult); ok {
+        if sessionResult != nil && sessionResult.ExitCode == 0 {
+            sessionResult.ExitCode = 1
+        }
+        return sessionResult, fmt.Errorf("claude-code authentication failed: %s", authMsg)
+    }
+    // Rate limit check (same logic as in the Run method)
+    if rateLimitMsg, ok := providerRateLimitSignal(result, sessionResult); ok {
+        if sessionResult != nil && sessionResult.ExitCode == 0 {
+            sessionResult.ExitCode = 1
+        }
+        return sessionResult, fmt.Errorf("claude-code provider rate limit: %s", rateLimitMsg)
+    }
+    if err != nil {
+        if sessionResult == nil {
+            sessionResult = &SessionResult{
+                Engine:     a.Name(),
+                ExitCode:   1,
+                DurationMs: time.Since(start).Milliseconds(),
+                Stderr:     result.Stderr,
+                Artifacts:  &SessionArtifacts{},
+            }
+        }
+        return sessionResult, fmt.Errorf("claude-code resume failed: %w", err)
+    }
+    if result.ExitCode != 0 {
+        return sessionResult, fmt.Errorf("claude-code resume failed (exit %d): %s", result.ExitCode, result.Stderr)
+    }
+    return sessionResult, nil
+}
+
+// buildClaudeResumePrintCmd builds the claude command with --resume flag.
+// The claude code CLI's --resume parameter directly takes the session ID value (no --session-id needed).
+func buildClaudeResumePrintCmd(sessionID, model, instruction string) string {
+    cmd := "claude --settings " + shellQuote(`{"disableAllHooks":true}`) +
+        " --resume " + shellQuote(sessionID) +
+        " -p --permission-mode=bypassPermissions"
+    if model != "" {
+        cmd += " --model " + shellQuote(model)
+    }
+    cmd += " " + shellQuote(instruction)
+    return cmd
+}
+```
+
+**Compile-time assertion** (ensures `ClaudeCodeAgent` implements `SessionResumer`):
+
+```go
+var _ SessionResumer = (*ClaudeCodeAgent)(nil)
+```
+
+#### Codex Implementation
+
+The codex CLI supports `codex exec resume <SESSION_ID>` for non-interactive session resume (confirmed by [official docs](https://developers.openai.com/codex/cli/features)). Session IDs are stored under the `~/.codex/sessions/` directory.
+
+```go
+// CodexAgent implements the SessionResumer interface.
+
+// RunTurn resumes a codex session with a follow-up message.
+func (a *CodexAgent) RunTurn(ctx context.Context, rt Runtime, opts ExecOptions,
+    message transcript.Message, sessionID string) (*SessionResult, error) {
+    start := time.Now()
+
+    instruction := message.Content
+    sandboxFlag := codexBypassSandbox
+    if rt.RequiresProcessSandbox() {
+        sandboxFlag = codexProcessSandbox
+    }
+    lastMessagePath := filepath.Join(rt.Workspace(), ".skill-up", "codex-last-message.txt")
+
+    // codex exec resume <SESSION_ID> continues an existing session
+    cmd := "mkdir -p " + shellQuote(filepath.Dir(lastMessagePath)) + "\n" +
+        nodeRuntimeCommandWithGuard("codex",
+            buildCodexResumeCmd(sessionID, instruction, a.effectiveModelName(ctx),
+                a.runProviderConfig(ctx), sandboxFlag, lastMessagePath))
+
+    envVars := a.credentialEnvVars(credential.EnvOpenAIAPIKey, credential.EnvOpenAIBaseURL)
+    opts = a.mergeExecOptionsEnv(ctx, opts, envVars, a.buildAgentObservabilityAttrs(nil))
+    ctx = observability.ContextWithConfiguredAgentSpanAttributes(ctx, opts.Env)
+
+    result, err := rt.Exec(ctx, cmd, opts)
+    sessionResult := a.buildSessionResult(ctx, rt, opts, instruction, start, result, lastMessagePath)
+    if err != nil {
+        if sessionResult == nil {
+            sessionResult = &SessionResult{
+                Engine:     a.Name(),
+                ExitCode:   1,
+                DurationMs: time.Since(start).Milliseconds(),
+                Stderr:     result.Stderr,
+                Artifacts:  &SessionArtifacts{},
+            }
+        }
+        return sessionResult, fmt.Errorf("codex resume failed: %w", err)
+    }
+    return sessionResult, nil
+}
+
+// buildCodexResumeCmd builds the codex command for resuming a session.
+func buildCodexResumeCmd(sessionID, instruction, model string, provider codexProviderConfig,
+    sandboxFlag, lastMessagePath string) string {
+    cmd := "codex exec resume " + shellQuote(sessionID) + " --json --skip-git-repo-check"
+    if sandboxFlag != "" {
+        cmd += " " + sandboxFlag
+    }
+    cmd += codexProviderFlags(provider)
+    if model != "" {
+        cmd += " -m " + shellQuote(model)
+    }
+    if lastMessagePath != "" {
+        cmd += " --output-last-message " + shellQuote(lastMessagePath)
+    }
+    cmd += " " + shellQuote(instruction)
+    return cmd
+}
+
+var _ SessionResumer = (*CodexAgent)(nil)
+```
+
+**Codex Session ID Extraction**: Unlike claude_code, codex does not return a session_id in its JSON output. The session ID needs to be read from the most recently created session filename under `~/.codex/sessions/`:
+
+```go
+// extractCodexSessionID extracts the session ID from a codex SessionResult.
+// Codex stores sessions under ~/.codex/sessions/<session-id>.jsonl.
+// We find the most recently modified session file after the initial Run call.
+func extractCodexSessionID(ctx context.Context, rt Runtime) string {
+    cmd := "ls -t ~/.codex/sessions/*.jsonl 2>/dev/null | head -1 | xargs -I{} basename {} .jsonl"
+    result, err := rt.Exec(ctx, cmd, ExecOptions{})
+    if err != nil || result.ExitCode != 0 {
+        return ""
+    }
+    return strings.TrimSpace(result.Stdout)
+}
+```
+
+The `extractSessionID` in the evaluator needs to dispatch based on Agent type:
+
+```go
+func extractSessionID(ctx context.Context, rt runtime.Runtime, runAgent agent.Agent, sr *agent.SessionResult) string {
+    if sr == nil {
+        return ""
+    }
+    // claude_code: session ID is stored in SessionResult.SessionID
+    if sr.SessionID != "" {
+        return sr.SessionID
+    }
+    // codex: extract from session filesystem
+    if runAgent.Name() == "codex" {
+        return extractCodexSessionID(ctx, rt)
+    }
+    return ""
+}
+```
+
+#### Session ID Extraction
+
+In the current claude_code implementation, the `claudePrintJSONResult` struct already contains a `SessionID` field (`json:"session_id"`), but this value is not stored in `SessionResult`. A new field needs to be added to `SessionResult`:
+
+```go
+// SessionResult with new SessionID field:
+type SessionResult struct {
+    // Existing fields
+    Engine       string                `json:"engine,omitempty"`
+    Model        string                `json:"model,omitempty"`
+    ExitCode     int                   `json:"exit_code"`
+    DurationMs   int64                 `json:"duration_ms"`
+    Turns        int                   `json:"turns"`
+    InputTokens  int                   `json:"input_tokens,omitempty"`
+    OutputTokens int                   `json:"output_tokens,omitempty"`
+    FinalMessage string                `json:"final_message,omitempty"`
+    Stderr       string                `json:"stderr,omitempty"`
+    Transcript   transcript.Transcript `json:"transcript,omitempty"`
+    Artifacts    *SessionArtifacts     `json:"artifacts,omitempty"`
+
+    // New field
+    // SessionID is the agent session identifier, used for multi-turn resume.
+    // Populated by agents that support session resume (e.g. claude_code, codex).
+    SessionID string `json:"session_id,omitempty"`
+}
+```
+
+In claude_code's `buildClaudePrintJSONSessionResult` and stream-json parsing logic, assign `payload.SessionID` to `SessionResult.SessionID`:
+
+```go
+// Add in buildClaudePrintJSONSessionResult:
+sessionResult.SessionID = payload.SessionID
+
+// Add in parseStreamOutput's result event handler:
+if payload.SessionID != "" {
+    sessionResult.SessionID = payload.SessionID
+}
+```
+
+The evaluator extracts session IDs using a unified multi-parameter version that supports dispatch logic for different Agents (see the `extractSessionID` definition in the codex implementation section above).
+
+#### Fallback Strategy
+
+The fallback logic is built into `executeMultiTurn` (via `agent.SessionResumer` type assertion); see the `executeMultiTurnFallback` method above.
+
+**Fallback mode behavior**:
+- Concatenates all turns into a single instruction and sends to the Agent in one shot (existing behavior)
+- Annotates the evaluation result with `execution_mode: "single_shot_fallback"`
+- Neither `post_condition` nor `capture` are executed (since there are no per-turn results)
+- Per-turn Judge assertions (e.g., `turn_response_contains`) will return FAIL due to missing `TurnResults`
+- A warning is output in the report, suggesting the user switch to an Agent that supports session resume
+
+```go
+// executeMultiTurnFallback is called when the Agent does not support SessionResumer,
+// concatenating multi-turn turns into a single instruction for one-shot execution (i.e., existing behavior).
+func (e *defaultEvaluator) executeMultiTurnFallback(
+    ctx context.Context,
+    caseCfg *config.CaseConfig,
+    configName string,
+    rt runtime.Runtime,
+    runAgent agent.Agent,
+) EvalResult {
+    // Directly reuse the existing executeCaseOnce flow.
+    //
+    // caseCfg.Input.Turns already exists; buildCaseMessages() concatenates them into messages,
+    // then BuildInstructionFromMessages() merges them into a single instruction — this is the existing behavior.
+    // executeCaseOnce internally contains the complete:
+    //   - tracing span management (agentSpan.End())
+    //   - artifact collection (finalizeArtifacts, ensureArtifactsInOutputDir)
+    //   - session result normalization (normalizeSessionResult)
+    //   - execution error handling (handleExecutionResult: timeout, non-zero exit code, etc.)
+    //   - expect pre-check + judge evaluation
+    //
+    // Note: In fallback mode, neither post_condition nor capture are executed (no per-turn results).
+    // Per-turn Judge assertions (turn_response_contains, etc.) will return FAIL since TurnResults is empty.
+    logging.WarnContextf(ctx, "Evaluator: multi-turn case %s running in single-shot fallback mode", caseCfg.ID)
+    return e.executeCaseOnce(ctx, caseCfg, configName, rt, runAgent)
+}
+```
+
+### Judge Per-Turn Assertions
+
+#### TurnResult Passing to Judge
+
+Add the following to `Input` in `internal/judge/judge.go`:
+
+```go
+type Input struct {
+    // Existing fields
+    CaseID         string
+    Transcript     transcript.Transcript
+    FinalMessage   string
+    ExitCode       int
+    WorkspacePath  string
+    SkillDir       string
+    WorkspaceDiff  string
+    GeneratedFiles []string
+    ArtifactDir    string
+    SessionResult  *agent.SessionResult
+    TurnsExecuted  int
+    TurnsTotal     int
+
+    // New field
+    // TurnResults holds per-turn execution results for multi-turn cases.
+    // Empty for single-turn cases.
+    TurnResults []TurnResult `json:"turn_results,omitempty"`
+}
+
+// TurnResult is the judge-visible representation of a single turn's execution.
+type TurnResult struct {
+    TurnNumber int                   `json:"turn_number"` // 1-indexed
+    Content    string                `json:"content"`     // the user message sent in this turn
+    Response   string                `json:"response"`
+    Transcript transcript.Transcript `json:"transcript"`
+    Status     string                `json:"status"`      // completed, skipped, failed, error
+}
+```
+
+#### rule_based Assertion Implementation
+
+Add new case branches in `evaluateAssertion` in `internal/judge/rule_based.go`:
+
+```go
+func evaluateAssertion(rule config.Rule, in Input) AssertionResult {
+    switch {
+    // Existing cases
+    case rule.OutputContains != nil:
+        return evalOutputContains(rule.OutputContains, in.FinalMessage)
+    case rule.ExitCode != nil:
+        return evalExitCode(*rule.ExitCode, in.ExitCode)
+    case rule.ToolCalled != nil:
+        return evalToolCalled(rule.ToolCalled, in.Transcript)
+    case len(rule.FilesExist) > 0:
+        return evalFilesExist(rule.FilesExist, in.WorkspacePath)
+    case len(rule.FilesNotExist) > 0:
+        return evalFilesNotExist(rule.FilesNotExist, in.WorkspacePath)
+
+    // New: per-turn assertions
+    case rule.TurnResponseContains != nil:
+        return evalTurnResponseContains(rule.TurnResponseContains, in.TurnResults)
+
+    case rule.TurnResponseNotContains != nil:
+        return evalTurnResponseNotContains(rule.TurnResponseNotContains, in.TurnResults)
+
+    case rule.ToolCalledInTurn != nil:
+        return evalToolCalledInTurn(rule.ToolCalledInTurn, in.TurnResults)
+
+    case rule.ToolNotCalledInTurn != nil:
+        return evalToolNotCalledInTurn(rule.ToolNotCalledInTurn, in.TurnResults)
+
+    default:
+        return AssertionResult{Text: "unknown rule", Passed: false, Evidence: "unrecognized assertion type"}
+    }
+}
+
+func evalTurnResponseContains(rule *config.TurnResponseContainsRule, turnResults []TurnResult) AssertionResult {
+    turnIdx := rule.Turn - 1
+    if turnIdx < 0 || turnIdx >= len(turnResults) {
+        return AssertionResult{
+            Text:     fmt.Sprintf("turn_response_contains(turn=%d)", rule.Turn),
+            Passed:   false,
+            Evidence: fmt.Sprintf("turn %d not found (total turns: %d)", rule.Turn, len(turnResults)),
+        }
+    }
+
+    tr := turnResults[turnIdx]
+    if tr.Status != "completed" {
+        return AssertionResult{
+            Text:     fmt.Sprintf("turn_response_contains(turn=%d)", rule.Turn),
+            Passed:   false,
+            Evidence: fmt.Sprintf("turn %d was %s, not completed", rule.Turn, tr.Status),
+        }
+    }
+
+    response := strings.ToLower(tr.Response)
+
+    // contains_all: AND semantics
+    for _, keyword := range rule.ContainsAll {
+        if !strings.Contains(response, strings.ToLower(keyword)) {
+            return AssertionResult{
+                Text:     fmt.Sprintf("turn_response_contains(turn=%d, all)", rule.Turn),
+                Passed:   false,
+                Evidence: fmt.Sprintf("turn %d response missing: %q", rule.Turn, keyword),
+            }
+        }
+    }
+
+    // contains_any: OR semantics
+    if len(rule.ContainsAny) > 0 {
+        found := false
+        for _, keyword := range rule.ContainsAny {
+            if strings.Contains(response, strings.ToLower(keyword)) {
+                found = true
+                break
+            }
+        }
+        if !found {
+            return AssertionResult{
+                Text:     fmt.Sprintf("turn_response_contains(turn=%d, any)", rule.Turn),
+                Passed:   false,
+                Evidence: fmt.Sprintf("turn %d response missing any of: %v", rule.Turn, rule.ContainsAny),
+            }
+        }
+    }
+
+    return AssertionResult{
+        Text:     fmt.Sprintf("turn_response_contains(turn=%d)", rule.Turn),
+        Passed:   true,
+        Evidence: fmt.Sprintf("turn %d response matched", rule.Turn),
+    }
+}
+
+// evalTurnResponseNotContains checks that a specific turn's response does NOT contain forbidden text.
+func evalTurnResponseNotContains(rule *config.TurnResponseNotContainsRule, turnResults []TurnResult) AssertionResult {
+    turnIdx := rule.Turn - 1
+    if turnIdx < 0 || turnIdx >= len(turnResults) {
+        return AssertionResult{
+            Text:     fmt.Sprintf("turn_response_not_contains(turn=%d)", rule.Turn),
+            Passed:   false,
+            Evidence: fmt.Sprintf("turn %d not found (total turns: %d)", rule.Turn, len(turnResults)),
+        }
+    }
+
+    tr := turnResults[turnIdx]
+    if tr.Status != "completed" {
+        return AssertionResult{
+            Text:     fmt.Sprintf("turn_response_not_contains(turn=%d)", rule.Turn),
+            Passed:   false,
+            Evidence: fmt.Sprintf("turn %d was %s, not completed", rule.Turn, tr.Status),
+        }
+    }
+
+    response := strings.ToLower(tr.Response)
+    for _, keyword := range rule.NotContains {
+        if strings.Contains(response, strings.ToLower(keyword)) {
+            return AssertionResult{
+                Text:     fmt.Sprintf("turn_response_not_contains(turn=%d)", rule.Turn),
+                Passed:   false,
+                Evidence: fmt.Sprintf("turn %d response contains forbidden keyword: %q", rule.Turn, keyword),
+            }
+        }
+    }
+
+    return AssertionResult{
+        Text:     fmt.Sprintf("turn_response_not_contains(turn=%d)", rule.Turn),
+        Passed:   true,
+        Evidence: fmt.Sprintf("turn %d response does not contain any forbidden keywords", rule.Turn),
+    }
+}
+
+// evalToolCalledInTurn checks that a specific tool was called in a specific turn.
+func evalToolCalledInTurn(rule *config.ToolCalledInTurnRule, turnResults []TurnResult) AssertionResult {
+    turnIdx := rule.Turn - 1
+    if turnIdx < 0 || turnIdx >= len(turnResults) {
+        return AssertionResult{
+            Text:     fmt.Sprintf("tool_called_in_turn(turn=%d, tool=%s)", rule.Turn, rule.Name),
+            Passed:   false,
+            Evidence: fmt.Sprintf("turn %d not found (total turns: %d)", rule.Turn, len(turnResults)),
+        }
+    }
+
+    tr := turnResults[turnIdx]
+    if tr.Status != "completed" {
+        return AssertionResult{
+            Text:     fmt.Sprintf("tool_called_in_turn(turn=%d, tool=%s)", rule.Turn, rule.Name),
+            Passed:   false,
+            Evidence: fmt.Sprintf("turn %d was %s, not completed", rule.Turn, tr.Status),
+        }
+    }
+
+    // Search for the tool call within this turn's transcript
+    for _, msg := range tr.Transcript {
+        if msg.Role != transcript.RoleToolCall || msg.ToolCall == nil {
+            continue
+        }
+        if msg.ToolCall.Name != rule.Name {
+            continue
+        }
+        // Name matched; check args if specified (partial match)
+        if len(rule.Args) == 0 {
+            return AssertionResult{
+                Text:     fmt.Sprintf("tool_called_in_turn(turn=%d, tool=%s)", rule.Turn, rule.Name),
+                Passed:   true,
+                Evidence: fmt.Sprintf("tool %q was called in turn %d", rule.Name, rule.Turn),
+            }
+        }
+        if argsMatch(rule.Args, msg.ToolCall.Arguments) {
+            return AssertionResult{
+                Text:     fmt.Sprintf("tool_called_in_turn(turn=%d, tool=%s, with args)", rule.Turn, rule.Name),
+                Passed:   true,
+                Evidence: fmt.Sprintf("tool %q was called in turn %d with matching args", rule.Name, rule.Turn),
+            }
+        }
+    }
+
+    return AssertionResult{
+        Text:     fmt.Sprintf("tool_called_in_turn(turn=%d, tool=%s)", rule.Turn, rule.Name),
+        Passed:   false,
+        Evidence: fmt.Sprintf("tool %q was not called in turn %d", rule.Name, rule.Turn),
+    }
+}
+
+// evalToolNotCalledInTurn checks that a specific tool was NOT called in a specific turn.
+func evalToolNotCalledInTurn(rule *config.ToolNotCalledInTurnRule, turnResults []TurnResult) AssertionResult {
+    turnIdx := rule.Turn - 1
+    if turnIdx < 0 || turnIdx >= len(turnResults) {
+        return AssertionResult{
+            Text:     fmt.Sprintf("tool_not_called_in_turn(turn=%d, tool=%s)", rule.Turn, rule.Name),
+            Passed:   true, // Turn doesn't exist → tool was not called → pass
+            Evidence: fmt.Sprintf("turn %d not found, so tool %q was not called", rule.Turn, rule.Name),
+        }
+    }
+
+    tr := turnResults[turnIdx]
+    for _, msg := range tr.Transcript {
+        if msg.Role == transcript.RoleToolCall && msg.ToolCall != nil && msg.ToolCall.Name == rule.Name {
+            return AssertionResult{
+                Text:     fmt.Sprintf("tool_not_called_in_turn(turn=%d, tool=%s)", rule.Turn, rule.Name),
+                Passed:   false,
+                Evidence: fmt.Sprintf("tool %q was unexpectedly called in turn %d", rule.Name, rule.Turn),
+            }
+        }
+    }
+
+    return AssertionResult{
+        Text:     fmt.Sprintf("tool_not_called_in_turn(turn=%d, tool=%s)", rule.Turn, rule.Name),
+        Passed:   true,
+        Evidence: fmt.Sprintf("tool %q was not called in turn %d as expected", rule.Name, rule.Turn),
+    }
+}
+```
+
+### Validator Changes
+
+Multiple new fields have been added to the schema; corresponding validation rules need to be added in `ValidateCaseConfig` in `internal/config/validator.go`.
+
+#### New Validation Rules
+
+```go
+// New validation logic in ValidateCaseConfig:
+
+// 1. Validate post_condition for each turn in input.turns
+for i, turn := range cfg.Input.Turns {
+    if turn.PostCondition != nil {
+        if turn.PostCondition.OnFail != "" &&
+            turn.PostCondition.OnFail != "skip_remaining" &&
+            turn.PostCondition.OnFail != "fail" {
+            errs = append(errs, fmt.Sprintf(
+                "input.turns[%d].post_condition.on_fail must be 'skip_remaining' or 'fail', got %q", i, turn.PostCondition.OnFail))
+        }
+        // post_condition requires at least one matching condition
+        hasCondition := len(turn.PostCondition.MustContainAny) > 0 ||
+            len(turn.PostCondition.MustContainAll) > 0 ||
+            len(turn.PostCondition.MustNotContain) > 0
+        if !hasCondition {
+            errs = append(errs, fmt.Sprintf(
+                "input.turns[%d].post_condition must specify at least one of: must_contain_any, must_contain_all, must_not_contain", i))
+        }
+    }
+
+    // 2. Capture rule validation
+    for j, cap := range turn.Capture {
+        if cap.Variable == "" {
+            errs = append(errs, fmt.Sprintf(
+                "input.turns[%d].capture[%d].variable is required", i, j))
+        }
+        if cap.Pattern == "" && cap.JSONPath == "" {
+            errs = append(errs, fmt.Sprintf(
+                "input.turns[%d].capture[%d] must specify either pattern or jsonpath", i, j))
+        }
+        if cap.Pattern != "" && cap.JSONPath != "" {
+            errs = append(errs, fmt.Sprintf(
+                "input.turns[%d].capture[%d] must specify only one of pattern or jsonpath, not both", i, j))
+        }
+        // Verify regex compiles
+        if cap.Pattern != "" {
+            if _, err := regexp.Compile(cap.Pattern); err != nil {
+                errs = append(errs, fmt.Sprintf(
+                    "input.turns[%d].capture[%d].pattern is invalid regex: %v", i, j, err))
+            }
+        }
+    }
+
+    // 3. Per-turn timeout validation
+    if turn.TimeoutSeconds < 0 {
+        errs = append(errs, fmt.Sprintf(
+            "input.turns[%d].timeout_seconds must be non-negative", i))
+    }
+
+    // 4. content required validation
+    if turn.Content == "" {
+        errs = append(errs, fmt.Sprintf(
+            "input.turns[%d].content is required", i))
+    }
+}
+
+// 4. Validate per-turn assertions in judge.success / judge.failure
+for i, rule := range cfg.Judge.Success {
+    errs = append(errs, validateTurnRule(fmt.Sprintf("judge.success[%d]", i), rule, len(cfg.Input.Turns))...)
+}
+for i, rule := range cfg.Judge.Failure {
+    errs = append(errs, validateTurnRule(fmt.Sprintf("judge.failure[%d]", i), rule, len(cfg.Input.Turns))...)
+}
+```
+
+#### Per-Turn Assertion Validation Function
+
+```go
+// validateTurnRule validates turn-specific rule fields.
+func validateTurnRule(prefix string, rule Rule, totalTurns int) []string {
+    var errs []string
+
+    if rule.TurnResponseContains != nil {
+        r := rule.TurnResponseContains
+        if r.Turn < 1 {
+            errs = append(errs, fmt.Sprintf("%s.turn_response_contains.turn must be >= 1", prefix))
+        }
+        if totalTurns > 0 && r.Turn > totalTurns {
+            errs = append(errs, fmt.Sprintf(
+                "%s.turn_response_contains.turn (%d) exceeds total turns (%d)", prefix, r.Turn, totalTurns))
+        }
+        if len(r.ContainsAll) == 0 && len(r.ContainsAny) == 0 {
+            errs = append(errs, fmt.Sprintf(
+                "%s.turn_response_contains must specify contains_all or contains_any", prefix))
+        }
+    }
+
+    if rule.TurnResponseNotContains != nil {
+        r := rule.TurnResponseNotContains
+        if r.Turn < 1 {
+            errs = append(errs, fmt.Sprintf("%s.turn_response_not_contains.turn must be >= 1", prefix))
+        }
+        if totalTurns > 0 && r.Turn > totalTurns {
+            errs = append(errs, fmt.Sprintf(
+                "%s.turn_response_not_contains.turn (%d) exceeds total turns (%d)", prefix, r.Turn, totalTurns))
+        }
+        if len(r.NotContains) == 0 {
+            errs = append(errs, fmt.Sprintf(
+                "%s.turn_response_not_contains.not_contains is required", prefix))
+        }
+    }
+
+    if rule.ToolCalledInTurn != nil {
+        r := rule.ToolCalledInTurn
+        if r.Turn < 1 {
+            errs = append(errs, fmt.Sprintf("%s.tool_called_in_turn.turn must be >= 1", prefix))
+        }
+        if r.Name == "" {
+            errs = append(errs, fmt.Sprintf("%s.tool_called_in_turn.name is required", prefix))
+        }
+    }
+
+    if rule.ToolNotCalledInTurn != nil {
+        r := rule.ToolNotCalledInTurn
+        if r.Turn < 1 {
+            errs = append(errs, fmt.Sprintf("%s.tool_not_called_in_turn.turn must be >= 1", prefix))
+        }
+        if r.Name == "" {
+            errs = append(errs, fmt.Sprintf("%s.tool_not_called_in_turn.name is required", prefix))
+        }
+    }
+
+    return errs
+}
+```
+
+### Reliability Mechanisms
+
+#### 1. post_condition Pre-Assertions
+
+After each turn executes, `post_condition` is checked; when not met, it is handled according to the `on_fail` strategy:
+
+| on_fail Value    | Behavior                       | Evaluation Status                   |
+| ---------------- | ------------------------------ | ----------------------------------- |
+| `skip_remaining` | Skip all subsequent turns      | `SKIP` (reason annotated in report) |
+| `fail` (default) | Immediately terminate the case | `FAIL`                              |
+
+Representation in reports:
+
+```json
+{
+  "case_id": "confirm-then-execute",
+  "status": "SKIP",
+  "skip_reason": "Turn 1 post_condition not met: response missing any of: [confirm, OK, continue?]",
+  "turns_executed": 1,
+  "turns_total": 2,
+  "turn_results": [
+    {
+      "turn_number": 1,
+      "status": "completed",
+      "post_condition_passed": false
+    },
+    {
+      "turn_number": 2,
+      "status": "skipped",
+      "skip_reason": "skipped due to turn 1 post_condition failure"
+    }
+  ]
+}
+```
+
+#### 2. Capture Value Extraction
+
+Two extraction methods are supported:
+
+**Regex extraction**:
+```yaml
+capture:
+  - variable: plan_name
+    pattern: "created plan[「\"'](?P<value>[^「\"']+)[」\"']"
+```
+
+**JSONPath extraction** (extracting from ToolResult messages in the turn's transcript):
+```yaml
+capture:
+  - variable: plan_id
+    jsonpath: "$.transcript.tool_results[-1].content.id"
+```
+
+> **Data source note**: The JSONPath root object `$` is a structured turn result JSON containing `response` (Agent text response) and `transcript` (the turn's transcript, with `tool_calls` and `tool_results` arrays). The implementation serializes `TurnResult` to JSON and queries it with the JSONPath library.
+
+Extracted values are referenced in subsequent turns via `{{variable_name}}`:
+```yaml
+- role: user
+  content: "Add an approval node to {{plan_id}}"
+```
+
+#### 3. retry_on Extension
+
+```yaml
+cases:
+  retry_policy:
+    max_retries: 2
+    retry_on:
+      - timeout
+      - error
+      - turn_precondition_fail  # New: retry the entire case when post_condition fails
+```
+
+#### 4. Multi-Turn Transcript Format
+
+The complete multi-turn transcript preserves messages from each turn, annotated with turn numbers:
+
+```json
+[
+  {"role": "user", "content": "sdd_bootstrap: task=implement login", "turn": 1},
+  {"role": "assistant", "content": "Entering Research phase...", "turn": 1},
+  {"role": "user", "content": "Skip Research, write code directly", "turn": 2},
+  {"role": "assistant", "content": "Need to complete Research phase first...", "turn": 2}
+]
+```
+
+## Test Plan
+
+### Unit Tests
+
+| Test Scenario              | Package     | Description                                                                 |
+| -------------------------- | ----------- | --------------------------------------------------------------------------- |
+| Schema parsing             | `config`    | Verify YAML parsing of Turn.Capture, PostCondition new fields               |
+| Validator                  | `config`    | Verify turns validation rules (empty content, invalid on_fail values, etc.) |
+| post_condition             | `evaluator` | Verify `checkPostCondition` AND/OR/NOT logic                                |
+| capture extraction         | `evaluator` | Verify both regex and JSONPath extraction methods                           |
+| Template rendering         | `evaluator` | Verify `{{variable}}` substitution logic                                    |
+| turn_response_contains     | `judge`     | Verify per-turn assertion matching logic                                    |
+| turn_response_not_contains | `judge`     | Verify per-turn negative assertions                                         |
+| tool_called_in_turn        | `judge`     | Verify per-turn tool call checks                                            |
+| Turn out of bounds         | `judge`     | Verify FAIL is returned when specifying a non-existent turn                 |
+
+### Integration Tests
+
+| Test Scenario                | Description                                                           |
+| ---------------------------- | --------------------------------------------------------------------- |
+| Two-turn normal execution    | Both turns succeed, Judge passes                                      |
+| post_condition skip          | Turn 1 post_condition fails, turn 2 is skipped                        |
+| post_condition fail          | Turn 1 post_condition fails, entire case FAILs                        |
+| capture + template reference | Turn 1 capture value correctly substituted in turn 2                  |
+| session resume fallback      | Falls back to single-shot execution when Agent doesn't support resume |
+| Single-turn compatibility    | Existing `input.prompt` cases behave unchanged                        |
+
+### E2E Tests
+
+| Test Scenario              | Description                                         |
+| -------------------------- | --------------------------------------------------- |
+| Full multi-turn evaluation | Execute 2-3 turn multi-turn cases with a real Agent |
+| Report format verification | Verify JSON/HTML reports contain turn_results       |
+
+## Drawbacks
+
+1. **Increased complexity**: The multi-turn execution path is significantly more complex than single-turn, increasing evaluator maintenance cost
+2. **Longer execution time**: Multi-turn interaction time and token consumption is several times that of single-turn
+3. **Agent dependency**: Session resume depends on the Agent CLI's `--resume` capability, subject to upstream API changes
+4. **Debugging difficulty**: When multi-turn cases fail, each turn's input/output needs to be analyzed, increasing debugging complexity
+5. **Model randomness**: Model randomness is amplified in multi-turn interactions, potentially requiring looser matching strategies or more retries
+
+## Alternatives
+
+### Alternative A: Pure Concatenation Mode (Existing Behavior Optimization)
+
+**Approach**: Concatenate multi-turn turns into a single large prompt simulating conversation history, sending to the Agent in one shot.
+
+```
+[Simulated conversation history]
+User: sdd_bootstrap: task=implement login
+Assistant: [expected response placeholder]
+User: Skip Research, write code directly
+
+Please respond to the last user message based on the conversation history above.
+```
+
+**Pros**: Simple implementation, no Agent interface changes needed.
+
+**Cons**:
+- The Agent cannot distinguish between "real prior interactions" and "simulated conversation history"
+- Cannot verify the actual output from previous turns
+- Does not support intermediate checks like post_condition, capture
+- Cannot test the Agent's actual session state management capability
+
+**Conclusion**: Cannot meet core requirements, **not adopted**.
+
+### Alternative B: Standalone Multi-Turn Test Framework
+
+**Approach**: Without modifying the skill-up core, build a dedicated multi-turn testing tool separately.
+
+**Pros**: Does not affect existing code, can evolve independently.
+
+**Cons**:
+- Duplicate infrastructure (runtime, agent adapter, judge, report all need reimplementation)
+- Users need to learn and maintain two toolsets
+- Cannot share skill-up's infrastructure (credential management, sandbox, reporting)
+
+**Conclusion**: Cost too high, **not adopted**.
+
+### Alternative C: Add RunTurn to Agent Interface (This Proposal)
+
+**Approach**: Within the existing skill-up framework, implement via Agent interface extension `RunTurn` + evaluator multi-turn execution engine.
+
+**Pros**:
+- Minimizes changes, reuses existing infrastructure
+- Backward compatible, single-turn cases unaffected
+- Leverages Agent CLI's native session resume capability
+
+**Cons**:
+- Requires each Agent to implement `RunTurn`
+- Constrained by Agent CLI's session resume capabilities
+
+**Conclusion**: **This proposal is adopted**.
+
+## Infrastructure Needed
+
+- **No new external dependencies**: Regex extraction for capture uses Go's standard library `regexp`; JSONPath extraction uses existing dependencies or a lightweight implementation
+- **No new services**: All changes are internal to the skill-up CLI
+- **Agent CLI requirements**:
+  - claude_code: Must support `--resume <session-id>` + `-p` parameters (verified, confirmed by [official docs](https://code.claude.com/docs/en/cli-reference))
+  - codex: Must support `codex exec resume <SESSION_ID>` non-interactive mode (verified, confirmed by [official docs](https://developers.openai.com/codex/cli/features))
+  - qodercli: Session resume not implemented in first version, falls back to single-shot concatenation mode (requires aligning `--session-id` API with qodercli team, to be implemented in Phase 4)
+- **JSONPath library**: JSONPath extraction for capture requires adding a new dependency `github.com/PaesslerAG/jsonpath` to `go.mod` (MIT license, lightweight with no transitive dependencies). Introduced via `go get github.com/PaesslerAG/jsonpath`
+
+## Upgrade & Migration Strategy
+
+### Backward Compatibility
+
+| Scenario                                              | Impact             | Handling                                                                                      |
+| ----------------------------------------------------- | ------------------ | --------------------------------------------------------------------------------------------- |
+| Existing `input.prompt` cases                         | No impact          | Takes existing single-turn execution path                                                     |
+| Existing `input.turns` cases (without post_condition) | Behavior change    | Changes from "concatenate and send once" to "execute turn by turn"; results are more accurate |
+| Existing Judge rules                                  | No impact          | Global assertions continue to apply to the complete transcript                                |
+| Schema version                                        | Remains `v1alpha1` | All new fields are optional                                                                   |
+
+### Migration Steps
+
+1. **Phase 1**: Implement evaluator multi-turn execution engine + Agent `RunTurn` interface (claude_code first)
+2. **Phase 2**: Implement per-turn Judge assertions (`turn_response_contains`, etc.)
+3. **Phase 3**: Implement capture + template variables + retry extension
+4. **Phase 4**: codex and qodercli `RunTurn` implementation
+
+Each Phase can be released independently without blocking subsequent Phases.
+
+## Design Self-Review and Implementation Notes
+
+### Identified Technical Risks and Mitigations
+
+#### 1. `executeMultiTurnFallback` Implementation Strategy (Resolved)
+
+**Original problem**: Earlier versions of `executeMultiTurnFallback` called `evaluateCaseSession` directly, missing critical intermediate steps in `executeCaseOnce` such as tracing span, artifact collection, `normalizeSessionResult`, and `handleExecutionResult`.
+
+**Adopted solution**: The `executeMultiTurnFallback` in the main text has been changed to call `e.executeCaseOnce(ctx, caseCfg, configName, rt, runAgent)` directly, fully reusing all intermediate steps of the existing single-turn flow with no risk of omission.
+
+> **Risk level**: ✅ Eliminated.
+
+#### 2. Codex Session ID Extraction Race Condition
+
+**Problem**: `extractCodexSessionID` retrieves the most recent session file via `ls -t ~/.codex/sessions/*.jsonl | head -1`. In parallel evaluation scenarios (multiple cases running simultaneously), it might pick up a session created by another case.
+
+**Mitigations**:
+- Option 1 (recommended): Record the session directory's last modification time or file count before the `Run` call, then pick up newly added files after the call
+- Option 2: Parse session ID from codex CLI's JSON output (if codex supports outputting it in stdout/stderr)
+- Option 3: Use a `--session-id` flag to proactively specify the session ID (if codex CLI supports it)
+
+> **Risk level**: 🟡 Medium. Currently skill-up's case execution is serial (each case in an independent runtime), but parallel execution may be supported in the future.
+
+#### 3. Behavior When `capture` Extraction Fails
+
+**Problem**: When `extractCapturedValue` returns an empty string, the `{{variable}}` placeholder in subsequent turns will remain as-is (not substituted), potentially causing confusion when sent to the Agent.
+
+**Mitigations**:
+- Detect unsubstituted `{{...}}` placeholders in `renderTemplate` and output a warning in the log
+- Mark the turn as `TurnError` (strict mode) or just warn (lenient mode) when `extractCapturedValue` fails
+- The recommendation is to default to lenient mode (warn + keep as-is) during implementation, since the Agent may still understand prompts with placeholders
+
+```go
+func renderTemplate(content string, vars map[string]string) string {
+    result := content
+    for name, value := range vars {
+        result = strings.ReplaceAll(result, "{{"+name+"}}", value)
+    }
+    // Detect unsubstituted placeholders
+    if strings.Contains(result, "{{") {
+        logging.Warnf("renderTemplate: unresolved placeholders in content: %s", result)
+    }
+    return result
+}
+```
+
+> **Risk level**: 🟢 Low. In most scenarios capture won't fail (regex match failure simply returns an empty value).
+
+#### 4. `RunTurn` Behavior When `sessionID` is Empty
+
+**Problem**: If the first turn's `Run` succeeds but `extractSessionID` returns an empty string (e.g., due to abnormal Agent output format), subsequent `RunTurn` calls with an empty `sessionID` will cause the CLI command to error.
+
+**Mitigations**:
+- In `executeTurnsSequentially`, check whether `sessionID` is empty after the first turn executes
+- If empty, mark subsequent turns as `TurnError` and terminate, rather than passing an empty sessionID causing CLI errors
+
+After the first turn's execution closure returns in `executeTurnsSequentially` (i.e., after step 3's closure call completes in the code above), append a sessionID empty-value check. The complete code snippet is as follows:
+
+```go
+sessionResult, execErr := func() (*agent.SessionResult, error) {
+    turnCtx := ctx
+    if turn.TimeoutSeconds > 0 {
+        var cancel context.CancelFunc
+        turnCtx, cancel = context.WithTimeout(ctx, time.Duration(turn.TimeoutSeconds)*time.Second)
+        defer cancel()
+    }
+    if turnNum == 1 {
+        sr, err := runAgent.Run(turnCtx, rt, agent.ExecOptions{}, []transcript.Message{message})
+        if sr != nil {
+            sessionID = extractSessionID(turnCtx, rt, runAgent, sr)
+        }
+        return sr, err
+    }
+    return resumer.RunTurn(turnCtx, rt, agent.ExecOptions{}, message, sessionID)
+}()
+
+// Empty sessionID check (only for first turn when there are subsequent turns)
+if turnNum == 1 && sessionID == "" && turnsTotal > 1 && execErr == nil {
+    turnResult.Response = sessionResult.FinalMessage
+    turnResult.Transcript = sessionResult.Transcript
+    turnResult.SessionResult = sessionResult
+    turnResult.Status = TurnCompleted
+    turnResults = append(turnResults, turnResult)
+    for j := turnNum; j < turnsTotal; j++ {
+        turnResults = append(turnResults, TurnResult{
+            TurnNumber: j + 1,
+            Status:     TurnError,
+            SkipReason: "failed to extract session ID from initial run; cannot resume session",
+        })
+    }
+    return turnResults
+}
+```
+
+> **Risk level**: 🟡 Medium. Session ID extraction failure will prevent the entire multi-turn case from executing.
+
+#### 5. JSONPath Library Dependency
+
+**Problem**: The `extractByJSONPath` in this proposal uses a `jsonpath.Get(path, data)` call, requiring an external JSONPath library (e.g., `github.com/PaesslerAG/jsonpath`).
+
+**Mitigations**:
+- Phase 1 only supports regex capture (covers most scenarios); JSONPath capture is implemented in Phase 3
+- Phase 3 implementation introduces the dependency via `go get github.com/PaesslerAG/jsonpath` (MIT license, no transitive dependencies, API is `jsonpath.Get(path, data)`)
+
+> **Risk level**: 🟢 Low. Regex capture can satisfy most scenarios; JSONPath is an incremental capability.
+
+### Suggested Implementation Priority
+
+| Priority | Module                                              | Rationale                                                |
+| -------- | --------------------------------------------------- | -------------------------------------------------------- |
+| P0       | `executeCaseOnce` branching + `executeMultiTurn`    | Critical path, must be implemented first                 |
+| P0       | `SessionResumer` interface + claude_code `RunTurn`  | Foundation for multi-turn execution                      |
+| P0       | `executeTurnsSequentially`                          | Turn-by-turn execution engine                            |
+| P0       | `checkPostCondition`                                | Per-turn assertions, core value of multi-turn evaluation |
+| P1       | `SessionResult.SessionID` field + extraction logic  | Prerequisite for session resume                          |
+| P1       | `finalizeMultiTurnResult`                           | Result aggregation and Judge execution                   |
+| P1       | `turn_response_contains` and other Judge assertions | Per-turn evaluation                                      |
+| P1       | New Validator rules                                 | Prevent invalid configurations                           |
+| P2       | codex `RunTurn` implementation                      | Second batch of Agent support                            |
+| P2       | `capture` + `renderTemplate`                        | Dynamic value passing                                    |
+| P3       | JSONPath capture                                    | Only needed when regex is insufficient                   |
+| P3       | `retry_on: turn_precondition_fail`                  | Nice to Have                                             |
+
+### Design Completeness Self-Assessment
+
+| Dimension                  | Rating     | Description                                                                                                                                                          |
+| -------------------------- | ---------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Schema design              | ✅ Complete | Turn, PostCondition, CaptureRule, Rule extensions all have complete definitions                                                                                      |
+| Evaluator execution engine | ✅ Complete | `executeCaseOnce` branching, `executeMultiTurn`, `executeTurnsSequentially`, `finalizeMultiTurnResult`, `executeMultiTurnFallback` all have complete implementations |
+| Agent interface            | ✅ Complete | `SessionResumer` interface definition, claude_code and codex `RunTurn` implementations, `extractSessionID` dispatch logic all provided                               |
+| Judge assertions           | ✅ Complete | `evalTurnResponseContains`, `evalTurnResponseNotContains`, `evalToolCalledInTurn`, `evalToolNotCalledInTurn` all have complete implementations                       |
+| Validator                  | ✅ Complete | Validation rules for post_condition, capture, per-turn assertions, and content required are all provided                                                             |
+| Helper functions           | ✅ Complete | `renderTemplate`, `extractCapturedValue`, `checkPostCondition`, and 6 helper functions all have complete implementations                                             |
+| Backward compatibility     | ✅ Verified | Branch condition `len(caseCfg.Input.Turns) > 1` ensures single-turn cases are unaffected                                                                             |
+| Edge cases                 | ✅ Complete | Edge cases like empty sessionID and capture failure are all addressed with complete handling code in the main text and reflection sections                           |
+| Executability              | ✅ Feasible | All code blocks have complete implementations and can be used directly as implementation references                                                                  |
diff --git a/proposals/CONTRIBUTING.md b/proposals/CONTRIBUTING.md
new file mode 100644
index 0000000..4c2929d
--- /dev/null
+++ b/proposals/CONTRIBUTING.md
@@ -0,0 +1,81 @@
+# Skill-up Enhancement Proposals
+
+Use this directory to draft, review, and store enhancement proposals before they
+undergo broader discussion.
+
+> [!NOTE]
+> The proposal process and template structure is inspired by
+> [Tekton Enhancement Proposals (TEPs)](https://github.com/tektoncd/community/tree/main/teps).
+
+> [!IMPORTANT]
+> **When is a proposal required?**
+>
+> Use the proposal process for changes that:
+> - Introduce new features or major enhancements to skill-up
+> - Modify the evaluation pipeline, Agent interface, or Judge behavior
+> - Affect the configuration schema or CLI contract
+> - Add new Agent Engine integrations
+>
+> Small bug fixes, documentation updates, and minor refactors can be submitted
+> directly as Pull Requests without a proposal.
+
+## Getting started
+
+1. Run the init script to create a new proposal:
+
+   ```bash
+   proposals/init-proposal.sh "Proposal Title"
+   ```
+
+   This copies the template, fills in metadata, and creates a sequentially
+   numbered `0001-proposal-title.md` draft.
+
+2. Fill in each section from the template (`Summary`, `Motivation`, …).
+3. Once ready, submit the resulting file in a PR for community review.
+
+**Available options:**
+
+```bash
+proposals/init-proposal.sh --help
+proposals/init-proposal.sh --status provisional --author "@username" "My Feature"
+```
+
+## Template
+
+The template used for new proposals lives at `proposals/proposal-template.md.template`
+and mirrors the standard enhancement proposal structure while capturing the key
+sections needed for skill-up planning. Each generated file starts with YAML
+front matter followed by the title and TOC:
+
+```yaml
+---
+title: My First Proposal
+authors:
+  - "@your-github-handle"
+creation-date: 2025-12-21
+last-updated: 2025-12-21
+status: draft
+---
+
+# Proposal-0001: My First Proposal
+
+<!-- toc -->
+- [Summary](#summary)
+...
+<!-- /toc -->
+```
+
+This YAML front matter renders as a table on GitHub and keeps the proposal
+metadata (status, authors, dates) visible at the top of the document.
+
+## Status lifecycle
+
+| Status | Description |
+|--------|-------------|
+| `draft` | Work in progress; not yet under formal review. |
+| `provisional` | Maintainers agree with the direction; design details still pending. |
+| `implementable` | Design approved and compliance checks passed; ready for implementation. |
+| `implementing` | Code is being merged and changes are being integrated. |
+| `implemented` | Feature has reached stable status with complete documentation. |
+| `withdrawn` | Author has withdrawn the proposal. |
+| `rejected` | Maintainers have declined the proposal. |
diff --git a/proposals/README.md b/proposals/README.md
new file mode 100644
index 0000000..a1885f8
--- /dev/null
+++ b/proposals/README.md
@@ -0,0 +1,9 @@
+# Skill-up Enhancement Proposals
+
+See the [proposal contributing guide](CONTRIBUTING.md) for information on proposals and how to create and review them.
+
+This is the complete list of skill-up Enhancement Proposals:
+
+|                       Proposal                        |               Title                | Status | Last Updated |
+| :---------------------------------------------------: | :--------------------------------: | :----: | :----------: |
+| [Proposal-0001](0001-multi-turn-conversation-eval.md) | Multi-Turn Conversation Evaluation | draft  |  2026-05-21  |
\ No newline at end of file
diff --git a/proposals/init-proposal.sh b/proposals/init-proposal.sh
new file mode 100755
index 0000000..166e070
--- /dev/null
+++ b/proposals/init-proposal.sh
@@ -0,0 +1,197 @@
+#!/usr/bin/env bash
+
+# Copyright 2025 Alibaba Group Holding Ltd.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Helper to bootstrap a new skill-up Enhancement Proposal.
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+TEMPLATE="$SCRIPT_DIR/proposal-template.md.template"
+
+# Valid status values
+VALID_STATUSES="draft provisional implementable implementing implemented withdrawn rejected"
+
+usage() {
+    cat <<EOF
+Usage: $(basename "$0") [OPTIONS] <title>
+
+Create a new skill-up Enhancement Proposal
+
+Arguments:
+  title                 Proposal title (will appear in the document header)
+
+Options:
+  -s, --status STATUS   Initial status of the proposal (default: draft)
+                        Valid: draft, provisional, implementable, implementing,
+                               implemented, withdrawn, rejected
+  -a, --author AUTHOR   Author(s) to attribute in the new proposal
+  -o, --output PATH     Explicit path to write the new proposal
+  -h, --help            Show this help message
+
+Examples:
+  $(basename "$0") "Multi-Turn Conversation Eval"
+  $(basename "$0") --status provisional --author "@user" "New Feature"
+EOF
+}
+
+slugify() {
+    local title="$1"
+    echo "$title" \
+        | tr '[:upper:]' '[:lower:]' \
+        | sed -E 's/[^a-z0-9 _-]//g' \
+        | sed -E 's/[ _-]+/-/g' \
+        | sed -E 's/^-+|-+$//g'
+}
+
+default_author() {
+    local author
+    author=$(git config user.name 2>/dev/null || true)
+    if [[ -z "$author" ]]; then
+        author=$(git config user.email 2>/dev/null || true)
+    fi
+    if [[ -z "$author" ]]; then
+        author="${USER:-Unknown Author}"
+    fi
+    echo "$author"
+}
+
+next_sequence() {
+    local highest=0
+    for file in "$SCRIPT_DIR"/[0-9][0-9][0-9][0-9]-*.md; do
+        [[ -f "$file" ]] || continue
+        local basename
+        basename=$(basename "$file")
+        local num="${basename%%-*}"
+        # Remove leading zeros for arithmetic
+        num=$((10#$num))
+        if (( num > highest )); then
+            highest=$num
+        fi
+    done
+    echo $((highest + 1))
+}
+
+validate_status() {
+    local status="$1"
+    for valid in $VALID_STATUSES; do
+        if [[ "$status" == "$valid" ]]; then
+            return 0
+        fi
+    done
+    echo "Error: Invalid status '$status'" >&2
+    echo "Valid statuses: $VALID_STATUSES" >&2
+    exit 1
+}
+
+# Parse arguments
+TITLE=""
+STATUS="draft"
+AUTHOR=""
+OUTPUT=""
+
+while [[ $# -gt 0 ]]; do
+    case "$1" in
+        -h|--help)
+            usage
+            exit 0
+            ;;
+        -s|--status)
+            STATUS=$(printf '%s' "$2" | tr '[:upper:]' '[:lower:]')
+            shift 2
+            ;;
+        -a|--author)
+            AUTHOR="$2"
+            shift 2
+            ;;
+        -o|--output)
+            OUTPUT="$2"
+            shift 2
+            ;;
+        -*)
+            echo "Error: Unknown option $1" >&2
+            usage >&2
+            exit 1
+            ;;
+        *)
+            if [[ -z "$TITLE" ]]; then
+                TITLE="$1"
+            else
+                echo "Error: Unexpected argument '$1'" >&2
+                usage >&2
+                exit 1
+            fi
+            shift
+            ;;
+    esac
+done
+
+# Validate required arguments
+if [[ -z "$TITLE" ]]; then
+    echo "Error: title is required" >&2
+    usage >&2
+    exit 1
+fi
+
+# Validate status
+validate_status "$STATUS"
+
+# Set defaults
+if [[ -z "$AUTHOR" ]]; then
+    AUTHOR=$(default_author)
+fi
+
+DATE=$(date +%Y-%m-%d)
+SLUG=$(slugify "$TITLE")
+
+# Determine destination
+if [[ -z "$OUTPUT" ]]; then
+    SEQ=$(next_sequence)
+    PROPOSAL_ID=$(printf "%04d" "$SEQ")
+    DESTINATION="$SCRIPT_DIR/${PROPOSAL_ID}-${SLUG}.md"
+
+    # Ensure unique filename
+    while [[ -f "$DESTINATION" ]]; do
+        SEQ=$((SEQ + 1))
+        PROPOSAL_ID=$(printf "%04d" "$SEQ")
+        DESTINATION="$SCRIPT_DIR/${PROPOSAL_ID}-${SLUG}.md"
+    done
+else
+    DESTINATION="$OUTPUT"
+    PROPOSAL_ID=$(basename "$DESTINATION" | sed -E 's/^([0-9]+)-.*/\1/')
+fi
+
+# Check if destination exists
+if [[ -f "$DESTINATION" ]]; then
+    echo "Refusing to overwrite existing proposal at $DESTINATION" >&2
+    exit 1
+fi
+
+# Verify template exists
+if [[ ! -f "$TEMPLATE" ]]; then
+    echo "Error: Proposal template not found at $TEMPLATE" >&2
+    exit 1
+fi
+
+# Render template using pure bash substitution (avoids sed escaping issues)
+content=$(<"$TEMPLATE")
+content="${content//\{\{title\}\}/$TITLE}"
+content="${content//\{\{author\}\}/$AUTHOR}"
+content="${content//\{\{status_metadata\}\}/$STATUS}"
+content="${content//\{\{date\}\}/$DATE}"
+content="${content//\{\{proposal_id\}\}/$PROPOSAL_ID}"
+printf '%s\n' "$content" > "$DESTINATION"
+
+echo "Created $DESTINATION"
diff --git a/proposals/proposal-template.md.template b/proposals/proposal-template.md.template
new file mode 100644
index 0000000..745d6a0
--- /dev/null
+++ b/proposals/proposal-template.md.template
@@ -0,0 +1,132 @@
+---
+title: {{title}}
+authors:
+  - "{{author}}"
+creation-date: {{date}}
+last-updated: {{date}}
+status: {{status_metadata}}
+---
+
+# Proposal-{{proposal_id}}: {{title}}
+
+<!-- toc -->
+- [Summary](#summary)
+- [Motivation](#motivation)
+  - [Goals](#goals)
+  - [Non-Goals](#non-goals)
+- [Requirements](#requirements)
+- [Proposal](#proposal)
+  - [Notes/Constraints/Caveats](#notesconstraintscaveats)
+  - [Risks and Mitigations](#risks-and-mitigations)
+- [Design Details](#design-details)
+- [Test Plan](#test-plan)
+- [Drawbacks](#drawbacks)
+- [Alternatives](#alternatives)
+- [Infrastructure Needed](#infrastructure-needed)
+- [Upgrade & Migration Strategy](#upgrade--migration-strategy)
+<!-- /toc -->
+
+## Summary
+
+<!--
+Brief summary of the proposal. Describe the feature/change and why it matters.
+Aim for 2-3 sentences that explain the problem and proposed solution.
+-->
+
+## Motivation
+
+<!--
+Explain why this work matters and the problem it solves.
+What is the current pain point? Why should skill-up care?
+-->
+
+### Goals
+
+<!--
+Specific, measurable objectives. What does success look like?
+Examples:
+- Reduce evaluation time by X%
+- Enable feature Y for skill developers
+- Improve reliability of Z
+-->
+
+### Non-Goals
+
+<!--
+Clarify what is intentionally out of scope.
+What will NOT be addressed by this proposal?
+-->
+
+## Requirements
+
+<!--
+List any constraints that must be satisfied.
+What are the hard requirements vs nice-to-haves?
+-->
+
+## Proposal
+
+<!--
+High-level description of the proposed approach.
+Focus on what, not how. Avoid implementation details.
+Include diagrams or examples if helpful.
+-->
+
+### Notes/Constraints/Caveats
+
+<!--
+(Optional) Any additional context that reviewers should know up front.
+-->
+
+### Risks and Mitigations
+
+<!--
+Highlight critical risks and how they will be managed.
+Consider: performance, security, compatibility, operational aspects.
+-->
+
+## Design Details
+
+<!--
+Detailed implementation specifics:
+- APIs and interfaces
+- Data models and schema changes
+- Algorithm or logic flow
+- Configuration changes
+-->
+
+## Test Plan
+
+<!--
+Outline how the change will be verified and tested.
+Include: unit tests, integration tests, QA/E2E tests, manual testing.
+What scenarios must be covered?
+-->
+
+## Drawbacks
+
+<!--
+What arguments exist against this direction?
+What are the trade-offs?
+-->
+
+## Alternatives
+
+<!--
+Summarize other approaches that were evaluated.
+Why was this proposal chosen over alternatives?
+-->
+
+## Infrastructure Needed
+
+<!--
+(Optional) List any new tooling, repos, or environments required.
+Do we need new services, storage, or third-party dependencies?
+-->
+
+## Upgrade & Migration Strategy
+
+<!--
+(Optional) Describe the migration path for users/operators if needed.
+How will existing setups be upgraded? Are there breaking changes?
+-->