alibaba · zpzjzj · May 26, 2026 · May 26, 2026 · May 27, 2026
@@ -50,7 +50,7 @@
 ## Features
 
 - **Declarative Eval Config**: Define evaluation environment, engine, model, and cases through YAML (`eval.yaml` + `cases/*.yaml`).
-- **Multi-Engine Support**: Works with Qoder CLI, Claude Code, and Codex as Agent Engines.
+- **Multi-Engine Support**: Works with Qoder CLI, Claude Code, and Codex as built-in Agent Engines, plus user-defined agents via `engine.custom` (local transport — see [docs/design/custom-engine.md](docs/design/custom-engine.md)).
 - **Flexible Judging**: Supports `rule_based`, `script`, and `agent_judge` evaluation strategies.
 - **Structured Reports**: Outputs Anthropic-compatible `grading.json`, `benchmark.json`, `benchmark.md`, plus `result.json`, JUnit XML, and HTML reports.
 - **Anthropic Compatible**: Import `evals.json` via `skill-up import`, or auto-detect with `--auto`.

@@ -50,7 +50,7 @@
 ## 特性
 
 - **声明式评测配置**：通过 YAML（`eval.yaml` + `cases/*.yaml`）定义评测环境、引擎、模型和用例。
-- **多引擎支持**：支持 Qoder CLI、Claude Code、Codex 等 Agent 引擎。
+- **多引擎支持**：内置支持 Qoder CLI、Claude Code、Codex；亦可通过 `engine.custom` 接入用户自定义 Agent（本地传输，详见 [docs/design/custom-engine.md](docs/design/custom-engine.md)）。
 - **灵活评分**：支持 `rule_based`（规则匹配）、`script`（脚本评分）、`agent_judge`（Agent 评分）三种评估策略。
 - **结构化报告**：输出 Anthropic 兼容的 `grading.json`、`benchmark.json`、`benchmark.md`，以及 `result.json`、JUnit XML 和 HTML 报告。
 - **Anthropic 兼容**：通过 `skill-up import` 导入 `evals.json`，或使用 `--auto` 自动识别。

@@ -112,6 +112,46 @@ report:
 
 `cases.parallelism` is the file-level default. To override it for a single run, use `skill-up run --parallelism N` without modifying `eval.yaml`. Allowed range: **1 to 256**.
 
+### Custom Engine
+
+When `engine.name` is not one of the built-ins (`claude_code`, `codex`, `qodercli`), declare an `engine.custom` block so skill-up knows how to invoke your agent. Only `transport: local` is implemented today; `transport: http` is reserved and currently fails validation with "not yet implemented".
+
+```yaml
+engine:
+  name: my-agent
+  model:
+    provider: anthropic
+    name: claude-sonnet-4-6
+  custom:
+    transport: local             # local (implemented) | http (planned)
+    response_format: session_result   # session_result (default) | text
+    timeout_seconds: 300
+    env:                         # credentials and secrets — NEVER reference these in command/args
+      MY_AGENT_TOKEN: ${MY_AGENT_TOKEN}
+    kwargs:                      # non-secret knobs exposed as ${kwargs.<key>}
+      profile: production
+    local:
+      command: /opt/my-agent/bin/run
+      args:
+        - --input
+        - ${input_file}          # path to the SessionInput JSON skill-up writes
+        - --output
+        - ${output_file}         # path your agent should write its SessionResult JSON to
+      cwd: ${workspace}          # optional; confined to the runtime workspace
+      input_file: inputs/messages.json   # optional override (relative to workspace)
+      output_file: outputs/session-result.json   # optional override
+```
+
+Template variables available in `command` / `args` / `cwd` / `env` / `input_file` / `output_file`:
+`${workspace}`, `${input_file}`, `${output_file}`, `${model}`, `${model_provider}`, `${model_name}`, `${case_id}`, `${variant}`, `${max_turns}`, `${timeout_seconds}`, `${kwargs.<key>}`, plus environment variables via `${VAR}` / `${VAR:-default}` / `${VAR?error message}`.
+
+Secret-handling rules (enforced at config load):
+
+- `${api_key}` and any kwarg whose key looks like a credential (`token`, `secret`, `api_key`, `apiKey`, `bearerToken`, …) cannot be referenced from `command` / `args` / `cwd` / `input_file` / `output_file`. Pass them through `engine.custom.env`, where they reach your agent as process environment variables instead of leaking into process listings.
+- `${SOMEVAR:-...}` defaults that contain recognizable credential shapes (`sk-...`, `sk-ant-...`, `ghp_...`, `AIza...`, `AKIA...`, JWTs) are likewise rejected in command-line contexts.
+
+See `docs/design/custom-engine.md` for the full SessionInput / SessionResult schema your agent must conform to.
+
 ### Engine kwargs (agent-specific switches)
 
 `engine.kwargs` is a free-form string map. Each agent reads only the keys it recognises; unknown keys are ignored. Unrecognised keys (typos like `bypas_sandbox`) emit a DEBUG log line — run with `-v` to surface them. CLI override: `--engine-kwarg key=value` (alias `--ek`), repeatable. Precedence: `--engine-kwarg` > `engine.kwargs` > default.
@@ -127,7 +167,7 @@ report:
 skill-up run evals/eval.yaml --engine-kwarg bypass_sandbox=true
 ```
 
-### MCP configuration
+>### MCP configuration
 
 MCP supports `mode: real` and `mode: mocked`. `real` installs a real MCP server into Agents such as `claude_code`, `qodercli`, or `codex`; `mocked` makes `internal/mcp` generate a local stdio mock server that is then installed into the Agent like any other MCP server.
 

@@ -111,6 +111,46 @@ report:
 
 `cases.parallelism` 是配置文件中的默认用例并行数；临时运行时可以用 `skill-up run --parallelism N` 覆盖它，不需要修改 `eval.yaml`。命令行覆盖值必须在 1 到 256 之间。
 
+### 自定义 Engine（Custom Engine）
+
+当 `engine.name` 不是内置引擎（`claude_code` / `codex` / `qodercli`）时，必须再写一个 `engine.custom` 段来告诉 skill-up 怎么调用你的 Agent。当前只实现了 `transport: local`；`transport: http` 已设计但尚未实现，validation 会直接报 "not yet implemented"。
+
+```yaml
+engine:
+  name: my-agent
+  model:
+    provider: anthropic
+    name: claude-sonnet-4-6
+  custom:
+    transport: local              # local（已实现）| http（规划中）
+    response_format: session_result  # session_result（默认）| text
+    timeout_seconds: 300
+    env:                          # 凭据 / 敏感参数 —— 不要在 command/args 里引用这些
+      MY_AGENT_TOKEN: ${MY_AGENT_TOKEN}
+    kwargs:                       # 非敏感开关，模板里以 ${kwargs.<key>} 暴露
+      profile: production
+    local:
+      command: /opt/my-agent/bin/run
+      args:
+        - --input
+        - ${input_file}           # skill-up 写入的 SessionInput JSON 路径
+        - --output
+        - ${output_file}          # 你的 Agent 应写入 SessionResult JSON 的路径
+      cwd: ${workspace}           # 可选；被限制在 runtime workspace 内
+      input_file: inputs/messages.json     # 可选覆盖（相对 workspace）
+      output_file: outputs/session-result.json
+```
+
+`command` / `args` / `cwd` / `env` / `input_file` / `output_file` 中可用的模板变量：
+`${workspace}`、`${input_file}`、`${output_file}`、`${model}`、`${model_provider}`、`${model_name}`、`${case_id}`、`${variant}`、`${max_turns}`、`${timeout_seconds}`、`${kwargs.<key>}`，以及环境变量形式 `${VAR}` / `${VAR:-default}` / `${VAR?error message}`。
+
+凭据收敛规则（配置加载期强校验）：
+
+- `${api_key}` 以及任何看起来像凭据的 kwarg key（`token` / `secret` / `api_key` / `apiKey` / `bearerToken` 等）都不允许出现在 `command` / `args` / `cwd` / `input_file` / `output_file` 中，必须经由 `engine.custom.env` 注入到子进程环境变量里。
+- `${SOMEVAR:-...}` 默认值如果匹配常见凭据特征（`sk-...`、`sk-ant-...`、`ghp_...`、`AIza...`、`AKIA...`、JWT 等），同样会在命令行场景被拒。
+
+SessionInput / SessionResult 的完整 JSON 契约见 `docs/design/custom-engine.md`。
+
 ### MCP 配置说明
 
 MCP 支持 `mode: real` 和 `mode: mocked`。`real` 用于把真实 MCP Server 安装到 `claude_code`、`qodercli` 或 `codex` 等 Agent；`mocked` 会由 `internal/mcp` 生成本地 stdio Mock Server，并按普通 MCP 配置安装到 Agent。

@@ -0,0 +1,133 @@
+//go:build e2e
+
+package e2e
+
+import (
+	"os"
+	"path/filepath"
+	"runtime"
+	"strings"
+	"testing"
+	"time"
+)
+
+// getCustomEngineTestdataDir returns the path to e2e/testdata/custom-engine.
+func getCustomEngineTestdataDir() string {
+	_, testFile, _, _ := runtime.Caller(0)
+	projectRoot := filepath.Dir(filepath.Dir(testFile))
+	return filepath.Join(projectRoot, "e2e", "testdata", "custom-engine")
+}
+
+// skipIfNoPOSIXShell skips the test on platforms where the agent.sh fixture
+// cannot run. agent.sh uses bash with `set -euo pipefail` and is not portable
+// to Windows; the test infrastructure for cross-platform fixtures is tracked
+// in issue #54.
+func skipIfNoPOSIXShell(t *testing.T) {
+	t.Helper()
+	if runtime.GOOS == "windows" {
+		t.Skip("agent.sh fixture is POSIX-only; see issue #54")
+	}
+}
+
+// customEngineEnv returns an env slice that points the custom engine's
+// ${CUSTOM_AGENT_BIN} placeholder at the fixture's agent.sh, while
+// preserving PATH and HOME so the test still resolves system tools.
+func customEngineEnv(agentBin string) []string {
+	return []string{
+		"PATH=" + os.Getenv("PATH"),
+		"HOME=" + os.Getenv("HOME"),
+		"CUSTOM_AGENT_BIN=" + agentBin,
+	}
+}
+
+// TestPipeline_CustomEngine_LocalTransport runs the entire evaluation pipeline
+// against a user-defined Custom Engine (transport: local). It verifies that:
+//  1. The eval loads and the custom engine block is env-resolved and validated.
+//  2. The framework writes the SessionInput JSON to ${input_file}.
+//  3. The agent's stdout is parsed back into a SessionResult.
+//  4. final_message reaches the report and an expect.must_contain rule passes.
+func TestPipeline_CustomEngine_LocalTransport(t *testing.T) {
+	skipIfNoPOSIXShell(t)
+	t.Parallel()
+
+	dir := getCustomEngineTestdataDir()
+	evalPath := filepath.Join(dir, "evals", "eval.yaml")
+	agentBin := filepath.Join(dir, "agent.sh")
+
+	outputDir := t.TempDir()
+	result := Run(t, RunConfig{
+		Env:     customEngineEnv(agentBin),
+		WorkDir: dir,
+		Timeout: 60 * time.Second,
+	}, "run", evalPath, "--no-delete", "--output-dir", outputDir)
+
+	if result.ExitCode != 0 {
+		t.Fatalf("custom engine run failed: exit=%d\nstdout:\n%s\nstderr:\n%s",
+			result.ExitCode, result.Stdout, result.Stderr)
+	}
+	if !strings.Contains(result.Stdout, "Running evaluation") {
+		t.Errorf("expected runner stage log in output, got:\n%s", result.Stdout)
+	}
+	if !strings.Contains(result.Stdout, "PASS") {
+		t.Errorf("expected at least one PASS case, got:\n%s", result.Stdout)
+	}
+	if !strings.Contains(result.Stdout, "1 passed") {
+		t.Errorf("expected the summary line to report 1 passed, got:\n%s", result.Stdout)
+	}
+}
+
+// TestPipeline_CustomEngine_Validate runs `skill-up validate` against the
+// custom engine eval.yaml. This exercises the post-load
+// config.ResolveCustomEngineConfig path that validate.go invokes, including
+// env-variable resolution of ${CUSTOM_AGENT_BIN}.
+func TestPipeline_CustomEngine_Validate(t *testing.T) {
+	skipIfNoPOSIXShell(t)
+	t.Parallel()
+
+	dir := getCustomEngineTestdataDir()
+	evalPath := filepath.Join(dir, "evals", "eval.yaml")
+	agentBin := filepath.Join(dir, "agent.sh")
+
+	result := Run(t, RunConfig{
+		Env:     customEngineEnv(agentBin),
+		WorkDir: dir,
+		Timeout: 30 * time.Second,
+	}, "validate", evalPath)
+
+	if result.ExitCode != 0 {
+		t.Fatalf("validate failed: exit=%d\nstdout:\n%s\nstderr:\n%s",
+			result.ExitCode, result.Stdout, result.Stderr)
+	}
+	if !strings.Contains(result.Stdout, "is valid") {
+		t.Errorf("expected validation success message, got:\n%s", result.Stdout)
+	}
+}
+
+// TestPipeline_CustomEngine_RejectsMissingEnvVar verifies the load-time env
+// resolution: when ${CUSTOM_AGENT_BIN} is unset, `run` must fail with a
+// configuration error before exec'ing anything.
+func TestPipeline_CustomEngine_RejectsMissingEnvVar(t *testing.T) {
+	skipIfNoPOSIXShell(t)
+	t.Parallel()
+
+	dir := getCustomEngineTestdataDir()
+	evalPath := filepath.Join(dir, "evals", "eval.yaml")
+
+	// Note: CUSTOM_AGENT_BIN intentionally unset.
+	result := Run(t, RunConfig{
+		Env: []string{
+			"PATH=" + os.Getenv("PATH"),
+			"HOME=" + os.Getenv("HOME"),
+		},
+		WorkDir: dir,
+		Timeout: 30 * time.Second,
+	}, "run", evalPath, "--no-delete", "--output-dir", t.TempDir())
+
+	if result.ExitCode == 0 {
+		t.Fatalf("expected run to fail with missing env var, got exit 0\nstdout:\n%s", result.Stdout)
+	}
+	combined := result.Stdout + result.Stderr
+	if !strings.Contains(combined, "CUSTOM_AGENT_BIN") {
+		t.Errorf("expected the error to name the missing env var, got:\n%s", combined)
+	}
+}
@@ -0,0 +1,9 @@
+# Custom Engine e2e fixture
+
+This directory is a minimal skill-up fixture that exercises the **Custom
+Engine** local transport end-to-end. It is consumed by `e2e/custom_engine_test.go`.
+
+`agent.sh` is a deterministic stand-in for a real custom agent: it reads the
+`SessionInput` JSON the framework writes to `${input_file}` and emits a fixed
+`SessionResult` on stdout. The case asserts that `final_message` flows from
+the agent's stdout into the report.
@@ -0,0 +1,28 @@
+#!/bin/bash
+# custom-engine/agent.sh — a deterministic mock for the Custom Engine local
+# transport. The agent receives the SessionInput JSON path as its first arg
+# (per the eval.yaml in this directory) and emits a SessionResult on stdout.
+#
+# The script intentionally ignores the input content and returns a fixed
+# response so the e2e test is deterministic. It still verifies that the
+# framework actually wrote the input file at the configured path, which is the
+# main piece of the transport contract this test exercises.
+set -euo pipefail
+
+INPUT_FILE="${1:-}"
+if [[ -z "$INPUT_FILE" || ! -f "$INPUT_FILE" ]]; then
+  echo "agent.sh: SessionInput file not provided or missing (got: '${INPUT_FILE:-}')" >&2
+  exit 1
+fi
+
+cat <<'JSON'
+{
+  "exit_code": 0,
+  "final_message": "custom-engine-handled",
+  "turns": 1,
+  "transcript": [
+    {"role": "user", "content": "ping"},
+    {"role": "assistant", "content": "custom-engine-handled"}
+  ]
+}
+JSON
@@ -0,0 +1,7 @@
+id: hello
+title: Custom engine emits expected final_message
+input:
+  prompt: ping the custom engine
+expect:
+  must_contain:
+    - "custom-engine-handled"
@@ -0,0 +1,28 @@
+schema_version: v1alpha1
+
+environment:
+  type: none
+
+mcp:
+  servers: []
+
+engine:
+  name: e2e-custom-agent
+  custom:
+    transport: local
+    response_format: session_result
+    local:
+      command: ${CUSTOM_AGENT_BIN}
+      args:
+        - ${input_file}
+
+cases:
+  files:
+    - evals/cases/hello.yaml
+  defaults:
+    timeout_seconds: 60
+    max_turns: 3
+  parallelism: 1
+
+report:
+  formats: [json]
@@ -11,6 +11,7 @@ import (
 	"path/filepath"
 	"strings"
 
+	"github.com/alibaba/skill-up/internal/config"
 	"github.com/alibaba/skill-up/internal/credential"
 	"github.com/alibaba/skill-up/internal/logging"
 	"github.com/alibaba/skill-up/internal/observability"
@@ -41,9 +42,21 @@ type SessionResult struct {
 
 // SessionArtifacts holds artifacts produced during an agent session.
 type SessionArtifacts struct {
-	WorkspaceDiff  string   `json:"workspace_diff,omitempty"`
-	GeneratedFiles []string `json:"generated_files,omitempty"` // Runtime file paths, e.g. ["outputs/stdout.json", "outputs/transcript.jsonl"]
-	Logs           string   `json:"logs,omitempty"`
+	WorkspaceDiff  string         `json:"workspace_diff,omitempty"`
+	GeneratedFiles []string       `json:"generated_files,omitempty"` // Runtime file paths, e.g. ["outputs/stdout.json", "outputs/transcript.jsonl"]
+	Files          []ArtifactFile `json:"files,omitempty"`           // Structured artifact declarations (Custom Engine).
+	Logs           string         `json:"logs,omitempty"`
+}
+
+// ArtifactFile is a structured artifact declaration returned by an agent.
+// Exactly one of Path, URL, Content, ContentBase64 should be set.
+type ArtifactFile struct {
+	Name          string `json:"name"`
+	Path          string `json:"path,omitempty"`
+	URL           string `json:"url,omitempty"`
+	Content       string `json:"content,omitempty"`
+	ContentBase64 string `json:"content_base64,omitempty"`
+	ContentType   string `json:"content_type,omitempty"`
 }
 
 // Config configures the agent.
@@ -65,6 +78,9 @@ type Config struct {
 	ModelProvider string
 	APIKey        string
 	BaseURL       string
+	// Custom carries the custom engine configuration when Name does not match
+	// a built-in agent. It is nil for built-in agents.
+	Custom *config.CustomEngineConfig
 	// Kwargs carries agent-specific key/value options forwarded from
 	// EngineConfig.Kwargs. Each agent reads only the keys it understands;
 	// unknown keys are ignored. See agent kwargs helpers in kwargs.go.
@@ -190,6 +206,19 @@ func (a *BaseAgent) credentialEnvVars(apiKeyEnv, baseURLEnv string) map[string]s
 	return envVars
 }
 
+// installSkillDefault is the fallback skill installer used when an agent does
+// not configure its own InstallSkillCmd template. It installs the skill source
+// under a.Cfg.SkillPath (or the caller-supplied target). Defined on BaseAgent
+// so both CLIAgent and CustomAgent share it via embedding, without making
+// CustomAgent inherit the rest of CLIAgent.
+func (a *BaseAgent) installSkillDefault(ctx context.Context, rt Runtime, skillCfg runtime.SkillConfig) error {
+	target := skillCfg.Target
+	if target == "" && a.Cfg.SkillPath != "" {
+		target = filepath.Join(a.Cfg.SkillPath, filepath.Base(skillCfg.Source))
+	}
+	return installSkill(ctx, rt, skillCfg.Source, target)
+}
+
 func persistRuntimeArtifact(ctx context.Context, rt Runtime, targetPath, content string) error {
 	f, err := os.CreateTemp("", "skill-up-artifact-*")
 	if err != nil {