Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@
## Features

- **Declarative Eval Config**: Define evaluation environment, engine, model, and cases through YAML (`eval.yaml` + `cases/*.yaml`).
- **Multi-Engine Support**: Works with Qoder CLI, Claude Code, and Codex as Agent Engines.
- **Multi-Engine Support**: Works with Qoder CLI, Claude Code, and Codex as built-in Agent Engines, plus user-defined agents via `engine.custom` (local transport — see [docs/design/custom-engine.md](docs/design/custom-engine.md)).
- **Flexible Judging**: Supports `rule_based`, `script`, and `agent_judge` evaluation strategies.
- **Structured Reports**: Outputs Anthropic-compatible `grading.json`, `benchmark.json`, `benchmark.md`, plus `result.json`, JUnit XML, and HTML reports.
- **Anthropic Compatible**: Import `evals.json` via `skill-up import`, or auto-detect with `--auto`.
Expand Down
2 changes: 1 addition & 1 deletion README.zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@
## 特性

- **声明式评测配置**:通过 YAML(`eval.yaml` + `cases/*.yaml`)定义评测环境、引擎、模型和用例。
- **多引擎支持**:支持 Qoder CLI、Claude Code、CodexAgent 引擎
- **多引擎支持**:内置支持 Qoder CLI、Claude Code、Codex;亦可通过 `engine.custom` 接入用户自定义 Agent(本地传输,详见 [docs/design/custom-engine.md](docs/design/custom-engine.md))
- **灵活评分**:支持 `rule_based`(规则匹配)、`script`(脚本评分)、`agent_judge`(Agent 评分)三种评估策略。
- **结构化报告**:输出 Anthropic 兼容的 `grading.json`、`benchmark.json`、`benchmark.md`,以及 `result.json`、JUnit XML 和 HTML 报告。
- **Anthropic 兼容**:通过 `skill-up import` 导入 `evals.json`,或使用 `--auto` 自动识别。
Expand Down
847 changes: 847 additions & 0 deletions docs/design/custom-engine.md

Large diffs are not rendered by default.

42 changes: 41 additions & 1 deletion docs/guide/writing-evals.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,46 @@ report:

`cases.parallelism` is the file-level default. To override it for a single run, use `skill-up run --parallelism N` without modifying `eval.yaml`. Allowed range: **1 to 256**.

### Custom Engine

When `engine.name` is not one of the built-ins (`claude_code`, `codex`, `qodercli`), declare an `engine.custom` block so skill-up knows how to invoke your agent. Only `transport: local` is implemented today; `transport: http` is reserved and currently fails validation with "not yet implemented".

```yaml
engine:
name: my-agent
model:
provider: anthropic
name: claude-sonnet-4-6
custom:
transport: local # local (implemented) | http (planned)
response_format: session_result # session_result (default) | text
timeout_seconds: 300
env: # credentials and secrets — NEVER reference these in command/args
MY_AGENT_TOKEN: ${MY_AGENT_TOKEN}
kwargs: # non-secret knobs exposed as ${kwargs.<key>}
profile: production
local:
command: /opt/my-agent/bin/run
args:
- --input
- ${input_file} # path to the SessionInput JSON skill-up writes
- --output
- ${output_file} # path your agent should write its SessionResult JSON to
cwd: ${workspace} # optional; confined to the runtime workspace
input_file: inputs/messages.json # optional override (relative to workspace)
output_file: outputs/session-result.json # optional override
```

Template variables available in `command` / `args` / `cwd` / `env` / `input_file` / `output_file`:
`${workspace}`, `${input_file}`, `${output_file}`, `${model}`, `${model_provider}`, `${model_name}`, `${case_id}`, `${variant}`, `${max_turns}`, `${timeout_seconds}`, `${kwargs.<key>}`, plus environment variables via `${VAR}` / `${VAR:-default}` / `${VAR?error message}`.

Secret-handling rules (enforced at config load):

- `${api_key}` and any kwarg whose key looks like a credential (`token`, `secret`, `api_key`, `apiKey`, `bearerToken`, …) cannot be referenced from `command` / `args` / `cwd` / `input_file` / `output_file`. Pass them through `engine.custom.env`, where they reach your agent as process environment variables instead of leaking into process listings.
- `${SOMEVAR:-...}` defaults that contain recognizable credential shapes (`sk-...`, `sk-ant-...`, `ghp_...`, `AIza...`, `AKIA...`, JWTs) are likewise rejected in command-line contexts.

See `docs/design/custom-engine.md` for the full SessionInput / SessionResult schema your agent must conform to.

### Engine kwargs (agent-specific switches)

`engine.kwargs` is a free-form string map. Each agent reads only the keys it recognises; unknown keys are ignored. Unrecognised keys (typos like `bypas_sandbox`) emit a DEBUG log line — run with `-v` to surface them. CLI override: `--engine-kwarg key=value` (alias `--ek`), repeatable. Precedence: `--engine-kwarg` > `engine.kwargs` > default.
Expand All @@ -127,7 +167,7 @@ report:
skill-up run evals/eval.yaml --engine-kwarg bypass_sandbox=true
```

### MCP configuration
>### MCP configuration

MCP supports `mode: real` and `mode: mocked`. `real` installs a real MCP server into Agents such as `claude_code`, `qodercli`, or `codex`; `mocked` makes `internal/mcp` generate a local stdio mock server that is then installed into the Agent like any other MCP server.

Expand Down
40 changes: 40 additions & 0 deletions docs/zh/guide/writing-evals.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,46 @@ report:

`cases.parallelism` 是配置文件中的默认用例并行数;临时运行时可以用 `skill-up run --parallelism N` 覆盖它,不需要修改 `eval.yaml`。命令行覆盖值必须在 1 到 256 之间。

### 自定义 Engine(Custom Engine)

当 `engine.name` 不是内置引擎(`claude_code` / `codex` / `qodercli`)时,必须再写一个 `engine.custom` 段来告诉 skill-up 怎么调用你的 Agent。当前只实现了 `transport: local`;`transport: http` 已设计但尚未实现,validation 会直接报 "not yet implemented"。

```yaml
engine:
name: my-agent
model:
provider: anthropic
name: claude-sonnet-4-6
custom:
transport: local # local(已实现)| http(规划中)
response_format: session_result # session_result(默认)| text
timeout_seconds: 300
env: # 凭据 / 敏感参数 —— 不要在 command/args 里引用这些
MY_AGENT_TOKEN: ${MY_AGENT_TOKEN}
kwargs: # 非敏感开关,模板里以 ${kwargs.<key>} 暴露
profile: production
local:
command: /opt/my-agent/bin/run
args:
- --input
- ${input_file} # skill-up 写入的 SessionInput JSON 路径
- --output
- ${output_file} # 你的 Agent 应写入 SessionResult JSON 的路径
cwd: ${workspace} # 可选;被限制在 runtime workspace 内
input_file: inputs/messages.json # 可选覆盖(相对 workspace)
output_file: outputs/session-result.json
```

`command` / `args` / `cwd` / `env` / `input_file` / `output_file` 中可用的模板变量:
`${workspace}`、`${input_file}`、`${output_file}`、`${model}`、`${model_provider}`、`${model_name}`、`${case_id}`、`${variant}`、`${max_turns}`、`${timeout_seconds}`、`${kwargs.<key>}`,以及环境变量形式 `${VAR}` / `${VAR:-default}` / `${VAR?error message}`。

凭据收敛规则(配置加载期强校验):

- `${api_key}` 以及任何看起来像凭据的 kwarg key(`token` / `secret` / `api_key` / `apiKey` / `bearerToken` 等)都不允许出现在 `command` / `args` / `cwd` / `input_file` / `output_file` 中,必须经由 `engine.custom.env` 注入到子进程环境变量里。
- `${SOMEVAR:-...}` 默认值如果匹配常见凭据特征(`sk-...`、`sk-ant-...`、`ghp_...`、`AIza...`、`AKIA...`、JWT 等),同样会在命令行场景被拒。

SessionInput / SessionResult 的完整 JSON 契约见 `docs/design/custom-engine.md`。

### MCP 配置说明

MCP 支持 `mode: real` 和 `mode: mocked`。`real` 用于把真实 MCP Server 安装到 `claude_code`、`qodercli` 或 `codex` 等 Agent;`mocked` 会由 `internal/mcp` 生成本地 stdio Mock Server,并按普通 MCP 配置安装到 Agent。
Expand Down
133 changes: 133 additions & 0 deletions e2e/custom_engine_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
//go:build e2e

package e2e

import (
"os"
"path/filepath"
"runtime"
"strings"
"testing"
"time"
)

// getCustomEngineTestdataDir returns the path to e2e/testdata/custom-engine.
func getCustomEngineTestdataDir() string {
_, testFile, _, _ := runtime.Caller(0)
projectRoot := filepath.Dir(filepath.Dir(testFile))
return filepath.Join(projectRoot, "e2e", "testdata", "custom-engine")
}

// skipIfNoPOSIXShell skips the test on platforms where the agent.sh fixture
// cannot run. agent.sh uses bash with `set -euo pipefail` and is not portable
// to Windows; the test infrastructure for cross-platform fixtures is tracked
// in issue #54.
func skipIfNoPOSIXShell(t *testing.T) {
t.Helper()
if runtime.GOOS == "windows" {
t.Skip("agent.sh fixture is POSIX-only; see issue #54")
}
}

// customEngineEnv returns an env slice that points the custom engine's
// ${CUSTOM_AGENT_BIN} placeholder at the fixture's agent.sh, while
// preserving PATH and HOME so the test still resolves system tools.
func customEngineEnv(agentBin string) []string {
return []string{
"PATH=" + os.Getenv("PATH"),
"HOME=" + os.Getenv("HOME"),
"CUSTOM_AGENT_BIN=" + agentBin,
}
}

// TestPipeline_CustomEngine_LocalTransport runs the entire evaluation pipeline
// against a user-defined Custom Engine (transport: local). It verifies that:
// 1. The eval loads and the custom engine block is env-resolved and validated.
// 2. The framework writes the SessionInput JSON to ${input_file}.
// 3. The agent's stdout is parsed back into a SessionResult.
// 4. final_message reaches the report and an expect.must_contain rule passes.
func TestPipeline_CustomEngine_LocalTransport(t *testing.T) {
skipIfNoPOSIXShell(t)
t.Parallel()

dir := getCustomEngineTestdataDir()
evalPath := filepath.Join(dir, "evals", "eval.yaml")
agentBin := filepath.Join(dir, "agent.sh")

outputDir := t.TempDir()
result := Run(t, RunConfig{
Env: customEngineEnv(agentBin),
WorkDir: dir,
Timeout: 60 * time.Second,
}, "run", evalPath, "--no-delete", "--output-dir", outputDir)

if result.ExitCode != 0 {
t.Fatalf("custom engine run failed: exit=%d\nstdout:\n%s\nstderr:\n%s",
result.ExitCode, result.Stdout, result.Stderr)
}
if !strings.Contains(result.Stdout, "Running evaluation") {
t.Errorf("expected runner stage log in output, got:\n%s", result.Stdout)
}
if !strings.Contains(result.Stdout, "PASS") {
t.Errorf("expected at least one PASS case, got:\n%s", result.Stdout)
}
if !strings.Contains(result.Stdout, "1 passed") {
t.Errorf("expected the summary line to report 1 passed, got:\n%s", result.Stdout)
}
}

// TestPipeline_CustomEngine_Validate runs `skill-up validate` against the
// custom engine eval.yaml. This exercises the post-load
// config.ResolveCustomEngineConfig path that validate.go invokes, including
// env-variable resolution of ${CUSTOM_AGENT_BIN}.
func TestPipeline_CustomEngine_Validate(t *testing.T) {
skipIfNoPOSIXShell(t)
t.Parallel()

dir := getCustomEngineTestdataDir()
evalPath := filepath.Join(dir, "evals", "eval.yaml")
agentBin := filepath.Join(dir, "agent.sh")

result := Run(t, RunConfig{
Env: customEngineEnv(agentBin),
WorkDir: dir,
Timeout: 30 * time.Second,
}, "validate", evalPath)

if result.ExitCode != 0 {
t.Fatalf("validate failed: exit=%d\nstdout:\n%s\nstderr:\n%s",
result.ExitCode, result.Stdout, result.Stderr)
}
if !strings.Contains(result.Stdout, "is valid") {
t.Errorf("expected validation success message, got:\n%s", result.Stdout)
}
}

// TestPipeline_CustomEngine_RejectsMissingEnvVar verifies the load-time env
// resolution: when ${CUSTOM_AGENT_BIN} is unset, `run` must fail with a
// configuration error before exec'ing anything.
func TestPipeline_CustomEngine_RejectsMissingEnvVar(t *testing.T) {
skipIfNoPOSIXShell(t)
t.Parallel()

dir := getCustomEngineTestdataDir()
evalPath := filepath.Join(dir, "evals", "eval.yaml")

// Note: CUSTOM_AGENT_BIN intentionally unset.
result := Run(t, RunConfig{
Env: []string{
"PATH=" + os.Getenv("PATH"),
"HOME=" + os.Getenv("HOME"),
},
WorkDir: dir,
Timeout: 30 * time.Second,
}, "run", evalPath, "--no-delete", "--output-dir", t.TempDir())

if result.ExitCode == 0 {
t.Fatalf("expected run to fail with missing env var, got exit 0\nstdout:\n%s", result.Stdout)
}
combined := result.Stdout + result.Stderr
if !strings.Contains(combined, "CUSTOM_AGENT_BIN") {
t.Errorf("expected the error to name the missing env var, got:\n%s", combined)
}
}
9 changes: 9 additions & 0 deletions e2e/testdata/custom-engine/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Custom Engine e2e fixture

This directory is a minimal skill-up fixture that exercises the **Custom
Engine** local transport end-to-end. It is consumed by `e2e/custom_engine_test.go`.

`agent.sh` is a deterministic stand-in for a real custom agent: it reads the
`SessionInput` JSON the framework writes to `${input_file}` and emits a fixed
`SessionResult` on stdout. The case asserts that `final_message` flows from
the agent's stdout into the report.
28 changes: 28 additions & 0 deletions e2e/testdata/custom-engine/agent.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/bin/bash
# custom-engine/agent.sh — a deterministic mock for the Custom Engine local
# transport. The agent receives the SessionInput JSON path as its first arg
# (per the eval.yaml in this directory) and emits a SessionResult on stdout.
#
# The script intentionally ignores the input content and returns a fixed
# response so the e2e test is deterministic. It still verifies that the
# framework actually wrote the input file at the configured path, which is the
# main piece of the transport contract this test exercises.
set -euo pipefail

INPUT_FILE="${1:-}"
if [[ -z "$INPUT_FILE" || ! -f "$INPUT_FILE" ]]; then
echo "agent.sh: SessionInput file not provided or missing (got: '${INPUT_FILE:-}')" >&2
exit 1
fi

cat <<'JSON'
{
"exit_code": 0,
"final_message": "custom-engine-handled",
"turns": 1,
"transcript": [
{"role": "user", "content": "ping"},
{"role": "assistant", "content": "custom-engine-handled"}
]
}
JSON
7 changes: 7 additions & 0 deletions e2e/testdata/custom-engine/evals/cases/hello.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
id: hello
title: Custom engine emits expected final_message
input:
prompt: ping the custom engine
expect:
must_contain:
- "custom-engine-handled"
28 changes: 28 additions & 0 deletions e2e/testdata/custom-engine/evals/eval.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
schema_version: v1alpha1

environment:
type: none

mcp:
servers: []

engine:
name: e2e-custom-agent
custom:
transport: local
response_format: session_result
local:
command: ${CUSTOM_AGENT_BIN}
args:
- ${input_file}

cases:
files:
- evals/cases/hello.yaml
defaults:
timeout_seconds: 60
max_turns: 3
parallelism: 1

report:
formats: [json]
35 changes: 32 additions & 3 deletions internal/agent/agent.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ import (
"path/filepath"
"strings"

"github.com/alibaba/skill-up/internal/config"
"github.com/alibaba/skill-up/internal/credential"
"github.com/alibaba/skill-up/internal/logging"
"github.com/alibaba/skill-up/internal/observability"
Expand Down Expand Up @@ -41,9 +42,21 @@ type SessionResult struct {

// SessionArtifacts holds artifacts produced during an agent session.
type SessionArtifacts struct {
WorkspaceDiff string `json:"workspace_diff,omitempty"`
GeneratedFiles []string `json:"generated_files,omitempty"` // Runtime file paths, e.g. ["outputs/stdout.json", "outputs/transcript.jsonl"]
Logs string `json:"logs,omitempty"`
WorkspaceDiff string `json:"workspace_diff,omitempty"`
GeneratedFiles []string `json:"generated_files,omitempty"` // Runtime file paths, e.g. ["outputs/stdout.json", "outputs/transcript.jsonl"]
Files []ArtifactFile `json:"files,omitempty"` // Structured artifact declarations (Custom Engine).
Logs string `json:"logs,omitempty"`
}

// ArtifactFile is a structured artifact declaration returned by an agent.
// Exactly one of Path, URL, Content, ContentBase64 should be set.
Comment thread
zpzjzj marked this conversation as resolved.
type ArtifactFile struct {
Name string `json:"name"`
Path string `json:"path,omitempty"`
URL string `json:"url,omitempty"`
Content string `json:"content,omitempty"`
ContentBase64 string `json:"content_base64,omitempty"`
ContentType string `json:"content_type,omitempty"`
}

// Config configures the agent.
Expand All @@ -65,6 +78,9 @@ type Config struct {
ModelProvider string
APIKey string
BaseURL string
// Custom carries the custom engine configuration when Name does not match
// a built-in agent. It is nil for built-in agents.
Custom *config.CustomEngineConfig
// Kwargs carries agent-specific key/value options forwarded from
// EngineConfig.Kwargs. Each agent reads only the keys it understands;
// unknown keys are ignored. See agent kwargs helpers in kwargs.go.
Expand Down Expand Up @@ -190,6 +206,19 @@ func (a *BaseAgent) credentialEnvVars(apiKeyEnv, baseURLEnv string) map[string]s
return envVars
}

// installSkillDefault is the fallback skill installer used when an agent does
// not configure its own InstallSkillCmd template. It installs the skill source
// under a.Cfg.SkillPath (or the caller-supplied target). Defined on BaseAgent
// so both CLIAgent and CustomAgent share it via embedding, without making
// CustomAgent inherit the rest of CLIAgent.
func (a *BaseAgent) installSkillDefault(ctx context.Context, rt Runtime, skillCfg runtime.SkillConfig) error {
target := skillCfg.Target
if target == "" && a.Cfg.SkillPath != "" {
target = filepath.Join(a.Cfg.SkillPath, filepath.Base(skillCfg.Source))
}
return installSkill(ctx, rt, skillCfg.Source, target)
}

func persistRuntimeArtifact(ctx context.Context, rt Runtime, targetPath, content string) error {
f, err := os.CreateTemp("", "skill-up-artifact-*")
if err != nil {
Expand Down
Loading
Loading