Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 28 additions & 1 deletion references/prompts-integration.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,30 @@ The prompt library composes cleanly with the existing primitive contract:

---

## The evaluation engine (Phase E) — from telemetry to judgment

The 5-step mandate above writes **telemetry**: *what ran* (model, tokens, cost, latency) and *how the human felt* (thumbs). Phase E adds the layer above it — **judgment**: *how good the output actually was*, measured by an LLM-as-judge against versioned rubrics. Design spec: `docs/superpowers/specs/2026-06-02-prompts-evals-engine-design.md`.

What a bstack-driven agent needs to know:

- **Completing an invocation is what triggers an eval.** Step 4 of the mandate (`broomva prompts complete`) is the ingest point. The server applies an adaptive sampling decision (per-rubric YAML config) and, if sampled, enqueues a judge job. The judge — the Life **`krisis`** crate (κρίσις, *judgment*; sibling of `vigil`) — scores the output against a **G-Eval** rubric (each dimension: Definition + Evaluation Steps + Score range + anchors) and writes a `PromptEvaluation` row.
- **Capture the output the cheap way: run the Vigil sidecar.** `broomva vigil install` writes a SessionStart hook and an on-demand local OTLP proxy; point your agent at it via `ANTHROPIC_BASE_URL`/`OPENAI_BASE_URL` and every LLM call is captured as a GenAI span and joined to the invocation by session-id — no extra step in the loop. Fallback if you don't run the sidecar: `broomva prompts complete --output-file <path>` hands the output over explicitly. No output available → the eval is `skipped` with a reason (a measured coverage gap, never silent).
- **The privacy line is load-bearing: store judgments, never content.** By default the captured output is read transiently to score it, then **dropped** — only scores + reasoning persist. Retention is opt-in (first-party or `sudo` mode) and isolated. Do not design any agent flow that assumes the platform retains the output; it does not, unless capture is explicitly enabled. Entity: `concept/store-judgments-not-content`.
- **Inspect evals from the terminal:**

```bash
broomva evals show <invocation_id> --wait # block until the judge finishes; print the breakdown
broomva evals list --slug <slug> --failing # what's scoring poorly
broomva evals tail # live feed of completing evals
broomva evals rubric validate <path> # local YAML lint, no server hit
```

- **`evals show <id> --wait` right after `prompts complete` is the killer loop** — the "did this run actually go well?" answer in the terminal where the work happened.

This deepens the P6/P11/P13 composition above: the eval score is the crystallized quality signal (P6), the judgment is empirical validation of the prompt itself (P11), and the evolving per-prompt aggregate is the dream-tier substrate (P13).

---

## Common traps

- **Pulling a prompt and not completing it.** The most common silent failure. The CLI emits the invocation id loudly; capture it. Set a TodoWrite reminder if the work is long-running.
Expand Down Expand Up @@ -196,6 +220,9 @@ LIST: broomva prompts list --metrics --sort skill_invokes
## See also

- `broomva-cli` skill SKILL.md — the source-of-truth mandate, mirrored here
- `docs/superpowers/specs/2026-05-09-prompts-eval-engine-design.md` — full design spec for the eval engine
- `docs/superpowers/specs/2026-06-02-prompts-evals-engine-design.md` — **Phase E** (evaluation/judgment) design spec: rubrics, `krisis`, Vigil capture, privacy posture
- `docs/superpowers/plans/2026-05-11-prompts-eval-engine-phase2-cli.md` — Phase 2 (CLI telemetry) plan
- `research/entities/concept/store-judgments-not-content.md` — the governing privacy invariant
- `research/entities/project/prompts-eval-engine.md` — project tracking node (11 decisions, topology, phasing)
- broomva.tech/prompts — the browsable web surface
- `bstack-control-harness-bootstrap` (slug) — the prompt to use when standing up a fresh agent harness
16 changes: 9 additions & 7 deletions references/skills-roster.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,22 @@

28 curated skills across 7 layers. The Broomva Stack.

> **Canonical *installable* roster:** [`references/companion-skills.yaml`](companion-skills.yaml) — the 56 bstack-native skills, all in the **broomva/skills** monorepo (validated by `tests/roster-monorepo-sync.test.sh`). Install path-independently: `npx skills add broomva/skills --skill <name>`. **This doc is a broader human catalog**: it also documents ecosystem *products* (e.g. symphony, autoany, next-forge) that are not installable bstack skills, so it intentionally lists more than the YAML. For what is installable, the YAML is authoritative.
> **Canonical *installable* roster:** [`references/companion-skills.yaml`](companion-skills.yaml) — the 61 bstack-native skills, all in the **broomva/skills** monorepo (validated by `tests/roster-monorepo-sync.test.sh`). Install path-independently: `npx skills add broomva/skills --skill <name>`. **This doc is a broader human catalog**: it also documents ecosystem *products* (e.g. symphony, autoany, next-forge) that are not installable bstack skills, so it intentionally lists more than the YAML. For what is installable, the YAML is authoritative.

## Publishing standard (skill layout) — required for installability
## Publishing standard (skill layout) — installability + consolidation

A skill is a **folder**, per the [Agent Skills standard](https://agentskills.io) — `SKILL.md` + optional `scripts/` / `references/` / `assets/`. **Publish every multi-file skill in a `skills/<name>/` subdirectory of its repo — never as a bare top-level `SKILL.md` at the repo root.**
A skill is a **folder**, per the [Agent Skills standard](https://agentskills.io) — `SKILL.md` + optional `scripts/` / `references/` / `assets/`. A top-level `SKILL.md` at a repo root is **standard-valid** (both the spec and the skills.sh README list the repo root as a discovery location). The defect is *not* the layout — it's an **open upstream CLI bug**.

| Layout | Remote `npx skills add` | |
|---|---|---|
| `skills/<name>/{SKILL.md, scripts/…}` | installs the **full** skill | ✅ standard |
| top-level `SKILL.md` + sibling `scripts/` at repo root | installs **only `SKILL.md`** — scripts dropped → non-functional | 🔴 broken (BRO-1561) |
| `skills/<name>/{SKILL.md, scripts/…}` (subdir) | installs the **full** skill | ✅ installs clean |
| top-level `SKILL.md` + sibling `scripts/` at repo root | installs **only `SKILL.md`** — siblings dropped → non-functional | ⚠️ standard-valid layout, but hits [skills#1523](https://github.com/vercel-labs/skills/issues/1523) |

**Why:** the vercel-labs/skills CLI special-cases a repo-root `SKILL.md` and copies only that file (a repo root carries `README`/`LICENSE`/`.github` — not a clean skill folder). A `skills/<name>/` subdir is a clean folder and is fully copied. A repo can keep its name and still be standard: `git mv SKILL.md scripts/ → skills/<name>/`; `npx skills add broomva/<name>` still works (the CLI's `skills/` priority-search finds the subdir).
**Why:** the vercel-labs/skills CLI special-cases a repo-root `SKILL.md` and copies only that file ([#1523](https://github.com/vercel-labs/skills/issues/1523) / #1517 — **open, unfixed**). *Local-path* install copies everything; only *remote* repo-root install drops siblings. A `skills/<name>/` subdir is a clean folder and is fully copied, so it sidesteps the bug.

**Validation:** `--list` is **necessary but not sufficient** — it only parses frontmatter, never the file-copy path, so it passes even when the install drops `scripts/`. The real gate is a **clean-room install that yields a *runnable* skill** (bundled files land + the skill's own test passes). The `skillify` skill machine-enforces this (`skillify_check.py` step 1 FAILs a repo-root-with-bundled-dirs layout). Full proof + root cause: `research/entities/tool/skills-sh.md`.
**Canonical fix — consolidate into the `broomva/skills` monorepo (BRO-1561):** rather than push each standalone repo into an awkward `skills/<name>/`-inside-a-same-named-repo workaround, skills vendor into **`broomva/skills/skills/<name>/`** and install via `npx skills add broomva/skills --skill <name>`. The subdir is non-redundant there, the monorepo has generic lint + per-skill tests, and standalone repos become **deprecated redirect stubs** (6-month window) pointing at the monorepo command. Tier-2 graduation in `broomva/skills` CONTRIBUTING is the path.

**Validation:** `--list` is **necessary but not sufficient** — it only parses frontmatter, never the file-copy path, so it passes even when the install drops `scripts/`. The real gate is a **clean-room install that yields a *runnable* skill** (bundled files land + the skill's own test passes). The `skillify` skill enforces frontmatter/skills.sh-parseability at step 1 (deterministic, required) and **advises** (step 1b, WARN) when a repo-root-with-bundled-dirs layout would hit #1523 — it no longer *rejects* a standard-valid layout. Full proof + root cause: `research/entities/tool/skills-sh.md`.

## Foundation — Control & Governance

Expand Down
Loading