From dac3efa935331d78b9d2f1d9907fa941923c380c Mon Sep 17 00:00:00 2001 From: "Carlos D. Escobar-Valbuena" Date: Sun, 28 Jun 2026 22:42:33 -0500 Subject: [PATCH] docs(refs): land prompts-integration Phase-E + skills-roster publishing reframe (BRO-1591) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Finishes the two doc edits stashed during the BRO-1588 pull. - prompts-integration.md: add "Phase E — evaluation engine" section (krisis judge crate, Vigil OTLP capture, store-judgments-not-content privacy invariant, `broomva evals` CLI, P6/P11/P13 composition); refresh See-also to the 2026-06-02 design spec + project/entity nodes. Purely additive. - skills-roster.md: reframe the publishing standard — a repo-root SKILL.md is standard-valid; the defect is the open upstream CLI bug vercel-labs/skills#1523 (verified OPEN), not the layout. Add the monorepo-consolidation canonical fix (BRO-1561) + the skillify advise-not-reject note (BRO-1581). Fix the stale banner count 56 -> 61. Origin's row annotations kept intact. No changes to companion-skills.yaml. Full tests/*.test.sh: 24/24. Co-Authored-By: Claude Opus 4.8 (1M context) --- references/prompts-integration.md | 29 ++++++++++++++++++++++++++++- references/skills-roster.md | 16 +++++++++------- 2 files changed, 37 insertions(+), 8 deletions(-) diff --git a/references/prompts-integration.md b/references/prompts-integration.md index 87d1916..16b59d2 100644 --- a/references/prompts-integration.md +++ b/references/prompts-integration.md @@ -169,6 +169,30 @@ The prompt library composes cleanly with the existing primitive contract: --- +## The evaluation engine (Phase E) — from telemetry to judgment + +The 5-step mandate above writes **telemetry**: *what ran* (model, tokens, cost, latency) and *how the human felt* (thumbs). Phase E adds the layer above it — **judgment**: *how good the output actually was*, measured by an LLM-as-judge against versioned rubrics. Design spec: `docs/superpowers/specs/2026-06-02-prompts-evals-engine-design.md`. + +What a bstack-driven agent needs to know: + +- **Completing an invocation is what triggers an eval.** Step 4 of the mandate (`broomva prompts complete`) is the ingest point. The server applies an adaptive sampling decision (per-rubric YAML config) and, if sampled, enqueues a judge job. The judge — the Life **`krisis`** crate (κρίσις, *judgment*; sibling of `vigil`) — scores the output against a **G-Eval** rubric (each dimension: Definition + Evaluation Steps + Score range + anchors) and writes a `PromptEvaluation` row. +- **Capture the output the cheap way: run the Vigil sidecar.** `broomva vigil install` writes a SessionStart hook and an on-demand local OTLP proxy; point your agent at it via `ANTHROPIC_BASE_URL`/`OPENAI_BASE_URL` and every LLM call is captured as a GenAI span and joined to the invocation by session-id — no extra step in the loop. Fallback if you don't run the sidecar: `broomva prompts complete --output-file ` hands the output over explicitly. No output available → the eval is `skipped` with a reason (a measured coverage gap, never silent). +- **The privacy line is load-bearing: store judgments, never content.** By default the captured output is read transiently to score it, then **dropped** — only scores + reasoning persist. Retention is opt-in (first-party or `sudo` mode) and isolated. Do not design any agent flow that assumes the platform retains the output; it does not, unless capture is explicitly enabled. Entity: `concept/store-judgments-not-content`. +- **Inspect evals from the terminal:** + +```bash +broomva evals show --wait # block until the judge finishes; print the breakdown +broomva evals list --slug --failing # what's scoring poorly +broomva evals tail # live feed of completing evals +broomva evals rubric validate # local YAML lint, no server hit +``` + +- **`evals show --wait` right after `prompts complete` is the killer loop** — the "did this run actually go well?" answer in the terminal where the work happened. + +This deepens the P6/P11/P13 composition above: the eval score is the crystallized quality signal (P6), the judgment is empirical validation of the prompt itself (P11), and the evolving per-prompt aggregate is the dream-tier substrate (P13). + +--- + ## Common traps - **Pulling a prompt and not completing it.** The most common silent failure. The CLI emits the invocation id loudly; capture it. Set a TodoWrite reminder if the work is long-running. @@ -196,6 +220,9 @@ LIST: broomva prompts list --metrics --sort skill_invokes ## See also - `broomva-cli` skill SKILL.md — the source-of-truth mandate, mirrored here -- `docs/superpowers/specs/2026-05-09-prompts-eval-engine-design.md` — full design spec for the eval engine +- `docs/superpowers/specs/2026-06-02-prompts-evals-engine-design.md` — **Phase E** (evaluation/judgment) design spec: rubrics, `krisis`, Vigil capture, privacy posture +- `docs/superpowers/plans/2026-05-11-prompts-eval-engine-phase2-cli.md` — Phase 2 (CLI telemetry) plan +- `research/entities/concept/store-judgments-not-content.md` — the governing privacy invariant +- `research/entities/project/prompts-eval-engine.md` — project tracking node (11 decisions, topology, phasing) - broomva.tech/prompts — the browsable web surface - `bstack-control-harness-bootstrap` (slug) — the prompt to use when standing up a fresh agent harness diff --git a/references/skills-roster.md b/references/skills-roster.md index 5cf71c5..b415f93 100644 --- a/references/skills-roster.md +++ b/references/skills-roster.md @@ -2,20 +2,22 @@ 28 curated skills across 7 layers. The Broomva Stack. -> **Canonical *installable* roster:** [`references/companion-skills.yaml`](companion-skills.yaml) — the 56 bstack-native skills, all in the **broomva/skills** monorepo (validated by `tests/roster-monorepo-sync.test.sh`). Install path-independently: `npx skills add broomva/skills --skill `. **This doc is a broader human catalog**: it also documents ecosystem *products* (e.g. symphony, autoany, next-forge) that are not installable bstack skills, so it intentionally lists more than the YAML. For what is installable, the YAML is authoritative. +> **Canonical *installable* roster:** [`references/companion-skills.yaml`](companion-skills.yaml) — the 61 bstack-native skills, all in the **broomva/skills** monorepo (validated by `tests/roster-monorepo-sync.test.sh`). Install path-independently: `npx skills add broomva/skills --skill `. **This doc is a broader human catalog**: it also documents ecosystem *products* (e.g. symphony, autoany, next-forge) that are not installable bstack skills, so it intentionally lists more than the YAML. For what is installable, the YAML is authoritative. -## Publishing standard (skill layout) — required for installability +## Publishing standard (skill layout) — installability + consolidation -A skill is a **folder**, per the [Agent Skills standard](https://agentskills.io) — `SKILL.md` + optional `scripts/` / `references/` / `assets/`. **Publish every multi-file skill in a `skills//` subdirectory of its repo — never as a bare top-level `SKILL.md` at the repo root.** +A skill is a **folder**, per the [Agent Skills standard](https://agentskills.io) — `SKILL.md` + optional `scripts/` / `references/` / `assets/`. A top-level `SKILL.md` at a repo root is **standard-valid** (both the spec and the skills.sh README list the repo root as a discovery location). The defect is *not* the layout — it's an **open upstream CLI bug**. | Layout | Remote `npx skills add` | | |---|---|---| -| `skills//{SKILL.md, scripts/…}` | installs the **full** skill | ✅ standard | -| top-level `SKILL.md` + sibling `scripts/` at repo root | installs **only `SKILL.md`** — scripts dropped → non-functional | 🔴 broken (BRO-1561) | +| `skills//{SKILL.md, scripts/…}` (subdir) | installs the **full** skill | ✅ installs clean | +| top-level `SKILL.md` + sibling `scripts/` at repo root | installs **only `SKILL.md`** — siblings dropped → non-functional | ⚠️ standard-valid layout, but hits [skills#1523](https://github.com/vercel-labs/skills/issues/1523) | -**Why:** the vercel-labs/skills CLI special-cases a repo-root `SKILL.md` and copies only that file (a repo root carries `README`/`LICENSE`/`.github` — not a clean skill folder). A `skills//` subdir is a clean folder and is fully copied. A repo can keep its name and still be standard: `git mv SKILL.md scripts/ → skills//`; `npx skills add broomva/` still works (the CLI's `skills/` priority-search finds the subdir). +**Why:** the vercel-labs/skills CLI special-cases a repo-root `SKILL.md` and copies only that file ([#1523](https://github.com/vercel-labs/skills/issues/1523) / #1517 — **open, unfixed**). *Local-path* install copies everything; only *remote* repo-root install drops siblings. A `skills//` subdir is a clean folder and is fully copied, so it sidesteps the bug. -**Validation:** `--list` is **necessary but not sufficient** — it only parses frontmatter, never the file-copy path, so it passes even when the install drops `scripts/`. The real gate is a **clean-room install that yields a *runnable* skill** (bundled files land + the skill's own test passes). The `skillify` skill machine-enforces this (`skillify_check.py` step 1 FAILs a repo-root-with-bundled-dirs layout). Full proof + root cause: `research/entities/tool/skills-sh.md`. +**Canonical fix — consolidate into the `broomva/skills` monorepo (BRO-1561):** rather than push each standalone repo into an awkward `skills//`-inside-a-same-named-repo workaround, skills vendor into **`broomva/skills/skills//`** and install via `npx skills add broomva/skills --skill `. The subdir is non-redundant there, the monorepo has generic lint + per-skill tests, and standalone repos become **deprecated redirect stubs** (6-month window) pointing at the monorepo command. Tier-2 graduation in `broomva/skills` CONTRIBUTING is the path. + +**Validation:** `--list` is **necessary but not sufficient** — it only parses frontmatter, never the file-copy path, so it passes even when the install drops `scripts/`. The real gate is a **clean-room install that yields a *runnable* skill** (bundled files land + the skill's own test passes). The `skillify` skill enforces frontmatter/skills.sh-parseability at step 1 (deterministic, required) and **advises** (step 1b, WARN) when a repo-root-with-bundled-dirs layout would hit #1523 — it no longer *rejects* a standard-valid layout. Full proof + root cause: `research/entities/tool/skills-sh.md`. ## Foundation — Control & Governance