From dac3efa935331d78b9d2f1d9907fa941923c380c Mon Sep 17 00:00:00 2001
From: "Carlos D. Escobar-Valbuena" <devteam@getstimulus.ai>
Date: Sun, 28 Jun 2026 22:42:33 -0500
Subject: [PATCH] docs(refs): land prompts-integration Phase-E + skills-roster
 publishing reframe (BRO-1591)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Finishes the two doc edits stashed during the BRO-1588 pull.

- prompts-integration.md: add "Phase E — evaluation engine" section (krisis judge
  crate, Vigil OTLP capture, store-judgments-not-content privacy invariant,
  `broomva evals` CLI, P6/P11/P13 composition); refresh See-also to the 2026-06-02
  design spec + project/entity nodes. Purely additive.
- skills-roster.md: reframe the publishing standard — a repo-root SKILL.md is
  standard-valid; the defect is the open upstream CLI bug vercel-labs/skills#1523
  (verified OPEN), not the layout. Add the monorepo-consolidation canonical fix
  (BRO-1561) + the skillify advise-not-reject note (BRO-1581). Fix the stale banner
  count 56 -> 61. Origin's row annotations kept intact.

No changes to companion-skills.yaml. Full tests/*.test.sh: 24/24.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 references/prompts-integration.md | 29 ++++++++++++++++++++++++++++-
 references/skills-roster.md       | 16 +++++++++-------
 2 files changed, 37 insertions(+), 8 deletions(-)
diff --git a/references/prompts-integration.md b/references/prompts-integration.md
index 87d1916..16b59d2 100644
--- a/references/prompts-integration.md
+++ b/references/prompts-integration.md
@@ -169,6 +169,30 @@ The prompt library composes cleanly with the existing primitive contract:
 
 ---
 
+## The evaluation engine (Phase E) — from telemetry to judgment
+
+The 5-step mandate above writes **telemetry**: *what ran* (model, tokens, cost, latency) and *how the human felt* (thumbs). Phase E adds the layer above it — **judgment**: *how good the output actually was*, measured by an LLM-as-judge against versioned rubrics. Design spec: `docs/superpowers/specs/2026-06-02-prompts-evals-engine-design.md`.
+
+What a bstack-driven agent needs to know:
+
+- **Completing an invocation is what triggers an eval.** Step 4 of the mandate (`broomva prompts complete`) is the ingest point. The server applies an adaptive sampling decision (per-rubric YAML config) and, if sampled, enqueues a judge job. The judge — the Life **`krisis`** crate (κρίσις, *judgment*; sibling of `vigil`) — scores the output against a **G-Eval** rubric (each dimension: Definition + Evaluation Steps + Score range + anchors) and writes a `PromptEvaluation` row.
+- **Capture the output the cheap way: run the Vigil sidecar.** `broomva vigil install` writes a SessionStart hook and an on-demand local OTLP proxy; point your agent at it via `ANTHROPIC_BASE_URL`/`OPENAI_BASE_URL` and every LLM call is captured as a GenAI span and joined to the invocation by session-id — no extra step in the loop. Fallback if you don't run the sidecar: `broomva prompts complete --output-file <path>` hands the output over explicitly. No output available → the eval is `skipped` with a reason (a measured coverage gap, never silent).
+- **The privacy line is load-bearing: store judgments, never content.** By default the captured output is read transiently to score it, then **dropped** — only scores + reasoning persist. Retention is opt-in (first-party or `sudo` mode) and isolated. Do not design any agent flow that assumes the platform retains the output; it does not, unless capture is explicitly enabled. Entity: `concept/store-judgments-not-content`.
+- **Inspect evals from the terminal:**
+
+```bash
+broomva evals show <invocation_id> --wait    # block until the judge finishes; print the breakdown
+broomva evals list --slug <slug> --failing   # what's scoring poorly
+broomva evals tail                           # live feed of completing evals
+broomva evals rubric validate <path>         # local YAML lint, no server hit
+```
+
+- **`evals show <id> --wait` right after `prompts complete` is the killer loop** — the "did this run actually go well?" answer in the terminal where the work happened.
+
+This deepens the P6/P11/P13 composition above: the eval score is the crystallized quality signal (P6), the judgment is empirical validation of the prompt itself (P11), and the evolving per-prompt aggregate is the dream-tier substrate (P13).
+
+---
+
 ## Common traps
 
 - **Pulling a prompt and not completing it.** The most common silent failure. The CLI emits the invocation id loudly; capture it. Set a TodoWrite reminder if the work is long-running.
@@ -196,6 +220,9 @@ LIST:     broomva prompts list --metrics --sort skill_invokes
 ## See also
 
 - `broomva-cli` skill SKILL.md — the source-of-truth mandate, mirrored here
-- `docs/superpowers/specs/2026-05-09-prompts-eval-engine-design.md` — full design spec for the eval engine
+- `docs/superpowers/specs/2026-06-02-prompts-evals-engine-design.md` — **Phase E** (evaluation/judgment) design spec: rubrics, `krisis`, Vigil capture, privacy posture
+- `docs/superpowers/plans/2026-05-11-prompts-eval-engine-phase2-cli.md` — Phase 2 (CLI telemetry) plan
+- `research/entities/concept/store-judgments-not-content.md` — the governing privacy invariant
+- `research/entities/project/prompts-eval-engine.md` — project tracking node (11 decisions, topology, phasing)
 - broomva.tech/prompts — the browsable web surface
 - `bstack-control-harness-bootstrap` (slug) — the prompt to use when standing up a fresh agent harness
diff --git a/references/skills-roster.md b/references/skills-roster.md
index 5cf71c5..b415f93 100644
--- a/references/skills-roster.md
+++ b/references/skills-roster.md
@@ -2,20 +2,22 @@
 
 28 curated skills across 7 layers. The Broomva Stack.
 
-> **Canonical *installable* roster:** [`references/companion-skills.yaml`](companion-skills.yaml) — the 56 bstack-native skills, all in the **broomva/skills** monorepo (validated by `tests/roster-monorepo-sync.test.sh`). Install path-independently: `npx skills add broomva/skills --skill <name>`. **This doc is a broader human catalog**: it also documents ecosystem *products* (e.g. symphony, autoany, next-forge) that are not installable bstack skills, so it intentionally lists more than the YAML. For what is installable, the YAML is authoritative.
+> **Canonical *installable* roster:** [`references/companion-skills.yaml`](companion-skills.yaml) — the 61 bstack-native skills, all in the **broomva/skills** monorepo (validated by `tests/roster-monorepo-sync.test.sh`). Install path-independently: `npx skills add broomva/skills --skill <name>`. **This doc is a broader human catalog**: it also documents ecosystem *products* (e.g. symphony, autoany, next-forge) that are not installable bstack skills, so it intentionally lists more than the YAML. For what is installable, the YAML is authoritative.
 
-## Publishing standard (skill layout) — required for installability
+## Publishing standard (skill layout) — installability + consolidation
 
-A skill is a **folder**, per the [Agent Skills standard](https://agentskills.io) — `SKILL.md` + optional `scripts/` / `references/` / `assets/`. **Publish every multi-file skill in a `skills/<name>/` subdirectory of its repo — never as a bare top-level `SKILL.md` at the repo root.**
+A skill is a **folder**, per the [Agent Skills standard](https://agentskills.io) — `SKILL.md` + optional `scripts/` / `references/` / `assets/`. A top-level `SKILL.md` at a repo root is **standard-valid** (both the spec and the skills.sh README list the repo root as a discovery location). The defect is *not* the layout — it's an **open upstream CLI bug**.
 
 | Layout | Remote `npx skills add` | |
 |---|---|---|
-| `skills/<name>/{SKILL.md, scripts/…}` | installs the **full** skill | ✅ standard |
-| top-level `SKILL.md` + sibling `scripts/` at repo root | installs **only `SKILL.md`** — scripts dropped → non-functional | 🔴 broken (BRO-1561) |
+| `skills/<name>/{SKILL.md, scripts/…}` (subdir) | installs the **full** skill | ✅ installs clean |
+| top-level `SKILL.md` + sibling `scripts/` at repo root | installs **only `SKILL.md`** — siblings dropped → non-functional | ⚠️ standard-valid layout, but hits [skills#1523](https://github.com/vercel-labs/skills/issues/1523) |
 
-**Why:** the vercel-labs/skills CLI special-cases a repo-root `SKILL.md` and copies only that file (a repo root carries `README`/`LICENSE`/`.github` — not a clean skill folder). A `skills/<name>/` subdir is a clean folder and is fully copied. A repo can keep its name and still be standard: `git mv SKILL.md scripts/ → skills/<name>/`; `npx skills add broomva/<name>` still works (the CLI's `skills/` priority-search finds the subdir).
+**Why:** the vercel-labs/skills CLI special-cases a repo-root `SKILL.md` and copies only that file ([#1523](https://github.com/vercel-labs/skills/issues/1523) / #1517 — **open, unfixed**). *Local-path* install copies everything; only *remote* repo-root install drops siblings. A `skills/<name>/` subdir is a clean folder and is fully copied, so it sidesteps the bug.
 
-**Validation:** `--list` is **necessary but not sufficient** — it only parses frontmatter, never the file-copy path, so it passes even when the install drops `scripts/`. The real gate is a **clean-room install that yields a *runnable* skill** (bundled files land + the skill's own test passes). The `skillify` skill machine-enforces this (`skillify_check.py` step 1 FAILs a repo-root-with-bundled-dirs layout). Full proof + root cause: `research/entities/tool/skills-sh.md`.
+**Canonical fix — consolidate into the `broomva/skills` monorepo (BRO-1561):** rather than push each standalone repo into an awkward `skills/<name>/`-inside-a-same-named-repo workaround, skills vendor into **`broomva/skills/skills/<name>/`** and install via `npx skills add broomva/skills --skill <name>`. The subdir is non-redundant there, the monorepo has generic lint + per-skill tests, and standalone repos become **deprecated redirect stubs** (6-month window) pointing at the monorepo command. Tier-2 graduation in `broomva/skills` CONTRIBUTING is the path.
+
+**Validation:** `--list` is **necessary but not sufficient** — it only parses frontmatter, never the file-copy path, so it passes even when the install drops `scripts/`. The real gate is a **clean-room install that yields a *runnable* skill** (bundled files land + the skill's own test passes). The `skillify` skill enforces frontmatter/skills.sh-parseability at step 1 (deterministic, required) and **advises** (step 1b, WARN) when a repo-root-with-bundled-dirs layout would hit #1523 — it no longer *rejects* a standard-valid layout. Full proof + root cause: `research/entities/tool/skills-sh.md`.
 
 ## Foundation — Control & Governance