feat(overlay): add generate-metadata workflow for Package entity creation#22
feat(overlay): add generate-metadata workflow for Package entity creation#22davidfestal wants to merge 1 commit into
Conversation
…tion Add a 6-phase workflow that generates missing Package metadata files and audits existing ones for consistency. Key capabilities: - Scans workspaces for plugins missing metadata files - Derives deterministic fields (name, OCI URL, supportedVersions) from source.json, plugins-list.yaml, and upstream package.json - Fetches config.d.ts from upstream to generate appConfigExamples - Audits supportedVersions consistency and empty appConfigExamples - Updates smoke-tests/test.env with placeholder variables - Delegates from onboard-plugin Phase 4 Also adds path_resolution and shell_permissions directives to SKILL.md for reliable script invocation across all workflows, and rewrites the metadata-format reference with real examples and correct paths. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
This PR extends the overlay skill with a new workflow and helper script to generate/audit Backstage Package metadata for overlay workspaces, and updates related documentation and onboarding guidance.
Changes:
- Add
workflows/generate-metadata.mdto define a phased process for scanning, generating, and auditing metadata. - Add
scripts/derive-metadata.pyplus new unit tests to support deterministic metadata derivation and audit checks. - Update overlay skill docs (
SKILL.md,references/metadata-format.md, andonboard-plugin.md) to route Phase 4 to the new workflow and document the metadata format.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/test_derive_metadata.py | Adds unit tests for the new derive-metadata helper script. |
| skills/overlay/workflows/onboard-plugin.md | Delegates Phase 4 metadata work to the new generate-metadata workflow. |
| skills/overlay/workflows/generate-metadata.md | New workflow describing scan/derive/audit steps and expected outputs. |
| skills/overlay/SKILL.md | Adds path resolution + shell permission guidance; adds routing entry for metadata workflow. |
| skills/overlay/scripts/derive-metadata.py | New CLI script to scan workspaces, derive fields, and perform audits. |
| skills/overlay/references/metadata-format.md | Rewrites metadata format reference with updated paths and examples. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| import importlib.util | ||
| import json | ||
| import textwrap |
| python scripts/derive-metadata.py --workspace argocd | ||
| python scripts/derive-metadata.py --workspace argocd --package-json '{"name":"@backstage-community/plugin-argocd","version":"2.8.0","backstage":{"role":"frontend-plugin"}}' | ||
| python scripts/derive-metadata.py --extract-env-vars metadata-file.yaml |
| def shorten_name(name: str) -> str: | ||
| """Apply shortening rules only if name exceeds K8S_NAME_LIMIT.""" | ||
| if len(name) <= K8S_NAME_LIMIT: | ||
| return name | ||
| shortened = name | ||
| for old, new in SHORTEN_RULES: | ||
| shortened = shortened.replace(old, new) | ||
| return shortened |
| def find_missing_metadata(workspace_dir: Path, plugins: list[dict]) -> list[dict]: | ||
| """Identify plugins that lack metadata files. | ||
|
|
||
| Uses a heuristic: for each plugin path, check if any existing metadata file's | ||
| packageName corresponds to that path. Falls back to filename pattern matching. | ||
| """ | ||
| metadata_dir = workspace_dir / "metadata" | ||
| existing_files = list(metadata_dir.glob("*.yaml")) if metadata_dir.exists() else [] | ||
| existing_names = {f.stem for f in existing_files} | ||
|
|
||
| missing = [] | ||
| for plugin in plugins: | ||
| path = plugin["path"] | ||
| path_suffix = path.rstrip("/").split("/")[-1] if path != "." else "" | ||
| found = any(path_suffix and path_suffix in name for name in existing_names) | ||
| if not found and path != ".": | ||
| missing.append(plugin) | ||
| elif path == "." and not existing_names: | ||
| missing.append(plugin) | ||
| return missing |
| if args.package_json: | ||
| pkg = json.loads(args.package_json) | ||
| fields = derive_plugin_fields( | ||
| pkg, args.workspace, args.plugin_path, source, | ||
| supported_versions, existing, | ||
| ) | ||
| print(json.dumps(fields, indent=2 if is_tty else None)) |
| </path_resolution> | ||
|
|
||
| <shell_permissions> | ||
| Prefer running `gh api` and `gh search code` as **direct shell commands** rather than via Python subprocess. Direct `gh` calls go through the user's command allowlist without triggering permission prompts. Python scripts that only do local work (file I/O, JSON processing, field derivation) also need no extra permissions. Only request `full_network` for Python scripts that internally spawn `gh` as a subprocess — the sandbox blocks network access from child processes. |
| Run the scan command from the overlay repo root (no network needed): | ||
|
|
||
| ```bash | ||
| python3 scripts/derive-metadata.py scan --workspace <workspace> |
| version: 10.17.0 | ||
| backstage: | ||
| role: frontend-plugin | ||
| supportedVersions: 1.45.3 |
| version: 1.4.0 | ||
| backstage: | ||
| role: backend-plugin | ||
| supportedVersions: 1.48.3 |
| is_tty = os.isatty(sys.stdout.fileno()) | ||
|
|
||
| if args.command == "extract-env-vars": | ||
| content = Path(args.file).read_text() | ||
| env_vars = extract_env_vars(content) | ||
| output = {"env_vars": env_vars} | ||
| print(json.dumps(output, indent=2 if is_tty else None)) |
durandom
left a comment
There was a problem hiding this comment.
Review: Script Architecture & Quality
Nice work on the overall concept — extracting deterministic field derivation (name shortening, OCI URLs, version consistency) into a script is exactly the right pattern for this project. The workflow phases are logical and the delegation from onboard-plugin Phase 4 makes sense. The metadata-format.md rewrite with real-world examples is a big improvement.
However, the script has some architectural issues and quality gaps compared to the existing scripts (analyze-pr.py, triage-prs.py). See inline comments for specifics.
Architectural
1. fetch-and-derive mixes concerns — remove or split it
The existing scripts in this project call gh to fetch data, but keep fetching and transformation cleanly separated. fetch-and-derive does both: it shells out to gh api via subprocess AND runs the derive logic. This creates the <shell_permissions> contradiction (see inline comment) and makes the subcommand harder to test (requires mocking subprocess).
The workflow already describes the gh api calls as direct shell commands in Phase 3.1. The script should only do local/deterministic work: scan (read local files) and derive (pure computation). Let the agent or the workflow handle the gh api calls — that's what the existing scripts' patterns do.
2. extract-env-vars doesn't need to be a subcommand
This is a one-liner: grep -oP '\$\{[A-Z_][A-Z0-9_]*\}' file | sort -u. Adding it as a Python subcommand adds code to maintain with no benefit.
Functional Bugs
3. shorten_name has no safety net (see inline)
4. find_missing_metadata uses substring matching (see inline)
Quality — Match Existing Patterns
5. run_gh is inconsistent with existing scripts
Both analyze-pr.py and triage-prs.py have run_gh that:
- Returns parsed JSON (not raw strings)
- Catches
FileNotFoundErrorfor missingghCLI - Catches
json.JSONDecodeError
This script's run_gh does none of those. If you keep run_gh, follow the established pattern.
6. Use --json flag instead of TTY auto-detection (see inline)
Minor
7. Module docstring shows old usage — The usage examples don't show subcommands (scan, derive, etc.) which is confusing.
8. Numbering gap in intake menu — Options jump from 4 to 8 (see inline).
9. supportedVersions mismatch in reference examples — Both examples in metadata-format.md show inconsistent versions (see inline).
TL;DR: The derive logic (name shortening, OCI URLs, consistency checks) is genuinely valuable as a script. The main asks: (1) remove fetch-and-derive — let the workflow handle gh api calls, keep the script doing only local/deterministic work, (2) fix the shorten_name safety net and find_missing_metadata matching, (3) align run_gh and output format with existing scripts.
| def shorten_name(name: str) -> str: | ||
| """Apply shortening rules only if name exceeds K8S_NAME_LIMIT.""" | ||
| if len(name) <= K8S_NAME_LIMIT: | ||
| return name |
There was a problem hiding this comment.
If the shortening rules don't reduce the name below 63 chars, this returns a name that exceeds the Kubernetes limit. The upstream shorten-component-name.sh likely has a final truncation step.
Add a safety net — e.g., truncate to 55 chars + - + 7-char hash suffix:
if len(shortened) > K8S_NAME_LIMIT:
import hashlib
h = hashlib.sha256(name.encode()).hexdigest()[:7]
shortened = shortened[:K8S_NAME_LIMIT - 8] + '-' + h
return shortenedThe test test_long_name_shortened only passes because the specific test string happens to get short enough — it doesn't cover names where the rules aren't sufficient.
| path = plugin["path"] | ||
| path_suffix = path.rstrip("/").split("/")[-1] if path != "." else "" | ||
| found = any(path_suffix and path_suffix in name for name in existing_names) | ||
| if not found and path != ".": |
There was a problem hiding this comment.
Substring matching: "argocd" in "something-argocd-backend" is True. If the workspace has both plugins/argocd and plugins/argocd-backend, and a metadata file x-argocd-backend.yaml exists, the check for plugins/argocd falsely matches.
Either:
- Use exact suffix matching with a separator:
name.endswith(path_suffix)orname == expected_metadata_name - Or better: derive the expected metadata name via
package_name_to_metadata_nameand check for exact file stem match
The Copilot comment (#4) flagged the same issue.
|
|
||
|
|
||
| def run_gh(args, check=True): | ||
| """Run a gh CLI command and return stdout as string.""" |
There was a problem hiding this comment.
This run_gh doesn't match the pattern from analyze-pr.py and triage-prs.py:
- Missing
FileNotFoundErrorhandler (crash ifghisn't installed) - Returns raw stdout string instead of parsed JSON
- No
json.JSONDecodeErrorhandling
If you keep run_gh (see my comment about removing fetch-and-derive), align with the existing pattern.
|
|
||
|
|
||
| def fetch_and_derive_all( | ||
| workspace_dir: Path, workspace: str, source: dict, missing_paths: list[str] |
There was a problem hiding this comment.
This subcommand mixes network I/O (gh api via subprocess) with pure computation (field derivation). The workflow already describes the gh api calls as shell commands in Phase 3.1 — having the script also do them via subprocess is redundant.
This is also what creates the <shell_permissions> contradiction in SKILL.md: the directive says "prefer direct gh calls over Python subprocess" but this function does exactly the opposite.
Recommendation: remove fetch-and-derive. The workflow should call gh api directly (agent-friendly, no sandbox issues), then pipe the results to derive for each plugin. The scan + derive subcommands are sufficient.
| sys.exit(1) | ||
|
|
||
| is_tty = os.isatty(sys.stdout.fileno()) | ||
|
|
There was a problem hiding this comment.
os.isatty(sys.stdout.fileno()) raises io.UnsupportedOperation in environments without a real fd (some test runners, CI wrappers). Use sys.stdout.isatty() instead.
Also: the existing scripts use an explicit --json flag rather than TTY auto-detection. That's more predictable — consider matching that pattern.
| title: Bugs | ||
| - title: Source Code | ||
| url: https://github.com/backstage/community-plugins/tree/main/workspaces/dynatrace/plugins/dynatrace | ||
| annotations: |
There was a problem hiding this comment.
supportedVersions: 1.45.3 but the dynamicArtifact tag above uses bs_1.49.4. These should match — this is the reference doc that agents will follow when generating metadata.
Same issue with the backend example below (supportedVersions: 1.48.3 vs bs_1.49.4 in the tag).
|
|
||
| <shell_permissions> | ||
| Prefer running `gh api` and `gh search code` as **direct shell commands** rather than via Python subprocess. Direct `gh` calls go through the user's command allowlist without triggering permission prompts. Python scripts that only do local work (file I/O, JSON processing, field derivation) also need no extra permissions. Only request `full_network` for Python scripts that internally spawn `gh` as a subprocess — the sandbox blocks network access from child processes. | ||
| </shell_permissions> |
There was a problem hiding this comment.
This directive says "prefer direct gh shell commands over Python subprocess" but derive-metadata.py's fetch-and-derive subcommand does exactly the opposite — it calls gh api via subprocess.run.
Either:
- Remove
fetch-and-derivefrom the script (recommended — see my comment on the script), or - Rewrite this to explain that the script handles
ghcalls internally and agents should use the script's subcommands rather than callingghdirectly for metadata tasks
| 2. **Update plugin version** — Bump to newer upstream commit/tag | ||
| 3. **Check plugin status** — Verify health and compatibility | ||
| 4. **Fix build failure** — Debug CI/publish issues | ||
| 8. **Generate or audit metadata** — Add missing Package metadata or fix inconsistencies in existing metadata |
There was a problem hiding this comment.
Nit: jumps from option 4 to option 8 in the user-facing menu. Should be 5 (or renumber the Core Team section to leave room).
Summary
workflows/generate-metadata.md— a 6-phase workflow that generates missing Package metadata YAML files and audits existing metadata for consistency within overlay workspacesscripts/derive-metadata.py— a Python helper script for deterministic metadata derivation (name shortening, OCI URL, supportedVersions, plugins-list parsing, env var extraction)<path_resolution>and<shell_permissions>directives toSKILL.mdfor reliable script invocation and sandbox-safe GitHub API calls across all workflowsreferences/metadata-format.mdwith real-world examples, correctcatalog-entities/extensions/paths, and comprehensive field documentationtests/unit/test_derive_metadata.pywith 47 tests covering all public functionsWorkflow Phases
gh api(agent-direct, no subprocess)Test plan
Made with Cursor