Skip to content

Commit 5013eec

Browse files
committed
chore: clean up internal phase markers and reorganize scripts
The repo had grown two ingestion pipelines (PDF prose corpus and XSD schema graph) without making the duality obvious. Header comments referenced internal "Phase N" planning vocabulary that doesn't help public readers, and a tool-output line shipped a forward reference to a future phase to every MCP caller. Reorganization: scripts/ingest/ -> scripts/ingest-pdf/ (was ambiguous) scripts/ingest-xsd/ stays scripts/fetch-xsd.ts -> scripts/ingest-xsd/fetch.ts (sibling layout) scripts/sync-sources.ts -> scripts/sources-sync.ts (verb-style name) scripts/ingest-pdf/extract-pdf.py -> extract.py db/migrations/0003_phase3_metadata.sql -> 0003_xsd_metadata.sql scripts/ingest-xsd/smoke.ts removed (debug-only, low value) Renamed npm scripts to match the new directory layout: ingest -> pdf:ingest ingest:chunk -> pdf:chunk ingest:embed -> pdf:embed ingest:upload -> pdf:upload ingest:setup -> pdf:setup db:sync-sources -> sources:sync xsd:smoke removed Strip "Phase N" markers from migration headers, source-file headers, test-file headers, and inline comments. None of those references were load-bearing; they were artifacts of the planning doc. Drop the user-facing "_behavior notes: none yet (Phase 5)._" line that shipped in every children/attributes/enum tool response. The line gave no information when notes are absent and exposed an internal phase label to the public. Replace the lone PLAN.md reference in scripts/ingest-xsd/ingest.ts with self-contained context. PLAN.md is gitignored; pointing at it was a broken link for anyone reading the public repo. Add scripts/ingest-pdf/README.md and scripts/ingest-xsd/README.md so each pipeline is documented at the level that contributors land at, and refresh CLAUDE.md to make the two corpora explicit and surface both flavors of MCP tools. 41 / 0 across db / ingest / mcp-server.
1 parent cb3e16d commit 5013eec

30 files changed

Lines changed: 275 additions & 202 deletions

CLAUDE.md

Lines changed: 48 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -29,15 +29,23 @@ apps/
2929
src/data/docs.ts ← All doc pages live here (single source of truth)
3030
src/components/ UI components (Sidebar, SuperDocPreview, etc.)
3131
src/pages/ Route pages (Home, Docs, SpecExplorer, Mcp)
32-
mcp-server/ Cloudflare Worker MCP server for AI spec search
32+
mcp-server/ Cloudflare Worker - MCP server (semantic + structural tools)
3333
packages/
3434
shared/ Database client, embedding client, types
3535
scripts/
36-
ingest/ PDF → chunks → embeddings → database pipeline
36+
ingest-pdf/ ECMA PDF -> spec_content (semantic search corpus)
37+
ingest-xsd/ ECMA XSDs -> schema graph (structural query corpus)
38+
sources-sync.ts data/sources.json -> reference_sources
39+
db-migrate.ts Apply db/migrations/*.sql in order
40+
ooxml-call.ts Local CLI harness for the structural MCP tools
3741
db/
38-
schema.sql PostgreSQL + pgvector schema
42+
schema.sql PostgreSQL + pgvector + XSD schema graph
43+
migrations/ Numbered, idempotent SQL migrations
44+
data/
45+
sources.json Source manifest (artifact URLs, sha256, license notes)
46+
xsd-cache/ Local-only XSD download cache (gitignored)
3947
dev/
40-
data/ Extracted/chunked/embedded spec content
48+
data/ Extracted/chunked/embedded PDF content
4149
```
4250

4351
## Commands
@@ -97,23 +105,52 @@ The XML you provide is wrapped in a minimal `w:document > w:body` structure auto
97105

98106
## MCP Server
99107

100-
Cloudflare Worker exposing three MCP tools for semantic spec search:
108+
Cloudflare Worker exposing two flavors of MCP tools backed by the same database:
101109

102-
- `search_ecma_spec` — semantic vector search across 18,000+ spec chunks
103-
- `get_section` — fetch a specific section by ID (e.g., "17.3.1.24")
104-
- `list_parts` — browse the spec structure
110+
Always-on (semantic search over the spec PDF):
111+
112+
- `search_ecma_spec` - semantic vector search across 18,000+ spec chunks
113+
- `get_section` - fetch a specific section by ID (e.g., "17.3.1.24")
114+
- `list_parts` - browse the spec structure
115+
116+
Behind `ENABLE_OOXML_TOOLS` (structural queries over the XSD schema graph):
117+
118+
- `ooxml_lookup_element` / `ooxml_lookup_type` - canonical symbol info
119+
- `ooxml_children` - legal children of an element/type/group, in document order
120+
- `ooxml_attributes` - attributes including those inherited and unfolded from attributeGroup refs
121+
- `ooxml_enum` - simpleType enumeration values
122+
- `ooxml_namespace_info` - vocabularies and per-profile symbol counts for a namespace URI
105123

106124
Uses PostgreSQL with pgvector (Neon serverless in production, Docker locally).
107125

108-
## Data Pipeline
126+
## Data Pipelines
127+
128+
Two ingest paths feed the same database. Both are reproducible from `data/sources.json`.
109129

110-
Ingests ECMA-376 PDFs into the vector database:
130+
**PDF (semantic corpus, into `spec_content`)**:
111131

112132
```
113133
PDF → extract (Python) → chunk (6KB) → embed (Voyage) → upload (PostgreSQL)
114134
```
115135

116-
Run the full pipeline: `bun scripts/ingest/pipeline.ts`
136+
```bash
137+
bun run pdf:ingest 1 ./pdfs/ECMA-376-Part1.pdf # full pipeline for one part
138+
```
139+
140+
See `scripts/ingest-pdf/README.md`.
141+
142+
**XSD (structural corpus, into `xsd_*` tables)**:
143+
144+
```
145+
ECMA Part 4 zip → fetch+verify (sha256) → parse → ingest (single transaction)
146+
```
147+
148+
```bash
149+
bun run xsd:fetch --url <part4-zip-url> --expected-sha256 <hex>
150+
bun run xsd:ingest
151+
```
152+
153+
See `scripts/ingest-xsd/README.md`.
117154

118155
## Database
119156

apps/mcp-server/src/index.ts

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,10 @@ export interface Env {
1717
DATABASE_URL: string;
1818
VOYAGE_API_KEY: string;
1919
/**
20-
* Phase 4 feature flag. Set to "true" to expose ooxml_lookup_element /
21-
* ooxml_lookup_type / ooxml_children / ooxml_attributes / ooxml_enum /
22-
* ooxml_namespace_info via tools/list and tools/call. Default off.
20+
* Feature flag for the OOXML structural tools. Set to "true" to expose
21+
* ooxml_lookup_element / ooxml_lookup_type / ooxml_children /
22+
* ooxml_attributes / ooxml_enum / ooxml_namespace_info via tools/list
23+
* and tools/call. Default off.
2324
*/
2425
ENABLE_OOXML_TOOLS?: string;
2526
}

apps/mcp-server/src/mcp.ts

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -162,9 +162,9 @@ async function handleToolsCall(
162162
try {
163163
let resultText: string;
164164

165-
// Phase 4 OOXML tools, feature-flagged. tools/list also gates on the same flag,
166-
// so callers should not see these tool names unless the flag is on. Defensive
167-
// check here in case a caller hand-crafts a request.
165+
// OOXML tools are feature-flagged; tools/list filters them out when the flag
166+
// is off, so callers should not see these tool names. Defensive check here in
167+
// case a caller hand-crafts a request.
168168
if (isOoxmlTool(name)) {
169169
if (!ooxmlToolsEnabled(env)) {
170170
return {

apps/mcp-server/src/ooxml-queries.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
/**
2-
* Read-only schema-graph queries powering the Phase 4 MCP tools:
2+
* Read-only schema-graph queries powering the OOXML MCP tools:
33
* ooxml_lookup_element, ooxml_lookup_type, ooxml_children,
44
* ooxml_attributes, ooxml_enum, ooxml_namespace_info.
55
*

apps/mcp-server/src/ooxml-tools.ts

Lines changed: 5 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
/**
2-
* Phase 4 read-only structural MCP tools. Behind ENABLE_OOXML_TOOLS env flag,
3-
* which gates both tools/list discovery and tools/call dispatch so the public
4-
* surface stays unchanged until the feature is intentionally enabled.
2+
* Read-only structural MCP tools backed by the OOXML schema graph. Gated by
3+
* ENABLE_OOXML_TOOLS, which filters both tools/list discovery and tools/call
4+
* dispatch so the public surface stays unchanged until the flag is set.
55
*
66
* Tools:
77
* ooxml_lookup_element, ooxml_lookup_type, ooxml_children,
88
* ooxml_attributes, ooxml_enum, ooxml_namespace_info.
99
*
10-
* Default profile is `transitional` until word-compatible-docx is composed
11-
* in Phase 6.
10+
* Default profile is `transitional`. Future profiles (e.g. word-compatible-docx)
11+
* will compose Transitional with Office extension schemas.
1212
*/
1313

1414
import { neon } from "@neondatabase/serverless";
@@ -315,8 +315,6 @@ function formatSymbolReport(label: string, hit: SymbolHit, profile: string): str
315315
lines.push(`- namespace: ${hit.namespaceUri}`);
316316
if (hit.typeRef) lines.push(`- type_ref: ${hit.typeRef}`);
317317
if (hit.sourceName) lines.push(`- source: ${hit.sourceName}`);
318-
lines.push("");
319-
lines.push("_behavior notes: none yet (Phase 5)._");
320318
return lines.join("\n");
321319
}
322320

@@ -340,8 +338,6 @@ function formatChildrenReport(
340338

341339
if (children.length === 0) {
342340
lines.push("_no direct or inherited children._");
343-
lines.push("");
344-
lines.push("_behavior notes: none yet (Phase 5)._");
345341
return lines.join("\n");
346342
}
347343

@@ -359,8 +355,6 @@ function formatChildrenReport(
359355
lines.push(
360356
"_group entries are returned as-is; call `ooxml_children` on the group qname to expand them._",
361357
);
362-
lines.push("");
363-
lines.push("_behavior notes: none yet (Phase 5)._");
364358
return lines.join("\n");
365359
}
366360

@@ -383,8 +377,6 @@ function formatAttributesReport(
383377

384378
if (attrs.length === 0) {
385379
lines.push("_no attributes._");
386-
lines.push("");
387-
lines.push("_behavior notes: none yet (Phase 5)._");
388380
return lines.join("\n");
389381
}
390382

@@ -401,8 +393,6 @@ function formatAttributesReport(
401393
`| ${a.localName} | ${a.attrUse} | ${a.typeRef ?? "-"} | ${a.defaultValue ?? "-"} | ${a.fixedValue ?? "-"} | ${from} |`,
402394
);
403395
}
404-
lines.push("");
405-
lines.push("_behavior notes: none yet (Phase 5)._");
406396
return lines.join("\n");
407397
}
408398

@@ -420,8 +410,6 @@ function formatEnumReport(sym: SymbolHit, enums: EnumEntry[], profile: string):
420410
} else {
421411
for (const e of enums) lines.push(`- ${e.value}`);
422412
}
423-
lines.push("");
424-
lines.push("_behavior notes: none yet (Phase 5)._");
425413
return lines.join("\n");
426414
}
427415

data/sources.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
{
2-
"$comment": "Source manifest. Human-edited; scripts/sync-sources.ts upserts these rows into reference_sources.",
2+
"$comment": "Source manifest. Human-edited; scripts/sources-sync.ts upserts these rows into reference_sources.",
33
"sources": [
44
{
55
"name": "ecma-376",

db/migrations/0001_reference_sources.sql

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
1-
-- Phase 1: Provenance foundation
2-
-- Adds reference_sources and source_id FK on spec_content.
1+
-- Provenance foundation: reference_sources catalog + source_id FK on spec_content.
32
-- Idempotent: safe to run against fresh installs (matches db/schema.sql) or existing DBs.
43

54
CREATE EXTENSION IF NOT EXISTS vector;

db/migrations/0002_xsd_schema.sql

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
-- Phase 2: XSD schema tables (empty)
2-
-- Profile-scoped symbol graph. All tables empty after this migration; data lands in Phase 3+.
1+
-- Profile-scoped XSD schema graph. All tables empty after this migration; data
2+
-- lands when scripts/ingest-xsd/ingest.ts runs against a populated cache.
33
-- Idempotent: safe to run against fresh installs (matches db/schema.sql) or existing DBs.
44

55
CREATE TABLE IF NOT EXISTS xsd_profiles (
@@ -116,7 +116,8 @@ CREATE TABLE IF NOT EXISTS xsd_enums (
116116
);
117117

118118
-- Curated Word/Office behavior claims keyed to symbols.
119-
-- claim_type enum is locked now (Phase 5 will populate).
119+
-- claim_type enum is locked now; the table stays empty until curated behavior
120+
-- notes start landing.
120121
CREATE TABLE IF NOT EXISTS behavior_notes (
121122
id SERIAL PRIMARY KEY,
122123
symbol_id INT REFERENCES xsd_symbols(id) ON DELETE CASCADE,
Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
1-
-- Phase 3 review fix: preserve element/attribute @type and group-ref compositor context.
1+
-- Preserve element/attribute @type and group-ref compositor context so the
2+
-- structural lookup tools can resolve element-to-type chains and attach refs
3+
-- to their enclosing compositor.
24
-- Idempotent.
35

46
ALTER TABLE xsd_symbols

db/migrations/0004_local_element_scoping.sql

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
-- Phase 4 review fix: scope local element symbols by their owner.
1+
-- Scope local element symbols by their owner.
22
--
33
-- Before this migration, an inline <xsd:element name="X" type="..."/> declared
44
-- inside two different complexTypes/groups collapsed to a single symbol keyed

0 commit comments

Comments
 (0)