Skip to content

Commit 38b6457

Browse files
authored
Merge pull request #3 from superdoc-dev/caio/ooxml-reference-phase-4-mcp-tools
feat(mcp): OOXML reference engine
2 parents fbe64f1 + 07f1086 commit 38b6457

41 files changed

Lines changed: 5158 additions & 53 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,4 +5,10 @@ dev/
55
.wrangler/
66
.env
77
.mcp.json
8-
.vscode/
8+
.vscode/
9+
10+
# Local-only planning doc (public repo)
11+
PLAN.md
12+
13+
# XSD/spec artifacts: pulled by scripts/fetch-xsd.ts; never committed.
14+
data/xsd-cache/

CLAUDE.md

Lines changed: 47 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -29,15 +29,22 @@ apps/
2929
src/data/docs.ts ← All doc pages live here (single source of truth)
3030
src/components/ UI components (Sidebar, SuperDocPreview, etc.)
3131
src/pages/ Route pages (Home, Docs, SpecExplorer, Mcp)
32-
mcp-server/ Cloudflare Worker MCP server for AI spec search
32+
mcp-server/ Cloudflare Worker - MCP server (semantic + structural tools)
3333
packages/
3434
shared/ Database client, embedding client, types
3535
scripts/
36-
ingest/ PDF → chunks → embeddings → database pipeline
36+
ingest-pdf/ ECMA PDF -> spec_content (semantic search corpus)
37+
ingest-xsd/ ECMA XSDs -> schema graph (structural query corpus)
38+
sources-sync.ts data/sources.json -> reference_sources
39+
db-migrate.ts Apply db/migrations/*.sql in order
3740
db/
38-
schema.sql PostgreSQL + pgvector schema
41+
schema.sql PostgreSQL + pgvector + XSD schema graph
42+
migrations/ Numbered, idempotent SQL migrations
43+
data/
44+
sources.json Source manifest (artifact URLs, sha256, license notes)
45+
xsd-cache/ Local-only XSD download cache (gitignored)
3946
dev/
40-
data/ Extracted/chunked/embedded spec content
47+
data/ Extracted/chunked/embedded PDF content
4148
```
4249

4350
## Commands
@@ -97,23 +104,52 @@ The XML you provide is wrapped in a minimal `w:document > w:body` structure auto
97104

98105
## MCP Server
99106

100-
Cloudflare Worker exposing three MCP tools for semantic spec search:
107+
Cloudflare Worker exposing two flavors of MCP tools backed by the same database.
101108

102-
- `search_ecma_spec` — semantic vector search across 18,000+ spec chunks
103-
- `get_section` — fetch a specific section by ID (e.g., "17.3.1.24")
104-
- `list_parts` — browse the spec structure
109+
Semantic search over the spec PDF (powered by `spec_content`):
110+
111+
- `search_ecma_spec` - semantic vector search across 18,000+ spec chunks
112+
- `get_section` - fetch a specific section by ID (e.g., "17.3.1.24")
113+
- `list_parts` - browse the spec structure
114+
115+
Structural queries over the XSD schema graph (powered by `xsd_*` tables):
116+
117+
- `ooxml_lookup_element` / `ooxml_lookup_type` - canonical symbol info
118+
- `ooxml_children` - legal children of an element/type/group, in document order
119+
- `ooxml_attributes` - attributes including those inherited and unfolded from attributeGroup refs
120+
- `ooxml_enum` - simpleType enumeration values
121+
- `ooxml_namespace_info` - vocabularies and per-profile symbol counts for a namespace URI
105122

106123
Uses PostgreSQL with pgvector (Neon serverless in production, Docker locally).
107124

108-
## Data Pipeline
125+
## Data Pipelines
126+
127+
Two ingest paths feed the same database. Both are reproducible from `data/sources.json`.
109128

110-
Ingests ECMA-376 PDFs into the vector database:
129+
**PDF (semantic corpus, into `spec_content`)**:
111130

112131
```
113132
PDF → extract (Python) → chunk (6KB) → embed (Voyage) → upload (PostgreSQL)
114133
```
115134

116-
Run the full pipeline: `bun scripts/ingest/pipeline.ts`
135+
```bash
136+
bun run pdf:ingest 1 ./pdfs/ECMA-376-Part1.pdf # full pipeline for one part
137+
```
138+
139+
See `scripts/ingest-pdf/README.md`.
140+
141+
**XSD (structural corpus, into `xsd_*` tables)**:
142+
143+
```
144+
ECMA Part 4 zip → fetch+verify (sha256) → parse → ingest (single transaction)
145+
```
146+
147+
```bash
148+
bun run xsd:fetch # URL + sha256 from data/sources.json
149+
bun run xsd:ingest
150+
```
151+
152+
See `scripts/ingest-xsd/README.md`.
117153

118154
## Database
119155

README.md

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,10 @@ The OOXML spec, explained by people who actually implemented it.
1010

1111
An interactive reference for ECMA-376 (Office Open XML) built by the [SuperDoc — DOCX editing and tooling](https://superdoc.dev) team. Every page combines XML structure, live rendered previews, and implementation notes that tell you what the spec doesn't.
1212

13-
- **Live previews** — Edit XML and see it render in real-time. Every example is a working document.
14-
- **Implementation notes** — Where Word diverges from the spec, what will break your code, and what to do about it.
15-
- **Semantic spec search** — 18,000+ spec chunks searchable by meaning via MCP server.
13+
- **Live previews** - Edit XML and see it render in real-time. Every example is a working document.
14+
- **Implementation notes** - Where Word diverges from the spec, what will break your code, and what to do about it.
15+
- **Semantic spec search** - 18,000+ spec chunks searchable by meaning via MCP server.
16+
- **Structural schema lookup** - Element children, attributes, types, enums, namespaces. Same MCP server, deterministic answers from the parsed XSDs.
1617

1718
## Why?
1819

@@ -22,13 +23,16 @@ We faced this at SuperDoc — building a document engine on native OOXML with no
2223

2324
## MCP Server
2425

25-
Search the ECMA-376 spec with AI. Ask questions in natural language, get answers grounded in the actual specification.
26+
Ask questions in natural language and get answers grounded in the spec, or query the schema graph for precise structural answers.
2627

2728
```bash
2829
claude mcp add --transport http ecma-spec https://api.ooxml.dev/mcp
2930
```
3031

31-
Works with Claude Code, Cursor, and any MCP-compatible client. Three tools: `search_ecma_spec` (semantic search), `get_section` (by ID), and `list_parts` (browse structure).
32+
Works with Claude Code, Cursor, and any MCP-compatible client. Two flavors of tools share one server:
33+
34+
- **Semantic** (over the spec PDF): `search_ecma_spec`, `get_section`, `list_parts`
35+
- **Structural** (over the parsed XSDs): `ooxml_lookup_element`, `ooxml_lookup_type`, `ooxml_children`, `ooxml_attributes`, `ooxml_enum`, `ooxml_namespace_info`
3236

3337
## Development
3438

apps/mcp-server/src/mcp.ts

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
import { createDb } from "./db";
88
import { embedQuery } from "./embeddings";
99
import type { Env } from "./index";
10+
import { callOoxmlTool, isOoxmlTool, OOXML_TOOL_DEFS } from "./ooxml-tools";
1011

1112
// JSON-RPC types
1213
interface JsonRpcRequest {
@@ -136,9 +137,7 @@ function handleToolsList(id: number | string | null): JsonRpcResponse {
136137
return {
137138
jsonrpc: "2.0",
138139
id,
139-
result: {
140-
tools: TOOLS,
141-
},
140+
result: { tools: [...TOOLS, ...OOXML_TOOL_DEFS] },
142141
};
143142
}
144143

@@ -162,6 +161,17 @@ async function handleToolsCall(
162161
try {
163162
let resultText: string;
164163

164+
// Structural OOXML tools share the dispatch with the existing semantic
165+
// tools below.
166+
if (isOoxmlTool(name)) {
167+
resultText = await callOoxmlTool(name, args ?? {}, env);
168+
return {
169+
jsonrpc: "2.0",
170+
id,
171+
result: { content: [{ type: "text", text: resultText }] },
172+
};
173+
}
174+
165175
switch (name) {
166176
case "search_ecma_spec": {
167177
const query = args?.query as string;

0 commit comments

Comments
 (0)