fix(docx): round-trip TOC fields and point comments#837
Conversation
Two Tier-1 open->save fidelity fixes found via a docx round-trip bake-off: 1. Cross-paragraph complex fields (TOC, INDEX, multi-paragraph TA) were dropped: the per-paragraph field state machine only finalizes a field when its w:fldChar end is in the same paragraph, so an unterminated field's begin/instrText/separate runs were swallowed. Buffer those runs and, if the paragraph ends with the field still open, splice them back at the field's start index (preserving order vs. in-paragraph non-run content such as the first TOC entry's hyperlink). Restores the documented fields.toc roundTrip:'full' guarantee. 2. Range-less 'point' comments (bare w:commentReference, no commentRangeStart/End) were dropped: the model had no commentReference node, so the reference run parsed empty and was consolidated away, and references were only emitted as a side-effect of commentRangeEnd. Add a first-class CommentReference node (parse + serialize) as the single source of truth for every reference. Adds toc-field-roundtrip + comment-reference-roundtrip tests. Full packages/core docx suite green (325 tests), typecheck clean. Pre-PR: run 'bun run api:extract' to refresh the API report for the new CommentReference export. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The new public CommentReference type (point-comment round-trip fix) adds a member to ParagraphContent; regenerate the API Extractor report so api:check passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
All contributors have signed the CLA ✍️ ✅ Posted by the CLA bot. |
|
I have read the CLA Document and I hereby sign the CLA |
jedrazb
left a comment
There was a problem hiding this comment.
Thanks for this — the TOC / cross-paragraph field fix (part 1) is solid: I traced the buffer-and-splice and the begin-run-before-hyperlink ordering and it round-trips correctly. Requesting changes on the point-comment half (part 2) because, as written, it regresses comment fidelity on the editor's own save path.
The regression
This PR removes the <w:commentReference> run from the commentRangeEnd serializer case and makes it depend on a commentReference content node. That node is produced only by the DOCX parser — never by the editor's save pipeline:
- Editor comments are ProseMirror marks (
commentOps.tsaddMark(... schema.marks.comment ...)), not content nodes. - On save,
insertCommentRanges(packages/core/src/prosemirror/conversion/fromProseDoc/paragraph.ts) emits onlycommentRangeStart/commentRangeEnd— there is zerocommentReferenceproducer anywhere underpackages/core/src/prosemirror/. The reply / tracked-change paths (injectReplyRangeMarkers.ts) emit only range markers too. - The import side drops it as well:
toProseDoc/paragraph.tshas nocommentReferencecase, so a parsed reference node is silently discarded before it reaches PM.
Net effect: any document opened and saved through the editor — and every comment created in the editor — loses all <w:commentReference> runs. Imported range-less point comments disappear entirely on re-save. Per ECMA-376 the reference run is the comment anchor, so this breaks the documented fields/comments roundTrip: 'full' guarantee (docs/site/data/word-features.ts).
Verified empirically, not just by reading: building a PM doc with a comment mark and running the real fromProseDoc + serializer with this diff applied yields commentRangeStart + commentRangeEnd and zero commentReference runs (pre-PR: one). A full toProseDoc → fromProseDoc → serialize round-trip of a 2-comment doc drops both references and loses the point comment.
What's needed before merge
- Emit the node on the PM-save path: in
insertCommentRanges(andinjectReplyRangeMarkers/ the TC variant), push{ type: 'commentReference', id }immediately after eachcommentRangeEnd. - Decide
commentReferencehandling intoProseDoc/paragraph.tsexplicitly (round-trip it or intentionally drop it — right now it's an unhandled silent drop). - Add a regression test that goes through ProseMirror (
toProseDoc → fromProseDoc → serialize), not justparseDocx → repackDocx— that's why the current suite stays green.
The TOC fix is independent and mergeable on its own; splitting it out (as you offered) and holding the comment half for the PM-path fix is the clean path. (Also still blocked on CLA.)
| case 'commentRangeEnd': | ||
| return ( | ||
| `<w:commentRangeEnd w:id="${content.id}"/>` + | ||
| `<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="${content.id}"/></w:r>` | ||
| ); | ||
| // Emit only the end marker. The reference run is emitted from its own | ||
| // 'commentReference' node (preserved at parse time), so every comment — | ||
| // ranged or point — round-trips its reference exactly once. | ||
| return `<w:commentRangeEnd w:id="${content.id}"/>`; | ||
| case 'commentReference': | ||
| return `<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="${content.id}"/></w:r>`; |
There was a problem hiding this comment.
This split is the root of the regression. Moving the <w:commentReference> run out of the commentRangeEnd case is only safe if every code path that produces a commentRangeEnd also produces a commentReference node. The DOCX parser now does — but the editor's PM-save path does not.
insertCommentRanges in fromProseDoc/paragraph.ts emits commentRangeStart/commentRangeEnd from comment marks and never a commentReference node (grep confirms no commentReference producer anywhere under prosemirror/). So after this change, any comment saved through the editor serializes with no reference run at all.
Empirically reproduced: PM doc with a comment mark → fromProseDoc → serialize with this diff = commentRangeStart + commentRangeEnd, zero commentReference (was 1 pre-PR).
Fix: push { type: 'commentReference', id } right after each commentRangeEnd in insertCommentRanges and in injectReplyRangeMarkers / the TC variant, and add a commentReference case to toProseDoc/paragraph.ts.
| const doc = await parseDocx(await buildCommentsDocx(), { preloadFonts: false }); | ||
| const out = new Uint8Array(await repackDocx(doc)); |
There was a problem hiding this comment.
This test exercises parseDocx → repackDocx (raw Document model → serializer), which preserves the parser-produced commentReference node — so it passes even though the editor save path regresses. The bug lives in the ProseMirror round-trip, which this never touches. Please add a sibling test that goes toProseDoc → fromProseDoc → serialize and asserts exactly one <w:commentReference> per comment (and that a range-less point comment survives).
Making `commentReference` the single source of truth for the reference run regressed comment fidelity on the editor's own save path: editor comments are ProseMirror `comment` marks, and `insertCommentRanges` (fromProseDoc) only synthesized `commentRangeStart`/`End` from them — nothing produced the `commentReference` node, so the serializer (which now emits the run only from that node) dropped every `<w:commentReference>` on save. Any doc opened+saved through the editor, and every editor-created comment, lost its anchor. Fix, per review: - fromProseDoc `insertCommentRanges`: emit a `commentReference` node right after each `commentRangeEnd` (both the mid-paragraph close and the end-of-paragraph close), so the editor save path round-trips the anchor. - docx `injectReplyRangeMarkers` + `injectTCReplyRangeMarkers`: emit a parallel `commentReference` for every reply / tracked-change reply too. - toProseDoc: handle `commentReference` explicitly — intentionally NOT converted to a PM node (the mark is the source of truth; it's regenerated on save, so keeping it would double-emit). Range-less "point" comments can't be a zero-width mark and are dropped on the editor path (still lossless via parseDocx→repackDocx). - New regression test goes through ProseMirror (toProseDoc → fromProseDoc → serialize), not just parseDocx → repackDocx, and asserts exactly one reference survives a save→load→save cycle (no loss, no accumulation). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the careful review — you're right, and the empirical repro nailed it. I fixed the PM-save-path regression rather than splitting the PR:
Kept the two halves together since the comment path is correct now, but happy to split the TOC fix out if you'd still prefer. CLA's signed too. |
What this fixes
Two Tier-1
.docxopen→save round-trip losses, each with a regression test. Found while building a round-trip fidelity bake-off against@eigenpal/docx-editor-core.1. Cross-paragraph complex fields (Table of Contents) were dropped
parseParagraphContentsruns the complex-field state machine per paragraph and only finalizes a field when it sees the matchingw:fldCharend in the same paragraph. A real Word TOC (andINDEX, a multi-paragraphTA) has itsbegin/instrText/separatein the first entry paragraph and itsendin a later one, so the field never closed and its consumed runs were swallowed — thebegin+ instruction were silently lost. This regressed the documentedfields.tocroundTrip: 'full'guarantee (docs/site/data/word-features.ts).Fix: buffer the runs an open field consumes and, if the paragraph ends with the field still open, splice them back at the field's start index — preserving order vs. in-paragraph non-run content (e.g. the first TOC entry's hyperlink). Each run already carries its
fieldChar/instrTextcontent, so the begin + instruction round-trip verbatim and Word re-stitches the field across paragraphs. Touches onlyparagraphParser/content.ts.2. Range-less "point" comments were dropped
A comment can be anchored by a bare
w:commentReferencewith nocommentRangeStart/End. The model had nocommentReferencenode, so the reference run parsed to empty content and was dropped by the run consolidator; references were only emitted as a side-effect ofcommentRangeEnd. A point comment therefore had no representation, and a two-comment document round-tripped to one.Fix: add a first-class
CommentReferenceparagraph-content node — parse everyw:commentReference(wrapped in a run or bare) and serialize it explicitly, making it the single source of truth for every reference (and stop the double-emit fromcommentRangeEnd). Ranged and point comments now both survive with exactly one reference each.Tests & verification
toc-field-roundtrip.test.ts(incl. the begin-runs-before-first-entry-hyperlink ordering),comment-reference-roundtrip.test.ts.packages/coredocx suite green (325 tests, 0 fail);bun run typecheckclean.CommentReferenceexport.Notes
content.tsbut are independent.🤖 Generated with Claude Code
Need help on this PR? Tag
/codesmithwith what you need. Autofix is disabled.