Skip to content

fix(docx): round-trip TOC fields and point comments#837

Open
sathinator-afk wants to merge 4 commits into
eigenpal:mainfrom
sathinator-afk:fix/roundtrip-toc-and-point-comments
Open

fix(docx): round-trip TOC fields and point comments#837
sathinator-afk wants to merge 4 commits into
eigenpal:mainfrom
sathinator-afk:fix/roundtrip-toc-and-point-comments

Conversation

@sathinator-afk

@sathinator-afk sathinator-afk commented Jun 16, 2026

Copy link
Copy Markdown

What this fixes

Two Tier-1 .docx open→save round-trip losses, each with a regression test. Found while building a round-trip fidelity bake-off against @eigenpal/docx-editor-core.

1. Cross-paragraph complex fields (Table of Contents) were dropped

parseParagraphContents runs the complex-field state machine per paragraph and only finalizes a field when it sees the matching w:fldChar end in the same paragraph. A real Word TOC (and INDEX, a multi-paragraph TA) has its begin/instrText/separate in the first entry paragraph and its end in a later one, so the field never closed and its consumed runs were swallowed — the begin + instruction were silently lost. This regressed the documented fields.toc roundTrip: 'full' guarantee (docs/site/data/word-features.ts).

Fix: buffer the runs an open field consumes and, if the paragraph ends with the field still open, splice them back at the field's start index — preserving order vs. in-paragraph non-run content (e.g. the first TOC entry's hyperlink). Each run already carries its fieldChar/instrText content, so the begin + instruction round-trip verbatim and Word re-stitches the field across paragraphs. Touches only paragraphParser/content.ts.

2. Range-less "point" comments were dropped

A comment can be anchored by a bare w:commentReference with no commentRangeStart/End. The model had no commentReference node, so the reference run parsed to empty content and was dropped by the run consolidator; references were only emitted as a side-effect of commentRangeEnd. A point comment therefore had no representation, and a two-comment document round-tripped to one.

Fix: add a first-class CommentReference paragraph-content node — parse every w:commentReference (wrapped in a run or bare) and serialize it explicitly, making it the single source of truth for every reference (and stop the double-emit from commentRangeEnd). Ranged and point comments now both survive with exactly one reference each.

Tests & verification

  • New: toc-field-roundtrip.test.ts (incl. the begin-runs-before-first-entry-hyperlink ordering), comment-reference-roundtrip.test.ts.
  • Full packages/core docx suite green (325 tests, 0 fail); bun run typecheck clean.
  • API report refreshed for the new public CommentReference export.

Notes

  • Authored with AI assistance (Claude Code); I'll sign the CLA on this PR.
  • Happy to split into two PRs (TOC; point comments) if you'd prefer — they share content.ts but are independent.

🤖 Generated with Claude Code


View with Codesmith Autofix with Codesmith
Need help on this PR? Tag /codesmith with what you need. Autofix is disabled.

sathinator-afk and others added 2 commits June 16, 2026 18:09
Two Tier-1 open->save fidelity fixes found via a docx round-trip bake-off:

1. Cross-paragraph complex fields (TOC, INDEX, multi-paragraph TA) were dropped:
   the per-paragraph field state machine only finalizes a field when its
   w:fldChar end is in the same paragraph, so an unterminated field's
   begin/instrText/separate runs were swallowed. Buffer those runs and, if the
   paragraph ends with the field still open, splice them back at the field's
   start index (preserving order vs. in-paragraph non-run content such as the
   first TOC entry's hyperlink). Restores the documented fields.toc
   roundTrip:'full' guarantee.

2. Range-less 'point' comments (bare w:commentReference, no commentRangeStart/End)
   were dropped: the model had no commentReference node, so the reference run
   parsed empty and was consolidated away, and references were only emitted as a
   side-effect of commentRangeEnd. Add a first-class CommentReference node
   (parse + serialize) as the single source of truth for every reference.

Adds toc-field-roundtrip + comment-reference-roundtrip tests. Full packages/core
docx suite green (325 tests), typecheck clean. Pre-PR: run 'bun run api:extract'
to refresh the API report for the new CommentReference export.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The new public CommentReference type (point-comment round-trip fix) adds a member
to ParagraphContent; regenerate the API Extractor report so api:check passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 16, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docx-editor Ready Ready Preview, Comment Jun 16, 2026 10:09pm

Request Review

@eigenpal-release-pal

eigenpal-release-pal Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

All contributors have signed the CLA ✍️ ✅

Posted by the CLA bot.

@sathinator-afk

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

@jedrazb jedrazb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this — the TOC / cross-paragraph field fix (part 1) is solid: I traced the buffer-and-splice and the begin-run-before-hyperlink ordering and it round-trips correctly. Requesting changes on the point-comment half (part 2) because, as written, it regresses comment fidelity on the editor's own save path.

The regression

This PR removes the <w:commentReference> run from the commentRangeEnd serializer case and makes it depend on a commentReference content node. That node is produced only by the DOCX parser — never by the editor's save pipeline:

  • Editor comments are ProseMirror marks (commentOps.ts addMark(... schema.marks.comment ...)), not content nodes.
  • On save, insertCommentRanges (packages/core/src/prosemirror/conversion/fromProseDoc/paragraph.ts) emits only commentRangeStart / commentRangeEnd — there is zero commentReference producer anywhere under packages/core/src/prosemirror/. The reply / tracked-change paths (injectReplyRangeMarkers.ts) emit only range markers too.
  • The import side drops it as well: toProseDoc/paragraph.ts has no commentReference case, so a parsed reference node is silently discarded before it reaches PM.

Net effect: any document opened and saved through the editor — and every comment created in the editor — loses all <w:commentReference> runs. Imported range-less point comments disappear entirely on re-save. Per ECMA-376 the reference run is the comment anchor, so this breaks the documented fields/comments roundTrip: 'full' guarantee (docs/site/data/word-features.ts).

Verified empirically, not just by reading: building a PM doc with a comment mark and running the real fromProseDoc + serializer with this diff applied yields commentRangeStart + commentRangeEnd and zero commentReference runs (pre-PR: one). A full toProseDoc → fromProseDoc → serialize round-trip of a 2-comment doc drops both references and loses the point comment.

What's needed before merge

  1. Emit the node on the PM-save path: in insertCommentRanges (and injectReplyRangeMarkers / the TC variant), push { type: 'commentReference', id } immediately after each commentRangeEnd.
  2. Decide commentReference handling in toProseDoc/paragraph.ts explicitly (round-trip it or intentionally drop it — right now it's an unhandled silent drop).
  3. Add a regression test that goes through ProseMirror (toProseDoc → fromProseDoc → serialize), not just parseDocx → repackDocx — that's why the current suite stays green.

The TOC fix is independent and mergeable on its own; splitting it out (as you offered) and holding the comment half for the PM-path fix is the clean path. (Also still blocked on CLA.)

Comment on lines 351 to +357
case 'commentRangeEnd':
return (
`<w:commentRangeEnd w:id="${content.id}"/>` +
`<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="${content.id}"/></w:r>`
);
// Emit only the end marker. The reference run is emitted from its own
// 'commentReference' node (preserved at parse time), so every comment —
// ranged or point — round-trips its reference exactly once.
return `<w:commentRangeEnd w:id="${content.id}"/>`;
case 'commentReference':
return `<w:r><w:rPr><w:rStyle w:val="CommentReference"/></w:rPr><w:commentReference w:id="${content.id}"/></w:r>`;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This split is the root of the regression. Moving the <w:commentReference> run out of the commentRangeEnd case is only safe if every code path that produces a commentRangeEnd also produces a commentReference node. The DOCX parser now does — but the editor's PM-save path does not.

insertCommentRanges in fromProseDoc/paragraph.ts emits commentRangeStart/commentRangeEnd from comment marks and never a commentReference node (grep confirms no commentReference producer anywhere under prosemirror/). So after this change, any comment saved through the editor serializes with no reference run at all.

Empirically reproduced: PM doc with a comment mark → fromProseDoc → serialize with this diff = commentRangeStart + commentRangeEnd, zero commentReference (was 1 pre-PR).

Fix: push { type: 'commentReference', id } right after each commentRangeEnd in insertCommentRanges and in injectReplyRangeMarkers / the TC variant, and add a commentReference case to toProseDoc/paragraph.ts.

Comment on lines +88 to +89
const doc = await parseDocx(await buildCommentsDocx(), { preloadFonts: false });
const out = new Uint8Array(await repackDocx(doc));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test exercises parseDocx → repackDocx (raw Document model → serializer), which preserves the parser-produced commentReference node — so it passes even though the editor save path regresses. The bug lives in the ProseMirror round-trip, which this never touches. Please add a sibling test that goes toProseDoc → fromProseDoc → serialize and asserts exactly one <w:commentReference> per comment (and that a range-less point comment survives).

Making `commentReference` the single source of truth for the reference run
regressed comment fidelity on the editor's own save path: editor comments are
ProseMirror `comment` marks, and `insertCommentRanges` (fromProseDoc) only
synthesized `commentRangeStart`/`End` from them — nothing produced the
`commentReference` node, so the serializer (which now emits the run only from
that node) dropped every `<w:commentReference>` on save. Any doc opened+saved
through the editor, and every editor-created comment, lost its anchor.

Fix, per review:
- fromProseDoc `insertCommentRanges`: emit a `commentReference` node right after
  each `commentRangeEnd` (both the mid-paragraph close and the end-of-paragraph
  close), so the editor save path round-trips the anchor.
- docx `injectReplyRangeMarkers` + `injectTCReplyRangeMarkers`: emit a parallel
  `commentReference` for every reply / tracked-change reply too.
- toProseDoc: handle `commentReference` explicitly — intentionally NOT converted
  to a PM node (the mark is the source of truth; it's regenerated on save, so
  keeping it would double-emit). Range-less "point" comments can't be a
  zero-width mark and are dropped on the editor path (still lossless via
  parseDocx→repackDocx).
- New regression test goes through ProseMirror (toProseDoc → fromProseDoc →
  serialize), not just parseDocx → repackDocx, and asserts exactly one
  reference survives a save→load→save cycle (no loss, no accumulation).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sathinator-afk

Copy link
Copy Markdown
Author

Thanks for the careful review — you're right, and the empirical repro nailed it. I fixed the PM-save-path regression rather than splitting the PR:

  • insertCommentRanges (fromProseDoc/paragraph.ts) now emits a commentReference node right after each commentRangeEnd (both the mid-paragraph and end-of-paragraph close), and injectReplyRangeMarkers / injectTCReplyRangeMarkers do the same for replies and tracked-change-parented replies.
  • toProseDoc/paragraph.ts now handles commentReference explicitly — intentionally not converted to a PM node, since the comment mark is the source of truth and the reference is regenerated on save (keeping it would double-emit). A range-less point comment can't be expressed as a zero-width mark, so it's dropped on the editor path (still lossless via parseDocx→repackDocx), documented inline.
  • Added a regression test that goes through ProseMirror (toProseDoc → fromProseDoc → serialize) and asserts exactly one reference survives a save→load→save cycle — no loss, no accumulation.

Kept the two halves together since the comment path is correct now, but happy to split the TOC fix out if you'd still prefer. CLA's signed too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants