Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
130 changes: 122 additions & 8 deletions packages/markdown-html/src/rules.js
Original file line number Diff line number Diff line change
Expand Up @@ -498,42 +498,155 @@ const HTML_BLOCK_RULE = {
}
};

const CAPTION_RULE = {
deserialize(el, next, ignoreSpace) {
if (el.tagName && el.tagName.toLowerCase() === 'caption') {
return {
type: 'caption',
nodes: next(el.childNodes, ignoreSpace)
};
}
}
};
Comment on lines +501 to +510
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CAPTION_RULE returns a plain object with type: 'caption' and no $class. If a <caption> appears outside of a <table> (invalid HTML, but possible input), this node will leak into the output DOM and likely break downstream visitors/serializers that expect only CommonMark/CiceroMark $class nodes. Consider handling <caption> directly inside the table branch (query the DOM children for <caption> and deserialize its contents there) or returning a valid CommonMark node and filtering it out safely.

Copilot uses AI. Check for mistakes.

/**
* Clean table cell nodes by removing Softbreaks and normalizing whitespace.
* @param {Array} nodes - the list of nodes
* @returns {Array} - the cleaned list of nodes
*/
function cleanTableNodes(nodes) {
const NS = CommonMarkModel.NAMESPACE;
const TEXT = `${NS}.Text`;
const SOFT = `${NS}.Softbreak`;

if (!nodes) return [];
nodes = Array.isArray(nodes) ? nodes : [nodes];

const merged = nodes.reduce((acc, node) => {
if (!node) return acc;

let newNode = { ...node };
if (newNode.nodes) newNode = { ...newNode, nodes: cleanTableNodes(newNode.nodes) };

if (newNode.$class === SOFT) {
newNode = { $class: TEXT, text: ' ' };
}

const last = acc[acc.length - 1];
if (last && last.$class === TEXT && newNode.$class === TEXT)
{
last.text += newNode.text;
}
else
{
acc.push(newNode);
}

return acc;
}, []);

// Normalize whitespace inside Text nodes
merged.forEach(n => {
if (n.$class === TEXT) n.text = n.text.replace(/\s+/g, ' ');
});


if (merged.length > 0 && merged[0].$class === TEXT) {
merged[0].text = merged[0].text.replace(/^\s+/, '');
}
if (merged.length > 0 && merged[merged.length - 1].$class === TEXT) {
merged[merged.length - 1].text = merged[merged.length - 1].text.replace(/\s+$/, '');
}

return merged.filter(n => n.$class !== TEXT || n.text.length > 0);
}
Comment on lines +517 to +562
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanTableNodes uses 2-space indentation and nonstandard brace placement in a file/repo configured for 4-space indentation (packages/markdown-html/.eslintrc.yml sets indent: [error, 4]). This will cause ESLint failures. Please reformat this function to match the established 4-space indentation and brace style used elsewhere in rules.js.

Copilot uses AI. Check for mistakes.


const TABLE_RULE = {
deserialize(el, next, ignoreSpace) {
if (el.tagName && el.tagName.toLowerCase() === 'table') {
return {
const children = next(el.childNodes, ignoreSpace);
const captionNode = children.find(c => c.type === 'caption');
let tableNodes = children.filter(node =>
node.$class === `${CommonMarkModel.NAMESPACE}.TableHead` ||
node.$class === `${CommonMarkModel.NAMESPACE}.TableBody`
);

let head = tableNodes.find(n => n.$class === `${CommonMarkModel.NAMESPACE}.TableHead`);
const body = tableNodes.find(n => n.$class === `${CommonMarkModel.NAMESPACE}.TableBody`);

if (!head && body && body.nodes && body.nodes.length > 0) {
const firstRow = body.nodes[0];
const hasHeaderCells = firstRow.nodes && firstRow.nodes.some(n => n.$class === `${CommonMarkModel.NAMESPACE}.HeaderCell`);

if (hasHeaderCells) {
head = {
$class: `${CommonMarkModel.NAMESPACE}.TableHead`,
nodes: [firstRow]
};
const newBody = {
$class: `${CommonMarkModel.NAMESPACE}.TableBody`,
nodes: body.nodes.slice(1)
};
tableNodes = [head, newBody];
}
}

const table = {
$class: `${CommonMarkModel.NAMESPACE}.Table`,
nodes: next(el.childNodes),
nodes: tableNodes,
};

if (captionNode) {
const captionParagraph = {
$class: `${CommonMarkModel.NAMESPACE}.Paragraph`,
nodes: [
{
$class: `${CommonMarkModel.NAMESPACE}.Strong`,
nodes: cleanTableNodes(captionNode.nodes)
},
Comment on lines +603 to +607
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapping the entire caption content in a Strong node assumes the caption deserializes to inline content only, but HTML <caption> allows flow content (e.g., <p>, lists). If block nodes are produced here, they’ll become children of Strong, which is an invalid CommonMark structure and can break rendering/serialization. Consider normalizing caption content to inline-only text for the bolded title, or emitting the caption as one/more block nodes before the table without forcing Strong around arbitrary children.

Copilot uses AI. Check for mistakes.
{
$class: `${CommonMarkModel.NAMESPACE}.Text`,
text: '\n\n'
Comment on lines +607 to +610
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caption emission builds a Paragraph that contains a Text node with literal \n\n. Newlines inside a paragraph are not the right way to separate block nodes and can lead to unexpected Markdown/HTML output (e.g., softbreaks inside the paragraph). Prefer returning the caption paragraph as its own block and let the block serializer handle spacing between the paragraph and the following table (or insert separation at the document/block level, not as inline text).

Suggested change
},
{
$class: `${CommonMarkModel.NAMESPACE}.Text`,
text: '\n\n'

Copilot uses AI. Check for mistakes.
}
]
};
return [captionParagraph, table];
}

return table;
Comment on lines 565 to +617
Copy link

Copilot AI Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are snapshot roundtrip tests for tables, but no coverage for HTML input containing <caption> (or the new header/body normalization paths). Please add a test that calls HtmlTransformer.toCiceroMark() on an HTML table with a caption and asserts the resulting CiceroMark/Markdown output (and ideally a roundtrip) so regressions in caption parsing/whitespace cleanup are caught.

Copilot uses AI. Check for mistakes.
}
if (el.tagName && el.tagName.toLowerCase() === 'thead') {
const nodes = next(el.childNodes);
return {
$class: `${CommonMarkModel.NAMESPACE}.TableHead`,
nodes: next(el.childNodes),
nodes: nodes.filter(n => n.$class === `${CommonMarkModel.NAMESPACE}.TableRow`),
};
}
if (el.tagName && el.tagName.toLowerCase() === 'tbody') {
const nodes = next(el.childNodes);
return {
$class: `${CommonMarkModel.NAMESPACE}.TableBody`,
nodes: next(el.childNodes),
nodes: nodes.filter(n => n.$class === `${CommonMarkModel.NAMESPACE}.TableRow`),
};
}
if (el.tagName && el.tagName.toLowerCase() === 'tr') {
const nodes = next(el.childNodes);
return {
$class: `${CommonMarkModel.NAMESPACE}.TableRow`,
nodes: next(el.childNodes),
nodes: nodes.filter(n => n.$class === `${CommonMarkModel.NAMESPACE}.HeaderCell` || n.$class === `${CommonMarkModel.NAMESPACE}.TableCell`),
};
}
if (el.tagName && el.tagName.toLowerCase() === 'th') {
return {
$class: `${CommonMarkModel.NAMESPACE}.HeaderCell`,
nodes: next(el.childNodes),
nodes: cleanTableNodes(next(el.childNodes)),
};
}
if (el.tagName && el.tagName.toLowerCase() === 'td') {
return {
$class: `${CommonMarkModel.NAMESPACE}.TableCell`,
nodes: next(el.childNodes),
nodes: cleanTableNodes(next(el.childNodes)),
};
}
},
Expand All @@ -560,7 +673,8 @@ const rules = [
HTML_INLINE_RULE,
HTML_BLOCK_RULE,
IMAGE_RULE,
TABLE_RULE
TABLE_RULE,
CAPTION_RULE
];


Expand Down