You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The generated Markdown twins (index.md) have three quality issues that degrade RAG retrieval and extraction across products. All three were observed on Telegraf Controller pages, but the causes are in the twin-generation pipeline and affect any product with the same content patterns.
Shared callout boilerplate dominates the first chunk of every page in a section. Every Telegraf Controller twin opens with ~250 tokens of identical beta-notice content (production-use warning, Slack links, release-schedule caveats) before the page's own content. Retrieval effect: the first chunk of every page in the section embeds near-identically, and the most extractable claim on every page is the warning, not the answer. The same problem applies to any product-wide banner or repeated callout (sunset notices, beta notices). Consider compressing repeated section-wide callouts, or moving them below the page lead, during twin generation.
Converter drops whitespace around links and code spans. Observed in production telegraf/controller/tokens/use/index.md:
Missing whitespace degrades tokenization and extraction fidelity on link-dense pages. This is a markdown-converter bug, not a content bug.
Verify code-tabs conversion in twins. Tabbed code blocks (code-tabs-wrapper) are the structure most likely to convert badly. PR previews don't build twins, so tab-heavy pages go unverified until production. Spot-check a tab-heavy page after deploy (for example, /telegraf/controller/reference/api/index.md once docs(controller): add Telegraf Controller API reference page #7314 lands) and add a conversion test or Cypress validation case for tabbed code blocks if it converts badly.
Acceptance
Repeated section-wide callouts no longer consume the lead chunk of every twin in a section.
Link/code-span spacing bugs fixed in the converter, with a regression test.
Code-tabs conversion verified, with a test case covering tabbed code blocks.
Problem
The generated Markdown twins (
index.md) have three quality issues that degrade RAG retrieval and extraction across products. All three were observed on Telegraf Controller pages, but the causes are in the twin-generation pipeline and affect any product with the same content patterns.Shared callout boilerplate dominates the first chunk of every page in a section. Every Telegraf Controller twin opens with ~250 tokens of identical beta-notice content (production-use warning, Slack links, release-schedule caveats) before the page's own content. Retrieval effect: the first chunk of every page in the section embeds near-identically, and the most extractable claim on every page is the warning, not the answer. The same problem applies to any product-wide banner or repeated callout (sunset notices, beta notices). Consider compressing repeated section-wide callouts, or moving them below the page lead, during twin generation.
Converter drops whitespace around links and code spans. Observed in production
telegraf/controller/tokens/use/index.md:please[submit an issue]\YOUR_TC_API_TOKEN`with`plugin](/telegraf/v1/output-plugins/heartbeat/)to sendMissing whitespace degrades tokenization and extraction fidelity on link-dense pages. This is a
markdown-converterbug, not a content bug.Verify code-tabs conversion in twins. Tabbed code blocks (
code-tabs-wrapper) are the structure most likely to convert badly. PR previews don't build twins, so tab-heavy pages go unverified until production. Spot-check a tab-heavy page after deploy (for example,/telegraf/controller/reference/api/index.mdonce docs(controller): add Telegraf Controller API reference page #7314 lands) and add a conversion test or Cypress validation case for tabbed code blocks if it converts badly.Acceptance
Related: #7314, #7320, #7321, #7322