Skip to content

Markdown twin hygiene: repeated callout boilerplate, link spacing, code-tabs conversion #7323

@jstirnaman

Description

@jstirnaman

Problem

The generated Markdown twins (index.md) have three quality issues that degrade RAG retrieval and extraction across products. All three were observed on Telegraf Controller pages, but the causes are in the twin-generation pipeline and affect any product with the same content patterns.

  1. Shared callout boilerplate dominates the first chunk of every page in a section. Every Telegraf Controller twin opens with ~250 tokens of identical beta-notice content (production-use warning, Slack links, release-schedule caveats) before the page's own content. Retrieval effect: the first chunk of every page in the section embeds near-identically, and the most extractable claim on every page is the warning, not the answer. The same problem applies to any product-wide banner or repeated callout (sunset notices, beta notices). Consider compressing repeated section-wide callouts, or moving them below the page lead, during twin generation.

  2. Converter drops whitespace around links and code spans. Observed in production telegraf/controller/tokens/use/index.md:

    • please[submit an issue]
    • \YOUR_TC_API_TOKEN`with`
    • plugin](/telegraf/v1/output-plugins/heartbeat/)to send

    Missing whitespace degrades tokenization and extraction fidelity on link-dense pages. This is a markdown-converter bug, not a content bug.

  3. Verify code-tabs conversion in twins. Tabbed code blocks (code-tabs-wrapper) are the structure most likely to convert badly. PR previews don't build twins, so tab-heavy pages go unverified until production. Spot-check a tab-heavy page after deploy (for example, /telegraf/controller/reference/api/index.md once docs(controller): add Telegraf Controller API reference page #7314 lands) and add a conversion test or Cypress validation case for tabbed code blocks if it converts badly.

Acceptance

  • Repeated section-wide callouts no longer consume the lead chunk of every twin in a section.
  • Link/code-span spacing bugs fixed in the converter, with a regression test.
  • Code-tabs conversion verified, with a test case covering tabbed code blocks.

Related: #7314, #7320, #7321, #7322

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions