Skip to content

Commit b4794e5

Browse files
authored
Add section on document format conversion for LLMs
Added a section discussing the efficacy of converting documents to Markdown format for LLM processing, including advantages over HTML.
1 parent 2d20034 commit b4794e5

1 file changed

Lines changed: 12 additions & 0 deletions

File tree

  • 0_Azure/3_AzureAI/0_AISearch/demos/2_Index_Increase

0_Azure/3_AzureAI/0_AISearch/demos/2_Index_Increase/README.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ Last updated: 2025-07-17
5656
- [Examples](#examples)
5757
- [How to integrate vector embedding](#how-to-integrate-vector-embedding)
5858
- [Chunking Techniques](#chunking-techniques)
59+
- [Document Format Conversion HTML vs. Markdown](#document-format-conversion-html-vs-markdown)
5960

6061
</details>
6162

@@ -228,6 +229,17 @@ From [Integrated vector embedding in Azure AI Search](https://learn.microsoft.co
228229
| **Custom combinations** | Mixes fixed and variable approaches, e.g., adding titles or metadata to chunks. | Flexible, balances context and efficiency. | Requires tuning and experimentation. | Enterprise apps needing both precision and recall. |
229230
| **Document parsing** | Indexers parse large files (Markdown, JSON, PDFs) into smaller search documents. | Automated, efficient for structured files. | Less control over chunk boundaries. | Technical documents, structured datasets. |
230231

232+
## Document Format Conversion HTML vs. Markdown
233+
234+
`Efficacy of converting documents to markdown format for LLM processing`
235+
236+
> When the conversion preserves the original meaning and structure, Markdown is usually the preferred format.
237+
> - Markdown is generally the better choice for LLM processing as long as the conversion quality is good. Markdown keeps the structure clean and lightweight, which helps models interpret headings, lists, tables, and emphasis without the noise of HTML tags. It also reduces the risk of formatting artifacts that can confuse the model.
238+
> - HTML can still work, but it often introduces extra markup, inline styles, or boilerplate that doesn’t add semantic value. That extra clutter can make the input harder for an LLM to parse and may even distort the document’s logical structure.
239+
240+
E.g [Chunk and vectorize by document layout or structure](https://learn.microsoft.com/en-us/azure/search/search-how-to-semantic-chunking)
241+
242+
231243
<!-- START BADGE -->
232244
<div align="center">
233245
<img src="https://img.shields.io/badge/Total%20views-1633-limegreen" alt="Total views">

0 commit comments

Comments
 (0)