Add section on document format conversion for LLMs

brown9804 · web-flow · commit b4794e59849c · 2025-12-16T15:50:38.000-06:00
Added a section discussing the efficacy of converting documents to Markdown format for LLM processing, including advantages over HTML.
diff --git a/0_Azure/3_AzureAI/0_AISearch/demos/2_Index_Increase/README.md b/0_Azure/3_AzureAI/0_AISearch/demos/2_Index_Increase/README.md
@@ -56,6 +56,7 @@ Last updated: 2025-07-17
 - [Examples](#examples)
 - [How to integrate vector embedding](#how-to-integrate-vector-embedding)
 - [Chunking Techniques](#chunking-techniques)
+- [Document Format Conversion HTML vs. Markdown](#document-format-conversion-html-vs-markdown)
 
 </details>
 
@@ -228,6 +229,17 @@ From [Integrated vector embedding in Azure AI Search](https://learn.microsoft.co
 | **Custom combinations** | Mixes fixed and variable approaches, e.g., adding titles or metadata to chunks. | Flexible, balances context and efficiency. | Requires tuning and experimentation. | Enterprise apps needing both precision and recall. |
 | **Document parsing** | Indexers parse large files (Markdown, JSON, PDFs) into smaller search documents. | Automated, efficient for structured files. | Less control over chunk boundaries. | Technical documents, structured datasets. |
 
+## Document Format Conversion HTML vs. Markdown
+
+`Efficacy of converting documents to markdown format for LLM processing`
+
+> When the conversion preserves the original meaning and structure, Markdown is usually the preferred format.
+> - Markdown is generally the better choice for LLM processing as long as the conversion quality is good. Markdown keeps the structure clean and lightweight, which helps models interpret headings, lists, tables, and emphasis without the noise of HTML tags. It also reduces the risk of formatting artifacts that can confuse the model.
+> - HTML can still work, but it often introduces extra markup, inline styles, or boilerplate that doesn’t add semantic value. That extra clutter can make the input harder for an LLM to parse and may even distort the document’s logical structure.
+
+E.g [Chunk and vectorize by document layout or structure](https://learn.microsoft.com/en-us/azure/search/search-how-to-semantic-chunking)
+
+
 <!-- START BADGE -->
 <div align="center">
   <img src="https://img.shields.io/badge/Total%20views-1633-limegreen" alt="Total views">