You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 0_Azure/3_AzureAI/0_AISearch/demos/2_Index_Increase/README.md
+23Lines changed: 23 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -60,6 +60,7 @@ Last updated: 2025-07-17
60
60
-[Scaling approaches](#scaling-approaches)
61
61
-[Examples](#examples)
62
62
-[How to integrate vector embedding](#how-to-integrate-vector-embedding)
63
+
-[Chunking Techniques](#chunking-techniques)
63
64
64
65
</details>
65
66
@@ -192,6 +193,28 @@ flowchart LR
192
193
From [Integrated vector embedding in Azure AI Search](https://learn.microsoft.com/en-us/azure/search/vector-search-integrated-vectorization#component-diagram)
193
194
194
195
196
+
## Chunking Techniques
197
+
> Document chunking should balance token limits, semantic coherence, and retrieval accuracy. `The goal is to split large documents into manageable pieces that are small enough for embeddings and LLMs, but still meaningful enough to preserve context.`
|**Fixed-size chunks**| Splits text into blocks of a set length (e.g., 512 tokens or 2,000 characters). Often includes overlap (10–25%). | Simple, predictable, easy to implement. | May break sentences or context mid-way. | Large narrative documents, technical manuals. |
213
+
|**Variable-sized chunks**| Breaks text based on natural boundaries like sentences, paragraphs, or headings. | Preserves readability and logical flow. | Chunk sizes can vary widely, affecting retrieval consistency. | Articles, structured documents, Markdown/HTML files. |
214
+
|**Semantic chunks**| Uses NLP or AI to split text into meaningful units that preserve semantic relationships. | High-quality context preservation, better search relevance. | More complex, requires AI processing. | Conversational text, FAQs, knowledge bases. |
215
+
|**Custom combinations**| Mixes fixed and variable approaches, e.g., adding titles or metadata to chunks. | Flexible, balances context and efficiency. | Requires tuning and experimentation. | Enterprise apps needing both precision and recall. |
216
+
|**Document parsing**| Indexers parse large files (Markdown, JSON, PDFs) into smaller search documents. | Automated, efficient for structured files. | Less control over chunk boundaries. | Technical documents, structured datasets. |
0 commit comments