Skip to content

Commit 9cfac4c

Browse files
authored
Add Chunking Techniques section to README
Added a section on Chunking Techniques to provide guidance on document chunking for Azure AI Search.
1 parent 5bc0670 commit 9cfac4c

1 file changed

Lines changed: 23 additions & 0 deletions

File tree

  • 0_Azure/3_AzureAI/0_AISearch/demos/2_Index_Increase

0_Azure/3_AzureAI/0_AISearch/demos/2_Index_Increase/README.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ Last updated: 2025-07-17
6060
- [Scaling approaches](#scaling-approaches)
6161
- [Examples](#examples)
6262
- [How to integrate vector embedding](#how-to-integrate-vector-embedding)
63+
- [Chunking Techniques](#chunking-techniques)
6364

6465
</details>
6566

@@ -192,6 +193,28 @@ flowchart LR
192193
From [Integrated vector embedding in Azure AI Search](https://learn.microsoft.com/en-us/azure/search/vector-search-integrated-vectorization#component-diagram)
193194

194195

196+
## Chunking Techniques
197+
> Document chunking should balance token limits, semantic coherence, and retrieval accuracy. `The goal is to split large documents into manageable pieces that are small enough for embeddings and LLMs, but still meaningful enough to preserve context.`
198+
199+
> [!TIP]
200+
> - **Chunk size**: Recommended ~512 tokens (~2,000 characters).
201+
> - **Overlap**: ~25% (~128 tokens) to preserve context.
202+
> - **Choice depends on document type**: Narrative vs structured vs conversational.
203+
204+
> Why It Matters?
205+
- Prevents exceeding embedding/LLM limits.
206+
- Improves **retrieval precision** (finding the right chunk).
207+
- Enhances **answer grounding** with citations.
208+
- Enables **agentic retrieval** pipelines to work effectively with multi-query decomposition.
209+
210+
| **Technique** | **How it Works** | **Pros** | **Cons** | **Best Use Cases** |
211+
|---------------|------------------|----------|----------|--------------------|
212+
| **Fixed-size chunks** | Splits text into blocks of a set length (e.g., 512 tokens or 2,000 characters). Often includes overlap (10–25%). | Simple, predictable, easy to implement. | May break sentences or context mid-way. | Large narrative documents, technical manuals. |
213+
| **Variable-sized chunks** | Breaks text based on natural boundaries like sentences, paragraphs, or headings. | Preserves readability and logical flow. | Chunk sizes can vary widely, affecting retrieval consistency. | Articles, structured documents, Markdown/HTML files. |
214+
| **Semantic chunks** | Uses NLP or AI to split text into meaningful units that preserve semantic relationships. | High-quality context preservation, better search relevance. | More complex, requires AI processing. | Conversational text, FAQs, knowledge bases. |
215+
| **Custom combinations** | Mixes fixed and variable approaches, e.g., adding titles or metadata to chunks. | Flexible, balances context and efficiency. | Requires tuning and experimentation. | Enterprise apps needing both precision and recall. |
216+
| **Document parsing** | Indexers parse large files (Markdown, JSON, PDFs) into smaller search documents. | Automated, efficient for structured files. | Less control over chunk boundaries. | Technical documents, structured datasets. |
217+
195218
<!-- START BADGE -->
196219
<div align="center">
197220
<img src="https://img.shields.io/badge/Total%20views-1633-limegreen" alt="Total views">

0 commit comments

Comments
 (0)