RAG Chunking Strategies: What’s the Optimal Chunk Size?

A deep dive into text chunking for Retrieval-Augmented Generation systems

Introduction: Why Chunking Matters in RAG

Retrieval-Augmented Generation (RAG) has become the go-to pattern for building LLM applications that need access to external knowledge. The concept is elegantly simple: instead of fine-tuning a model on your data, you retrieve relevant context at query time and inject it into your prompt.

But here’s the thing most tutorials gloss over: the quality of your RAG system lives or dies by how you chunk your documents.

Chunking-the process of splitting documents into smaller pieces for embedding and retrieval-sits at the critical junction between your raw data and your vector store. Get it wrong, and you’ll either retrieve fragments too small to be useful or chunks so large they bury the relevant information in noise.

Consider this scenario: A user asks ,“What’s the refund policy for enterprise customers?” Your document contains the answer, but it’s embedded in a 50-page terms of service. How you chunk that document determines whether you retrieve:

  • A useless fragment: “…enterprise customers are entitled to…”
    • An overwhelming wall of text containing 15 different policies
      • A perfectly-scoped chunk with exactly the refund policy details

        This post explores the major chunking strategies, their tradeoffs, and provides working code with benchmarks to help you make informed decisions for your use case.

        Chunking Strategies: The Four Horsemen

        1. Fixed-Size Chunking

        The simplest approach: split text into chunks of exactly N characters (or tokens), with optional overlap.

        How it works:

        Document:"The quick brown fox jumps over the lazy dog. It was a sunny day."Chunk size:30chars,Overlap:10Chunk 1: "The quick brown fox jumps over"Chunk 2: "jumps over the lazy dog. It wa"Chunk 3: "dog. It was a sunny day."

        Pros:

        • Predictable chunk sizes (great for token budget management)
          • Fast and simple to implement
            • Consistent embedding dimensions

              Cons:

              • Cuts mid-sentence, mid-word, even mid-concept
                • No respect for document structure
                  • Overlap helps but doesn’t solve semantic fragmentation

                    Best for: Uniform text without strong structure (logs, transcripts, simple prose)

                    2. Recursive Character Splitting

                    LangChain’s default approach. Attempts to split on natural boundaries in order of preference: paragraphs → sentences → words → characters.

                    How it works:

                    1. Try to split on \n\n (paragraphs)
                      1. If chunks still too large, split on \n (lines)
                        1. Then on . (sentences)
                          1. Then on (words)
                            1. Finally, on individual characters

                              Pros:

                              • Respects document structure where possible
                                • More semantically coherent chunks
                                  • Configurable separator hierarchy

                                    Cons:

                                    • Chunk sizes vary significantly
                                      • Can still produce awkward splits
                                        • Doesn’t understand actual semantic meaning

                                          Best for: General-purpose documents, articles, documentation

                                          3. Sentence-Based Chunking

                                          Splits text at sentence boundaries, then groups sentences until reaching a size threshold.

                                          How it works:

                                          1. Parse document into sentences using NLP (spaCy, NLTK, or regex)
                                            1. Group consecutive sentences until chunk size limit
                                              1. Optionally overlap by N sentences

                                                Pros:

                                                • Never breaks mid-sentence
                                                  • Natural reading flow preserved
                                                    • Predictable semantic units

                                                      Cons:

                                                      • Requires NLP dependency for accurate sentence detection
                                                        • Short sentences create tiny chunks; long sentences create huge ones
                                                          • Doesn’t consider paragraph or section boundaries

                                                            Best for: Well-written prose, articles, books, documentation

                                                            4. Semantic Chunking

                                                            The most sophisticated approach: use embeddings to find natural breakpoints where the topic shifts.

                                                            How it works:

                                                            1. Split into sentences
                                                              1. Embed each sentence
                                                                1. Compute similarity between consecutive sentences
                                                                  1. Split where similarity drops below threshold (topic change detected)

                                                                    Pros:

                                                                    • Chunks are semantically coherent
                                                                      • Adapts to content structure dynamically
                                                                        • Topic changes become chunk boundaries

                                                                          Cons:

                                                                          • Computationally expensive (requires embedding every sentence)
                                                                            • Chunk sizes highly variable
                                                                              • Threshold tuning required

                                                                                Best for: Long-form content with multiple topics, research papers, complex documents

                                                                                The Chunk Size Tradeoff: Too Small vs. Too Large

                                                                                This is where most practitioners struggle. Let’s break down the tradeoffs:

                                                                                Too Small (< 200 tokens)

                                                                                Problems:

                                                                                • Lost context: “The company was founded in 2015” means nothing without knowing which company
                                                                                  • Fragmented answers: The answer to a question spans multiple chunks, but you only retrieve one
                                                                                    • Embedding noise: Short text has less semantic signal, leading to worse similarity matching
                                                                                      • Retrieval overhead: More chunks = more storage, more comparisons, higher latency

                                                                                        Symptoms: Answers feel incomplete. Users ask follow-ups that should have been covered.

                                                                                        Too Large (> 1000 tokens)

                                                                                        Problems:

                                                                                        • Diluted relevance: The chunk contains the answer plus 20 unrelated paragraphs
                                                                                          • Token budget waste: You blow your context window on irrelevant text
                                                                                            • Lower precision: Many chunks “kind of” match, making ranking harder
                                                                                              • LLM confusion: More context isn’t always better-models can get distracted

                                                                                                Symptoms: Retrieved context feels bloated. LLM outputs mention irrelevant details.

                                                                                                The Sweet Spot

                                                                                                For most use cases, 256–512 tokens hits the balance:

                                                                                                But here’s the real answer: it depends on your data and queries.

                                                                                                • Q&A over technical docs? Smaller chunks (256) work well-answers are often localized
                                                                                                  • Summarization of reports? Larger chunks (512–1024) preserve narrative flow
                                                                                                    • Legal document search? Semantic chunking-clause boundaries matter more than size

                                                                                                      Benchmark Methodology

                                                                                                      To produce meaningful comparisons, our benchmark follows these principles:

                                                                                                      Corpus Selection

                                                                                                      We use the same corpus for all strategies-a technical document on machine learning fundamentals. In production, use a representative sample of your actual documents.

                                                                                                      Evaluation Metrics

                                                                                                      • Precision: What fraction of retrieved chunks contain relevant information?
                                                                                                        • Recall: What fraction of the expected information was retrieved?
                                                                                                          • F1 Score: Harmonic mean of precision and recall

                                                                                                            Test Protocol

                                                                                                            1. Chunk the corpus with each strategy × size combination
                                                                                                              1. Embed and index all chunks using the same embedding model
                                                                                                                1. Run identical queries against each index
                                                                                                                  1. Retrieve top-k (k=3) chunks per query
                                                                                                                    1. Score against expected keywords

                                                                                                                      Test Environment

                                                                                                                      Python Version: 3.11.14Platform: Linux 6.14.0-37-genericProcessor: x86_64CPU Cores: 4 physical, 8 logicalRAM: 15.3 GBSentenceTransformers: 5.2.2ChromaDB: 1.5.0

                                                                                                                      Variables Controlled

                                                                                                                      VariableValueEmbedding modelall-MiniLM-L6-v2Vector storeChromaDB (cosine similarity)Top-k3Chunk overlap50 characters

                                                                                                                      Results and Analysis

                                                                                                                      Running the benchmark on our ML corpus produces the following results:

                                                                                                                      Key Findings

                                                                                                                      1. Larger chunks (1024) achieve the highest F1 score (1.236) across fixed, recursive, and sentence-based strategies, with perfect recall (1.0). This suggests that for this corpus, preserving more context improves retrieval quality.
                                                                                                                        1. Sentence-based chunking offers the best speed-to-quality ratio. At 256 characters, it achieves F1 of 1.201 with fast indexing (104.73ms) and retrieval (8.62ms) times.
                                                                                                                          1. Semantic chunking underperforms contrary to expectations, achieving only F1 of 1.022. The algorithm created many small chunks (avg 125.9 chars) regardless of target size, fragmenting the content too aggressively.
                                                                                                                            1. Recursive chunking is faster than fixed-size with identical quality scores, making it the better default choice. It respects natural boundaries while maintaining performance.
                                                                                                                              1. The 1024 sweet spot emerges for this corpus. All strategies converge to similar performance at 1024 characters, suggesting this size captures complete semantic units.

                                                                                                                                Performance Considerations

                                                                                                                                Semantic chunking requires embedding every sentence-on large corpus, which becomes expensive. Consider batch processing or using a faster embedding model.

                                                                                                                                Practical Recommendations by Use Case

                                                                                                                                📚 Technical Documentation / Knowledge Bases

                                                                                                                                • Strategy: Recursive or Sentence-based
                                                                                                                                  • Chunk size: 256–512 tokens
                                                                                                                                    • Why: Answers are typically contained in single sections. Smaller chunks = higher precision.

                                                                                                                                      📄 Legal Documents / Contracts

                                                                                                                                      • Strategy: Semantic or custom section-aware
                                                                                                                                        • Chunk size: 512–768 tokens
                                                                                                                                          • Why: Clause boundaries matter. Semantic chunking detects topic shifts.

                                                                                                                                            💬 Customer Support / FAQ

                                                                                                                                            • Strategy: Sentence-based
                                                                                                                                              • Chunk size: 128–256 tokens
                                                                                                                                                • Why: Q&A pairs are short. Smaller chunks prevent mixing unrelated questions.

                                                                                                                                                  📖 Books / Long-form Content

                                                                                                                                                  • Strategy: Semantic with larger thresholds
                                                                                                                                                    • Chunk size: 512–1024 tokens
                                                                                                                                                      • Why: Narrative flow matters. Larger chunks preserve context.

                                                                                                                                                        📊 Structured Data (Tables, Lists)

                                                                                                                                                        • Strategy: Custom structure-aware splitting
                                                                                                                                                          • Chunk size: Keep logical units intact
                                                                                                                                                            • Why: Standard splitters mangle tables. Build custom logic to keep rows/items together.

                                                                                                                                                              🔬 Research Papers

                                                                                                                                                              • Strategy: Section-aware + semantic
                                                                                                                                                                • Chunk size: 512 tokens within sections
                                                                                                                                                                  • Why: Papers have a clear structure (Abstract, Methods, Results). Chunk within sections.

                                                                                                                                                                    Conclusion

                                                                                                                                                                    Chunking is not a solved problem-it’s a tuning problem. The optimal strategy depends on:

                                                                                                                                                                    1. Your document types — Structured vs. unstructured, short vs. long
                                                                                                                                                                      1. Your query patterns — Factoid questions vs. exploratory search
                                                                                                                                                                        1. Your constraints — Latency requirements, embedding costs, token budgets

                                                                                                                                                                          Start here:

                                                                                                                                                                          1. Use RecursiveCharacterTextSplitter at 512 characters with 50 character overlap
                                                                                                                                                                            1. Benchmark with your actual documents and queries
                                                                                                                                                                              1. If precision matters more, try smaller chunks or semantic chunking
                                                                                                                                                                                1. If context matters more, try larger chunks or sentence-based

                                                                                                                                                                                  The code in this post gives you everything needed to run your own experiments. Clone it, swap in your corpus, and find your optimal configuration.

                                                                                                                                                                                  Remember: A 10% improvement in chunking can translate to a dramatically better user experience. It’s worth the investment to get it right.

                                                                                                                                                                                  Code Repository: All benchmark code and examples are available at github.com/Devparihar5/chunking-strategies-comparison

                                                                                                                                                                                  References

                                                                                                                                                                                  1. LangChain Text Splitters Documentation: 
                                                                                                                                                                                    1. ChromaDB Documentation: 
                                                                                                                                                                                      1. Sentence Transformers: 
                                                                                                                                                                                        1. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” — Lewis et al., 2020
                                                                                                                                                                                          1. “Lost in the Middle: How Language Models Use Long Contexts” — Liu et al., 2023
                                                                                                                                                                                            1. NLTK Tokenization: 

                                                                                                                                                                                              Have questions or want to share your chunking results? Feel free to reach out or open an issue in the repository.

                                                                                                                                                                                              Semantics
                                                                                                                                                                                              Chunking Strategies
                                                                                                                                                                                              Ai
                                                                                                                                                                                              Rag
                                                                                                                                                                                              Vector Database
                                                                                                                                                                                              card-1card-2card-3card-4card-5card-6card-7card-8

                                                                                                                                                                                              Unlock more with Accomplishr

                                                                                                                                                                                              Create your free account today to access expert insights, member stories, and exclusive content. Don't miss out—sign up now for personalized recommendations and valuable resources tailored to your professional growth and success!