What is chunking in RAG?

Chunking is the process of splitting documents into smaller pieces before embedding and storing them in a vector database. Good chunks preserve meaning while fitting the embedding model context window.

What chunk size should I use for RAG?

A common starting point is 200 to 400 tokens with 10 to 20 percent overlap. Smaller chunks improve precision; larger chunks preserve context. The best size depends on your document structure and query patterns.

What is semantic chunking?

Semantic chunking splits documents at natural boundaries such as paragraphs, headings, or topic shifts instead of fixed token counts. It keeps related sentences together and reduces boundary noise.

How do I chunk code for RAG?

Split code by function, class, or logical block. Include imports, docstrings, and function signatures in each chunk so the embedding captures what the code does and how to call it.

RAG Chunking Strategies: Size, Overlap, and Semantic Splitting

Bottom line: Chunking is the highest-leverage decision in a RAG pipeline. A good strategy keeps related ideas together. A bad strategy splits them apart and ruins retrieval.

Why chunking matters

Language models and embedding models both have context limits. If you embed an entire whitepaper as one vector, the embedding averages away the details. If you cut text at arbitrary boundaries, a sentence may lose the subject it refers to.

Chunking balances two goals: each chunk should be small enough to be specific, but large enough to be self-contained.

Fixed-size chunking

Fixed-size chunking splits text into chunks of N tokens with an overlap of M tokens. It is simple, fast, and predictable.

Pros: easy to implement, consistent chunk length, works with any text.
Cons: can split sentences, paragraphs, and ideas across boundaries.

Start with 300 tokens and a 50-token overlap, then measure retrieval quality before changing anything.

Semantic chunking

Semantic chunking splits at natural boundaries. For documents, split on headings and paragraphs. For conversations, split on speaker turns. For transcripts, split on pauses or topic shifts.

The result is chunks that read like mini-documents. They are easier for embeddings to represent accurately and easier for models to reason over.

Recursive and hierarchical chunking

Recursive chunking tries larger boundaries first, then smaller ones. For example, split by Markdown heading level 1, then by paragraph, then by sentence if a section is still too long.

Hierarchical chunking creates parent-child relationships. A parent chunk summarizes a section while child chunks hold the details. At retrieval time you can return the parent for context and the child for specificity.

Chunking by content type

Markdown and docs

Split on headings and paragraphs. Keep front matter separate.

Code

Split by function, class, or module. Include signatures and docstrings.

Transcripts

Split by speaker turn or time window. Preserve question-answer pairs.

Structured data

One row or record per chunk. Flatten nested JSON into sentences.

Overlap and context windows

Overlap reduces the chance that an important idea is cut in half. A 10 to 20 percent overlap is usually enough. Too much overlap increases storage cost and retrieval noise without improving quality.

Testing your chunks

Sample 20 chunks and read them. Do they make sense on their own?
Run a small evaluation set and compare hit rate across chunk sizes.
Check that no chunk contains headers, footers, or navigation junk.
Verify chunks fit inside your embedding model's maximum input length.

Published 2026-06-12

Related Resources

RAG Pipeline Architect

Prompt

Design production-ready Retrieval-Augmented Generation pipelines with advanced chunking strategies, embedding optimization, and hybrid search capabilities for enterprise knowledge bases.

RAG Implementation Expert

Skill

Build production-grade Retrieval-Augmented Generation systems with vector databases, embeddings, and hybrid search.

MODULAR RAG MCP SERVER

MCP Server

A modular RAG (Retrieval-Augmented Generation) system with MCP Server architecture. Using Skill to make AI follow each step of the spec and complete the code 100% by AI.

RAG

Glossary

RAG stands for Retrieval-Augmented Generation. It is a pattern that gives a language model access to information outside its training data by fetching relevant documents at query time and including them in the prompt. Instead of memorizing facts, the model reasons over retrieved snippets, which makes answers more accurate, current, and traceable. A typical RAG pipeline has four stages. First, documents are split into chunks and converted into embeddings using an embedding model. Second, those embeddings are stored in a vector database. Third, when a user asks a question, the system embeds the query and searches the database for the closest chunks. Finally, the retrieved chunks are added to the prompt as context, and the model generates an answer grounded in that evidence. RAG is especially useful when answers depend on private data, such as internal wikis, support tickets, or product documentation. It also reduces hallucination because the model can cite the retrieved text. Teams often tune RAG by changing chunk size, overlap, reranking algorithms, and query rewriting strategies.

Knowledge Graph Builder

Prompt

Extract entities, relationships, and semantic connections from unstructured text to build structured knowledge graphs for search, discovery, and data integration.