RAG from Scratch
Bottom line: Retrieval-Augmented Generation (RAG) lets a language model answer questions using your own documents. This guide walks through building a RAG pipeline step by step, from raw documents to a working system.
What RAG does
Large language models are trained on public data, so they do not know your private documents, recent releases, or internal knowledge bases. RAG solves this by retrieving relevant text snippets at query time and including them in the prompt. The model then reasons over the retrieved context instead of relying only on its training data.
The RAG pipeline
A minimal RAG system has five stages:
- Load: read documents from files, APIs, databases, or web pages.
- Chunk: split documents into smaller pieces that fit in the embedding model's input window.
- Embed: convert each chunk into a dense vector that captures its meaning.
- Store: save vectors and their text in a vector database for fast similarity search.
- Retrieve and generate: embed the user query, find the closest chunks, and pass them to a language model.
1. Loading documents
Use loaders that match your source. For PDFs, tools like PyPDF or Unstructured extract text while preserving some structure. For websites, scrape the main content and strip navigation and ads. For databases, export rows as text or JSON. The cleaner your source text, the better your retrieval quality.
2. Chunking strategy
Chunk size and overlap are the most important knobs in a RAG pipeline. Small chunks (100-300 tokens) give precise retrieval but may lose context. Large chunks (500-1000 tokens) preserve context but introduce noise. A common starting point is 300 tokens with a 50-token overlap so boundaries do not split important sentences.
Also consider semantic chunking. Instead of fixed sizes, split on headings, paragraphs, or natural boundaries. For code, split by function or class. For dialogue, split by speaker turn.
3. Choosing an embedding model
The embedding model turns text into vectors. OpenAI's text-embedding-3-small and text-embedding-3-large are strong defaults. Cohere Embed, Voyage, and Jina offer competitive alternatives. Open models like nomic-embed-text and e5 are good for self-hosted or offline pipelines.
Match the embedding model's dimensions to your vector database and use the same model at index and query time. Mixing embedding models destroys retrieval quality.
4. Vector database options
Managed
Pinecone, Weaviate Cloud, and Zilliz handle scaling and operations for you.
Self-hosted
Chroma, Qdrant, and Milvus run locally or in your own infrastructure.
Postgres
pgvector adds vector search to PostgreSQL, great if you already use it.
In-memory
FAISS, Annoy, and Chroma in-memory are perfect for prototypes and small datasets.
5. Retrieval and re-ranking
Retrieve the top-k chunks using cosine or dot-product similarity. A value of k between 5 and 10 is common. If your chunks are small, retrieve more and then re-rank with a cross-encoder or a dedicated re-ranker like Cohere Rerank. Re-ranking improves precision by scoring each retrieved chunk against the full query.
6. Prompt engineering for RAG
Format the retrieved chunks clearly so the model knows what is context and what is the question. A simple template:
Use the following context to answer the question.
If the answer is not in the context, say "I don't know".
Context:
{#each chunks as chunk}
- {chunk}
{/each}
Question: {question}
Answer: 7. Evaluation
Measure retrieval with hit rate, mean reciprocal rank, and nDCG. Measure generation with faithfulness, answer relevance, and hallucination checks. Build a small labeled dataset of questions and expected answers, then iterate on chunk size, embedding model, and re-ranking.
Common mistakes
- Chunks that are too large and dilute relevance.
- Not cleaning headers, footers, and duplicate text from crawled pages.
- Using different embedding models for indexing and querying.
- Skipping evaluation and tuning only by intuition.
- Returning retrieved chunks to users without grounding the answer.
Next steps
- Embedding Cost Calculator - estimate RAG embedding costs
- Context Window Comparison - pick a model that fits your chunks
- How to Choose an Embedding Model - compare models by task and budget
Published 2026-06-12
Related Resources
RAG Pipeline Architect
PromptDesign production-ready Retrieval-Augmented Generation pipelines with advanced chunking strategies, embedding optimization, and hybrid search capabilities for enterprise knowledge bases.
RAG Implementation Expert
SkillBuild production-grade Retrieval-Augmented Generation systems with vector databases, embeddings, and hybrid search.
MODULAR RAG MCP SERVER
MCP ServerA modular RAG (Retrieval-Augmented Generation) system with MCP Server architecture. Using Skill to make AI follow each step of the spec and complete the code 100% by AI.
RAG
GlossaryRAG stands for Retrieval-Augmented Generation. It is a pattern that gives a language model access to information outside its training data by fetching relevant documents at query time and including them in the prompt. Instead of memorizing facts, the model reasons over retrieved snippets, which makes answers more accurate, current, and traceable. A typical RAG pipeline has four stages. First, documents are split into chunks and converted into embeddings using an embedding model. Second, those embeddings are stored in a vector database. Third, when a user asks a question, the system embeds the query and searches the database for the closest chunks. Finally, the retrieved chunks are added to the prompt as context, and the model generates an answer grounded in that evidence. RAG is especially useful when answers depend on private data, such as internal wikis, support tickets, or product documentation. It also reduces hallucination because the model can cite the retrieved text. Teams often tune RAG by changing chunk size, overlap, reranking algorithms, and query rewriting strategies.
LLM Fine-Tuning Specialist
PromptDesign and execute efficient fine-tuning strategies for large language models using LoRA, QLoRA, and full fine-tuning. Optimize for specific domains, tasks, and deployment constraints.