Skip to main content
VePrompts

RAG from Scratch

Bottom line: Retrieval-Augmented Generation (RAG) lets a language model answer questions using your own documents. This guide walks through building a RAG pipeline step by step, from raw documents to a working system.

What RAG does

Large language models are trained on public data, so they do not know your private documents, recent releases, or internal knowledge bases. RAG solves this by retrieving relevant text snippets at query time and including them in the prompt. The model then reasons over the retrieved context instead of relying only on its training data.

The RAG pipeline

A minimal RAG system has five stages:

  1. Load: read documents from files, APIs, databases, or web pages.
  2. Chunk: split documents into smaller pieces that fit in the embedding model's input window.
  3. Embed: convert each chunk into a dense vector that captures its meaning.
  4. Store: save vectors and their text in a vector database for fast similarity search.
  5. Retrieve and generate: embed the user query, find the closest chunks, and pass them to a language model.

1. Loading documents

Use loaders that match your source. For PDFs, tools like PyPDF or Unstructured extract text while preserving some structure. For websites, scrape the main content and strip navigation and ads. For databases, export rows as text or JSON. The cleaner your source text, the better your retrieval quality.

2. Chunking strategy

Chunk size and overlap are the most important knobs in a RAG pipeline. Small chunks (100-300 tokens) give precise retrieval but may lose context. Large chunks (500-1000 tokens) preserve context but introduce noise. A common starting point is 300 tokens with a 50-token overlap so boundaries do not split important sentences.

Also consider semantic chunking. Instead of fixed sizes, split on headings, paragraphs, or natural boundaries. For code, split by function or class. For dialogue, split by speaker turn.

3. Choosing an embedding model

The embedding model turns text into vectors. OpenAI's text-embedding-3-small and text-embedding-3-large are strong defaults. Cohere Embed, Voyage, and Jina offer competitive alternatives. Open models like nomic-embed-text and e5 are good for self-hosted or offline pipelines.

Match the embedding model's dimensions to your vector database and use the same model at index and query time. Mixing embedding models destroys retrieval quality.

4. Vector database options

Managed

Pinecone, Weaviate Cloud, and Zilliz handle scaling and operations for you.

Self-hosted

Chroma, Qdrant, and Milvus run locally or in your own infrastructure.

Postgres

pgvector adds vector search to PostgreSQL, great if you already use it.

In-memory

FAISS, Annoy, and Chroma in-memory are perfect for prototypes and small datasets.

5. Retrieval and re-ranking

Retrieve the top-k chunks using cosine or dot-product similarity. A value of k between 5 and 10 is common. If your chunks are small, retrieve more and then re-rank with a cross-encoder or a dedicated re-ranker like Cohere Rerank. Re-ranking improves precision by scoring each retrieved chunk against the full query.

6. Prompt engineering for RAG

Format the retrieved chunks clearly so the model knows what is context and what is the question. A simple template:

Use the following context to answer the question.
If the answer is not in the context, say "I don't know".

Context:
{#each chunks as chunk}
- {chunk}
{/each}

Question: {question}
Answer:

7. Evaluation

Measure retrieval with hit rate, mean reciprocal rank, and nDCG. Measure generation with faithfulness, answer relevance, and hallucination checks. Build a small labeled dataset of questions and expected answers, then iterate on chunk size, embedding model, and re-ranking.

Common mistakes

  • Chunks that are too large and dilute relevance.
  • Not cleaning headers, footers, and duplicate text from crawled pages.
  • Using different embedding models for indexing and querying.
  • Skipping evaluation and tuning only by intuition.
  • Returning retrieved chunks to users without grounding the answer.

Next steps

Published 2026-06-12

Related Resources

RAG Pipeline Architect

Prompt

Design production-ready Retrieval-Augmented Generation pipelines with advanced chunking strategies, embedding optimization, and hybrid search capabilities for enterprise knowledge bases.

RAG Implementation Expert

Skill

Build production-grade Retrieval-Augmented Generation systems with vector databases, embeddings, and hybrid search.

MODULAR RAG MCP SERVER

MCP Server

A modular RAG (Retrieval-Augmented Generation) system with MCP Server architecture. Using Skill to make AI follow each step of the spec and complete the code 100% by AI.

RAG

Glossary

RAG stands for Retrieval-Augmented Generation. It is a pattern that gives a language model access to information outside its training data by fetching relevant documents at query time and including them in the prompt. Instead of memorizing facts, the model reasons over retrieved snippets, which makes answers more accurate, current, and traceable. A typical RAG pipeline has four stages. First, documents are split into chunks and converted into embeddings using an embedding model. Second, those embeddings are stored in a vector database. Third, when a user asks a question, the system embeds the query and searches the database for the closest chunks. Finally, the retrieved chunks are added to the prompt as context, and the model generates an answer grounded in that evidence. RAG is especially useful when answers depend on private data, such as internal wikis, support tickets, or product documentation. It also reduces hallucination because the model can cite the retrieved text. Teams often tune RAG by changing chunk size, overlap, reranking algorithms, and query rewriting strategies.

LLM Fine-Tuning Specialist

Prompt

Design and execute efficient fine-tuning strategies for large language models using LoRA, QLoRA, and full fine-tuning. Optimize for specific domains, tasks, and deployment constraints.