RAG Evaluation Metrics That Actually Matter
Bottom line: You cannot improve what you do not measure. Split RAG evaluation into retrieval metrics and generation metrics, then iterate on each part independently.
The two halves of RAG quality
A RAG system fails if either retrieval or generation is bad. If retrieval returns irrelevant chunks, the model has no chance. If retrieval is perfect but the model ignores the context, the answer is still wrong.
Measure both halves. Start with retrieval metrics to make sure the right chunks are available, then add generation metrics to make sure the model uses them.
Retrieval metrics
Hit rate
The percentage of queries where the correct chunk appears anywhere in the top k results. Easy to understand and a good starting point.
Mean reciprocal rank
Rewards systems that place the correct chunk near the top. Calculated as 1 divided by the rank of the first relevant chunk, averaged across queries.
Normalized discounted cumulative gain
nDCG considers multiple relevant chunks and their positions. Best when answers are spread across several chunks.
Precision at k
The share of top k retrieved chunks that are relevant. Useful when you want to minimize noise in the context window.
Generation metrics
Faithfulness
Checks whether every claim in the generated answer is supported by the retrieved context. A low score means the model is hallucinating.
Answer relevance
Measures how well the answer addresses the original question, ignoring whether it is factually correct.
Context relevance
Measures whether the retrieved chunks contain the information needed to answer the question.
Building a labeled test set
- Collect questions: use real user queries or generate synthetic ones from your documents.
- Annotate chunks: mark the chunks that contain the answer to each question.
- Write reference answers: write concise answers based only on the annotated chunks.
- Version your data: keep the test set separate from training data and track changes over time.
Tools and frameworks
Ragas is the most popular open-source framework for RAG evaluation. It provides reference-free and reference-based metrics. You can also use Arize Phoenix, TruLens, or write your own evaluators with an LLM-as-a-judge pattern.
Iterating with metrics
Change one variable at a time. Test chunk size, embedding model, re-ranker, and prompt separately. If retrieval metrics improve but generation metrics do not, the problem is likely in the prompt or the model. If both are low, focus on chunking and embeddings first.
Published 2026-06-12
Related Resources
RAG Pipeline Architect
PromptDesign production-ready Retrieval-Augmented Generation pipelines with advanced chunking strategies, embedding optimization, and hybrid search capabilities for enterprise knowledge bases.
DeepSeek Coder Specialist
SkillLeverage DeepSeek Coder for complex software development with extended context and reasoning capabilities.
MODULAR RAG MCP SERVER
MCP ServerA modular RAG (Retrieval-Augmented Generation) system with MCP Server architecture. Using Skill to make AI follow each step of the spec and complete the code 100% by AI.
RAG
GlossaryRAG stands for Retrieval-Augmented Generation. It is a pattern that gives a language model access to information outside its training data by fetching relevant documents at query time and including them in the prompt. Instead of memorizing facts, the model reasons over retrieved snippets, which makes answers more accurate, current, and traceable. A typical RAG pipeline has four stages. First, documents are split into chunks and converted into embeddings using an embedding model. Second, those embeddings are stored in a vector database. Third, when a user asks a question, the system embeds the query and searches the database for the closest chunks. Finally, the retrieved chunks are added to the prompt as context, and the model generates an answer grounded in that evidence. RAG is especially useful when answers depend on private data, such as internal wikis, support tickets, or product documentation. It also reduces hallucination because the model can cite the retrieved text. Teams often tune RAG by changing chunk size, overlap, reranking algorithms, and query rewriting strategies.
LLM Fine-Tuning Specialist
PromptDesign and execute efficient fine-tuning strategies for large language models using LoRA, QLoRA, and full fine-tuning. Optimize for specific domains, tasks, and deployment constraints.