RAG Evaluation Guide
Learn how to evaluate and improve the quality of your Retrieval-Augmented Generation (RAG) pipeline.
What is RAG Evaluation?
RAG evaluation measures how well your AI agent:
- Retrieves relevant documents for the user’s query
- Grounds its response in the retrieved context
- Attributes information to the correct sources
Key Metrics
Groundedness
Definition: The degree to which the response is supported by the retrieved context.
Formula: supported_claims / total_claims
Score Range: 0-1 (higher is better)
Example:
Context: "Our return policy is 30 days for all products."
Response: "You can return items within 30 days." ✅ Grounded
Response: "You can return items within 60 days." ❌ Not groundedFaithfulness
Definition: How accurately the response reflects the meaning of the source material.
Checks for:
- Paraphrasing accuracy
- No meaning distortion
- Preserved nuance
Example:
Context: "Premium users can access all features."
Response: "Premium users have full feature access." ✅ Faithful
Response: "All users can access premium features." ❌ Not faithfulCitation Accuracy
Definition: Whether sources are correctly attributed.
Checks for:
- Correct source references
- Accurate page/section numbers
- No fabricated citations
Context Relevance
Definition: How well the retrieved documents match the user’s query.
Formula: relevant_docs / retrieved_docs
Setting Up RAG Evaluation
Trace your RAG pipeline
import { traceRetrieval, traceLLM } from '@thinkhive/sdk';
async function ragPipeline(query: string) {
// Trace retrieval
const docs = await traceRetrieval({
name: 'document-search',
query: query,
topK: 5,
}, async () => {
return await vectorDB.search(query);
});
// Trace generation
const response = await traceLLM({
name: 'answer-generation',
modelName: 'gpt-4',
provider: 'openai',
metadata: {
retrievedDocs: docs.length,
},
}, async () => {
return await generateAnswer(query, docs);
});
return response;
}Enable RAG evaluation
import { runs } from '@thinkhive/sdk';
const run = await runs.create({
name: 'rag-query',
input: userQuery,
output: response,
metadata: {
retrievedDocuments: docs,
},
});
// Trigger analysis with RAG evaluation
const analysis = await runs.analyze(run.id, {
includeRagEvaluation: true,
});
console.log(analysis.ragEvaluation);
// {
// groundedness: 0.85,
// faithfulness: 0.92,
// citationAccuracy: 0.78,
// contextRelevance: 0.88
// }Set quality thresholds
const QUALITY_THRESHOLDS = {
groundedness: 0.8,
faithfulness: 0.85,
citationAccuracy: 0.7,
contextRelevance: 0.75,
};
function checkQuality(metrics) {
const issues = [];
if (metrics.groundedness < QUALITY_THRESHOLDS.groundedness) {
issues.push('Low groundedness - response may contain unsupported claims');
}
if (metrics.faithfulness < QUALITY_THRESHOLDS.faithfulness) {
issues.push('Low faithfulness - response may distort source meaning');
}
return issues;
}Improving RAG Quality
Low Groundedness
Symptoms: Response contains claims not in the retrieved context.
Solutions:
- Improve retrieval relevance (better embeddings, more documents)
- Add explicit grounding instructions to the prompt
- Use a lower temperature setting
- Add citation requirements
Prompt Example:
Answer ONLY based on the provided context. If the context doesn't contain
the information, say "I don't have information about that."
Context: {context}
Question: {question}Low Faithfulness
Symptoms: Response distorts or misrepresents source information.
Solutions:
- Use chain-of-thought reasoning
- Add fact-checking step
- Reduce creative generation settings
Low Citation Accuracy
Symptoms: Wrong sources cited or fabricated references.
Solutions:
- Provide structured source metadata
- Add explicit citation format requirements
- Validate citations in post-processing
Low Context Relevance
Symptoms: Retrieved documents don’t match the query.
Solutions:
- Improve embedding model
- Add query expansion
- Use hybrid search (keyword + semantic)
- Implement re-ranking
Monitoring RAG Quality
Set up alerts
// Configure webhook for low-quality responses
await webhooks.create({
url: 'https://your-app.com/alerts',
events: ['trace.failure'],
filters: {
'ragEvaluation.groundedness': { lt: 0.7 }
}
});Track trends
// Get quality trends over time
const stats = await api.get('/api/v1/evaluation/trends', {
params: {
agentId: 'agent_123',
metric: 'groundedness',
period: '30d'
}
});Best Practices
Quality Benchmarks by Use Case
| Use Case | Groundedness | Faithfulness |
|---|---|---|
| Customer Support | 0.85+ | 0.90+ |
| Technical Docs | 0.90+ | 0.95+ |
| Legal/Compliance | 0.95+ | 0.98+ |
- Include source documents in your traces for accurate evaluation
- Set appropriate thresholds based on your use case criticality
- Monitor trends not just individual traces
- Investigate clusters of low-quality responses
- Test fixes with shadow testing before deploying
Next Steps
- Hallucination Detection - Detect fabricated information
- Shadow Testing - Validate improvements
- API Reference - Evaluation API details