RAG Evaluation Guide

Learn how to evaluate and improve the quality of your Retrieval-Augmented Generation (RAG) pipeline.

What is RAG Evaluation?

RAG evaluation measures how well your AI agent:

Retrieves relevant documents for the user’s query
Grounds its response in the retrieved context
Attributes information to the correct sources

Key Metrics

Groundedness

Definition: The degree to which the response is supported by the retrieved context.

Formula: supported_claims / total_claims

Score Range: 0-1 (higher is better)

Example:

Context: "Our return policy is 30 days for all products."
Response: "You can return items within 30 days." ✅ Grounded
Response: "You can return items within 60 days." ❌ Not grounded

Faithfulness

Definition: How accurately the response reflects the meaning of the source material.

Checks for:

Paraphrasing accuracy
No meaning distortion
Preserved nuance

Example:

Context: "Premium users can access all features."
Response: "Premium users have full feature access." ✅ Faithful
Response: "All users can access premium features." ❌ Not faithful

Citation Accuracy

Definition: Whether sources are correctly attributed.

Checks for:

Correct source references
Accurate page/section numbers
No fabricated citations

Context Relevance

Definition: How well the retrieved documents match the user’s query.

Formula: relevant_docs / retrieved_docs

Setting Up RAG Evaluation

Trace your RAG pipeline

import { traceRetrieval, traceLLM } from '@thinkhive/sdk';
 
async function ragPipeline(query: string) {
  // Trace retrieval
  const docs = await traceRetrieval({
    name: 'document-search',
    query: query,
    topK: 5,
  }, async () => {
    return await vectorDB.search(query);
  });
 
  // Trace generation
  const response = await traceLLM({
    name: 'answer-generation',
    modelName: 'gpt-4',
    provider: 'openai',
    metadata: {
      retrievedDocs: docs.length,
    },
  }, async () => {
    return await generateAnswer(query, docs);
  });
 
  return response;
}

Enable RAG evaluation

import { runs } from '@thinkhive/sdk';
 
const run = await runs.create({
  name: 'rag-query',
  input: userQuery,
  output: response,
  metadata: {
    retrievedDocuments: docs,
  },
});
 
// Trigger analysis with RAG evaluation
const analysis = await runs.analyze(run.id, {
  includeRagEvaluation: true,
});
 
console.log(analysis.ragEvaluation);
// {
//   groundedness: 0.85,
//   faithfulness: 0.92,
//   citationAccuracy: 0.78,
//   contextRelevance: 0.88
// }

Set quality thresholds

const QUALITY_THRESHOLDS = {
  groundedness: 0.8,
  faithfulness: 0.85,
  citationAccuracy: 0.7,
  contextRelevance: 0.75,
};
 
function checkQuality(metrics) {
  const issues = [];
 
  if (metrics.groundedness < QUALITY_THRESHOLDS.groundedness) {
    issues.push('Low groundedness - response may contain unsupported claims');
  }
  if (metrics.faithfulness < QUALITY_THRESHOLDS.faithfulness) {
    issues.push('Low faithfulness - response may distort source meaning');
  }
 
  return issues;
}

Improving RAG Quality

Low Groundedness

Symptoms: Response contains claims not in the retrieved context.

Solutions:

Improve retrieval relevance (better embeddings, more documents)
Add explicit grounding instructions to the prompt
Use a lower temperature setting
Add citation requirements

Prompt Example:

Answer ONLY based on the provided context. If the context doesn't contain
the information, say "I don't have information about that."

Context: {context}

Question: {question}

Low Faithfulness

Symptoms: Response distorts or misrepresents source information.

Solutions:

Use chain-of-thought reasoning
Add fact-checking step
Reduce creative generation settings

Low Citation Accuracy

Symptoms: Wrong sources cited or fabricated references.

Solutions:

Provide structured source metadata
Add explicit citation format requirements
Validate citations in post-processing

Low Context Relevance

Symptoms: Retrieved documents don’t match the query.

Solutions:

Improve embedding model
Add query expansion
Use hybrid search (keyword + semantic)
Implement re-ranking

Monitoring RAG Quality

Set up alerts

// Configure webhook for low-quality responses
await webhooks.create({
  url: 'https://your-app.com/alerts',
  events: ['trace.failure'],
  filters: {
    'ragEvaluation.groundedness': { lt: 0.7 }
  }
});

Track trends

// Get quality trends over time
const stats = await api.get('/api/v1/evaluation/trends', {
  params: {
    agentId: 'agent_123',
    metric: 'groundedness',
    period: '30d'
  }
});

Best Practices

Quality Benchmarks by Use Case

Use Case	Groundedness	Faithfulness
Customer Support	0.85+	0.90+
Technical Docs	0.90+	0.95+
Legal/Compliance	0.95+	0.98+

Include source documents in your traces for accurate evaluation
Set appropriate thresholds based on your use case criticality
Monitor trends not just individual traces
Investigate clusters of low-quality responses
Test fixes with shadow testing before deploying

Next Steps

Hallucination Detection - Detect fabricated information
Shadow Testing - Validate improvements
API Reference - Evaluation API details

Cases & Fixes Hallucination Detection