GuidesRAG Evaluation

RAG Evaluation Guide

Learn how to evaluate and improve the quality of your Retrieval-Augmented Generation (RAG) pipeline.

What is RAG Evaluation?

RAG evaluation measures how well your AI agent:

  1. Retrieves relevant documents for the user’s query
  2. Grounds its response in the retrieved context
  3. Attributes information to the correct sources

Key Metrics

Groundedness

Definition: The degree to which the response is supported by the retrieved context.

Formula: supported_claims / total_claims

Score Range: 0-1 (higher is better)

Example:

Context: "Our return policy is 30 days for all products."
Response: "You can return items within 30 days." ✅ Grounded
Response: "You can return items within 60 days." ❌ Not grounded

Faithfulness

Definition: How accurately the response reflects the meaning of the source material.

Checks for:

  • Paraphrasing accuracy
  • No meaning distortion
  • Preserved nuance

Example:

Context: "Premium users can access all features."
Response: "Premium users have full feature access." ✅ Faithful
Response: "All users can access premium features." ❌ Not faithful

Citation Accuracy

Definition: Whether sources are correctly attributed.

Checks for:

  • Correct source references
  • Accurate page/section numbers
  • No fabricated citations

Context Relevance

Definition: How well the retrieved documents match the user’s query.

Formula: relevant_docs / retrieved_docs

Setting Up RAG Evaluation

Trace your RAG pipeline

import { traceRetrieval, traceLLM } from '@thinkhive/sdk';
 
async function ragPipeline(query: string) {
  // Trace retrieval
  const docs = await traceRetrieval({
    name: 'document-search',
    query: query,
    topK: 5,
  }, async () => {
    return await vectorDB.search(query);
  });
 
  // Trace generation
  const response = await traceLLM({
    name: 'answer-generation',
    modelName: 'gpt-4',
    provider: 'openai',
    metadata: {
      retrievedDocs: docs.length,
    },
  }, async () => {
    return await generateAnswer(query, docs);
  });
 
  return response;
}

Enable RAG evaluation

import { runs } from '@thinkhive/sdk';
 
const run = await runs.create({
  name: 'rag-query',
  input: userQuery,
  output: response,
  metadata: {
    retrievedDocuments: docs,
  },
});
 
// Trigger analysis with RAG evaluation
const analysis = await runs.analyze(run.id, {
  includeRagEvaluation: true,
});
 
console.log(analysis.ragEvaluation);
// {
//   groundedness: 0.85,
//   faithfulness: 0.92,
//   citationAccuracy: 0.78,
//   contextRelevance: 0.88
// }

Set quality thresholds

const QUALITY_THRESHOLDS = {
  groundedness: 0.8,
  faithfulness: 0.85,
  citationAccuracy: 0.7,
  contextRelevance: 0.75,
};
 
function checkQuality(metrics) {
  const issues = [];
 
  if (metrics.groundedness < QUALITY_THRESHOLDS.groundedness) {
    issues.push('Low groundedness - response may contain unsupported claims');
  }
  if (metrics.faithfulness < QUALITY_THRESHOLDS.faithfulness) {
    issues.push('Low faithfulness - response may distort source meaning');
  }
 
  return issues;
}

Improving RAG Quality

Low Groundedness

Symptoms: Response contains claims not in the retrieved context.

Solutions:

  1. Improve retrieval relevance (better embeddings, more documents)
  2. Add explicit grounding instructions to the prompt
  3. Use a lower temperature setting
  4. Add citation requirements

Prompt Example:

Answer ONLY based on the provided context. If the context doesn't contain
the information, say "I don't have information about that."

Context: {context}

Question: {question}

Low Faithfulness

Symptoms: Response distorts or misrepresents source information.

Solutions:

  1. Use chain-of-thought reasoning
  2. Add fact-checking step
  3. Reduce creative generation settings

Low Citation Accuracy

Symptoms: Wrong sources cited or fabricated references.

Solutions:

  1. Provide structured source metadata
  2. Add explicit citation format requirements
  3. Validate citations in post-processing

Low Context Relevance

Symptoms: Retrieved documents don’t match the query.

Solutions:

  1. Improve embedding model
  2. Add query expansion
  3. Use hybrid search (keyword + semantic)
  4. Implement re-ranking

Monitoring RAG Quality

Set up alerts

// Configure webhook for low-quality responses
await webhooks.create({
  url: 'https://your-app.com/alerts',
  events: ['trace.failure'],
  filters: {
    'ragEvaluation.groundedness': { lt: 0.7 }
  }
});
// Get quality trends over time
const stats = await api.get('/api/v1/evaluation/trends', {
  params: {
    agentId: 'agent_123',
    metric: 'groundedness',
    period: '30d'
  }
});

Best Practices

Quality Benchmarks by Use Case

Use CaseGroundednessFaithfulness
Customer Support0.85+0.90+
Technical Docs0.90+0.95+
Legal/Compliance0.95+0.98+
  1. Include source documents in your traces for accurate evaluation
  2. Set appropriate thresholds based on your use case criticality
  3. Monitor trends not just individual traces
  4. Investigate clusters of low-quality responses
  5. Test fixes with shadow testing before deploying

Next Steps