GuidesThinkEval

ThinkEval — Evaluation Framework

ThinkEval is ThinkHive’s built-in evaluation engine for systematically measuring and improving AI agent quality.

Overview

ThinkEval lets you define evaluation criteria, run them against your agent’s traces, and track quality over time. It supports deterministic graders, LLM-based judges, and human review workflows.

Creating an Evaluation Suite

An evaluation suite groups related criteria that measure a specific quality dimension of your agent.

Define your criteria

Choose what aspects of quality to measure. ThinkEval supports several criterion types:

Criterion TypeDescriptionExample
DeterministicRule-based checks with exact matchingResponse length, format validation
LLM JudgeAI-powered quality assessmentHelpfulness, accuracy, tone
JuryMultiple LLM judges for consensusHigh-stakes evaluations
CompositeWeighted combination of criteriaOverall quality score

Create the suite via API

curl -X POST "https://app.thinkhive.ai/api/v1/evaluation/suites" \
  -H "Authorization: Bearer thk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Customer Support Quality",
    "description": "Evaluate support agent responses",
    "criteria": [
      {
        "name": "response_relevance",
        "type": "llm_judge",
        "prompt": "Rate how relevant the response is to the customer question on a scale of 1-5.",
        "scale": { "min": 1, "max": 5 }
      },
      {
        "name": "format_check",
        "type": "deterministic",
        "rule": "response_length_between",
        "params": { "min": 50, "max": 2000 }
      }
    ]
  }'

Run evaluations

Execute the suite against a set of traces:

curl -X POST "https://app.thinkhive.ai/api/v1/evaluation/run" \
  -H "Authorization: Bearer thk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "suiteId": "suite_abc123",
    "traceIds": ["trace_1", "trace_2", "trace_3"],
    "options": {
      "parallel": true,
      "timeout": 30000
    }
  }'

Review results

curl "https://app.thinkhive.ai/api/v1/evaluation/results?suiteId=suite_abc123" \
  -H "Authorization: Bearer thk_your_api_key"

Criterion Types

Deterministic Graders

Deterministic graders apply rule-based checks that produce consistent, reproducible results.

{
  "name": "json_validity",
  "type": "deterministic",
  "rule": "json_valid",
  "description": "Check if the response contains valid JSON"
}

Available deterministic rules:

RuleDescriptionParameters
response_length_betweenCheck response lengthmin, max
contains_keywordsRequired keywords presentkeywords[]
json_validValid JSON in response
regex_matchRegex pattern matchingpattern
no_piiNo PII detected
response_time_underLatency thresholdmaxMs

LLM Judge

LLM judges use a language model to assess quality based on a custom prompt.

{
  "name": "helpfulness",
  "type": "llm_judge",
  "prompt": "Evaluate how helpful this response is for the user's question. Consider completeness, clarity, and actionability.",
  "scale": { "min": 1, "max": 5 },
  "model": "gpt-4o"
}
⚠️

LLM judge evaluations consume credits. See Billing & Credits for details.

Jury Mode

Jury mode runs multiple LLM judges and aggregates their scores for higher reliability.

{
  "name": "accuracy_jury",
  "type": "jury",
  "judges": [
    { "model": "gpt-4o", "weight": 0.4 },
    { "model": "claude-3-5-sonnet", "weight": 0.4 },
    { "model": "gpt-4o-mini", "weight": 0.2 }
  ],
  "aggregation": "weighted_average",
  "prompt": "Rate the factual accuracy of this response."
}

Using ThinkEval in the Dashboard

The ThinkEval wizard in the dashboard provides a guided setup:

  1. Navigate to Evaluation in the sidebar
  2. Create Suite — name it and add criteria
  3. Select Traces — choose traces to evaluate (manually or by filter)
  4. Run — execute the evaluation
  5. Review — inspect per-trace scores, distributions, and trends

SDK Integration

import { ThinkHive } from 'thinkhive-js';
 
const th = new ThinkHive({
  apiKey: process.env.THINKHIVE_API_KEY,
  endpoint: 'https://app.thinkhive.ai',
  serviceName: 'my-agent'
});
 
// Run an evaluation suite
const results = await th.evaluate({
  suiteId: 'suite_abc123',
  traceIds: ['trace_1', 'trace_2'],
});
 
console.log(results.summary);
// { averageScore: 4.2, passRate: 0.85, criteriaBreakdown: {...} }

Evaluation Results

Results include per-trace scores, aggregate statistics, and trend data.

{
  "suiteId": "suite_abc123",
  "runId": "eval_run_789",
  "summary": {
    "totalTraces": 100,
    "averageScore": 4.1,
    "passRate": 0.82,
    "criteriaBreakdown": {
      "response_relevance": { "average": 4.3, "passRate": 0.90 },
      "format_check": { "average": 1.0, "passRate": 0.95 }
    }
  },
  "results": [
    {
      "traceId": "trace_1",
      "scores": {
        "response_relevance": 4,
        "format_check": 1
      },
      "passed": true
    }
  ]
}

Best Practices

  • Start with deterministic graders for objective checks (format, length, PII), then add LLM judges for subjective quality
  • Use jury mode for high-stakes evaluations where consistency matters
  • Run evaluations regularly against production traces to track quality trends
  • Combine with shadow testing — evaluate fix candidates before deploying them
  • Set pass thresholds to create automated quality gates in your CI/CD pipeline

Next Steps