ThinkEval — Evaluation Framework

ThinkEval is ThinkHive’s built-in evaluation engine for systematically measuring and improving AI agent quality.

Overview

ThinkEval lets you define evaluation criteria, run them against your agent’s traces, and track quality over time. It supports deterministic graders, LLM-based judges, and human review workflows.

Creating an Evaluation Suite

An evaluation suite groups related criteria that measure a specific quality dimension of your agent.

Define your criteria

Choose what aspects of quality to measure. ThinkEval supports several criterion types:

Criterion Type	Description	Example
Deterministic	Rule-based checks with exact matching	Response length, format validation
LLM Judge	AI-powered quality assessment	Helpfulness, accuracy, tone
Jury	Multiple LLM judges for consensus	High-stakes evaluations
Composite	Weighted combination of criteria	Overall quality score

Create the suite via API

curl -X POST "https://app.thinkhive.ai/api/v1/evaluation/suites" \
  -H "Authorization: Bearer thk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Customer Support Quality",
    "description": "Evaluate support agent responses",
    "criteria": [
      {
        "name": "response_relevance",
        "type": "llm_judge",
        "prompt": "Rate how relevant the response is to the customer question on a scale of 1-5.",
        "scale": { "min": 1, "max": 5 }
      },
      {
        "name": "format_check",
        "type": "deterministic",
        "rule": "response_length_between",
        "params": { "min": 50, "max": 2000 }
      }
    ]
  }'

Run evaluations

Execute the suite against a set of traces:

curl -X POST "https://app.thinkhive.ai/api/v1/evaluation/run" \
  -H "Authorization: Bearer thk_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "suiteId": "suite_abc123",
    "traceIds": ["trace_1", "trace_2", "trace_3"],
    "options": {
      "parallel": true,
      "timeout": 30000
    }
  }'

Review results

curl "https://app.thinkhive.ai/api/v1/evaluation/results?suiteId=suite_abc123" \
  -H "Authorization: Bearer thk_your_api_key"

Criterion Types

Deterministic Graders

Deterministic graders apply rule-based checks that produce consistent, reproducible results.

{
  "name": "json_validity",
  "type": "deterministic",
  "rule": "json_valid",
  "description": "Check if the response contains valid JSON"
}

Available deterministic rules:

Rule	Description	Parameters
`response_length_between`	Check response length	`min`, `max`
`contains_keywords`	Required keywords present	`keywords[]`
`json_valid`	Valid JSON in response	—
`regex_match`	Regex pattern matching	`pattern`
`no_pii`	No PII detected	—
`response_time_under`	Latency threshold	`maxMs`

LLM Judge

LLM judges use a language model to assess quality based on a custom prompt.

{
  "name": "helpfulness",
  "type": "llm_judge",
  "prompt": "Evaluate how helpful this response is for the user's question. Consider completeness, clarity, and actionability.",
  "scale": { "min": 1, "max": 5 },
  "model": "gpt-4o"
}

⚠️

LLM judge evaluations consume credits. See Billing & Credits for details.

Jury Mode

Jury mode runs multiple LLM judges and aggregates their scores for higher reliability.

{
  "name": "accuracy_jury",
  "type": "jury",
  "judges": [
    { "model": "gpt-4o", "weight": 0.4 },
    { "model": "claude-3-5-sonnet", "weight": 0.4 },
    { "model": "gpt-4o-mini", "weight": 0.2 }
  ],
  "aggregation": "weighted_average",
  "prompt": "Rate the factual accuracy of this response."
}

Using ThinkEval in the Dashboard

The ThinkEval wizard in the dashboard provides a guided setup:

Navigate to Evaluation in the sidebar
Create Suite — name it and add criteria
Select Traces — choose traces to evaluate (manually or by filter)
Run — execute the evaluation
Review — inspect per-trace scores, distributions, and trends

SDK Integration

import { ThinkHive } from 'thinkhive-js';
 
const th = new ThinkHive({
  apiKey: process.env.THINKHIVE_API_KEY,
  endpoint: 'https://app.thinkhive.ai',
  serviceName: 'my-agent'
});
 
// Run an evaluation suite
const results = await th.evaluate({
  suiteId: 'suite_abc123',
  traceIds: ['trace_1', 'trace_2'],
});
 
console.log(results.summary);
// { averageScore: 4.2, passRate: 0.85, criteriaBreakdown: {...} }

Evaluation Results

Results include per-trace scores, aggregate statistics, and trend data.

{
  "suiteId": "suite_abc123",
  "runId": "eval_run_789",
  "summary": {
    "totalTraces": 100,
    "averageScore": 4.1,
    "passRate": 0.82,
    "criteriaBreakdown": {
      "response_relevance": { "average": 4.3, "passRate": 0.90 },
      "format_check": { "average": 1.0, "passRate": 0.95 }
    }
  },
  "results": [
    {
      "traceId": "trace_1",
      "scores": {
        "response_relevance": 4,
        "format_check": 1
      },
      "passed": true
    }
  ]
}

Best Practices

Start with deterministic graders for objective checks (format, length, PII), then add LLM judges for subjective quality
Use jury mode for high-stakes evaluations where consistency matters
Run evaluations regularly against production traces to track quality trends
Combine with shadow testing — evaluate fix candidates before deploying them
Set pass thresholds to create automated quality gates in your CI/CD pipeline

Next Steps

RAG Evaluation — Specialized metrics for RAG pipelines
Hallucination Detection — Detect fabricated information
Cases & Fixes — Act on evaluation failures
API Reference: Evaluation & Grading — Full endpoint documentation

Overview Cases & Fixes