ThinkEval — Evaluation Framework
ThinkEval is ThinkHive’s built-in evaluation engine for systematically measuring and improving AI agent quality.
Overview
ThinkEval lets you define evaluation criteria, run them against your agent’s traces, and track quality over time. It supports deterministic graders, LLM-based judges, and human review workflows.
Creating an Evaluation Suite
An evaluation suite groups related criteria that measure a specific quality dimension of your agent.
Define your criteria
Choose what aspects of quality to measure. ThinkEval supports several criterion types:
| Criterion Type | Description | Example |
|---|---|---|
| Deterministic | Rule-based checks with exact matching | Response length, format validation |
| LLM Judge | AI-powered quality assessment | Helpfulness, accuracy, tone |
| Jury | Multiple LLM judges for consensus | High-stakes evaluations |
| Composite | Weighted combination of criteria | Overall quality score |
Create the suite via API
curl -X POST "https://app.thinkhive.ai/api/v1/evaluation/suites" \
-H "Authorization: Bearer thk_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"name": "Customer Support Quality",
"description": "Evaluate support agent responses",
"criteria": [
{
"name": "response_relevance",
"type": "llm_judge",
"prompt": "Rate how relevant the response is to the customer question on a scale of 1-5.",
"scale": { "min": 1, "max": 5 }
},
{
"name": "format_check",
"type": "deterministic",
"rule": "response_length_between",
"params": { "min": 50, "max": 2000 }
}
]
}'Run evaluations
Execute the suite against a set of traces:
curl -X POST "https://app.thinkhive.ai/api/v1/evaluation/run" \
-H "Authorization: Bearer thk_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"suiteId": "suite_abc123",
"traceIds": ["trace_1", "trace_2", "trace_3"],
"options": {
"parallel": true,
"timeout": 30000
}
}'Review results
curl "https://app.thinkhive.ai/api/v1/evaluation/results?suiteId=suite_abc123" \
-H "Authorization: Bearer thk_your_api_key"Criterion Types
Deterministic Graders
Deterministic graders apply rule-based checks that produce consistent, reproducible results.
{
"name": "json_validity",
"type": "deterministic",
"rule": "json_valid",
"description": "Check if the response contains valid JSON"
}Available deterministic rules:
| Rule | Description | Parameters |
|---|---|---|
response_length_between | Check response length | min, max |
contains_keywords | Required keywords present | keywords[] |
json_valid | Valid JSON in response | — |
regex_match | Regex pattern matching | pattern |
no_pii | No PII detected | — |
response_time_under | Latency threshold | maxMs |
LLM Judge
LLM judges use a language model to assess quality based on a custom prompt.
{
"name": "helpfulness",
"type": "llm_judge",
"prompt": "Evaluate how helpful this response is for the user's question. Consider completeness, clarity, and actionability.",
"scale": { "min": 1, "max": 5 },
"model": "gpt-4o"
}LLM judge evaluations consume credits. See Billing & Credits for details.
Jury Mode
Jury mode runs multiple LLM judges and aggregates their scores for higher reliability.
{
"name": "accuracy_jury",
"type": "jury",
"judges": [
{ "model": "gpt-4o", "weight": 0.4 },
{ "model": "claude-3-5-sonnet", "weight": 0.4 },
{ "model": "gpt-4o-mini", "weight": 0.2 }
],
"aggregation": "weighted_average",
"prompt": "Rate the factual accuracy of this response."
}Using ThinkEval in the Dashboard
The ThinkEval wizard in the dashboard provides a guided setup:
- Navigate to Evaluation in the sidebar
- Create Suite — name it and add criteria
- Select Traces — choose traces to evaluate (manually or by filter)
- Run — execute the evaluation
- Review — inspect per-trace scores, distributions, and trends
SDK Integration
import { ThinkHive } from 'thinkhive-js';
const th = new ThinkHive({
apiKey: process.env.THINKHIVE_API_KEY,
endpoint: 'https://app.thinkhive.ai',
serviceName: 'my-agent'
});
// Run an evaluation suite
const results = await th.evaluate({
suiteId: 'suite_abc123',
traceIds: ['trace_1', 'trace_2'],
});
console.log(results.summary);
// { averageScore: 4.2, passRate: 0.85, criteriaBreakdown: {...} }Evaluation Results
Results include per-trace scores, aggregate statistics, and trend data.
{
"suiteId": "suite_abc123",
"runId": "eval_run_789",
"summary": {
"totalTraces": 100,
"averageScore": 4.1,
"passRate": 0.82,
"criteriaBreakdown": {
"response_relevance": { "average": 4.3, "passRate": 0.90 },
"format_check": { "average": 1.0, "passRate": 0.95 }
}
},
"results": [
{
"traceId": "trace_1",
"scores": {
"response_relevance": 4,
"format_check": 1
},
"passed": true
}
]
}Best Practices
- Start with deterministic graders for objective checks (format, length, PII), then add LLM judges for subjective quality
- Use jury mode for high-stakes evaluations where consistency matters
- Run evaluations regularly against production traces to track quality trends
- Combine with shadow testing — evaluate fix candidates before deploying them
- Set pass thresholds to create automated quality gates in your CI/CD pipeline
Next Steps
- RAG Evaluation — Specialized metrics for RAG pipelines
- Hallucination Detection — Detect fabricated information
- Cases & Fixes — Act on evaluation failures
- API Reference: Evaluation & Grading — Full endpoint documentation