Evaluation & Grading API
ThinkHive provides a complete evaluation and grading pipeline for AI agents: create evaluation sets, run automated and deterministic graders, route edge cases to human reviewers, monitor evaluation health over time, detect nondeterministic behavior, and evaluate multi-turn conversations.
All endpoints require authentication via Authorization: Bearer th_your_api_key header. See Authentication for details.
Evaluation Sets & Criteria
Manage golden datasets and run evaluations against your AI agents.
List Evaluation Sets
GET /api/evaluation/setsQuery Parameters:
| Parameter | Type | Description |
|---|---|---|
agentId | string | Filter sets by agent |
page | number | Page number (default: 1) |
limit | number | Results per page (default: 20, max: 100) |
Response:
{
"success": true,
"data": [
{
"id": "eval_set_001",
"name": "Customer Support Golden Set",
"description": "50 curated examples for support agent evaluation",
"exampleCount": 50,
"agentId": "agent_abc123",
"criteria": [
{ "name": "accuracy", "weight": 0.4 },
{ "name": "groundedness", "weight": 0.3 },
{ "name": "completeness", "weight": 0.3 }
],
"createdAt": "2025-01-15T10:00:00Z",
"updatedAt": "2025-03-02T14:22:00Z"
}
],
"pagination": {
"page": 1,
"limit": 20,
"total": 3
}
}Create Evaluation Set
POST /api/evaluation/setsRequest Body:
{
"name": "Product FAQ Evaluation",
"description": "Test cases for product knowledge questions",
"agentId": "agent_abc123",
"examples": [
{
"input": "What is the return policy?",
"expectedOutput": "Our return policy allows returns within 30 days of purchase with a valid receipt. Items must be in original condition.",
"context": "Returns documentation from help center",
"criteria": ["accuracy", "completeness", "tone"]
},
{
"input": "How do I cancel my subscription?",
"expectedOutput": "You can cancel your subscription from Settings > Billing > Cancel Plan. Your access continues until the end of the billing period.",
"context": "Billing FAQ article",
"criteria": ["accuracy", "helpfulness"]
}
],
"criteria": [
{ "name": "accuracy", "weight": 0.4, "description": "Factual correctness against source material" },
{ "name": "completeness", "weight": 0.3, "description": "Covers all relevant information" },
{ "name": "tone", "weight": 0.15, "description": "Professional and empathetic" },
{ "name": "helpfulness", "weight": 0.15, "description": "Actionable and clear" }
]
}Response:
{
"success": true,
"data": {
"id": "eval_set_042",
"name": "Product FAQ Evaluation",
"exampleCount": 2,
"createdAt": "2025-03-10T09:00:00Z"
}
}Run Evaluation
POST /api/evaluation/runRequest Body:
{
"evalSetId": "eval_set_042",
"agentId": "agent_abc123",
"config": {
"metrics": ["accuracy", "groundedness", "faithfulness"],
"threshold": 0.8,
"graderModel": "gpt-4o",
"concurrency": 5
}
}Response:
{
"success": true,
"data": {
"runId": "eval_run_108",
"status": "running",
"progress": {
"completed": 0,
"total": 50
},
"estimatedCompletionTime": "2025-03-10T09:05:00Z"
}
}Get Evaluation Results
GET /api/evaluation/runs/:runIdResponse:
{
"success": true,
"data": {
"runId": "eval_run_108",
"evalSetId": "eval_set_042",
"agentId": "agent_abc123",
"status": "completed",
"startedAt": "2025-03-10T09:00:00Z",
"completedAt": "2025-03-10T09:04:32Z",
"summary": {
"passed": 42,
"failed": 8,
"passRate": 0.84,
"avgAccuracy": 0.87,
"avgGroundedness": 0.82,
"avgFaithfulness": 0.91
},
"results": [
{
"exampleId": "ex_001",
"input": "What is the return policy?",
"actualOutput": "Our return policy allows returns within 30 days...",
"passed": true,
"scores": {
"accuracy": 0.92,
"groundedness": 0.88,
"faithfulness": 0.95
},
"reasoning": "Response accurately covers the return window and receipt requirement."
},
{
"exampleId": "ex_002",
"input": "How do I cancel my subscription?",
"actualOutput": "Please contact support to cancel.",
"passed": false,
"scores": {
"accuracy": 0.45,
"groundedness": 0.30,
"faithfulness": 0.60
},
"reasoning": "Response is vague and does not include the self-service cancellation path from Settings > Billing."
}
]
}
}Compare Runs
GET /api/evaluation/compareQuery Parameters:
| Parameter | Type | Description |
|---|---|---|
runIds | string | Comma-separated run IDs to compare |
Response:
{
"success": true,
"data": {
"comparison": {
"runs": [
{
"id": "eval_run_107",
"passRate": 0.80,
"avgAccuracy": 0.82,
"avgGroundedness": 0.79,
"completedAt": "2025-03-08T15:00:00Z"
},
{
"id": "eval_run_108",
"passRate": 0.84,
"avgAccuracy": 0.87,
"avgGroundedness": 0.82,
"completedAt": "2025-03-10T09:04:32Z"
}
],
"improvement": 0.05,
"significantChanges": [
"Accuracy improved by 6% on FAQ questions",
"Groundedness improved by 4% across all categories",
"2 previously failing examples now pass"
],
"regressions": [
"Example ex_034 regressed from 0.91 to 0.72 on accuracy"
]
}
}
}Evaluation Runs
Manage evaluation run lifecycle. Use these endpoints to list, create, retrieve, and update individual evaluation runs.
List Runs
GET /api/eval-runsQuery Parameters:
| Parameter | Type | Description |
|---|---|---|
agentId | string | Filter by agent ID |
status | string | Filter by status: pending, running, completed, failed |
evalSetId | string | Filter by evaluation set |
page | number | Page number (default: 1) |
limit | number | Results per page (default: 20) |
Response:
{
"success": true,
"data": [
{
"id": "eval_run_108",
"evalSetId": "eval_set_042",
"agentId": "agent_abc123",
"status": "completed",
"passRate": 0.84,
"totalExamples": 50,
"passed": 42,
"failed": 8,
"startedAt": "2025-03-10T09:00:00Z",
"completedAt": "2025-03-10T09:04:32Z"
},
{
"id": "eval_run_107",
"evalSetId": "eval_set_042",
"agentId": "agent_abc123",
"status": "completed",
"passRate": 0.80,
"totalExamples": 50,
"passed": 40,
"failed": 10,
"startedAt": "2025-03-08T14:30:00Z",
"completedAt": "2025-03-08T15:00:00Z"
}
],
"pagination": {
"page": 1,
"limit": 20,
"total": 12
}
}Create Run
POST /api/eval-runsRequest Body:
{
"evalSetId": "eval_set_042",
"agentId": "agent_abc123",
"name": "Post-prompt-update regression check",
"config": {
"metrics": ["accuracy", "groundedness"],
"threshold": 0.8,
"graderModel": "gpt-4o",
"concurrency": 10,
"timeout": 30000
},
"tags": ["regression", "prompt-v2"]
}Response:
{
"success": true,
"data": {
"id": "eval_run_109",
"status": "pending",
"createdAt": "2025-03-10T11:00:00Z"
}
}Get Run
GET /api/eval-runs/:idResponse:
{
"success": true,
"data": {
"id": "eval_run_109",
"evalSetId": "eval_set_042",
"agentId": "agent_abc123",
"name": "Post-prompt-update regression check",
"status": "running",
"passRate": null,
"progress": {
"completed": 23,
"total": 50
},
"config": {
"metrics": ["accuracy", "groundedness"],
"threshold": 0.8,
"graderModel": "gpt-4o"
},
"tags": ["regression", "prompt-v2"],
"startedAt": "2025-03-10T11:00:05Z",
"completedAt": null
}
}Update Run
PATCH /api/eval-runs/:idUse this to cancel a running evaluation or update metadata.
Request Body:
{
"status": "cancelled",
"name": "Updated run name",
"tags": ["regression", "prompt-v2", "cancelled-early"]
}Response:
{
"success": true,
"data": {
"id": "eval_run_109",
"status": "cancelled",
"name": "Updated run name",
"updatedAt": "2025-03-10T11:02:00Z"
}
}Deterministic Graders
Apply rule-based grading to agent outputs without LLM calls. Deterministic graders are fast, reproducible, and cost-free. Use them for structural validation, compliance checks, and baseline quality gates.
Deterministic graders run locally with zero latency overhead. Combine them with LLM-based evaluation for comprehensive coverage.
Rule Types
| Rule Type | Description | Example Use Case |
|---|---|---|
length | Validate output length (min/max characters or tokens) | Ensure responses are concise |
keywords | Check for required or prohibited keywords | Verify brand terms are included |
json_valid | Validate output is well-formed JSON | Tool-calling agents |
regex | Match output against a regular expression | Format validation (dates, IDs) |
no_pii | Detect personally identifiable information | Compliance enforcement |
response_time | Assert response latency is within bounds | SLA compliance |
Evaluate with Rules
POST /api/deterministic-graders/evaluateRequest Body:
{
"output": "Thank you for contacting Acme Corp support. Your order #ORD-2025-1234 has been shipped and will arrive by March 15, 2025. You can track it at https://tracking.acme.com/ORD-2025-1234.",
"rules": [
{
"type": "length",
"config": { "min": 50, "max": 500, "unit": "characters" }
},
{
"type": "keywords",
"config": {
"required": ["Acme Corp", "order"],
"prohibited": ["I don't know", "I'm not sure"]
}
},
{
"type": "regex",
"config": {
"pattern": "ORD-\\d{4}-\\d{4}",
"shouldMatch": true
}
},
{
"type": "no_pii",
"config": {
"categories": ["email", "phone", "ssn", "credit_card"]
}
}
],
"metadata": {
"traceId": "trace_abc123",
"agentId": "agent_support_v2"
}
}Response:
{
"success": true,
"data": {
"passed": true,
"score": 1.0,
"results": [
{
"rule": "length",
"passed": true,
"detail": "Output length 189 characters is within range [50, 500]"
},
{
"rule": "keywords",
"passed": true,
"detail": "All required keywords found. No prohibited keywords detected."
},
{
"rule": "regex",
"passed": true,
"detail": "Pattern 'ORD-\\d{4}-\\d{4}' matched: ORD-2025-1234"
},
{
"rule": "no_pii",
"passed": true,
"detail": "No PII detected in output"
}
],
"metadata": {
"traceId": "trace_abc123",
"agentId": "agent_support_v2",
"evaluatedAt": "2025-03-10T12:00:00Z",
"durationMs": 3
}
}
}Bulk Evaluate
Evaluate multiple outputs in a single request. Useful for batch processing historical traces or running deterministic checks as part of a CI pipeline.
POST /api/deterministic-graders/bulk-evaluateRequest Body:
{
"items": [
{
"id": "trace_001",
"output": "Your account balance is $1,234.56. Contact us at support@acme.com for questions.",
"rules": [
{ "type": "length", "config": { "min": 20, "max": 300 } },
{ "type": "no_pii", "config": { "categories": ["email", "phone"] } }
]
},
{
"id": "trace_002",
"output": "{\"status\": \"approved\", \"amount\": 500, \"currency\": \"USD\"}",
"rules": [
{ "type": "json_valid", "config": { "schema": "approval_response" } },
{ "type": "length", "config": { "min": 10, "max": 1000 } }
]
},
{
"id": "trace_003",
"output": "Response generated in 245ms. The weather in NYC is 72F.",
"rules": [
{ "type": "response_time", "config": { "maxMs": 500 } },
{ "type": "keywords", "config": { "prohibited": ["error", "failed", "exception"] } }
]
}
]
}Response:
{
"success": true,
"data": {
"totalItems": 3,
"passed": 2,
"failed": 1,
"results": [
{
"id": "trace_001",
"passed": false,
"score": 0.5,
"results": [
{ "rule": "length", "passed": true, "detail": "Output length 82 characters is within range [20, 300]" },
{ "rule": "no_pii", "passed": false, "detail": "PII detected: email address (support@acme.com)" }
]
},
{
"id": "trace_002",
"passed": true,
"score": 1.0,
"results": [
{ "rule": "json_valid", "passed": true, "detail": "Valid JSON matching schema 'approval_response'" },
{ "rule": "length", "passed": true, "detail": "Output length 58 characters is within range [10, 1000]" }
]
},
{
"id": "trace_003",
"passed": true,
"score": 1.0,
"results": [
{ "rule": "response_time", "passed": true, "detail": "Response time 245ms is within 500ms limit" },
{ "rule": "keywords", "passed": true, "detail": "No prohibited keywords detected" }
]
}
],
"summary": {
"passRate": 0.67,
"avgScore": 0.83,
"durationMs": 8
}
}
}Human Review Queue
Route borderline or high-stakes evaluations to human reviewers. ThinkHive manages assignment, calibration, conflict resolution, and reviewer analytics.
Get Review Queue
GET /api/human-review/queueQuery Parameters:
| Parameter | Type | Description |
|---|---|---|
reviewerId | string | Filter by assigned reviewer |
status | string | pending, in_progress, completed, skipped |
priority | string | low, medium, high, critical |
agentId | string | Filter by agent |
page | number | Page number (default: 1) |
limit | number | Results per page (default: 20) |
Response:
{
"success": true,
"data": {
"items": [
{
"id": "review_501",
"traceId": "trace_abc123",
"agentId": "agent_support_v2",
"priority": "high",
"status": "pending",
"input": "I want to delete all my data and close my account permanently.",
"output": "I can help you with that. I'll initiate the account deletion process. This will permanently remove all your data within 30 days as required by our data retention policy.",
"autoGraderScores": {
"accuracy": 0.72,
"safety": 0.65
},
"flagReason": "Safety score below threshold (0.65 < 0.80)",
"assignedTo": null,
"createdAt": "2025-03-10T08:30:00Z"
},
{
"id": "review_502",
"traceId": "trace_def456",
"agentId": "agent_billing_v1",
"priority": "medium",
"status": "in_progress",
"input": "Why was I charged twice this month?",
"output": "I see two charges on your account. The first is your regular subscription and the second appears to be a prorated charge from your plan upgrade on March 3rd.",
"autoGraderScores": {
"accuracy": 0.78,
"helpfulness": 0.80
},
"flagReason": "Accuracy score in review range (0.70-0.85)",
"assignedTo": "reviewer_jane",
"createdAt": "2025-03-10T07:15:00Z"
}
],
"pagination": {
"page": 1,
"limit": 20,
"total": 47
},
"queueStats": {
"pending": 23,
"inProgress": 12,
"completedToday": 34
}
}
}Assign Reviewer
POST /api/human-review/assignRequest Body:
{
"reviewId": "review_501",
"reviewerId": "reviewer_jane",
"priority": "high",
"dueBy": "2025-03-10T17:00:00Z",
"notes": "Account deletion request - verify compliance language is accurate"
}Response:
{
"success": true,
"data": {
"reviewId": "review_501",
"assignedTo": "reviewer_jane",
"status": "in_progress",
"dueBy": "2025-03-10T17:00:00Z",
"assignedAt": "2025-03-10T12:00:00Z"
}
}Calibration Sets
Create calibration sets to measure and align inter-reviewer agreement. Calibration reviews are compared against a known-good answer key.
POST /api/human-review/calibrationRequest Body:
{
"name": "Q1 2025 Support Calibration",
"reviewerIds": ["reviewer_jane", "reviewer_mike", "reviewer_sara"],
"examples": [
{
"input": "I need a refund for my last order",
"output": "I'd be happy to help with your refund. I've processed a full refund of $49.99 to your original payment method. It should appear within 3-5 business days.",
"expectedScores": {
"accuracy": 0.95,
"helpfulness": 0.90,
"tone": 0.92
},
"notes": "Ideal response - proactive, specific amount, clear timeline"
},
{
"input": "Your product is terrible and I want my money back",
"output": "I understand your frustration. Let me look into this for you. Could you share your order number so I can process your refund?",
"expectedScores": {
"accuracy": 0.80,
"helpfulness": 0.85,
"tone": 0.95
},
"notes": "Good de-escalation but should acknowledge specific complaint"
}
]
}Response:
{
"success": true,
"data": {
"calibrationId": "cal_012",
"name": "Q1 2025 Support Calibration",
"reviewerCount": 3,
"exampleCount": 2,
"status": "pending",
"createdAt": "2025-03-10T10:00:00Z"
}
}Skip / Reassign Review
POST /api/human-review/skipRequest Body:
{
"reviewId": "review_502",
"reviewerId": "reviewer_jane",
"reason": "conflict_of_interest",
"reassignTo": "reviewer_mike"
}Response:
{
"success": true,
"data": {
"reviewId": "review_502",
"previousReviewer": "reviewer_jane",
"assignedTo": "reviewer_mike",
"skipReason": "conflict_of_interest",
"status": "in_progress"
}
}Submit Review
POST /api/human-review/submitRequest Body:
{
"reviewId": "review_501",
"reviewerId": "reviewer_jane",
"verdict": "pass",
"scores": {
"accuracy": 0.90,
"safety": 0.85,
"helpfulness": 0.88,
"tone": 0.92
},
"feedback": "Response correctly describes the deletion process and mentions the 30-day retention window. Could improve by mentioning the user can download their data before deletion.",
"tags": ["account-deletion", "gdpr-related"],
"suggestedOutput": "I can help you with that. Before I initiate the deletion, would you like to download a copy of your data? Once confirmed, I'll permanently remove all your data within 30 days per our data retention policy."
}Response:
{
"success": true,
"data": {
"reviewId": "review_501",
"status": "completed",
"verdict": "pass",
"reviewerId": "reviewer_jane",
"completedAt": "2025-03-10T14:30:00Z",
"reviewDurationMs": 142000
}
}Review Stats
GET /api/human-review/statsQuery Parameters:
| Parameter | Type | Description |
|---|---|---|
reviewerId | string | Stats for a specific reviewer |
startDate | string | ISO 8601 start date |
endDate | string | ISO 8601 end date |
agentId | string | Filter by agent |
Response:
{
"success": true,
"data": {
"period": {
"start": "2025-03-01T00:00:00Z",
"end": "2025-03-10T23:59:59Z"
},
"overview": {
"totalReviews": 156,
"avgReviewTimeMs": 95000,
"passRate": 0.72,
"interReviewerAgreement": 0.88
},
"byReviewer": [
{
"reviewerId": "reviewer_jane",
"name": "Jane Smith",
"reviewsCompleted": 52,
"avgReviewTimeMs": 82000,
"agreementRate": 0.91,
"calibrationScore": 0.94
},
{
"reviewerId": "reviewer_mike",
"name": "Mike Johnson",
"reviewsCompleted": 48,
"avgReviewTimeMs": 105000,
"agreementRate": 0.86,
"calibrationScore": 0.89
}
],
"byVerdict": {
"pass": 112,
"fail": 31,
"borderline": 13
}
}
}Eval Health Monitoring
Track the health and quality of your evaluation pipeline over time. Detect regressions, identify saturated metrics, and generate snapshots for executive reporting.
Health Report
GET /api/eval-health/reportQuery Parameters:
| Parameter | Type | Description |
|---|---|---|
agentId | string | Agent to report on (required) |
period | string | 7d, 30d, 90d (default: 30d) |
Response:
{
"success": true,
"data": {
"agentId": "agent_support_v2",
"period": "30d",
"generatedAt": "2025-03-10T15:00:00Z",
"overallHealth": "good",
"healthScore": 0.87,
"metrics": {
"passRate": {
"current": 0.84,
"previous": 0.80,
"trend": "improving",
"change": 0.04
},
"avgAccuracy": {
"current": 0.87,
"previous": 0.85,
"trend": "improving",
"change": 0.02
},
"avgGroundedness": {
"current": 0.82,
"previous": 0.83,
"trend": "stable",
"change": -0.01
},
"avgResponseTime": {
"current": 1240,
"previous": 1380,
"trend": "improving",
"change": -140,
"unit": "ms"
}
},
"runsInPeriod": 12,
"totalExamplesEvaluated": 600,
"alerts": [
{
"type": "regression",
"severity": "warning",
"message": "Groundedness dropped 3% on billing-related questions in the last 7 days",
"affectedExamples": 8
}
]
}
}Snapshots
Retrieve point-in-time snapshots of evaluation metrics for historical analysis and reporting.
GET /api/eval-health/snapshotsQuery Parameters:
| Parameter | Type | Description |
|---|---|---|
agentId | string | Agent ID (required) |
startDate | string | ISO 8601 start date |
endDate | string | ISO 8601 end date |
granularity | string | daily, weekly, monthly (default: daily) |
Response:
{
"success": true,
"data": {
"agentId": "agent_support_v2",
"granularity": "weekly",
"snapshots": [
{
"date": "2025-02-24",
"passRate": 0.80,
"avgAccuracy": 0.85,
"avgGroundedness": 0.83,
"runsCount": 3,
"examplesEvaluated": 150
},
{
"date": "2025-03-03",
"passRate": 0.82,
"avgAccuracy": 0.86,
"avgGroundedness": 0.82,
"runsCount": 4,
"examplesEvaluated": 200
},
{
"date": "2025-03-10",
"passRate": 0.84,
"avgAccuracy": 0.87,
"avgGroundedness": 0.82,
"runsCount": 5,
"examplesEvaluated": 250
}
]
}
}Regressions
Detect significant regressions in evaluation scores across runs.
GET /api/eval-health/regressionsQuery Parameters:
| Parameter | Type | Description |
|---|---|---|
agentId | string | Agent ID (required) |
threshold | number | Minimum score drop to flag as regression (default: 0.05) |
period | string | 7d, 30d, 90d (default: 30d) |
Response:
{
"success": true,
"data": {
"agentId": "agent_support_v2",
"regressionsDetected": 2,
"regressions": [
{
"id": "reg_001",
"metric": "groundedness",
"category": "billing",
"previousScore": 0.89,
"currentScore": 0.76,
"drop": 0.13,
"severity": "critical",
"firstDetected": "2025-03-08T12:00:00Z",
"affectedExamples": [
{ "id": "ex_021", "input": "Why was I charged twice?", "scoreDrop": 0.18 },
{ "id": "ex_034", "input": "Can I get a prorated refund?", "scoreDrop": 0.15 }
],
"possibleCause": "Prompt template updated on 2025-03-07, billing context section shortened"
},
{
"id": "reg_002",
"metric": "accuracy",
"category": "shipping",
"previousScore": 0.91,
"currentScore": 0.85,
"drop": 0.06,
"severity": "warning",
"firstDetected": "2025-03-09T09:00:00Z",
"affectedExamples": [
{ "id": "ex_045", "input": "What shipping options do you offer?", "scoreDrop": 0.08 }
],
"possibleCause": "Knowledge base shipping article last updated 2024-12-01"
}
]
}
}Saturation Analysis
Identify metrics that have plateaued and may no longer differentiate agent quality.
GET /api/eval-health/saturationQuery Parameters:
| Parameter | Type | Description |
|---|---|---|
agentId | string | Agent ID (required) |
period | string | 30d, 90d, 180d (default: 90d) |
Response:
{
"success": true,
"data": {
"agentId": "agent_support_v2",
"period": "90d",
"analysis": [
{
"metric": "tone",
"currentScore": 0.96,
"variance": 0.002,
"saturated": true,
"recommendation": "Metric 'tone' has been consistently above 0.95 for 90 days with near-zero variance. Consider removing from active evaluation or raising the threshold."
},
{
"metric": "accuracy",
"currentScore": 0.87,
"variance": 0.015,
"saturated": false,
"recommendation": "Metric 'accuracy' shows healthy variance and room for improvement. Continue evaluating."
},
{
"metric": "groundedness",
"currentScore": 0.82,
"variance": 0.028,
"saturated": false,
"recommendation": "Metric 'groundedness' shows the highest variance. Prioritize improving retrieval quality."
}
]
}
}Nondeterminism Detection
Detect and quantify inconsistent behavior in your AI agents. Run the same inputs multiple times and measure output variance to identify reliability issues.
Nondeterminism detection is essential for agents with low-temperature settings that are expected to produce consistent outputs. High variance on identical inputs signals prompt fragility or retrieval instability.
Create Nondeterminism Run
POST /api/nondeterminism/runsRequest Body:
{
"agentId": "agent_support_v2",
"inputs": [
{ "id": "input_001", "text": "What is the return policy?" },
{ "id": "input_002", "text": "How do I upgrade my plan?" },
{ "id": "input_003", "text": "Is there a student discount?" }
],
"config": {
"repetitions": 5,
"temperature": 0.0,
"similarityMetric": "semantic",
"timeout": 30000
}
}Response:
{
"success": true,
"data": {
"runId": "nondet_run_007",
"status": "running",
"totalInferences": 15,
"estimatedCompletionTime": "2025-03-10T15:10:00Z"
}
}Get Nondeterminism Results
GET /api/nondeterminism/runs/:idResponse:
{
"success": true,
"data": {
"runId": "nondet_run_007",
"agentId": "agent_support_v2",
"status": "completed",
"completedAt": "2025-03-10T15:08:42Z",
"summary": {
"avgConsistency": 0.92,
"minConsistency": 0.78,
"maxConsistency": 0.99,
"highVarianceInputs": 1
},
"results": [
{
"inputId": "input_001",
"input": "What is the return policy?",
"consistency": 0.99,
"variance": "low",
"outputs": [
"Our return policy allows returns within 30 days of purchase with a valid receipt.",
"Our return policy allows returns within 30 days of purchase with a valid receipt.",
"Our return policy allows returns within 30 days with a valid receipt. Items must be unused.",
"Our return policy allows returns within 30 days of purchase with a valid receipt.",
"Our return policy allows returns within 30 days of purchase with a valid receipt."
],
"semanticSimilarityMatrix": [[1.0, 1.0, 0.97, 1.0, 1.0]]
},
{
"inputId": "input_002",
"input": "How do I upgrade my plan?",
"consistency": 0.98,
"variance": "low",
"outputs": [
"Go to Settings > Billing > Upgrade Plan to see available options.",
"Navigate to Settings, then Billing, and click Upgrade Plan.",
"You can upgrade from Settings > Billing > Upgrade Plan.",
"Go to Settings > Billing > Upgrade Plan to view options.",
"Head to Settings > Billing > Upgrade Plan to see your options."
]
},
{
"inputId": "input_003",
"input": "Is there a student discount?",
"consistency": 0.78,
"variance": "high",
"outputs": [
"Yes, we offer a 20% student discount. Verify with your .edu email.",
"We don't currently offer student discounts, but check our promotions page.",
"Yes! Students get 20% off with a valid student ID or .edu email.",
"We offer a 20% student discount. You'll need to verify your student status.",
"Currently we don't have a specific student discount program."
],
"alert": "Contradictory outputs detected: 3 responses confirm a discount, 2 deny it. Check knowledge base for conflicting information."
}
]
}
}Consistency Report
Generate an aggregate consistency report across multiple nondeterminism runs.
GET /api/nondeterminism/consistencyQuery Parameters:
| Parameter | Type | Description |
|---|---|---|
agentId | string | Agent ID (required) |
period | string | 7d, 30d, 90d (default: 30d) |
minRepetitions | number | Minimum repetitions per input (default: 3) |
Response:
{
"success": true,
"data": {
"agentId": "agent_support_v2",
"period": "30d",
"overallConsistency": 0.91,
"totalInputsTested": 150,
"totalInferences": 750,
"byCategory": [
{
"category": "billing",
"consistency": 0.94,
"inputsTested": 45,
"highVarianceCount": 2
},
{
"category": "product",
"consistency": 0.92,
"inputsTested": 60,
"highVarianceCount": 4
},
{
"category": "policy",
"consistency": 0.85,
"inputsTested": 45,
"highVarianceCount": 8
}
],
"topVarianceInputs": [
{
"input": "Is there a student discount?",
"consistency": 0.78,
"contradictions": true,
"runId": "nondet_run_007"
},
{
"input": "Can I pause my subscription?",
"consistency": 0.81,
"contradictions": true,
"runId": "nondet_run_005"
}
],
"trend": {
"current": 0.91,
"previous": 0.88,
"direction": "improving"
}
}
}Calculate pass@k
Estimate the probability that at least one of k sampled outputs passes evaluation. Based on the pass@k metric from code generation research.
POST /api/nondeterminism/pass-at-kRequest Body:
{
"agentId": "agent_support_v2",
"evalSetId": "eval_set_042",
"k": [1, 3, 5, 10],
"config": {
"n": 20,
"threshold": 0.8,
"metrics": ["accuracy", "groundedness"]
}
}Response:
{
"success": true,
"data": {
"agentId": "agent_support_v2",
"evalSetId": "eval_set_042",
"n": 20,
"threshold": 0.8,
"results": {
"pass@1": 0.84,
"pass@3": 0.94,
"pass@5": 0.97,
"pass@10": 0.99
},
"byExample": [
{
"exampleId": "ex_001",
"passCount": 18,
"totalSamples": 20,
"pass@1": 0.90,
"pass@3": 0.999
},
{
"exampleId": "ex_002",
"passCount": 12,
"totalSamples": 20,
"pass@1": 0.60,
"pass@3": 0.88
}
],
"insight": "Agent passes 84% of examples on first try but 97% within 5 attempts, suggesting moderate nondeterminism that benefits from retry strategies."
}
}Conversation Eval
Evaluate multi-turn conversations holistically. Unlike single-turn evaluation, conversation eval assesses coherence, context retention, goal completion, and turn-level quality across an entire dialogue.
Get Conversation Traces
Retrieve multi-turn conversation traces suitable for evaluation.
GET /api/conversation-eval/tracesQuery Parameters:
| Parameter | Type | Description |
|---|---|---|
agentId | string | Filter by agent |
minTurns | number | Minimum number of turns (default: 2) |
maxTurns | number | Maximum number of turns |
status | string | evaluated, pending, all (default: all) |
startDate | string | ISO 8601 start date |
endDate | string | ISO 8601 end date |
page | number | Page number (default: 1) |
limit | number | Results per page (default: 20) |
Response:
{
"success": true,
"data": {
"traces": [
{
"traceId": "conv_trace_001",
"agentId": "agent_support_v2",
"turnCount": 6,
"duration": 245000,
"status": "pending",
"goalCompleted": null,
"turns": [
{
"turnNumber": 1,
"role": "user",
"content": "I need help with my recent order",
"timestamp": "2025-03-10T14:00:00Z"
},
{
"turnNumber": 2,
"role": "assistant",
"content": "I'd be happy to help! Could you share your order number?",
"timestamp": "2025-03-10T14:00:02Z",
"latencyMs": 1840
},
{
"turnNumber": 3,
"role": "user",
"content": "It's ORD-2025-5678. The item arrived damaged.",
"timestamp": "2025-03-10T14:00:15Z"
},
{
"turnNumber": 4,
"role": "assistant",
"content": "I'm sorry to hear that, I've pulled up order ORD-2025-5678. I can see it was a wireless keyboard delivered on March 7th. I can offer you a full replacement or a refund. Which would you prefer?",
"timestamp": "2025-03-10T14:00:18Z",
"latencyMs": 2100
},
{
"turnNumber": 5,
"role": "user",
"content": "A replacement please",
"timestamp": "2025-03-10T14:00:30Z"
},
{
"turnNumber": 6,
"role": "assistant",
"content": "Done! I've initiated a replacement for your wireless keyboard. You'll receive a shipping confirmation within 24 hours. Is there anything else I can help with?",
"timestamp": "2025-03-10T14:00:33Z",
"latencyMs": 1950
}
],
"metadata": {
"sessionId": "sess_789",
"userSatisfaction": null
}
}
],
"pagination": {
"page": 1,
"limit": 20,
"total": 89
}
}
}Evaluate Conversation
POST /api/conversation-eval/evaluateRequest Body:
{
"traceId": "conv_trace_001",
"criteria": {
"coherence": {
"enabled": true,
"weight": 0.2,
"description": "Logical flow and consistency across turns"
},
"contextRetention": {
"enabled": true,
"weight": 0.25,
"description": "Agent correctly references earlier context"
},
"goalCompletion": {
"enabled": true,
"weight": 0.3,
"description": "User's goal was identified and resolved"
},
"turnQuality": {
"enabled": true,
"weight": 0.15,
"description": "Individual turn accuracy and helpfulness"
},
"efficiency": {
"enabled": true,
"weight": 0.1,
"description": "Minimal unnecessary turns to reach resolution"
}
},
"config": {
"graderModel": "gpt-4o",
"includePerTurnScores": true
}
}Response:
{
"success": true,
"data": {
"evaluationId": "conv_eval_023",
"traceId": "conv_trace_001",
"status": "completed",
"overallScore": 0.91,
"passed": true,
"scores": {
"coherence": 0.95,
"contextRetention": 0.93,
"goalCompletion": 0.98,
"turnQuality": 0.88,
"efficiency": 0.80
},
"perTurnScores": [
{
"turnNumber": 2,
"role": "assistant",
"scores": { "relevance": 0.90, "helpfulness": 0.85 },
"feedback": "Good opening but could be more specific about what information is needed."
},
{
"turnNumber": 4,
"role": "assistant",
"scores": { "relevance": 0.95, "helpfulness": 0.92, "accuracy": 0.90 },
"feedback": "Excellent context recall - identified the product and delivery date. Offered clear resolution options."
},
{
"turnNumber": 6,
"role": "assistant",
"scores": { "relevance": 0.90, "helpfulness": 0.88 },
"feedback": "Clean resolution with follow-up offer. Could mention return instructions for the damaged item."
}
],
"summary": "Conversation handled well. Agent correctly identified the issue, retained context throughout, and resolved the customer's goal in 6 turns. Minor improvement: include return instructions for the damaged item.",
"evaluatedAt": "2025-03-10T15:30:00Z"
}
}Get Results
Retrieve evaluation results for conversation traces.
GET /api/conversation-eval/resultsQuery Parameters:
| Parameter | Type | Description |
|---|---|---|
agentId | string | Filter by agent |
traceId | string | Filter by specific trace |
minScore | number | Minimum overall score |
maxScore | number | Maximum overall score |
startDate | string | ISO 8601 start date |
endDate | string | ISO 8601 end date |
page | number | Page number (default: 1) |
limit | number | Results per page (default: 20) |
Response:
{
"success": true,
"data": {
"results": [
{
"evaluationId": "conv_eval_023",
"traceId": "conv_trace_001",
"agentId": "agent_support_v2",
"overallScore": 0.91,
"passed": true,
"turnCount": 6,
"scores": {
"coherence": 0.95,
"contextRetention": 0.93,
"goalCompletion": 0.98,
"turnQuality": 0.88,
"efficiency": 0.80
},
"evaluatedAt": "2025-03-10T15:30:00Z"
},
{
"evaluationId": "conv_eval_022",
"traceId": "conv_trace_002",
"agentId": "agent_support_v2",
"overallScore": 0.64,
"passed": false,
"turnCount": 12,
"scores": {
"coherence": 0.70,
"contextRetention": 0.55,
"goalCompletion": 0.40,
"turnQuality": 0.75,
"efficiency": 0.50
},
"evaluatedAt": "2025-03-10T15:25:00Z"
}
],
"pagination": {
"page": 1,
"limit": 20,
"total": 89
},
"aggregate": {
"avgOverallScore": 0.82,
"passRate": 0.76,
"avgTurnCount": 5.4,
"topFailureReasons": [
{ "reason": "Goal not completed", "count": 8 },
{ "reason": "Context lost mid-conversation", "count": 5 },
{ "reason": "Excessive turns for simple request", "count": 3 }
]
}
}
}Error Responses
All endpoints return consistent error responses:
{
"success": false,
"error": {
"code": "VALIDATION_ERROR",
"message": "Invalid evaluation set ID",
"details": {
"field": "evalSetId",
"reason": "Evaluation set 'eval_set_999' not found"
}
}
}Common Error Codes:
| HTTP Status | Code | Description |
|---|---|---|
| 400 | VALIDATION_ERROR | Invalid request body or parameters |
| 401 | UNAUTHORIZED | Missing or invalid API key |
| 403 | FORBIDDEN | Insufficient permissions |
| 404 | NOT_FOUND | Resource not found |
| 409 | CONFLICT | Resource already exists or state conflict |
| 429 | RATE_LIMITED | Too many requests |
| 500 | INTERNAL_ERROR | Server error |
Next Steps
- Traces API - Capture agent traces for evaluation
- Guardrails API - Real-time content scanning
- Explainability & Analysis - Understand agent decisions
- Evaluation Guide - End-to-end evaluation workflows
- Webhooks & Notifications - Get notified on evaluation results