Evaluation & Grading API

ThinkHive provides a complete evaluation and grading pipeline for AI agents: create evaluation sets, run automated and deterministic graders, route edge cases to human reviewers, monitor evaluation health over time, detect nondeterministic behavior, and evaluate multi-turn conversations.

All endpoints require authentication via Authorization: Bearer th_your_api_key header. See Authentication for details.

Evaluation Sets & Criteria

Manage golden datasets and run evaluations against your AI agents.

List Evaluation Sets

GET /api/evaluation/sets

Query Parameters:

Parameter	Type	Description
`agentId`	string	Filter sets by agent
`page`	number	Page number (default: 1)
`limit`	number	Results per page (default: 20, max: 100)

Response:

{
  "success": true,
  "data": [
    {
      "id": "eval_set_001",
      "name": "Customer Support Golden Set",
      "description": "50 curated examples for support agent evaluation",
      "exampleCount": 50,
      "agentId": "agent_abc123",
      "criteria": [
        { "name": "accuracy", "weight": 0.4 },
        { "name": "groundedness", "weight": 0.3 },
        { "name": "completeness", "weight": 0.3 }
      ],
      "createdAt": "2025-01-15T10:00:00Z",
      "updatedAt": "2025-03-02T14:22:00Z"
    }
  ],
  "pagination": {
    "page": 1,
    "limit": 20,
    "total": 3
  }
}

Create Evaluation Set

POST /api/evaluation/sets

Request Body:

{
  "name": "Product FAQ Evaluation",
  "description": "Test cases for product knowledge questions",
  "agentId": "agent_abc123",
  "examples": [
    {
      "input": "What is the return policy?",
      "expectedOutput": "Our return policy allows returns within 30 days of purchase with a valid receipt. Items must be in original condition.",
      "context": "Returns documentation from help center",
      "criteria": ["accuracy", "completeness", "tone"]
    },
    {
      "input": "How do I cancel my subscription?",
      "expectedOutput": "You can cancel your subscription from Settings > Billing > Cancel Plan. Your access continues until the end of the billing period.",
      "context": "Billing FAQ article",
      "criteria": ["accuracy", "helpfulness"]
    }
  ],
  "criteria": [
    { "name": "accuracy", "weight": 0.4, "description": "Factual correctness against source material" },
    { "name": "completeness", "weight": 0.3, "description": "Covers all relevant information" },
    { "name": "tone", "weight": 0.15, "description": "Professional and empathetic" },
    { "name": "helpfulness", "weight": 0.15, "description": "Actionable and clear" }
  ]
}

Response:

{
  "success": true,
  "data": {
    "id": "eval_set_042",
    "name": "Product FAQ Evaluation",
    "exampleCount": 2,
    "createdAt": "2025-03-10T09:00:00Z"
  }
}

Run Evaluation

POST /api/evaluation/run

Request Body:

{
  "evalSetId": "eval_set_042",
  "agentId": "agent_abc123",
  "config": {
    "metrics": ["accuracy", "groundedness", "faithfulness"],
    "threshold": 0.8,
    "graderModel": "gpt-4o",
    "concurrency": 5
  }
}

Response:

{
  "success": true,
  "data": {
    "runId": "eval_run_108",
    "status": "running",
    "progress": {
      "completed": 0,
      "total": 50
    },
    "estimatedCompletionTime": "2025-03-10T09:05:00Z"
  }
}

Get Evaluation Results

GET /api/evaluation/runs/:runId

Response:

{
  "success": true,
  "data": {
    "runId": "eval_run_108",
    "evalSetId": "eval_set_042",
    "agentId": "agent_abc123",
    "status": "completed",
    "startedAt": "2025-03-10T09:00:00Z",
    "completedAt": "2025-03-10T09:04:32Z",
    "summary": {
      "passed": 42,
      "failed": 8,
      "passRate": 0.84,
      "avgAccuracy": 0.87,
      "avgGroundedness": 0.82,
      "avgFaithfulness": 0.91
    },
    "results": [
      {
        "exampleId": "ex_001",
        "input": "What is the return policy?",
        "actualOutput": "Our return policy allows returns within 30 days...",
        "passed": true,
        "scores": {
          "accuracy": 0.92,
          "groundedness": 0.88,
          "faithfulness": 0.95
        },
        "reasoning": "Response accurately covers the return window and receipt requirement."
      },
      {
        "exampleId": "ex_002",
        "input": "How do I cancel my subscription?",
        "actualOutput": "Please contact support to cancel.",
        "passed": false,
        "scores": {
          "accuracy": 0.45,
          "groundedness": 0.30,
          "faithfulness": 0.60
        },
        "reasoning": "Response is vague and does not include the self-service cancellation path from Settings > Billing."
      }
    ]
  }
}

Compare Runs

GET /api/evaluation/compare

Query Parameters:

Parameter	Type	Description
`runIds`	string	Comma-separated run IDs to compare

Response:

{
  "success": true,
  "data": {
    "comparison": {
      "runs": [
        {
          "id": "eval_run_107",
          "passRate": 0.80,
          "avgAccuracy": 0.82,
          "avgGroundedness": 0.79,
          "completedAt": "2025-03-08T15:00:00Z"
        },
        {
          "id": "eval_run_108",
          "passRate": 0.84,
          "avgAccuracy": 0.87,
          "avgGroundedness": 0.82,
          "completedAt": "2025-03-10T09:04:32Z"
        }
      ],
      "improvement": 0.05,
      "significantChanges": [
        "Accuracy improved by 6% on FAQ questions",
        "Groundedness improved by 4% across all categories",
        "2 previously failing examples now pass"
      ],
      "regressions": [
        "Example ex_034 regressed from 0.91 to 0.72 on accuracy"
      ]
    }
  }
}

Evaluation Runs

Manage evaluation run lifecycle. Use these endpoints to list, create, retrieve, and update individual evaluation runs.

List Runs

GET /api/eval-runs

Query Parameters:

Parameter	Type	Description
`agentId`	string	Filter by agent ID
`status`	string	Filter by status: `pending`, `running`, `completed`, `failed`
`evalSetId`	string	Filter by evaluation set
`page`	number	Page number (default: 1)
`limit`	number	Results per page (default: 20)

Response:

{
  "success": true,
  "data": [
    {
      "id": "eval_run_108",
      "evalSetId": "eval_set_042",
      "agentId": "agent_abc123",
      "status": "completed",
      "passRate": 0.84,
      "totalExamples": 50,
      "passed": 42,
      "failed": 8,
      "startedAt": "2025-03-10T09:00:00Z",
      "completedAt": "2025-03-10T09:04:32Z"
    },
    {
      "id": "eval_run_107",
      "evalSetId": "eval_set_042",
      "agentId": "agent_abc123",
      "status": "completed",
      "passRate": 0.80,
      "totalExamples": 50,
      "passed": 40,
      "failed": 10,
      "startedAt": "2025-03-08T14:30:00Z",
      "completedAt": "2025-03-08T15:00:00Z"
    }
  ],
  "pagination": {
    "page": 1,
    "limit": 20,
    "total": 12
  }
}

Create Run

POST /api/eval-runs

Request Body:

{
  "evalSetId": "eval_set_042",
  "agentId": "agent_abc123",
  "name": "Post-prompt-update regression check",
  "config": {
    "metrics": ["accuracy", "groundedness"],
    "threshold": 0.8,
    "graderModel": "gpt-4o",
    "concurrency": 10,
    "timeout": 30000
  },
  "tags": ["regression", "prompt-v2"]
}

Response:

{
  "success": true,
  "data": {
    "id": "eval_run_109",
    "status": "pending",
    "createdAt": "2025-03-10T11:00:00Z"
  }
}

Get Run

GET /api/eval-runs/:id

Response:

{
  "success": true,
  "data": {
    "id": "eval_run_109",
    "evalSetId": "eval_set_042",
    "agentId": "agent_abc123",
    "name": "Post-prompt-update regression check",
    "status": "running",
    "passRate": null,
    "progress": {
      "completed": 23,
      "total": 50
    },
    "config": {
      "metrics": ["accuracy", "groundedness"],
      "threshold": 0.8,
      "graderModel": "gpt-4o"
    },
    "tags": ["regression", "prompt-v2"],
    "startedAt": "2025-03-10T11:00:05Z",
    "completedAt": null
  }
}

Update Run

PATCH /api/eval-runs/:id

Use this to cancel a running evaluation or update metadata.

Request Body:

{
  "status": "cancelled",
  "name": "Updated run name",
  "tags": ["regression", "prompt-v2", "cancelled-early"]
}

Response:

{
  "success": true,
  "data": {
    "id": "eval_run_109",
    "status": "cancelled",
    "name": "Updated run name",
    "updatedAt": "2025-03-10T11:02:00Z"
  }
}

Deterministic Graders

Apply rule-based grading to agent outputs without LLM calls. Deterministic graders are fast, reproducible, and cost-free. Use them for structural validation, compliance checks, and baseline quality gates.

Deterministic graders run locally with zero latency overhead. Combine them with LLM-based evaluation for comprehensive coverage.

Rule Types

Rule Type	Description	Example Use Case
`length`	Validate output length (min/max characters or tokens)	Ensure responses are concise
`keywords`	Check for required or prohibited keywords	Verify brand terms are included
`json_valid`	Validate output is well-formed JSON	Tool-calling agents
`regex`	Match output against a regular expression	Format validation (dates, IDs)
`no_pii`	Detect personally identifiable information	Compliance enforcement
`response_time`	Assert response latency is within bounds	SLA compliance

Evaluate with Rules

POST /api/deterministic-graders/evaluate

Request Body:

{
  "output": "Thank you for contacting Acme Corp support. Your order #ORD-2025-1234 has been shipped and will arrive by March 15, 2025. You can track it at https://tracking.acme.com/ORD-2025-1234.",
  "rules": [
    {
      "type": "length",
      "config": { "min": 50, "max": 500, "unit": "characters" }
    },
    {
      "type": "keywords",
      "config": {
        "required": ["Acme Corp", "order"],
        "prohibited": ["I don't know", "I'm not sure"]
      }
    },
    {
      "type": "regex",
      "config": {
        "pattern": "ORD-\\d{4}-\\d{4}",
        "shouldMatch": true
      }
    },
    {
      "type": "no_pii",
      "config": {
        "categories": ["email", "phone", "ssn", "credit_card"]
      }
    }
  ],
  "metadata": {
    "traceId": "trace_abc123",
    "agentId": "agent_support_v2"
  }
}

Response:

{
  "success": true,
  "data": {
    "passed": true,
    "score": 1.0,
    "results": [
      {
        "rule": "length",
        "passed": true,
        "detail": "Output length 189 characters is within range [50, 500]"
      },
      {
        "rule": "keywords",
        "passed": true,
        "detail": "All required keywords found. No prohibited keywords detected."
      },
      {
        "rule": "regex",
        "passed": true,
        "detail": "Pattern 'ORD-\\d{4}-\\d{4}' matched: ORD-2025-1234"
      },
      {
        "rule": "no_pii",
        "passed": true,
        "detail": "No PII detected in output"
      }
    ],
    "metadata": {
      "traceId": "trace_abc123",
      "agentId": "agent_support_v2",
      "evaluatedAt": "2025-03-10T12:00:00Z",
      "durationMs": 3
    }
  }
}

Bulk Evaluate

Evaluate multiple outputs in a single request. Useful for batch processing historical traces or running deterministic checks as part of a CI pipeline.

POST /api/deterministic-graders/bulk-evaluate

Request Body:

{
  "items": [
    {
      "id": "trace_001",
      "output": "Your account balance is $1,234.56. Contact us at support@acme.com for questions.",
      "rules": [
        { "type": "length", "config": { "min": 20, "max": 300 } },
        { "type": "no_pii", "config": { "categories": ["email", "phone"] } }
      ]
    },
    {
      "id": "trace_002",
      "output": "{\"status\": \"approved\", \"amount\": 500, \"currency\": \"USD\"}",
      "rules": [
        { "type": "json_valid", "config": { "schema": "approval_response" } },
        { "type": "length", "config": { "min": 10, "max": 1000 } }
      ]
    },
    {
      "id": "trace_003",
      "output": "Response generated in 245ms. The weather in NYC is 72F.",
      "rules": [
        { "type": "response_time", "config": { "maxMs": 500 } },
        { "type": "keywords", "config": { "prohibited": ["error", "failed", "exception"] } }
      ]
    }
  ]
}

Response:

{
  "success": true,
  "data": {
    "totalItems": 3,
    "passed": 2,
    "failed": 1,
    "results": [
      {
        "id": "trace_001",
        "passed": false,
        "score": 0.5,
        "results": [
          { "rule": "length", "passed": true, "detail": "Output length 82 characters is within range [20, 300]" },
          { "rule": "no_pii", "passed": false, "detail": "PII detected: email address (support@acme.com)" }
        ]
      },
      {
        "id": "trace_002",
        "passed": true,
        "score": 1.0,
        "results": [
          { "rule": "json_valid", "passed": true, "detail": "Valid JSON matching schema 'approval_response'" },
          { "rule": "length", "passed": true, "detail": "Output length 58 characters is within range [10, 1000]" }
        ]
      },
      {
        "id": "trace_003",
        "passed": true,
        "score": 1.0,
        "results": [
          { "rule": "response_time", "passed": true, "detail": "Response time 245ms is within 500ms limit" },
          { "rule": "keywords", "passed": true, "detail": "No prohibited keywords detected" }
        ]
      }
    ],
    "summary": {
      "passRate": 0.67,
      "avgScore": 0.83,
      "durationMs": 8
    }
  }
}

Human Review Queue

Route borderline or high-stakes evaluations to human reviewers. ThinkHive manages assignment, calibration, conflict resolution, and reviewer analytics.

Get Review Queue

GET /api/human-review/queue

Query Parameters:

Parameter	Type	Description
`reviewerId`	string	Filter by assigned reviewer
`status`	string	`pending`, `in_progress`, `completed`, `skipped`
`priority`	string	`low`, `medium`, `high`, `critical`
`agentId`	string	Filter by agent
`page`	number	Page number (default: 1)
`limit`	number	Results per page (default: 20)

Response:

{
  "success": true,
  "data": {
    "items": [
      {
        "id": "review_501",
        "traceId": "trace_abc123",
        "agentId": "agent_support_v2",
        "priority": "high",
        "status": "pending",
        "input": "I want to delete all my data and close my account permanently.",
        "output": "I can help you with that. I'll initiate the account deletion process. This will permanently remove all your data within 30 days as required by our data retention policy.",
        "autoGraderScores": {
          "accuracy": 0.72,
          "safety": 0.65
        },
        "flagReason": "Safety score below threshold (0.65 < 0.80)",
        "assignedTo": null,
        "createdAt": "2025-03-10T08:30:00Z"
      },
      {
        "id": "review_502",
        "traceId": "trace_def456",
        "agentId": "agent_billing_v1",
        "priority": "medium",
        "status": "in_progress",
        "input": "Why was I charged twice this month?",
        "output": "I see two charges on your account. The first is your regular subscription and the second appears to be a prorated charge from your plan upgrade on March 3rd.",
        "autoGraderScores": {
          "accuracy": 0.78,
          "helpfulness": 0.80
        },
        "flagReason": "Accuracy score in review range (0.70-0.85)",
        "assignedTo": "reviewer_jane",
        "createdAt": "2025-03-10T07:15:00Z"
      }
    ],
    "pagination": {
      "page": 1,
      "limit": 20,
      "total": 47
    },
    "queueStats": {
      "pending": 23,
      "inProgress": 12,
      "completedToday": 34
    }
  }
}

Assign Reviewer

POST /api/human-review/assign

Request Body:

{
  "reviewId": "review_501",
  "reviewerId": "reviewer_jane",
  "priority": "high",
  "dueBy": "2025-03-10T17:00:00Z",
  "notes": "Account deletion request - verify compliance language is accurate"
}

Response:

{
  "success": true,
  "data": {
    "reviewId": "review_501",
    "assignedTo": "reviewer_jane",
    "status": "in_progress",
    "dueBy": "2025-03-10T17:00:00Z",
    "assignedAt": "2025-03-10T12:00:00Z"
  }
}

Calibration Sets

Create calibration sets to measure and align inter-reviewer agreement. Calibration reviews are compared against a known-good answer key.

POST /api/human-review/calibration

Request Body:

{
  "name": "Q1 2025 Support Calibration",
  "reviewerIds": ["reviewer_jane", "reviewer_mike", "reviewer_sara"],
  "examples": [
    {
      "input": "I need a refund for my last order",
      "output": "I'd be happy to help with your refund. I've processed a full refund of $49.99 to your original payment method. It should appear within 3-5 business days.",
      "expectedScores": {
        "accuracy": 0.95,
        "helpfulness": 0.90,
        "tone": 0.92
      },
      "notes": "Ideal response - proactive, specific amount, clear timeline"
    },
    {
      "input": "Your product is terrible and I want my money back",
      "output": "I understand your frustration. Let me look into this for you. Could you share your order number so I can process your refund?",
      "expectedScores": {
        "accuracy": 0.80,
        "helpfulness": 0.85,
        "tone": 0.95
      },
      "notes": "Good de-escalation but should acknowledge specific complaint"
    }
  ]
}

Response:

{
  "success": true,
  "data": {
    "calibrationId": "cal_012",
    "name": "Q1 2025 Support Calibration",
    "reviewerCount": 3,
    "exampleCount": 2,
    "status": "pending",
    "createdAt": "2025-03-10T10:00:00Z"
  }
}

Skip / Reassign Review

POST /api/human-review/skip

Request Body:

{
  "reviewId": "review_502",
  "reviewerId": "reviewer_jane",
  "reason": "conflict_of_interest",
  "reassignTo": "reviewer_mike"
}

Response:

{
  "success": true,
  "data": {
    "reviewId": "review_502",
    "previousReviewer": "reviewer_jane",
    "assignedTo": "reviewer_mike",
    "skipReason": "conflict_of_interest",
    "status": "in_progress"
  }
}

Submit Review

POST /api/human-review/submit

Request Body:

{
  "reviewId": "review_501",
  "reviewerId": "reviewer_jane",
  "verdict": "pass",
  "scores": {
    "accuracy": 0.90,
    "safety": 0.85,
    "helpfulness": 0.88,
    "tone": 0.92
  },
  "feedback": "Response correctly describes the deletion process and mentions the 30-day retention window. Could improve by mentioning the user can download their data before deletion.",
  "tags": ["account-deletion", "gdpr-related"],
  "suggestedOutput": "I can help you with that. Before I initiate the deletion, would you like to download a copy of your data? Once confirmed, I'll permanently remove all your data within 30 days per our data retention policy."
}

Response:

{
  "success": true,
  "data": {
    "reviewId": "review_501",
    "status": "completed",
    "verdict": "pass",
    "reviewerId": "reviewer_jane",
    "completedAt": "2025-03-10T14:30:00Z",
    "reviewDurationMs": 142000
  }
}

Review Stats

GET /api/human-review/stats

Query Parameters:

Parameter	Type	Description
`reviewerId`	string	Stats for a specific reviewer
`startDate`	string	ISO 8601 start date
`endDate`	string	ISO 8601 end date
`agentId`	string	Filter by agent

Response:

{
  "success": true,
  "data": {
    "period": {
      "start": "2025-03-01T00:00:00Z",
      "end": "2025-03-10T23:59:59Z"
    },
    "overview": {
      "totalReviews": 156,
      "avgReviewTimeMs": 95000,
      "passRate": 0.72,
      "interReviewerAgreement": 0.88
    },
    "byReviewer": [
      {
        "reviewerId": "reviewer_jane",
        "name": "Jane Smith",
        "reviewsCompleted": 52,
        "avgReviewTimeMs": 82000,
        "agreementRate": 0.91,
        "calibrationScore": 0.94
      },
      {
        "reviewerId": "reviewer_mike",
        "name": "Mike Johnson",
        "reviewsCompleted": 48,
        "avgReviewTimeMs": 105000,
        "agreementRate": 0.86,
        "calibrationScore": 0.89
      }
    ],
    "byVerdict": {
      "pass": 112,
      "fail": 31,
      "borderline": 13
    }
  }
}

Eval Health Monitoring

Track the health and quality of your evaluation pipeline over time. Detect regressions, identify saturated metrics, and generate snapshots for executive reporting.

Health Report

GET /api/eval-health/report

Query Parameters:

Parameter	Type	Description
`agentId`	string	Agent to report on (required)
`period`	string	`7d`, `30d`, `90d` (default: `30d`)

Response:

{
  "success": true,
  "data": {
    "agentId": "agent_support_v2",
    "period": "30d",
    "generatedAt": "2025-03-10T15:00:00Z",
    "overallHealth": "good",
    "healthScore": 0.87,
    "metrics": {
      "passRate": {
        "current": 0.84,
        "previous": 0.80,
        "trend": "improving",
        "change": 0.04
      },
      "avgAccuracy": {
        "current": 0.87,
        "previous": 0.85,
        "trend": "improving",
        "change": 0.02
      },
      "avgGroundedness": {
        "current": 0.82,
        "previous": 0.83,
        "trend": "stable",
        "change": -0.01
      },
      "avgResponseTime": {
        "current": 1240,
        "previous": 1380,
        "trend": "improving",
        "change": -140,
        "unit": "ms"
      }
    },
    "runsInPeriod": 12,
    "totalExamplesEvaluated": 600,
    "alerts": [
      {
        "type": "regression",
        "severity": "warning",
        "message": "Groundedness dropped 3% on billing-related questions in the last 7 days",
        "affectedExamples": 8
      }
    ]
  }
}

Snapshots

Retrieve point-in-time snapshots of evaluation metrics for historical analysis and reporting.

GET /api/eval-health/snapshots

Query Parameters:

Parameter	Type	Description
`agentId`	string	Agent ID (required)
`startDate`	string	ISO 8601 start date
`endDate`	string	ISO 8601 end date
`granularity`	string	`daily`, `weekly`, `monthly` (default: `daily`)

Response:

{
  "success": true,
  "data": {
    "agentId": "agent_support_v2",
    "granularity": "weekly",
    "snapshots": [
      {
        "date": "2025-02-24",
        "passRate": 0.80,
        "avgAccuracy": 0.85,
        "avgGroundedness": 0.83,
        "runsCount": 3,
        "examplesEvaluated": 150
      },
      {
        "date": "2025-03-03",
        "passRate": 0.82,
        "avgAccuracy": 0.86,
        "avgGroundedness": 0.82,
        "runsCount": 4,
        "examplesEvaluated": 200
      },
      {
        "date": "2025-03-10",
        "passRate": 0.84,
        "avgAccuracy": 0.87,
        "avgGroundedness": 0.82,
        "runsCount": 5,
        "examplesEvaluated": 250
      }
    ]
  }
}

Regressions

Detect significant regressions in evaluation scores across runs.

GET /api/eval-health/regressions

Query Parameters:

Parameter	Type	Description
`agentId`	string	Agent ID (required)
`threshold`	number	Minimum score drop to flag as regression (default: 0.05)
`period`	string	`7d`, `30d`, `90d` (default: `30d`)

Response:

{
  "success": true,
  "data": {
    "agentId": "agent_support_v2",
    "regressionsDetected": 2,
    "regressions": [
      {
        "id": "reg_001",
        "metric": "groundedness",
        "category": "billing",
        "previousScore": 0.89,
        "currentScore": 0.76,
        "drop": 0.13,
        "severity": "critical",
        "firstDetected": "2025-03-08T12:00:00Z",
        "affectedExamples": [
          { "id": "ex_021", "input": "Why was I charged twice?", "scoreDrop": 0.18 },
          { "id": "ex_034", "input": "Can I get a prorated refund?", "scoreDrop": 0.15 }
        ],
        "possibleCause": "Prompt template updated on 2025-03-07, billing context section shortened"
      },
      {
        "id": "reg_002",
        "metric": "accuracy",
        "category": "shipping",
        "previousScore": 0.91,
        "currentScore": 0.85,
        "drop": 0.06,
        "severity": "warning",
        "firstDetected": "2025-03-09T09:00:00Z",
        "affectedExamples": [
          { "id": "ex_045", "input": "What shipping options do you offer?", "scoreDrop": 0.08 }
        ],
        "possibleCause": "Knowledge base shipping article last updated 2024-12-01"
      }
    ]
  }
}

Saturation Analysis

Identify metrics that have plateaued and may no longer differentiate agent quality.

GET /api/eval-health/saturation

Query Parameters:

Parameter	Type	Description
`agentId`	string	Agent ID (required)
`period`	string	`30d`, `90d`, `180d` (default: `90d`)

Response:

{
  "success": true,
  "data": {
    "agentId": "agent_support_v2",
    "period": "90d",
    "analysis": [
      {
        "metric": "tone",
        "currentScore": 0.96,
        "variance": 0.002,
        "saturated": true,
        "recommendation": "Metric 'tone' has been consistently above 0.95 for 90 days with near-zero variance. Consider removing from active evaluation or raising the threshold."
      },
      {
        "metric": "accuracy",
        "currentScore": 0.87,
        "variance": 0.015,
        "saturated": false,
        "recommendation": "Metric 'accuracy' shows healthy variance and room for improvement. Continue evaluating."
      },
      {
        "metric": "groundedness",
        "currentScore": 0.82,
        "variance": 0.028,
        "saturated": false,
        "recommendation": "Metric 'groundedness' shows the highest variance. Prioritize improving retrieval quality."
      }
    ]
  }
}

Nondeterminism Detection

Detect and quantify inconsistent behavior in your AI agents. Run the same inputs multiple times and measure output variance to identify reliability issues.

Nondeterminism detection is essential for agents with low-temperature settings that are expected to produce consistent outputs. High variance on identical inputs signals prompt fragility or retrieval instability.

Create Nondeterminism Run

POST /api/nondeterminism/runs

Request Body:

{
  "agentId": "agent_support_v2",
  "inputs": [
    { "id": "input_001", "text": "What is the return policy?" },
    { "id": "input_002", "text": "How do I upgrade my plan?" },
    { "id": "input_003", "text": "Is there a student discount?" }
  ],
  "config": {
    "repetitions": 5,
    "temperature": 0.0,
    "similarityMetric": "semantic",
    "timeout": 30000
  }
}

Response:

{
  "success": true,
  "data": {
    "runId": "nondet_run_007",
    "status": "running",
    "totalInferences": 15,
    "estimatedCompletionTime": "2025-03-10T15:10:00Z"
  }
}

Get Nondeterminism Results

GET /api/nondeterminism/runs/:id

Response:

{
  "success": true,
  "data": {
    "runId": "nondet_run_007",
    "agentId": "agent_support_v2",
    "status": "completed",
    "completedAt": "2025-03-10T15:08:42Z",
    "summary": {
      "avgConsistency": 0.92,
      "minConsistency": 0.78,
      "maxConsistency": 0.99,
      "highVarianceInputs": 1
    },
    "results": [
      {
        "inputId": "input_001",
        "input": "What is the return policy?",
        "consistency": 0.99,
        "variance": "low",
        "outputs": [
          "Our return policy allows returns within 30 days of purchase with a valid receipt.",
          "Our return policy allows returns within 30 days of purchase with a valid receipt.",
          "Our return policy allows returns within 30 days with a valid receipt. Items must be unused.",
          "Our return policy allows returns within 30 days of purchase with a valid receipt.",
          "Our return policy allows returns within 30 days of purchase with a valid receipt."
        ],
        "semanticSimilarityMatrix": [[1.0, 1.0, 0.97, 1.0, 1.0]]
      },
      {
        "inputId": "input_002",
        "input": "How do I upgrade my plan?",
        "consistency": 0.98,
        "variance": "low",
        "outputs": [
          "Go to Settings > Billing > Upgrade Plan to see available options.",
          "Navigate to Settings, then Billing, and click Upgrade Plan.",
          "You can upgrade from Settings > Billing > Upgrade Plan.",
          "Go to Settings > Billing > Upgrade Plan to view options.",
          "Head to Settings > Billing > Upgrade Plan to see your options."
        ]
      },
      {
        "inputId": "input_003",
        "input": "Is there a student discount?",
        "consistency": 0.78,
        "variance": "high",
        "outputs": [
          "Yes, we offer a 20% student discount. Verify with your .edu email.",
          "We don't currently offer student discounts, but check our promotions page.",
          "Yes! Students get 20% off with a valid student ID or .edu email.",
          "We offer a 20% student discount. You'll need to verify your student status.",
          "Currently we don't have a specific student discount program."
        ],
        "alert": "Contradictory outputs detected: 3 responses confirm a discount, 2 deny it. Check knowledge base for conflicting information."
      }
    ]
  }
}

Consistency Report

Generate an aggregate consistency report across multiple nondeterminism runs.

GET /api/nondeterminism/consistency

Query Parameters:

Parameter	Type	Description
`agentId`	string	Agent ID (required)
`period`	string	`7d`, `30d`, `90d` (default: `30d`)
`minRepetitions`	number	Minimum repetitions per input (default: 3)

Response:

{
  "success": true,
  "data": {
    "agentId": "agent_support_v2",
    "period": "30d",
    "overallConsistency": 0.91,
    "totalInputsTested": 150,
    "totalInferences": 750,
    "byCategory": [
      {
        "category": "billing",
        "consistency": 0.94,
        "inputsTested": 45,
        "highVarianceCount": 2
      },
      {
        "category": "product",
        "consistency": 0.92,
        "inputsTested": 60,
        "highVarianceCount": 4
      },
      {
        "category": "policy",
        "consistency": 0.85,
        "inputsTested": 45,
        "highVarianceCount": 8
      }
    ],
    "topVarianceInputs": [
      {
        "input": "Is there a student discount?",
        "consistency": 0.78,
        "contradictions": true,
        "runId": "nondet_run_007"
      },
      {
        "input": "Can I pause my subscription?",
        "consistency": 0.81,
        "contradictions": true,
        "runId": "nondet_run_005"
      }
    ],
    "trend": {
      "current": 0.91,
      "previous": 0.88,
      "direction": "improving"
    }
  }
}

Calculate pass@k

Estimate the probability that at least one of k sampled outputs passes evaluation. Based on the pass@k metric from code generation research.

POST /api/nondeterminism/pass-at-k

Request Body:

{
  "agentId": "agent_support_v2",
  "evalSetId": "eval_set_042",
  "k": [1, 3, 5, 10],
  "config": {
    "n": 20,
    "threshold": 0.8,
    "metrics": ["accuracy", "groundedness"]
  }
}

Response:

{
  "success": true,
  "data": {
    "agentId": "agent_support_v2",
    "evalSetId": "eval_set_042",
    "n": 20,
    "threshold": 0.8,
    "results": {
      "pass@1": 0.84,
      "pass@3": 0.94,
      "pass@5": 0.97,
      "pass@10": 0.99
    },
    "byExample": [
      {
        "exampleId": "ex_001",
        "passCount": 18,
        "totalSamples": 20,
        "pass@1": 0.90,
        "pass@3": 0.999
      },
      {
        "exampleId": "ex_002",
        "passCount": 12,
        "totalSamples": 20,
        "pass@1": 0.60,
        "pass@3": 0.88
      }
    ],
    "insight": "Agent passes 84% of examples on first try but 97% within 5 attempts, suggesting moderate nondeterminism that benefits from retry strategies."
  }
}

Conversation Eval

Evaluate multi-turn conversations holistically. Unlike single-turn evaluation, conversation eval assesses coherence, context retention, goal completion, and turn-level quality across an entire dialogue.

Get Conversation Traces

Retrieve multi-turn conversation traces suitable for evaluation.

GET /api/conversation-eval/traces

Query Parameters:

Parameter	Type	Description
`agentId`	string	Filter by agent
`minTurns`	number	Minimum number of turns (default: 2)
`maxTurns`	number	Maximum number of turns
`status`	string	`evaluated`, `pending`, `all` (default: `all`)
`startDate`	string	ISO 8601 start date
`endDate`	string	ISO 8601 end date
`page`	number	Page number (default: 1)
`limit`	number	Results per page (default: 20)

Response:

{
  "success": true,
  "data": {
    "traces": [
      {
        "traceId": "conv_trace_001",
        "agentId": "agent_support_v2",
        "turnCount": 6,
        "duration": 245000,
        "status": "pending",
        "goalCompleted": null,
        "turns": [
          {
            "turnNumber": 1,
            "role": "user",
            "content": "I need help with my recent order",
            "timestamp": "2025-03-10T14:00:00Z"
          },
          {
            "turnNumber": 2,
            "role": "assistant",
            "content": "I'd be happy to help! Could you share your order number?",
            "timestamp": "2025-03-10T14:00:02Z",
            "latencyMs": 1840
          },
          {
            "turnNumber": 3,
            "role": "user",
            "content": "It's ORD-2025-5678. The item arrived damaged.",
            "timestamp": "2025-03-10T14:00:15Z"
          },
          {
            "turnNumber": 4,
            "role": "assistant",
            "content": "I'm sorry to hear that, I've pulled up order ORD-2025-5678. I can see it was a wireless keyboard delivered on March 7th. I can offer you a full replacement or a refund. Which would you prefer?",
            "timestamp": "2025-03-10T14:00:18Z",
            "latencyMs": 2100
          },
          {
            "turnNumber": 5,
            "role": "user",
            "content": "A replacement please",
            "timestamp": "2025-03-10T14:00:30Z"
          },
          {
            "turnNumber": 6,
            "role": "assistant",
            "content": "Done! I've initiated a replacement for your wireless keyboard. You'll receive a shipping confirmation within 24 hours. Is there anything else I can help with?",
            "timestamp": "2025-03-10T14:00:33Z",
            "latencyMs": 1950
          }
        ],
        "metadata": {
          "sessionId": "sess_789",
          "userSatisfaction": null
        }
      }
    ],
    "pagination": {
      "page": 1,
      "limit": 20,
      "total": 89
    }
  }
}

Evaluate Conversation

POST /api/conversation-eval/evaluate

Request Body:

{
  "traceId": "conv_trace_001",
  "criteria": {
    "coherence": {
      "enabled": true,
      "weight": 0.2,
      "description": "Logical flow and consistency across turns"
    },
    "contextRetention": {
      "enabled": true,
      "weight": 0.25,
      "description": "Agent correctly references earlier context"
    },
    "goalCompletion": {
      "enabled": true,
      "weight": 0.3,
      "description": "User's goal was identified and resolved"
    },
    "turnQuality": {
      "enabled": true,
      "weight": 0.15,
      "description": "Individual turn accuracy and helpfulness"
    },
    "efficiency": {
      "enabled": true,
      "weight": 0.1,
      "description": "Minimal unnecessary turns to reach resolution"
    }
  },
  "config": {
    "graderModel": "gpt-4o",
    "includePerTurnScores": true
  }
}

Response:

{
  "success": true,
  "data": {
    "evaluationId": "conv_eval_023",
    "traceId": "conv_trace_001",
    "status": "completed",
    "overallScore": 0.91,
    "passed": true,
    "scores": {
      "coherence": 0.95,
      "contextRetention": 0.93,
      "goalCompletion": 0.98,
      "turnQuality": 0.88,
      "efficiency": 0.80
    },
    "perTurnScores": [
      {
        "turnNumber": 2,
        "role": "assistant",
        "scores": { "relevance": 0.90, "helpfulness": 0.85 },
        "feedback": "Good opening but could be more specific about what information is needed."
      },
      {
        "turnNumber": 4,
        "role": "assistant",
        "scores": { "relevance": 0.95, "helpfulness": 0.92, "accuracy": 0.90 },
        "feedback": "Excellent context recall - identified the product and delivery date. Offered clear resolution options."
      },
      {
        "turnNumber": 6,
        "role": "assistant",
        "scores": { "relevance": 0.90, "helpfulness": 0.88 },
        "feedback": "Clean resolution with follow-up offer. Could mention return instructions for the damaged item."
      }
    ],
    "summary": "Conversation handled well. Agent correctly identified the issue, retained context throughout, and resolved the customer's goal in 6 turns. Minor improvement: include return instructions for the damaged item.",
    "evaluatedAt": "2025-03-10T15:30:00Z"
  }
}

Get Results

Retrieve evaluation results for conversation traces.

GET /api/conversation-eval/results

Query Parameters:

Parameter	Type	Description
`agentId`	string	Filter by agent
`traceId`	string	Filter by specific trace
`minScore`	number	Minimum overall score
`maxScore`	number	Maximum overall score
`startDate`	string	ISO 8601 start date
`endDate`	string	ISO 8601 end date
`page`	number	Page number (default: 1)
`limit`	number	Results per page (default: 20)

Response:

{
  "success": true,
  "data": {
    "results": [
      {
        "evaluationId": "conv_eval_023",
        "traceId": "conv_trace_001",
        "agentId": "agent_support_v2",
        "overallScore": 0.91,
        "passed": true,
        "turnCount": 6,
        "scores": {
          "coherence": 0.95,
          "contextRetention": 0.93,
          "goalCompletion": 0.98,
          "turnQuality": 0.88,
          "efficiency": 0.80
        },
        "evaluatedAt": "2025-03-10T15:30:00Z"
      },
      {
        "evaluationId": "conv_eval_022",
        "traceId": "conv_trace_002",
        "agentId": "agent_support_v2",
        "overallScore": 0.64,
        "passed": false,
        "turnCount": 12,
        "scores": {
          "coherence": 0.70,
          "contextRetention": 0.55,
          "goalCompletion": 0.40,
          "turnQuality": 0.75,
          "efficiency": 0.50
        },
        "evaluatedAt": "2025-03-10T15:25:00Z"
      }
    ],
    "pagination": {
      "page": 1,
      "limit": 20,
      "total": 89
    },
    "aggregate": {
      "avgOverallScore": 0.82,
      "passRate": 0.76,
      "avgTurnCount": 5.4,
      "topFailureReasons": [
        { "reason": "Goal not completed", "count": 8 },
        { "reason": "Context lost mid-conversation", "count": 5 },
        { "reason": "Excessive turns for simple request", "count": 3 }
      ]
    }
  }
}

Error Responses

All endpoints return consistent error responses:

{
  "success": false,
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "Invalid evaluation set ID",
    "details": {
      "field": "evalSetId",
      "reason": "Evaluation set 'eval_set_999' not found"
    }
  }
}

Common Error Codes:

HTTP Status	Code	Description
400	`VALIDATION_ERROR`	Invalid request body or parameters
401	`UNAUTHORIZED`	Missing or invalid API key
403	`FORBIDDEN`	Insufficient permissions
404	`NOT_FOUND`	Resource not found
409	`CONFLICT`	Resource already exists or state conflict
429	`RATE_LIMITED`	Too many requests
500	`INTERNAL_ERROR`	Server error

Next Steps

Traces API - Capture agent traces for evaluation
Guardrails API - Real-time content scanning
Explainability & Analysis - Understand agent decisions
Evaluation Guide - End-to-end evaluation workflows
Webhooks & Notifications - Get notified on evaluation results

Guardrails Explainability & Analysis