API ReferenceEvaluation & Grading

Evaluation & Grading API

ThinkHive provides a complete evaluation and grading pipeline for AI agents: create evaluation sets, run automated and deterministic graders, route edge cases to human reviewers, monitor evaluation health over time, detect nondeterministic behavior, and evaluate multi-turn conversations.

All endpoints require authentication via Authorization: Bearer th_your_api_key header. See Authentication for details.


Evaluation Sets & Criteria

Manage golden datasets and run evaluations against your AI agents.

List Evaluation Sets

GET /api/evaluation/sets

Query Parameters:

ParameterTypeDescription
agentIdstringFilter sets by agent
pagenumberPage number (default: 1)
limitnumberResults per page (default: 20, max: 100)

Response:

{
  "success": true,
  "data": [
    {
      "id": "eval_set_001",
      "name": "Customer Support Golden Set",
      "description": "50 curated examples for support agent evaluation",
      "exampleCount": 50,
      "agentId": "agent_abc123",
      "criteria": [
        { "name": "accuracy", "weight": 0.4 },
        { "name": "groundedness", "weight": 0.3 },
        { "name": "completeness", "weight": 0.3 }
      ],
      "createdAt": "2025-01-15T10:00:00Z",
      "updatedAt": "2025-03-02T14:22:00Z"
    }
  ],
  "pagination": {
    "page": 1,
    "limit": 20,
    "total": 3
  }
}

Create Evaluation Set

POST /api/evaluation/sets

Request Body:

{
  "name": "Product FAQ Evaluation",
  "description": "Test cases for product knowledge questions",
  "agentId": "agent_abc123",
  "examples": [
    {
      "input": "What is the return policy?",
      "expectedOutput": "Our return policy allows returns within 30 days of purchase with a valid receipt. Items must be in original condition.",
      "context": "Returns documentation from help center",
      "criteria": ["accuracy", "completeness", "tone"]
    },
    {
      "input": "How do I cancel my subscription?",
      "expectedOutput": "You can cancel your subscription from Settings > Billing > Cancel Plan. Your access continues until the end of the billing period.",
      "context": "Billing FAQ article",
      "criteria": ["accuracy", "helpfulness"]
    }
  ],
  "criteria": [
    { "name": "accuracy", "weight": 0.4, "description": "Factual correctness against source material" },
    { "name": "completeness", "weight": 0.3, "description": "Covers all relevant information" },
    { "name": "tone", "weight": 0.15, "description": "Professional and empathetic" },
    { "name": "helpfulness", "weight": 0.15, "description": "Actionable and clear" }
  ]
}

Response:

{
  "success": true,
  "data": {
    "id": "eval_set_042",
    "name": "Product FAQ Evaluation",
    "exampleCount": 2,
    "createdAt": "2025-03-10T09:00:00Z"
  }
}

Run Evaluation

POST /api/evaluation/run

Request Body:

{
  "evalSetId": "eval_set_042",
  "agentId": "agent_abc123",
  "config": {
    "metrics": ["accuracy", "groundedness", "faithfulness"],
    "threshold": 0.8,
    "graderModel": "gpt-4o",
    "concurrency": 5
  }
}

Response:

{
  "success": true,
  "data": {
    "runId": "eval_run_108",
    "status": "running",
    "progress": {
      "completed": 0,
      "total": 50
    },
    "estimatedCompletionTime": "2025-03-10T09:05:00Z"
  }
}

Get Evaluation Results

GET /api/evaluation/runs/:runId

Response:

{
  "success": true,
  "data": {
    "runId": "eval_run_108",
    "evalSetId": "eval_set_042",
    "agentId": "agent_abc123",
    "status": "completed",
    "startedAt": "2025-03-10T09:00:00Z",
    "completedAt": "2025-03-10T09:04:32Z",
    "summary": {
      "passed": 42,
      "failed": 8,
      "passRate": 0.84,
      "avgAccuracy": 0.87,
      "avgGroundedness": 0.82,
      "avgFaithfulness": 0.91
    },
    "results": [
      {
        "exampleId": "ex_001",
        "input": "What is the return policy?",
        "actualOutput": "Our return policy allows returns within 30 days...",
        "passed": true,
        "scores": {
          "accuracy": 0.92,
          "groundedness": 0.88,
          "faithfulness": 0.95
        },
        "reasoning": "Response accurately covers the return window and receipt requirement."
      },
      {
        "exampleId": "ex_002",
        "input": "How do I cancel my subscription?",
        "actualOutput": "Please contact support to cancel.",
        "passed": false,
        "scores": {
          "accuracy": 0.45,
          "groundedness": 0.30,
          "faithfulness": 0.60
        },
        "reasoning": "Response is vague and does not include the self-service cancellation path from Settings > Billing."
      }
    ]
  }
}

Compare Runs

GET /api/evaluation/compare

Query Parameters:

ParameterTypeDescription
runIdsstringComma-separated run IDs to compare

Response:

{
  "success": true,
  "data": {
    "comparison": {
      "runs": [
        {
          "id": "eval_run_107",
          "passRate": 0.80,
          "avgAccuracy": 0.82,
          "avgGroundedness": 0.79,
          "completedAt": "2025-03-08T15:00:00Z"
        },
        {
          "id": "eval_run_108",
          "passRate": 0.84,
          "avgAccuracy": 0.87,
          "avgGroundedness": 0.82,
          "completedAt": "2025-03-10T09:04:32Z"
        }
      ],
      "improvement": 0.05,
      "significantChanges": [
        "Accuracy improved by 6% on FAQ questions",
        "Groundedness improved by 4% across all categories",
        "2 previously failing examples now pass"
      ],
      "regressions": [
        "Example ex_034 regressed from 0.91 to 0.72 on accuracy"
      ]
    }
  }
}

Evaluation Runs

Manage evaluation run lifecycle. Use these endpoints to list, create, retrieve, and update individual evaluation runs.

List Runs

GET /api/eval-runs

Query Parameters:

ParameterTypeDescription
agentIdstringFilter by agent ID
statusstringFilter by status: pending, running, completed, failed
evalSetIdstringFilter by evaluation set
pagenumberPage number (default: 1)
limitnumberResults per page (default: 20)

Response:

{
  "success": true,
  "data": [
    {
      "id": "eval_run_108",
      "evalSetId": "eval_set_042",
      "agentId": "agent_abc123",
      "status": "completed",
      "passRate": 0.84,
      "totalExamples": 50,
      "passed": 42,
      "failed": 8,
      "startedAt": "2025-03-10T09:00:00Z",
      "completedAt": "2025-03-10T09:04:32Z"
    },
    {
      "id": "eval_run_107",
      "evalSetId": "eval_set_042",
      "agentId": "agent_abc123",
      "status": "completed",
      "passRate": 0.80,
      "totalExamples": 50,
      "passed": 40,
      "failed": 10,
      "startedAt": "2025-03-08T14:30:00Z",
      "completedAt": "2025-03-08T15:00:00Z"
    }
  ],
  "pagination": {
    "page": 1,
    "limit": 20,
    "total": 12
  }
}

Create Run

POST /api/eval-runs

Request Body:

{
  "evalSetId": "eval_set_042",
  "agentId": "agent_abc123",
  "name": "Post-prompt-update regression check",
  "config": {
    "metrics": ["accuracy", "groundedness"],
    "threshold": 0.8,
    "graderModel": "gpt-4o",
    "concurrency": 10,
    "timeout": 30000
  },
  "tags": ["regression", "prompt-v2"]
}

Response:

{
  "success": true,
  "data": {
    "id": "eval_run_109",
    "status": "pending",
    "createdAt": "2025-03-10T11:00:00Z"
  }
}

Get Run

GET /api/eval-runs/:id

Response:

{
  "success": true,
  "data": {
    "id": "eval_run_109",
    "evalSetId": "eval_set_042",
    "agentId": "agent_abc123",
    "name": "Post-prompt-update regression check",
    "status": "running",
    "passRate": null,
    "progress": {
      "completed": 23,
      "total": 50
    },
    "config": {
      "metrics": ["accuracy", "groundedness"],
      "threshold": 0.8,
      "graderModel": "gpt-4o"
    },
    "tags": ["regression", "prompt-v2"],
    "startedAt": "2025-03-10T11:00:05Z",
    "completedAt": null
  }
}

Update Run

PATCH /api/eval-runs/:id

Use this to cancel a running evaluation or update metadata.

Request Body:

{
  "status": "cancelled",
  "name": "Updated run name",
  "tags": ["regression", "prompt-v2", "cancelled-early"]
}

Response:

{
  "success": true,
  "data": {
    "id": "eval_run_109",
    "status": "cancelled",
    "name": "Updated run name",
    "updatedAt": "2025-03-10T11:02:00Z"
  }
}

Deterministic Graders

Apply rule-based grading to agent outputs without LLM calls. Deterministic graders are fast, reproducible, and cost-free. Use them for structural validation, compliance checks, and baseline quality gates.

Deterministic graders run locally with zero latency overhead. Combine them with LLM-based evaluation for comprehensive coverage.

Rule Types

Rule TypeDescriptionExample Use Case
lengthValidate output length (min/max characters or tokens)Ensure responses are concise
keywordsCheck for required or prohibited keywordsVerify brand terms are included
json_validValidate output is well-formed JSONTool-calling agents
regexMatch output against a regular expressionFormat validation (dates, IDs)
no_piiDetect personally identifiable informationCompliance enforcement
response_timeAssert response latency is within boundsSLA compliance

Evaluate with Rules

POST /api/deterministic-graders/evaluate

Request Body:

{
  "output": "Thank you for contacting Acme Corp support. Your order #ORD-2025-1234 has been shipped and will arrive by March 15, 2025. You can track it at https://tracking.acme.com/ORD-2025-1234.",
  "rules": [
    {
      "type": "length",
      "config": { "min": 50, "max": 500, "unit": "characters" }
    },
    {
      "type": "keywords",
      "config": {
        "required": ["Acme Corp", "order"],
        "prohibited": ["I don't know", "I'm not sure"]
      }
    },
    {
      "type": "regex",
      "config": {
        "pattern": "ORD-\\d{4}-\\d{4}",
        "shouldMatch": true
      }
    },
    {
      "type": "no_pii",
      "config": {
        "categories": ["email", "phone", "ssn", "credit_card"]
      }
    }
  ],
  "metadata": {
    "traceId": "trace_abc123",
    "agentId": "agent_support_v2"
  }
}

Response:

{
  "success": true,
  "data": {
    "passed": true,
    "score": 1.0,
    "results": [
      {
        "rule": "length",
        "passed": true,
        "detail": "Output length 189 characters is within range [50, 500]"
      },
      {
        "rule": "keywords",
        "passed": true,
        "detail": "All required keywords found. No prohibited keywords detected."
      },
      {
        "rule": "regex",
        "passed": true,
        "detail": "Pattern 'ORD-\\d{4}-\\d{4}' matched: ORD-2025-1234"
      },
      {
        "rule": "no_pii",
        "passed": true,
        "detail": "No PII detected in output"
      }
    ],
    "metadata": {
      "traceId": "trace_abc123",
      "agentId": "agent_support_v2",
      "evaluatedAt": "2025-03-10T12:00:00Z",
      "durationMs": 3
    }
  }
}

Bulk Evaluate

Evaluate multiple outputs in a single request. Useful for batch processing historical traces or running deterministic checks as part of a CI pipeline.

POST /api/deterministic-graders/bulk-evaluate

Request Body:

{
  "items": [
    {
      "id": "trace_001",
      "output": "Your account balance is $1,234.56. Contact us at support@acme.com for questions.",
      "rules": [
        { "type": "length", "config": { "min": 20, "max": 300 } },
        { "type": "no_pii", "config": { "categories": ["email", "phone"] } }
      ]
    },
    {
      "id": "trace_002",
      "output": "{\"status\": \"approved\", \"amount\": 500, \"currency\": \"USD\"}",
      "rules": [
        { "type": "json_valid", "config": { "schema": "approval_response" } },
        { "type": "length", "config": { "min": 10, "max": 1000 } }
      ]
    },
    {
      "id": "trace_003",
      "output": "Response generated in 245ms. The weather in NYC is 72F.",
      "rules": [
        { "type": "response_time", "config": { "maxMs": 500 } },
        { "type": "keywords", "config": { "prohibited": ["error", "failed", "exception"] } }
      ]
    }
  ]
}

Response:

{
  "success": true,
  "data": {
    "totalItems": 3,
    "passed": 2,
    "failed": 1,
    "results": [
      {
        "id": "trace_001",
        "passed": false,
        "score": 0.5,
        "results": [
          { "rule": "length", "passed": true, "detail": "Output length 82 characters is within range [20, 300]" },
          { "rule": "no_pii", "passed": false, "detail": "PII detected: email address (support@acme.com)" }
        ]
      },
      {
        "id": "trace_002",
        "passed": true,
        "score": 1.0,
        "results": [
          { "rule": "json_valid", "passed": true, "detail": "Valid JSON matching schema 'approval_response'" },
          { "rule": "length", "passed": true, "detail": "Output length 58 characters is within range [10, 1000]" }
        ]
      },
      {
        "id": "trace_003",
        "passed": true,
        "score": 1.0,
        "results": [
          { "rule": "response_time", "passed": true, "detail": "Response time 245ms is within 500ms limit" },
          { "rule": "keywords", "passed": true, "detail": "No prohibited keywords detected" }
        ]
      }
    ],
    "summary": {
      "passRate": 0.67,
      "avgScore": 0.83,
      "durationMs": 8
    }
  }
}

Human Review Queue

Route borderline or high-stakes evaluations to human reviewers. ThinkHive manages assignment, calibration, conflict resolution, and reviewer analytics.

Get Review Queue

GET /api/human-review/queue

Query Parameters:

ParameterTypeDescription
reviewerIdstringFilter by assigned reviewer
statusstringpending, in_progress, completed, skipped
prioritystringlow, medium, high, critical
agentIdstringFilter by agent
pagenumberPage number (default: 1)
limitnumberResults per page (default: 20)

Response:

{
  "success": true,
  "data": {
    "items": [
      {
        "id": "review_501",
        "traceId": "trace_abc123",
        "agentId": "agent_support_v2",
        "priority": "high",
        "status": "pending",
        "input": "I want to delete all my data and close my account permanently.",
        "output": "I can help you with that. I'll initiate the account deletion process. This will permanently remove all your data within 30 days as required by our data retention policy.",
        "autoGraderScores": {
          "accuracy": 0.72,
          "safety": 0.65
        },
        "flagReason": "Safety score below threshold (0.65 < 0.80)",
        "assignedTo": null,
        "createdAt": "2025-03-10T08:30:00Z"
      },
      {
        "id": "review_502",
        "traceId": "trace_def456",
        "agentId": "agent_billing_v1",
        "priority": "medium",
        "status": "in_progress",
        "input": "Why was I charged twice this month?",
        "output": "I see two charges on your account. The first is your regular subscription and the second appears to be a prorated charge from your plan upgrade on March 3rd.",
        "autoGraderScores": {
          "accuracy": 0.78,
          "helpfulness": 0.80
        },
        "flagReason": "Accuracy score in review range (0.70-0.85)",
        "assignedTo": "reviewer_jane",
        "createdAt": "2025-03-10T07:15:00Z"
      }
    ],
    "pagination": {
      "page": 1,
      "limit": 20,
      "total": 47
    },
    "queueStats": {
      "pending": 23,
      "inProgress": 12,
      "completedToday": 34
    }
  }
}

Assign Reviewer

POST /api/human-review/assign

Request Body:

{
  "reviewId": "review_501",
  "reviewerId": "reviewer_jane",
  "priority": "high",
  "dueBy": "2025-03-10T17:00:00Z",
  "notes": "Account deletion request - verify compliance language is accurate"
}

Response:

{
  "success": true,
  "data": {
    "reviewId": "review_501",
    "assignedTo": "reviewer_jane",
    "status": "in_progress",
    "dueBy": "2025-03-10T17:00:00Z",
    "assignedAt": "2025-03-10T12:00:00Z"
  }
}

Calibration Sets

Create calibration sets to measure and align inter-reviewer agreement. Calibration reviews are compared against a known-good answer key.

POST /api/human-review/calibration

Request Body:

{
  "name": "Q1 2025 Support Calibration",
  "reviewerIds": ["reviewer_jane", "reviewer_mike", "reviewer_sara"],
  "examples": [
    {
      "input": "I need a refund for my last order",
      "output": "I'd be happy to help with your refund. I've processed a full refund of $49.99 to your original payment method. It should appear within 3-5 business days.",
      "expectedScores": {
        "accuracy": 0.95,
        "helpfulness": 0.90,
        "tone": 0.92
      },
      "notes": "Ideal response - proactive, specific amount, clear timeline"
    },
    {
      "input": "Your product is terrible and I want my money back",
      "output": "I understand your frustration. Let me look into this for you. Could you share your order number so I can process your refund?",
      "expectedScores": {
        "accuracy": 0.80,
        "helpfulness": 0.85,
        "tone": 0.95
      },
      "notes": "Good de-escalation but should acknowledge specific complaint"
    }
  ]
}

Response:

{
  "success": true,
  "data": {
    "calibrationId": "cal_012",
    "name": "Q1 2025 Support Calibration",
    "reviewerCount": 3,
    "exampleCount": 2,
    "status": "pending",
    "createdAt": "2025-03-10T10:00:00Z"
  }
}

Skip / Reassign Review

POST /api/human-review/skip

Request Body:

{
  "reviewId": "review_502",
  "reviewerId": "reviewer_jane",
  "reason": "conflict_of_interest",
  "reassignTo": "reviewer_mike"
}

Response:

{
  "success": true,
  "data": {
    "reviewId": "review_502",
    "previousReviewer": "reviewer_jane",
    "assignedTo": "reviewer_mike",
    "skipReason": "conflict_of_interest",
    "status": "in_progress"
  }
}

Submit Review

POST /api/human-review/submit

Request Body:

{
  "reviewId": "review_501",
  "reviewerId": "reviewer_jane",
  "verdict": "pass",
  "scores": {
    "accuracy": 0.90,
    "safety": 0.85,
    "helpfulness": 0.88,
    "tone": 0.92
  },
  "feedback": "Response correctly describes the deletion process and mentions the 30-day retention window. Could improve by mentioning the user can download their data before deletion.",
  "tags": ["account-deletion", "gdpr-related"],
  "suggestedOutput": "I can help you with that. Before I initiate the deletion, would you like to download a copy of your data? Once confirmed, I'll permanently remove all your data within 30 days per our data retention policy."
}

Response:

{
  "success": true,
  "data": {
    "reviewId": "review_501",
    "status": "completed",
    "verdict": "pass",
    "reviewerId": "reviewer_jane",
    "completedAt": "2025-03-10T14:30:00Z",
    "reviewDurationMs": 142000
  }
}

Review Stats

GET /api/human-review/stats

Query Parameters:

ParameterTypeDescription
reviewerIdstringStats for a specific reviewer
startDatestringISO 8601 start date
endDatestringISO 8601 end date
agentIdstringFilter by agent

Response:

{
  "success": true,
  "data": {
    "period": {
      "start": "2025-03-01T00:00:00Z",
      "end": "2025-03-10T23:59:59Z"
    },
    "overview": {
      "totalReviews": 156,
      "avgReviewTimeMs": 95000,
      "passRate": 0.72,
      "interReviewerAgreement": 0.88
    },
    "byReviewer": [
      {
        "reviewerId": "reviewer_jane",
        "name": "Jane Smith",
        "reviewsCompleted": 52,
        "avgReviewTimeMs": 82000,
        "agreementRate": 0.91,
        "calibrationScore": 0.94
      },
      {
        "reviewerId": "reviewer_mike",
        "name": "Mike Johnson",
        "reviewsCompleted": 48,
        "avgReviewTimeMs": 105000,
        "agreementRate": 0.86,
        "calibrationScore": 0.89
      }
    ],
    "byVerdict": {
      "pass": 112,
      "fail": 31,
      "borderline": 13
    }
  }
}

Eval Health Monitoring

Track the health and quality of your evaluation pipeline over time. Detect regressions, identify saturated metrics, and generate snapshots for executive reporting.

Health Report

GET /api/eval-health/report

Query Parameters:

ParameterTypeDescription
agentIdstringAgent to report on (required)
periodstring7d, 30d, 90d (default: 30d)

Response:

{
  "success": true,
  "data": {
    "agentId": "agent_support_v2",
    "period": "30d",
    "generatedAt": "2025-03-10T15:00:00Z",
    "overallHealth": "good",
    "healthScore": 0.87,
    "metrics": {
      "passRate": {
        "current": 0.84,
        "previous": 0.80,
        "trend": "improving",
        "change": 0.04
      },
      "avgAccuracy": {
        "current": 0.87,
        "previous": 0.85,
        "trend": "improving",
        "change": 0.02
      },
      "avgGroundedness": {
        "current": 0.82,
        "previous": 0.83,
        "trend": "stable",
        "change": -0.01
      },
      "avgResponseTime": {
        "current": 1240,
        "previous": 1380,
        "trend": "improving",
        "change": -140,
        "unit": "ms"
      }
    },
    "runsInPeriod": 12,
    "totalExamplesEvaluated": 600,
    "alerts": [
      {
        "type": "regression",
        "severity": "warning",
        "message": "Groundedness dropped 3% on billing-related questions in the last 7 days",
        "affectedExamples": 8
      }
    ]
  }
}

Snapshots

Retrieve point-in-time snapshots of evaluation metrics for historical analysis and reporting.

GET /api/eval-health/snapshots

Query Parameters:

ParameterTypeDescription
agentIdstringAgent ID (required)
startDatestringISO 8601 start date
endDatestringISO 8601 end date
granularitystringdaily, weekly, monthly (default: daily)

Response:

{
  "success": true,
  "data": {
    "agentId": "agent_support_v2",
    "granularity": "weekly",
    "snapshots": [
      {
        "date": "2025-02-24",
        "passRate": 0.80,
        "avgAccuracy": 0.85,
        "avgGroundedness": 0.83,
        "runsCount": 3,
        "examplesEvaluated": 150
      },
      {
        "date": "2025-03-03",
        "passRate": 0.82,
        "avgAccuracy": 0.86,
        "avgGroundedness": 0.82,
        "runsCount": 4,
        "examplesEvaluated": 200
      },
      {
        "date": "2025-03-10",
        "passRate": 0.84,
        "avgAccuracy": 0.87,
        "avgGroundedness": 0.82,
        "runsCount": 5,
        "examplesEvaluated": 250
      }
    ]
  }
}

Regressions

Detect significant regressions in evaluation scores across runs.

GET /api/eval-health/regressions

Query Parameters:

ParameterTypeDescription
agentIdstringAgent ID (required)
thresholdnumberMinimum score drop to flag as regression (default: 0.05)
periodstring7d, 30d, 90d (default: 30d)

Response:

{
  "success": true,
  "data": {
    "agentId": "agent_support_v2",
    "regressionsDetected": 2,
    "regressions": [
      {
        "id": "reg_001",
        "metric": "groundedness",
        "category": "billing",
        "previousScore": 0.89,
        "currentScore": 0.76,
        "drop": 0.13,
        "severity": "critical",
        "firstDetected": "2025-03-08T12:00:00Z",
        "affectedExamples": [
          { "id": "ex_021", "input": "Why was I charged twice?", "scoreDrop": 0.18 },
          { "id": "ex_034", "input": "Can I get a prorated refund?", "scoreDrop": 0.15 }
        ],
        "possibleCause": "Prompt template updated on 2025-03-07, billing context section shortened"
      },
      {
        "id": "reg_002",
        "metric": "accuracy",
        "category": "shipping",
        "previousScore": 0.91,
        "currentScore": 0.85,
        "drop": 0.06,
        "severity": "warning",
        "firstDetected": "2025-03-09T09:00:00Z",
        "affectedExamples": [
          { "id": "ex_045", "input": "What shipping options do you offer?", "scoreDrop": 0.08 }
        ],
        "possibleCause": "Knowledge base shipping article last updated 2024-12-01"
      }
    ]
  }
}

Saturation Analysis

Identify metrics that have plateaued and may no longer differentiate agent quality.

GET /api/eval-health/saturation

Query Parameters:

ParameterTypeDescription
agentIdstringAgent ID (required)
periodstring30d, 90d, 180d (default: 90d)

Response:

{
  "success": true,
  "data": {
    "agentId": "agent_support_v2",
    "period": "90d",
    "analysis": [
      {
        "metric": "tone",
        "currentScore": 0.96,
        "variance": 0.002,
        "saturated": true,
        "recommendation": "Metric 'tone' has been consistently above 0.95 for 90 days with near-zero variance. Consider removing from active evaluation or raising the threshold."
      },
      {
        "metric": "accuracy",
        "currentScore": 0.87,
        "variance": 0.015,
        "saturated": false,
        "recommendation": "Metric 'accuracy' shows healthy variance and room for improvement. Continue evaluating."
      },
      {
        "metric": "groundedness",
        "currentScore": 0.82,
        "variance": 0.028,
        "saturated": false,
        "recommendation": "Metric 'groundedness' shows the highest variance. Prioritize improving retrieval quality."
      }
    ]
  }
}

Nondeterminism Detection

Detect and quantify inconsistent behavior in your AI agents. Run the same inputs multiple times and measure output variance to identify reliability issues.

Nondeterminism detection is essential for agents with low-temperature settings that are expected to produce consistent outputs. High variance on identical inputs signals prompt fragility or retrieval instability.

Create Nondeterminism Run

POST /api/nondeterminism/runs

Request Body:

{
  "agentId": "agent_support_v2",
  "inputs": [
    { "id": "input_001", "text": "What is the return policy?" },
    { "id": "input_002", "text": "How do I upgrade my plan?" },
    { "id": "input_003", "text": "Is there a student discount?" }
  ],
  "config": {
    "repetitions": 5,
    "temperature": 0.0,
    "similarityMetric": "semantic",
    "timeout": 30000
  }
}

Response:

{
  "success": true,
  "data": {
    "runId": "nondet_run_007",
    "status": "running",
    "totalInferences": 15,
    "estimatedCompletionTime": "2025-03-10T15:10:00Z"
  }
}

Get Nondeterminism Results

GET /api/nondeterminism/runs/:id

Response:

{
  "success": true,
  "data": {
    "runId": "nondet_run_007",
    "agentId": "agent_support_v2",
    "status": "completed",
    "completedAt": "2025-03-10T15:08:42Z",
    "summary": {
      "avgConsistency": 0.92,
      "minConsistency": 0.78,
      "maxConsistency": 0.99,
      "highVarianceInputs": 1
    },
    "results": [
      {
        "inputId": "input_001",
        "input": "What is the return policy?",
        "consistency": 0.99,
        "variance": "low",
        "outputs": [
          "Our return policy allows returns within 30 days of purchase with a valid receipt.",
          "Our return policy allows returns within 30 days of purchase with a valid receipt.",
          "Our return policy allows returns within 30 days with a valid receipt. Items must be unused.",
          "Our return policy allows returns within 30 days of purchase with a valid receipt.",
          "Our return policy allows returns within 30 days of purchase with a valid receipt."
        ],
        "semanticSimilarityMatrix": [[1.0, 1.0, 0.97, 1.0, 1.0]]
      },
      {
        "inputId": "input_002",
        "input": "How do I upgrade my plan?",
        "consistency": 0.98,
        "variance": "low",
        "outputs": [
          "Go to Settings > Billing > Upgrade Plan to see available options.",
          "Navigate to Settings, then Billing, and click Upgrade Plan.",
          "You can upgrade from Settings > Billing > Upgrade Plan.",
          "Go to Settings > Billing > Upgrade Plan to view options.",
          "Head to Settings > Billing > Upgrade Plan to see your options."
        ]
      },
      {
        "inputId": "input_003",
        "input": "Is there a student discount?",
        "consistency": 0.78,
        "variance": "high",
        "outputs": [
          "Yes, we offer a 20% student discount. Verify with your .edu email.",
          "We don't currently offer student discounts, but check our promotions page.",
          "Yes! Students get 20% off with a valid student ID or .edu email.",
          "We offer a 20% student discount. You'll need to verify your student status.",
          "Currently we don't have a specific student discount program."
        ],
        "alert": "Contradictory outputs detected: 3 responses confirm a discount, 2 deny it. Check knowledge base for conflicting information."
      }
    ]
  }
}

Consistency Report

Generate an aggregate consistency report across multiple nondeterminism runs.

GET /api/nondeterminism/consistency

Query Parameters:

ParameterTypeDescription
agentIdstringAgent ID (required)
periodstring7d, 30d, 90d (default: 30d)
minRepetitionsnumberMinimum repetitions per input (default: 3)

Response:

{
  "success": true,
  "data": {
    "agentId": "agent_support_v2",
    "period": "30d",
    "overallConsistency": 0.91,
    "totalInputsTested": 150,
    "totalInferences": 750,
    "byCategory": [
      {
        "category": "billing",
        "consistency": 0.94,
        "inputsTested": 45,
        "highVarianceCount": 2
      },
      {
        "category": "product",
        "consistency": 0.92,
        "inputsTested": 60,
        "highVarianceCount": 4
      },
      {
        "category": "policy",
        "consistency": 0.85,
        "inputsTested": 45,
        "highVarianceCount": 8
      }
    ],
    "topVarianceInputs": [
      {
        "input": "Is there a student discount?",
        "consistency": 0.78,
        "contradictions": true,
        "runId": "nondet_run_007"
      },
      {
        "input": "Can I pause my subscription?",
        "consistency": 0.81,
        "contradictions": true,
        "runId": "nondet_run_005"
      }
    ],
    "trend": {
      "current": 0.91,
      "previous": 0.88,
      "direction": "improving"
    }
  }
}

Calculate pass@k

Estimate the probability that at least one of k sampled outputs passes evaluation. Based on the pass@k metric from code generation research.

POST /api/nondeterminism/pass-at-k

Request Body:

{
  "agentId": "agent_support_v2",
  "evalSetId": "eval_set_042",
  "k": [1, 3, 5, 10],
  "config": {
    "n": 20,
    "threshold": 0.8,
    "metrics": ["accuracy", "groundedness"]
  }
}

Response:

{
  "success": true,
  "data": {
    "agentId": "agent_support_v2",
    "evalSetId": "eval_set_042",
    "n": 20,
    "threshold": 0.8,
    "results": {
      "pass@1": 0.84,
      "pass@3": 0.94,
      "pass@5": 0.97,
      "pass@10": 0.99
    },
    "byExample": [
      {
        "exampleId": "ex_001",
        "passCount": 18,
        "totalSamples": 20,
        "pass@1": 0.90,
        "pass@3": 0.999
      },
      {
        "exampleId": "ex_002",
        "passCount": 12,
        "totalSamples": 20,
        "pass@1": 0.60,
        "pass@3": 0.88
      }
    ],
    "insight": "Agent passes 84% of examples on first try but 97% within 5 attempts, suggesting moderate nondeterminism that benefits from retry strategies."
  }
}

Conversation Eval

Evaluate multi-turn conversations holistically. Unlike single-turn evaluation, conversation eval assesses coherence, context retention, goal completion, and turn-level quality across an entire dialogue.

Get Conversation Traces

Retrieve multi-turn conversation traces suitable for evaluation.

GET /api/conversation-eval/traces

Query Parameters:

ParameterTypeDescription
agentIdstringFilter by agent
minTurnsnumberMinimum number of turns (default: 2)
maxTurnsnumberMaximum number of turns
statusstringevaluated, pending, all (default: all)
startDatestringISO 8601 start date
endDatestringISO 8601 end date
pagenumberPage number (default: 1)
limitnumberResults per page (default: 20)

Response:

{
  "success": true,
  "data": {
    "traces": [
      {
        "traceId": "conv_trace_001",
        "agentId": "agent_support_v2",
        "turnCount": 6,
        "duration": 245000,
        "status": "pending",
        "goalCompleted": null,
        "turns": [
          {
            "turnNumber": 1,
            "role": "user",
            "content": "I need help with my recent order",
            "timestamp": "2025-03-10T14:00:00Z"
          },
          {
            "turnNumber": 2,
            "role": "assistant",
            "content": "I'd be happy to help! Could you share your order number?",
            "timestamp": "2025-03-10T14:00:02Z",
            "latencyMs": 1840
          },
          {
            "turnNumber": 3,
            "role": "user",
            "content": "It's ORD-2025-5678. The item arrived damaged.",
            "timestamp": "2025-03-10T14:00:15Z"
          },
          {
            "turnNumber": 4,
            "role": "assistant",
            "content": "I'm sorry to hear that, I've pulled up order ORD-2025-5678. I can see it was a wireless keyboard delivered on March 7th. I can offer you a full replacement or a refund. Which would you prefer?",
            "timestamp": "2025-03-10T14:00:18Z",
            "latencyMs": 2100
          },
          {
            "turnNumber": 5,
            "role": "user",
            "content": "A replacement please",
            "timestamp": "2025-03-10T14:00:30Z"
          },
          {
            "turnNumber": 6,
            "role": "assistant",
            "content": "Done! I've initiated a replacement for your wireless keyboard. You'll receive a shipping confirmation within 24 hours. Is there anything else I can help with?",
            "timestamp": "2025-03-10T14:00:33Z",
            "latencyMs": 1950
          }
        ],
        "metadata": {
          "sessionId": "sess_789",
          "userSatisfaction": null
        }
      }
    ],
    "pagination": {
      "page": 1,
      "limit": 20,
      "total": 89
    }
  }
}

Evaluate Conversation

POST /api/conversation-eval/evaluate

Request Body:

{
  "traceId": "conv_trace_001",
  "criteria": {
    "coherence": {
      "enabled": true,
      "weight": 0.2,
      "description": "Logical flow and consistency across turns"
    },
    "contextRetention": {
      "enabled": true,
      "weight": 0.25,
      "description": "Agent correctly references earlier context"
    },
    "goalCompletion": {
      "enabled": true,
      "weight": 0.3,
      "description": "User's goal was identified and resolved"
    },
    "turnQuality": {
      "enabled": true,
      "weight": 0.15,
      "description": "Individual turn accuracy and helpfulness"
    },
    "efficiency": {
      "enabled": true,
      "weight": 0.1,
      "description": "Minimal unnecessary turns to reach resolution"
    }
  },
  "config": {
    "graderModel": "gpt-4o",
    "includePerTurnScores": true
  }
}

Response:

{
  "success": true,
  "data": {
    "evaluationId": "conv_eval_023",
    "traceId": "conv_trace_001",
    "status": "completed",
    "overallScore": 0.91,
    "passed": true,
    "scores": {
      "coherence": 0.95,
      "contextRetention": 0.93,
      "goalCompletion": 0.98,
      "turnQuality": 0.88,
      "efficiency": 0.80
    },
    "perTurnScores": [
      {
        "turnNumber": 2,
        "role": "assistant",
        "scores": { "relevance": 0.90, "helpfulness": 0.85 },
        "feedback": "Good opening but could be more specific about what information is needed."
      },
      {
        "turnNumber": 4,
        "role": "assistant",
        "scores": { "relevance": 0.95, "helpfulness": 0.92, "accuracy": 0.90 },
        "feedback": "Excellent context recall - identified the product and delivery date. Offered clear resolution options."
      },
      {
        "turnNumber": 6,
        "role": "assistant",
        "scores": { "relevance": 0.90, "helpfulness": 0.88 },
        "feedback": "Clean resolution with follow-up offer. Could mention return instructions for the damaged item."
      }
    ],
    "summary": "Conversation handled well. Agent correctly identified the issue, retained context throughout, and resolved the customer's goal in 6 turns. Minor improvement: include return instructions for the damaged item.",
    "evaluatedAt": "2025-03-10T15:30:00Z"
  }
}

Get Results

Retrieve evaluation results for conversation traces.

GET /api/conversation-eval/results

Query Parameters:

ParameterTypeDescription
agentIdstringFilter by agent
traceIdstringFilter by specific trace
minScorenumberMinimum overall score
maxScorenumberMaximum overall score
startDatestringISO 8601 start date
endDatestringISO 8601 end date
pagenumberPage number (default: 1)
limitnumberResults per page (default: 20)

Response:

{
  "success": true,
  "data": {
    "results": [
      {
        "evaluationId": "conv_eval_023",
        "traceId": "conv_trace_001",
        "agentId": "agent_support_v2",
        "overallScore": 0.91,
        "passed": true,
        "turnCount": 6,
        "scores": {
          "coherence": 0.95,
          "contextRetention": 0.93,
          "goalCompletion": 0.98,
          "turnQuality": 0.88,
          "efficiency": 0.80
        },
        "evaluatedAt": "2025-03-10T15:30:00Z"
      },
      {
        "evaluationId": "conv_eval_022",
        "traceId": "conv_trace_002",
        "agentId": "agent_support_v2",
        "overallScore": 0.64,
        "passed": false,
        "turnCount": 12,
        "scores": {
          "coherence": 0.70,
          "contextRetention": 0.55,
          "goalCompletion": 0.40,
          "turnQuality": 0.75,
          "efficiency": 0.50
        },
        "evaluatedAt": "2025-03-10T15:25:00Z"
      }
    ],
    "pagination": {
      "page": 1,
      "limit": 20,
      "total": 89
    },
    "aggregate": {
      "avgOverallScore": 0.82,
      "passRate": 0.76,
      "avgTurnCount": 5.4,
      "topFailureReasons": [
        { "reason": "Goal not completed", "count": 8 },
        { "reason": "Context lost mid-conversation", "count": 5 },
        { "reason": "Excessive turns for simple request", "count": 3 }
      ]
    }
  }
}

Error Responses

All endpoints return consistent error responses:

{
  "success": false,
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "Invalid evaluation set ID",
    "details": {
      "field": "evalSetId",
      "reason": "Evaluation set 'eval_set_999' not found"
    }
  }
}

Common Error Codes:

HTTP StatusCodeDescription
400VALIDATION_ERRORInvalid request body or parameters
401UNAUTHORIZEDMissing or invalid API key
403FORBIDDENInsufficient permissions
404NOT_FOUNDResource not found
409CONFLICTResource already exists or state conflict
429RATE_LIMITEDToo many requests
500INTERNAL_ERRORServer error

Next Steps