Human Review Guide

Add human oversight to your AI evaluation pipeline with review queues, calibration sets, and reviewer management.

Why Human Review?

Automated evaluation catches most issues, but some scenarios require human judgment:

Ambiguous responses where correctness depends on nuance
High-stakes domains like legal, medical, or financial advice
Calibration to ensure your automated graders align with human expectations
Edge cases where the AI evaluator is uncertain

Human review complements automated evaluation — it does not replace it. Use it strategically for cases where automated scores fall below a confidence threshold.

Setting Up a Review Queue

Install and configure the SDK

import { init, humanReview } from '@thinkhive/sdk';
 
init({ apiKey: process.env.THINKHIVE_API_KEY });
 
const queue = await humanReview.createQueue({
  name: 'Support Agent Reviews',
  agentId: 'agent_123',
  criteria: {
    minConfidence: 0.7, // Route low-confidence evals to review
    sampleRate: 0.05,   // Also sample 5% of all evaluations
  },
});

import thinkhive
 
thinkhive.init(api_key=os.environ["THINKHIVE_API_KEY"])
 
from thinkhive import human_review
 
queue = human_review.create_queue(
    name="Support Agent Reviews",
    agent_id="agent_123",
    criteria={
        "min_confidence": 0.7,
        "sample_rate": 0.05,
    },
)

curl -X POST https://app.thinkhive.ai/api/human-review/queues \
  -H "Authorization: Bearer $THINKHIVE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Support Agent Reviews",
    "agentId": "agent_123",
    "criteria": {
      "minConfidence": 0.7,
      "sampleRate": 0.05
    }
  }'

Assign reviewers

await humanReview.addReviewers(queue.id, {
  reviewers: [
    { email: 'alice@company.com', role: 'lead' },
    { email: 'bob@company.com', role: 'reviewer' },
    { email: 'carol@company.com', role: 'reviewer' },
  ],
  assignmentStrategy: 'round_robin', // or 'least_busy', 'random'
});

Create a calibration set

Calibration sets train reviewers on expected judgments before they enter the live queue.

const calibrationSet = await humanReview.createCalibrationSet({
  queueId: queue.id,
  name: 'Onboarding Calibration',
  items: [
    {
      traceId: 'trace_abc',
      expectedVerdict: 'fail',
      explanation: 'Response fabricated a return policy that does not exist.',
    },
    {
      traceId: 'trace_def',
      expectedVerdict: 'pass',
      explanation: 'Response correctly cited the knowledge base article.',
    },
  ],
  passingScore: 0.8, // Reviewers must agree with 80% of expected verdicts
});

Start reviewing

Reviewers receive items in the ThinkHive dashboard or via the API.

// Fetch the next item from the queue
const item = await humanReview.getNextItem(queue.id);
 
console.log(item);
// {
//   id: 'review_001',
//   traceId: 'trace_xyz',
//   input: 'How do I reset my password?',
//   output: 'You can reset your password by...',
//   automatedScore: 0.65,
//   assignedTo: 'alice@company.com'
// }
 
// Submit a review verdict
await humanReview.submitVerdict(item.id, {
  verdict: 'fail',
  reason: 'Response omits the required 2FA step.',
  correctedOutput: 'To reset your password, first verify your identity via 2FA...',
});

Queue Management

Fetch queue status

const status = await humanReview.getQueue(queue.id);
 
console.log(status);
// {
//   id: 'queue_001',
//   name: 'Support Agent Reviews',
//   pending: 23,
//   inProgress: 5,
//   completed: 142,
//   averageReviewTime: '3m 20s',
//   agreementRate: 0.87
// }

Skip or reassign a review

// Skip an item (returns it to the queue for another reviewer)
await humanReview.skip(item.id, {
  reason: 'Conflict of interest',
});
 
// Reassign to a specific reviewer
await humanReview.reassign(item.id, {
  to: 'carol@company.com',
  reason: 'Domain expertise required',
});

Filter and list reviews

const reviews = await humanReview.listReviews({
  queueId: queue.id,
  status: 'completed',
  verdict: 'fail',
  dateRange: { from: '2025-01-01', to: '2025-01-31' },
  limit: 50,
});

Review Statistics

const stats = await humanReview.getStats(queue.id);
 
console.log(stats);
// {
//   totalReviewed: 142,
//   verdicts: { pass: 98, fail: 37, uncertain: 7 },
//   averageReviewTime: 200, // seconds
//   interReviewerAgreement: 0.87,
//   calibrationScores: {
//     'alice@company.com': 0.92,
//     'bob@company.com': 0.85,
//     'carol@company.com': 0.88,
//   },
//   automatedVsHumanAgreement: 0.79,
// }

⚠️

If automatedVsHumanAgreement drops below 0.7, your automated graders may need recalibration. Consider updating grader prompts or thresholds based on human reviewer feedback.

REST API Reference

Method	Endpoint	Description
`POST`	`/api/human-review/queues`	Create a review queue
`GET`	`/api/human-review/queues/:id`	Get queue status
`POST`	`/api/human-review/queues/:id/reviewers`	Add reviewers
`GET`	`/api/human-review/queues/:id/next`	Get next review item
`POST`	`/api/human-review/items/:id/verdict`	Submit a verdict
`POST`	`/api/human-review/items/:id/skip`	Skip a review item
`POST`	`/api/human-review/items/:id/reassign`	Reassign a review item
`GET`	`/api/human-review/queues/:id/stats`	Get review statistics
`POST`	`/api/human-review/calibration-sets`	Create a calibration set

Best Practices

Review Priority by Confidence Score

Confidence Range	Priority	Action
0.0 — 0.5	Critical	Immediate human review
0.5 — 0.7	High	Queue for next available reviewer
0.7 — 0.9	Medium	Sample-based review
0.9 — 1.0	Low	Automated only

Start with calibration — onboard every reviewer through a calibration set before granting queue access
Use inter-reviewer agreement to identify reviewers who need additional training
Feed human verdicts back into your automated graders for continuous improvement
Set SLAs for review turnaround to prevent queue buildup
Rotate reviewers to avoid fatigue and bias

Next Steps

Nondeterminism Testing — Measure evaluation reliability
Evaluation — Set up automated evaluation
API Reference — Full API documentation

Multi-Agent Tracing Nondeterminism Testing