GuidesHuman Review

Human Review Guide

Add human oversight to your AI evaluation pipeline with review queues, calibration sets, and reviewer management.

Why Human Review?

Automated evaluation catches most issues, but some scenarios require human judgment:

  • Ambiguous responses where correctness depends on nuance
  • High-stakes domains like legal, medical, or financial advice
  • Calibration to ensure your automated graders align with human expectations
  • Edge cases where the AI evaluator is uncertain

Human review complements automated evaluation — it does not replace it. Use it strategically for cases where automated scores fall below a confidence threshold.

Setting Up a Review Queue

Install and configure the SDK

import { init, humanReview } from '@thinkhive/sdk';
 
init({ apiKey: process.env.THINKHIVE_API_KEY });
 
const queue = await humanReview.createQueue({
  name: 'Support Agent Reviews',
  agentId: 'agent_123',
  criteria: {
    minConfidence: 0.7, // Route low-confidence evals to review
    sampleRate: 0.05,   // Also sample 5% of all evaluations
  },
});

Assign reviewers

await humanReview.addReviewers(queue.id, {
  reviewers: [
    { email: 'alice@company.com', role: 'lead' },
    { email: 'bob@company.com', role: 'reviewer' },
    { email: 'carol@company.com', role: 'reviewer' },
  ],
  assignmentStrategy: 'round_robin', // or 'least_busy', 'random'
});

Create a calibration set

Calibration sets train reviewers on expected judgments before they enter the live queue.

const calibrationSet = await humanReview.createCalibrationSet({
  queueId: queue.id,
  name: 'Onboarding Calibration',
  items: [
    {
      traceId: 'trace_abc',
      expectedVerdict: 'fail',
      explanation: 'Response fabricated a return policy that does not exist.',
    },
    {
      traceId: 'trace_def',
      expectedVerdict: 'pass',
      explanation: 'Response correctly cited the knowledge base article.',
    },
  ],
  passingScore: 0.8, // Reviewers must agree with 80% of expected verdicts
});

Start reviewing

Reviewers receive items in the ThinkHive dashboard or via the API.

// Fetch the next item from the queue
const item = await humanReview.getNextItem(queue.id);
 
console.log(item);
// {
//   id: 'review_001',
//   traceId: 'trace_xyz',
//   input: 'How do I reset my password?',
//   output: 'You can reset your password by...',
//   automatedScore: 0.65,
//   assignedTo: 'alice@company.com'
// }
 
// Submit a review verdict
await humanReview.submitVerdict(item.id, {
  verdict: 'fail',
  reason: 'Response omits the required 2FA step.',
  correctedOutput: 'To reset your password, first verify your identity via 2FA...',
});

Queue Management

Fetch queue status

const status = await humanReview.getQueue(queue.id);
 
console.log(status);
// {
//   id: 'queue_001',
//   name: 'Support Agent Reviews',
//   pending: 23,
//   inProgress: 5,
//   completed: 142,
//   averageReviewTime: '3m 20s',
//   agreementRate: 0.87
// }

Skip or reassign a review

// Skip an item (returns it to the queue for another reviewer)
await humanReview.skip(item.id, {
  reason: 'Conflict of interest',
});
 
// Reassign to a specific reviewer
await humanReview.reassign(item.id, {
  to: 'carol@company.com',
  reason: 'Domain expertise required',
});

Filter and list reviews

const reviews = await humanReview.listReviews({
  queueId: queue.id,
  status: 'completed',
  verdict: 'fail',
  dateRange: { from: '2025-01-01', to: '2025-01-31' },
  limit: 50,
});

Review Statistics

const stats = await humanReview.getStats(queue.id);
 
console.log(stats);
// {
//   totalReviewed: 142,
//   verdicts: { pass: 98, fail: 37, uncertain: 7 },
//   averageReviewTime: 200, // seconds
//   interReviewerAgreement: 0.87,
//   calibrationScores: {
//     'alice@company.com': 0.92,
//     'bob@company.com': 0.85,
//     'carol@company.com': 0.88,
//   },
//   automatedVsHumanAgreement: 0.79,
// }
⚠️

If automatedVsHumanAgreement drops below 0.7, your automated graders may need recalibration. Consider updating grader prompts or thresholds based on human reviewer feedback.

REST API Reference

MethodEndpointDescription
POST/api/human-review/queuesCreate a review queue
GET/api/human-review/queues/:idGet queue status
POST/api/human-review/queues/:id/reviewersAdd reviewers
GET/api/human-review/queues/:id/nextGet next review item
POST/api/human-review/items/:id/verdictSubmit a verdict
POST/api/human-review/items/:id/skipSkip a review item
POST/api/human-review/items/:id/reassignReassign a review item
GET/api/human-review/queues/:id/statsGet review statistics
POST/api/human-review/calibration-setsCreate a calibration set

Best Practices

Review Priority by Confidence Score

Confidence RangePriorityAction
0.0 — 0.5CriticalImmediate human review
0.5 — 0.7HighQueue for next available reviewer
0.7 — 0.9MediumSample-based review
0.9 — 1.0LowAutomated only
  1. Start with calibration — onboard every reviewer through a calibration set before granting queue access
  2. Use inter-reviewer agreement to identify reviewers who need additional training
  3. Feed human verdicts back into your automated graders for continuous improvement
  4. Set SLAs for review turnaround to prevent queue buildup
  5. Rotate reviewers to avoid fatigue and bias

Next Steps