Human Review Guide
Add human oversight to your AI evaluation pipeline with review queues, calibration sets, and reviewer management.
Why Human Review?
Automated evaluation catches most issues, but some scenarios require human judgment:
- Ambiguous responses where correctness depends on nuance
- High-stakes domains like legal, medical, or financial advice
- Calibration to ensure your automated graders align with human expectations
- Edge cases where the AI evaluator is uncertain
Human review complements automated evaluation — it does not replace it. Use it strategically for cases where automated scores fall below a confidence threshold.
Setting Up a Review Queue
Install and configure the SDK
import { init, humanReview } from '@thinkhive/sdk';
init({ apiKey: process.env.THINKHIVE_API_KEY });
const queue = await humanReview.createQueue({
name: 'Support Agent Reviews',
agentId: 'agent_123',
criteria: {
minConfidence: 0.7, // Route low-confidence evals to review
sampleRate: 0.05, // Also sample 5% of all evaluations
},
});Assign reviewers
await humanReview.addReviewers(queue.id, {
reviewers: [
{ email: 'alice@company.com', role: 'lead' },
{ email: 'bob@company.com', role: 'reviewer' },
{ email: 'carol@company.com', role: 'reviewer' },
],
assignmentStrategy: 'round_robin', // or 'least_busy', 'random'
});Create a calibration set
Calibration sets train reviewers on expected judgments before they enter the live queue.
const calibrationSet = await humanReview.createCalibrationSet({
queueId: queue.id,
name: 'Onboarding Calibration',
items: [
{
traceId: 'trace_abc',
expectedVerdict: 'fail',
explanation: 'Response fabricated a return policy that does not exist.',
},
{
traceId: 'trace_def',
expectedVerdict: 'pass',
explanation: 'Response correctly cited the knowledge base article.',
},
],
passingScore: 0.8, // Reviewers must agree with 80% of expected verdicts
});Start reviewing
Reviewers receive items in the ThinkHive dashboard or via the API.
// Fetch the next item from the queue
const item = await humanReview.getNextItem(queue.id);
console.log(item);
// {
// id: 'review_001',
// traceId: 'trace_xyz',
// input: 'How do I reset my password?',
// output: 'You can reset your password by...',
// automatedScore: 0.65,
// assignedTo: 'alice@company.com'
// }
// Submit a review verdict
await humanReview.submitVerdict(item.id, {
verdict: 'fail',
reason: 'Response omits the required 2FA step.',
correctedOutput: 'To reset your password, first verify your identity via 2FA...',
});Queue Management
Fetch queue status
const status = await humanReview.getQueue(queue.id);
console.log(status);
// {
// id: 'queue_001',
// name: 'Support Agent Reviews',
// pending: 23,
// inProgress: 5,
// completed: 142,
// averageReviewTime: '3m 20s',
// agreementRate: 0.87
// }Skip or reassign a review
// Skip an item (returns it to the queue for another reviewer)
await humanReview.skip(item.id, {
reason: 'Conflict of interest',
});
// Reassign to a specific reviewer
await humanReview.reassign(item.id, {
to: 'carol@company.com',
reason: 'Domain expertise required',
});Filter and list reviews
const reviews = await humanReview.listReviews({
queueId: queue.id,
status: 'completed',
verdict: 'fail',
dateRange: { from: '2025-01-01', to: '2025-01-31' },
limit: 50,
});Review Statistics
const stats = await humanReview.getStats(queue.id);
console.log(stats);
// {
// totalReviewed: 142,
// verdicts: { pass: 98, fail: 37, uncertain: 7 },
// averageReviewTime: 200, // seconds
// interReviewerAgreement: 0.87,
// calibrationScores: {
// 'alice@company.com': 0.92,
// 'bob@company.com': 0.85,
// 'carol@company.com': 0.88,
// },
// automatedVsHumanAgreement: 0.79,
// }⚠️
If automatedVsHumanAgreement drops below 0.7, your automated graders may need recalibration. Consider updating grader prompts or thresholds based on human reviewer feedback.
REST API Reference
| Method | Endpoint | Description |
|---|---|---|
POST | /api/human-review/queues | Create a review queue |
GET | /api/human-review/queues/:id | Get queue status |
POST | /api/human-review/queues/:id/reviewers | Add reviewers |
GET | /api/human-review/queues/:id/next | Get next review item |
POST | /api/human-review/items/:id/verdict | Submit a verdict |
POST | /api/human-review/items/:id/skip | Skip a review item |
POST | /api/human-review/items/:id/reassign | Reassign a review item |
GET | /api/human-review/queues/:id/stats | Get review statistics |
POST | /api/human-review/calibration-sets | Create a calibration set |
Best Practices
Review Priority by Confidence Score
| Confidence Range | Priority | Action |
|---|---|---|
| 0.0 — 0.5 | Critical | Immediate human review |
| 0.5 — 0.7 | High | Queue for next available reviewer |
| 0.7 — 0.9 | Medium | Sample-based review |
| 0.9 — 1.0 | Low | Automated only |
- Start with calibration — onboard every reviewer through a calibration set before granting queue access
- Use inter-reviewer agreement to identify reviewers who need additional training
- Feed human verdicts back into your automated graders for continuous improvement
- Set SLAs for review turnaround to prevent queue buildup
- Rotate reviewers to avoid fatigue and bias
Next Steps
- Nondeterminism Testing — Measure evaluation reliability
- Evaluation — Set up automated evaluation
- API Reference — Full API documentation