Nondeterminism Testing Guide

LLM outputs are inherently nondeterministic. Learn how to measure reliability and design evaluations you can trust.

Why Nondeterminism Matters

When you evaluate an AI response, the evaluation itself uses an LLM — and that evaluator LLM may return different scores on identical inputs. A single evaluation run can be misleading:

An evaluation that passes once might fail on a second run
Borderline scores are especially unreliable
Aggregated metrics can mask instability in individual cases

⚠️

A single evaluation run tells you what happened once. Pass@k analysis tells you what happens on average — and whether you can trust the result.

Core Concepts

Pass@k

Pass@k measures the probability that at least one out of k evaluation runs returns a passing result. It answers: “If I run this evaluation k times, what is the chance it passes at least once?”

pass@k = 1 - ((n - c) / n)^k

Where n is total runs, c is passing runs, and k is the sample size. For example, with 10 runs and 8 passing: pass@3 = 1 - (2/10)³ = 0.992.

Pass-to-k

Pass-to-k (also called “consistent pass”) measures the probability that all k runs pass. It answers: “If I run this evaluation k times, will it pass every time?”

Reliability Score

A composite metric combining pass@k and pass-to-k to give a single reliability rating for an evaluation.

Running Pass@k Analysis

Configure nondeterminism testing

import { nondeterminism } from '@thinkhive/sdk';
 
// Run the same evaluation multiple times
const analysis = await nondeterminism.analyze({
  traceId: 'trace_abc',
  evaluatorId: 'eval_groundedness',
  runs: 10, // Number of repeated evaluations
});
 
console.log(analysis);
// {
//   traceId: 'trace_abc',
//   runs: 10,
//   passCount: 8,
//   failCount: 2,
//   scores: [0.82, 0.91, 0.78, 0.85, 0.88, 0.79, 0.92, 0.84, 0.76, 0.87],
//   mean: 0.842,
//   stdDev: 0.052,
//   passAtK: { 1: 0.80, 3: 0.99, 5: 1.0 },
//   passToK: { 1: 0.80, 3: 0.51, 5: 0.33 },
//   reliability: 'low',
// }

from thinkhive import nondeterminism
 
analysis = nondeterminism.analyze(
    trace_id="trace_abc",
    evaluator_id="eval_groundedness",
    runs=10,
)
 
print(analysis)
# {
#   "trace_id": "trace_abc",
#   "runs": 10,
#   "pass_count": 8,
#   "scores": [0.82, 0.91, 0.78, ...],
#   "mean": 0.842,
#   "std_dev": 0.052,
#   "pass_at_k": {1: 0.80, 3: 0.99, 5: 1.0},
#   "reliability": "low"
# }

# Step 1: Create a nondeterminism run
curl -X POST https://app.thinkhive.ai/api/nondeterminism/runs \
  -H "Authorization: Bearer $THINKHIVE_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "traceId": "trace_abc",
    "evaluatorId": "eval_groundedness",
    "runs": 10
  }'
# Returns: { "id": "run_abc123", ... }
 
# Step 2: Analyze the completed run
curl -X POST https://app.thinkhive.ai/api/nondeterminism/runs/run_abc123/analyze \
  -H "Authorization: Bearer $THINKHIVE_API_KEY"

Calculate pass@k manually

You can also compute pass@k on your own data using the SDK helpers.

import { nondeterminism } from '@thinkhive/sdk';
 
const passAtK = nondeterminism.calculatePassAtK({
  totalRuns: 10,
  passingRuns: 8,
  k: 3,
});
// 0.992 -- 99.2% chance at least 1 of 3 runs passes
 
const passToK = nondeterminism.calculatePassToK({
  totalRuns: 10,
  passingRuns: 8,
  k: 3,
});
// 0.5111 -- 51.11% chance all 3 runs pass

Check evaluation reliability

const reliable = nondeterminism.isReliableEvaluation({
  totalRuns: 10,
  passingRuns: 8,
  threshold: 0.9, // require 90% pass-to-3 for "reliable"
  k: 3,
});
 
console.log(reliable);
// {
//   reliable: false,
//   passToK: 0.5111,
//   recommendation: 'Increase evaluation runs or adjust pass threshold'
// }

Interpreting Results

Reliability Ratings

Rating	Pass-to-3	What It Means
High	> 0.90	Evaluation is consistent. Single-run results are trustworthy.
Moderate	0.70 — 0.90	Some instability. Use majority voting (best of 3).
Low	0.50 — 0.70	Significant variance. Use best of 5 or adjust evaluator.
Unreliable	< 0.50	Evaluation is essentially random. Redesign the evaluator.

Score Distribution

const distribution = await nondeterminism.scoreDistribution({
  traceId: 'trace_abc',
  evaluatorId: 'eval_groundedness',
  runs: 20,
});
 
console.log(distribution);
// {
//   histogram: {
//     '0.0-0.2': 0, '0.2-0.4': 0, '0.4-0.6': 2,
//     '0.6-0.8': 7, '0.8-1.0': 11
//   },
//   bimodal: false,
//   recommendation: 'Scores cluster in the 0.6-1.0 range. Evaluation is moderately stable.'
// }

Bimodal distributions (scores clustering at both extremes) usually indicate that the evaluator prompt is ambiguous. The LLM is “deciding” differently each time rather than scoring on a gradient.

Designing Reliable Evaluations

1. Use majority voting

import { nondeterminism } from '@thinkhive/sdk';
 
const result = await nondeterminism.majorityVote({
  traceId: 'trace_abc',
  evaluatorId: 'eval_groundedness',
  runs: 3,
  passThreshold: 0.8,
});
 
console.log(result);
// {
//   votes: [{ score: 0.85, pass: true }, { score: 0.72, pass: false }, { score: 0.88, pass: true }],
//   majorityVerdict: 'pass',
//   confidence: 0.67
// }

2. Lower evaluator temperature

Lower temperature reduces variance but can make the evaluator less nuanced.

const analysis = await nondeterminism.analyze({
  traceId: 'trace_abc',
  evaluatorId: 'eval_groundedness',
  runs: 10,
  evaluatorConfig: {
    temperature: 0.1, // Lower temperature for more deterministic scoring
  },
});

3. Improve evaluator prompts

Vague criteria produce inconsistent scores. Be specific:

- "Rate whether the response is good."
+ "Rate whether the response answers the user's question using only
+  information from the provided context. Score 1.0 if all claims are
+  supported, 0.5 if some claims are unsupported, 0.0 if the response
+  contradicts the context."

Batch Nondeterminism Analysis

Run consistency checks across your entire evaluation suite.

const batchResults = await nondeterminism.batchAnalyze({
  agentId: 'agent_123',
  evaluatorId: 'eval_groundedness',
  sampleSize: 100,
  runsPerTrace: 5,
});
 
console.log(batchResults.summary);
// {
//   totalTraces: 100,
//   reliableEvaluations: 72,
//   moderateEvaluations: 21,
//   unreliableEvaluations: 7,
//   overallReliability: 0.82,
//   recommendation: 'Consider majority voting for 28 unstable traces.'
// }

Best Practices

When to Run Nondeterminism Testing

When introducing a new evaluator — validate it is consistent before trusting it
After changing evaluator prompts or thresholds
After switching LLM providers or models
Periodically (weekly) as a health check on evaluation stability

Run at least 5 repetitions for meaningful pass@k estimates
Use pass-to-3 as your primary reliability metric — it balances cost and confidence
Investigate bimodal distributions — they indicate evaluator prompt issues
Set reliability gates — require pass-to-3 > 0.9 before trusting single-run evaluations
Document your reliability baselines so you can detect evaluator drift

Next Steps

Human Review — Add human oversight for unreliable evaluations
Evaluation — Set up automated evaluation pipelines
API Reference — Full API documentation

Human Review Business Metrics & ROI