Nondeterminism Testing Guide
LLM outputs are inherently nondeterministic. Learn how to measure reliability and design evaluations you can trust.
Why Nondeterminism Matters
When you evaluate an AI response, the evaluation itself uses an LLM — and that evaluator LLM may return different scores on identical inputs. A single evaluation run can be misleading:
- An evaluation that passes once might fail on a second run
- Borderline scores are especially unreliable
- Aggregated metrics can mask instability in individual cases
A single evaluation run tells you what happened once. Pass@k analysis tells you what happens on average — and whether you can trust the result.
Core Concepts
Pass@k
Pass@k measures the probability that at least one out of k evaluation runs returns a passing result. It answers: “If I run this evaluation k times, what is the chance it passes at least once?”
pass@k = 1 - ((n - c) / n)^kWhere n is total runs, c is passing runs, and k is the sample size. For example, with 10 runs and 8 passing: pass@3 = 1 - (2/10)³ = 0.992.
Pass-to-k
Pass-to-k (also called “consistent pass”) measures the probability that all k runs pass. It answers: “If I run this evaluation k times, will it pass every time?”
Reliability Score
A composite metric combining pass@k and pass-to-k to give a single reliability rating for an evaluation.
Running Pass@k Analysis
Configure nondeterminism testing
import { nondeterminism } from '@thinkhive/sdk';
// Run the same evaluation multiple times
const analysis = await nondeterminism.analyze({
traceId: 'trace_abc',
evaluatorId: 'eval_groundedness',
runs: 10, // Number of repeated evaluations
});
console.log(analysis);
// {
// traceId: 'trace_abc',
// runs: 10,
// passCount: 8,
// failCount: 2,
// scores: [0.82, 0.91, 0.78, 0.85, 0.88, 0.79, 0.92, 0.84, 0.76, 0.87],
// mean: 0.842,
// stdDev: 0.052,
// passAtK: { 1: 0.80, 3: 0.99, 5: 1.0 },
// passToK: { 1: 0.80, 3: 0.51, 5: 0.33 },
// reliability: 'low',
// }Calculate pass@k manually
You can also compute pass@k on your own data using the SDK helpers.
import { nondeterminism } from '@thinkhive/sdk';
const passAtK = nondeterminism.calculatePassAtK({
totalRuns: 10,
passingRuns: 8,
k: 3,
});
// 0.992 -- 99.2% chance at least 1 of 3 runs passes
const passToK = nondeterminism.calculatePassToK({
totalRuns: 10,
passingRuns: 8,
k: 3,
});
// 0.5111 -- 51.11% chance all 3 runs passCheck evaluation reliability
const reliable = nondeterminism.isReliableEvaluation({
totalRuns: 10,
passingRuns: 8,
threshold: 0.9, // require 90% pass-to-3 for "reliable"
k: 3,
});
console.log(reliable);
// {
// reliable: false,
// passToK: 0.5111,
// recommendation: 'Increase evaluation runs or adjust pass threshold'
// }Interpreting Results
Reliability Ratings
| Rating | Pass-to-3 | What It Means |
|---|---|---|
| High | > 0.90 | Evaluation is consistent. Single-run results are trustworthy. |
| Moderate | 0.70 — 0.90 | Some instability. Use majority voting (best of 3). |
| Low | 0.50 — 0.70 | Significant variance. Use best of 5 or adjust evaluator. |
| Unreliable | < 0.50 | Evaluation is essentially random. Redesign the evaluator. |
Score Distribution
const distribution = await nondeterminism.scoreDistribution({
traceId: 'trace_abc',
evaluatorId: 'eval_groundedness',
runs: 20,
});
console.log(distribution);
// {
// histogram: {
// '0.0-0.2': 0, '0.2-0.4': 0, '0.4-0.6': 2,
// '0.6-0.8': 7, '0.8-1.0': 11
// },
// bimodal: false,
// recommendation: 'Scores cluster in the 0.6-1.0 range. Evaluation is moderately stable.'
// }Bimodal distributions (scores clustering at both extremes) usually indicate that the evaluator prompt is ambiguous. The LLM is “deciding” differently each time rather than scoring on a gradient.
Designing Reliable Evaluations
1. Use majority voting
import { nondeterminism } from '@thinkhive/sdk';
const result = await nondeterminism.majorityVote({
traceId: 'trace_abc',
evaluatorId: 'eval_groundedness',
runs: 3,
passThreshold: 0.8,
});
console.log(result);
// {
// votes: [{ score: 0.85, pass: true }, { score: 0.72, pass: false }, { score: 0.88, pass: true }],
// majorityVerdict: 'pass',
// confidence: 0.67
// }2. Lower evaluator temperature
Lower temperature reduces variance but can make the evaluator less nuanced.
const analysis = await nondeterminism.analyze({
traceId: 'trace_abc',
evaluatorId: 'eval_groundedness',
runs: 10,
evaluatorConfig: {
temperature: 0.1, // Lower temperature for more deterministic scoring
},
});3. Improve evaluator prompts
Vague criteria produce inconsistent scores. Be specific:
- "Rate whether the response is good."
+ "Rate whether the response answers the user's question using only
+ information from the provided context. Score 1.0 if all claims are
+ supported, 0.5 if some claims are unsupported, 0.0 if the response
+ contradicts the context."Batch Nondeterminism Analysis
Run consistency checks across your entire evaluation suite.
const batchResults = await nondeterminism.batchAnalyze({
agentId: 'agent_123',
evaluatorId: 'eval_groundedness',
sampleSize: 100,
runsPerTrace: 5,
});
console.log(batchResults.summary);
// {
// totalTraces: 100,
// reliableEvaluations: 72,
// moderateEvaluations: 21,
// unreliableEvaluations: 7,
// overallReliability: 0.82,
// recommendation: 'Consider majority voting for 28 unstable traces.'
// }Best Practices
When to Run Nondeterminism Testing
- When introducing a new evaluator — validate it is consistent before trusting it
- After changing evaluator prompts or thresholds
- After switching LLM providers or models
- Periodically (weekly) as a health check on evaluation stability
- Run at least 5 repetitions for meaningful pass@k estimates
- Use pass-to-3 as your primary reliability metric — it balances cost and confidence
- Investigate bimodal distributions — they indicate evaluator prompt issues
- Set reliability gates — require pass-to-3 > 0.9 before trusting single-run evaluations
- Document your reliability baselines so you can detect evaluator drift
Next Steps
- Human Review — Add human oversight for unreliable evaluations
- Evaluation — Set up automated evaluation pipelines
- API Reference — Full API documentation