GuidesNondeterminism Testing

Nondeterminism Testing Guide

LLM outputs are inherently nondeterministic. Learn how to measure reliability and design evaluations you can trust.

Why Nondeterminism Matters

When you evaluate an AI response, the evaluation itself uses an LLM — and that evaluator LLM may return different scores on identical inputs. A single evaluation run can be misleading:

  • An evaluation that passes once might fail on a second run
  • Borderline scores are especially unreliable
  • Aggregated metrics can mask instability in individual cases
⚠️

A single evaluation run tells you what happened once. Pass@k analysis tells you what happens on average — and whether you can trust the result.

Core Concepts

Pass@k

Pass@k measures the probability that at least one out of k evaluation runs returns a passing result. It answers: “If I run this evaluation k times, what is the chance it passes at least once?”

pass@k = 1 - ((n - c) / n)^k

Where n is total runs, c is passing runs, and k is the sample size. For example, with 10 runs and 8 passing: pass@3 = 1 - (2/10)³ = 0.992.

Pass-to-k

Pass-to-k (also called “consistent pass”) measures the probability that all k runs pass. It answers: “If I run this evaluation k times, will it pass every time?”

Reliability Score

A composite metric combining pass@k and pass-to-k to give a single reliability rating for an evaluation.

Running Pass@k Analysis

Configure nondeterminism testing

import { nondeterminism } from '@thinkhive/sdk';
 
// Run the same evaluation multiple times
const analysis = await nondeterminism.analyze({
  traceId: 'trace_abc',
  evaluatorId: 'eval_groundedness',
  runs: 10, // Number of repeated evaluations
});
 
console.log(analysis);
// {
//   traceId: 'trace_abc',
//   runs: 10,
//   passCount: 8,
//   failCount: 2,
//   scores: [0.82, 0.91, 0.78, 0.85, 0.88, 0.79, 0.92, 0.84, 0.76, 0.87],
//   mean: 0.842,
//   stdDev: 0.052,
//   passAtK: { 1: 0.80, 3: 0.99, 5: 1.0 },
//   passToK: { 1: 0.80, 3: 0.51, 5: 0.33 },
//   reliability: 'low',
// }

Calculate pass@k manually

You can also compute pass@k on your own data using the SDK helpers.

import { nondeterminism } from '@thinkhive/sdk';
 
const passAtK = nondeterminism.calculatePassAtK({
  totalRuns: 10,
  passingRuns: 8,
  k: 3,
});
// 0.992 -- 99.2% chance at least 1 of 3 runs passes
 
const passToK = nondeterminism.calculatePassToK({
  totalRuns: 10,
  passingRuns: 8,
  k: 3,
});
// 0.5111 -- 51.11% chance all 3 runs pass

Check evaluation reliability

const reliable = nondeterminism.isReliableEvaluation({
  totalRuns: 10,
  passingRuns: 8,
  threshold: 0.9, // require 90% pass-to-3 for "reliable"
  k: 3,
});
 
console.log(reliable);
// {
//   reliable: false,
//   passToK: 0.5111,
//   recommendation: 'Increase evaluation runs or adjust pass threshold'
// }

Interpreting Results

Reliability Ratings

RatingPass-to-3What It Means
High> 0.90Evaluation is consistent. Single-run results are trustworthy.
Moderate0.70 — 0.90Some instability. Use majority voting (best of 3).
Low0.50 — 0.70Significant variance. Use best of 5 or adjust evaluator.
Unreliable< 0.50Evaluation is essentially random. Redesign the evaluator.

Score Distribution

const distribution = await nondeterminism.scoreDistribution({
  traceId: 'trace_abc',
  evaluatorId: 'eval_groundedness',
  runs: 20,
});
 
console.log(distribution);
// {
//   histogram: {
//     '0.0-0.2': 0, '0.2-0.4': 0, '0.4-0.6': 2,
//     '0.6-0.8': 7, '0.8-1.0': 11
//   },
//   bimodal: false,
//   recommendation: 'Scores cluster in the 0.6-1.0 range. Evaluation is moderately stable.'
// }

Bimodal distributions (scores clustering at both extremes) usually indicate that the evaluator prompt is ambiguous. The LLM is “deciding” differently each time rather than scoring on a gradient.

Designing Reliable Evaluations

1. Use majority voting

import { nondeterminism } from '@thinkhive/sdk';
 
const result = await nondeterminism.majorityVote({
  traceId: 'trace_abc',
  evaluatorId: 'eval_groundedness',
  runs: 3,
  passThreshold: 0.8,
});
 
console.log(result);
// {
//   votes: [{ score: 0.85, pass: true }, { score: 0.72, pass: false }, { score: 0.88, pass: true }],
//   majorityVerdict: 'pass',
//   confidence: 0.67
// }

2. Lower evaluator temperature

Lower temperature reduces variance but can make the evaluator less nuanced.

const analysis = await nondeterminism.analyze({
  traceId: 'trace_abc',
  evaluatorId: 'eval_groundedness',
  runs: 10,
  evaluatorConfig: {
    temperature: 0.1, // Lower temperature for more deterministic scoring
  },
});

3. Improve evaluator prompts

Vague criteria produce inconsistent scores. Be specific:

- "Rate whether the response is good."
+ "Rate whether the response answers the user's question using only
+  information from the provided context. Score 1.0 if all claims are
+  supported, 0.5 if some claims are unsupported, 0.0 if the response
+  contradicts the context."

Batch Nondeterminism Analysis

Run consistency checks across your entire evaluation suite.

const batchResults = await nondeterminism.batchAnalyze({
  agentId: 'agent_123',
  evaluatorId: 'eval_groundedness',
  sampleSize: 100,
  runsPerTrace: 5,
});
 
console.log(batchResults.summary);
// {
//   totalTraces: 100,
//   reliableEvaluations: 72,
//   moderateEvaluations: 21,
//   unreliableEvaluations: 7,
//   overallReliability: 0.82,
//   recommendation: 'Consider majority voting for 28 unstable traces.'
// }

Best Practices

When to Run Nondeterminism Testing

  1. When introducing a new evaluator — validate it is consistent before trusting it
  2. After changing evaluator prompts or thresholds
  3. After switching LLM providers or models
  4. Periodically (weekly) as a health check on evaluation stability
  1. Run at least 5 repetitions for meaningful pass@k estimates
  2. Use pass-to-3 as your primary reliability metric — it balances cost and confidence
  3. Investigate bimodal distributions — they indicate evaluator prompt issues
  4. Set reliability gates — require pass-to-3 > 0.9 before trusting single-run evaluations
  5. Document your reliability baselines so you can detect evaluator drift

Next Steps