Skip to content

eval-variance

Domain: eval · Model class: cheap

Use this skill when the user wants to work on Measuring output variance and flakiness across multiple runs to assess model consistency. Triggers include “measure output variance”, “how flaky is my prompt”, “consistency analysis”. Do NOT use when design the eval first (use core-eval-design).

Measuring output variance and flakiness across multiple runs to assess model consistency. This skill provides structured guidance, references, and worked examples to help produce high-quality, actionable outputs.

  • “measure output variance”
  • “how flaky is my prompt”
  • “consistency analysis”
  • “repeated run benchmarking”
  • “stability of my AI workflow”
  • design the eval first (use core-eval-design)
  • analyze quality vs cost tradeoffs after benchmarking
  1. What is the user’s goal and current state?
  2. What constraints (time, team, compliance) apply?
  3. Are there existing artifacts (specs, code, benchmarks) to reference?
  • evaluation criteria
  • scoring or benchmark framing
  • comparison-ready output
  • decision guidance

eval-design · eval-prompt-bench · eval-output-grading