eval-variance
Domain: eval · Model class: cheap
Description
Section titled “Description”Use this skill when the user wants to work on Measuring output variance and flakiness across multiple runs to assess model consistency. Triggers include “measure output variance”, “how flaky is my prompt”, “consistency analysis”. Do NOT use when design the eval first (use core-eval-design).
Purpose
Section titled “Purpose”Measuring output variance and flakiness across multiple runs to assess model consistency. This skill provides structured guidance, references, and worked examples to help produce high-quality, actionable outputs.
Trigger Phrases
Section titled “Trigger Phrases”- “measure output variance”
- “how flaky is my prompt”
- “consistency analysis”
- “repeated run benchmarking”
- “stability of my AI workflow”
Anti-Triggers
Section titled “Anti-Triggers”- design the eval first (use core-eval-design)
- analyze quality vs cost tradeoffs after benchmarking
Intake Questions
Section titled “Intake Questions”- What is the user’s goal and current state?
- What constraints (time, team, compliance) apply?
- Are there existing artifacts (specs, code, benchmarks) to reference?
Output Contract
Section titled “Output Contract”- evaluation criteria
- scoring or benchmark framing
- comparison-ready output
- decision guidance