bench-blind-comparison
Domain: bench · Model class: cheap
Description
Section titled “Description”Use this skill when the user wants to work on Running blind pairwise comparisons between AI outputs to remove bias from evaluation. Triggers include “blind comparison of outputs”, “pairwise eval without bias”, “A/B test my prompts blindly”. Do NOT use when design the eval suite (use adv-eval-suite-designer).
Purpose
Section titled “Purpose”Running blind pairwise comparisons between AI outputs to remove bias from evaluation. This skill provides structured guidance, references, and worked examples to help produce high-quality, actionable outputs.
Trigger Phrases
Section titled “Trigger Phrases”- “blind comparison of outputs”
- “pairwise eval without bias”
- “A/B test my prompts blindly”
- “unbiased prompt comparison”
Anti-Triggers
Section titled “Anti-Triggers”- design the eval suite (use adv-eval-suite-designer)
- analyze benchmark trends (use adv-benchmark-analyzer)
Intake Questions
Section titled “Intake Questions”- What is the user’s goal and current state?
- What constraints (time, team, compliance) apply?
- Are there existing artifacts (specs, code, benchmarks) to reference?
Output Contract
Section titled “Output Contract”- benchmark analysis summary
- trend or regression findings
- comparison-ready evidence
- follow-up actions