Skip to content

bench-blind-comparison

Domain: bench · Model class: cheap

Description

Use this skill when the user wants to work on Running blind pairwise comparisons between AI outputs to remove bias from evaluation. Triggers include “blind comparison of outputs”, “pairwise eval without bias”, “A/B test my prompts blindly”. Do NOT use when design the eval suite (use adv-eval-suite-designer).

Purpose

Running blind pairwise comparisons between AI outputs to remove bias from evaluation. This skill provides structured guidance, references, and worked examples to help produce high-quality, actionable outputs.

Trigger Phrases

“blind comparison of outputs”
“pairwise eval without bias”
“A/B test my prompts blindly”
“unbiased prompt comparison”

Anti-Triggers

design the eval suite (use adv-eval-suite-designer)
analyze benchmark trends (use adv-benchmark-analyzer)

Intake Questions

What is the user’s goal and current state?
What constraints (time, team, compliance) apply?
Are there existing artifacts (specs, code, benchmarks) to reference?

Output Contract

benchmark analysis summary
trend or regression findings
comparison-ready evidence
follow-up actions

bench-analyzer · bench-eval-suite · eval-output-grading