Skip to content

bench-blind-comparison

Domain: bench · Model class: cheap

Use this skill when the user wants to work on Running blind pairwise comparisons between AI outputs to remove bias from evaluation. Triggers include “blind comparison of outputs”, “pairwise eval without bias”, “A/B test my prompts blindly”. Do NOT use when design the eval suite (use adv-eval-suite-designer).

Running blind pairwise comparisons between AI outputs to remove bias from evaluation. This skill provides structured guidance, references, and worked examples to help produce high-quality, actionable outputs.

  • “blind comparison of outputs”
  • “pairwise eval without bias”
  • “A/B test my prompts blindly”
  • “unbiased prompt comparison”
  • design the eval suite (use adv-eval-suite-designer)
  • analyze benchmark trends (use adv-benchmark-analyzer)
  1. What is the user’s goal and current state?
  2. What constraints (time, team, compliance) apply?
  3. Are there existing artifacts (specs, code, benchmarks) to reference?
  • benchmark analysis summary
  • trend or regression findings
  • comparison-ready evidence
  • follow-up actions

bench-analyzer · bench-eval-suite · eval-output-grading