bench-eval-suite

Domain: bench · Model class: cheap

Description

Use this skill when the user wants to work on Designing comprehensive evaluation suites covering multiple dimensions of AI system quality. Triggers include “design a comprehensive eval suite”, “multi-dimensional evaluation framework”, “end-to-end eval suite for my AI system”. Do NOT use when run individual benchmarks (use core-prompt-benchmarking).

Purpose

Designing comprehensive evaluation suites covering multiple dimensions of AI system quality. This skill provides structured guidance, references, and worked examples to help produce high-quality, actionable outputs.

Trigger Phrases

“design a comprehensive eval suite”
“multi-dimensional evaluation framework”
“end-to-end eval suite for my AI system”
“what evals do I need for production readiness”

Anti-Triggers

run individual benchmarks (use core-prompt-benchmarking)
analyze results (use adv-benchmark-analyzer)

Intake Questions

What is the user’s goal and current state?
What constraints (time, team, compliance) apply?
Are there existing artifacts (specs, code, benchmarks) to reference?

Output Contract

benchmark analysis summary
trend or regression findings
comparison-ready evidence
follow-up actions

bench-analyzer · bench-blind-comparison · eval-design