bench-eval-suite
Domain: bench · Model class: cheap
Description
Section titled “Description”Use this skill when the user wants to work on Designing comprehensive evaluation suites covering multiple dimensions of AI system quality. Triggers include “design a comprehensive eval suite”, “multi-dimensional evaluation framework”, “end-to-end eval suite for my AI system”. Do NOT use when run individual benchmarks (use core-prompt-benchmarking).
Purpose
Section titled “Purpose”Designing comprehensive evaluation suites covering multiple dimensions of AI system quality. This skill provides structured guidance, references, and worked examples to help produce high-quality, actionable outputs.
Trigger Phrases
Section titled “Trigger Phrases”- “design a comprehensive eval suite”
- “multi-dimensional evaluation framework”
- “end-to-end eval suite for my AI system”
- “what evals do I need for production readiness”
Anti-Triggers
Section titled “Anti-Triggers”- run individual benchmarks (use core-prompt-benchmarking)
- analyze results (use adv-benchmark-analyzer)
Intake Questions
Section titled “Intake Questions”- What is the user’s goal and current state?
- What constraints (time, team, compliance) apply?
- Are there existing artifacts (specs, code, benchmarks) to reference?
Output Contract
Section titled “Output Contract”- benchmark analysis summary
- trend or regression findings
- comparison-ready evidence
- follow-up actions