Skip to content

bench-eval-suite

Domain: bench · Model class: cheap

Use this skill when the user wants to work on Designing comprehensive evaluation suites covering multiple dimensions of AI system quality. Triggers include “design a comprehensive eval suite”, “multi-dimensional evaluation framework”, “end-to-end eval suite for my AI system”. Do NOT use when run individual benchmarks (use core-prompt-benchmarking).

Designing comprehensive evaluation suites covering multiple dimensions of AI system quality. This skill provides structured guidance, references, and worked examples to help produce high-quality, actionable outputs.

  • “design a comprehensive eval suite”
  • “multi-dimensional evaluation framework”
  • “end-to-end eval suite for my AI system”
  • “what evals do I need for production readiness”
  • run individual benchmarks (use core-prompt-benchmarking)
  • analyze results (use adv-benchmark-analyzer)
  1. What is the user’s goal and current state?
  2. What constraints (time, team, compliance) apply?
  3. Are there existing artifacts (specs, code, benchmarks) to reference?
  • benchmark analysis summary
  • trend or regression findings
  • comparison-ready evidence
  • follow-up actions

bench-analyzer · bench-blind-comparison · eval-design