eval-prompt-bench
Domain: eval · Model class: cheap
Description
Section titled “Description”Use this skill when the user wants to work on Running benchmarks to score prompts, compare versions, and detect regressions. Triggers include “benchmark this prompt”, “compare prompt versions”, “detect prompt regressions”. Do NOT use when design the eval first (use core-eval-design).
Purpose
Section titled “Purpose”Running benchmarks to score prompts, compare versions, and detect regressions. This skill provides structured guidance, references, and worked examples to help produce high-quality, actionable outputs.
Trigger Phrases
Section titled “Trigger Phrases”- “benchmark this prompt”
- “compare prompt versions”
- “detect prompt regressions”
- “run my eval suite”
- “score this prompt”
Anti-Triggers
Section titled “Anti-Triggers”- design the eval first (use core-eval-design)
- grade individual outputs (use core-output-grading)
Intake Questions
Section titled “Intake Questions”- What is the user’s goal and current state?
- What constraints (time, team, compliance) apply?
- Are there existing artifacts (specs, code, benchmarks) to reference?
Output Contract
Section titled “Output Contract”- evaluation criteria
- scoring or benchmark framing
- comparison-ready output
- decision guidance