eval-design
Domain: eval · Model class: cheap
Description
Section titled “Description”Use this skill when the user wants to work on Designing high-quality eval datasets with realistic prompts, hard negatives, and discriminative assertions. Triggers include “design an eval set”, “build a benchmark dataset”, “create test cases for my prompt”. Do NOT use when run the evals after designing them (use core-prompt-benchmarking).
Purpose
Section titled “Purpose”Designing high-quality eval datasets with realistic prompts, hard negatives, and discriminative assertions. This skill provides structured guidance, references, and worked examples to help produce high-quality, actionable outputs.
Trigger Phrases
Section titled “Trigger Phrases”- “design an eval set”
- “build a benchmark dataset”
- “create test cases for my prompt”
- “how do I write good evals”
- “eval-first development”
Anti-Triggers
Section titled “Anti-Triggers”- run the evals after designing them (use core-prompt-benchmarking)
- grade the outputs (use core-output-grading)
Intake Questions
Section titled “Intake Questions”- What is the user’s goal and current state?
- What constraints (time, team, compliance) apply?
- Are there existing artifacts (specs, code, benchmarks) to reference?
Output Contract
Section titled “Output Contract”- evaluation criteria
- scoring or benchmark framing
- comparison-ready output
- decision guidance