Skip to content

eval-design

Domain: eval · Model class: cheap

Use this skill when the user wants to work on Designing high-quality eval datasets with realistic prompts, hard negatives, and discriminative assertions. Triggers include “design an eval set”, “build a benchmark dataset”, “create test cases for my prompt”. Do NOT use when run the evals after designing them (use core-prompt-benchmarking).

Designing high-quality eval datasets with realistic prompts, hard negatives, and discriminative assertions. This skill provides structured guidance, references, and worked examples to help produce high-quality, actionable outputs.

  • “design an eval set”
  • “build a benchmark dataset”
  • “create test cases for my prompt”
  • “how do I write good evals”
  • “eval-first development”
  • run the evals after designing them (use core-prompt-benchmarking)
  • grade the outputs (use core-output-grading)
  1. What is the user’s goal and current state?
  2. What constraints (time, team, compliance) apply?
  3. Are there existing artifacts (specs, code, benchmarks) to reference?
  • evaluation criteria
  • scoring or benchmark framing
  • comparison-ready output
  • decision guidance

eval-prompt-bench · eval-output-grading · eval-variance