Skip to content

eval-design

Domain: eval · Model class: cheap

Description

Use this skill when the user wants to work on Designing high-quality eval datasets with realistic prompts, hard negatives, and discriminative assertions. Triggers include “design an eval set”, “build a benchmark dataset”, “create test cases for my prompt”. Do NOT use when run the evals after designing them (use core-prompt-benchmarking).

Purpose

Designing high-quality eval datasets with realistic prompts, hard negatives, and discriminative assertions. This skill provides structured guidance, references, and worked examples to help produce high-quality, actionable outputs.

Trigger Phrases

“design an eval set”
“build a benchmark dataset”
“create test cases for my prompt”
“how do I write good evals”
“eval-first development”

Anti-Triggers

run the evals after designing them (use core-prompt-benchmarking)
grade the outputs (use core-output-grading)

Intake Questions

What is the user’s goal and current state?
What constraints (time, team, compliance) apply?
Are there existing artifacts (specs, code, benchmarks) to reference?

Output Contract

evaluation criteria
scoring or benchmark framing
comparison-ready output
decision guidance

eval-prompt-bench · eval-output-grading · eval-variance