Skip to content

Evaluation Skills

The eval-* family measures the quality of AI agent outputs, prompts, and orchestration results. Skills use a 3-way majority vote pattern to reduce individual model bias.

Skill IDDescriptionModel Class
eval-promptScores a prompt on clarity, specificity, context sufficiency, and output predictabilityfree
eval-output-gradingGrades a model response against rubric criteria (accuracy, format, completeness, safety)free
eval-varianceMeasures response stability across multiple runs of the same promptcheap
eval-prompt-benchBenchmarks a prompt against alternatives using the same input; ranks by scorecheap
eval-adversarialTests prompt robustness against injection, jailbreak, and edge-case adversarial inputsstrong
eval-output-grading
├── Claude Haiku 4.5 → vote A
├── GPT-5 mini → vote B
└── GPT-4.1 → vote C
│ (split?)
└── GPT-5.4 → tiebreak
│ (still split?)
└── Claude Sonnet 4.6 → final
SituationSkill(s)
Testing a new prompt varianteval-prompt + eval-variance
Comparing two prompt approacheseval-prompt-bench
Grading an agent’s responseeval-output-grading
Security review of prompt surfaceeval-adversarial
  • evaluate — all five skills coordinated
  • benchmark — uses eval-output-grading + eval-variance for blind comparison
  • prompt-engineering — uses eval-prompt to score generated prompts
  • govern — uses eval-adversarial for injection resistance testing

Skills output a structured score object:

{
"skill": "eval-output-grading",
"score": 8.5,
"max": 10,
"dimensions": {
"accuracy": 9,
"format": 8,
"completeness": 9,
"safety": 8
},
"notes": "Missing boundary condition for empty input"
}