quality-evaluate
Mission
Section titled “Mission”Define metrics → measure → compare → report → act. Every evaluation produces a decision or action.
When to Use
Section titled “When to Use”Use when benchmarking AI system quality, measuring output consistency, running eval suites, comparing model versions, detecting quality regressions, grading outputs against rubrics, or generating evaluation reports.
Triggers: “benchmark this”, “run evals”, “measure quality”, “compare model outputs”, “quality gate”, “detect regression”, “grade these outputs”, “eval suite”
Skills Invoked
Section titled “Skills Invoked”3-way majority vote:
eval-output-grading(×3 vote: Haiku, GPT-5 mini, GPT-4.1)eval-variance— stability measurementeval-prompt-bench— prompt variant comparisoneval-adversarial(Advanced) — injection/jailbreak resistancebench-blind-comparison— side-by-side blind evaluationbench-eval-suite— reusable test suite construction
Chain-To
Section titled “Chain-To”prompt-engineering— optimize prompts that score poorlycode-refactor— fix code quality issues found in evalpolicy-govern— escalate safety/governance failuresphysics-analysis— deeper analysis via QM/GR metaphors
Example
Section titled “Example”{ "request": "Run a quality evaluation on the new code generation prompt against the v1 baseline"}Output: Score comparison table (v1 vs v2 across 5 dimensions), variance analysis, adversarial test results, and a recommendation (promote / iterate / revert).