Skip to content

quality-evaluate

Cross-Modelworkflow

Define metrics → measure → compare → report → act. Every evaluation produces a decision or action.

Use when benchmarking AI system quality, measuring output consistency, running eval suites, comparing model versions, detecting quality regressions, grading outputs against rubrics, or generating evaluation reports.

Triggers: “benchmark this”, “run evals”, “measure quality”, “compare model outputs”, “quality gate”, “detect regression”, “grade these outputs”, “eval suite”

3-way majority vote:

  • eval-output-grading (×3 vote: Haiku, GPT-5 mini, GPT-4.1)
  • eval-variance — stability measurement
  • eval-prompt-bench — prompt variant comparison
  • eval-adversarial (Advanced) — injection/jailbreak resistance
  • bench-blind-comparison — side-by-side blind evaluation
  • bench-eval-suite — reusable test suite construction
  • prompt-engineering — optimize prompts that score poorly
  • code-refactor — fix code quality issues found in eval
  • policy-govern — escalate safety/governance failures
  • physics-analysis — deeper analysis via QM/GR metaphors
{
"request": "Run a quality evaluation on the new code generation prompt against the v1 baseline"
}

Output: Score comparison table (v1 vs v2 across 5 dimensions), variance analysis, adversarial test results, and a recommendation (promote / iterate / revert).