Skip to content

Benchmarking Skills

The bench-* family establishes objective baselines and comparative measurements. Output feeds the evaluate and research instructions.

Skill IDDescriptionModel Class
bench-analyzerProfiles runtime performance: measures latency, throughput, P95/P99, and regression detectionfree
bench-blind-comparisonSide-by-side blind comparison of two solutions or responses without revealing sourcecheap
bench-eval-suiteConstructs a reusable evaluation suite: test cases, scoring rubric, and expected outputscheap
SituationSkill(s)
Comparing two algorithm implementationsbench-blind-comparison
Establishing performance baselinesbench-analyzer
Building a repeatable eval datasetbench-eval-suite
  • benchmark — primary consumer; all three coordinated
  • evaluate — uses bench-blind-comparison for output comparison
  • research — uses bench-analyzer to validate performance claims

Both options are presented without labels during evaluation:

Option A: [implementation 1 — label hidden]
Option B: [implementation 2 — label hidden]
Evaluated by 3 free-tier models → majority vote → result revealed

This prevents evaluator bias toward the “familiar” or “expected” implementation.

{
"suite": "auth-service-v2",
"cases": [
{
"id": "happy-path-login",
"input": { "email": "user@example.com", "password": "valid" },
"expected": { "status": 200, "token": "<jwt>" },
"rubric": ["has_token", "status_200", "response_time_lt_200ms"]
}
]
}