Benchmarking Skills
The bench-* family establishes objective baselines and comparative measurements. Output feeds the evaluate and research instructions.
Skills
Section titled “Skills”| Skill ID | Description | Model Class |
|---|---|---|
bench-analyzer | Profiles runtime performance: measures latency, throughput, P95/P99, and regression detection | free |
bench-blind-comparison | Side-by-side blind comparison of two solutions or responses without revealing source | cheap |
bench-eval-suite | Constructs a reusable evaluation suite: test cases, scoring rubric, and expected outputs | cheap |
When to Use
Section titled “When to Use”| Situation | Skill(s) |
|---|---|
| Comparing two algorithm implementations | bench-blind-comparison |
| Establishing performance baselines | bench-analyzer |
| Building a repeatable eval dataset | bench-eval-suite |
Instructions That Invoke These Skills
Section titled “Instructions That Invoke These Skills”- benchmark — primary consumer; all three coordinated
- evaluate — uses
bench-blind-comparisonfor output comparison - research — uses
bench-analyzerto validate performance claims
bench-blind-comparison Format
Section titled “bench-blind-comparison Format”Both options are presented without labels during evaluation:
Option A: [implementation 1 — label hidden]Option B: [implementation 2 — label hidden]
Evaluated by 3 free-tier models → majority vote → result revealedThis prevents evaluator bias toward the “familiar” or “expected” implementation.
bench-eval-suite Output
Section titled “bench-eval-suite Output”{ "suite": "auth-service-v2", "cases": [ { "id": "happy-path-login", "input": { "email": "user@example.com", "password": "valid" }, "expected": { "status": 200, "token": "<jwt>" }, "rubric": ["has_token", "status_200", "response_time_lt_200ms"] } ]}