Continuous Evaluation
Tool: quality-evaluate
Model: Cross-Model
Trigger & Intent
Section titled “Trigger & Intent”Triggered by: The implement workflow or a direct eval request against prompt templates.
Intent: Quantify variance and accuracy of agent pipelines without human intervention. Benchmarks must be repeatable.
Resource Pooling
Section titled “Resource Pooling”Capability profile: evaluation — requires structured_output + classification, prefers cost_sensitive, fast_draft fallback, fan-out 3. Tie-break/synthesis escalation handled by configured orchestration patterns.
Required Skills
Section titled “Required Skills”| Skill | Role |
|---|---|
eval-prompt | Prompt template evaluation |
eval-variance | Statistical variance analysis |
bench-blind-comparison | A/B pairwise comparison |
Input Schema
Section titled “Input Schema”{ evalSuiteId: string; targetModel: string;}Decisions & Throw-Backs
Section titled “Decisions & Throw-Backs”If performance degrades (variance increases or score drops), throws an exception and routes back to prompt-engineering to refine templates. Evaluates output quality using randomly generated A/B pairwise comparisons.
Success Chains
Section titled “Success Chains”On successful completion chains to: prompt-engineering · refactor · govern
FSM — Double-loop learning with assumption revision
Section titled “FSM — Double-loop learning with assumption revision”stateDiagram-v2
[*] --> RunEvalSuite
RunEvalSuite --> EvalOutput
EvalOutput --> QualityAssessment
QualityAssessment --> PromptFix: method-level failure
PromptFix --> RunEvalSuite
QualityAssessment --> MetricAudit: repeated systemic mismatch
MetricAudit --> RubricReconsideration
RubricReconsideration --> BenchmarkRedefinition
BenchmarkRedefinition --> RunEvalSuite
QualityAssessment --> RetainEvalModel: acceptable quality fit
RetainEvalModel --> [*]
Execution Sequence
Section titled “Execution Sequence”sequenceDiagram
participant Orchestrator
participant Pool (Analytical)
participant Pool (Mechanical)
participant Tool (Context)
Orchestrator->>Pool (Analytical): Allocate Capability Profile
activate Pool (Analytical)
Pool (Analytical)->>Tool (Context): Issue Tool Calls (Parallel)
Tool (Context)-->>Pool (Analytical): Return Data
alt Shallow Loop
Pool (Analytical)->>Pool (Analytical): Auto-correct Schema
else Medium Loop
Pool (Analytical)->>Pool (Mechanical): Delegate Fixes
end
Pool (Analytical)-->>Orchestrator: Synthesis Gate
deactivate Pool (Analytical)
opt Deep Loop
Orchestrator->>Orchestrator: Complete Throw-back to Prior Stage
end