Skip to content

Continuous Evaluation

Tool: quality-evaluate Model: Cross-Model

Triggered by: The implement workflow or a direct eval request against prompt templates.

Intent: Quantify variance and accuracy of agent pipelines without human intervention. Benchmarks must be repeatable.

Capability profile: evaluation — requires structured_output + classification, prefers cost_sensitive, fast_draft fallback, fan-out 3. Tie-break/synthesis escalation handled by configured orchestration patterns.

SkillRole
eval-promptPrompt template evaluation
eval-varianceStatistical variance analysis
bench-blind-comparisonA/B pairwise comparison
{
evalSuiteId: string;
targetModel: string;
}

If performance degrades (variance increases or score drops), throws an exception and routes back to prompt-engineering to refine templates. Evaluates output quality using randomly generated A/B pairwise comparisons.

On successful completion chains to: prompt-engineering · refactor · govern

FSM — Double-loop learning with assumption revision

Section titled “FSM — Double-loop learning with assumption revision”
stateDiagram-v2
    [*] --> RunEvalSuite
    RunEvalSuite --> EvalOutput
    EvalOutput --> QualityAssessment

    QualityAssessment --> PromptFix: method-level failure
    PromptFix --> RunEvalSuite

    QualityAssessment --> MetricAudit: repeated systemic mismatch
    MetricAudit --> RubricReconsideration
    RubricReconsideration --> BenchmarkRedefinition
    BenchmarkRedefinition --> RunEvalSuite

    QualityAssessment --> RetainEvalModel: acceptable quality fit
    RetainEvalModel --> [*]
sequenceDiagram
    participant Orchestrator
    participant Pool (Analytical)
    participant Pool (Mechanical)
    participant Tool (Context)

    Orchestrator->>Pool (Analytical): Allocate Capability Profile
    activate Pool (Analytical)
    Pool (Analytical)->>Tool (Context): Issue Tool Calls (Parallel)
    Tool (Context)-->>Pool (Analytical): Return Data

    alt Shallow Loop
        Pool (Analytical)->>Pool (Analytical): Auto-correct Schema
    else Medium Loop
        Pool (Analytical)->>Pool (Mechanical): Delegate Fixes
    end

    Pool (Analytical)-->>Orchestrator: Synthesis Gate
    deactivate Pool (Analytical)

    opt Deep Loop
        Orchestrator->>Orchestrator: Complete Throw-back to Prior Stage
    end