Skip to main content

Evaluation & Optimization

Evaluate how well your agents, skills, and action blocks perform on specific tasks — then automatically optimize them using an agent-driven improvement loop.

What It Does

  1. Define experiments — tasks with instructions, environments, and evaluators.
  2. Run subjects (agents, skills, MCPs, action blocks) against those experiments.
  3. Score results using weighted evaluators (string matching, script, agent-judge, deliberation).
  4. Optimize automatically — an optimizer agent reads the subject's code, makes targeted changes, re-evaluates, and keeps improvements.

Architecture

Evaluation and optimization overview: production runs and eval fixtures feed measurement, which drives optimization and promotionDeveloper WorkflowProduction RunsEvent log traces from real agent usage.event logtracesPromote good or bad runs into future eval fixtures.Eval SetCurated fixtures, expected outputs, and rubrics.fixturesmetricsThis is the repeatable benchmark you tune against.Measure The AgentReplay past runs or execute the current agent against the eval set.replay agent Xevaluate agent Xreal runs become replayable testsfixtures drive repeatable scoringscores + tracesOptimization LoopPropose changes, re-evaluate variants, rank winners, then promote one.candidate agentsDeploy Winnerbaseline, v1, v2, ...deployed runs feed the next measurement cycle

Key Concepts

ConceptWhat it is
Task (Experiment)Defines what to test: instruction, environment, evaluators, optional optimization
SubjectThe thing being evaluated: agent, skill, action-block, capability, or MCP
SuiteA folder grouping related tasks
RunExecutes subjects against tasks, produces scored results
EvaluatorScores the subject's output (expected-output, script, agent-judge, deliberation)
OptimizationAgent-driven iterative improvement of the subject

Subject Types

TypeWhat it is
agentAn installed agent
skillA skill
action-blockAn action block
capabilityA capability
mcpAn MCP server

Data Storage

All eval data is stored as JSON files in .codebolt/evals/:

.codebolt/evals/
├── index.json
├── tasks/
├── suites/
└── runs/

Workflow

  1. Open the Eval Panel in Codebolt (Experiments tab).
  2. Create an experiment — define instruction, environment, evaluators.
  3. Switch to Runs tab, create a run — select subjects.
  4. Click Start — subjects execute, evaluators score, results update in real time.
  5. Optionally enable optimization — optimizer agent iterates to improve scores.
  6. Review the leaderboard — ranked subjects by score.

See Also