Evaluation & Optimization

Evaluate how well your agents, skills, and action blocks perform on specific tasks — then automatically optimize them using an agent-driven improvement loop.

What It Does

Define experiments — tasks with instructions, environments, and evaluators.
Run subjects (agents, skills, MCPs, action blocks) against those experiments.
Score results using weighted evaluators (string matching, script, agent-judge, deliberation).
Optimize automatically — an optimizer agent reads the subject's code, makes targeted changes, re-evaluates, and keeps improvements.

Architecture

Key Concepts

Concept	What it is
Task (Experiment)	Defines what to test: instruction, environment, evaluators, optional optimization
Subject	The thing being evaluated: agent, skill, action-block, capability, or MCP
Suite	A folder grouping related tasks
Run	Executes subjects against tasks, produces scored results
Evaluator	Scores the subject's output (expected-output, script, agent-judge, deliberation)
Optimization	Agent-driven iterative improvement of the subject

Subject Types

Type	What it is
`agent`	An installed agent
`skill`	A skill
`action-block`	An action block
`capability`	A capability
`mcp`	An MCP server

Data Storage

All eval data is stored as JSON files in .codebolt/evals/:

.codebolt/evals/
├── index.json
├── tasks/
├── suites/
└── runs/

Workflow

Open the Eval Panel in Codebolt (Experiments tab).
Create an experiment — define instruction, environment, evaluators.
Switch to Runs tab, create a run — select subjects.
Click Start — subjects execute, evaluators score, results update in real time.
Optionally enable optimization — optimizer agent iterates to improve scores.
Review the leaderboard — ranked subjects by score.

What It Does​

Architecture​

Key Concepts​

Subject Types​

Data Storage​

Workflow​

See Also​