Skip to main content

Eval & Optimization

Codebolt's Eval & Optimization system lets you scientifically measure agent quality and automatically improve it. You define what "good" looks like, run agents against test tasks, score their outputs, and optionally let an optimizer agent iterate on the agent's code, prompts, or config until the score improves.

Open via: Bottom bar → Agents → Eval

How it works

Define Tasks
│ instructions, evaluators, environment

Create a Run
│ select subjects (agents to test)

Execute
│ subjects run tasks in their environment

Score
│ evaluators produce weighted scores

Optimize (optional)
│ optimizer agent modifies subject code/prompts
│ re-evaluates, keeps improvements
└─ repeats up to maxIterations

Core concepts

ConceptWhat it is
TaskA test definition — instruction, evaluators, environment, and optional optimization config
SuiteA named folder grouping related tasks
SubjectWhat is being tested — currently agents and action blocks
RunAn execution that pairs subjects × tasks and produces scored results
EvaluatorA scoring mechanism that inspects the subject's output and returns a 0–100 score
OptimizationAn automatic loop that modifies a subject between eval iterations to improve its score

What can be evaluated

Subject typeDescription
agentA full agent run — the agent receives the task instruction and produces output
action-blockAn action block execution

Support for evaluating skill, capability, and mcp subjects is planned.

Data storage

All eval data lives inside your project under .codebolt/evals/:

.codebolt/evals/
├── index.json ← subjects, tasks, suites, runs metadata
├── subjects/ ← one file per subject
├── tasks/ ← one file per task
├── suites/ ← one file per suite
└── runs/ ← one file per run (includes all results)

Because it's plain files in your project, eval data can be committed, diffed, and shared with teammates.

Pages in this section

PageWhat it covers
Eval TasksDefining tasks — instructions, environments, evaluators
EvaluatorsAll four evaluator types and how scoring works
Running EvalsSubjects, suites, runs, leaderboard, and results
OptimizationThe optimization loop, strategies, and output