Eval & Optimization

Codebolt's Eval & Optimization system lets you scientifically measure agent quality and automatically improve it. You define what "good" looks like, run agents against test tasks, score their outputs, and optionally let an optimizer agent iterate on the agent's code, prompts, or config until the score improves.

Open via: Bottom bar → Agents → Eval

How it works

Define Tasks
    │  instructions, evaluators, environment
    ▼
Create a Run
    │  select subjects (agents to test)
    ▼
Execute
    │  subjects run tasks in their environment
    ▼
Score
    │  evaluators produce weighted scores
    ▼
Optimize (optional)
    │  optimizer agent modifies subject code/prompts
    │  re-evaluates, keeps improvements
    └─ repeats up to maxIterations

Core concepts

Concept	What it is
Task	A test definition — instruction, evaluators, environment, and optional optimization config
Suite	A named folder grouping related tasks
Subject	What is being tested — currently agents and action blocks
Run	An execution that pairs subjects × tasks and produces scored results
Evaluator	A scoring mechanism that inspects the subject's output and returns a 0–100 score
Optimization	An automatic loop that modifies a subject between eval iterations to improve its score

What can be evaluated

Subject type	Description
`agent`	A full agent run — the agent receives the task instruction and produces output
`action-block`	An action block execution

Support for evaluating skill, capability, and mcp subjects is planned.

Data storage

All eval data lives inside your project under .codebolt/evals/:

.codebolt/evals/
├── index.json          ← subjects, tasks, suites, runs metadata
├── subjects/           ← one file per subject
├── tasks/              ← one file per task
├── suites/             ← one file per suite
└── runs/               ← one file per run (includes all results)

Because it's plain files in your project, eval data can be committed, diffed, and shared with teammates.

Pages in this section

Page	What it covers
Eval Tasks	Defining tasks — instructions, environments, evaluators
Evaluators	All four evaluator types and how scoring works
Running Evals	Subjects, suites, runs, leaderboard, and results
Optimization	The optimization loop, strategies, and output

How it works​

Core concepts​

What can be evaluated​

Data storage​

Pages in this section​

How it works

Core concepts

What can be evaluated

Data storage

Pages in this section