Auto-Optimize Agents
Once your agent basically works, the next question is not "does it run?" but "does it reliably perform well?"
That is where Codebolt's eval and optimization system fits.
Keep the full Evaluation & Optimization section separate in the docs, because the same system is used for more than agents:
- agents
- skills
- capabilities
- tools and MCP integrations
- prompt and context strategies
But for agent authors, this is the natural next step after Testing and Debugging.
When to use optimization
Reach for optimization when:
- your agent works, but quality is inconsistent
- you want to compare prompt or model variants
- you added tools or capabilities and want evidence they help
- you want to reduce cost or latency without harming quality
- you are preparing an agent for publishing or wider internal use
Do not start here. First make the agent correct enough to be worth measuring.
The practical sequence
For custom agents, the workflow is usually:
- Build the agent.
- Run it manually on real tasks.
- Add tests and replay coverage.
- Create an eval set from the kinds of tasks the agent should handle well.
- Run optimization loops to compare variants.
- Promote the winning version.
In short:
build -> test -> replay -> eval -> optimize -> publish
What you can optimize
For agents, common optimization targets are:
- system prompt wording
- model choice
- decoding settings
- tool allowlists
- capability activation
- context assembly choices
The goal is not "make it smarter" in the abstract. The goal is to improve a measurable outcome on a known task set.
Why this stays outside Creating Agents
The eval system is broader than agent authoring.
It is also the right place to measure:
- whether a skill improves a task class
- whether an MCP tool is called correctly
- whether a capability helps or harms
- whether a provider or model swap changes cost, latency, or quality
So the top-level Evaluation & Optimization section should stay separate. This page is just the bridge for agent builders.
Start here next
- Evaluation & Optimization Overview — the full system
- Replay and Traces — use real runs as eval material
- Writing Evals — build a useful eval set
- Optimization Loop — generate and compare variants
- Metrics & Scoring — decide what "better" means