Skip to main content

Evaluation & Optimization

Agents are programs. Like any program, you get better results by measuring them and iterating. Codebolt ships a developer-facing eval and optimization system for exactly that.

This section is about refining what you built — systematically, with evidence. It's distinct from guardrails, which are runtime safety for the person using an agent.

What this is for

  • Quality regression. Did my change to the prompt make the agent worse on tasks it used to handle?
  • A/B comparison. Agent A vs. Agent B on the same inputs — which one performs better, and on what kinds of tasks?
  • Prompt optimization. Systematically tune prompt text, temperature, tool selection, or context rules, guided by scores.
  • Tool quality. Is my new MCP tool actually being called correctly and producing good results?
  • Capability tuning. Does activating a capability improve the behaviour it's meant to improve?

The pieces

PiecePurposeReference
Replay & tracesRe-run recorded conversations against a new agent versionReplay and Traces
Eval setsCurated input/expected-output fixturesWriting Evals
Optimization loopServer-driven tuning — generate variants, evaluate, pick winnersOptimization Loop
Metrics & scoringMeasurable dimensions (correctness, cost, latency, tool choice quality)Metrics & Scoring

The mental model

┌── event log ──┐ ┌── eval set ──┐
│ real runs │ │ fixtures │
└──────┬────────┘ └──────┬───────┘
│ │
▼ ▼
replay agent X evaluate agent X
│ │
└───────┬─────────────────┘

scores + traces


optimization loop — propose changes, re-evaluate


agent X', X'', ...

Production runs become eval material. Eval results drive the optimization loop. The loop produces candidate agents; you pick the one that wins and deploy it.

When to reach for it

  • Before a significant change to a production agent — baseline first, change, compare.
  • When behaviour regresses in a way you can't pin to a specific run.
  • When you want to experiment with many small variations and need a tiebreaker better than intuition.
  • Before publishing a capability or agent to the marketplace.

When not to reach for it: early exploration. Eval overhead is real — don't build an eval set before you know what "good" looks like.

Eval vs. guardrails

AxisEvalGuardrails
When it runsOffline, during developmentInline, at runtime
What it doesScores outputsBlocks / rewrites outputs
Who consumesDeveloper refining the agentEnd user running the agent
Where it lives in docsBuild on Codebolt (this section)Using Codebolt

They complement: guardrails catch runtime issues; evals catch quality regressions before they reach runtime.

See also