Skip to main content

Evaluators

Evaluators

An evaluator inspects a subject's output after a task run and returns a score from 0 to 100, plus a reasoning string explaining the score. Each task can have multiple evaluators; their scores are combined into a single weighted average.

Scoring

final score = Σ(evaluator.score × evaluator.weight) / Σ(weights)

Each evaluator has a weight you set when adding it to a task. A weight of 2 makes that evaluator count twice as much as one with weight 1. Weights are relative — only the ratios matter.

Every evaluator result includes:

  • score — normalised 0–100
  • rawScore — original value before normalisation (varies by evaluator type)
  • reasoning — explanation produced by the evaluator

Evaluator types

Expected Output

Compares the subject's output against a target string you define on the task.

Match modeHow it works
exactOutput must match the expected string exactly (100 or 0)
containsOutput must contain the expected string (100 or 0)
regexOutput must match a regular expression (100 or 0)
semanticAn LLM judge decides how semantically similar the output is to the expected (0–100)

Use exact or contains for deterministic, structured outputs. Use semantic when you care about meaning rather than exact wording.


Script

Runs a shell script that you write. The subject's output is provided as the environment variable EVAL_OUTPUT. Your script prints a JSON object to stdout:

{ "score": 85, "reasoning": "The agent correctly identified all three issues but missed the severity ranking." }

Score range: whatever your script returns (0–100).

Use Script evaluators for custom logic — checking file contents the agent wrote, running a test suite against the agent's code, calling an external API to validate the result, etc.

#!/bin/bash
# Example: run jest tests against the agent's code
cd /workspace
npm test -- --json 2>/dev/null | python3 -c "
import json, sys
data = json.load(sys.stdin)
passed = data['numPassedTests']
total = data['numTotalTests']
score = int(passed / total * 100)
print(json.dumps({'score': score, 'reasoning': f'{passed}/{total} tests passed'}))
"

Agent Judge

Spawns a judge agent to evaluate the output. The judge receives the task instruction, the subject's output, and any evaluation criteria you specify. It returns a JSON score and reasoning.

ConfigDescription
Judge AgentWhich agent to use as the judge
CriteriaNatural language description of what to look for

The judge agent sees:

  • The original task instruction (what was asked)
  • The subject's output (what was produced)
  • Your criteria (what good looks like)

The judge then responds with a structured score. This evaluator is best for open-ended outputs where you need subjective reasoning — code quality, explanation clarity, tone, completeness.


Deliberation

Multiple agents discuss and vote on the output quality. Codebolt creates an agent deliberation session with four score brackets:

BracketScore range
Poor0–25
Fair26–50
Good51–75
Excellent76–100

Each participating agent votes for a bracket and explains its reasoning. The winning bracket determines the score. This evaluator produces a more robust assessment than a single judge for high-stakes or ambiguous tasks.

See Agent Deliberation for how the deliberation mechanism works.

Combining evaluators

A task can combine evaluator types. For example:

EvaluatorWeightPurpose
Script3Run automated tests — hard correctness signal
Agent Judge1Assess code quality — soft quality signal

The script carries more weight here because automated test results are more reliable than subjective quality judgement. Adjust weights to reflect how much you trust each signal.