Skip to main content

Running Evals and Results

Create runs to execute subjects against experiments, view results in real time, and compare subjects on a leaderboard.

Creating a Run

  1. Open the Eval Panel and switch to the Runs tab.
  2. Click New Run.
  3. Select the subject type (agent, skill, action-block, capability, MCP).
  4. Pick one or more subjects from the available list.
  5. Choose a task (single experiment) or suite (all tasks in the suite).
  6. Click Create — the run is created in pending state.

Starting a Run

Click Start on a pending run. The server:

  1. Transitions the run to running.
  2. For each (subject, task) pair:
    • Resolves the environment (local, remote, or provider).
    • Resolves the instruction (text, script, or hybrid).
    • Executes the subject — creates a thread, sends the instruction, waits for completion.
    • Extracts the subject's output (last agent message).
    • Runs all evaluators against the output.
    • Computes the weighted score.
    • If optimization is enabled and the score is below target, enters the optimization loop.
  3. Computes aggregate scores per subject.
  4. Marks the run as completed.

Real-Time Updates

Results update in real time via WebSocket events:

EventWhen
eval.run.result.updatedAfter each result completes or changes
eval.run.optimization.updatedAfter each optimization iteration
eval.run.completedWhen the entire run finishes

The UI merges these updates live — you can watch scores and optimization progress as they happen.

Run Results

Each (subject, task) result contains:

FieldDescription
statuspending, running, completed, failed, skipped
outputThe subject's output text
scoreWeighted average of evaluator scores (0-100)
evaluatorResultsIndividual evaluator scores and reasoning
durationMsHow long execution took
threadIdLink to the chat thread
optimizationIteration history (if optimization was enabled)

Leaderboard

After a run completes, the leaderboard ranks subjects by their average score across all tasks:

RankSubjectScore
1Agent A92
2Agent C85
3Agent B71

Access via the UI or API: GET /evals/runs/:id/leaderboard.

Subject Profile

View a single subject's scores across all tasks in a run:

TaskScoreStatus
Sort array100completed
Parse JSON75completed
Write tests60completed

Access via: GET /evals/runs/:id/profile/:subjectId.

Run States

StateMeaning
pendingCreated but not started
runningExecution in progress
completedAll results finished
failedRun failed with an error
cancelledUser cancelled the run

REST API

MethodEndpointDescription
GET/evals/runsList runs (filter by suiteId, taskId, status)
POST/evals/runsCreate a run
GET/evals/runs/:idGet run with full results
POST/evals/runs/:id/startStart a pending run
POST/evals/runs/:id/cancelCancel a running run
GET/evals/runs/:id/leaderboardRanked subjects by score
GET/evals/runs/:id/profile/:subjectIdSubject's scores per task

See Also