Running Evals and Results

Create runs to execute subjects against experiments, view results in real time, and compare subjects on a leaderboard.

Creating a Run

Click Start on a pending run. The server:

Results update in real time via WebSocket events:

Event	When
`eval.run.result.updated`	After each result completes or changes
`eval.run.optimization.updated`	After each optimization iteration
`eval.run.completed`	When the entire run finishes

The UI merges these updates live — you can watch scores and optimization progress as they happen.

Each (subject, task) result contains:

Field	Description
`status`	pending, running, completed, failed, skipped
`output`	The subject's output text
`score`	Weighted average of evaluator scores (0-100)
`evaluatorResults`	Individual evaluator scores and reasoning
`durationMs`	How long execution took
`threadId`	Link to the chat thread
`optimization`	Iteration history (if optimization was enabled)

After a run completes, the leaderboard ranks subjects by their average score across all tasks:

Access via the UI or API: GET /evals/runs/:id/leaderboard.

View a single subject's scores across all tasks in a run:

Access via: GET /evals/runs/:id/profile/:subjectId.

Method	Endpoint	Description
`GET`	`/evals/runs`	List runs (filter by suiteId, taskId, status)
`POST`	`/evals/runs`	Create a run
`GET`	`/evals/runs/:id`	Get run with full results
`POST`	`/evals/runs/:id/start`	Start a pending run
`POST`	`/evals/runs/:id/cancel`	Cancel a running run
`GET`	`/evals/runs/:id/leaderboard`	Ranked subjects by score
`GET`	`/evals/runs/:id/profile/:subjectId`	Subject's scores per task