Skip to main content

Creating Experiments

An experiment (task) defines what to test, where to run it, and how to score the output. Create experiments in the Eval Panel's Experiments tab.

Task Structure

Every task has three parts:

  1. Instruction — what to tell the subject.
  2. Environment — where to execute.
  3. Evaluators — how to score the output.

Plus an optional optimization config for automatic improvement.

Instruction

The instruction tells the subject what to do. Three types:

TypeDescription
textA text prompt sent to the subject
scriptA setup script that runs before the subject starts
hybridBoth a setup script and a text prompt

Text instruction

A plain text prompt:

Write a function that sorts an array of numbers in ascending order.

Script instruction

A script that sets up the environment before the subject runs (e.g., create files, seed data).

Hybrid

Combines both — the script runs first to set up context, then the text prompt is sent to the subject.

Environment

Choose where the subject runs:

TypeDescription
localRun on the local machine
remoteRun in a specific remote environment (by ID)
providerRun using an execution provider (E2B, Docker, etc.)

For remote and provider, you specify the environment or provider ID.

Evaluators

Each task has one or more evaluators that score the subject's output. Evaluators are weighted — the final score is a weighted average.

See Evaluators for details on each type.

Optimization (Optional)

Enable optimization to have an agent automatically improve the subject. When enabled, the system:

  1. Runs the initial eval.
  2. Passes the score and output to an optimizer agent.
  3. The optimizer makes one targeted change to the subject's code.
  4. Re-evaluates the modified subject.
  5. Repeats until the target score is reached or max iterations hit.

See Optimization Loop for details.

Suites

Group related tasks into a suite (folder). When you create a run from a suite, all tasks in the suite are executed.

Use suites to:

  • Test different aspects of an agent (accuracy, speed, tool usage).
  • Compare subjects across a standardized benchmark.
  • Run regression tests after changes.

REST API

MethodEndpointDescription
GET/evals/tasksList all tasks
POST/evals/tasksCreate a task
GET/evals/tasks/:idGet a task
PUT/evals/tasks/:idUpdate a task
DELETE/evals/tasks/:idDelete a task
GET/evals/suitesList all suites
POST/evals/suitesCreate a suite
GET/evals/suites/:idGet a suite with its tasks
PUT/evals/suites/:idUpdate a suite
DELETE/evals/suites/:idDelete a suite

See Also