Creating Experiments

An experiment (task) defines what to test, where to run it, and how to score the output. Create experiments in the Eval Panel's Experiments tab.

Task Structure

Every task has three parts:

Instruction — what to tell the subject.
Environment — where to execute.
Evaluators — how to score the output.

Plus an optional optimization config for automatic improvement.

Instruction

The instruction tells the subject what to do. Three types:

Type	Description
`text`	A text prompt sent to the subject
`script`	A setup script that runs before the subject starts
`hybrid`	Both a setup script and a text prompt

Text instruction

A plain text prompt:

Write a function that sorts an array of numbers in ascending order.

Script instruction

A script that sets up the environment before the subject runs (e.g., create files, seed data).

Hybrid

Combines both — the script runs first to set up context, then the text prompt is sent to the subject.

Environment

Choose where the subject runs:

Type	Description
`local`	Run on the local machine
`remote`	Run in a specific remote environment (by ID)
`provider`	Run using an execution provider (E2B, Docker, etc.)

For remote and provider, you specify the environment or provider ID.

Evaluators

Each task has one or more evaluators that score the subject's output. Evaluators are weighted — the final score is a weighted average.

See Evaluators for details on each type.

Optimization (Optional)

Enable optimization to have an agent automatically improve the subject. When enabled, the system:

Runs the initial eval.
Passes the score and output to an optimizer agent.
The optimizer makes one targeted change to the subject's code.
Re-evaluates the modified subject.
Repeats until the target score is reached or max iterations hit.

See Optimization Loop for details.

Suites

Group related tasks into a suite (folder). When you create a run from a suite, all tasks in the suite are executed.

Use suites to:

Test different aspects of an agent (accuracy, speed, tool usage).
Compare subjects across a standardized benchmark.
Run regression tests after changes.

REST API

Method	Endpoint	Description
`GET`	`/evals/tasks`	List all tasks
`POST`	`/evals/tasks`	Create a task
`GET`	`/evals/tasks/:id`	Get a task
`PUT`	`/evals/tasks/:id`	Update a task
`DELETE`	`/evals/tasks/:id`	Delete a task
`GET`	`/evals/suites`	List all suites
`POST`	`/evals/suites`	Create a suite
`GET`	`/evals/suites/:id`	Get a suite with its tasks
`PUT`	`/evals/suites/:id`	Update a suite
`DELETE`	`/evals/suites/:id`	Delete a suite

Task Structure​

Instruction​

Text instruction​

Script instruction​

Hybrid​

Environment​

Evaluators​

Optimization (Optional)​

Suites​

REST API​

See Also​