Replay an Agent Run
Re-running an agent with the same or similar prompt to verify that changes to the agent's configuration didn't break its behaviour.
You'll need: a custom agent with at least one previous run, and a change you want to test.
Why replay
LLM behaviour isn't deterministic. You change the agent's instructions, test it, and the new run looks fine — but that's one sample. You don't know if you broke a common case until the bad case shows up in production.
Manual replay helps by:
- Recording what the agent did in a good run (via the run history).
- Re-running the agent with the same prompt after your change.
- Comparing — what's different between the two runs?
Step 1 — note a good run
Run your agent with a task and note the prompt you used:
codebolt --prompt "review the current branch" --agent my-agent
After the run completes, check the run history in the UI. Note:
- The exact prompt you used
- The tool calls the agent made
- The final output
Step 2 — make a change
Edit your agent — tweak the instructions, update the tool list, change the model, whatever.
Step 3 — replay
Run the same prompt against the modified agent:
codebolt --prompt "review the current branch" --agent my-agent
Compare the new run's tool calls and output against the previous run. Key things to check:
- Did the agent still call the right tools?
- Did it follow the same general approach?
- Is the output quality at least as good?
Step 4 — what to look for
Because LLM outputs are non-deterministic, exact matches aren't expected. Instead, look for:
- Structural changes — the agent now skips a step it used to do, or does things in a different order.
- Quality regressions — the output is noticeably worse or missing key elements.
- Tool call differences — the agent calls different tools or makes many more calls (potential loop).
Keeping test prompts
Save prompts that exercise important agent behaviour in a file:
# agent-tests.md
## Test: branch review
Prompt: "review the current branch"
Expected: agent reads git diff, reads relevant files, produces structured review
## Test: specific file review
Prompt: "@src/auth/session.ts review for security issues"
Expected: agent focuses on that file, checks for auth bugs
Run these prompts after any agent change to verify nothing broke.
Limitations
- Non-determinism. LLM outputs vary between runs. You're checking for structural/quality regressions, not exact matches.
- Context differences. If the codebase changed between runs, the agent will naturally produce different output.
- No automated diff. You compare runs manually by inspecting both in the run history.