Plan-Execute-Review

Three agents in a short pipeline: one plans, one executes, one reviews. The most useful multi-agent pattern in practice — it covers ~70% of real use cases where more than one agent is genuinely warranted.

When to use

Any non-trivial change where review matters. Refactors, migrations, new features, bug fixes.
Tasks where "did this actually solve the problem?" is a real question, not just "does it compile?".
When the executor would benefit from a fresh-eyes check that a second pass from the same agent can't provide.

If you only adopt one multi-agent pattern, adopt this one.

Shape

     ┌────────────┐
     │   Plan     │  big model, reads a lot, writes a plan doc
     └─────┬──────┘
           │
           ▼
     ┌────────────┐
     │  Execute   │  fast model, writes code, follows the plan
     └─────┬──────┘
           │
           ▼
     ┌────────────┐
     │   Review   │  different-family model, reads diff + plan
     └─────┬──────┘
           │
           ├── approve → done
           └── reject  → back to Execute (or back to Plan for major issues)

Each agent's job

Plan agent

Input: user request + codebase context.
Output: a structured plan document — files to touch, approach, test strategy, risks.
Tools: read-only. Codemap, file reads, search, knowledge graph queries. No write tools.
Model: the biggest one you can afford. Planning is where you want maximum capability.

The plan is not chat. It's a structured document the executor will consume directly. Put it in jsonMemoryService or pass as the next agent's input.

Execute agent

Input: the plan document (only — not the full conversation with the user).
Output: file changes + an execution report (what it did, what it skipped, what surprised it).
Tools: file write, code edit, test running. No planning tools — if the plan is wrong, the executor reports it; it doesn't replan.
Model: fast and cheap. Execution is mostly mechanical once the plan is good.

The executor's constrained context is the whole point. It doesn't re-debate decisions; it executes.

Review agent

Input: the original plan + the diff + the execution report.
Output: approve / reject with structured comments.
Tools: read-only. Can run tests, read diffs, query the knowledge graph. No writes.
Model: a different family from the executor. If the executor used Claude, use GPT for review (or vice versa). Same-family models correlate their mistakes.

The review agent should be given the plan, not just the diff. "Does this change match the plan?" is a better question than "is this change good?".

Why this works

Three forces aligned:

Fresh context at each stage — each agent has just the information it needs, not the full history. Smaller context, sharper focus.
Specialisation — different models, different prompts, different tool allowlists. Each agent is good at its one job.
Independence at review — the reviewer's opinion is not correlated with the executor's because they're different processes with different contexts.

The reject loop

When review rejects, you have three options:

Rejection kind	Action
Small fix-up	Back to Execute with the review comments appended to the input
Plan was wrong	Back to Plan with the rejection reason
Unclear/irreconcilable	Escalate to a human

Always cap iterations. A hard limit of 2-3 Execute↔Review rounds prevents runaway cost when the two agents disagree persistently. On hitting the cap, escalate.

Implementation in Codebolt

The whole pattern is naturally expressed as an agent flow with three nodes and one conditional edge (review → execute on reject).
Alternatively, an orchestrator agent can start each stage manually via codebolt_agent.start.
Every stage is a child run under the orchestrator; the event log shows the full plan → execute → review → verdict trace.
reviewMergeRequestService can wrap the review stage if you also want human-in-the-loop before the change actually commits.

Pitfalls

Executor ignoring the plan. If the executor has more context than it needs, it will re-decide. Keep its input tight.
Reviewer not seeing the plan. If the reviewer only sees the diff, it's reviewing code quality, not whether the change solves the task. Pass the plan.
Same-family reviewer. Correlated mistakes. Use a different model family.
Infinite reject loops. See cap above.
Review rubber-stamping. If the reviewer always approves, the reviewer is useless. Add one or two intentional failure tests to tune the reviewer's strictness.

When to use​

Shape​

Each agent's job​

Plan agent​

Execute agent​

Review agent​

Why this works​

The reject loop​

Implementation in Codebolt​

Pitfalls​

See also​