Skip to main content

Plan-Execute-Review

Three agents in a short pipeline: one plans, one executes, one reviews. The most useful multi-agent pattern in practice — it covers ~70% of real use cases where more than one agent is genuinely warranted.

When to use

  • Any non-trivial change where review matters. Refactors, migrations, new features, bug fixes.
  • Tasks where "did this actually solve the problem?" is a real question, not just "does it compile?".
  • When the executor would benefit from a fresh-eyes check that a second pass from the same agent can't provide.

If you only adopt one multi-agent pattern, adopt this one.

Shape

┌────────────┐
│ Plan │ big model, reads a lot, writes a plan doc
└─────┬──────┘


┌────────────┐
│ Execute │ fast model, writes code, follows the plan
└─────┬──────┘


┌────────────┐
│ Review │ different-family model, reads diff + plan
└─────┬──────┘

├── approve → done
└── reject → back to Execute (or back to Plan for major issues)

Each agent's job

Plan agent

  • Input: user request + codebase context.
  • Output: a structured plan document — files to touch, approach, test strategy, risks.
  • Tools: read-only. Codemap, file reads, search, knowledge graph queries. No write tools.
  • Model: the biggest one you can afford. Planning is where you want maximum capability.

The plan is not chat. It's a structured document the executor will consume directly. Put it in jsonMemoryService or pass as the next agent's input.

Execute agent

  • Input: the plan document (only — not the full conversation with the user).
  • Output: file changes + an execution report (what it did, what it skipped, what surprised it).
  • Tools: file write, code edit, test running. No planning tools — if the plan is wrong, the executor reports it; it doesn't replan.
  • Model: fast and cheap. Execution is mostly mechanical once the plan is good.

The executor's constrained context is the whole point. It doesn't re-debate decisions; it executes.

Review agent

  • Input: the original plan + the diff + the execution report.
  • Output: approve / reject with structured comments.
  • Tools: read-only. Can run tests, read diffs, query the knowledge graph. No writes.
  • Model: a different family from the executor. If the executor used Claude, use GPT for review (or vice versa). Same-family models correlate their mistakes.

The review agent should be given the plan, not just the diff. "Does this change match the plan?" is a better question than "is this change good?".

Why this works

Three forces aligned:

  1. Fresh context at each stage — each agent has just the information it needs, not the full history. Smaller context, sharper focus.
  2. Specialisation — different models, different prompts, different tool allowlists. Each agent is good at its one job.
  3. Independence at review — the reviewer's opinion is not correlated with the executor's because they're different processes with different contexts.

The reject loop

When review rejects, you have three options:

Rejection kindAction
Small fix-upBack to Execute with the review comments appended to the input
Plan was wrongBack to Plan with the rejection reason
Unclear/irreconcilableEscalate to a human

Always cap iterations. A hard limit of 2-3 Execute↔Review rounds prevents runaway cost when the two agents disagree persistently. On hitting the cap, escalate.

Implementation in Codebolt

  • The whole pattern is naturally expressed as an agent flow with three nodes and one conditional edge (review → execute on reject).
  • Alternatively, an orchestrator agent can start each stage manually via codebolt_agent.start.
  • Every stage is a child run under the orchestrator; the event log shows the full plan → execute → review → verdict trace.
  • reviewMergeRequestService can wrap the review stage if you also want human-in-the-loop before the change actually commits.

Pitfalls

  • Executor ignoring the plan. If the executor has more context than it needs, it will re-decide. Keep its input tight.
  • Reviewer not seeing the plan. If the reviewer only sees the diff, it's reviewing code quality, not whether the change solves the task. Pass the plan.
  • Same-family reviewer. Correlated mistakes. Use a different model family.
  • Infinite reject loops. See cap above.
  • Review rubber-stamping. If the reviewer always approves, the reviewer is useless. Add one or two intentional failure tests to tune the reviewer's strictness.

See also