Planner × Implementer: Which LLM Duo Writes the Best Code per Dollar?

Most LLM coding benchmarks ask one question: which model is smartest? But that’s not how I actually build with agents. In practice I often split the work — one model plans the change, another model implements it. So the real question is: which pairing gives you the best code per dollar?

To find out, I built a small harness called duobench and ran a tournament: every combination of Kimi K2.6, Claude Opus 4.8, and GPT-5.5 as planner and implementer, fixing a real GitHub issue, with every result graded by a panel of LLM judges.

The results surprised me. Let me walk you through the whole thing.

The Idea: Quality per Dollar, Not Just Quality

A “condition” in this benchmark is a planner → implementer pair. If a model plans and implements, it’s a “solo” run. So Opus-planning-then-Kimi-implementing is one condition; Kimi solo is another.

For each condition we measure two things:

  • Quality — how good is the fix? (scored 0–10 by judges)
  • Cost — how many dollars of tokens did it burn?

Divide them and you get efficiency (quality per dollar), which is what actually matters when you’re paying the bill.

Step 1: A Real Bug, Not a Toy

Synthetic tasks are easy to game, so I used a genuine issue from a popular open-source project: Flask #4041 — “Raise error when blueprint name contains a dot.”

Here’s the bug in plain terms. Flask lets you organize routes into blueprints, and blueprints can be nested. Nesting uses the dot (.) as a separator in endpoint names — something like parent.child.view. Flask already refused dots in endpoint names, but it forgot to check the blueprint’s own name. So if you named a blueprint "my.app", the dot would silently corrupt the routing namespace.

The fix: validate the blueprint name in its constructor and raise a clear ValueError if it contains a dot. Small, self-contained, but it requires real understanding — you have to find the right constructor, pick the right exception, update existing tests that used dotted names, and add a changelog entry. A perfect “medium-hard” task.

I cloned Flask at the commit right before the official fix, so every model started from the same pre-bug-fix state.

Step 2: Plan → Implement → Judge

Each condition runs through three phases:

  1. Plan. The planner LLM reads the issue, explores the repo, and writes a handoff plan — no code, just guidance.
  2. Implement. The implementer LLM takes that plan and makes exactly one local commit fixing the issue. (No pull requests, no pushing — everything stays local and is evaluated from the commit itself.)
  3. Judge. A panel scores the commit on four dimensions: task completion, correctness, code quality, and verification.

To keep the judging honest, I used two judges — Opus 4.8 and GPT-5.5 — and averaged their scores. More on why that matters in a moment.

The Leaderboard

Here’s how all seven conditions stacked up on quality (averaged across both judges):

Leaderboard of planner-implementer conditions by quality

The striking thing is how flat quality is. Almost every condition lands between 8.0 and 8.5 out of 10. On a well-scoped task like this, the choice of model barely moves quality at all.

So if quality is roughly a tie… the whole game becomes cost.

The Money Chart

This plot tells most of the story. Cost on the x-axis, quality on the y-axis. Top-left is best: high quality, low price.

Cost versus quality scatter plot with iso-efficiency lines

The dashed lines are “iso-efficiency” contours — every point on the same line has the same quality-per-dollar. The further up-left of a line you are, the better the value.

And look who’s hugging the left edge: Kimi. Every cheap, high-value point involves Kimi as the implementer. GPT-5.5 solo sits alone on the far right — same quality as everyone else, for nearly 4× the price of Kimi solo.

Here’s the full ranking by efficiency:

RankPlanner → ImplementerQualityCostEfficiency
🥇Kimi → Kimi7.50$0.2529.5
🥈GPT-5.5 → Kimi8.50$0.4120.5
🥉Opus → Kimi8.50$0.4319.9
4Kimi → GPT-5.58.38$0.4917.1
5Kimi → Opus8.25$0.6512.7
6Opus → Opus8.25$0.6512.7
7GPT-5.5 → GPT-5.58.50$0.899.6

Why the Implementer Is Everything

If you want to understand where the money goes, this is the chart to study:

Cost breakdown showing plan versus implement spend per condition

It splits each condition’s cost into planning (blue) vs implementing (red). Two things jump out:

  • Planning is cheap. Look at the tiny blue bars on every Kimi-planned condition — about $0.03. Planning is a short, read-only task; it barely costs anything no matter who does it.
  • Implementing is where you pay. The red bars dominate. Implementation is the long agentic loop — reading files, editing, running tests, iterating — and that’s where token costs pile up.

The conclusion is clean: your implementer choice decides your bill. A pricey planner paired with a cheap implementer (Opus → Kimi) is a fantastic deal. A cheap planner with a pricey implementer (Kimi → GPT-5.5) is not.

What Each Model Is Actually Good At

Running the full matrix surfaced a distinct personality for each model:

🟣 Kimi K2.6 — the value king. Dramatically the cheapest, especially as an implementer, while staying within a hair of everyone else on quality. Its only ding: in solo mode the judges scored it slightly lower (7.5) because it pruned the existing test suite a bit too aggressively. But for raw bang-for-buck, nothing came close.

🔵 Claude Opus 4.8 — the quality anchor. When you want the best fix, Opus delivers it — the single highest-quality condition was Opus planning, Kimi implementing (8.5), and in a separate run Opus showed the best “verification discipline” (it actually ran the tests). It’s a superb planner: its careful plan lifts a cheaper implementer’s output. Just don’t pay Opus to do the long implementation grind if a cheaper model will do.

🟢 GPT-5.5 — consistent but expensive. Rock-solid 8.5 quality across the board, never the problem — but never the value either. As an implementer it costs 2–3× what Kimi does for the same score. Its premium pricing simply didn’t buy better results on this task.

The Judges Have Opinions Too

Using two judges wasn’t just for show. When I checked how each judge scored its own model’s work versus everyone else’s, a bias showed up:

Bar chart comparing how each judge scores its own model versus others
  • GPT-5.5 favored GPT-implemented commits by +0.32 points — mild but real self-promotion.
  • Opus showed no favoritism — it was actually slightly harsher on its own commits (−0.20).

Averaging the two judges cancels most of that bias out. If I’d trusted a single self-interested judge, the leaderboard would have been subtly wrong. A good reminder: when you use an LLM as a grader, don’t let a contestant be the sole referee.

The Takeaways

If you build with coding agents, here’s what I’d carry away from this:

  1. On well-scoped tasks, quality is nearly a tie — optimize for cost. Every model produced a working, reasonable fix.
  2. Split the roles. A strong planner + a cheap implementer (Opus → Kimi, or GPT → Kimi) gets you top-tier quality at a fraction of the price.
  3. Your implementer is your budget. Planning is pennies; implementation is the bill.
  4. Kimi K2.6 is the efficiency champion — and a genuinely strong default for the implementation step.
  5. Judge with a panel, not a single model, and watch for self-bias.

The smartest model isn’t always the right model. Sometimes the right move is to let the expensive one think, and the cheap one type.


This benchmark is a single trial per condition on one issue, so treat the fine-grained rankings as indicative rather than gospel. The quality spread (7.5–8.5) is within the range of judge noise — which is itself part of the finding: the models are closer than the price tags suggest.