The two coding copilots are teammates, not a winner

I've spent the last several months using both Claude Sonnet 4.5 and GPT-5 inside GitHub Copilot, sometimes on alternate days, sometimes side by side. The question I get asked most often is 'which one is better.' The honest answer, after enough hours with both, is that it's the wrong question.

There's a version of the comparison that's tempting, and I fell into it early. Sonnet feels fast, eager, willing to try things. GPT-5 feels considered, methodical, careful about the edges. The shorthand I was using for a while was that GPT-5 was like a senior engineer with fifteen years of experience and Sonnet was like a brilliant junior who types faster than she thinks. The metaphor has rhetorical utility and limited predictive power.

Where the metaphor holds

On refactors of established codebases, on long-range consistency, on the kinds of tasks where the 'right' answer requires respecting existing conventions and hedging against subtle breakage, GPT-5 consistently outperforms. It reads more of the surrounding code before suggesting changes. It produces diffs that pass code review on the first pass more often. It tends toward conservatism in ways that are annoying when you want fresh thinking and welcome when you want something that works in a week.

On green-field code, on problems where the direction isn't obvious, on exploration of new libraries or patterns, Sonnet earns its reputation. It moves faster, offers more variants, and is more willing to propose something novel. The diffs often need a second pass, but that second pass is where you actually learn something. For prototyping and for any problem where the shape of the solution isn't yet clear, it is my preferred collaborator.

Where the metaphor breaks

The breakage is that these are characteristics at a task level, not stable seniority traits. Sonnet on a refactor in old code is careful. GPT-5 on a brand-new prototype is slow and sometimes actively wrong, because it's extrapolating from patterns in its training data that don't apply. The 'senior vs junior' framing implies the difference is global. It isn't. The difference is task-specific, and mapping the task-to-model preference is a lot of the practical work.

We measured this in one team over a quarter. Fraction of first-pass diffs that passed CI without human edit: Sonnet ~38%, GPT-5 ~47%. Fraction of accepted PRs that didn't require a follow-up within two weeks: Sonnet ~61%, GPT-5 ~78%. Different phases, different winners. The teams with strong review culture get more value out of Sonnet because they catch the rough edges. The teams with weaker review benefit from GPT-5's built-in caution. The choice is contextual to the team, not just the task.

The choice is contextual to the team, not just the task.

The practical operating model

What has emerged for me, after enough weeks with both, is a rough division of labour. Prototyping and exploration: Sonnet. Migration work and touching old code: GPT-5. Writing tests for existing functions: either, leaning GPT-5 for conservatism. Drafting architectural options: Sonnet, because the divergence is useful. Code review feedback on a junior engineer's PR: GPT-5, because the feedback is calibrated. Debugging an intermittent production bug: GPT-5, because it hedges less about what it doesn't know.

The cognitive cost of switching is lower than people expect. In user studies inside our engineering org, the developers who used both in the same session reported higher satisfaction than the ones who committed to a single model. That surprised me. I'd expected the context-switching cost to dominate. It didn't. The mental model is 'which teammate for this task', and choosing between two specialists feels natural once you stop trying to rank them.

The caveat

Both models share a strong web-backend bias. Both are better at Python and TypeScript than at Rust or C++. Both struggle with anything that requires reasoning about memory layout, timing, or concurrency at the silicon level. If you're doing low-level work, neither is yet a reliable partner, and the enthusiasm about coding copilots reads quite differently from the hardware seat than it does from the web-app seat. Keep the specific domain in mind before you generalise.

The other caveat is that the models improve quickly, and the specific observations above will age within six months. The durable observation is the one about division of labour. Even if both models become individually better, the choice between them will remain contextual. The task-to-model map is a more stable artefact than any single-task benchmark.

A recommendation

If you're trying to pick one, don't. Deploy both. Let engineers choose per task. Track rollback rate by model, PR follow-up rate by model, time-to-merge by model. Revisit quarterly. Don't have a strong opinion about which is better, because the answer keeps changing, and the cost of being wrong is higher than the cost of supporting both. The organisations that will get the most value from coding copilots in the next two years are the ones that treat them as a small team of specialists, not a single tool to standardise on.

The meta-observation is that we are, quietly, through daily use, learning how to work with distinguishable AI collaborators with different characters. That's the capability that will matter in five years, more than any individual model. Build the muscle now.

← Previous

Office gossip points the wrong way

I got my hands-on coding back, with a bonus