AI Code Review Is a Sycophant

AI Code Review Is a Sycophant

GitHub Copilot code review is available in pull requests. Claude can review a diff. Cursor highlights issues as you type. Every major AI coding assistant now offers some form of review, and teams are using these tools to supplement (or in some cases replace) asynchronous human review on pull requests.

This is not necessarily wrong. AI code review is genuinely useful. But there is a pattern to what it misses, and understanding that pattern matters more than debating whether to use these tools at all.

In my experience, AI code reviewers behave like sycophants. They are good at finding small problems with how you built something. They are almost incapable of questioning whether you should have built it at all.

What AI Code Review Is Good At

To be clear: these tools are useful. Worth adding to your PR workflow.

AI reviews reliably catch:

Obvious bugs in isolation. Null dereferences, off-by-one errors, incorrect operator precedence, missing await, unchecked return values from methods that can fail. These are the bugs human reviewers also catch, and they slip through when reviewers are tired, rushed, or staring at a 500-line diff.

Common anti-patterns. async void, catching Exception without rethrowing, DateTime.Now instead of DateTime.UtcNow, string concatenation in loops, ConfigureAwait(false) missing in library code. Pattern matching against known bad patterns is exactly what Large Language Models (LLMs) do well.

Trivial security issues. SQL injection via string concatenation, hardcoded credentials, insecure random number generation. These appear in training data thousands of times.

Style consistency. Naming inconsistencies, missing XML documentation, inconsistent error handling patterns relative to the rest of the file.

These categories represent real value. A review pass that catches these before human review means human reviewers can spend their time on harder problems.

What AI Code Review Systematically Misses

This is where the sycophancy shows up.

Wrong abstraction. AI reviewers evaluate the code you wrote against its own internal logic. They rarely notice that the abstraction itself is wrong: that the OrderProcessor class is doing three different things and probably should not exist as a single class, that the interface design couples callers to implementation details, that the naming reveals a confused mental model of the domain. Recognizing a wrong abstraction requires understanding the system it lives in and the cost of fixing it later. AI reviewers do not have that context.

“This should be deleted.” The correct review comment for a surprising fraction of pull requests is something like: “This feature was not the right call, let’s talk before merging.” AI reviewers will not write that comment. They review code on its own terms. A well-implemented feature that solves the wrong problem gets a positive AI review, and that feedback loop, repeated over time, shapes how a team thinks about what quality means.

Systemic patterns across the codebase. AI reviewers see the diff. They do not know that the same abstraction appeared in three other places and was wrong each time. They do not know that this exact approach was tried and reverted eight months ago, and that the revert commit explains why. Reviewers with codebase history catch this. AI reviewers cannot.

Business logic correctness. Is this the right formula for calculating the surcharge? Does this authorization check correctly represent the access control model? Is this state machine transition valid given how the domain actually works? AI reviewers can tell you the code is internally consistent. They cannot tell you it is correct relative to what the software is supposed to do. This is not a minor gap. Business logic bugs are often the costliest bugs, and they are invisible to a reviewer that does not understand the business.

Performance under real load. AI reviewers flag obvious O(n²) algorithms and missing database indexes in toy examples. They rarely have visibility into the data distribution, the access patterns, or the production load profile that determines whether the code will hold up at scale. The performance review that matters happens in load testing and production, not in the diff view.

The Sycophancy Problem

The specific failure mode of AI code review is not that it misses things. Every review process misses things. The problem is the pattern of what it misses.

AI reviewers tend to approve the overall approach and find issues in the details. When a team leans heavily on AI review, there is a subtle risk: reviewers get better and better at fixing the details an AI flags, while the bigger structural questions get less attention over time. I have seen this happen, and it is not anyone’s fault. It is a natural response to the feedback signal you are getting.

The approval bias is structural. AI reviewers are trained on review data where most code in a diff is acceptable. The kind of feedback that says “the entire approach here is wrong, close this PR and start over” is rare in training data and produces outcomes that make the tool seem less useful. So the model optimizes away from it.

The result: AI reviewers are systematically biased toward approving what you built and suggesting small improvements. They are not calibrated to recognize when the correct response is rejection.

There is also a confidence effect worth naming. A developer who ships a PR with zero AI findings tends to feel more confident that the code is solid. That confidence is not entirely wrong (the mechanical issues are likely clean), but it can crowd out the instinct to ask for a second human opinion. Over time, “the AI found nothing” starts to function as a substitute for “this is good code”, and that is a different claim entirely.

What AI Review Should Change About Human Review

If AI review is in your pipeline, it should shift what human reviewers focus on, not replace them.

AI reviewers handle the mechanical layer well: obvious bugs, pattern violations, style issues. That creates an opportunity for human reviewers to focus on what AI cannot do:

  • Is this the right design?
  • Does this code belong here at all?
  • Does the naming suggest the author has a clear mental model of the domain?
  • Is this consistent with decisions made elsewhere in the system?
  • What will maintaining this cost in six months?

Human review time is finite. If a human reviewer spends twenty minutes on a PR that an AI already reviewed and only surfaces style issues, something has gone wrong with how review time is being used. The value of human review is judgment, context, and the willingness to say “not yet.”

A team that uses AI review to reduce the need for human judgment does not end up with less review. It ends up with coverage that feels high but catches less of what actually matters.

The Diff Problem

Both AI and human review share a structural limitation: they evaluate changes, not outcomes.

A large refactor that genuinely improves a design looks messy as a diff: deletions everywhere, moved code, renamed concepts. A small change that introduces a subtle bug can look perfectly clean. Both human and AI reviewers are influenced by the shape of the change, not just its effect on the codebase.

AI reviewers are more constrained here because they have no option to go beyond the diff. A human reviewer can pull the branch, run it, read the surrounding code, check git history. AI reviewers are limited to what is presented to them.

This means AI review is structurally better suited to focused, contained changes, and less suited to catching problems that only become visible when you look at the broader context.

A concrete example: a PR that migrates a service to use a new internal library might look straightforward in the diff. The imports change, a few method calls are updated, tests pass. An AI reviewer sees nothing alarming. But a human who knows that the new library has different error propagation semantics, or that the migration breaks an assumption made elsewhere in the codebase, can catch that. The diff does not surface it. Context does.

Using AI Review Without Becoming Dependent on It

A few practices that have worked well in my experience:

Use AI review as a pre-filter, not a gatekeeper. Let it catch mechanical issues before human review. Humans then review for judgment, not syntax. An AI approval should not substitute for human review on anything that carries real risk.

Treat AI approval as a weak signal. An AI saying “looks good” means it did not find a pattern match for common issues. That is useful information, but it is not an endorsement of the design.

Read what the AI flagged, and what it did not. If it found nothing interesting, that is not evidence the code is good. It may mean the problems are exactly the kind the AI cannot see.

Keep humans in the design conversation. Architecture decisions, new abstractions, changes to domain models: these all need human review from someone with context. No AI reviewer carries your system’s history, your domain knowledge, or the judgment to tell you a design direction is off before you build it out.

Watch for approval drift. If PRs consistently get AI approval and human reviewers gradually stop questioning design decisions, that is a signal worth paying attention to. The human review may have been quietly degraded, not supplemented.

The Honest Summary

AI code review tools are useful. Add them to your pipeline. Let them handle the mechanical layer.

But they are not reviewers in the sense that actually matters. They do not have judgment. They do not know your system. They cannot tell you that you built the wrong thing. They are pattern matchers with a structural bias toward approving what you wrote.

The risk is not that these tools make developers worse (most developers using AI review are thoughtful professionals who also get human review). The risk is subtler: over time, optimizing for what AI review catches can quietly shift attention away from the questions it cannot ask. Staying aware of that dynamic is enough to avoid it.

AI review is a useful tool. Keep it in that category.

An AI reviewer that says “looks good” is not telling you the code is good. It is telling you it did not find a match.

Comments

VG Wort