ChatGPT Reviews My Code: What It Caught, What It Overlooked in My Project
Photo by Jonathan Kemper (unsplash.com/@jupp) on Unsplash
10/10. That's the score ChatGPT gave itself for spotting obvious bugs in five real pull‑request reviews, according to a recent experiment.
Key Facts
- •Key company: ChatGPT
ChatGPT’s performance as a code‑review assistant was put to the test in a hands‑on experiment by developer Даниил Корнилов, who fed the model five real pull‑request diffs and measured its feedback against that of a senior engineer. The results, published on his personal blog on March 8, show that the AI excels at catching low‑level defects but flounders when the problem requires architectural insight or domain‑specific reasoning. In the “obvious bugs” category, GPT‑4 earned a perfect 10 / 10, flagging a missing zero‑division guard in a Swift function and even suggesting a concise guard‑statement fix. Similar success was recorded for naming inconsistencies (8 / 10) and missing error handling (9 / 10), where the model pointed out abbreviated identifiers such as `usrData` and warned that a silent‑catch block would swallow failures, respectively. These high scores line up with the AI’s strength in pattern‑matching syntax and style conventions, a capability that “is shockingly good at spotting basic errors,” Корнилов notes.
The experiment also exposed the model’s blind spots. Architectural concerns received a dismal 2 / 10, as GPT‑4 failed to recognize that a 500‑line ViewModel was doing too much and should be broken into smaller components. Performance pitfalls fared only slightly better (3 / 10); the AI missed an N+1 query pattern in a loop that a senior developer identified in two seconds, incorrectly labeling the code “clean.” Business‑logic errors were the lowest‑scoring category (1 / 10): a discount calculation that could produce negative prices slipped past the model because it lacks an understanding of the underlying domain rules. Even security checks were only modestly effective (4 / 10), catching obvious SQL‑injection patterns but overlooking subtler issues such as timing attacks or insecure token storage. By contrast, the human reviewer scored 9 / 10 on architecture, 8 / 10 on performance, 10 / 10 on business logic, and 7 / 10 on security, underscoring the gap between syntactic analysis and holistic code quality.
Корнилов’s “right‑way” workflow treats AI as a first‑pass filter rather than a replacement for human review. He recommends writing code, running an AI review to capture roughly 60 % of issues in seconds, fixing the obvious problems, and then handing the diff to a senior engineer for the remaining 40 % of defects. This hybrid approach, he claims, can halve the time senior reviewers spend on trivial typos and null‑check oversights. The developer also combines GPT‑4 with static‑analysis tools like SwiftLint for iOS projects, noting that the two layers together catch more problems than either could alone. Detailed instructions for his AI‑augmented workflow are available on his Boosty page, and he shares daily tips on a Telegram channel, t.me/SwiftUIDaily.
Industry observers have begun to echo Корнилов’s cautionary tone. ZDNet’s recent piece on “How well does ChatGPT know you?” highlights that the model’s strength lies in surface‑level prompts, while deeper, context‑rich tasks still demand human judgment. Likewise, TechCrunch’s coverage of ChatGPT’s new “year‑end review” feature frames the AI as a complementary assistant rather than a standalone analyst. Both outlets reinforce the notion that, despite impressive gains in natural‑language understanding, GPT‑4 remains a tool best used in tandem with expert oversight.
In practice, the experiment suggests a pragmatic formula for development teams: leverage GPT‑4’s rapid syntax‑level checks to offload the low‑hang work, then allocate senior engineers to the nuanced, system‑wide concerns that the model routinely overlooks. As AI‑driven tooling matures, the balance may shift, but for now the data from Корнилов’s five‑pull‑request test—scoring 10 / 10 on syntax, 8 / 10 on naming, 9 / 10 on error handling, but only 2 / 10 on architecture—makes clear that human expertise remains indispensable for robust, secure, and performant software.
Sources
No primary source found (coverage-based)
- Dev.to AI Tag
This article was created using AI technology and reviewed by the SectorHQ editorial team for accuracy and quality.