Comments Page - There is an AI code review bubble

« Back There is an AI code review bubblegreptile.comSubmitted by dakshgupta 5 hours ago

kaishin a minute ago
We used Greptile where I work and it was so bad we decided to switch to Claude. And even Claude isn’t nearly as good at reviewing as an experienced programmer with domain knowledge.
candiddevmike 2 hours ago
None of these tools perform particularly well and all lack context to actually provide a meaningful review beyond what a linter would find, IMO. The SOTA isn't capable of using a code diff as a jumping off point.
Also the system prompts for some of them are kinda funny in a hopelessly naive aspirational way. We should all aspire to live and breathe the code review system prompt on a daily basis.
- gherkinnn 31 minutes ago
  Opus 4.5 catches all sorts of things a linter would not, and with little manual prompting at that. Missing DB indexes, forgotten migration scenarios, inconsistencies with similar services, an overlooked edge case.
  Now I'm getting a robot to review the branch at regular intervals and poking holes in my thinking. The trick is not to use an LLM as a confirmation machine.
  It doesn't replace a human reviewer.
  I don't see the point of paying for yet another CI integration doing LLM code review.
- shagie an hour ago
  In some code that I was working on, I had
  // stuff obj.setSomeData(something); // fifteen lines of other code obj.setSomeData(something); // more stuff
  The 'something' was a little bit more complex, but it was the same something with slightly different formatting.
  My linter didn't catch the repeat call. When asking the AI chat for a review of the code changes it did correctly flag that there was a repeat call.
  It also caught a repeat call in
  List<Objs> objs = someList.stream().filter(o -> o.field.isPresent()).toList(); // ... var something = someFunc(objs); Thingy someFunc(List<Objs> param) { return param.stream().filter(o -> o.field.isPresent()). ...
  Where one of the filter calls is unnecessary... and it caught that across a call boundary.
  So, I'd say that AI code reviews are better than a linter. There's still things that it fusses about because it doesn't know the full context of the application and the tables that make certain guarantees about the data, or code conventions for the team (in particular the use of internal terms within naming conventions).
  uoaei 36 minutes ago
  I'd say I see one anecdote, nothing to draw conclusions from.
  realusername 32 minutes ago
  I had a similar review by AI except my equivalent of setSomeData was stateful and needed to be there in both places, the AI just didn't understand any of it.
  james_marks 27 minutes ago
  When this happens to me it makes me question my design.
  If the AI doesn’t understand it, chances are it’s counter-intuitive. Of course not all LLM’s are equal, etc, etc.
  wolletd 5 minutes ago
  Then again, I have a rough idea on how I could implement this check with some (language-dependent) accuracy in a linter. With LLM's I... just hope and pray?
- cbovis 14 minutes ago
  GH Copilot is definitely far better than just a linter. I don't have examples to hand but one thing that's stood out to me is its use of context outside the changes in the diff. It'll pull in context that typically isn't visible in the PR itself, the sort of things that only someone experienced in the code base with good recall would connect the dots on (e.g. this doesn't conform to typical patterns, or a version of this is already encapsulated in reusable code, or there's an existing constant that could be used here instead of the hardcoded value you have).
- dakshgupta an hour ago
  I agree that none perform _super_ well.
  I would argue they go far beyond linters now, which was perhaps not true even nine months ago.
  To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.
  tadfisher an hour ago
  Not trying to sidetrack, but a figure like that is data, not evidence. At the very minimum you need context which allows for interpretation; 9,078 positive author comments would be less impressive if Greptile made 1,000,000 comments in that time period, for example.
  fragmede 40 minutes ago
  over 7 days does contextualize it some, though.
  9,078 comments / 7 (days) / 8 (hours) = 162.107 though, so if human that person is making 162 comments an hour, 8 hours a day, 7 days a week?
  onedognight an hour ago
  I fully agree. Claude’s review comments have been 50% useful, which is great. For comparison I have almost never found a useful TeamScale comment (classic static analyzer). Even more important, half of Claude’s good finds are orthogonal to those found by other human reviewers on our team. I.e. it points out things human reviewers miss consistently and v.v.
  Sharlin 21 minutes ago
  TBH that sounds like TeamScale just has too verbose default settings. On the other hand, people generally find almost all of the lints in Clippy's [1] default set useful, but if you enable "pedantic" lints, the signal-to-noise ratio starts getting worse – those generally require a more fine-grained setup, disabling and enabling individual lints to suit your needs.
  [1] https://doc.rust-lang.org/stable/clippy/
  mulmboy 34 minutes ago
  People more often say that to save face by implying the issue you identified would be reasonable for the author to miss because it's subtle or tricky or whatever. It's often a proxy for embarrassment
  blibble 43 minutes ago
  > To the degree you consider this to be evidence, in the last 7 days, the authors of a PR has replied to a Greptile comment with "great catch", "good catch", etc. 9,078 times.
  do you have a bot to do this too?
  written-beyond an hour ago
  I mean how far Rusts own clippy lint went before any LLMs was actually insane.
  Clippy + Rusts type system would basically ensure my software was working as close as possible to my spec before the first run. LLMs have greatly reduced the bar for bringing clippy quality linting to every language but at the cost of determinism.
  boredtofears an hour ago
  That sounds more like confirmation that greptile is being included in a lot agentic coding loops than anything
- athrowaway3z an hour ago
  > The SOTA isn't capable of using a code diff as a jumping off point.
  Not a jumping off point, but I'm having pretty great results on a complicated fork on a big project with a `git diff main..fork > main.diff`, then load in the specs I keep, and tell it to review the diff in chunks while updating a ./review.md
  It's solving a problem I created myself by not reviewing some commits well enough, but it's surprisingly effective at picking up interactions spread out over multiple commits that might have slipped through regardless.
- vimda an hour ago
  Anecdotally, Claude Bug Bot has actually been super impressive in understanding non trivial changes. Like, today, it noted a race condition in a ~1000 line go change that go test -race didnt pick up. There are definitely issues though. For one, it's non deterministic, so you end up with half a dozen commits, with each run noting different issues. For a second, it tends to be quite in favour of premature optimisation. But over all, well worth it in my experience
geooff_ an hour ago
This article has a catchy headline, but there's really no content to it. This is content marketing without content. It seems like every week on Hacker News, there's a dozen of these. All seemingly code reviewers, too. Keep it to LinkedIn.
cbovis 18 minutes ago
I've also noticed this explosion of code review tools and felt that there's some misplaced focus going on for companies.
Two that stood out to me are Sentry and Vercel. Both have released code review tools recently and both feel misplaced. I can definitely see why they thought they could expand with that type of product offering but I just don't see a benefit over their competition. We have GH copilot natively available on all our PRs, it does a great job, integrates very well with the PR comment system, and is cheap (free with our current usage patterns). GH and other source control services are well placed to have first-class code review functionality baked into their PR tooling.
It's not really clear to me what Sentry/Vercel are offering beyond what copilot does and in my brief testing of them didn't see noticeable difference in quality or DX. Feels like they're fighting an uphill battle from day one with the product choice and are ultimately limited on DX by how deeply GH and other source control service allow them to integrate.
What I would love to see from Vercel, which they feel very well placed to offer, is AI powered QA. They already control the preview environments being deployed to for each PR, they have a feedback system in place with their Vercel toolbar comments, so they "just" need to tie those together with an agentic QA system. A much loftier goal of course but a differentiator and something I'm sure a lot of teams would pay top dollar for if it works well.
ahmadyan an hour ago
Problem with Code Review is it is quite straightforward to just prompt it, and the frontier models, whether Opus or GPT5.2Codex do a great job at code-reviews. I don't need second subscription or API call when the first one i already have and focus on integration works well out of the box.
In our case, agentastic.dev, we just baked the code-review right into our IDE. It just packages the diff for the agent, with some prompt, and sends it out to different agent choice (whether claude, codex) in parallel. The reason our users like it so much is because they don't need to pay extra for code-review anymore. Hard to beat free add-on, and cherry on top is you don't need to read a freaking poems.
themafia 40 minutes ago
> Unfortunately, code review performance is ephemeral and subjective
> Today's agents are better than the median human code reviewer
Which is it? You cannot have it both ways.
- maxverse 22 minutes ago
  > Today's agents are better than the median human code reviewer
  "...at catching issues and enforcing standards, and they're only getting better".
  I took this to mean what good code review is is subjective. But if you clearly define standards and patterns for your code, your linter/automated tools/ AI code reviewer will always catch more than humans.
personjerry 2 hours ago
I don't really understand how this differentiates against the competition.
> Independence
Any "agent" running against code review instead of code generation is "independent"?
> Autonomy
Most other code review tools can also be automated and integrated.
> Loops
You can also ping other code review tools for more reviews...
I feel like this article actually works against you by presenting the problem and inadequately solving them.
- dakshgupta 2 hours ago
  > Independence
  It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways. Question: Would you trust a Cursor review of Claude-written code more, less, or the same as a Cursor review of Cursor-written code?
  > Autonomy
  Plenty of tools have invested heavily in AI-assisted review - creating great UIs to help human reviewers understand and check diffs. Our view is that code validation will be completely autonomous in the medium term, and so our system is designed to make all human intervention optional. This is possibly a unpopular opinion, and we respect the camp that might say people will always review AI-generated code. It's just not the future we want for this profession, nor the one we predict.
  > Loops
  You can invest in UX and tooling that makes this easier or harder. Our first step towards making this easier is a native Claude Code plugin in the `/plugins` command that let's Claude code do a plan, write, commit, get review comments, plan, write loop.
  sdenton4 an hour ago
  Independence is ridiculous - the underlying llm models are too similar on their training days and methodologies to be anything like independent. Trying different models may somewhat reduce the dependency, but all have read stack overflow, Reddit, and GitHub in their training.
  It might be an interesting time to double down on automatically building and checking deterministic models of code which were previously too much of a pain to bother with. Eg, adding type checking to lazy python code. These types of checks really are model independent, and using agents to build and manage them might bring a lot of value.
  liamconnell an hour ago
  > It is, but when a model/harness/tools/system prompts are the same/similar in the generator and reviewer fail in similar ways.
  Is there empirical evidence for that? Where is it on an epistemic meter between (1) “it sounds good when I say it”, and (10) “someone ran evaluation and got significant support.”
  “Vibes” (2/3 on scale) are ok, just honestly curious.
maxverse 20 minutes ago
Maybe I'm buying into the cool-aid, but I actually really liked the self-aware tone of this post.
> Based on our benchmarks, we are uniquely good at catching bugs. However, if all company blogs are to be trusted, this is something we have in common with every other AI code review product. One just has to try a few, and pick the one that feels the best.
disillusionist 35 minutes ago
My company just finished a several week review period of Greptile. Devs were split over the usefulness of the tool (compared to our current solution, Cursor). While Greptile did occasionally offer better insights than Cursor, it also exhibited strange behavior such as entirely overwriting PR descriptions with its own text and occasionally arguing with itself in the comments. In the end we decided to NOT purchase Greptile as there were enough "not quite there" issues that made it more trouble than worthwhile. I am certain, though, that the Greptile team will resolve all those problems and I wish them the best of luck!
pawelduda an hour ago
Good code reviews are part of team's culture and it's hard to just patch it with an agent. With millions of tools it will be arms race between which one is louder about as many things as possible because:
- it will have higher chance at convincing the author that the issue was important by throwing more darts - something that a human wouldn't do because it takes real mental effort to go through an authentic review,
- it will sometimes find real big issue which reinforces the bias that it's useful
- there will always be tendency towards more feedback (not higher quality) because if it's too silent, is it even doing anything?
So I believe it will just add more round of back and forth of prompting between more people, but not sure if net positive
Plus PRs are a good reality check if your code makes sense, when another person reviews it. A final safeguard before maintainability miss, or a disaster waiting to be deployed.
quanwinn an hour ago
I liked that the post is self-aware that it's promoting its own product. But the writing seemed more focus on the philosophy behind code reviews and the impact of AI, and less on the mechanics of how greptile differs from competitors. I was hoping to see more on the latter.
- dakshgupta an hour ago
  Thanks! We go over that on many other pages. Here are some:
  https://www.greptile.com/benchmarks https://www.greptile.com/greptile-vs-coderabbit https://www.greptile.com/greptile-vs-bugbot
TuringTest an hour ago
>A human rubber-stamping code being validated by a super intelligent machine is the equivalent of a human sitting silently in the driver's seat of a self-driving car, "supervising".
So, absolutely necessary and essential?
In order to get the machine out of trouble when the unavoidable strange situation happens that didn't appear during training, and requires some judgement based on ethics or logical reasoning. For that case, you need a human in charge.
taude an hour ago
It's not terribly hard to write a Copilot GHA that does this yourself for your specific teams needs. Not sure why you'd been to bring a vendor on for this....
What do the vendors provide?
I looked at a couple which were pretty snazzy at first glance, but now that I know more about how copilot agents work and such, I'm pretty sure in a few hours, I could have the foundation for my team to build on that would take care of a lot of our PR review needs....
jackconsidine an hour ago
> Only once would you have X write a PR, then have X approve and merge it to realize the absurdity of what you just did.
I get the idea. I'll still throw out that having a single X go through the full workflow could still be useful in that there's an audit log, undo features (reverting a PR), notifications what have you. It's not equivalent to "human writes ticket, code deployed live" for that reason
sastraxi an hour ago
Contrary to some of the other anecdotes in this thread, I've found automated code review to discover some tricky stuff that humans missed. We use https://www.cubic.dev/
- aurareturn an hour ago
  Before I push any code, I always ask 2 different frontier LLMs to review the changes for any potential issues. Saved my ass a few times before pushing to production.
- pomarie an hour ago
  Founder of cubic here, thanks for the shoutout!
seanmccann 26 minutes ago
As Claude Code (and Opus) improves, Greptile is finding fewer issues in my code reviews.
mohsen1 38 minutes ago
So far I've been pretty happy with Greptile. Tried Copilot and Cubic.dev but landed on Greptile
pnathan an hour ago
Claude code's code review is _sufficient_ imo.
still need HITL, but the human is shifted right and can do other things rather than grinding through fiddly details.
dcreater 31 minutes ago
Reminder that this comes from from the founder that got rightly lambasted for his comments about work life balance and then doubled down when called out.
trjordan 2 hours ago
1. I absolutely agree there's a bubble. Everybody is shipping a code review agent.
2. What on earth is this defense of their product? I could see so many arguments for why their code reviewer is the best, and this contains none of them.
More broadly, though, if you've gotten to the point where you're relying on AI code review to catch bugs, you've lost the plot.
The point of a PR is to share knowledge and to catch structural gaps. Bug-finding is a bonus. Catching bugs, automated self-review, structuring your code to be sensible: that's _your_ job. Write the code to be as sensible as possible, either by yourself or with an AI. Get the review because you work on a team, not in a vacuum.
- dakshgupta 2 hours ago
  2. There is plenty of evidence for this elsewhere on the site, and we do encourage people to try it because like with a lot of AI tools, YMMV.
  You're totally right that PR reviews go a lot farther than catching issues and enforcing standard. Knowledge sharing is a very important part of it. However, there are processes you can create to enable better knowledge sharing and let AI handle the issue-catching (maybe not fully yet, but in time). Blocking code from merging because knowledge isn't shared yet seems unnecessary.
- ahmadyan an hour ago
  > 2. What on earth is this defense of their product?
  i think the distribution channel is the only defensive moat in low-to-mid-complexity fast-to-implement features like code-review agents. So in case of linear and cursor-bugbot it make a lot of sense. I wonder when Github/Gitlab/Atlassian or Xcode will release their own review agent.
- lenerdenator an hour ago
  > More broadly, though, if you've gotten to the point where you're relying on AI code review to catch bugs, you've lost the plot.
  > The point of a PR is to share knowledge and to catch structural gaps.
  Well, it was to share knowledge and to catch structural gaps.
  Now you have an idea, for better or for worse, that software needs to be developed AI-first. That's great for the creation of new code but as we all know, it's almost guaranteed that you'll get some bad output from the AI that you used to generate the code, and since it can generate code very fast, you have a lot of it to go through, especially if you're working on a monorepo that wasn't architected particularly well when it was written years ago.
  PRs seem like an almost natural place to do this. The alternative is the industry finding a more appropriate place to do this sort of thing in the SDLC, which is gonna take time, seeing as how agentic loop software development is so new.
h1fra 18 minutes ago
one more ai code review please, I promise it will fix everything this time, please just one more
heliumtera 12 minutes ago
No shit. What is the point of using an llm model to review code produced by an llm model?
Code review pressupose a different perspective, which no platform can offer at the moment because they are just as sophisticated as the model they wrap. Claude generated the code, and Claude was asked if the code was good enough, and now you want to be in the middle to ask Claude again but with more emphasis, I guess? If I want more emphasis I can ask Claude myself. Or Qwen. I can't even begin to understand this rationale.
dcreater 28 minutes ago
There is an AI bubble.
Can drop the extra words