I recently had to do a one-off task using SQL in a way that I wasn't too familiar with. Since I could explain conceptually what I needed but didn't know all the right syntax this seemed like a perfect use case to loop in Claude.
The first couple back and forths went ok but it quickly gave me some SQL that was invalid. I sent back the exact error and line number and it responded by changing all of the aliases but repeated the same logical error. I tried again and this time it rewrote more of the code, but still used the exact same invalid operation.
At that point I just went ahead and read some docs and other resources and solved things the traditional way.
Given all of the hype around LLMs I'm honestly surprised to see top models still failing in such basic and straightforward ways. I keep trying to use LLMs in my regular work so that I'm not missing out on something potentially great but I still haven't hit a point where they're all that useful.
I'm not convinced LLMs will evolve into general AI. The promises that it's just around the corner feels increasingly like a big scam.
I was never on board with it. It feels like the same step change google was - there was a time when it was just miles ahead of everything else out there around 1998. The first time you used it, it was like "geez, you got it right, didn't know that was possible". It's big, changed things, but wasn't an end of history event a bunch of people are utterly convinced this is.
We just need a little more of your (not mine) money to get there. Would I lie to you for $100 billion?
Depends what you mean by evolve. I don't think we'll get general AI by simply scaling LLMs, but I think general AI, if it arrives, will be able to trace its lineage very much back to LLMs. Journeys through the embedding space very much feel like the way forward to me, and that's what LLMs are.
I mean it’s been a couple of years!
It may or may not happen but “scam” means intentional deceit. I don’t think anyone actually knows where LLMs are going with enough certainty to use that pejorative.
Is it intentional deceit to tell everyone it's leading to something when, as you correctly point out, nobody actually knows if it will?
when their stock price rises because of their words, yes.
It has made me stop using Google and StackOverflow. I can look most things up quickly, not rubber duck with other people, and thus I am more efficient. It also it is good at spotting bugs in a function if the APIs are known and the APIs version is something it was trained on. If I need to understand what something is doing, it can help annotate the lines.
I use it to improve my code, but I still cannot get it to do anything that is moderately complex. The paper tracks with what I've experienced.
I do think it will continue to rapidly evolve, but it probably is more of a cognitive aid than a replacement. I try to only use it when I am tight on time. or need a crutch to help me keep going.
I had to do something similar with BigQuery and some open source datasets recently.
I had bad results with Claude as you mentioned. It kept hallucinating parts of the docs for the open datasets, coming up with nonsense columns. Not fixing errors when presented the error text and more context. I had a similar outcome with 4o.
But I tried the same with o1 and it was much better consistently, with full generations of queries and alterations. I fed it in some parts of docs anytime it struggled and it figured it out.
Ultimately I was able to achieve what I was trying to do with o1. I’m guessing the reasoning helped, especially when I confronted it about hallucinations and provided bits of the docs.
Maybe the model and the lack of CoT could be part of the challenge you ran into?
> and provided bits of the docs.
At this point I'd ask myself whether I want my original problem solved or if I just want the LLM to succeed with my requested task.
Yes, I imagine some do like to read and then ponder over the BigQuery docs. I like to get my work done. In my case, o1 nailed BigQuery flawlessly, saving me time. I just needed to feed in some parts of the open source dataset docs
I am a paying user of both Claude AI and ChatGPT, I think for the use case you mention ChatGPT would have done better than Claude. At $20/month I recommend that you try it for the same use case. o1 might have succeeded where Claude failed.
I do something like this every day at work lol. It's a good base to start with, but often you'll eventually have to Google or look at the docs to see what it's messing up
I've had a pretty similar outlook and still kind of do, but I think I do understand the hype a little bit: I've found that Claude and Gemini 2 Pro (experimental) sometimes are able to do things that I genuinely don't expect them to be able to do. Of course, that was the case before to a lesser extent already, and I already know that that alone doesn't translate to useful necessarily.
So, I have been trying Gemini 2 Pro, mainly because I have free access to it for now, and I think it strikes a bit above being interesting and into the territory of being useful. It has the same failure mode issues that LLMs have always had, but honestly it has managed to generate code and answer questions that Google definitely was not helping with. When not dealing with hallucinations/knowledge gaps, the resulting code was shockingly decent, and it could generate hundreds of lines of code without an obvious error or bug at times, depending on what you asked. The main issues were occasionally missing an important detail or overly complicating some aspect. I found the quality of unit tests generated to be sub par, as it often made unit tests that strongly overlapped with each other and didn't necessarily add value (and rarely worked out-of-the-box anyways, come to think of it.)
When trying to use it for real-world tasks where I actually don't know the answers, I've had mixed results. On a couple occasions it helped me get to the right place when Google searches were going absolutely nowhere, so the value proposition is clearly somewhere. It was good at generating decent mundane code, bash scripts, CMake code, Bazel, etc. which to me looked decently written, though I am not confident enough to actually use its output yet. Once it suggested a non-existent linker flag to solve an issue, but surprisingly it actually did inadvertently suggest a solution to my problem that actually did work at the same time (it's a weird rabbit hole, but compiling with -D_GNU_SOURCE fixed an obscure linker error with a very old and non-standard build environment, helping me get my DeaDBeeF plugin building with their upstream apbuild-based system.)
But unfortunately, hallucination remains an issue, and the current workflow (even with Cursor) leaves a lot to be desired. I'd like to see systems that can dynamically grab context and use web searches, try compiling or running tests, and maybe even have other LLMs "review" the work and try to get to a better state. I'm sure all of that exists, but I'm not really a huge LLM person so I haven't kept up with it. Personally, with the state frontier models are in, though, I'd like to try this sort of system if it does exist. I'd just like to see what the state of the art is capable of.
Even that aside, though, I can see this being useful especially since Google Search is increasingly unusable.
I do worry, though. If these technologies get better, it's probably going to make a lot of engineers struggle to develop deep problem-solving skills, since you will need them a lot less to get started. Learning to RTFM, dig into code and generally do research is valuable stuff. Having a bot you can use as an infinite lazyweb may not be the greatest thing.
Makes perfect sense why it couldn't answer your question, you didn't have the vocabulary of relational algebra to correctly prime the model. Any rudimentary field have their own corpus vocabulary to express ideas and concepts specific to that domain.
I honestly can't tell if this is a sarcastic reply or not
I'm not sure it was even human.
Half of the work is specification and iteration. I think there’s a focus on full SWE replacement because it’s sensational, but we’ll more end up with SWE able to focus on the less patterned or ambiguous work and made way more productive with the LLM handling subtasks more efficiently. I don’t see how full SWE replacement can happen unless non-SWE people using LLMs become technical enough to get what they need out of them, in which case they probably have just become SWE anyway.
Yeah, I tried Copilot for the first time the other day and it seemed to be able to handle this approach fairly well -- I had to refine the details, but none of it was because of hallucinations or anything like that. I didn't give it a chance to try to handle the high-level objective, but based on past experience, it would have done something pointlessly overwrought at best.
Also, as an aside, re "not a real programmer" salt: If we suppose, as I've been led to believe, that the "true essence" of programming is the ability to granularize instructions and conceptualize data flow like this, and if LLMs remain unsuitable for doing such tasks reliably unless the user can do so, this would seem to undermine the idea that someone can only pretend to be a programmer if they use the LLMs.
Anyway, I used Copilot in VSCode to "Fix" this "code" (it advised me that I should "fix" my "code" by . . . implementing it, and then helpfully provided a complete example):
# Take a URL from stdin (prompt)
# If the URL contains "www.reddit.com", replace this substring with "old.reddit.com"
# Curl the URL and extract all links matching /https:\/\/monkeytype\.com\/profile\/[^>]+/ from the html;
# put them in a defaultdict as the first values;
# for each first value, the key is the username that appears in the nearest previous p.tagline > a.author
# For each first value, use Selenium to browse to the monkeytype.com/profile url;
# wait until 'div[class=\'pbsTime\'] div:nth-child(3) div:nth-child(1) div:nth-child(2)' is visible AND contains numbers;
# assign this value as the second value in the defaultdict
# Print the defaultdict as a json object
> unless non-SWE people using LLMs become technical enough to get what they need out of them
Non-SWE person here. In the past year I've been able to use LLMs to do several tasks for which I previously would have paid a freelancer on Fiverr.
The most complex one, done last spring, involved writing a Python program that I ran on Google Colab to grab the OCR transcriptions of dozens of 19th-century books off the Internet Archive, send the transcriptions to Gemini 1.5, and collect Gemini's five-paragraph summary of each book.
If I had posted the job to Fiverr, I would have been willing to pay several hundred dollars for it. Instead, I was able to do it all myself with no knowledge of Python or previous experience with Google Colab. All it cost was my subscription to ChatGPT Plus (which I would have had anyway) and a few dollars of API usage.
I didn't put any full-time SWEs out of work, but I did take one job away from a Fiverr freelancer.
> I didn't put any full-time SWEs out of work, but I did take one job away from a Fiverr freelancer.
Who would use LLM anyway these days. Interesting when Fiverr will add non-human freelancers. Something similar to algorithmic traders. Passive income.
> I didn't put any full-time SWEs out of work, but I did take one job away from a Fiverr freelancer.
I think this is the nuance most miss when they think about how AI models will displace work.
Most seem to think “if it can’t fully replace a SWE then it’s not going to happen”
When in reality, it starts by lowering the threshold for someone who’s technical but not a SWE, to jump in and do the work themselves. Or it makes the job of an existing engineer more efficient. Each hour less work needed spread across many tasks that would have otherwise gone to an engineer eventually sum up to a full time worth of an engineer. If it’s a Fiverr dev you eliminated the work of, that means the Fiverr dev will eventually go after the work that’s remaining, putting supply pressure on other devs
It’s the same mistake many had about self driving cars not happening because they couldn’t handle every road. No, they just need to start with 1 road, master that, and then keep expanding to more roads. Until they can do all of SF, and then more and more cities
This is a good anecdote but most software engineering is not scripting. It’s getting waist (or neck) deep in a large codebase and many intricacies.
That being said I’m very bullish on AI being able to handle more and more of this very soon. Cursor definitely does a great job giving us a taste of cross codebase understanding.
Seconded. Zed makes it trivial to provide entire codebases as context to Claude 3.5 Sonnet. That particular model has felt as good as a junior developer when given small, focused tasks. A year ago, I wouldn’t have imagined that my current use of LLMs was even possible.
not sure about Claude but my main problem with 03-mini is that it 'forgets' the things which are supposed to fit in the context window. This results in it using different function names, data structures. I think it's guessing them instead of fetching from the previous records.
If the goal is to get something to run correctly roughly once with some known data or input, then that's fine. Actual software development aims to run under 100% of circumstances, and LLMs are essentially cargo culting the development process and entrusting an automation that is unreliable to do mundane tasks. Sadly the quality of software will keep going down, perhaps even faster.
If the llm can’t find me a solution in 3 to 5 tries while I improve the prompt I fall back to mire traditional methods and or use another model like Gemini.
Everyone is a typist now, so I don't think it is farfetched that everyone is a SWE in the future.
Very few people are typist.
Most people can use a keyboard, but the majority of non-technical people type at a speed which is orders of magnitude less than a professional typist.
Another comment here mentions how they used colab while not being a SWE, but that is already miles ahead of what average people do with computers.
There's people who have used computers for decades and wouldn't be able to do a sum in a spreadsheet, nor know that is something spreadsheets can do.
What’s the WPM cutoff to be considered a typist?
In the narrowest version of the definition:
> The Registered Skilled Reporter (RSR) is NCRA's new designation that will recognize those stenographic professionals who are looking to validate their beginning level of competency.
> You have to pass three five-minute Skills Tests (SKT), which evaluate your skills level in three areas: Literary at 160 wpm, Jury Charge at 180 wpm, Testimony/Q&A at 200 wpm.
https://www.ncra.org/certification/NCRA-Certifications/regis...
Stenography is a little different to regular typing isn't it?
> in which case they probably have just become SWE anyway
or learn to use something like Bubble
> OpenAI researchers have admitted that even the most advanced AI models still are no match for human coders — even though CEO Sam Altman insists they will be able to beat "low-level" software engineers by the end of this year
This is the “self-driving cars next year, definitely” of the 20s, at this point.
Not sure what they found. Either model is unable, or they were unable to solve the tasks using models. Loos like they used strait questions and not Chain of Thoughts. The result for the same model depends on how you ask. The tasks probably required more thinking under the hood than model is allowed to do in one request. More interesting would be if model is capable of solving given enough time. Using multiple requests orchestrated by some framework automatically.
This has been obvious for a couple years to anyone in the industry that has been faced with an onslaught of PRs to review from AI enabled coders who sometimes can't even explain the changes being made at all. Great job calling it AI.
Well, OpenAI does currently have 288 job openings, including plenty of software engineers, so that says something.
> even though CEO Sam Altman insists they will be able to beat "low-level" software engineers by the end of this year.
"low/high level" starts to lose its meaning to me because it gets used in opposite ways
Where are the low level CEOs vs high level CEOs?
I'll bet AI could do their jobs right now.
Can SOMEONE please write AI software to replace these people?
Management consultants. Main attributes are confidence and ability to generate content. No need to stick around to see it through.
Are you bothered by the fact that software engineers might be easier to automate?
Is that a fact? I mean, see the linked article; even the company whose whole business model lies in convincing people that that _is_ a fact is kinda saying “yeah, perhaps not”, with vague promises of jam tomorrow.
Considering that there are chickens who outperform stockbrokers, no.
It's the opposite. An LLM is better at CEO stuff than working code. A good developer + LLM instead of CEO can succeed. A good CEO + LLM instead of developer cannot succeed. (For a tech company)
Even better, if you click through to the linked source he doesn't say "low-level" at all, or make any claim that is at all like the claim he is cited as making!
Yeah low-level language conflated with low-level coders , means the opposite in some sense
I find the framing of this story quite frustrating.
The purpose of new benchmarks is to gather tasks that today's LLMs can't solve comprehensively.
It an AI lab built a benchmark that their models scored 100% on they would have been wasting everyone's time!
Writing a story that effectively says "ha ha ha, look at OpenAI's models failing to beat the new benchemark they created!" is a complete misunderstanding of the research.
AI don't "solve" problems, best it can do is remember them. Ask them to solve anything new that's challenging and it starts to hallucinate. At least currently.
And I'm ashamed that OpenAI and Sam altman are walking around talking about AGI. And I'm so... disillusioned by the entire tech community that they have fallen for it or they at least pretend to believe it. It's like LinkedIn Where everybody pretends to be cringe, positivity, people. Even though they know it's cringe and nobody believes it.
I wonder how many of the solutions that passes SWE-lancer evals would not be accepted by the poster due to low quality
I’ve been trying so many things to automate solving bugs and adding features 100% by AI and I have to admit it’s been a failure. Without someone that can read the code and fully understand the AI generated code and suggests improvements (SWE in the loop) AI code is mostly not good.
So this is an in-house benchmarks after their undisclosed partnership with a previous benchmark company. Really hope they do not have their next model to vastly outperform on this benchmark in the coming weeks.
They should feed it bootcamp study materials and Cracking the Coding Interview book in order to improve its ability to code.
If it can master Binary Search Trees, it can master anything.
"If you need to improve speed, add Hash Tables."
> The models weren't allowed to access the internet
How many software developers could solve most even simple programming problems (except 'Hello world') with zero shot style (you write in notepad then can compile only once and execute once) without access to internet (stackoverflow, google search, documentation), tools (terminal, debugger, linter, cli)?
I think then it's not the best comparison to make any judgement. Future benchmark should test agents where they allowed to solve the problem in 5-10 minutes, allow give access to internet, documentation, linter, terminal with MCP servers.
What would searching the Internet provide the models that they don’t already have? Most likely data sources such as stack overflow, documentation on the language it’s targeting, and a variety of relevant forum posts are already part of its training set.
Unless someone else came along and said “here’s how to solve x problem step by step”, I don’t see how additional information past its cutoff point would help. (Perhaps the AI could post on a forum and wait for an answer?)
Yes, iterative programming could help via access to tools- I can see that helping.
Why do programmers search for specific questions rather than always relying on their inherent knowledge?
I’m a crappy hobbyist programmer but for me it is useful to see if someone has implemented exactly what I need, or debugged the problem I’m having. I don’t think it’s reasonable to expect programmers or LLMs to know everything about every library’s use in every context just from first principles.
I do it to save the limited brain power I have before rest or food is required. You could spend 5 minutes writing a sort (at a high level processing) or just use existing code which might take 5 minutes to find but uses less brain power.
This allows you to use that brain power on specific things that need you and let google remember the format of that specific command or let an ai write out your routing file.
The older I get the less I'm bound by time, lack of knowledge or scope but more limited by clarity. Delegate tasks where possible and keep the clarity for the overall project and your position.
But why would that information not be included in the wide crawl already encoded in the model weights before the knowledge cutoff? I believe the article mentions frontier models so we are talking about models trained on trillions of tokens here
Because cutoff can be like few months ago and you still have new versions of libraries being developed every month. API getting deprecated or removed or new API being added. Model need to have access to the latest API or SDK that is available and know e.g. what iOS SDK you have currently available and what MacOS version you have etc. Having access to github issues also help to figure out if there is bug in library.
> How many software developers could solve most even simple programming problems (except 'Hello world') with zero shot style (you write in notepad then can compile only once and execute once) without access to internet (stackoverflow, google search, documentation), tools (terminal, debugger, linter, cli)?
Many, there was a time when SO did not exist and people were able to solve non trivial problems. There was a time coding problems on exams had to be solved on paper and if they were not compiling you would not pass.
you miss my point about zero short style where you have only one shot to compile and execute you code. Even in old times when people programmed using punched cards it required a lot of reviews and iterations. This is the reason why scripting languages like python, ruby, php, javascript got popular because you had very fast feedback loop and do dozens of mini experiments. Majority of coding problems we have today are not algorithmic in nature.
I had one shot at my exams, was writing them on paper, compiling code in my brain.
It depends a lot on the type of problem. If we're talking about fixing a bug or adding a new feature to a large existing code base, which probably describes a huge portion or professional software engineering work, I would say most engineers could do most of those tasks without the internet. Especially if the goal is to simply pass a benchmark test of getting it working without future considerations.
You sound like someone who never used punch cards.
I think most developers could do that if they trained. As someone who learned how to program before the internet, its just a different mindset and would take some time to adjust.
I am doing that now where changes take a day to make it to staging and no local environment. You roll with it.
> You sound like someone who never used punch cards.
I hope HN never changes.
I think about this a lot. AI in the current state is like working with an intern who is on a stranded island with no internet access or compiler, they have to write down all of the code in forward sequence on piece of paper, god help them if they have to write any UI while also being blind. None of the "build an app with AI start-to-finish" products work well at all because of this.
AI models are trained on the data from the internet, so sure, they couldn't do their search feature to scour the internet, but I doubt the material is much different than what the models were already trained on.
Additionally, before the age of stackoverflow and google, SWEs cracked open the book or documentation for whatever technology they were using.
Isn't this how interviews tend to work? So I think a good number of devs would, yes.
Interviews like leetcode on whiteboard only testing your reasoning not if your solution will execute out of the box in zero shot style. Humans solve problem in iterative way that's why fast feedback loop and access to tools is essential. When you start coding compiler or linter hints you that you forgot to close some braces or miss semicolon. Compiler tips you that API in new version changed, intellisense hints you what methods you can use in current context and what parameters you can use and their types. Once you execute program you get runtimes tips that maybe you missed installing some node or python package. When you installing packages you get hints that maybe one package has additional dependency and 2 package version are not compatible. Command line tools like `ls` tells you what's project structure etc.
As one who organises competitive programming contests on a regular basis for university students, I would say almost every single one.
I used to be able to. The web's made me lazy. I was better before it and I'm better when I don't use it.
Really, the stuff you think helps you is often just holding you back.
These models are held at higher standards than humans. They should be above to solve any coding problem with just the documentation.
Isn't point of the training that they already have all the information they could have. So they do not need the Internet as on Internet there would only be information they already "know"...
I barely ever look at StackOverflow as the quality of answers there is so poor. It was once good but the proliferation of duplicates[1] has really ruined it for me, as well as outdated answers not being replaced. Google search results are also crap.
I agree with your point, though. The "LLM" model just isn't a good fit for some tasks, in fact many tasks. It is good for creative writing, but even then only really because our standards for creative writing are pretty low. It doesn't write with any real creativity or flair in the writing. It can make things up and stay on topic. It is poor for anything where accuracy matters. It can't edit what it produces! Nobody writes things in one shot in reality, not even creative writing, but especially not code or technical writing. It needs to be able to do a whole suite of other things: move blocks of output around, rewrite chunks, expand chunks, condense chunks, check chunks against external sources or proper knowledge banks, compare chunks for internal consistency, and more. That is how we operate: at the level of functions or blocks of code, at the level of paragraphs and sentences and sections.
[1]: Yes, the opposite of the problem people here usually have with it, which is things being closed as duplicates. I think more duplicates should be deleted and redirected to a canonical answer, which is then a focus of improvement. Too often google searches give me barely answered or unanswered duplicates and I have to click around in the site to find the result Google clearly should have given me in the first place (better keyword matches, not closed, higher score, etc). I think StackOverflow do this intentionally so people have to click on more pages and see more ads.
> How many software developers could solve most even simple programming problems (except 'Hello world') with zero shot style (you write in notepad then can compile only once and execute once) without access to internet (stackoverflow, google search, documentation), tools (terminal, debugger, linter, cli)?
The experienced ones can
Link to the original paper: https://arxiv.org/pdf/2502.12115
TL;DR:
They tested with programming tasks and manager's tasks.
The vast majority of tasks given require bugfixes.
Claude 3.5 Sonnet (the best performing LLM) passed 21.1% of programmer tasks and 47.0% of manager tasks.
The LLMs have a higher probability of passing the tests when they are given more attempts, but there's not a lot of data showing where the improvement tails off. (probably due to how expensive it is to run the tests)
Personally, I have other concerns:
- A human being asked to review repeated LLM attempts to resolve a problem is going to lead that human to review things less thoroughly after a few attempts and over time is going to let false positives slip through
- An LLM being asked to review repeated LLM attempts to resolve a problem is going to lead to the LLM convincing itself that it is correct with no regard for the reality of the situation.
- LLM use increases code churn in a code base
- Increased code churn is known to be bad the health of projects
Coding, especially the type mentioned in the article (building an app based on a specification)—is a highly complex task. It cannot be completed with a single prompt and an immediate, flawless result.
This is why even most software projects (built by humans) go through multiple iterations before they work perfectly.
We should consider a few things before asking, "Can AI code like humans?":
- How did AI learn to code? What structured curriculum was used?
- Did AI receive mentoring from an experienced senior who has solved real-life issues that the AI hasn't encountered yet?
- Did the AI learn through hands-on coding or just by reading Stack Overflow?
If we want to model AI as being on par with (or even superior to) human intelligence, don’t we at least need to consider how humans learn these complex skills?
Right now, it's akin to giving a human thousands of coding books to "read" and "understand," but offering no opportunity to test their programs on a computer. That’s essentially what's happening!
Without doing that, I don't think we'll ever be able to determine whether the limitation of current AI is due to its "low intelligence" or because it hasn’t been given a proper opportunity to learn.
LLMs can fundamentally only do something similar to learning in the training phase. So by the time you interact with it, it has learned all it can. The question we then care about is whether it has learned enough to be useful for problem X. There's no meaningful concept of "how intelligent" the system is beyond what it has learned, no abstract IQ test decoupled from base knowledge you could even conceive of.
>How did AI learn to code?
It didn't, it's just very good at copying already existing code and tweeking it a bit.
>Did AI receive mentoring from an experienced senior
It doesnt even comprehend what an experienced senior is, all it cares about is how frequently certain patterns occurred in certain circumstances.
>Did the AI learn through hands-on coding or just by reading Stack Overflow?
it "learnt" by collecting a large database of existing code, most of which is very low quality open source proofs of concept, then spits out the bits that are probably related to a question.
I think we're drastically oversimplifying what "pattern matching" means. It is also one of the fundamental mechanisms by which the human brain operates. I believe we are consciously (or perhaps subconsciously) conditioned to think that human "logic" and "reasoning" are several degrees more advanced than pattern matching. However, I don't think this is true.
The fundamental difference lies in how patterns are formed in each case. For LLMs, all they know are the patterns they observe in "words" - that is the only "sense" they possess. But for humans, pattern recognition involves continuously ingesting and identifying patterns across our five primary senses—not just separately, but simultaneously.
For example, when an LLM describes something as "ball-shaped," it cannot feel the shape of a ball because it lacks another sense to associate with the word "ball-shaped." In contrast, humans have the sense of touch, allowing them to associate the word or sound pattern "ball" with the physical sensation of holding a ball.
>`It is also one of the fundamental mechanisms by which the human brain operates.
One of the fundamental mechanisms by which brains operate. The bits we share with every other animal with a brain,
good luck teaching your dog to code.
being great at fetching your newspaper in the morning doesn't mean its going to wake up and write you an accounting software package at the end of the year.
LLMs will never solve this problem, they are basically just glorified copy & paste engines, solving real code problems requires invention, even for most basic tasks. The best they will manage in their current direct is reason they don't have the capability or capacity to actually solve the problem rather than just getting it wrong the vast majority of the time.
I believe it. I couldn't even get o1 or claude 3.5 to write a tampermonkey script that would turn off auto-scroll to bottom in LibreChat, even when uploading the html and javascript as context.
Apparently it has to do with overflow anchor or something in React? Idk. I gave up.
Unless it works by literally scrolling with JS, I bet some strategic bit of CSS should do it...
I prompted up a very basic Flask scaffold via Windsurf and once it reached a certain code size, it just started to remove or weirdly rewrite old parts to handle the context. ("You're right let's move that back in"). Didn't end well.
it's so much easier to learn from examples than from documentation in my opinion, documentation is, what I use when I want to know additional parameters or downsides of a functionality. I'm no coder though.
what I'd like LLMs to do is present examples using acceptable design standards, e.g. whats the pythonic way to do this, and what are exceptions that might yield better performance/optimization (at what does it cost), or what is the best go(lang) JSON parser (since the built-in isn't very good).
But instead, I get average to below-average examples (surprise surprise, this is what happens when you train on a high noise-to-signal set of data), which are either subtly or wildly incorrect. I can't see this improving, with reddit and other forums trying to introduce AI bot written posts. Surely these companies are aware of how LLM output degenerates when fed its own input within a few (not even dozen) generations?!?
Interesting that Claude wins despite the other models being more expensive and doing much better in the traditional benchmarks.
Despite the lack luster coding performance, AI has PROVEN its able to provide a rationale for profit taking job cuts, layoffs, reduced stock grants, and increased executive bonuses.
So it's not ALL bad news.
I dont understand why anyone would even think that the current crop of AI is capable of planning or reason on its own.
Transformers with memory would be different story.
But, no memory, no capability to reason. End of story, right?
This mirrors what I've seen. I've found that LLMs are most helpful in places where I have the most experience.
Maybe this is because of explicitness in prompt and preempting edge cases. Maybe it's because I know exactly what should be done. In these cases, I will still sometimes be surprised by a more complete answer then I was envisioning, a few edge cases that weren't front of mind.
But if I have _no_ idea things go wildly off course. I was doing some tricky frontend work with dynamically placed reactflow nodes and bezier curve edges. It took me easily 6 hours of bashing my head against the problem, and it was hard to stop using the assistant because of sunk cost. But I probably would have gotten more out of it and been faster if I'd just sat down and really broken down the problem for a few hours and then moved to implement.
The most tempting part of LLMs is letting them figure out design when you're in a time crunch. And the way it solves things when you understand the domain and the bottoms-up view of the work is deceptive in terms of capability.
And in this case, it's hoping that people on upwork understand their problems deeply. If they did, they probably wouldn't be posting on upwork. That's what they're trying to pay for.
I just had this conversation with a customer. And it’s hard to avoid anthropomorphizing ai. Once you equate the ai system with a human - a human who creates perfectly pep8 formatted python is probably a decent python programmer, whereas someone who bangs out some barely readable code with mixed spacing and variable naming styles is most likely a novice.
We use these signals to indicate how much we should trust the code - same with written text. Poorly constructed sentences? Gaps or pauses? Maybe that person isn’t as knowledgeable.
These shortcuts fail miserably on a system that generates perfect grammar, so when you bring your stereotypes gleaned from dealing with humans into the ai world, you’re in for an unpleasant surprise when you unpack the info and find it’s only about 75% correct, despite the impeccable grammar.
> But if I have _no_ idea things go wildly off course.
This is the key to getting some amount of productivity from LLMs in my experience, the ability to spot very quickly when they veer off course into fantasyland and nip it in the bud.
Then you point out the issue to them, they agree that they made a dumb mistake and fix it, then you ask them to build on what you just agreed to and they go and reintroduce the same issue they just agreed with you was an obvious problem... because ultimately they are more fancy auto complete machines than they are actual thinking machines.
I have found them to be a time saver on the whole even when working with new languages but I think this may in large part be helped by the fact that I have literally decades of coding experience that sets off my spidey senses as soon as they start going rampant.
I can't begin to imagine how comical it must be when someone who doesn't have a strong programming foundation just blindly trusts these things to produce useful code until the runtime or compile time bugs become unavoidably obvious.
I believe the outcome of this type of article is actually positive. The ‘SWE-Lancer’ benchmark provides visibility into a more pragmatic assessment of LLM capabilities.
Ironically it actually refutes Altman’s claims mentioned in the same article . Hard to replace engineers when you create a benchmark you can’t score decently on.
Or it could be a case of: Never prepare a benchmark/prep a comparison which you think you won't succeed at. This is especially true when you are funded by mostly private/VC investors. Time will tell.
I think they are trying to frame the narrative; then succeed at it. Let's see. This helps justify OpenAPI's validation and efforts to investors/VC's. After all; IMO without coding as a use case for LLM's AI wouldn't nearly have the same hype/buzz as it does now. Greed (profit) and fear (losing jobs) are a great motivator to keep investment hype and funds coming in.
The benchmark for AI models to assess their 'coding' ability should be on actual real world production-grade repositories and fixing bugs in them such as the Linux kernel, Firefox, sqlite or other large scale well known repositories.
Not these Hackerrank, Leetcode or previous IOI and IMO problems which we already have the solutions to them and reproducing the most optimal solution copied from someone else.
If it can't manage most unseen coding problems with no previous solutions to them, what hope does it have against explaining and fixing bugs correctly on very complex repositories with over 1M-10M+ lines of code?
The new Lancer benchmark is on actual problems and that is where it is failing by a huge margin.
Previously on source: https://news.ycombinator.com/item?id=43086347
It solved a lot of mines
I’ve got 15 years of coding experience at some of the biggest tech companies. My personal opinion is that most people have no clue how good these AI coding systems already are. If you use something like RepoPrompt, where you selectively choose which files to include in the prompt, and then also provide a clear description of what changes you want to make—along with a significant portion of the source code—a model like O1Pro will nail the solution the first time.
The real issue is that people are not providing proper context to the models. Take any random coding library you’re interfacing with, like a Postgres database connection client. The LLM isn’t going to inherently know all of the different configurations and nuances of that client. However, if you pass in the source code for the client along with the relevant portions of your own codebase, you’re equipping the model with the exact information it needs.
Every time you do this, including a large prompt size—maybe 50,000 to 100,000 tokens—you dramatically improve the model’s ability to generate an accurate and useful response. With a strong model like O1Pro, the results can be exceptional. The key isn’t that these models are incapable; it’s that users aren’t feeding them the right data.
Are you suggesting that OpenAI published a paper assessing their own models on real-world problems, but failed to properly use their own models? And/or that you know better than OpenAI scientists how to use OpenAI models most effectively?
the limiting factor is no longer the answers but the questions
I saw this
still, chain of thought is great for LeetCode 75
Since interviewers “want to see how you think” (and get the right answer in less time than other candidates on average)
I can now see how you’re supposed to think (and get the right answer in less time than other candidates on average, for now)
... so far
The models were restricted from accessing the internet and forced to develop their own solutions internally.
I think researchers will find that human coders are unable to solve most coding problems without access to the internet.
What kinds of problems are you talking about? There are problems that require you to learn new libraries and services constantly, accessing documentation, and there are actual software problems when you have to reflect on your own large code base. I work on both kinds of problems, and in the first case, the models are actually well versed in say, all of CloudFormation syntax, that I would have to look up. On the opposite end, I have written many features on trips, unable to be distracted by the internet, just me and the code, and being able to read library source code.
The fact is, programming requires abstract modeling that language models aren’t demonstrating the capability of fully replicating. At least, not that we can see, yet.
If you're requiring human coders to write valid, compilable code, maybe. But if you're doing that, you're doing coding interviews wrong.
Any interviews I've run or been a part of have required the interviewee to demonstrate their problem-solving skills using pseudo-code.
A decent human programmer with experience in a particular domain may rely on internet access to look up API documentation and other generic references, but if you read the paper, you'll see that the AI systems tested suffered from more basic deficiencies in approach and reasoning ('3.6. Discussion', starting on page 7).
And you created the internet then?
Damn. How did they write code before 1992?
We did have usenet in the 80s and gopher in the 90s. But yes, in those days it was that mythical "paper" stuff (or were we still using papyrus? I forget)
Seriously? I think I've been most productive pre-internet days when stuck on a trans-atlantic flight with a java (shudder, talk about PTSD) reference manual and laptop that could barely last 2 hours on battery, with emphasis on the measure-twice, cut once mentality.
It's painful to watch junior coders copy-n-paste from SO or W3schools (including code samples clearly labelled not-for-production) with little effort to understanding what they are doing.