Comments Page - OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problems

« Back OpenAI Researchers Find That AI Is Unable to Solve Most Coding Problemsfuturism.comSubmitted by colinprince 4 months ago

rurp 4 months ago
I recently had to do a one-off task using SQL in a way that I wasn't too familiar with. Since I could explain conceptually what I needed but didn't know all the right syntax this seemed like a perfect use case to loop in Claude.
The first couple back and forths went ok but it quickly gave me some SQL that was invalid. I sent back the exact error and line number and it responded by changing all of the aliases but repeated the same logical error. I tried again and this time it rewrote more of the code, but still used the exact same invalid operation.
At that point I just went ahead and read some docs and other resources and solved things the traditional way.
Given all of the hype around LLMs I'm honestly surprised to see top models still failing in such basic and straightforward ways. I keep trying to use LLMs in my regular work so that I'm not missing out on something potentially great but I still haven't hit a point where they're all that useful.
- deegles 4 months ago
  I'm not convinced LLMs will evolve into general AI. The promises that it's just around the corner feels increasingly like a big scam.
  kristopolous 4 months ago
  I was never on board with it. It feels like the same step change google was - there was a time when it was just miles ahead of everything else out there around 1998. The first time you used it, it was like "geez, you got it right, didn't know that was possible". It's big, changed things, but wasn't an end of history event a bunch of people are utterly convinced this is.
  blitzar 4 months ago
  We just need a little more of your (not mine) money to get there. Would I lie to you for $100 billion?
  petesergeant 4 months ago
  Depends what you mean by evolve. I don't think we'll get general AI by simply scaling LLMs, but I think general AI, if it arrives, will be able to trace its lineage very much back to LLMs. Journeys through the embedding space very much feel like the way forward to me, and that's what LLMs are.
  giardini 4 months ago
  Embedding spaces are one thing, LLMs are quite another.
  I believe the former are understandable and likely a part of true AGI but the latter a series of hacks, at worst a red herring leading us off the proper track into a deadend.
  boppo1 4 months ago
  Is there a resource I can use to understand the difference between embedding space and latent space?
  brookst 4 months ago
  I mean it’s been a couple of years!
  It may or may not happen but “scam” means intentional deceit. I don’t think anyone actually knows where LLMs are going with enough certainty to use that pejorative.
  krupan 4 months ago
  Is it intentional deceit to tell everyone it's leading to something when, as you correctly point out, nobody actually knows if it will?
  lazystar 4 months ago
  when their stock price rises because of their words, yes.
  johnnyanmac 4 months ago
  >“scam” means intentional deceit.
  Yes. I'm pretty sure any engineer working on this knows it's not "a few years away". But it doesn't stop product teams from taking adcvantadge of the hype cycle. Hence, "use deception to deprive (someone) of money or possessions.".
  fragmede 4 months ago
  I'm equally sure there are true believers who've drunk the Kool-aid and really believe AGI is right around the corner, just need to fix a few bugs and wait just a few more generations of Moore's law. What difference does the beliefs of a nameless engineer at an AI company make?
  johnnyanmac 4 months ago
  >What difference does the beliefs of a nameless engineer at an AI company make?
  Hopefully a manager and proper task scheduling. If I made these promises every sprint and kept saying "yea the task is only a week away from completion!" I'd be fired unless I fell down the rabbit hole to Alice in Wonderland. I'm using good faith to assume a lot of those AI engineers are smarter and better schedulers than I am.
  But that's what managers and proper scoping and perspective is for. Maybe they're okay with that, but I'd wager any profit motivated company would not keep exploring unless the gains are enormous.
  fragmede 4 months ago
  but we're laymen with no attempts to define what AGI means in concrete terms, nevermind remotely come to any agreement over it, so the manager and proper task scheduling is gonna have much more concrete and discreet task in the sprint in jira. it's not like there's just one ticket with a billion points or umptuple XL that says AGI. That's not even gonna be an epic with a million tickets. (or maybe it is. I don't work there, lol)
  The people in the trenches there are working on tactical specific things and aren't going to be fired for meeting their internal KPIs which aren't nebulously AGI.
  johnnyanmac 4 months ago
  >so the manager and proper task scheduling is gonna have much more concrete and discreet task in the sprint in jira
  Yes, that's the true meaning behind my words. The marketing is saying "were working on AGI as we speak!" and that's only bare bones true in the same way that someone is buying a house... While baeeot starting their savings, and unsure or where they are buying and what they want in it. It's barely an idea, per alone "around the corner".
  Meanwhile, they are given small, unexciting, but important stepping stones to experiment. Nothing that makes line go up, because that means being truthful. Thars why I hate hype culture. It obscures true progress and honesty towards progress. A distraction because talking about a "thing" in a blue sky is more profitable than talking about the actual stepping stones.
- Yizahi 4 months ago
  LLMs coding performance is directly proportional to amount of stolen data for the learning process. That's why frontend folks are swearing by it and are forecasting our new god dominance in just a few years. That's because frontend code is literally out there mostly, just take it and compile into a dataset. Stuff like SQL DBs is not laying on every internet corner and is probably very underrepresented in the dataset, producing inferior performance. Same with rare or systems languages, like Rust for example, LLMs are also very bad with it.
- sesteel 4 months ago
  It has made me stop using Google and StackOverflow. I can look most things up quickly, not rubber duck with other people, and thus I am more efficient. It also it is good at spotting bugs in a function if the APIs are known and the APIs version is something it was trained on. If I need to understand what something is doing, it can help annotate the lines.
  I use it to improve my code, but I still cannot get it to do anything that is moderately complex. The paper tracks with what I've experienced.
  I do think it will continue to rapidly evolve, but it probably is more of a cognitive aid than a replacement. I try to only use it when I am tight on time. or need a crutch to help me keep going.
- tasuki 4 months ago
  This happens in about one third of my coding interactions with LLMs. I've been trying to get better at handling the situation. At some point it's clear you've explained the problem well enough and the LLM actually is regurgitating the same wrong answer, unable to make progress. It would be useful to spot this asap.
  I enjoy working with very strongly typed languages (Elm, Haskell), and it's hard for me to avoid "just paste the compile error to the LLM it only takes a second" trap. At some point (usually around three back-and-forths), if the LLM can't fix the error, it will just generate increasingly different compile errors. It's a matter of choosing which one I decide to actually dive into (this is more of a problem with Haskell than Elm, as Elm compile errors are second to none).
  cyberpunk 4 months ago
  Honest question -- not trying to be offensive, but what are you using elm for? Everywhere I've encountered it it's some legacy system that no one has cared to migrate yet and it's a complete dumpster fire.
  You spend about three days trying to get it to build then say fuck it and rewrite it.
  At least, that's the story of the last (and only) three times I've seen elm code in the wild.
  tasuki 4 months ago
  I'm not really a frontend developer. I'm using Elm for toy projects, in fact I did one recently.[0] Elm is my favourite language!
  > You spend about three days trying to get it to build then say fuck it and rewrite it.
  What are the problems you encounter? I can't quite imagine in what way an Elm project could be hard to build! (Also not trying to be offensive, but I almost don't believe you!)
  And into which language do you rewrite those "dumpster fire" Elm codebases?
  [0] https://github.com/tasuki/iso-maze
  cyberpunk 4 months ago
  typescript usually, the elm frontends tend to be in some abandoned repo which hasn't had a ci run in like 2 years and which instantly fail on missing deps or security controls etc.
  tasuki 4 months ago
  Yes, right, I forgot, this also happened to me once: someone deleted their repo for an Elm library. It was salvageable through whatever archive and publishing my own copy.
  It happens less often in Elm than in JavaScript though! I'll take "abandoned for two years" Elm project over "abandoned for two years" typescript project anytime!
  The problem, in your case, was not really Elm.
- halis 4 months ago
  While I do find llm’s useful, it’s mostly for simple and repetitive tasks.
  In my opinion, they aren’t actually coding anything and have no amount of understanding. They are simply advanced at searching things and pasting back an answer that they scraped online. They can also run simple transformations on those snippets like rename variables. But if you tell it there’s a problem, it doesn’t try to actually solve the problem. It just traverses to the same branch in the tree and tries to give you another similar solution in the tree or if there’s nothing better it will give you the same solution but maybe run a transformation on it.
  So, in short, learn how to code or teach your kids how to code. Because going forward, I think it’s going to be more valuable than ever.
  AbstractH24 4 months ago
  > So, in short, learn how to code or teach your kids how to code. Because going forward, I think it’s going to be more valuable than ever.
  Teach your kids how to be resourceful and curious. Coding is just a means to an end to do that. Agreed though it’s a great one.
- jameslk 4 months ago
  I had to do something similar with BigQuery and some open source datasets recently.
  I had bad results with Claude as you mentioned. It kept hallucinating parts of the docs for the open datasets, coming up with nonsense columns. Not fixing errors when presented the error text and more context. I had a similar outcome with 4o.
  But I tried the same with o1 and it was much better consistently, with full generations of queries and alterations. I fed it in some parts of docs anytime it struggled and it figured it out.
  Ultimately I was able to achieve what I was trying to do with o1. I’m guessing the reasoning helped, especially when I confronted it about hallucinations and provided bits of the docs.
  Maybe the model and the lack of CoT could be part of the challenge you ran into?
  whilenot-dev 4 months ago
  > and provided bits of the docs.
  At this point I'd ask myself whether I want my original problem solved or if I just want the LLM to succeed with my requested task.
  jameslk 4 months ago
  Yes, I imagine some do like to read and then ponder over the BigQuery docs. I like to get my work done. In my case, o1 nailed BigQuery flawlessly, saving me time. I just needed to feed in some parts of the open source dataset docs
  threatofrain 4 months ago
  That depends on how hard it is to auto ingest context.
- aprilthird2021 4 months ago
  I do something like this every day at work lol. It's a good base to start with, but often you'll eventually have to Google or look at the docs to see what it's messing up
- benhurmarcel 4 months ago
  For what it’s worth, I recently wrote an SQL file that gave an error. I tried to fix it myself and searched the internet but couldn’t solve it. I pasted it into Claude and it solved the error immediately.
- logicallee 4 months ago
  I am a paying user of both Claude AI and ChatGPT, I think for the use case you mention ChatGPT would have done better than Claude. At $20/month I recommend that you try it for the same use case. o1 might have succeeded where Claude failed.
  tasuki 4 months ago
  Meh. Ballpark they're very similar. I think people overestimate the differences between the LLMs. Similarly to how people overestimate the differences between... people. The difference between the village idiot and Einstein only looks like a big difference to us humans. In the grand scale of things, they're pretty similar.
  Now, obviously, LLMs and humans aren't that similar! Different amount of knowledge, different failure modes, etc.
- mamcx 4 months ago
  I consider now that all LLMs are just for give you a rough shape, but you must rewrite it manually, doing judgments on what to keep or not.
- noahbp 4 months ago
  There is no one "SQL", unfortunately. All of the major database engines have their own forks and extensions. If you didn't specify which one you were using (Microsoft SQL, Oracle, Postgres, SQLite, MySQL), then you didn't give the LLM enough information to work with, same as a junior engineer working blindly with only the information you give them.
  rurp 4 months ago
  I left that part out for brevity, but I told Claude the version of Postgres I was using at the start, and even specified that the mistake it produced is invalid in Postgres.
- Sterling9x 4 months ago
  When claude gets in a loop like this the best thing to do is just start over in a new window.
  When you line up claude on some good context and a good question it does really well. There are more specialized llms for sql I would try one of those. Claude is a generalist and for that, it's not great at everything.
  It's really good at react and python -- as someone else mentioned -- that junior code is public and available.
  However, random sql needs more "guiding" via the prompt. Explain more about the data and why it's wrong. Tell claude, "I think you're producing slop" and he will break out of his loop.
  Good luck!
- jchw 4 months ago
  I've had a pretty similar outlook and still kind of do, but I think I do understand the hype a little bit: I've found that Claude and Gemini 2 Pro (experimental) sometimes are able to do things that I genuinely don't expect them to be able to do. Of course, that was the case before to a lesser extent already, and I already know that that alone doesn't translate to useful necessarily.
  So, I have been trying Gemini 2 Pro, mainly because I have free access to it for now, and I think it strikes a bit above being interesting and into the territory of being useful. It has the same failure mode issues that LLMs have always had, but honestly it has managed to generate code and answer questions that Google definitely was not helping with. When not dealing with hallucinations/knowledge gaps, the resulting code was shockingly decent, and it could generate hundreds of lines of code without an obvious error or bug at times, depending on what you asked. The main issues were occasionally missing an important detail or overly complicating some aspect. I found the quality of unit tests generated to be sub par, as it often made unit tests that strongly overlapped with each other and didn't necessarily add value (and rarely worked out-of-the-box anyways, come to think of it.)
  When trying to use it for real-world tasks where I actually don't know the answers, I've had mixed results. On a couple occasions it helped me get to the right place when Google searches were going absolutely nowhere, so the value proposition is clearly somewhere. It was good at generating decent mundane code, bash scripts, CMake code, Bazel, etc. which to me looked decently written, though I am not confident enough to actually use its output yet. Once it suggested a non-existent linker flag to solve an issue, but surprisingly it actually did inadvertently suggest a solution to my problem that actually did work at the same time (it's a weird rabbit hole, but compiling with -D_GNU_SOURCE fixed an obscure linker error with a very old and non-standard build environment, helping me get my DeaDBeeF plugin building with their upstream apbuild-based system.)
  But unfortunately, hallucination remains an issue, and the current workflow (even with Cursor) leaves a lot to be desired. I'd like to see systems that can dynamically grab context and use web searches, try compiling or running tests, and maybe even have other LLMs "review" the work and try to get to a better state. I'm sure all of that exists, but I'm not really a huge LLM person so I haven't kept up with it. Personally, with the state frontier models are in, though, I'd like to try this sort of system if it does exist. I'd just like to see what the state of the art is capable of.
  Even that aside, though, I can see this being useful especially since Google Search is increasingly unusable.
  I do worry, though. If these technologies get better, it's probably going to make a lot of engineers struggle to develop deep problem-solving skills, since you will need them a lot less to get started. Learning to RTFM, dig into code and generally do research is valuable stuff. Having a bot you can use as an infinite lazyweb may not be the greatest thing.
- hackit2 4 months ago
  Makes perfect sense why it couldn't answer your question, you didn't have the vocabulary of relational algebra to correctly prime the model. Any rudimentary field have their own corpus vocabulary to express ideas and concepts specific to that domain.
  krupan 4 months ago
  I honestly can't tell if this is a sarcastic reply or not
  numba888 4 months ago
  I'm not sure it was even human.
jasonthorsness 4 months ago
Half of the work is specification and iteration. I think there’s a focus on full SWE replacement because it’s sensational, but we’ll more end up with SWE able to focus on the less patterned or ambiguous work and made way more productive with the LLM handling subtasks more efficiently. I don’t see how full SWE replacement can happen unless non-SWE people using LLMs become technical enough to get what they need out of them, in which case they probably have just become SWE anyway.
- tkgally 4 months ago
  > unless non-SWE people using LLMs become technical enough to get what they need out of them
  Non-SWE person here. In the past year I've been able to use LLMs to do several tasks for which I previously would have paid a freelancer on Fiverr.
  The most complex one, done last spring, involved writing a Python program that I ran on Google Colab to grab the OCR transcriptions of dozens of 19th-century books off the Internet Archive, send the transcriptions to Gemini 1.5, and collect Gemini's five-paragraph summary of each book.
  If I had posted the job to Fiverr, I would have been willing to pay several hundred dollars for it. Instead, I was able to do it all myself with no knowledge of Python or previous experience with Google Colab. All it cost was my subscription to ChatGPT Plus (which I would have had anyway) and a few dollars of API usage.
  I didn't put any full-time SWEs out of work, but I did take one job away from a Fiverr freelancer.
  jameslk 4 months ago
  > I didn't put any full-time SWEs out of work, but I did take one job away from a Fiverr freelancer.
  I think this is the nuance most miss when they think about how AI models will displace work.
  Most seem to think “if it can’t fully replace a SWE then it’s not going to happen”
  When in reality, it starts by lowering the threshold for someone who’s technical but not a SWE, to jump in and do the work themselves. Or it makes the job of an existing engineer more efficient. Each hour less work needed spread across many tasks that would have otherwise gone to an engineer eventually sum up to a full time worth of an engineer. If it’s a Fiverr dev you eliminated the work of, that means the Fiverr dev will eventually go after the work that’s remaining, putting supply pressure on other devs
  It’s the same mistake many had about self driving cars not happening because they couldn’t handle every road. No, they just need to start with 1 road, master that, and then keep expanding to more roads. Until they can do all of SF, and then more and more cities
  player1234 4 months ago
  Entirely possible. Have you got any numbers and real world examples? Growth? Profits? Actual quantified productivity gains?
  The nuance your 'gotcha' scenario miss is that displacing fiverr, speeding up small side project, making scripts fo non-SWE, creating boilerplate, etc is not the trillions of dollars disruption that is needed by now.
  ripped_britches 4 months ago
  This is a good anecdote but most software engineering is not scripting. It’s getting waist (or neck) deep in a large codebase and many intricacies.
  That being said I’m very bullish on AI being able to handle more and more of this very soon. Cursor definitely does a great job giving us a taste of cross codebase understanding.
  koito17 4 months ago
  Seconded. Zed makes it trivial to provide entire codebases as context to Claude 3.5 Sonnet. That particular model has felt as good as a junior developer when given small, focused tasks. A year ago, I wouldn’t have imagined that my current use of LLMs was even possible.
  numba888 4 months ago
  not sure about Claude but my main problem with 03-mini is that it 'forgets' the things which are supposed to fit in the context window. This results in it using different function names, data structures. I think it's guessing them instead of fetching from the previous records.
  ai-christianson 4 months ago
  > This is a good anecdote but most software engineering is not scripting. It’s getting waist (or neck) deep in a large codebase and many intricacies.
  The agent I'm working on (RA.Aid) handles this by crawling and researching the codebase before doing any work. I ended up making the first version precisely because I was working on a larger monorepo project with lots of files, backend, api layer, app, etc.
  So I think the LLMs can do it, but only if techniques are used to allow it to hone in on the specific information in a codebase that is relevant to a particular change.
  dkjaudyeqooe 4 months ago
  If the goal is to get something to run correctly roughly once with some known data or input, then that's fine. Actual software development aims to run under 100% of circumstances, and LLMs are essentially cargo culting the development process and entrusting an automation that is unreliable to do mundane tasks. Sadly the quality of software will keep going down, perhaps even faster.
  player1234 4 months ago
  Stop with the realism, one off scripts is going to give trillions in ROI any day now. Personally could easily chip in maybe a million a month in subsbription fees be cause my bolierplate code I write once in a blue moon has speed up infinitely and I will cash out in profits any day now.
  numba888 4 months ago
  > I didn't put any full-time SWEs out of work, but I did take one job away from a Fiverr freelancer.
  Who would use LLM anyway these days. Interesting when Fiverr will add non-human freelancers. Something similar to algorithmic traders. Passive income.
- hiq 4 months ago
  IOW LLMs make programming somewhat higher-level similar to what new programming languages in the past, either via code generation from natural language (main use-case right now?), or by interpreting a "program" written in natural language ("sum all the numbers in the 3rd column of this CSV").
  The latter case enables more people to program to a certain extent, similar to what spreadsheets did, while we still need full SWEs in the first case, as you pointed out.
- sanxiyn 4 months ago
  Everyone is a typist now, so I don't think it is farfetched that everyone is a SWE in the future.
  riffraff 4 months ago
  Very few people are typist.
  Most people can use a keyboard, but the majority of non-technical people type at a speed which is orders of magnitude less than a professional typist.
  Another comment here mentions how they used colab while not being a SWE, but that is already miles ahead of what average people do with computers.
  There's people who have used computers for decades and wouldn't be able to do a sum in a spreadsheet, nor know that is something spreadsheets can do.
  jimbob45 4 months ago
  What’s the WPM cutoff to be considered a typist?
  helloplanets 4 months ago
  In the narrowest version of the definition:
  > The Registered Skilled Reporter (RSR) is NCRA's new designation that will recognize those stenographic professionals who are looking to validate their beginning level of competency.
  > You have to pass three five-minute Skills Tests (SKT), which evaluate your skills level in three areas: Literary at 160 wpm, Jury Charge at 180 wpm, Testimony/Q&A at 200 wpm.
  https://www.ncra.org/certification/NCRA-Certifications/regis...
  000ooo000 4 months ago
  Stenography is a little different to regular typing isn't it?
  vonunov 4 months ago
  This is the National Court Reporter's Association; participating membership is for, as you say, stenographic court reporters and [stenographic] captioners, CART providers, and the like. Their membership FAQ mentions transcriptionists as eligible for associate membership -- not independently, though, only in a role supporting stenographic professionals (so really more like scopists, I believe).
  Note also that the speeds listed are described as beginning level. Contrast this with the speed contest(1) featuring literary (i.e., a speech or some kind of governmental literature or something to that effect) read at 200-220 WPM, jury charge (instructions) read at 200-260 WPM, and Q&A ([witness] testimony) read at 280 WPM.
  These are the kind of speeds that have been typical of stenographers pretty much as long as it's been a thing, even when it was done with a pen rather than a steno machine -- well, back into the 19th century at least; I personally can't speak to the performance of the earlier shorthand systems off the top of my head.
  Those "beginning" speeds are about at the top of what most of the best longhand typists can do at any serious length (see, for example, hi-games.net typing leaderboard for a 5-minute(2) vs 10-second(3) test).
  As to the WPM cutoff to be considered a "typist"? I mean, it's not like it's a professional credential or anything. Anyone can be a typist if they're typing, I suppose, or if they choose to take it seriously enough. Even the de facto standards of job requirements are nothing much to go by: The typing speeds listed as required in the job postings for nearly all customer service, tech support, general office, and other such jobs, quite frankly, range from underwhelming to laughable. Even transcriptionists (longhand, as in not steno, and offline, as in not real-time) don't need to type more than about 80 WPM to find work in the field, if even that much. In my view, 80 WPM is still an awfully tedious sort of speed, but I understand it's more commonly considered a respectable one, and more than adequate for most tasks, so I guess I'd be fine with that number if I had to pick one.
  I'll also throw in with jb-wells above and say that anyone who's touch-typing (and preferably making progress into the triple digits, or so far as their own ability will allow) might as well be considered a typist -- or anyone who managed to convince someone to pay them to type things at any speed.
  1. https://www.ncra.org/home/the-profession/Awards-and-contests...
  2. https://hi-games.net/typing-test,300/
  3 https://hi-games.net/typing-test,10/
  riffraff 4 months ago
  I guess 60-70 WPM at >95% accuracy? I have not obtained the needed certification :)
  My mom went to a secretary/business assistant school in the '70s and the typing class (on a typewriter!) required using ten fingers and touch typing. The expectation was you'd be fast enough to transcribe someone dictating (they learned to use stenography for faster situations).
  jb-wells 4 months ago
  when you don't have to look at the keyboard
- zitterbewegung 4 months ago
  If the llm can’t find me a solution in 3 to 5 tries while I improve the prompt I fall back to mire traditional methods and or use another model like Gemini.
- petesergeant 4 months ago
  > in which case they probably have just become SWE anyway
  or learn to use something like Bubble
- vonunov 4 months ago
  Yeah, I tried Copilot for the first time the other day and it seemed to be able to handle this approach fairly well -- I had to refine the details, but none of it was because of hallucinations or anything like that. I didn't give it a chance to try to handle the high-level objective, but based on past experience, it would have done something pointlessly overwrought at best.
  Also, as an aside, re "not a real programmer" salt: If we suppose, as I've been led to believe, that the "true essence" of programming is the ability to granularize instructions and conceptualize data flow like this, and if LLMs remain unsuitable for coding tasks unless the user can do so, this would seem to undermine the idea that someone can only pretend to be a programmer if they use the LLMs.
  Anyway, I used Copilot in VSCode to "Fix" this "code" (it advised me that I should "fix" my "code" by . . . implementing it, and then helpfully provided a complete example):
  # Take a URL from stdin (prompt) # If the URL contains "www.reddit.com", replace this substring with "old.reddit.com" # Curl the URL and extract all links matching /https:\/\/monkeytype\.com\/profile\/[^>]+/ from the html; # put them in a defaultdict as the first values; # for each first value, the key is the username that appears in the nearest previous p.tagline > a.author # For each first value, use Selenium to browse to the monkeytype.com/profile url; # wait until 'div[class=\'pbsTime\'] div:nth-child(3) div:nth-child(1) div:nth-child(2)' is visible AND contains numbers; # assign this value as the second value in the defaultdict # Print the defaultdict as a json object
jr-ai-interview 4 months ago
This has been obvious for a couple years to anyone in the industry that has been faced with an onslaught of PRs to review from AI enabled coders who sometimes can't even explain the changes being made at all. Great job calling it AI.
pton_xd 4 months ago
Well, OpenAI does currently have 288 job openings, including plenty of software engineers, so that says something.
- jr-ai-interview 4 months ago
  Lol, that place is causing so many worldwide horrors misrepresenting AI I would't give sama an answer to an email at this point.
nostrebored 4 months ago
This mirrors what I've seen. I've found that LLMs are most helpful in places where I have the most experience.
Maybe this is because of explicitness in prompt and preempting edge cases. Maybe it's because I know exactly what should be done. In these cases, I will still sometimes be surprised by a more complete answer then I was envisioning, a few edge cases that weren't front of mind.
But if I have _no_ idea things go wildly off course. I was doing some tricky frontend work with dynamically placed reactflow nodes and bezier curve edges. It took me easily 6 hours of bashing my head against the problem, and it was hard to stop using the assistant because of sunk cost. But I probably would have gotten more out of it and been faster if I'd just sat down and really broken down the problem for a few hours and then moved to implement.
The most tempting part of LLMs is letting them figure out design when you're in a time crunch. And the way it solves things when you understand the domain and the bottoms-up view of the work is deceptive in terms of capability.
And in this case, it's hoping that people on upwork understand their problems deeply. If they did, they probably wouldn't be posting on upwork. That's what they're trying to pay for.
- ipython 4 months ago
  I just had this conversation with a customer. And it’s hard to avoid anthropomorphizing ai. Once you equate the ai system with a human - a human who creates perfectly pep8 formatted python is probably a decent python programmer, whereas someone who bangs out some barely readable code with mixed spacing and variable naming styles is most likely a novice.
  We use these signals to indicate how much we should trust the code - same with written text. Poorly constructed sentences? Gaps or pauses? Maybe that person isn’t as knowledgeable.
  These shortcuts fail miserably on a system that generates perfect grammar, so when you bring your stereotypes gleaned from dealing with humans into the ai world, you’re in for an unpleasant surprise when you unpack the info and find it’s only about 75% correct, despite the impeccable grammar.
- georgemcbay 4 months ago
  > But if I have _no_ idea things go wildly off course.
  This is the key to getting some amount of productivity from LLMs in my experience, the ability to spot very quickly when they veer off course into fantasyland and nip it in the bud.
  Then you point out the issue to them, they agree that they made a dumb mistake and fix it, then you ask them to build on what you just agreed to and they go and reintroduce the same issue they just agreed with you was an obvious problem... because ultimately they are more fancy auto complete machines than they are actual thinking machines.
  I have found them to be a time saver on the whole even when working with new languages but I think this may in large part be helped by the fact that I have literally decades of coding experience that sets off my spidey senses as soon as they start going rampant.
  I can't begin to imagine how comical it must be when someone who doesn't have a strong programming foundation just blindly trusts these things to produce useful code until the runtime or compile time bugs become unavoidably obvious.
_def 4 months ago
> even though CEO Sam Altman insists they will be able to beat "low-level" software engineers by the end of this year.
"low/high level" starts to lose its meaning to me because it gets used in opposite ways
- mrayycombi 4 months ago
  Where are the low level CEOs vs high level CEOs?
  I'll bet AI could do their jobs right now.
  Can SOMEONE please write AI software to replace these people?
  richardw 4 months ago
  Management consultants. Main attributes are confidence and ability to generate content. No need to stick around to see it through.
  ITB 4 months ago
  Are you bothered by the fact that software engineers might be easier to automate?
  throwaway290 4 months ago
  It's the opposite. An LLM is better at CEO stuff than working code. A good developer + LLM instead of CEO can succeed. A good CEO + LLM instead of developer cannot succeed. (For a tech company)
  rpmisms 4 months ago
  Considering that there are chickens who outperform stockbrokers, no.
  rsynnott 4 months ago
  Is that a fact? I mean, see the linked article; even the company whose whole business model lies in convincing people that that _is_ a fact is kinda saying “yeah, perhaps not”, with vague promises of jam tomorrow.
- anonymoushn 4 months ago
  Even better, if you click through to the linked source he doesn't say "low-level" at all, or make any claim that is at all like the claim he is cited as making!
- realitysballs 4 months ago
  Yeah low-level language conflated with low-level coders , means the opposite in some sense
floppiplopp 4 months ago
LLMs are still just text generators. These are statistical models that cannot think or solve logical problems. They might fool people, as Weizenbaum's "Eliza" did in the late 60s, by generating code that sort of runs sometimes, but identifying and solving a logic problem is something I reliably see these things fail at.
- tomduncalf 4 months ago
  Have you tried the latest models, using them with Cursor etc? They might not be truly intelligent but I’d be surprised if an SWE can’t see that they are already offering a lot of value.
  They probably can’t solve totally novel problems but they are good at transposing existing solutions to new domains. I’ve built some pretty crazy stuff with just prompts - granted I can prompt with detailed technical instructions when needed as I’m a SWE, similar to instructing a junior. I’ve built prototypes which would take days in hours which to me is hugely exciting.
  The code quality of pure AI generated code isn’t great but my approach right now is to use that to prototype things mostly with prompts (it takes as much time to build a prototype as it would to create a mock up or document explaining the idea previously) then once we are committed to it, I’ll rebuild it mostly by hand but using Cursor to help.
ldjkfkdsjnv 4 months ago
I’ve got 15 years of coding experience at some of the biggest tech companies. My personal opinion is that most people have no clue how good these AI coding systems already are. If you use something like RepoPrompt, where you selectively choose which files to include in the prompt, and then also provide a clear description of what changes you want to make—along with a significant portion of the source code—a model like O1Pro will nail the solution the first time.
The real issue is that people are not providing proper context to the models. Take any random coding library you’re interfacing with, like a Postgres database connection client. The LLM isn’t going to inherently know all of the different configurations and nuances of that client. However, if you pass in the source code for the client along with the relevant portions of your own codebase, you’re equipping the model with the exact information it needs.
Every time you do this, including a large prompt size—maybe 50,000 to 100,000 tokens—you dramatically improve the model’s ability to generate an accurate and useful response. With a strong model like O1Pro, the results can be exceptional. The key isn’t that these models are incapable; it’s that users aren’t feeding them the right data.
- tsimionescu 4 months ago
  Are you suggesting that OpenAI published a paper assessing their own models on real-world problems, but failed to properly use their own models? And/or that you know better than OpenAI scientists how to use OpenAI models most effectively?
  ldjkfkdsjnv 4 months ago
  thou shall not question the high priests
  tsimionescu 4 months ago
  That's not what I'm saying.
  But telling us that the designers of a product are stupid and don't know how to use their own product when they're disclosing its limitations should really come with more than a "trust me bro" as evidence.
- Dban1 4 months ago
  the limiting factor is no longer the answers but the questions
simonw 4 months ago
I find the framing of this story quite frustrating.
The purpose of new benchmarks is to gather tasks that today's LLMs can't solve comprehensively.
It an AI lab built a benchmark that their models scored 100% on they would have been wasting everyone's time!
Writing a story that effectively says "ha ha ha, look at OpenAI's models failing to beat the new benchemark they created!" is a complete misunderstanding of the research.
- mckngbrd 4 months ago
  Shhh ... you're spoiling everybody's confirmation bias against LLMs. They are obviously terrible at coding, just as we have known all along, and everybody should laugh at them. Nothing to see here!
  johnnyanmac 4 months ago
  As long as these companies keep pretending AI is ready to replace humans, I will be biased against lies, thank you.
  player1234 4 months ago
  Since you are one of the cool kids in the know, can you share the road map to profitability and even better the expected/hyped ROI? Without extrpolations into science fiction, please.
mohsen1 4 months ago
I wonder how many of the solutions that passes SWE-lancer evals would not be accepted by the poster due to low quality
I’ve been trying so many things to automate solving bugs and adding features 100% by AI and I have to admit it’s been a failure. Without someone that can read the code and fully understand the AI generated code and suggests improvements (SWE in the loop) AI code is mostly not good.
WiSaGaN 4 months ago
So this is an in-house benchmarks after their undisclosed partnership with a previous benchmark company. Really hope they do not have their next model to vastly outperform on this benchmark in the coming weeks.
contractorwolf 4 months ago
To all those devs saying "i tried it and it wasnt perfect the first time, so I gave up", I am reminded of something my father used to say: "A bad carpenter blames his tools"
So AI not going to answer your question right on its first attempt in many cases. It is forced to make a lot of assumptions based on the limited info you gave it, some of those may not match your individual case. Learn to prompt better and it will work better for you. It is a skill, just like everything else in life.
Imagine going into a job today and saying "i tried google but it didnt give me what I was looking for as the first result, so I dont use google anymore". I just wouldnt hire a dev that couldnt learn to use AI as a tool to get there job done 10x faster. If that is your attitude, 2026 might really be a wake-up call for your new life.
- gar1t 4 months ago
  One of the reasons SO works is that the correct or best answer tends to move toward the top of the list. AI struggles to do this reliably - and so its closer to SO where the answers are randomly selected and you have to try a few to get the one that's correct.
  Also, I bet your father imagined the carpenter's toolbox full of well accepted useful tools. For many of us, non-bad carpenters, AI hasn't made the cut yet.
  contractorwolf 4 months ago
  All I am saying is that if you are expecting it to fail for you, that is absolutely what it will do. You weren't good at typing on your first day either. I went from a decent engineer with 20 years experience to a 10x engineer able to take on any problem, in less than a year. All because I learned how to use the tool effectively.
  Just dont give up, get back on that bike and keep peddling. I promise it will amaze you if you give it a chance.
  This is a tool, just like all the other ones you have learned, but it will make you a far better engineer. It can fill all those gaps in your understanding of code. You can ask it all those questions you are unwilling to ask your colleagues, because you think you will sound dumb for not knowing. You can ask it to explain everything again if you still dont get it. It is powerful if you know how to use it.
blindriver 4 months ago
They should feed it bootcamp study materials and Cracking the Coding Interview book in order to improve its ability to code.
- Ozzie_osman 4 months ago
  If it can master Binary Search Trees, it can master anything.
  blindriver 4 months ago
  "If you need to improve speed, add Hash Tables."
rsynnott 4 months ago
> OpenAI researchers have admitted that even the most advanced AI models still are no match for human coders — even though CEO Sam Altman insists they will be able to beat "low-level" software engineers by the end of this year
This is the “self-driving cars next year, definitely” of the 20s, at this point.
rvz 4 months ago
The benchmark for AI models to assess their 'coding' ability should be on actual real world production-grade repositories and fixing bugs in them such as the Linux kernel, Firefox, sqlite or other large scale well known repositories.
Not these Hackerrank, Leetcode or previous IOI and IMO problems which we already have the solutions to them and reproducing the most optimal solution copied from someone else.
If it can't manage most unseen coding problems with no previous solutions to them, what hope does it have against explaining and fixing bugs correctly on very complex repositories with over 1M-10M+ lines of code?
- cjblomqvist 4 months ago
  For anyone who don't know what IOI and IMO refers to;
  IOI refers to the International Olympiad in Informatics, a prestigious annual computer science competition for high school students, while IMO refers to the International Mathematical Olympiad, which is a world-renowned mathematics competition for pre-college students.
  (Ironically, provided by ChatGPT)
- ashu1461 4 months ago
  The new Lancer benchmark is on actual problems and that is where it is failing by a huge margin.
pzo 4 months ago
> The models weren't allowed to access the internet
How many software developers could solve most even simple programming problems (except 'Hello world') with zero shot style (you write in notepad then can compile only once and execute once) without access to internet (stackoverflow, google search, documentation), tools (terminal, debugger, linter, cli)?
I think then it's not the best comparison to make any judgement. Future benchmark should test agents where they allowed to solve the problem in 5-10 minutes, allow give access to internet, documentation, linter, terminal with MCP servers.
- thrw011 4 months ago
  > How many software developers could solve most even simple programming problems (except 'Hello world') with zero shot style (you write in notepad then can compile only once and execute once) without access to internet (stackoverflow, google search, documentation), tools (terminal, debugger, linter, cli)?
  Many, there was a time when SO did not exist and people were able to solve non trivial problems. There was a time coding problems on exams had to be solved on paper and if they were not compiling you would not pass.
  pzo 4 months ago
  you miss my point about zero short style where you have only one shot to compile and execute you code. Even in old times when people programmed using punched cards it required a lot of reviews and iterations. This is the reason why scripting languages like python, ruby, php, javascript got popular because you had very fast feedback loop and do dozens of mini experiments. Majority of coding problems we have today are not algorithmic in nature.
  thrw011 4 months ago
  I had one shot at my exams, was writing them on paper, compiling code in my brain.
- ipython 4 months ago
  What would searching the Internet provide the models that they don’t already have? Most likely data sources such as stack overflow, documentation on the language it’s targeting, and a variety of relevant forum posts are already part of its training set.
  Unless someone else came along and said “here’s how to solve x problem step by step”, I don’t see how additional information past its cutoff point would help. (Perhaps the AI could post on a forum and wait for an answer?)
  Yes, iterative programming could help via access to tools- I can see that helping.
  brookst 4 months ago
  Why do programmers search for specific questions rather than always relying on their inherent knowledge?
  I’m a crappy hobbyist programmer but for me it is useful to see if someone has implemented exactly what I need, or debugged the problem I’m having. I don’t think it’s reasonable to expect programmers or LLMs to know everything about every library’s use in every context just from first principles.
  ipaddr 4 months ago
  I do it to save the limited brain power I have before rest or food is required. You could spend 5 minutes writing a sort (at a high level processing) or just use existing code which might take 5 minutes to find but uses less brain power.
  This allows you to use that brain power on specific things that need you and let google remember the format of that specific command or let an ai write out your routing file.
  The older I get the less I'm bound by time, lack of knowledge or scope but more limited by clarity. Delegate tasks where possible and keep the clarity for the overall project and your position.
  ipython 4 months ago
  But why would that information not be included in the wide crawl already encoded in the model weights before the knowledge cutoff? I believe the article mentions frontier models so we are talking about models trained on trillions of tokens here
  brookst 4 months ago
  In addition to cutoff dates, models do not encode every single thing from the training set verbatim. One forum post somewhere about Foo library v13.5.3 being incompatible with Bar 2.3 and resulting in ValueErrors is not going to make it.
  pzo 4 months ago
  Because cutoff can be like few months ago and you still have new versions of libraries being developed every month. API getting deprecated or removed or new API being added. Model need to have access to the latest API or SDK that is available and know e.g. what iOS SDK you have currently available and what MacOS version you have etc. Having access to github issues also help to figure out if there is bug in library.
- ipaddr 4 months ago
  You sound like someone who never used punch cards.
  I think most developers could do that if they trained. As someone who learned how to program before the internet, its just a different mindset and would take some time to adjust.
  I am doing that now where changes take a day to make it to staging and no local environment. You roll with it.
  CPLX 4 months ago
  > You sound like someone who never used punch cards.
  I hope HN never changes.
- rurp 4 months ago
  It depends a lot on the type of problem. If we're talking about fixing a bug or adding a new feature to a large existing code base, which probably describes a huge portion or professional software engineering work, I would say most engineers could do most of those tasks without the internet. Especially if the goal is to simply pass a benchmark test of getting it working without future considerations.
- milesrout 4 months ago
  I barely ever look at StackOverflow as the quality of answers there is so poor. It was once good but the proliferation of duplicates[1] has really ruined it for me, as well as outdated answers not being replaced. Google search results are also crap.
  I agree with your point, though. The "LLM" model just isn't a good fit for some tasks, in fact many tasks. It is good for creative writing, but even then only really because our standards for creative writing are pretty low. It doesn't write with any real creativity or flair in the writing. It can make things up and stay on topic. It is poor for anything where accuracy matters. It can't edit what it produces! Nobody writes things in one shot in reality, not even creative writing, but especially not code or technical writing. It needs to be able to do a whole suite of other things: move blocks of output around, rewrite chunks, expand chunks, condense chunks, check chunks against external sources or proper knowledge banks, compare chunks for internal consistency, and more. That is how we operate: at the level of functions or blocks of code, at the level of paragraphs and sentences and sections.
  [1]: Yes, the opposite of the problem people here usually have with it, which is things being closed as duplicates. I think more duplicates should be deleted and redirected to a canonical answer, which is then a focus of improvement. Too often google searches give me barely answered or unanswered duplicates and I have to click around in the site to find the result Google clearly should have given me in the first place (better keyword matches, not closed, higher score, etc). I think StackOverflow do this intentionally so people have to click on more pages and see more ads.
- a2128 4 months ago
  I think about this a lot. AI in the current state is like working with an intern who is on a stranded island with no internet access or compiler, they have to write down all of the code in forward sequence on piece of paper, god help them if they have to write any UI while also being blind. None of the "build an app with AI start-to-finish" products work well at all because of this.
- sky2224 4 months ago
  AI models are trained on the data from the internet, so sure, they couldn't do their search feature to scour the internet, but I doubt the material is much different than what the models were already trained on.
  Additionally, before the age of stackoverflow and google, SWEs cracked open the book or documentation for whatever technology they were using.
- achierius 4 months ago
  Isn't this how interviews tend to work? So I think a good number of devs would, yes.
  pzo 4 months ago
  Interviews like leetcode on whiteboard only testing your reasoning not if your solution will execute out of the box in zero shot style. Humans solve problem in iterative way that's why fast feedback loop and access to tools is essential. When you start coding compiler or linter hints you that you forgot to close some braces or miss semicolon. Compiler tips you that API in new version changed, intellisense hints you what methods you can use in current context and what parameters you can use and their types. Once you execute program you get runtimes tips that maybe you missed installing some node or python package. When you installing packages you get hints that maybe one package has additional dependency and 2 package version are not compatible. Command line tools like `ls` tells you what's project structure etc.
- kristopolous 4 months ago
  I used to be able to. The web's made me lazy. I was better before it and I'm better when I don't use it.
  Really, the stuff you think helps you is often just holding you back.
- JohnKemeny 4 months ago
  As one who organises competitive programming contests on a regular basis for university students, I would say almost every single one.
  undefined 4 months ago
  [deleted]
- unification_fan 4 months ago
  > How many software developers could solve most even simple programming problems (except 'Hello world') with zero shot style (you write in notepad then can compile only once and execute once) without access to internet (stackoverflow, google search, documentation), tools (terminal, debugger, linter, cli)?
  The experienced ones can
- inykt 4 months ago
  These models are held at higher standards than humans. They should be above to solve any coding problem with just the documentation.
  Ekaros 4 months ago
  Isn't point of the training that they already have all the information they could have. So they do not need the Internet as on Internet there would only be information they already "know"...
spartanatreyu 4 months ago
Link to the original paper: https://arxiv.org/pdf/2502.12115
TL;DR:
They tested with programming tasks and manager's tasks.
The vast majority of tasks given require bugfixes.
Claude 3.5 Sonnet (the best performing LLM) passed 21.1% of programmer tasks and 47.0% of manager tasks.
The LLMs have a higher probability of passing the tests when they are given more attempts, but there's not a lot of data showing where the improvement tails off. (probably due to how expensive it is to run the tests)
Personally, I have other concerns:
- A human being asked to review repeated LLM attempts to resolve a problem is going to lead that human to review things less thoroughly after a few attempts and over time is going to let false positives slip through
- An LLM being asked to review repeated LLM attempts to resolve a problem is going to lead to the LLM convincing itself that it is correct with no regard for the reality of the situation.
- LLM use increases code churn in a code base
- Increased code churn is known to be bad the health of projects
AbstractH24 4 months ago
Increasingly I think that the impact of generative AI is going to more of an incremental form of disruption than revolutionary. More like spreadsheets than the printing press.
Spreadsheets becoming mainstream made it easy to do computing that once took a lot of manual human labor quite quickly. And it made plenty of jobs and people who do them obsolete. But they didn’t upend society fundamentally or the need for intelligence and they didn’t get rolled out overnight.
anandnair 4 months ago
Coding, especially the type mentioned in the article (building an app based on a specification)—is a highly complex task. It cannot be completed with a single prompt and an immediate, flawless result.
This is why even most software projects (built by humans) go through multiple iterations before they work perfectly.
We should consider a few things before asking, "Can AI code like humans?":
- How did AI learn to code? What structured curriculum was used?
- Did AI receive mentoring from an experienced senior who has solved real-life issues that the AI hasn't encountered yet?
- Did the AI learn through hands-on coding or just by reading Stack Overflow?
If we want to model AI as being on par with (or even superior to) human intelligence, don’t we at least need to consider how humans learn these complex skills?
Right now, it's akin to giving a human thousands of coding books to "read" and "understand," but offering no opportunity to test their programs on a computer. That’s essentially what's happening!
Without doing that, I don't think we'll ever be able to determine whether the limitation of current AI is due to its "low intelligence" or because it hasn’t been given a proper opportunity to learn.
- tsimionescu 4 months ago
  LLMs can fundamentally only do something similar to learning in the training phase. So by the time you interact with it, it has learned all it can. The question we then care about is whether it has learned enough to be useful for problem X. There's no meaningful concept of "how intelligent" the system is beyond what it has learned, no abstract IQ test decoupled from base knowledge you could even conceive of.
- DarkmSparks 4 months ago
  >How did AI learn to code?
  It didn't, it's just very good at copying already existing code and tweeking it a bit.
  >Did AI receive mentoring from an experienced senior
  It doesnt even comprehend what an experienced senior is, all it cares about is how frequently certain patterns occurred in certain circumstances.
  >Did the AI learn through hands-on coding or just by reading Stack Overflow?
  it "learnt" by collecting a large database of existing code, most of which is very low quality open source proofs of concept, then spits out the bits that are probably related to a question.
  anandnair 4 months ago
  I think we're drastically oversimplifying what "pattern matching" means. It is also one of the fundamental mechanisms by which the human brain operates. I believe we are consciously (or perhaps subconsciously) conditioned to think that human "logic" and "reasoning" are several degrees more advanced than pattern matching. However, I don't think this is true.
  The fundamental difference lies in how patterns are formed in each case. For LLMs, all they know are the patterns they observe in "words" - that is the only "sense" they possess. But for humans, pattern recognition involves continuously ingesting and identifying patterns across our five primary senses—not just separately, but simultaneously.
  For example, when an LLM describes something as "ball-shaped," it cannot feel the shape of a ball because it lacks another sense to associate with the word "ball-shaped." In contrast, humans have the sense of touch, allowing them to associate the word or sound pattern "ball" with the physical sensation of holding a ball.
  DarkmSparks 4 months ago
  >`It is also one of the fundamental mechanisms by which the human brain operates.
  One of the fundamental mechanisms by which brains operate. The bits we share with every other animal with a brain,
  good luck teaching your dog to code.
  being great at fetching your newspaper in the morning doesn't mean its going to wake up and write you an accounting software package at the end of the year.
  anandnair 4 months ago
  We don't even need that example. The example is in front of us. Take a smaller parameter model and ask it to do the same complex thing that a larger parameter model did. It will struggle.
  Btw, I'm not saying it's just the number of parameters that matters.
casey2 4 months ago
"Here's the choice you have when you are faced with something new. You can take this technological advance and decide "This is a better way of doing the stuff I'm doing now and I can use this to continue on the path that I'm going", so that's staying in the pink plane , or you can say "This is not a better old thing, this is almost a new thing and I wonder what that new thing is trying to be" and if you do that there's a chance of actually perhaps gaining some incredible leverage over simply optimizing something that can't be optimized very much. - Kay
Current LLMs will change the world, but it won't be by completing pull requests quickly.
Although a "stargate level" LLM could accelerate pink plane traversal so much that you don't even need to find the correct usecase. LLM scaling will be the computer graphics scaling of this generation. In terms of intelligence gpt4 based o3 is but a postage stamp. As LLMs scale a picture of intelligence will emerge.
DarkmSparks 4 months ago
LLMs will never solve this problem, they are basically just glorified copy & paste engines, solving real code problems requires invention, even for most basic tasks. The best they will manage in their current direct is reason they don't have the capability or capacity to actually solve the problem rather than just getting it wrong the vast majority of the time.
- sumedh 4 months ago
  > LLMs will never solve this problem
  We should revisit this comment in 5 years.
internet101010 4 months ago
I believe it. I couldn't even get o1 or claude 3.5 to write a tampermonkey script that would turn off auto-scroll to bottom in LibreChat, even when uploading the html and javascript as context.
Apparently it has to do with overflow anchor or something in React? Idk. I gave up.
- throwaway290 4 months ago
  Unless it works by literally scrolling with JS, I bet some strategic bit of CSS should do it...
marban 4 months ago
I prompted up a very basic Flask scaffold via Windsurf and once it reached a certain code size, it just started to remove or weirdly rewrite old parts to handle the context. ("You're right let's move that back in"). Didn't end well.
aszantu 4 months ago
it's so much easier to learn from examples than from documentation in my opinion, documentation is, what I use when I want to know additional parameters or downsides of a functionality. I'm no coder though.
- xarope 4 months ago
  what I'd like LLMs to do is present examples using acceptable design standards, e.g. whats the pythonic way to do this, and what are exceptions that might yield better performance/optimization (at what does it cost), or what is the best go(lang) JSON parser (since the built-in isn't very good).
  But instead, I get average to below-average examples (surprise surprise, this is what happens when you train on a high noise-to-signal set of data), which are either subtly or wildly incorrect. I can't see this improving, with reddit and other forums trying to introduce AI bot written posts. Surely these companies are aware of how LLM output degenerates when fed its own input within a few (not even dozen) generations?!?
ginvok 4 months ago
AI don't "solve" problems, best it can do is remember them. Ask them to solve anything new that's challenging and it starts to hallucinate. At least currently.
- ilrwbwrkhv 4 months ago
  And I'm ashamed that OpenAI and Sam altman are walking around talking about AGI. And I'm so... disillusioned by the entire tech community that they have fallen for it or they at least pretend to believe it. It's like LinkedIn Where everybody pretends to be cringe, positivity, people. Even though they know it's cringe and nobody believes it.
  leosanchez 4 months ago
  Whenever AI hallucinates a little complex SQL or some tool / language it doesn't have much training data on, I think of AGI hype and Sam Altman's words on how AI can be used to cure cancer in near future.
  ilrwbwrkhv 4 months ago
  Instead if they rightly just said it is an useful tool to be used by researchers to help them like a smart calculator for big data, it would be so much more honest and correct.
numba888 4 months ago
Not sure what they found. Either model is unable, or they were unable to solve the tasks using models. Loos like they used strait questions and not Chain of Thoughts. The result for the same model depends on how you ask. The tasks probably required more thinking under the hood than model is allowed to do in one request. More interesting would be if model is capable of solving given enough time. Using multiple requests orchestrated by some framework automatically.
scotty79 4 months ago
> SWE-Lancer, built on more than 1,400 software engineering tasks from the freelancer site Upwork
What's not mentioned here (I think) is that tasks in this benchmark are priced and they sum up to million dollars.
And current AIs were able to earn nearly half of that.
So while technically they can't solve most problems (yet) they are already perfectly capable of taking about 40% of the food off your plate.
halis 4 months ago
The biggest scam in AI is Salesforce. They’re going to take some crappy model that they made or simply switch to a better open source model. Then they’re going to make a large deal with a cloud provider to spin up all these models for all their customers and then re-sell it as AI to their customers for 100x what OpenAI gets per month. And the quality will be lower.
darepublic 4 months ago
To me o1 is pretty good. I dunno how it would digest an entire codebase and solve a bug in it. Those details weren't obvious to me from the article above. But o1 has certainly been very valuable to me in coding in new languages on the fly.
ryandvm 4 months ago
LLMs at this point are just a great replacement for Stack Overflow. If what you're doing has been heavily documented and you just need a primer or some skeletal sample code, the LLMs are great.
They are not creative at all, but 99% of my job is not creative either.
imperial_note 4 months ago
The article concludes that LLMs are "not skilled enough at software engineering to replace real-life people quite yet".
Yet Claude 3.5 sonnet "earned" $403.325,00 according to the paper referenced. That is $403k worth of labour potentially replaced.
ineedasername 4 months ago
The models weren't allowed to access the internet, meaning they couldn't just crib similar answers that'd been posted online
So… not the same basic tools that a human has when coding?
- spartanatreyu 4 months ago
  Well if the tasks are based on tasks that were already asked on freelancing sites, the LLMs could just google the task and copy-paste the solution.
emorning3 4 months ago
I dont understand why anyone would even think that the current crop of AI is capable of planning or reason on its own.
Transformers with memory would be different story.
But, no memory, no capability to reason. End of story, right?
mrayycombi 4 months ago
Despite the lack luster coding performance, AI has PROVEN its able to provide a rationale for profit taking job cuts, layoffs, reduced stock grants, and increased executive bonuses.
So it's not ALL bad news.
ChrisArchitect 4 months ago
Previously on source: https://news.ycombinator.com/item?id=43086347
prompt_overflow 4 months ago
This shows how unrealistic, inaccurate and pointless coding benchmarks are.
I also include similar code interview platforms like leetcode, hackerrank and so on.
siliconc0w 4 months ago
Interesting that Claude wins despite the other models being more expensive and doing much better in the traditional benchmarks.
realitysballs 4 months ago
I believe the outcome of this type of article is actually positive. The ‘SWE-Lancer’ benchmark provides visibility into a more pragmatic assessment of LLM capabilities.
Ironically it actually refutes Altman’s claims mentioned in the same article . Hard to replace engineers when you create a benchmark you can’t score decently on.
- throw234234234 4 months ago
  Or it could be a case of: Never prepare a benchmark/prep a comparison which you think you won't succeed at. This is especially true when you are funded by mostly private/VC investors. Time will tell.
  I think they are trying to frame the narrative; then succeed at it. Let's see. This helps justify OpenAPI's validation and efforts to investors/VC's. After all; IMO without coding as a use case for LLM's AI wouldn't nearly have the same hype/buzz as it does now. Greed (profit) and fear (losing jobs) are a great motivator to keep investment hype and funds coming in.
yieldcrv 4 months ago
I saw this
still, chain of thought is great for LeetCode 75
Since interviewers “want to see how you think” (and get the right answer in less time than other candidates on average)
I can now see how you’re supposed to think (and get the right answer in less time than other candidates on average, for now)
m3kw9 4 months ago
It solved a lot of mines
tiberriver256 4 months ago
... And then they tried sonnet 3.7
axelfontaine 4 months ago
... so far
kjrfghslkdjfl 4 months ago
[dead]
chasing0entropy 4 months ago
The models were restricted from accessing the internet and forced to develop their own solutions internally.
I think researchers will find that human coders are unable to solve most coding problems without access to the internet.
- Demiurge 4 months ago
  What kinds of problems are you talking about? There are problems that require you to learn new libraries and services constantly, accessing documentation, and there are actual software problems when you have to reflect on your own large code base. I work on both kinds of problems, and in the first case, the models are actually well versed in say, all of CloudFormation syntax, that I would have to look up. On the opposite end, I have written many features on trips, unable to be distracted by the internet, just me and the code, and being able to read library source code.
  The fact is, programming requires abstract modeling that language models aren’t demonstrating the capability of fully replicating. At least, not that we can see, yet.
- caconym_ 4 months ago
  A decent human programmer with experience in a particular domain may rely on internet access to look up API documentation and other generic references, but if you read the paper, you'll see that the AI systems tested suffered from more basic deficiencies in approach and reasoning ('3.6. Discussion', starting on page 7).
- njovin 4 months ago
  If you're requiring human coders to write valid, compilable code, maybe. But if you're doing that, you're doing coding interviews wrong.
  Any interviews I've run or been a part of have required the interviewee to demonstrate their problem-solving skills using pseudo-code.
- croes 4 months ago
  And you created the internet then?
- mrayycombi 4 months ago
  Damn. How did they write code before 1992?
  xarope 4 months ago
  We did have usenet in the 80s and gopher in the 90s. But yes, in those days it was that mythical "paper" stuff (or were we still using papyrus? I forget)
- xarope 4 months ago
  Seriously? I think I've been most productive pre-internet days when stuck on a trans-atlantic flight with a java (shudder, talk about PTSD) reference manual and laptop that could barely last 2 hours on battery, with emphasis on the measure-twice, cut once mentality.
  It's painful to watch junior coders copy-n-paste from SO or W3schools (including code samples clearly labelled not-for-production) with little effort to understanding what they are doing.