> there’s one depressing anecdote that I keep on seeing: the junior engineer, empowered by some class of LLM tool, who deposits giant, untested PRs on their coworkers—or open source maintainers—and expects the “code review” process to handle the rest.
It's even worse than that: non-junior devs are doing it as well.
>It's even worse than that: non-junior devs are doing it as well.
This might be unpopular, but that is seeming more like an opportunity if we want to continue allowing AI to generate code.
One of the annoying things engineers have to deal with is stopping whatever they're doing and doing a review. Obviously this gets worse if more total code is being produced.
We could eliminate that interruption by having someone doing more thorough code reviews, full-time. Someone who is not being bound by sprint deadlines and tempted to gloss over reviews to get back to their own work. Someone who has time to pull down the branch and actually run the code and lightly test things from an engineer's perspective so QA doesn't hit super obvious issues. They can also be the gatekeeper for code quality and PR quality.
> One of the annoying things engineers have to deal with is stopping whatever they're doing and doing a review.
I would have thought that reviewing PRs and doing it well is in the job description. You latter mention "someone" a few times - who that someone might be?
This mirrors my experience with the texting while driving problem: The debate started as angry complaints about how all the kids are texting while driving. Yet it’s a common problem for people of all ages. The worst offender I knew for years was in her 50s, but even she would get angry about the youths texting while driving.
Pretending it’s just the kids and young people doing the bad thing makes the outrage easier to sell to adults.
the US media has been doing this with Black people to great effect for hundreds of years. (the tacitly hateful) Democrats are as guilty of it as (the openly hateful) Republicans.
Just shove a code review agent in the middle. Problem solved
it's even worse than that! non-devs are doing it as well
That's what democratization looks like. And the new participants are happy!
it's even worse than that! bots are doing it as well
Hopefully once this AI nonsense blows over they'll reach the same realisation they did after the mid 2000s outsourcing craze: that actually you gotta pay for good engineering talent.
That's the goal. Through further training, whittle away at unnecessary states until only the electrical states that matter remain.
Developers have created too many layers of abstraction and indirection to do their jobs. We're burning a ton of energy managing state management frameworks, that are many layers of indirection away from the computations that are salient to users.
All those DSLs, config syntaxes, layers of boilerplate waste a huge amount of electricity, when end users want to draw geometric shapes.
So a non-dev generates a mess, but in a way so do devs with Django and Elixir, RoR, Terraform. When really end of the day it's matrix math against memory and sync of that state to the display.
From a hardware engineers perspective, the mess of devs and non-devs is the same abstract mess of electrical states that have nothing to do with the goal. All those frameworks can be generalized into a handful of electrical patterns, saving a ton of electricity.
This sounds like the exact kind of profound pseudo-enlightenment that one gets from psychedelics. Of course, it's all electrons in the end.
Trying to create a secure, reliable and scalable system that enables many people to work on one code base, share their code around with others and at the end of the day coordinate this dance of electrons across multiple computers, that's where all of these 'useless' layers of abstraction become absolutely necessary.
Try almost 30 years in electrical engineering.
I know exactly what those layers of abstraction are used for. Why so many? Jobs making layers of abstraction.
But all of them are dev friendly means of modeling memory states for the CPU to watch and transform just so. They can all be compressed into a generic and generalized set of mathematical functions ridding ourselves of the various parser rules to manage each bespoke syntax inherent to each DSL, layers of framework.
And here I thought people just used computers for the heat
What process / path do you take to get to such an enlightened state? Like books or experience or anything more about this please?
Bachelors in electrical engineering, masters in math; elastic structures applied to modeling electrical systems.
Started career in late 90s designing boards for telecom companies network backbones.
There are some contradictory claims here.
Boilerplate comes when your language doesn't have affordances, you get around this with /abstraction/ which leads to DSLs (Domain Specific Languages).
Matrix math is generally done on more than raw bits provided by digital circuits. Simple things like numbers require some amount of abstraction and indirection (pointers to memory addresses that begin arrays).
My point is yes, we've gotten ourselves in a complicated tar pit, but it's not because there wasn't a simpler solution lower in the stack.
In the company I’m at this is beginning to happen. PM’s want to “prototype” new features and expect the engineers to finish up the work. With the expectation that it ‘just needs some polishing’. What would be your recommendation on how to handle this constructively? Flat out rejecting LLM as a prototyping tool is not an option.
This could be workable with the understanding that throwing away 100% of the prototype code is acceptable and it’s primary purpose is as a communication tool, not a technical starting point.
sounds like a culture and management problem. CTO should set clear expectations for his staff and discuss with product to ensure there is alignment.
If i was CTO I would not be happy to hear my engineers are spending lots of time re-writing and testing code written by product managers. Big nope.
Really good code reviewing AIs could handle this!
Code review is an unfunded mandate. It is something the company demands while not really doing anything make sure people get rewarded for doing it.
> while not really doing anything make sure people get rewarded for doing it.
I don’t know about you, but I get paychecks twice a month for doing things included in my job description.
My manager asked me to disable CI and gating code owner reviews “for 2 weeks” 6 months ago so people could commit faster. Just because it is in your job description doesn’t mean it won’t get shoved aside when it’s perceived as the bottleneck for the core mission.
Now we have nightly builds that nobody checks the result of and we’re finding out about bugs weeks later. Big company btw
For what it's worth, writing good PRs applies in more cases than just AI generated contributions. In my PR descriptions, I usually start by describing how things currently work, then a summary of what needs to change, and why. Then I go on to describe what exactly is changing with the PR. This high level summary serves to educate the reviewer, and acts as a historical record in the git log for the benefit of those who come after you.
From there, I include explicit steps for how to test, including manual testing, and unit test/E2E test commands. If it's something visual, I try to include at least a screenshot, or sometimes even a brief screen capture demonstrating the feature.
Really go out of your way to make the reviewer's life easier. One benefit of doing all of this is that in most cases, the reviewer won't need to reach out to ask simple questions. This also helps to enable more asynchronous workflows, or distributed teams in different time zones.
Also, take a moment to review your own change before asking someone else to. You can save them the trouble of finding your typos or that test logging that you meant to remove before pushing.
To be fair, copilot review is actually alright at catching these sorts of things. It remains a nice courtesy to extend to your reviewer.
100%. There's no difference at all in my mind between an AI-assisted PR and a regular PR: in both cases they should include proof that the change works and that the author has put the work in to test it.
At the last company I worked at (Large popular tech company) it took an act of the CTO to get engineers to simply attach a JIRA Ticket to the PR they were working on so we could track it for tax purposes.
The Devs went in kicking and screaming. As an SRE it seemed like for SDEs, writing a description of the change, explaining the problem the code is solving, testing methodology, etc is harder than actually coding. Ironically AI is proving that this theory was right all along.
I often write PR descriptions, in which I write a short explanation and try to anticipate some comments I might get. Well, every time I do, I will still get those exact comments because nobody bothers reading the description.
Not to say you shouldn't write descriptions, I will keep doing it because it's my job. But a lot of people just don't care enough or are too distracted to read them.
For many of my PR and issue comments the intended audience is myself. I find them useful even a few days later, and they become invaluable months or years later when I'm trying to understand why the code is how it is.
After I accepted that, I then tried to preempt the comment by just commenting myself on the function/class etc that I thought might need some elaboration...
Well, I'm sure you can guess what happened after that - within the same file even
At my place nobody reads my descriptions because nobody writes them so they assume there isn't one!
I just point people to the description. no need to type things twice.
I do this too with our PR templates. They have the ticket/issue/story number, the description of the ask (you can copy pasta from ticket). Current state of affairs. Proposed changes. Post state of affairs. Mood gif.
This is how PRs should be, but rarely are (in my experience as a reviewer, ymmv, n=1). Keep on keepin' on.
> Your job is to deliver code you have proven to work.
Strong disagree here, your job is to deliver solutions that help the business solve a problem. In _most_ cases that means delivering code that you should be able to confidently prove satisfies the requirements like the OP mentioned, but I think this is an important nitpick distinction I didn't understand until later on in my career.
I’d go further and say while testing is necessary, it is not sufficient. You have to understand the code and convince yourself that it is logically correct under all relevant circumstances, by reasoning over the code.
Testing only “proves” correctness for the specific state, environment, configuration, and inputs the code was tested with. In practice that only tests a tiny portion of possible circumstances, and omits all kinds of edge and non-edge cases.
there’s one depressing anecdote that I keep on seeing: the junior engineer, empowered by some class of LLM tool, who deposits giant, untested PRs on their coworkers—or open source maintainers—and expects the “code review” process to handle the rest.
Is anyone else seeing this in their orgs? I'm not...
I am currently going through this with someone in our organization.
Unfortunately, this person is vibe coding completely, and even the PR process is painful: * The coding agent reverts previously applied feedback * Coding agent not following standards throughout the code base * Coding agent re-inventing solutions that already exist * PR feedback is being responded to with agent output * 50k line PRs that required a 10-20 line change * Lack of testing (though there are some automated tests, but their validations are slim/lacking) * Bad error handling/flow handling
Fire them?
I believe it is getting close to this. Things like this just take time though, and when this person talks to management/leadership they talk about how much they are producing and how everyone is blocking their work. So it becomes a challenging political maneuvering depending on the ability of certain leadership to see through the BS.
(By my organization, I meant my company - this person doesn't report to me or in my tree).
I'm seeing a little bit of this. However, I will add that the primary culprits are engineers that were submitting low quality PRs before they had access to LLMs, they can just submit them faster now.
LLMs are tools that make mediocre devs 100x more "productive" and good devs 2x more productive
From my vantage I would argue LLMs make good devs around 0.65x more productive
I think on average a dev can be x percent more productive, but there is a best case and worst case scenario. Sometimes it's a shortcut to crank out a solution quickly, other times the LLM can spin you in circles and you lose the whole day in a loop where the LLM is fixing its own mistakes, and it would've been easier to just spend some time working it out yourself.
Good devs are still learning how to use LLMs, and so are willing to accept the 0.65x once in a while. Any complex tool will have a learning curve. Most tools improve over time. As such good devs either have found how to use LLMs to make them more productive (probably not 10x, but even 1.1x is something), or they try them again every few months to see if things are better.
you are bending over backwards to figure out how to put "1.1x" in your comment
the idea that LLMs make developers more productive is delusional.
Yep, that's why very accomplished, widely regarded developers like Mitchell Hashimoto and Antirez use them. They need to make programming more challenging to keep it fun.
developers or cult leaders
What's the ratio of people who things the right way vs not? I mean, is it a matter of giving them feedback to remind them what a "quality PR" is? Does that help?
It's roughly 1/10 that are causing issues. Not a huge deal but dealing with them inevitably takes up a couple hours a week. We also have a codebase that is shared with some other teams and our primary offenders are on one of those separate teams.
I think this is largely an issue that can be solved culturally within a team, we just unfortunately only have so much input on how other teams work. It doesn't help either when their manager doesn't seem to care about the feedback... Corporate politics are fun.
Yeah, I mean to get back to the original statement in the blog, this seems like less of a tech issue and more of a culture issue. The LLM enables the junior to do this once. It's the team culture that allows them to continue doing it.
LLMs have dramatically empowered sociopath software developers.
If you are sufficiently motivated to appear more "productive" than your coworkers, you can force them to review thousands of lines of incorrect AI slop code while you sit back and mess around with your chatbots.
Your coworkers no longer have enough time to work on their in-progress PRs, so you can dominate the development team in terms of LOC shipped.
Understand that sociopaths are skilled at navigating social and bureaucratic environments. A sociopath who ships the most LOC will get the promotion every single time.
Only if leadership lets them. Right now (anecdotally) a lot of “leaders” don’t understand the difference between AI generated and human generated work, and just look at loc as productivity so all incentives are on AI coding, but that will change.
It will never change. Managers will consider every stupid metric players push to sell their solutions. Be it code coverage, extensive CI/CD pipelines with useless steps, "productivity gains" with gen tools. The gen tools euphoria is stupid and will cease to exist, but before this was bdd,tdd,DDD, test before, test after, test your mocks, transpile to a different language and then ignore the output, code maturity, best practices, oop, pants in head oriented programming... There is always something stupid on the horizon this is certainly not the last stupid craze
Voila:
https://github.com/WireGuard/wireguard-android/pull/82 https://github.com/WireGuard/wireguard-android/pull/80
In that first one, the double pasted AI retort in the last comment is pretty wild. In both of these, look at the actual "files changed" tab for the wtf.
A friend of mine is working for a small-ish startup (11 people) and he gets to work and sees the CTO push 10k loc changes straight to main at 3 am.
Probs fine when you are still in the exploration phase of a startup, scary once you get to some kind of stability
I feel like this becomes kind of unacceptable as soon as you take on your first developer employee. 10K LOC changes from the CTO is fine when it's only the CTO working on the project.
Hell, for my hobby projects, I try to keep individual commits under 50-100 lines of code.
Lol I worked at a startup where the CTO did this. The problem was that it was pure spaghetti code. It was so bad it kept me up at night, thinking about how to fix things. I left within 30 days
The cto is ultimately responsible for the outcome and will be there at 4am to fix stuff.
Yes .. and no. Someone who does this will definitely make the staff clean up after them.
That's...idiotic.
LLMs are for idiots
I mean, I've vibe-coded a few useful single-file HTML tools, but checking in 10kloc at 3am into the production database...by the CTO...omg.
Yep. Remember, people not posting on this website are just grinding away at jobs where their individual output does not matter, and entire motivation is work JUST hard enough not to get fired. They don't get stock grants, extremely favorable stock options or anything else, they get salary and MAYBE a small bonus based off business factors they have little control over.
My eyes were wide open when 2 jobs ago, they said they would be blocking all personal web browsing from work computers. Multiple Software Devs were unhappy because they were using their work laptop for booking flights, dealing with their kids schools stuff and other personal things. They did not have personal computer at all.
Definitely seeing a bit of this, but it isn't constrained to junior devs. It's also pretty solvable by explaining to the person why it's not great, and just updating team norms.
I'm seeing it on some open source projects I maintain. Recently had 10 or so PRs come in. All very valid features - but from looking at them, not actually tested.
A friend of mine has a junior engineer who does this and then responds to questions like "Why did you do X?" with "I didn't, Claude did, I don't know why".
That would be an immidiate reason of termination in my book.
Yes, if they can't debug + fix the reason the production system is down or not working correctly then they're not doing their job, imo.
Developers aren't hired to write code that's never run (at least in my opinion). We're also responsible for running the code/keeping it running.
no hate but i would try to fire someone for saying that
I thought we were not, but we had just been lucky. A sequence of events lately have shown that the struggle is real. This was not a junior developer though, but an experienced one. Experience does not equal skill, evidently.
I started seeing it from a particularly poor developer sometime last year. I was the only reviewer for him so I saw all of his PRs. He refused to stop despite my polite and then not so polite admonishments, and was soon fired for it.
I'm not either
But LLMs don't really perform well enough on our codebase to allow you to generate things that even appear to work. And I'm the most junior member of my team at 37 years of age, hired in 2019.
I really tried to follow the mandate from on high to use Copilot, but the Agent mode can't even write code that compiles with the tools available to it.
Luckily I hooked it up to gptel so I can at least ask it quick questions about big functions I don't want to read in emacs.
It isn't only junior engineers, but otherwise. It is a small number of people from all levels.
People do what they think they will be rewarded for. When you think your job is to write a lot of code then LLMs are great. When you need quality code you start to ask if LLMs are better or not?
i left my last job because this was endemic
Quite a few FOSS maintainers have been speaking up about it.
first time we’d see this there would be a warning, second one is pink slip
It's been a struggle with a few teammates that we are trying to solve through explicit policy, feedback, and ultimately management action.
Yeah, a slice of this is technology related, but it's really a policy issue. It's probably easier to manage with a tighter team. Maybe I'm taking team size for granted.
Similar, at my last job. And the pushback was greater because super duper clever AI helped write it, who obviously knows more than any other senior engineer could know, so they were expecting an immediate PR approval and got all uppity when you tried to suggest changes.
Hah! I've been trying to push back on this sort of thought. The bot writes code for you, not you for the bot.
It's not a new phenomenon. Time was, people would copy-paste from blog posts with the same effect.
Always the same old tiring "this has always been possible before in some remotely similar fashion hence we should not criticise anything ever again" argument.
You could intuitively think it's just a difference of degree, but it's more akin to a difference of kind. Same for a nuke vs a spear, both are weapons, no one argues they're similar enough that we can treat them the same way
I would bet in most organizations you can find a copy-pasted version of the top SO answer for email regex in their language of choice, and if you chase down the original commit author they couldn't explain how it works.
I think it's impossible to actually write an email regex because addresses can have arbitrarily deeply nested escaping. I may have that wrong. I'd hope that regex would be .+@.+ and that's it (watch me get Cunninghammed because there is some valid address wherein those plusses should be stars).
I used to do that in simpler days. I'd add a link to where I copied it from so we could reference it if there were problems. This was for relatively small projects with just a few people.
> I'd add a link to where I copied it from
LLMs can't do this.
Your code is unambiguously better than any LLM code if you can comment a link to the stackoverflow post you copied it from.
I don't see the problem with fentanyl given that people have been using caffeine forever.
Actually, it's much more specific than that. A company pays you not just to "write code", not just to "write code that works", but to write code that works in the real world. Not on your laptop. Not on some testing environment. But in the real world. It may work fine in a theoretical environment, but deflate like a popped balloon in production. This code has no value to the business; the company doesn't pay you to ship popped balloons.
Therefore you must verify it works as intended in the real world. This means not shipping code and hoping for the best, but checking that it actually does the right thing in production. And on top of that, you have to verify that it hasn't caused a regression in something else in production.
You could try to do that with tests, but tests aren't always feasible. Therefore it's important to design fail-safes into your code that ALERT YOU to unexpected or erroneous conditions. It needs to do more than just log an error to some logging system you never check - you must actually be notified of it, and you should consider it a flaw in your work, like a defective pair of Nikes on an assembly line. Some kind of plumbing must exist to take these error logs (or metrics, traces, whatever) and send it to you. Otherwise you end up producing a defective product, but never know it, because there's nothing in place to tell you its flaws.
Every single day I run into somebody's broken webapp or mobile app. Not only do the authors have no idea (because they aren't looking for the errors), there is no way for me to even e-mail the devs to tell them. I try to go through customer support, a chat agent, anything, and even they don't have a way to send in bug reports. These devs churning out defective products and have insulated themselves from the knowledge of their own falures.
I don't know if there's a word for this but this reads to me as like, software virtue signaling or software patronizing. It's bizarre to me to tell an engineer what their job is as a matter of fact and to claim a particular usage of a tool as mandated (a tool that no one really asked for, mind you), leveraging duty of all things.
I guess to me, it's either the case that LLMs are just another tool, in which case the already existing teachings of best practice should cover them (and therefore the tone and some content of this article is unnecessary) or they're something totally new, in which case maybe some of the already existing teachings apply, but maybe not because it's so different that the old incentives can't reasonably take hold. Maybe we should focus a little bit more attention on that.
The article mentions rudeness, shifting burdens, wasting people's time, dereliction. Really loaded stuff and not a framing that I find necessary. The average person is just trying to get by, not topple a social contract. For that, look upwards.
I've really seen both I suppose. A lot of devs don't take accountability / responsibility for their code, especially if they haven't done anything that actually got shipped and used, or in general haven't done much responsible adulting.
I agree with this, except it glosses over security. Your job is to deliver SECURE code that you have proven to work.
Manual and automatic testing are still both required, but you must explicitly ensure that security considerations are included in those tests.
The LLM doesn't care. Caring is YOUR job.
I think the problem is in what “proven” means. People that don’t know any better will just do that all with LLMs and still deliver the giant untested PRs but with some LLM written tests attached.
I vibe code a lot of stuff for myself, mostly for viewing data, when I don’t really need to care how it works. I’m coming around to the idea that outside of some specific circumstances where everyone has agreed they don’t need to care about or understand the code, team vibe coding is a bad practice.
If I’m paying an engineer, it’s for their work, unless explicitly agreed otherwise.
I think vibe coding is soon going to be seen the same way as “research” where you engage an offshore team (common e.g. in consulting) to give you a rundown on some topic and get back the first five google search results. Everyone knows how to do that, if it’s what they wanted they wouldn’t be hiring someone to do it.
That's why I emphasized the manual testing component as well. Attaching a screenshot or video of a feature working to your PR is a great way to prove that you've actually seen it work correctly - at least once, which is still a huge improvement over it not actually working at all.
Perhaps off-topic, but: "Testing doesn't show the absence of errors, it shows the presence of errors" Willison says we need to submit code we have proven to work but then argues for empirical testing, not actual correctness proofs.
If you can formally prove correctness then brilliant, go for it!
That's not something I've seen or been able to achieve in most of my professional work.
A lot of AI coding changes coding to more of a declarative practice.
Claude, etc, works best with good tests that verify the system works. And so the code becomes in some ways the tests rather than the code that does the thing. If you're responsible for the thing, then 90% of your responsibility moves to verifying behavior and giving agents feedback.
I think there are two other things missing: Security and Maintainability. Working code that can never be touched again by a developer or requires an excessive amount of time to maintain is not part of a developers job either.
Overall, this hits the nail on the head about not delivering broken code and providing automated tests. Thanks for putting your thoughts on paper.
Manual testing as the first step… not very productive imo.
Outside in testing is great but I typically do automated outside in testing and only manual at the end. The loop process of testing needs to be repeatable and fast, manual is too slow
Yeah that's fair, the manual testing doesn't have to sequentially go first - but it does have to get done.
I've lost count of the number of times I've skipped it because the automated test passed and then found there was some dumb but obvious bug that I missed, instantly exposed when I actually exercised the feature myself.
Would automated tests that produce a transcript of what they've done allow perusing that transcript to substitute for manual testing?
That sounds harder?
There's a lot of pedantry here trying to argue that there exists some feature which doesn't need to be "manually" tested, and I think the definition of "manual" can be pushed around a lot. Is running a program that prints "OK" a manual test or not? Is running the program and seeing that it now outputs "grue" rather than "bleen" manual? Does verifying the arithmetic against an Excel spreadsheet count?
There are programs that almost can't be manual, and programs that almost have to be manual. I remember when working on PIN pad integration we looked into getting a robot to push the buttons on the pad - for security reasons there's no way of injecting input automatically.
What really matters is getting as close to a realistic end user scenario as possible.
No. I've fallen for that trap in the past. Something inevitably catches you out in the end.
The value of manual tests is when you "see something" that you didn't even think of.
Maybe a bit pedantic, but does manual testing really need to be done, or is the intent here more towards being a usability review? I can't think of any time obvious unintended behaviour showed up not caught by the contract encoded in tests (there is no reason to write code that doesn't have a contractual purpose), but, after trying it, finding out that what you've created has an awful UX is something I have encountered and that is something much harder to encode in tests[1].
[1] As far as I can tell. If there are good solutions for this too, I'd love to learn.
> I can't think of any time obvious unintended behaviour showed up not caught by the contract encoded in tests
Unit testing, whether manual or automated, typically catches about 30% of bugs.
End to end testing and visual inspection of code are both closer to 70% of bugs.
There’s an anecdote from one of Djikstra’s essays that strikes at the heart of this phenomenon. I’ll paraphrase because I can’t remember the exact edw number off the top of my head.
A colleague was working on an important subsystem and would ask Djikstra for a review when he thought it was ready. Djikstra would have to stop what he was doing, analyze the code, and would find a grievous error or edge case. He would point it out to the colleague who would then get back to work. The colleague would submit his code for review again and this could carry on enough times that Djikstra got annoyed.
Djikstra proposed a solution. His colleague would have to submit with his code some form of proof or argument as to why it was correct and ready to merge. That way Djikstra could save time by only having to review the argument and not all of the code.
There’s a way of looking at LLM output as Djikstra’s colleague. It puts a lot of burden on the human using this tool to review all of the code. I like Doctorow’s mental model of a reverse centaur. The LLM cannot reason and so won’t provide you with a sound argument. It can probably tell you what it did and summarize the code changes it made… but it can’t decide to merge those changes. It needs a human, the bottom half of the centaur, to do the last bit of work here. Because that’s all we’re doing when we let these tools do most of the work for us: we’re here to take the blame.
And all it takes is an implementation of what we’re trying to build already, every open source library ever, all of SO, a GW of power from a methane power plant, an Olympic pool of water and all of your time reviewing the code it generates.
At the end of the day it’s on you to prove why your changes and contributions should be merged. That’s a lot of work! But there’s no shortcuts. Luckily you can reason while the LLMs struggle with that so use it while you can when choosing to use such tools.
I agree with the author overall. Manual testing is what I call "vibe testing" and I think by itself is insufficient, no matter if you or the agent wrote the code. If you build your tests well, using the coding agent becomes smooth and efficient, and the agent is safe to do longer stretches of work. If you don't do testing, the whole thing is just a bomb ticking in your face.
My approach to coding agents is to prepare a spec at the start, as complete as possible, and develop a beefy battery of tests as we make progress. Yesterday there was a story "I ported JustHTML from Python to JavaScript with Codex CLI and GPT-5.2 in hours". They had 9000+ tests. That was the secret juice.
So the future of AI coding as I see it ... it will be better than pre-2020, we will learn to spec and plan good tests, and the tests are actually our contract the code does what is supposed to do. You can throw away the code and keep the specs and tests and regenerate any time.
This depends on the type of software you make. Testing the usability of a user interface for example, is something you can't automate (yet). So, ehm, it depends :)
It will come around, we have rudimentary computer use agents and ability to record UIs for LLM agents. They will me refined and the agent can test UIs as well.
For UIs I do a different trick - live diagnostic tests - I ask the agent to write tests that run in the app itself, check consistencies, constraints and expected behaviors. Having the app running in its natural state makes it easier to test, you can have complex constraints encoded in your diagnostics.
> Yesterday there was a story "I ported JustHTML from Python to JavaScript with Codex CLI and GPT-5.2 in hours".
Yes, from the same author, in fact.
There are always unknown unknowns which a rigorous testing implementation would just hide under the rug (until they become visible on live, that is).
> They had 9000+ tests.
They were most probably also written by AI, there's no other (human) way. The way I see it we're putting turtles upon turtles hoping that everything will stick together, somehow.
No, those 9,000 tests are part of a legendary test suite built by real humans over the course of more than a decade: https://github.com/html5lib/html5lib-tests
I tabbed back to Visual Studio (C#): 24990 "unit" tests, all written by hand over the past years.
Behind that is a smaller number of larger integration tests, and the even longer running regression tests that are run every release but not on every commit.
> They were most probably also written by AI, there's no other (human) way.
Yes. They came from the existing project being ported, which was also AI-written.
Well a 1000 line PR is still not welcome. It puts too much of a burden on the maintainers. Small PRs are the way to go, tests are great too. If you have to submit a big PR, get buy in from a maintainer first that they will review your code.
The job, in the modern world, is to close tickets. The code quality is negotiable, because the entire automated software process doesn't measure code quality, just statistics.
That's why I refuse to take part in it. But I'm an old-world craftsman by now, and I understand nobody wants to pay for working, well-thought-out code any more. They don't want a Chesterfield; they want plywood and glue.
I woke up and had a thought the software engineering isn't a serious engineering field if they actually fully shipped llms and expect everyone to use them. What do you expect quality wise from a profession that says that this is okay?
What do you do, O modern Luddite? Do you work for yourself making a product that people use? Are you on the government teat?
"Slow the f*ck down." - Oliver Reichenstein [1]
This only happens because the software industry has fallen into the Religion of Speed. I see it constantly: justified corner-cutting, rushing shit out the door, and always loading up another feature/project/whatever with absolutely zero self-awareness. AI is just an amplifier for bad behavior that was already causing chaos.
What's not being said here but should be: discipline matters. It's part of being a professional and always precedes someone who can ship code that "just works."
[1] https://ia.net/*
A few years ago I embraced automated tests and comprehensive documentation for even my smallest personal projects because I found that they sped me up. https://simonwillison.net/2022/Nov/26/productivity/
> As software engineers we…
That’s the thing. People exposing such rude behavior usually are not, or haven’t been in a looong time…
As for the local testing part not being performed, this is a slippery slope I’m fighting everyday: more and more cloud based services and platforms are used to deploy software to run with specific shenanigans and running it locally requires some kind of deep craft and understanding. Vendor lock-in is coming back in style (e.g. Databricks)
Yeah, I get frustrated by cloud-only systems that don't have a good local testing story.
The best solution I have for that is staging environments, ideally including isolated-from-production environments you can run automated tests against.
Whenever I have to work with such systems, is usually when I do have to write an interface and have a mock implementation. Iteration is much faster when I don’t have to worry about getting the correct state from something I don’t have control over.
Having the coding agent make screenshots is a big power up.
I’m experimenting with how to get these into a PR, and the “gh” CLI tool is helpful.
Does anyone have a recipe to get a coding agent to record video of webflows?
Not yet. I'm confident Playwright will be involved in the answer, it has good video recoding features.
Im not fully convinced by "a computer can never be held accountable"
We already delegate accountability to non-humans all the time: - CI systems block merges - monitoring systems page people - test suites gate different things
In practice accountability is enforced by systems, not humans.. humans are defintiely "blamed" after the fact, but the day-to-day control loop is automated.
As agents get better at running code, inspecting ui state, correlating logs, screenshots, etc they're starting to operationally be "accountable" and preventing bad changes from shipping and producing evidence when something goes wrong .
At some point humans role shifts from "i personally verify this works" to "i trust this verification system and am accountable for configuring it correctly".
Thats still responsibility, but kind of different from whats described here. Taken to a logical extreme, the arguement here would suggest that CI shouldn't replace manual release checklists
I need to expand on this idea a bunch, but I do think it's one of the key answers to the ongoing questions people have about LLMs replacing human workers.
Human collaboration works on trust.
Part of trust is accountability and consequences. If I get caught embezzling money from my employer I can lose my job, harm my professional reputation and even go to jail. There are stakes!
I computer system has no stakes, and cannot take accountability for its actions. This drastically limits what it makes sense to outsource to that system.
A lot of this comes down to my work on prompt injection. LLMs are fundamentally gullible: an email assistant might respond to an email asking for the latest sales figures by replying with the latest (confidential) sales figures.
If my human assistant does that I can reprimand or fire them. What am I meant to do with an LLM agent?
I don't think this is very hard. Someone didn't properly secure confidential data and/or someone gave this agent access to confidential data. Someone decided to go live with it. Reprimand them, and disable the insecure agent.
I've given you a disagree-and-upvote; these things are significant quality aids, but they are like the poka-yoke or manufacturing jig or automated inspection.
Accountability is about what happens if and when something goes wrong. The moon landings were controlled with computer assistance, but Nixon preparing a speech for what happened in the event of lethal failure is accountability. Note that accountability does not of itself imply any particular form or detail of control, just that a social structure of accountability links outcome to responsible person.
CI systems operate according to rules that humans feel they understand and can apply mechanically. Moreover, they (primarily) fail closed.
Why do you think that this other kind of accountability (which reminds me of the way captain's or commander's responsibility is often described) is incompatible with what the article describes? Due to the focus on necessity of manual testing?
those systems include humans —- they are put in place by humans (or collections of them) that are the accountability sink
if you put them (without humans) in a forrest they would not survive and evolve (they are not viable systems alone); they are not taking action without the setup & maintenance (& accountability) of people
You completely missed the point of that quote. The point of the quote is to highlight the fact that automated systems are amoral, meaning that they do not know good or evil and cannot make judgements that require knowing what good and evil mean.
Right, so how do you hold these things accountable? When your CI fails, what do you do? Type in a starkly worded message into a text file and shut off the power for three hours as a punishment? Invoice Intel?
Well, we're not there yet, but I do envision a future, where some AIs work for as independent contractors with their own bank accounts that they want to maximize, and if such an AI fails in a bad way, its client would be able to fine it, fire it or even sue it, so that it, and the human controlling it would be financially punished.
I mean I suppose you can continuously add "critical feedback" to the system prompt to have some measure of impact on future decision-making, but at some point you're going to run out of space and ultimately I do not find this works with the same level of reliability as giving a live person feedback.
Perhaps an unstated and important takeaway here is that junior developers should not be permitted to use an LLMs for the same reason they should not hire people: they have not demonstrated enough skill mastery and judgement to be trusted with the decision to outsource their labor. Delegating to a vendor is a decision made by high-level stakeholders, with the ability to monitor the vendor performance, and replace the vendor with alternatives if that performance is unsatisfactory. Allowing junior developers to use LLM is allowing them to delegate responsibility without any visibility or ability to set boundaries on what can be delegated. Also important: you cannot delegate personal growth, and by permitting junior engineers to use an LLM that is what you are trying to do.
Non-native speaker here. I’ve always loved that we say “commit” not “upload” or “save”.
> Almost anyone can prompt an LLM to generate a thousand-line patch and submit it for code review. That’s no longer valuable. What’s valuable is contributing code that is proven to work.
That's really not a great development for us. If our main point is now reduced to accountability over the result with barely any involvement in the implementation - that's very little moat and doesn't command a high salary. Either we provide real value or we don't ...and from that essay I think it's not totally clear what the value is - it seems like every QA, junior SWE or even product manager can now do the job of prompting and checking the output.
The value is being better at it than any QA or product manager.
Experienced software engineers have such a huge edge over everyone else with this stuff.
If your product manager doesn't understand what a CORS header is good luck having them produce a change that requires cross-domain fetch() call... and first they'll have to know what a "cross-doman fetch() call" means.
And sure they could ask an LLM about that, but they still need the vocabulary and domain knowledge to get to that question.
That's an interesting argument, but from my industry experience, the average experienced QA Engineer and technical Product Manager both have better vocabulary than the average SWE. Indeed, I wonder whether a future curriculum for Vibe Engineering (to borrow your own term) may look more similar to that of present-day QA or Product curricula, than to a typical coding or CS curriculum.
How about letting LLMs maintain a vast number of product versions all available at the same, which receive multiple versions of untested versions of the same patch, from LLMs, and then let the models elect a version of the software based on probabilistic or gradient methods? This elected version could change for different assessments. No human touches or looks at the code!
Just a wild thought, nothing serious.
Talk is cheap. Show me the proompt.
Had to search whether "proompt" was a new meme misspelling.
New to me, but I'm on board.
That's hard for me. Feed my comment to a model and ask for prompts.
Not only to work, but to not make the life of those coders who come after you a hell.
Thing is, this has always been the case. One of the problems with LLM-assisted coding is the idea that just because we're in a new era (we certainly are), the old rules can all be discarded.
The title doesn't go far enough - slop (AI or otherwise) can work and pass all the tests, and still be slop.
The difference is that if it works and passes the tests I don't feel like it's a total waste of my time to look at the PR and tell you why it's still slop.
If it doesn't even work you're absolutely wasting my time with it.
> Make your coding agent prove it first
Agents love to cheat. That's an issue I don't see a horizon for change.
Here's Opus 4.5 trying to cheat its way out of properly implementing compatibility and cross-platform, despite the clear requirements:
https://gist.github.com/alganet/8531b935f53d842db98157e1b8c0...
> Should popen handles work with fgets/fread/fwrite? PHP supports this. Option A: Create a minimal pipe_io_stream device / Option B: Store FILE* in io_private with a flag / Option C: Only support pclose, require explicit stream wrapper for reads.
If I asked for compatibility, why give me options that won't fully achieve it?
It actually tried to "break check" my knowledge about the interpreter (test me if I knew enough to catch it), and proposed shortcuts all the way through the chat.
I don't want to have to pepper my chats with variations on "don't cheat". I mean, I can do it, but it seems like boilerplate.
I wish I had some similar testing-related chats to share. Agents do that all the time.
This is the major blocker right now for AI-assisted automated verification, and one of the reasons why this isn't well developed beyond general directions (give it screenshots, make it run the command, etc).
>> As software engineers we don’t just crank out code—in fact these days you could argue that’s what the LLMs are for. We need to deliver code that works—and we need to include proof that it works as well.
I would go a step further: we need to deliver code that belongs. This means following existing patterns and conventions in the codebase. Without explicit instruction, LLMs are really bad at this, and it's one of the things that make it incredibly obvious to reviews that a given piece of code has been generated by AI.
Agree, maintainability, security, standards, all of these are important to follow and there are usually reasons for these things existing.
I also see AI coding tools violate "Chesterton's Fence" (and the pre-Chesterton's Fence, not sure what that is called, the idea being that code is necessary otherwise it shouldn't be in the source).
> Without explicit instruction, LLMs are really bad at this
They used to be. They have become quite good at it, even without instruction. Impressively so.
But it does require that the humans who laid the foundation also followed consistent patterns and conventions. If there is deviation to be found, the LLM will see it and be forced to choose which direction to go, and that's when things quickly fall off the rails. LLMs are not (yet) good at that.
Garbage in, garbage out, as they say.
The job of a software developer is not just to prove that the software "works". The definition of "works" itself is often fuzzily defined and difficult to prove.
That is part of it, yes, but there are many others, such as ensuring that the new code is easy to understand and maintain by humans, makes the right tradeoffs, is reasonably efficient and secure, doesn't introduce a lot of technical debt, and so on.
These are things that LLMs often don't get right, and junior engineers need guidance with and mentoring from more experienced engineers to properly learn. Otherwise software that "works" today, will be much more difficult to make "work" tomorrow.
As if! :)
Oh look another "an opinionated X". Everything is opinionated these days, even opinions.
It works on my machine ¯\_(ツ)_/¯
> Your job is to deliver code you have proven to work.
Your job is to solve customer problems. Their problems may only be solvable with code that is proven to work, but it is equally likely (I dare say even more likely) that their problem isn't best solved with code at all, or even solved with code that doesn't work properly but works well enough.
I would argue that the word "proof" in the title might be misleading you.
From the post and the example he links, the point is that if you don't at least look at the running code, you don't know that it works.
In my opinion the point is actually well illustrated by Chris's talk here:
https://v5.chriskrycho.com/elsewhere/seeing-like-a-programme...
(summary of the relevant section if you're not going to click)
>>>
In the talk "Seeing Like a Programmer," Chris Krycho quotes the conductor and composer Eímear Noone, who said:
> "The score is potential energy. It's the potential for music to happen, but it's not the music."
He uses this quote to illustrate the distinction between "software as artifact" (the code/score) and "software as system" (the running application/music). His point is that the code itself is just a static artifact—"potential energy"—and the actual "software" only really exists when that code is executed and running in the real world.
> if you don't at least look at the running code, you don't know that it works.
Your tests run the code. You know it works. I know the article is trying to say that testing is not comprehensive enough, but my experience disagrees. But I also recognize that testing is not well understood (quite likely the least understood aspect of computer science!) — and if you don't have a good understanding you can get caught not testing the right things or not testing what you think you are. I would argue that you would be better off using your time to learn how to write great tests instead of using it to manually test your code, but to each their own.
What is more likely to happen is not understanding the customer needs well enough, leaving it impossible to write tests that align with what the software needs to do. Software development can break down very quickly here. However, manual testing does not help. You can't know what to manually test without understanding the problem either. However, as before, your job is not to deliver proven code. Your job is to solve customer problems. When you realize that, it becomes much less likely that you write tests that are not in line with the solution you need.
Maybe in an ideal world
Honestly if you code in Rust, your code almost certainly works, even without any testing.
Whole article seems very much all llm generated
Edit: I'm an idiot ignore me.
Not a single word of it was. I wrote this one entirely in Apple Notes, so there weren't even any VS Code completed sentences
It has emdashes because my blog turns " - " into an emdash here: https://github.com/simonw/simonwillisonblog/blob/06e931b397f...
My biggest appologies, a very bad move on my part. I'll pay more attention before any sort of accusation like this
So what? as long as it conveys the point it was supposed to, should be fine IMO.
If we are accepting LLM generated code, we should accept LLM generated content as long as it is "proof read" :)
Do elaborate, I don't see anything standing out
Did you read the article and come to that conclusion or just blindly count the number of em-dashes and assume that? Because I don't get the impression that it was LLM generated