Knee jerk reaction is to hate this, but if the PRs are small enough to quickly review and/or fix, having your test suite start to execute branches of code that it previously wasn’t has some value? It seems much (or exactly?) like using copilot— it’s going to be wrong, but sometimes it is 80% there nearly instantly, and you spend all of your time on the last 20%. Still, time saved, and value gained, so long as you know what you are doing yourself. Maybe at least annotate the bot tests to not mix them in with the intentionally added human tests, then it’s even harder to justify throwing this idea out completely.
Even if the test is COMPLETELY off, it might be enough to get someone thinking “but wait, that gives me enough of an idea to test this branch of code” or even better “wait, this branch of code is useless anyway, let’s just delete it”
I'd be worried that the tests it generates would be brittle or not helpful, like testing an exact implementation instead of the API contract.
I don't know if that's the case here. I looked at the diff but I don't understand the code in the project. I'd like to see it on something I recognize better.
Code coverage is useful as a metric as long as it's not turned into a target. But it's primary usefulness is to tell a developer if they've given enough thought to how to test their code is correct or not.
A robot that back-fills coverage in tests seems... counterproductive to me.
I think it may be useful if you are carefully reviewing the tests generated. Basically the robot is saying "Did you know that your code had this behaviour? If so let's lock it in" In many ways this feels similar to mutation testing, but sort of the opposite. To approximate behaviour:
Mutation testing changes or simplifies code that is "dead" to the tests. That is the change doesn't affect the result of the tests.
This adds tests that lock-in behaviour of previously "dead" branches. (Or tests that expose the incorrect behaviour of those branches.)
In another way it is like snapshot testing. It recording the current behaviour of the code and that behaviour can be approved/rejected by the human reviewer.
I was thinking the same, one usage could be to discover cases that don't make functional sense. Assuming these tests get reviewed.
Robot is giving too much credit. It’s rather a fancy dumb autocorrect emulating some corpus.
The only place LLMs should be trusted is how we all probably originally tried them out: “Write me a poem about a space traveling teddy bear in the style of Eminem.”
Trying to use LLMs for factual, numeric, logical purposes is a fundamentally flawed endeavor. Especially from our community, I’m disappointed to see this amount of blind trust and willful ignorance and lack of real engineering discipline in treating LLMs as anywhere near trustworthy.
LLMs != AGI. Hot take, even if we did have AGI, it shouldn’t be given access / free rein over code and computing systems. I trust humans, with all our flaws, because we are limited in our time and energy, and speed.
> Especially from our community, I’m disappointed to see this amount of blind trust and willful ignorance and lack of real engineering discipline in treating LLMs as anywhere near trustworthy.
I'm not disappointed anymore, because I've lowered my expectations for software engineers and the tech community in general. VC-adjacent hype and seeing the world through sci-fi fantasies trumps "engineering discipline" every time, at least in its discourse (you've got both for LLMs). There's a lot of unacknowledged ignorance of areas outside the tech bubble, and also a lot of contempt for it. Also, there's a pathological inability to think through to consequences to temper a compulsion to mindlessly "build the next thing," (especially acute with AGI, which for our sake I hope is infeasible or at least impractical).
I really strongly believe someone needs to "take the keys away" from the tech community. And I say that as a member of it.
> LLMs != AGI. Hot take, even if we did have AGI, it shouldn’t be given access / free rein over code and computing systems. I trust humans, with all our flaws, because we are limited in our time and energy, and speed.
Also, more importantly, we are humans. It's literally some of the scariest, most hopeless things to imagine our world being dominated by something else, especially something else that unreachably exceeds our capabilities (I'd count a billionaire controlling an army of AGI drones to be essentially inhuman).
I haven’t had the energy to inform myself on all of this but I have a similar instinct.
> with all our flaws, because we are limited in our time and energy, and speed.
Something that might be relevant also is that we have skin in the game. External motivation can lead us to create intentionally correct or malicious programs. A robot’s fate is irrelevant since it has no capacity to care.
I’m sure that algorithms try to emulate this skin-in-the-game factor. But it will always just be a simulation.
I agree, this seems to start from the assumption that a new unit test should pass, but that isn't really the case. A unit test should assert that the code functions accordingly to it's planned design. The AI writing unit tests can't know it's planned design, so it can't write test cases that prove the code is wrong
I see, AI has learned how to write terrible commit messages, another thing I'm better at than AI, score!
What would it look like if a tool like this was taken to its logical conclusion: an agent that could write unit tests for any existing code and get it to 100% code coverage. Not only that, it would have tests for every combinatorial code path reachable. Test files would outnumber human code by orders of magnitude. The ultimate pipe dream of some software developers would be fulfilled.
Would such a tool be helpful? probably in some circumstances (e.g. spacecraft software perhaps), but I sure wouldn't want such a tool. If this is less than ideal, then how do we reach a compromise? What code branches and functions should be left untested? Is this question even answerable from the textual representation of code and documentation?
> What would it look like if a tool like this was taken to its logical conclusion: an agent that could write unit tests for any existing code and get it to 100% code coverage. Not only that, it would have tests for every combinatorial code path reachable. Test files would outnumber human code by orders of magnitude. The ultimate pipe dream of some software developers would be fulfilled.
That dream is a nightmare. All code has bugs and that test suite would remove any "not bug" signals provided by the test cases.
Most if not all utopias turn out to by dystopias if you look hard enough.
Eventually you have so many tests that any change at all will break some of them. So you ask someone if the test makes sense or should be removed. But there is no one to ask.
In my opinion, no. AI writing unit tests can't know the original intent is the code, only the written intent. So an AI written test case gives you a false sense of security.
Additionally, having such high coverage amounts often adds friction to writing new code, for good or bad
Okay, but what do these tests mean? It's easy to add tests that are meaningless and either test the obvious or are just for coverage.
But some of the buggiest stuff I've dealt with were in codebases that had full coverage. Because none of the tests were designed to test the original intent of the designed code.
I suspect there is also another angle, which is are the tests maintainable as well? Like you said, if you're not testing intent, this might be one more thing to maintain.
In another view, this might just be a fancy way of doing snapshot testing, use AI to generate all the inputs to produce a robust snapshot, but realize the output isn't unit tests, it's snapshots that report changes in outputs that devs will just rubber stamp.
I think these tests are useful as regression tests - unit tests can be really helpful when making changes down the line, tipping you off that you missed something. Also much easier to refactor when there’s good test coverage.
From the PR: unit tests: what are they good for?
Answer: Personal opinion - writing unit testing is not fun. It becomes even less appealing as your codebase grows and maintaining tests becomes a time-consuming chore.
However, the benefits of comprehensive unit tests are real:
Reliability: They create a more reliable codebase where developers can make changes confidently
Speed: Teams can move quickly without fear of breaking existing functionality
Safe Refactoring: Code improvements and restructuring become significantly safer when backed by thorough tests
Living Documentation: Tests serve as clear documentation of your code's behavior:
They show exactly what happens for each input They present changes in a human-readable format: "for this input → expect this output" They run quickly and are easy to execute This immediate feedback loop is beneficial during development
> I think these tests are useful as regression tests - unit tests can be really helpful when making changes down the line, tipping you off that you missed something. Also much easier to refactor when there’s good test coverage.
The problem is that this assumes that the tests or the method was written correctly in the first place. If the behavior in the method is wrong and the tests are validating that the behavior is wrong, then you pay an extra tax. First to fix the behavior of the method, then to fix the behavior of the tests.
That's why automatically generating unit tests is in my opinion adding a bomb to your codebase. The only exception is stuff like basic parameter testing but even that can be questionable at times (is null a valid input at any point for example) unless you know the intent of the code, and AI can't really grasp at the intent.
Will we reach Dead GitHub eventually?
- DependaBot
- CoverageBot
- ReplyBot
- CodeOfConductEnforcementBot
- KudosBot
Old projects can live actively forever without any intervention.