> Companies with relatively young, high-quality codebases benefit the most from generative AI tools, while companies with gnarly, legacy codebases will struggle to adopt them. In other words, the penalty for having a ‘high-debt’ codebase is now larger than ever.
This mirrors my experience using LLMs on personal projects. They can provide good advice only to the extent that your project stays within the bounds of well-known patterns. As soon as your codebase gets a little bit "weird" (ie trying to do anything novel and interesting), the model chokes, starts hallucinating, and makes your job considerably harder.
Put another way, LLMs make the easy stuff easier, but royally screws up the hard stuff. The gap does appear to be widening, not shrinking. They work best where we need them the least.
The niche I've found for LLMs is for implementing individual functions and unit tests. I'll define an interface and a return (or a test name and expectation) and say "this is what I want this to do", and let the LLM take the first crack at it. Limiting the bounds of the problem to be solved does a pretty good job of at least scaffolding something out that I can then take to completion. I almost never end up taking the LLM's autocompletion at face value, but having it written out to review and tweak does save substantial amounts of time.
The other use case is targeted code review/improvement. "Suggest how I could improve this" fills a niche which is currently filled by linters, but can be more flexible and robust. It has its place.
The fundamental problem with LLMs is that they follow patterns, rather than doing any actual reasoning. This is essentially the observation made by the article; AI coding tools do a great job of following examples, but their usefulness is limited to the degree to which the problem to be solved maps to a followable example.
Can't tell you how much I love it for testing, it's basically the only thing I use it for. I now have a test suite that can rebuild my entire app from the ground up locally, and works in the cloud as well. It's a huge motivator actually to write a piece of code with the reward being the ability to send it to the LLM to create some tests and then seeing a nice stream of green checkmarks.
> I now have a test suite that can rebuild my entire app from the ground up
What does this mean?
Sorry, should have been more clear. Firebase is (or was) a PITA when I started the app I'm working on a few years ago. I have a lot of records in my db that I need to validate after normalizing the data. I used to have an admin page that spit out a bunch of json data with some basic filtering and self-rolled testing that I could verify at a glance.
After a few years off from this project, I refactored it all, and part of that refactoring was building a test suite that I can run. When ran, it will rebuild, normalize, and verify all the data in my app (scraped data).
When I deploy, it will also run these tests and then email if something breaks, but skip the seeding portion.
I had plans to do this before but the firebase emulator still had a lot of issues a few years ago, and refactoring this project gave me the freedom to finally build a proper testing environment and make my entire app make full use of my local firebase emulator without issue.
I like giving it my test cases in plain english. It still gets them wrong sometimes but 90% of the time they are good to go.
I struggle to get github copilot to create any unit tests that provide any value. How to you get it to create really useful tests?
>see a nice stream of green check marks.
Please tell me this is satire.
If not, I would absolutely hate to be the person that comes in after and finds a big stream of red xs every time I try to change something.
I had Codeium add something to a function that added a new data value to an object. Unbidden it wrote three new tests, good tests. I wrote my own test by cutting and pasting a test it wrote with a modification, it pointed out that I didn’t edit the comment so I told it to do so.
It also screwed up the imports of my tests pretty bad, some imports that worked before got changed for no good reason. It replaced the JetBrains NotNull annotation with a totally different annotation.
It was able to figure out how to update a DAO object when I added a new field. It got the type of the field wrong when updating the object corresponding to a row in that database column even though it wrote the liquibase migration and should have known the type —- we had chatted plenty about that migration.
It got many things right but I had to fix a lot of mistakes. It is not clear that it really saves time.
Try using Cursor with the latest claude-3-5-sonnet-20241022.
Unfortunately I “think different” and use Windows. I use Microsoft Copilot and would say it is qualitatively similar to codeium in quality, a real quantitative eval would be a lot of work.
Cursor (cursor.com) is just a vscode wrapper, should work fine with Windows. If you're already in the AI coding space I seriously urge you to at least give it a go.
I'll look into it.
I'll add that my experience with the Codium plugin for IntelliJ is night and day different from the Windsurf editor from Codium.
The first one "just doesn't work" and struggles to see files that are in my project, the second basically works.
You can also look into https://www.greptile.com/ to ask codebase questions. There's so many AI coding tools out there now. I've heard good things about https://codebuddy.ca/ as well (for IntelliJ) and https://www.continue.dev/ (also for IntelliJ).
>The first one "just doesn't work"
Haha. You're on a roll.
Let's be clear here, Codeium kinda sucks. Yeah, it's free and it works, somewhat. But I wouldn't trust it much.
It's containerized and I have a script that takes care of everything from the ground up :) I've tested this on multiple OS' and friends computers. I'm thankful to old me for writing a readme for current me lol.
>Please tell me this is satire.
No. I started doing TDD. It's fun to think about a piece of functionality, write out some tests, and then slowly make it pass. Removes a lot of cognitive load for me and serves as a micro todo. It's also nice that when you're working and think of something to add, you can just write out a quick test for it and add it to kanban later.
I can't tell you how many times I've worked on projects that are gigantic in complexity and don't do any testing, or use typescript, or both. You're always like 25% paranoid about everything you do and it's the worst.
Can you give some examples? What LLM? What code? What tests?
As a test I just asked "ChatGPT 4o with canvas" to "Can you write a set of tests to test glBufferData and all of its edge cases?"
glBufferData is a 32 year old API so there's clearly plenty of examples for to have looked it. There are even multiple public tests for it including the official tests that are open sources and so easily scannable. It failed
It wrote 8 tests, 7 of those tests were wrong in that it did something wrong intentionally then asserted it go no error. It wasn't close to comprehensive. It didn't test the function actually put data in the buffer for example, nor did it check the set of valid enums to see that they work. Nor did it check that the target parameter actually works and affects the correct buffer bound to that target.
This is my experience with LLMs for code so far. I do get answers quicker from LLMs sometimes for tech questions vs searching via Google and reading stack overflow. But that's only sometimes. As a recent example, I was trying to add TypeScript types some JavaScript and it failed. I went round and round tell it it failed but it got stuck in a loop and just kept saying "Oh, sorry. How about this -- repeat of previous code"
Yes this is the same for me. I’ve shifted my programming style so now I just write function signatures and let the AI do the rest for me. It has been a dream and works consistently well.
I’ll also often add hints at the top of the file in the form of comments or sample data to help keep it on the right track.
Like most of us it appears LLMs really only want to work on greenfield projects.
Good joke, but the reality is they falter even more on truly greenfield projects.
That is because, by definition, their models are based upon the past. And woe unto thee if that training data was not pristine. Error propagation is a feature; it's a part of the design, unless one is suuuuper careful. As some have said, "Fools rush in."
Or, in comic form: https://www.smbc-comics.com/comic/rise-of-the-machines
I agree with this. But the reason is that AI does better the more constrained it is, and existing codebases come with constraints.
That said, if you are using Gen AI without a advanced rag system feeding it lots of constraints and patterns/templates I wish you luck.
The site also suggests LLMs care a great deal one way or another.
"Unlock a codebase that your engineers and AI love."
I think they do often act opinionated and show some decision-making ability, so AI alignment really is important.
Remember to tip your LLM
I was recently assigned to work on a huge legacy ColdFusion backend service. I was very surprised at how useful AI was with code. It was even better, in my experience, than I've seen with python, java, or typescript. The only explanation I can come up with is there is so much legacy ColdFusion code out there that was used to train Copilot and whatever AI jetbrains uses for code completion that this is one of the languages they are most suited to assist with.
Perhaps it is the reverse: That ColdFusion training sources are limited, so it is more likely to converge on a homogenization?
While, causally, we usually think of a programming language as being one thing, but in reality a programming language generally only specifies a syntax. All of the other features of a language emerge from the people using them. And because of that, two different people can end up speaking two completely different languages even when sharing the same syntax.
This is especially apparent when you witness someone who is familiar with programming in language X, who then starts learning language Y. You'll notice, at least at first, they will still try to write their programs in language X using Y syntax, instead of embracing language Y in all its glory. Now, multiply that by the millions of developers who will touch code in a popular language like Python, Java, or Typescript and things end up all over the place.
So while you might have a lot more code to train on overall, you need a lot more code for the LLM to be able to discern the different dialects that emerge out of the additional variety. Quantity doesn't imply quality.
I wonder what a language designed as a target for LLM-generated code would look like? What semantics and syntax would help the LLM generate code that is more likely to be correct and maintainable by humans?
Perhaps something like Cobol? (Shudder.)
That's great, but a sample size of 1, and AI utility is also self-confirmation-biasing. If the AI stops providing useful output, you stop using it. It's like "what you're searching is always in the last place you look". After you recognize AI's limits, most people wouldn't keep trying to ask it to do things they've learned it can't do. But still, there's an area of things it does, and a (ok, fuzzy) boundary of its capabilities.
Basically, for any statement about AI helpfulness, you need to quantify how far it can help you. Depending on your personality, anything else is likely either always a success (if you have a positive outlook) or a failure (if you focus on the negative).
But where did these companies get the ColdFusion code for their training data? Since ColdFusion is an old language and used for backend services, how much ColdFusion code is open source and crawlable?
I'm definitely assuming that they don't limit their training data to what is open source and crawlable.
That's a good question. I presume there is some way to check github for how much code in each language is available on it.
similar experience with perl scripts being re-written into golang. Crazy good experience with Claude
> Put another way, LLMs make the easy stuff easier, but royally screws up the hard stuff.
This is my experience with generation as well - but I still don't trust it for the easy stuff and thus the model ends up being a hindrance in all scenarios. It is much easier for me to comprehend something I'm actively writing so making sure a generative AI isn't hallucinating costs more than me just writing it myself in the first place.
For me same experience but opposite conclusion. LLM saves me time by being excellent at yak shaving, letting me focus on the things that truly need my attention.
It would be great if they were good at the hard stuff too, but if I had to pick, the basics is where i want them the most. My brain just really dislikes that stuff, and i find it challenging to stay focused and motivated on those things.
Yep, I'm building a dev tool that is based on this principle. Let me focus on the hard stuff, and offload the details to an AI in a principled manner. The current crop of AI dev tools seem to fall outside of this sweet spot: either they try to do everything, or act as a smarter code completion. Ideally I will spend more time domain modeling and less time "coding".
> LLM saves me time by being excellent at yak shaving, letting me focus on the things that truly need my attention.
But these tools often don't generate working, let alone bug-free, code. Even for simple things, you still need to review and fix it, or waste time re-prompting them. All this takes time and effort, so I wonder how much time you're actually saving in the long run.
I use ChatGPT the most when I need to make a small change in a language I'm not fluent in, but I have a clear understanding of the project and what I'm trying to do. Example: "write a function that does this and this in Javascript". It's essentially a replacement of stack overflow.
I never use it for something that really requires knowledge of the code base, so the quality of the code base doesn't really matter. Also, I don't think it has ever provided me something I wouldn't have been able to do myself pretty quickly.
This is true, but I look at it differently. It makes it easier to automate the boring or annoying. Gotta throw up an admin interface? Need to write more unit tests? Need a one-off but complicated SQL query? They tend to excel at these things, and it makes me more likely to do them, while keeping my best attention for the things that really need me.
> They work best where we need them the least.
Au contraire. I hate writing boilerplate. I hate digging through APIs. I hate typing the same damn thing over and over again.
The easy stuff is mind numbing. The hard stuff is fun.
You write these once (or zero time) by using a scaffolding template, a generator, or snippets.
And now LLM write these for me, it's relaxing
Same experience, but I think it's going to change. As models get better, their context window keeps growing while mine stays the same.
To be clear, our context window can be really huge if you are living the project. But not if you are new to it or even getting back to it after a few years.
Here's the secret to grokking a software project: a given codebase is not understandable without understanding how and why it was built; i.e. if you didn't build it, you're not going to understand why it is the way it is.
In theory, the codebase should be, as it is, understandable (and it is, with a great deal of rigorous study). In reality, that's simply not the case, not for any non-trivial software system.
So your secret to understanding code is: Abandon hope all ye who enter here?
Oh, you will understand why things were built. It's inevitable.
And all of that understanding will come from people complaining about you fixing a bug.
Hard work is no secret, it's just avoided by slackers at all costs :-)
What I'm really saying is that our software development software is missing a very important dimension.
LLMs might be a good argument for documenting more of the "why" in code comments.
Too bad most projects don't document any of those decisions.
> They work best where we need them the least.
I disagree, but it’s largely a matter of expectations. I don’t expect them to solve hard problems for me. That’s currently still my job. But when I’m writing new code, even for a legacy system, they can save a lot of time in getting the initial coding done, helping write comments, unit tests, and so on.
It’s not doing difficult work, but it saves a lot of toil.
Ive noticed the same and wonder if this is the natural result of public codebases on average being simpler since small projects will always outnumber bigger ones (at least if you ignore forks with zero new commits)
If high quality closed off codebases were used in training, would we see an improvement in LLM quality for more complex use cases?
One description of the class of problems LLM's are a good fit for is anything at which you could throw an army of interns. And this seems consistent with that.
Ironically enough I’ve always found LLMs work best when I don’t know what I’m doing
Is that because you can’t judge the quality of the output?
I find this perspective both scary and exciting. I'm curious, how do you validate the LLM's output? If you have a way to do this, and it's working. Then that's amazing. If you don't, how are you gauging "work best"?
> This mirrors my experience using LLMs on personal projects. They can provide good advice only to the extent that your project stays within the bounds of well-known patterns.
I agree but I find its still a great productivity boost for certain tasks, cutting through the hype and figuring out tasks that are well suited to these tools and prompting optimially has taken me a long time.
I hear people say this a lot but invariably the tasks end up being "things you shouldnt be doing".
E.g. pointing the AI at your code and getting it to write unit tests or writing more boilerplate, faster.
This is only partly true. AI works really well on very legacy codebases like cobol and mainframe, and it's very good at converting that to modern languages and architectures. It's all the stuff from like 2001-2015 that it gets weird on.
> AI works really well on very legacy codebases like cobol and mainframe
Any sources? Seems unlikely that LLMs would be good at something with so little training data in the widely available internet.
LLMs are good at taking the underlying structure of one medium and repeating it using another medium.
maybe its a signal that you software should be restructured into modules that fit well-established patterns.
its like you are building website thats not using MVC and complain that LLM advice is garbage...
No, you shouldn't restructure your software into highly-repetitive noise so that a dumb computer can guess what comes next.
I am proponent of Clean and Simple architecture that follows standard patterns.
because they are easier to maintain, there should be no clever tricks or arch.
all software arch should be boring and simple, with as few tricks as possible, unless it is absolutely warranted
A pattern is a structured way of working around a language deficiency. Good code does not need patterns or architecture, it expresses the essence of the actual business problem and no more. Such software is also significantly easier to maintain if you measure maintainability against how much functionality the software implements rather than how many lines of code it is. Unfortunately the latter is very common, and there is probably a bright future in using LLMs to edit masses of LLM-copy-pasted code as a poor man's substitute for doing it right.
Simplicity is hard. And difficulty is what almost everyone using LLMs is trying to avoid. More code breed complexity.
I read somewhere that 1/6 of the time should be allocated to refactoring (every 6th cycle). I wonder how that should be done with LLMs.
Exactly that. LLMs generate a lot of simple and dumb code fast. Then you need to refactor it and you can't because LLMs are still very bad at that. They can only refactor locally with a very limited scope, not globally.
Good luck to anyone having to maintain legacy LLM-generated codebases in the future, I won't.
> the model chokes, starts hallucinating, and makes your job considerably harder.
Coincidentally this also happens with developers in unfamiliar territory.
I often think of LLMs as a really smart junior developer - full of answers, half correct, with zero wisdom but 100% confidence
I'd like to think most developers know how to say "I don't know, let's do some research" but in reality, many probably just take a similar approach to the LLM - feign competence and just hack out whatever is needed for today's goal, don't worry about tomorrow.
Eh, it’s been kinda nice to just hit tab-to-complete on things like formulaic (but comprehensive) test suites, etc.
I never wanted the LLM to take over the (fun) part - thinking through the hard/unusual parts of the problem - but you’re also not wrong that they’re needed the least for the boilerplate. It’s still nice :)
True, if you're using LLMs as a completion engine or to generate scaffolding it's still very useful! But we have to acknowledge that's by far the easiest part of programming. IDEs and deterministic dev tools have done that (very well) for decades.
The LLM gains are in efficiency for rote tasks, not solving the other hard problems that make up 98% of the day. The idea that LLMs are going to advance software in any substantial way seems implausible to me - It's an efficiency tool in the same category as other IDE features, an autocomplete search engine on steroids, not even remotely approaching AGI (yet).
> The idea that LLMs are going to advance software in any substantial way seems implausible to me
I disagree. They won't do that for existing developers. But they will make it so that tech-savy people will be able to do much more. And they might even make it so that one-off customization per person will become feasable.
Imagine you want to sort hackernews comments by number of character inline in your browser. Tell the AI to add this feature and maybe it will work (just for you). That's some ways I can see substantial changes happen in the future.
> It’s still nice :)
This is the thing about the kind of free advertising so many on this site provide for these llm corpos.
I’ve seen so many comparisons between “ai” and “stack overflow” that mirror this sentiment of “it’s still nice :)”.
Who’s laying off and replacing thousands of working staff for “still nice :)” or because of “stack overflow”?
Who’s hiring former alphabet agency heads to their board for “still nice :)”?
Who’s forcing these services into everything for “still nice :)”?
Who’s raising billions for “still nice :)”?
So while developers argue tooth and nail for these tools that they seemingly think everyone only sees through their personal lens of a “still nice :)” developer tool, the companies are leveraging that effort to oversell their product beyond the scope of “still nice :)”.
I suspected some of that, and your explanation looks more general and good.
Or, for a joke, LLMs plagiarize!
> They work best where we need them the least.
Just like most of the web frameworks and ORMs I've been forced to use over the years.
as the context windows get larger and the UX for analyzing multiple files gets better, I’ve found them to be pretty good
But they still fail at devops because so many config scripts are at never versions than the training set
> However, in ‘high-debt’ environments with subtle control flow, long-range dependencies, and unexpected patterns, they struggle to generate a useful response
I'd argue that a lot of this is not "tech debt" but just signs of maturity in a codebase. Real world business requirements don't often map cleanly onto any given pattern. Over time codebases develop these "scars", little patches of weirdness. It's often tempting for the younger, less experienced engineer to declare this as tech debt or cruft or whatever, and that a full re-write is needed. Only to re-learn the lessons those scars taught in the first place.
I recently watched a team speedrun this phenomenon in rather dramatic fashion. They released a ground-up rewrite of an existing service to much fanfare, talking about how much simpler it was than the old version. Only to spend the next year systematically restoring most of those pieces of complexity as whoever was on pager duty that week got to experience a high-pressure object lesson in why some design quirk of the original existed in the first place.
Fast forward to now and we're basically back to where we started. Only now they're working on code that was written in a different language, which I suppose is (to misappropriate a Royce quote) "worth something, but not much."
That said, this is also a great example of why I get so irritated with colleagues who believe it's possible for code to be "self-documenting" on anything larger than a micro-scale. That's what the original code tried to do, and it meant that its current maintainers were left without any frickin' clue why all those epicycles were in there. Sure, documentation can go stale, but even a slightly inaccurate accounting for the reason would have, at the very least, served as a clear reminder that a reason did indeed exist. Without that, there wasn't much to prevent them from falling into the perennially popular assumption that one's esteemed predecessors were idiots who had no clue what they were doing.
> Only to spend the next year systematically restoring most of those pieces of complexity as whoever was on pager duty that week got to experience a high-pressure object lesson in why some design quirk of the original existed in the first place.
Just to emphasize the point: even if it's not obvious why there is a line of code, it should at least be obvious that the line of code does something. It's important to find out what that something is and remember it for a refactor. At the very least, the knowledge could help you figure out a bug a day or two before you decide to pore over every line in the diff.
In my refactoring I always refer to that as Chesterton's Fence. Never remove something until you know why it was put in in the first place. Plenty of times it's because you were trying to support Python 3.8 or something else obsolete, and a whole lot of the time it's because you thought that the next project was going to be X so you tried to make that easy but X never got done so you have code to nowhere. Then feel free to refactor it, but a lot of the time it's because of good reasons that are NOT obsolete or overtaken by events, and when refactoring you need to be able to tell the difference.
https://www.chesterton.org/taking-a-fence-down/ has the full cite on the names.
Incidentally the person who really convinced me to stop trying to future-proof made a point along those lines. Not in the same language, but he basically pointed out that, in practice, future-proofing is usually just an extremely efficient way to litter your code with Chesterton's Fences.
I got really 'lucky' in that the first major project I ever worked on was future-proofed to high heaven, and I became the one to maintain that thing for a few years as none of the expected needs for multiple layers of future-proofing abstraction came to pass. Oh but if we ever wanted to switch from Oracle to Sybase, it would have been 30% easier with our database connection factory!
I never let that happen again.
> ...if we ever wanted to switch from Oracle to Sybase...
Yeah like Oracle would ever let that happen
This can be a huge road block. Even if the developer who wrote the code is still around, there's no telling if they will even remember writing that line, or why they did so. But in most projects, that original developer is going to be long gone.
I leave myself notes when I do bug fixes for this exact reason.
> Sure, documentation can go stale, but even a slightly inaccurate accounting for the reason would have, at the very least, served as a clear reminder that a reason did indeed exist.
Which is borderline the reason for version control: Do a git/svn blame on that line, find what commit it was added, and see what the commit message was. Bonus points if it links to a case on a system you still use. Sure the commit message can be useless, but it's at least something you're forced to enter when committing code, rather than external documentation that can be missed and now be misleading. Version control can even show you that codebase at time that change was made so you can see it in context (which has saved me a few times, showing what something was added for so I could confirm a suspicion).
Hahaha, Joel Spolsky predicted exactly that IN THE YEAR 2000:
https://www.joelonsoftware.com/2000/04/06/things-you-should-...
Times have changed. Code now does acquire bugs just by sitting there. Assholes you depend on are changing language definitions, compiler behavior, and libraries in a massive effort concentrated on breaking your code. :)
> Assholes you depend on are changing language definitions, compiler behavior, and libraries in a massive effort concentrated on breaking your code. :)
Big Open Source is plotting against the working class developer.
It acquires bugs, security flaws, and obsolescence from the operating system itself.
In general, when people say this sort of thing, if I dig into what exactly they're doing I discover they're importing half of npm/pypi/etc.
My code doesn't acquire bugs by sitting there in 2024 any more than it did in 2004. On most projects these days I'm using Django + Preact + HTM. Preact and HTM get loaded from static files by my root Django template. My PyPi dependencies are pinned to specific versions, and usually I have <10 (usually it's just Django and Django REST framework, sometimes it's even just Django).
Golang really is the best when it comes to backwards compatibility. I'm able to import dependencies from 14 years ago and have them work with 0 changes
100% hear this and I know as a developer at a big company I have no say over the business side of things but there's probably something to be said for we should all push for clear logical business processes that make sense. Take something like a complicated offering of subscriptions, it's bad for the customer, it's bad for sales people, it's bad for customer support, honestly it's probably even bad for marketing. Keep things simple. But I suppose those complexities ultimately probably allow for greater revenue as it would allow for greater extraction of dollars per customer e.g. people who met this criteria are willing to pay more so we'll have this niche plan but like I outlined above at what cost? Are you even coming out ahead in the long run?
There is a pretty well known essay by Joel Spolsky (which is now 24 years old!) titled "Things You Should Never Do" where he talks about the error of doing a rewrite: https://www.joelonsoftware.com/2000/04/06/things-you-should-... . While I don't necessarily agree with all of his positions here, and given the way most software is architected and deployed these days some of this advice is just obsolete (e.g. relatively little software is complete, client-side binaries where his advice is more relevant), I think he makes some fantastic points. This part is particularly aligned with what you are saying:
> Back to that two page function. Yes, I know, it’s just a simple function to display a window, but it has grown little hairs and stuff on it and nobody knows why. Well, I’ll tell you why: those are bug fixes. One of them fixes that bug that Nancy had when she tried to install the thing on a computer that didn’t have Internet Explorer. Another one fixes that bug that occurs in low memory conditions. Another one fixes that bug that occurred when the file is on a floppy disk and the user yanks out the disk in the middle. That LoadLibrary call is ugly but it makes the code work on old versions of Windows 95.
> Each of these bugs took weeks of real-world usage before they were found. The programmer might have spent a couple of days reproducing the bug in the lab and fixing it. If it’s like a lot of bugs, the fix might be one line of code, or it might even be a couple of characters, but a lot of work and time went into those two characters.
> When you throw away code and start from scratch, you are throwing away all that knowledge. All those collected bug fixes. Years of programming work.
Louder for the people in the back. I've had this notion for quite a long time that "tech debt" is just another way to say "this code does things in ways I don't like". This is so well said, thank you!
There is a difference between "this code does things in ways I don't like" and "this code does things in ways nobody likes"
> Over time codebases develop these "scars", little patches of weirdness. It's often tempting for the younger, less experienced engineer to declare this as tech debt or cruft or whatever, and that a full re-write is needed. Only to re-learn the lessons those scars taught in the first place.
Do you have an opinion when this maturity is too mature?
Let's say, you would need to add a major feature that would drastically change the existing code base. On top of that, by changing the language, this major feature would be effortless to add.
When it is worth to fight with scars or just rewrite?
Code that has thorough unit and integration tests, no matter how old and crusty, can be refactored with a good deal of confidence, and AI can help with that.
Imo real tech debt is when the separation between business logic and implementation details get blurry.
Rewrites tend to focus all in on implementation.
I call them warts, but yes agree, especially an industry that does a lot of changing, for example a heavily regulated one.
Put another way, sometimes code is complex because it has to be.
>Instead of trying to force genAI tools to tackle thorny issues in legacy codebases, human experts should do the work of refactoring legacy code until genAI can operate on it smoothly
Instead of genAI doing the rubbish, boring, low status part of the job, you should do the bits of the job no one will reward you for, and then watch as your boss waxes lyrical about how genAI is amazing once you've done all the hard work for it?
It just feels like if you're re-directing your efforts to help the AI, because the AI isn't very good at actual complex coding tasks then... what's the benefit of AI in the first place? It's nice that it helps you with the easy bit, but the easy bit shouldn't be that much of your actual work and at the end of the day... it's easy?
This gives very similar vibes to: "I wanted machines to do all the soul crushing monotonous jobs so we would be free to go and paint and write books and fulfill our creative passions but instead we've created a machine to trivially create any art work but can't work a till"
> There is an emerging belief that AI will make tech debt less relevant.
Wow. It's hard to believe that people are earnestly supposing this. From everything we have evidence of so far, AI generated code is destined to be a prolific font of tech debt. It's irregular, inconsistent, highly sensitive to specific prompting and context inputs, and generally produces "make do" code at best. It can be extremely "cheap" vs traditional contributions, but gets to where it's going by the shortest path rather than the most forward-looking or comprehensive.
And so it does indeed work best with young projects where the prevailing tech debt load remains low enough that the project can absorb large additions of new debt and incoherence, but that's not to the advantage of young projects. It's setting those projects up to be young and debt-swamped much sooner than they would otherwise be.
If mature projects can't use generative AI as extensively, that's going to be to their advantage, not their detriment -- at least in terms of tech debt. They'll be forced to continue plodding along at their lumbering pace while competitors bloom and burst in cycles of rapid initial development followed by premature seizure/collapse.
And to be clear: AI generated code can have real value, but the framing of this article is bonkers.
The mainstream layman/MBA view is that "AI/nocode will replace the programmers". Most actual programmers know better, of course.
Evergreen: https://static.googleusercontent.com/media/research.google.c...
Machine learning is the high interest credit card of technical debt.
Came to post this — it’s the same underlying technology, just a lot more compute now.
This is funny in the context of seeing GCP try to deprecate a text embedded api and then push out the deadline by 6 months.
It is a self-reinforcing pattern: the easier it is to generate code, the more code is generated. The more code is generated, the bigger the cost of maintenance is (and the relationship is super-linear).
So every time we generate the same boilerplate we really do copy/paste adding to maintenance costs.
We are amazed looking at the code generation capabilities of LLMs forgetting the goal is to have less code - not more.
My experience is the opposite - I find large blobs of generated code to be daunting, so I tend to pretty quickly reject them and either write something smaller by hand, or reprompt (in one way for another) for less, easier to review code.
And what do you do with the generated code?
Do you package it in a reusable library so that you don't have to do the same prompting again?
Or rather - just because it is so easy to do - you don't bother?
If that's the later - that's exactly the pattern I am talking about.
You are an excellent user of AI code generation - but your habit is absolutely not the norm and other developers will throw in paragraphs of AI slop mindlessly.
This is just taking the advice to make code sane so that humans could undertand and modify it, and then justifying it as "AI should be able to understand and modify it". I mean, the same developer efficiency improvements apply to both humans and AI. The only difference is that currently humans working in a space eventually learn the gotchas, while current AIs don't really have that ability to learn the nuances of a particular space over time.
I love the way our SWE jobs are evolving. AI eating the simple stuff, generating more code but with harder to detect bugs... I'm serious, it feels that we can move faster with these tools but perhaps have to operate differently.
We are a long ways from automating our jobs away, instead our expertise evolves.
I suspect doctors go through a similar evolution as surgical methods are updated.
I would love to read or participate in the discussion of how to be strategic in this new world. Specifically, how to best utilize code generating tools as a SWE. I suppose I can wait a couple of years for new school SWEs to teach me, unless anyone is aware of content on this?
On one hand I agree with this conceptually, but on the other hand I've also been able to use AI to rapidly clean up and better structure a bunch of my existing code.
The blind copy-paste has generally been a bad idea though. Still need to read the code spit out, ask for explanations, do some iterating.
Imagine a single file full of complicated logic, where messing with one if statement might cause serious bugs. Here an AI will likely struggle, whereas a human could spend a couple of hours trying to work out the connections.
But if you have a code base with predictable software architectural patterns, the AI will likely recognise and help with all the boilerplate.
Of course there is a lot of middle ground between bad and good.
Do you mind getting into specifics about how you've been using AI to restructure your code? What tools are you using, and how large is the code base you're working with?
Yeah LLMs are pretty good at doing things like moving a lambda function to the right spot or refactoring two overlapping classes to a base class. Often it only saves five minutes but that adds up over time.
> Not only does a complex codebase make it harder for the model to generate a coherent response, it also makes it harder for the developer to formulate a coherent request.
> This experience has lead most developers to “watch and wait” for the tools to improve until they can handle ‘production-level’ complexity in software.
You will be waiting until the heat death of the universe.
If you are unable to articulate the exact nature of your problem, it won't ever matter how powerful the model is. Even a nuclear weapon will fail to have effect on target if you can't approximate its location.
Ideas like dumpstering all of the codebase into a gigantic context window seem insufficient, since the reason you are involved in the first place is because that heap is not doing what the customer wants it to do. It is currently a representation of where you don't want to be.
Well, increasing temperature (ie. adding some more randomness) for sure is going to magically generate a solution the customer wants. Right? /s
AI has a different "tech debt" issue.
Because with AI you can turn any problem into a black box. You build a model, and call it "solved". But then reality hits ...
This was what I thought the post would talk about before clicking through. AI adds tech debt because none of the people maintaining or operating the code wrote the code and are no longer familiar with their own implementation
Yeah the article is just another borderline-useless self-promotion piece.
"Companies with relatively young, high-quality codebases"
I thought that at the beginning the code might be a bit messy because there is the need to iterate fast and quality comes with time, what's the experience of the crowd on this?
In my experience you need a high quality codebase to be able to iterate at maximum speed. Any time someone, myself included, thought they could cut corners to speed up iteration, it ended up slowing things down dramatically in the end.
Coding haphazardly can be a lot more thrilling, though! I certainly don't enjoy the process of maintaining high quality code. It is lovely in hindsight, but an awful slog in the moment. I suspect that is why startups often need to sacrifice quality: The aforementioned thrill is the motivation to build something that has a high probability of being a complete waste of time. It doesn't matter how fast you can theoretically iterate if you can't compel yourself to work on it.
> thought they could cut corners to speed up iteration
Anecdotally, I find you can get about 3 days of speed from cutting corners - after that, as you say, you get slowed down more than you got sped up. First day, you get massive speed from going haphazard; second day, you're running out of corners to cut, and on the third day you start running into problems you created for yourself on the first day.
A piece of advice I heard many years ago was to not be afraid to throw away code. I've actually used that advice from time to time. It's not really a waste of time to do a `git reset --hard master` if you wrote shit code, but while writing it, you figured out how you should have written the code.
Very much yes.
There's little reason to try to go straight for the final product when you don't know exactly how to get there, and that's frequently the case. Build toys to learn what you need efficiently, toss them, and then build the real thing. Trying to shoot for the final product while also changing direction multiple times along the way tends to create code with multiple conflicting goals subtly encoded in it, and it'll just confuse you and others later.
I don't think there's such a thing as a single metric for quality - the code should do what is required at the time and scale. At the early stages, you can get away with inefficient things that are faster to develop and iterate on, then when you get to the scale where you have thousands of customers and find that your problem is data throughput or whatever, and not speed of iteration, you can break that apart and make a more complex beast of it.
You gotta make the right trade-off at the right time.
This!
Active tradeoff analysis and a structure that allows for honest reflection on current needs is the holy grail.
Choices are rarely about what is best and are rather about finding the least worst option.
I find messiness often comes from capturing every possible edge case that a young codebase probably doesn’t do tbh.
A user deleted their account and there’s now a request to register that account with that username? We didn’t think of that (concerns from ux on imposter and abuse to be handled). Better code in a catch and handle this. Do this 100x times and you code has 100x custom branching logic that potentially interacts n^2 ways since each exceptional event could probably occur in conjunction with other exceptional events.
It’s why I caution strongly against rewrites. It’s easy to look at code and say it’s too complex for what it does but is the complexity actually needless? Can you think of a way to refactor the complexity out? If so do that refactor if not a rewrite won't solve it.
I agree. New codebases are clean because they don't have all the warts of accumulated edge cases.
If the new codebase is messy because the team is moving fast as parent describes, that means the dev team is doing sloppy work in order to move fast. That type of speed is very short lived, because it's a lot harder to add 100 bugfixes to an already-messy codebase.
A startup with talent theoretically follows that pattern. If you're not a startup, you don't need to go fast in the beginning. If you don't have talent in both your dev team and your management, the codebase will get worse over time. Every company can differ on those two variables, and their codebases will reflect that. Probably most companies are large and talent-starved, so they go slow, start out with good code, then get bad over time.
Purely depends on the ability for a culture that values leaving options open in the future develops or not.
Young companies tend to have systems that are small enough or with institutional knowledge to pivot when needed and tend to have small teams with good lines of communication that allow for as shared purpose and values.
Architectural erosion is a long tailed problem typically.
Large legacy companies that can avoid architectural erosion do better than some startups who don't actively target maintainability, but it tends to require stronger commitment from Leadership than most orgs can maintain.
In my experience most large companies confuse the need to maintain adaptability with a need to impose silly policies that are applied irrespective of the long term impacts.
Integration and disintegration drivers are too fluid, context sensitive, and long term for prescription at a central layer.
The possibility mythical Amazon API edict is an example where focusing on separation and product focus could work, with high costs if you never get to the scale where it pays off.
The runways and guardrails concept seems to be a good thing in the clients I have worked for.
Some frameworks like Laravel can bring you far in terms of features. You're mostly gluing stuff together on top of an high-quality codebase. It gets ugly when you need too add all the edge cases that every real-world use case entails. And suddenly you have hundreds of lines of if statements in one method.
IME, “young” correlates with health b/c less time has been spent making it a mess… but, what’s really going on is the company’s culture and how it relates to quality work, aka, whether engineers are given the time to perform deep maintenance as the iteration concludes.
Maybe… to put it another way, it’s that time spent on quality isn’t time spent on discovery, but it’s only time spent on quality that gets you quality. So while a company is heavily focused on discovery - iteration, p/m fit, engineers figuring it out, etc - it’s not making a good codebase, and if they never carve out time to focus on quality, that won’t change.
That’s not entirely true - IMO, there’s a synergistic, not exclusionary relationship between the two - but it gets the idea across, I think.
> what's the experience of the crowd on this?
It's very hard to retrofit quality into existing code. It really should be there from the very start.
It is not just the code produced with code generation tools but also business logic using gen AI.
For example a RAG pipeline. People are rushing things to market that are not built to last. The likes of LangChain etc. offer little software engineering polishing. I wish there were a more mature enterprise framework. Spring AI is still in the making and Go is lagging behind.
I enjoy reading these articles and reading comments from people who clearly have no idea how to use AI or it’s abilities.
It's funny that his recommendations - organize code in modules etc. - are nothing AI-specific, it's what you'd do if you had to handover your project to an external team, or simply make it maintainable in the long term. So the best strategy for collaborating with AI turns out to be the same as for collaborating with humans.
I completely agree. That's why my stance is to wait and see, and in the meanwhile get our shit together, as in make our code maintainable by any intelligent being, human or not.
I keep waiting for the pairing of coding LLMs with a programming language created specifically to be coupled with a coding LLM.
Ever heard of LISP?
http://jmc.stanford.edu/articles/lisp.html
> This paper concentrates on the development of the basic ideas of LISP... when the programming language was implemented and applied to problems of artificial intelligence.
The problem is less the language and more what is written with any given language
The world is complex and we have to write a lot of code to capture that complexity. LLMs are good at the first 20% but balk at the 80% effort to match reality
> In essence, the goal should be to unblock your AI tools as much as possible. One reliable way to do this is to spend time breaking your system down into cohesive and coherent modules, each interacting through an explicit interface.
I find this works because its much easier to debug a subtle GPT bug in a well validated interface than the same bug buried in a nested for loop somewhere.
Haven't read the article, don't need to read the article: this is so, SO, so painfully obvious! If someone needs this spelled out for them they shouldn't be making technical decisions of any kind. Sad that this needs to be said.
AI code is just a more available SO code. You don't use the code handed to you, you learn from it.
While this primarily focuses on the software development side of things, I’d like to chime in that this applies to the IT side of the equation as well.
LLMs can’t understand why your firewall rules have strange forwards for ancient enterprise systems, nor can they “automate” Operations on legacy systems or custom implementations. The only way to fix those issues is to throw money and political will behind addressing technical debt in a permanent sense, which no organization seemingly wants to do.
These things aren’t silver bullets, and throwing more technology at an inherently political problem (tech debt) won’t ever solve it.
I find AI most helpful with very specific, narrow commands (add a new variable to the logger, which means typescript and a bunch of other things need to be updated) and it can go off and do that. While it's doing that I'll be thinking about the next thing to be fixed already.
Asking it for higher level planning / architecture is just asking for pain
Current gen AI is bad at high level planning. But I've found it useful in iterating on my ideas, sort of a rubberduck++. It helps to have a system prompt that is not overly agreeable
yes! It's definitely talked out of making some really dumb decisions
Speaking personally, I've found this tech much more helpful in existing codebases than new ones.
Missing test? Great, I'll get help identifying what the code should be doing, then use AI to write a boatload of tests in service towards those goals. Then I'll use it to help refactor some of the code.
But unlike the article, this requires actively engaging with the tool rather than, as they say a "sit and wait" (i.e., lazy) approach to developing.
I recently started playing with OpenSCAD and CadQuery -- tried a variety of the commercial LLMs, they all fall on their face so hard, teeth go flying.
This is for tiny code snippets, hello-world size, stringing together some primitives to render relatively simple objects.
Turns out, if the codebase / framework is a bit obscure and poorly documented, even the genie can't help.
Today's job is finishing up and testing some rather gnarly haproxy configuration. There's already a fairly high chance I'm going to stuff something up with it. There is no chance that I'm giving some other entity that chance as well.
> Companies with relatively young, high-quality codebases benefit the most from generative AI tools, while companies with gnarly, legacy codebases will struggle to adopt them.
So you say, but {citation needed}. Stuff like this is simply not known yet.
AI can easily be applied in legacy codebases, like to help with time-consuming refactoring.
The title of this article made me think that paying down traditional tech debt due to bugs or whatever is straightforward. Software with tech debt and/or bugs that incorporates AI isn’t a straightforward rewrite, but takes ML skills to pay down.
> Instead of trying to force genAI tools to tackle thorny issues in legacy codebases, human experts should do the work of refactoring legacy code until genAI can operate on it smoothly. When direct refactoring is still too risky, teams can adjust their development strategy with approaches like strangler fig to build greenfield modules which can benefit immediately from genAI tooling.
Or, y'know, just not bother with any of this bullshit. "We must rewrite everything so that CoPilot will sometimes give correct answers!" I mean, is this worth the effort? Why? This seems bonkers, on the face of it.
>I mean, is this worth the effort? Why?
It doesn't matter, it's the new hotness. Look at scrum, how shit it is for software and for devs, yet it's absolutely everywhere.
Remember "move fast and break things?" Everyone started taking that as gospel and writing garbage code. It seems the industry is run by toddlers.
/rant
> AI makes tech debt more expensive
This isn't AI doing.
It's the doing of adding any new feature to a product with existing tech debt.
And since AI for most companies is a feature, like any feature, it only makes the tech debt worse.
This type of analysis is a mirror of the early days of chess "AI". All kinds of commentary explaining the weaknesses of the engines, and extolling the impossible-to-reproduce capabilities of human players. But while they may have been correct in the moment, they didn't really appreciate the march toward utter dominance and supremacy of the machines over human players.
While there is no guarantee that the same trajectory is true for programming, we need to heed how emotionally attached we can be to denying the possibility.
Is this based on a study or something? I just see a graph with no references. What am I missing here?
I cannot wait for the inevitable top-down backlash banning any use of AI tools.
Good for us I guess?
Not sure its tech debt as such, its the hidden cost of having to maintain AI tech. its not a static state.. and its got an ongoing maint cost.
Yeah, this is a total click-bait article. The claim put forth by the title is not at all supported by the article contents, which basically states "old codebases riddled with tech-debt do not benefit very much from GenAI, while newer cleaner codebases will see more benefit." That is so completely far off from "AI will make your tech debt worse."
Coding with AI could easily be a new form of early software/developer tech debt. Taking leaps that are too big, or too small, can be unexpected.
Code is not really lossy zipped text.
And well duh
The author starts with a straw man argument, of someone who thinks that AI is great at dealing with technical debt. He makes little attempt to steel man their argument. Then the author argues the opposite without much supporting evidence. I think the author is right that some people were quick to assume that AI is much better for brownfield projects, but I think the author was also quick to assume the opposite.
I agree with a lot of the assertions made in TFA but not so much the conclusion. AI increasing the velocity of simpler code doesn’t make tech debt more expensive, it just means it won’t benefit as much / be made cheaper.
OTOH if devs are getting the simpler stuff done faster maybe they have more time to work on debt.
I asked the AI to write me some code to get a list of all the objects in an S3 bucket. It returned some code that worked, it would no doubt be approved by most developers. But on further inspection I noticed that it would cause a bug if the bucket had more than 1000 objects because S3 only delivers 1000 max objects per request, and the API is paged, and the AI had no ability to understand this. So the AI's code would be buggy should the bucket contain more than 1000 objects, which is really, really easy to do with an S3 bucket.
Most AI code is kind of like that. It's sourced from demo quality examples and piecemeal paid work. The resulting code is focused on succinctly solving the problem in the prompt. Factoring and concerns external to making the demo work disappear first. Then any edge cases that might complicate the result get tossed.
Claude did the simple version by default but I asked it to support more than 1000 and it did it fine
at some extent I do agree with the point you're trying to make.
But unless you include pagination needs to be handled as well, the LLM will naively just implement the bare minimum.
Context matters. And supplying enough context is what makes all the difference when interacting with these kind of solutions.
not parent, but
> I asked the AI to write me some code to get a list of all the objects in an S3 bucket
they didn’t ask for all the objects in the first returned page of the query
they asked for all the objects.
the necessary context is there.
LLMs are just on par with devs who don’t read tickets properly / don’t pay attention to the API they’re calling (i’ve had this exact case happen with someone in a previous team and it was a combination of both).
LLMs differ though. Newest Claude just gave me a paginated solution without further prodding.
In other more obscure cases I just add the documentation to it's context and let it work based on that.
yeah AI isn't good at uncovering all the foot guns and corner cases, but I think this reflects most of StackOverflow, which (not coincidentally) also misses all of these
... until it won't. A mature code-base also has (or should have) strong test coverage, both in unit-testing and comprehensive integration testing. With proper ci/cd pipelines, you can have a small team update and upgrade stuff at a fraction of the usual cost (see amazon going from old java to newer versions) and "pay off" some of that debt.
The tooling for this will only improve.
According to Ilya Sutskever: "results from scaling up pre-training have plateaued". https://www.reuters.com/technology/artificial-intelligence/o...
They're trying other techniques to improve what we already have atm, but we're almost at the limit of its capabilities.
True if you’re using AI the wrong way. AI means dramatically less code, most of which is generated.
Creating react pages is the new COBOL
are LLMs even auditable?
Don't make me tap the sign.
"GARBAGE IN -- GARBAGE OUT!!"
Microservices are back on the menu, boys.
> human experts should do the work of refactoring legacy code until genAI can operate on it smoothly
How does one determine if that's even possible, much less estimate the work involved to get there?
After all, 'subtle control flow, long-range dependencies, and unexpected patterns' do not always indicate tech-debt.
As long as you can constrain your solution to the logic contained inside a Todo app, all is golden /s
Bah this article is a bunch of nonsense. You're saying that a technology that has been around for a grand 2 years is not yet mature? Color me shocked.
I'm sure nothing will change in the future either.
According to Ilya Sutskever: "results from scaling up pre-training have plateaued".
https://www.reuters.com/technology/artificial-intelligence/o...
They're trying other techniques to improve what we already have atm.
and we plow through plateaus every 6 months, regularly, by inventing something new. I thought we were engineers, not some kind of amish cult.
LLM code gen tools are really freaking good...at making the exact same react boilerplate app that everyone else has.
The moment you need to do something novel or complicated they choke up.
This is why I'm not very confident that tools like Vercel's v0 (https://v0.dev/) are useful for more than just playing around. It seems very impressive at first glance - but it's a mile wide and only an inch deep.
Most people don't do novel things, and those that do still have like 90% same business logic somebody else has done a million times over.
If can you can create boilerplate code, logging, documentation, common algorithms by AI it saves you a lot of time which you can use on your specialized stuff. I am convinced that you can make yourself x2 by using an AI. Just use it in the proper way.
I feel like we should get rid of the boilerplate, rather than have an LLM barf it out.
There's an inherent tradeoff here though. Beyond a certain complexity threshold, code that leans toward more boilerplate is generally much easier to understand and maintain than code that tries to DRY everything with layers of abstraction, indirection, and magic.
Honestly, this bit about genAI being good at generating boilerplate is correct, but it always makes me wonder... is this really a thing that would save a ton of time? How much boilerplate are people writing? Only a small fraction of code that I write involves boilerplate.
I just tend to use am extension such as https://marketplace.visualstudio.com/items?itemName=Huuums.v... for my boilerplate, as I can customize along the way for the project and not think hard. I have seen a lot of younger devs not using such a thing or already existing CLI and instead copy paste then rename, or try writing from scratch every time but slight differences... It is weird to me how many don't look for ways to automate boilerplate, as it has always been my default.
Yeah, I often like to point out that our entire industry is already built on taking repeatable stuff and then abstracting it away.
Boilerplate code exists when the next step is often to start customizing it in a unique and unpredictable way.
When I try to read code on GitHub that has the var or val keyword, I have no fucking idea what the types of the variables are. Sure, the compiler can infer, since it’s just ingested your entire code base, but I have a single page of text to look at.
Some boilerplate is good.
Tested code libraries save time, AI generated code saves time at writing but the review takes more time because it’s foreign code.
I guess that's a good way to think of it. Despite not being very useful (currently, anyway) for certain types of complicated or novel work - they still are very useful for other types of work and can help reduce development toil.
or you just can start with a well maintained boilerplate
> I am convinced that you can make yourself x2 by using an AI.
This means you're getting paid 2x more, right?
...Right?