• voytec a day ago

    I agree in general but the web was already polluted by Google's unwritten SEO rules. Single-sentence paragraphs, multiple keyword repetitions and focus on "indexability" instead of readability, made the web a less than ideal source for such analysis long before LLMs.

    It also made the web a less than ideal source for training. And yet LLMs were still fed articles written for Googlebot, not humans. ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.

    • doe_eyes a day ago

      > I agree in general but the web was already polluted by Google's unwritten SEO rules. Single-sentence paragraphs, multiple keyword repetitions and focus on "indexability" instead of readability, made the web a less than ideal source for such analysis long before LLMs.

      Blog spam was generally written by humans. While it sucked for other reasons, it seemed fine for measuring basic word frequencies in human-written text. The frequencies are probably biased in some ways, but this is true for most text. A textbook on carburetor maintenance is going to have the word "carburetor" at way above the baseline. As long as you have a healthy mix of varied books, news articles, and blogs, you're fine.

      In contrast, LLM content is just a serpent eating its own tail - you're trying to build a statistical model of word distribution off the output of a (more sophisticated) model of word distribution.

      • weinzierl 21 hours ago

        Isn't it the other way around?

        SEO text carefully tuned to tf-idf metrics and keyword stuffed to them empirically determined threshold Google just allows should have unnatural word frequencies.

        LLM content should just enhance and cement the status quo word frequencies.

        Outliers like the word "delve" could just be sentinels, carefully placed like trap streets on a map.

        • mlsu 17 hours ago

          But you can already see it with Delve. Mistral uses "delve" more than baseline, because it was trained on GPT.

          So it's classic positive feedback. LLM uses delve more, delve appears in training data more, LLM uses delve more...

          Who knows what other semantic quirks are being amplified like this. It could be something much more subtle, like cadence or sentence structure. I already notice that GPT has a "tone" and Claude has a "tone" and they're all sort of "GPT-like." I've read comments online that stop and make me question whether they're coming from a bot, just because their word choice and structure echoes GPT. It will sink into human writing too, since everyone is learning in high school and college that the way you write is by asking GPT for a first draft and then tweaking it (or not).

          Unfortunately, I think human and machine generated text are entirely miscible. There is no "baseline" outside the machines, other than from pre-2022 text. Like pre-atomic steel.

          • bryanrasmussen 10 hours ago

            is the use of miscible here a clue? Or just some workplace vocabulary you've adapted analogically?

            • mlsu 9 hours ago

              Human me just thought it was a good word for this. It implies some irreversible process of mixing, I think that characterizes this process really well.

              • noduerme 7 hours ago

                There were dozens of 20th Century ideological movements which developed their own forms of "Newspeak" in their own native languages. Largely, natural human dialog between native speakers and between those opposed to the prevailing regime recoils violently at stilted, official, or just "uncool" usages in daily vernacular. So I wouldn't be too surprised to see a sharp downtick in the popular use of any word that becomes subject to an LLM's positive-feedback loop.

                Far from saying the pool of language is now polluted, I think we now have a great data set to begin to discern authentic from inauthentic human language. Although sure, people on the fringes could get caught in a false positive for being bots, like you or I.

                The biggest LLM of them all is the daily driver of all new linguistic innovation: Human society, in all its daily interactions. The quintillions of daily phrases exchanged and forever mutating around the globe - each mutation of phrase interacting with its interlocutor, and each drawing from not the last 500,000 tokens but the entire multi-modal, if you will, experience of each human to date in their entire lives - vastly eclipses anything any hardware could ever emulate given the current energy constraints. Software LLMs are just a state machine stuck in a moment in time. At best they will always lag, the way Stalinist language lagged years behind the patois of average Russians, who invented daily linguistic dodges to subvert and mock the regime. The same process takes place anywhere there is a dominant official or uncool accent or phrasing. The ghetto invents new words, new rhythm, and then it becomes cool in the middle class. The authorities never catch up, precisely because the use of subversive language is humanity's immune system against authority.

                If there is one distinctly human trait, it's sniffing out anyone who sounds suspiciously inauthentic. (Sadly, it's also the trait that leads to every kind of conspiracy theorizing imaginable; but this too probably confers in some cases an evolutionary advantage). Sniffing out the sound of a few LLMs is already happening, and will accelerate geometrically, much faster than new models can be trained.

                • bryanrasmussen 7 hours ago

                  humans also lag humans, the future may already be spoken, but the slang is not evenly memed out yet.

              • jazzyjackson 8 hours ago

                If you think that's niche wait til you hear about man-machine miscegenation

              • taneq 17 hours ago

                > LLM uses delve more, delve appears in training data more, LLM uses delve more...

                Some day we may view this as the beginnings of machine culture.

                • mlsu 17 hours ago

                  Oh no, it's been here for quite a while. Our culture is already heavily glued to the machine. The way we express ourselves, the language we use, even our very self-conception originates increasingly in online spaces.

                  Have you ever seen someone use their smartphone? They're not "here," they are "there." Forming themselves in cyberspace -- or being formed, by the machine.

                  • taneq 3 hours ago

                    chat is this real?

              • derefr 21 hours ago

                1. People don't generally use the (big, whole-web-corpus-trained) general-purpose LLM base-models to generate bot slop for the web. Paying per API call to generate that kind of stuff would be far too expensive; it'd be like paying for eStamps to send spam email. Spambot developers use smaller open-source models, trained on much smaller corpuses, sized and quantized to generate text that's "just good enough" to pass muster. This creates a sampling bias in the word-associational "knowledge" the model is working from when generating.

                2. Given how LLMs work, a prompt is a bias — they're one-and-the-same. You can't ask an LLM to write you a mystery novel without it somewhat adopting the writing quirks common to the particular mystery novels it has "read." Even the writing style you use in your prompt influences this bias. (It's common advice among "AI character" chatbot authors, to write the "character card" describing a character, in the style that you want the character speaking in, for exactly this reason.) Whatever prompt the developer uses, is going to bias the bot away from the statistical norm, toward the writing-style elements that exist within whatever hypersphere of association-space contains plausible completions of the prompt.

                3. Bot authors do SEO too! They take the tf-idf metrics and keyword stuffing, and turn it into training data to fine-tune models, in effect creating "automated SEO experts" that write in the SEO-compatible style by default. (And in so doing, they introduce unintentional further bias, given that the SEO-optimized training dataset likely is not an otherwise-perfect representative sampling of writing style for the target language.)

                • travisjungroth 11 hours ago

                  On point 1, that’s surprising to me. A 2,000 word blog post would be 10 cents with GPT-4o. So you put out 1,000 of them, which is a lot, for $100.

                  • brazzy 10 hours ago

                    But then you'll be competing for clicks with others who put out 1,000,000 posts for less costs because they used a small, self hosted model.

                    • baq 9 hours ago

                      if you are a sales & marketing intern, have a potato laptop and $100 budget to spend on seo, you aren't going to be self hosting anything even if you know what that means.

                      • nerdponx 7 hours ago

                        This is about high-volume blog/news-spam created specifically to serve ads and affiliate links, not about occasional content marketing for legitimate companies.

                • tigerlily 6 hours ago

                    Too deep we delved, and awoke the ancient delves.
                  • lbhdc 21 hours ago

                    > LLM content should just enhance and cement the status quo word frequencies.

                    TFA mentions this hasn't been the case.

                    • flakiness 17 hours ago

                      Would you mind dropping the link talking about this point? (context: I'm a total outsider and have no idea what TFA is.)

                      • girvo 17 hours ago

                        TFA means "the featured article", so in this case the "Why wordfreq will not be updated" link we're talking about.

                        • adastra22 16 hours ago

                          To be pedantic, the F in TFA has the same meaning as the F in RTFM.

                          It’s the same origin. On Slashdot (the HN of the early 00’s) people would admonish others to RTFA. Then they started using it as a referent: TFA was the thing you were supposed to have read.

                          • girvo 13 hours ago

                            Oh that I'm aware of, but it's softened over time too haha

                            I miss the old Atomic MPC forums in the ~00s.

                          • jnordwick 3 hours ago

                            The Fucking Article, from RTFA - Read the Fucking Article - and RTFM - Read the Fucking Manual/Manpage

                    • brudgers 11 hours ago

                      serpent eating its own tail

                      GOGI.

                      • romwell 10 hours ago

                        The Inhuman Centipede

                    • bondarchuk a day ago

                      At some point though you have to acknowledge that a specific use of language belongs to the medium through which you're counting word frequencies. There are also specific writing styles (including sentence/paragraph sizes, unnecessary repetitions, focusing on other metrics than readability) associated with newspapers, novels, e-mails to your boss, anything really. As long as text was written by a human who was counting on at least some remote possibility that another human might read it, this is way more legitimate use of language than just generating it with a machine.

                      • kevindamm a day ago

                        Yes but not quite as far as you imply. The training data is weighted by a quality metric, articles written by journalists and wikipedia contributors are given more weight than Aunt May's brownie recipe and corpoblogspam.

                        • jsheard a day ago

                          > The training data is weighted by a quality metric

                          At least in Googles case, they're having so much difficulty keeping AI slop out of their search results that I don't have much faith in their ability to give it an appropriately low training weight. They're not even filtering the comically low-hanging fruit like those YouTube channels which post a new "product review" every 10 minutes, with an AI generated thumbnail and AI voice reading an AI script that was never graced by human eyes before being shat out onto the internet, and is of course always a glowing recommendation since the point is to get the viewer to click an affiliate link.

                          Google has been playing the SEO cat and mouse game forever, so can startups with a fraction of the experience be expected to do any better at filtering the noise out of fresh web scrapes?

                          • acdha a day ago

                            > Google has been playing the SEO cat and mouse game forever, so can startups with a fraction of the experience be expected to do any better at filtering the noise out of fresh web scrapes?

                            Google has been _monetizing_ the SEO game forever. They chose not to act against many notorious actors because the metric they optimize for is ad revenue and and those sites were loaded with ads. As long as advertisers didn’t stop buying, they didn’t feel much pressure to make big changes.

                            A smaller company without that inherent conflict of interest in its business model can do better because they work on a fundamentally different problem.

                            • derefr 21 hours ago

                              > those YouTube channels which post a new "product review" every 10 minutes, with an AI generated thumbnail and AI voice reading an AI script that was never graced by human eyes before being shat out onto the internet

                              The problem is that, of the signals you mention,

                              • the highly-informative ones (posting a new review every 10 minutes, having affiliate links in the description) are contextual — i.e. they're heuristics that only work on a site-specific basis. If the point is to create a training pipeline that consumes "every video on the Internet" while automatically rejecting the videos that are botspam, then contextual heuristics of this sort won't scale. (And Google "doesn't do things that don't scale.")

                              • and, conversely, the context-free signals you mention (thumbnail looks AI-generated, voice is synthesized) aren't actually highly correlated with the script being LLM-barf rather than something a human wrote.

                              Why? One of the primary causes is TikTok (because TikTok content gets cross-posted to YouTube a lot.) TikTok has a built-in voiceover tool; and many people don't like their voice, or don't have a good microphone, or can't speak fluent/unaccented English, or whatever else — so they choose to sit there typing out a script on their phone, and then have the AI read the script, rather than reading the script themselves.

                              And then, when these videos get cross-posted, usually they're being cross-posted in some kind of compilation, through some tool that picks an AI-generated thumbnail for the compilation.

                              Yet, all the content in these is real stuff that humans wrote, and so not something Google would want to throw away! (And in fact, such content is frequently a uniquely-good example of the "gen-alpha vernacular writing style", which otherwise doesn't often appear in the corpus due to people of that age not doing much writing in public-web-scrapeable places. So Google really wants to sample it.)

                              • nneonneo 10 hours ago

                                Reminds me of a Google search I did yesterday: “Hezbollah” yields a little info box with headings “Overview”, “History”, “Apps” and “Return policy”.

                                I’m guessing that the association between “pagers” and “Hezbollah” ended up creating the latter two tabs, but who knows. Maybe some AI video out there did a product review of Hezbollah.

                                • Suppafly a day ago

                                  >At least in Googles case, they're having so much difficulty keeping AI slop out of their search results that I don't have much faith in their ability to give it an appropriately low training weight.

                                  I've noticed that lately. It used to be the top google result was almost always what you needed. Now at the top is an AI summary that is pretty consistently wrong, often in ways that aren't immediately obvious if you aren't familiar with the topic.

                                  • noirscape a day ago

                                    Google has those problems because the company's revenue source (Ads) and the thing that puts it on the map (Search) are fundamentally at odds with one another.

                                    A useful Search would ideally send a user to the site with the most signal and the fewest noise. Meanwhile, ads are inherently noise; they're extra pieces of information inserted into a webpage that at best tangentially correlate to the subject of a page.

                                    Up until ~5 years ago, Google was able to strike a balance on keeping these two stable; you'd get results with some Ads but the signal generally outweighed the noise. Unfortunately from what I can tell from anecdotes and courtroom documents, the Ad team at Google has essentially hijacked every other aspect of the company by threatening that yearly bonuses won't be given out if they don't kowtow to the Ad teams wishes to optimize ad revenue somewhere in 2018-2019 and has no sign of stopping since there's no effective competition to Google. (There's like, Bing and Kagi? Nobody uses Bing though and Kagi is only used by tech enthusiasts. The problem with Google is that to copy it, you need a ton of computing resources upfront and are going up against a company with infinitely more money and ability to ensure users don't leave their ecosystem; go ahead and abandon Search, but good luck convincing others to give up say, their Gmail account, which keeps them locked to Google and Search will be there, enticing the average user.)

                                    Google has absolutely zero incentive to filter out generative AI junk from their search results outside the amount of it that's damaging their PR since most of the SEO spam is also running Google Ads (since unless you're hosting adult content, Google's ad network is practically the only option). Their solution therefore isn't to remove the AI junk, but to instead reduce it enough to the degree where a user will not get the same type of AI junk twice.

                                    • PaulHoule 20 hours ago

                                      My understanding is that Google Ads are what makes Google Search unassailable.

                                      A search engine isn't a two-sided market in itself but the ad network that supports it is. A better search engine is a technological problem, but a decently paying ad network is a technological problem and a hard marketing problem.

                                    • epgui a day ago

                                      I don’t think they were talking about the quality of Google search results. I believe they were talking about how the data was processed by the wordfreq project.

                                      • kevindamm a day ago

                                        I was actually referring to the data ingestion for training LLMs, I don't know what filtering or weighting might be done with wordfreq.

                                    • Freak_NL a day ago

                                      It certainly feels like the amount of regurgitated, nonsensical, generated content (nontent?) has risen spectacularly specifically in the past few years. 2021 sounds about right based on just my own experience, even though I can't point to any objective source backing that up.

                                      • eszed a day ago

                                        Upvoted for "nontent" alone: it'll be my go-to term from now on, and I hope it catches on.

                                        Is it of your own coinage? When the AI sifts through the digital wreckage of the brief human empire, they may give you the credit.

                                        • Freak_NL a day ago

                                          I do hope it catches on! I did come up with this myself, but I really doubt I'm the only one — and indeed: Wiktionary lists it already with a 2023 vintage:

                                          https://en.wiktionary.org/wiki/nontent

                                        • zharknado a day ago

                                          Ooh I like “nontent.” Nothing like a spicy portmanteau!

                                          • eptcyka a day ago

                                            I personally am yet to see this beyond some slop on youtube. And I am here for the AI meme videos. I recognize the dangers of this, all I am saying is that I don't feel the effect, yet.

                                            • Freak_NL a day ago

                                              I'm seeing it a lot when searching for some advice in a well-defined subject, like, say, leatherworking or sewing (or recipes, obviously). Instead of finding forums with hobbyists, in-depth blog posts, or manufacturers advice pages, increasingly I find articles which seem like natural language at first, but are composed of paragraphs and headers repeating platitudes and basic tips. It takes a few seconds to realize the site is just pushing generated articles.

                                              Increasingly I find that for in-depth explanations or tutorials Youtube is the only place to go, but even there the search results can lead to loads of videos which just seem… off. But at least those are still made by humans.

                                              • sharpshadow 7 hours ago

                                                Looking forward to watch perfect generated videos. We need so much more power and chips but it’s completely worth it. After that? Maybe generated videogames. But the video stuff will be awesome and changing the video dominated social media content for ever. Virtual headsets will become useful finally generating anything you want to see and jump tru space and time.

                                                • ghaff a day ago

                                                  There's been a ton of low-rent listicle writing out there for ages. Certainly not new in the past few years. I admit I don't go on YouTube much and don't even have a tiktok account so it's possible there's a lot of newer lousy content I'm not really exposed to.

                                                  It seems to me that the fact it's so cheap and relatively easy for people with dreams of becoming wealthy influencers to put stuff out there has more to do with the flood of often mediocre content than AI does.

                                                  Of course the vast majority don't have much real success and get on with life and the crank turns and a new generation perpetuates the cycle.

                                                  LLMs etc. may make things marginally easier but there's no shortage of twenty somethings with lots of time imagining riches while making pennies.

                                                • jsheard a day ago

                                                  SEO grifters have fully integrated AI at this point, there are dozens of turn-key "solutions" for mass-producing "content" with the absolute minimum effort possible. It's been refined to the point that scraping material from other sites, running it through the LLM blender to make it look original, and publishing it on a platform like Wordpress is fully automated end-to-end.

                                                  • sahmeepee 19 hours ago

                                                    Or check out "money printer" on github: a tongue in cheek mashup of various tools to take a keyword as input and produce a youtube video with subtitles and narration as output.

                                                • darby_nine a day ago

                                                  Aunt may's brownie recipe (or at least her thoughts on it) are likely something you'd want if you want to reflect how humans use language. Both news-style and encyclopedia-style writing represent a pretty narrow slice.

                                                  • creshal a day ago

                                                    That's why search engines rated them highly, and why a million spam sites cropped up that paid writers $1/essay to pretend to be Aunt May, and why today every recipe website has a gigantic useless fake essay in front of their copypasted made up recipes.

                                                    • Freak_NL a day ago

                                                      I hate how looking for recipes has become so… disheartening. Online recipes are fine for reputable sources like newspapers where professional recipe writers are paid for their contributions, but searching for some Aunt May's recipe for 'X' in the big ocean of the internet is pointless — too much raw sewage dumped in.

                                                      It sucks, because sharing recipes seemed like one of those things the internet could be really good at.

                                                      • c6400sc 12 hours ago

                                                        It's interesting to search for recipes in other languages and not find junk as we do in English.

                                                        I read Spanish and Italian fluently and stumble my way through Japanese (with translation). It's easier to find a good recipe in these languages, provided you can find the ingredients or substitutes.

                                                        • smallerfish a day ago

                                                          There seem to be quite a few recipe sharing sites around - e.g. allrecipes.com.

                                                          • creshal a day ago

                                                            And they're all flooded with low effort trash and useless.

                                                            The only remaining reliable source - now that many newspapers are axing the remaining staff in favour of LLMs - is pre-2020 print cookbooks. Anything online or printed later must be assumed to be tainted, full of untested sewage and potentially dangerous suggestions.

                                                            • jerf a day ago

                                                              The wife and I use the internet for recipe ideas... but we hardly ever follow them directly anymore. We're no formally-trained chefs but we've been home cooks for over 20 years now, and so many of them are self-evidently bad, or distinctly suboptimal. The internet chef's aversion to flavor is a meme with us now; "add one-sixty-fourth of a teaspoon of garlic powder to your gallon of soup, and mix in two crystals of table salt". Either that or they're all getting some seriously potent spices all the time and I'd like to know where they shop because my spices are nowhere near as powerful as theirs.

                                                              • halostatue a day ago

                                                                It's not just online recipes, but cookbooks written for the Better Home & Gardens crowd. The ones who write "curry powder" (and mean the yellow McCormick stuff which is so bland as to have almost no flavour) or call for one clove of garlic in their recipe.

                                                                I joke with folks that my assumption with "one clove of garlic" is that they really mean "one head of garlic" if you want any flavour. (And if the recipe title has "garlic" in it and you are using one clove, you’re lying.)

                                                                • nick3443 a day ago

                                                                  If the recipe has "garlic" in the title, I'm budgeting 1/2 head per serving.

                                                              • formerly_proven a day ago

                                                                Well there's https://www.allrecipes.com/author/chef-john/ on that particular site.

                                                                • davejohnclark a day ago

                                                                  I absolutely love Chef John. Great recipes and the cadence of his speech on YouTube (foodwishes) is very soothing, while he cooks up something amazing. If you're a home cook I highly recommend his recipes and his channel.

                                                                  • JohnFen a day ago

                                                                    Chef John is the best.

                                                            • shagie a day ago

                                                              I wish more people presented recipes like cooking for engineers. For example - Meat Lasagna https://www.cookingforengineers.com/recipe/36/Meat-Lasagna

                                                              • bhasi 21 hours ago

                                                                I love the table-diagrams at the end. I've never seen anything like that until now and it really seems useful for visualization of the recipe and the sequence of steps.

                                                                • tirant 8 hours ago

                                                                  Interestingly my wife has been writing recipes on post-it notes for years in that same style, with arrows instead of tables. And she's the opposite to an Engineer, a psychologist (interest in people vs objects).

                                                                  When I saw them, they blew my mind. Short to store and easy to understand.

                                                                  • shagie 20 hours ago

                                                                    Combined with pictures for what each step should look like. I had a few of these pages printed out back in the '00s for some recipes that I did.

                                                                  • grues-dinner a day ago

                                                                    And here I thought my defacement of printed recipes by bracketing everything that goes together at each stage was just me. There are, well, maybe not dozens but at least two of us! Saves a lot of bowls when you know without further checking that you can, say, just dump the flour and sugar, butter and eggs into the big bowl without having to prepare separately because they're in the "1: big bowl" bracket.

                                                                    • halostatue a day ago

                                                                      Depends on what you’re doing. For best cookies, you want to cream the butter with the sugar, then add the eggs, and finally add the flour. If you’re interested and can find one, it’s worth taking a vegan baking class. You learn a lot about ingredient substitutions for baking, about what the different non-vegan ingredients are doing that you have to compensate for…and it does something that I’ve only recently started seeing happen in non-vegan baking recipes: it separates the wet ingredients from the dry ingredients.

                                                                      That is, when baking, you can usually (again, exceptions for creaming the sugar in butter, etc.) take all of your dry ingredients and mix/sift them together, and then you pour your wet ingredients in a well you’ve made in the dry ingredients (these can also usually be mixed together).

                                                                      • grues-dinner a day ago

                                                                        No need to cakesplain, that was an example with three ingredients of the top of my head, very, very obviously the exact ingredients and bracket assignments vary depending on what you are making.

                                                                        But for shortbread or fork biscuits those three could indeed all go in the bowl in one go (but that one admittedly doesn't really need a bracket because the recipe is "put in bowl, mix with hands, bake").

                                                                  • darby_nine a day ago

                                                                    Ok, but what i said is true regardless of SEO, and that SEO has also fed back into english before LLMs were a thing. If you only train on those subsets you'll also end up with a chatbot that doesn't speak in a way we'll identify as natural english.

                                                                    • actionfromafar a day ago

                                                                      Yet. Give it time. The LLMs will train our future children.

                                                                      • darby_nine 17 hours ago

                                                                        I'm sure they already are.

                                                                • Lalabadie a day ago

                                                                  The current state of things leads me to believe that Google's current ranking system has been somehow too transparent for the last 2-3 years.

                                                                  The top of search results is consistently crowded by pages that obviously game ranking metrics instead of offering any value to humans.

                                                                • rockskon 12 hours ago

                                                                  Don't forget Google's adsense rules which penalized useful straightforward websites and mandated websites be full of "content". Doesn't matter if the "content" is garbage nonsense rambling and excessive word use - it's content and much more likely to be okayed by adsense!

                                                                  • sahmeepee 19 hours ago

                                                                    Prior to Google we had Altavista and in those days it was incredibly common to find keywords spammed hundreds of times in white text on a white background in the footer of a page. SEO spam is not new, it's just different.

                                                                    • ToucanLoucan a day ago

                                                                      This feels like a second, magnitudes larger Eternal September. I wonder how much more of this the Internet can take before everyone just abandons it entirely. My usage is notably lower than it was in even 2018, it's so goddamn hard to find anything worth reading anymore (which is why I spend so much damn time here, tbh).

                                                                      • wpietri a day ago

                                                                        I think it's an arms race, but it's an open question who wins.

                                                                        For a while I thought email as a medium was doomed, but spammers mostly lost that arms race. One interesting difference is that with spam, the large tech companies were basically all fighting against it. But here, many of the large tech companies are either providing tools to spammers (LLMs) or actively encouraging spammy behaviors (by integrating LLMs in ways that encourage people to send out text that they didn't write).

                                                                        • jsheard a day ago

                                                                          The fight against spam email also led to mass consolidation of what was supposed to be a decentralised system though. Monoliths like Google and Microsoft now act as de-facto gatekeepers who decide whether or not you're allowed to send emails, and there's little to no transparency or recourse to their decisions.

                                                                          There's probably an analogy to be made about the open decentralised internet in the age of AI here, if it gets to the point that search engines have to assume all sites are spam by default until proven otherwise, much like how an email server is assumed guilty until proven innocent.

                                                                          • jerf a day ago

                                                                            Another problem with this arms race is that spam emails actually are largely separable from ham emails for most people... or at least they were, for most of their run. The thousandth email that claims the UN has set aside money for me due to my non-existent African noble ancestry that they can't find anyone to give it to and I just need to send the Thailand embassy some money to start processing my multi-million yuan payout and send it to my choice of proxy in Colombia to pick it up is quite different from technical conversation about some GitHub issue I'm subscribed to, on all sorts of metrics.

                                                                            However, the frontline of the email war has shifted lately. Now the most important part of the war is being fought over emails that look just like ham, but aren't. Business frauds where someone convinces you that they are the CEO or CFO or some VP and they need you to urgently buy this or that for them right now no time to talk is big business right now, and before you get too high-and-mighty about how immune you are to that, they are now extremely good at looking official. This war has not been won yet, and to a large degree, isn't something you necessarily win by AI either.

                                                                            I think there's an analogy here to the war on content slop. Since what the content slop wants is just for you to see it so they can serve you ads, it doesn't need anything else that our algorithms could trip on, like links to malware or calls to action to be defrauded, or anything else. It looks just like the real stuff, and telling that it isn't could require a human rather vast amounts of input just to be mostly sure. Except we don't have the ability to authenticate where it came from. (There is no content authentication solution that will work at scale. No matter how you try to get humans to "sign their work" people will always work out how to automate it and then it's done.) So the one good and solid signal that helps in email is gone for general web content.

                                                                            I don't judge this as a winning scenario for the defenders here. It's not a total victory for the attackers either, but I'd hesitate to even call an advantage for one side or the other. Fighting AI slop is not going to be easy.

                                                                            • ToucanLoucan a day ago

                                                                              > but spammers mostly lost that arms race

                                                                              I'm not saying this is impossible but that's going to be an uphill sell for me as a concept. According to some quick stats I checked I'm getting roughly 600 emails per day, about 550 of which go directly to spam filtering, and of the remaining 50, I'd say about 6 are actually emails I want to be receiving. That's an impressive amount overall for whoever built this particular filter, but it's also still a ton of chaff to sort wheat from and as a result I don't use email much for anything apart from when I have to.

                                                                              Like, I guess that's technically usable, I'm much happier filtering 44 emails than 594 emails? But that's like saying I solved the problem of a flat tire by installing a wooden cart wheel.

                                                                              It's also worth noting there that if I do have an email thats flagged as spam that shouldn't be, I then have to wade through a much deeper pond of shit to go find it as well. So again, better, but IMO not even remotely solved.

                                                                              • dhosek a day ago

                                                                                I’m not sure what you’ve done to get that level of spam, but I get about 10 spam emails a day at most and that’s across multiple accounts including one that I’ve used for almost 30 years and had used on Usenet which was the uber-spam magnet. A couple newer (10–15 year old) addresses which I’ve published on webpages with mailto links attract maybe one message a week and one that I keep for a specialized purpose (fiction and poetry submissions) gets maybe one to two messages per year, mostly because it’s of the form example@example.com so easily guessed by enterprising spammers.

                                                                                Looking at the last days’ spam¹ I have three 419-style scams (widows wanting to give away their dead husbands’ grand piano or multi-million euro estate) and three phishing attempts. There are duplicate messages in each category.

                                                                                About fifteen years ago, I did a purge of mailing list subscriptions and there’s very little that comes in that I don’t want, most notably a writer who’s a nice guy, but who interpreted my question about a comment he made on a podcast as an invitation to be added to his manually managed email list and given that it’s only four or five messages a year, I guess I can live with that.

                                                                                1. I cleaned out spam yesterday while checking for a confirmation message from a purchase.

                                                                                • wpietri a day ago

                                                                                  I'm having a hard time finding reliably sourced statistics here, but I suspect you're an outlier. My personal numbers are way better, both on Gmail and Fastmail, despite using the same email addresses for decades.

                                                                                • pyrale a day ago

                                                                                  > but spammers mostly lost that arms race.

                                                                                  Advertising in your mails isn't Google's.

                                                                                • BeFlatXIII a day ago

                                                                                  I hope this trend accelerates to force us all into grass-touching and book-reading. The sooner, the better.

                                                                                  • MrLeap 17 hours ago

                                                                                    Books printed before 2018, right?

                                                                                    I already find myself mentally filtering out audible releases after a certain date unless they're from an author I recognize.

                                                                                • redbell 18 hours ago

                                                                                  > ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.

                                                                                  Based on the process above, naturally, the third iteration then is LLMs writing for corporate bots, neither for humans nor for other LLMs.

                                                                                  • rgrieselhuber a day ago

                                                                                    Indexability is orthogonal to readability.

                                                                                    • hk__2 a day ago

                                                                                      It should be, but sadly it’s not.

                                                                                    • pphysch 21 hours ago

                                                                                      It's crazy to attribute the downfall of the web/search to Google. What does Google have to do with all the genuine open web content, Google's source of wealth, getting starved by (increasingly) walled gardens like Facebook, Reddit, Discord?

                                                                                      I don't see how Google's SEO rules being written or unwritten has any bearing. Spammers will always find a way.

                                                                                      • krelian a day ago

                                                                                        >And yet LLMs were still fed articles written for Googlebot, not humans.

                                                                                        How do we know what content LLMs were fed? Isn't that a highly guarded secret?

                                                                                        Won't the quality of the content be paramount to the quality of the generated output or does it not work that way?

                                                                                        • GTP a day ago

                                                                                          We do know that the open web consitutes the bulk of the trainig data, although we don't get to know the specific webpages that got used. Plus some more selected sources, like books, of which again we only know that those are books but not which books were used. So it's just a matter of probability that there was a good amount of SEO spam as well.

                                                                                      • jgrahamc a day ago

                                                                                        I created https://lowbackgroundsteel.ai/ in 2023 as a place to gather references to unpolluted datasets. I'll add wordfreq. Please submit stuff to the Tumblr.

                                                                                        • LeoPanthera 21 hours ago

                                                                                          Congratulations on "shipping", I've had a background task to create pretty much exactly this site for a while. What is your cutoff date? I made this handy list, in research for mine:

                                                                                            2017: Invention of transformer architecture
                                                                                            June 2018: GPT-1
                                                                                            February 2019: GPT-2
                                                                                            June 2020: GPT-3
                                                                                            March 2022: GPT-3.5
                                                                                            November 2022: ChatGPT
                                                                                          
                                                                                          You may want to add kiwix archives from before whatever date you choose. You can find them on the Internet Archive, and they're available for Wikipedia, Stack Overflow, Wikisource, Wikibooks, and various other wikis.
                                                                                          • jgrahamc 7 hours ago

                                                                                            I was taking "Release of ChatGPT" as the Trinity date.

                                                                                          • VyseofArcadia a day ago

                                                                                            Clever name. I like the analogy.

                                                                                            • freilanzer a day ago

                                                                                              I don't seem to get it.

                                                                                              • ziddoap a day ago

                                                                                                Steel without nuclear contamination is sought after, and only available from pre-war / pre-atomic sources.

                                                                                                The analogy is that data is now contaminated with AI like steel is now contaminated with nuclear fallout.

                                                                                                https://en.wikipedia.org/wiki/Low-background_steel

                                                                                                >Low-background steel, also known as pre-war steel[1] and pre-atomic steel,[2] is any steel produced prior to the detonation of the first nuclear bombs in the 1940s and 1950s. Typically sourced from ships (either as part of regular scrapping or shipwrecks) and other steel artifacts of this era, it is often used for modern particle detectors because more modern steel is contaminated with traces of nuclear fallout.[3][4]

                                                                                                • umvi a day ago

                                                                                                  > and only available from pre-war / pre-atomic sources.

                                                                                                  From the same wiki you linked:

                                                                                                  "Since the end of atmospheric nuclear testing, background radiation has decreased to very near natural levels, making special low-background steel no longer necessary for most radiation-sensitive uses, as brand-new steel now has a low enough radioactive signature"

                                                                                                  and

                                                                                                  "For the most demanding items even low-background steel can be too radioactive and other materials like high-purity copper may be used"

                                                                                                  • sergiotapia a day ago

                                                                                                    reading stuff like this makes me so happy. no matter how fucked up something may be there is always a way to clean right up.

                                                                                                    • shreddit 7 hours ago

                                                                                                      I wouldn't be so optimistic. The thing you call "way" is actually just time. Yes, anything humanity does (good or bad) will fade with time. But do we have the amount of time to clean up X (and i don't refer to X as in "formally twitter")?

                                                                                                      • felbane a day ago

                                                                                                        glances nervously at atmospheric CO2

                                                                                                        • genewitch 16 hours ago

                                                                                                          the easiest solution is growing dense vegetation including trees, then using that for things[0] or burying it until we have a better mitigation strategy for atmospheric carbon.

                                                                                                          Another solution, and one that, if i weren't such a lazy, is ocean based carbon binding. You can run electricity directly through ocean water and precipitate the carbon out as calcium carbonate, which is both: useful to humans as is and after processing; and useful to the coral reefs and crustaceans/mollusks or whatever in the oceans.

                                                                                                          If anyone wants to kick me about a million US dollars, i can make a POC on a used barge with solar panels and as much recycled material as possible, and have that just run off the coast of florida or something. I figure the total cost to get a barge is around a quarter million, all-in[1], the electronics and seawater stuff is about another $150-200 thousand, and the rest is mine for the idea and the lawyers' to get this approved and left alone to do the research.

                                                                                                          [0] burning it for heat is fine, as the net CO2 levels will remain constant, but i mean things like houses and boardwalks and boats, furniture, and so on.

                                                                                                          [1] could be more, now, the last time i was researching seaworthy barge costs it was between $100,000 and $200,000. I'm hoping there's someone that can donate the barge so i can make the rest more fit for purpose - redundancy, better solar, better mppt, better batteries, better materials for the electrodes (it takes platinum and titanium iirc, i haven't looked at my documents for a long while.)

                                                                                                          • heavensteeth 9 hours ago

                                                                                                            The earth will recover. We may not, but earth will.

                                                                                                            • more-coffee 7 hours ago

                                                                                                              And in a few million years, the next intelligent life form will examine remains of human texts, and wonder: with all the tools and knowledge they possessed, how could they not have prevented their demise?

                                                                                                              (Sorry for pessimism and offtopicism)

                                                                                                              • CAP_NET_ADMIN 3 hours ago

                                                                                                                We are but puny agents of entropy.

                                                                                                      • swyx a day ago

                                                                                                        and I applied to LLMs here: https://www.latent.space/p/nov-2023

                                                                                                      • AlphaAndOmega0 a day ago

                                                                                                        It's a reference to the practise of scavenging steel from sources that were produced before nuclear testing began, as any steel produced afterwards is contaminated with nuclear isotopes from the fallout. Mostly ship wrecks, and WW2 means there are plenty of those. The pun in question is that his project tries to source text that hasn't been contaminated with AI generated material.

                                                                                                        https://en.m.wikipedia.org/wiki/Low-background_steel

                                                                                                        • ms512 a day ago

                                                                                                          After the detonation of the first nuclear weapons, any newly produced steel has a low dose of nuclear fallout.

                                                                                                          For applications that need to avoid the background radiation (like physics research), pre atomic age steel is extracted, like from old shipwrecks.

                                                                                                          https://en.m.wikipedia.org/wiki/Low-background_steel

                                                                                                          • GreenWatermelon a day ago

                                                                                                            From the blog

                                                                                                            > Low Background Steel (and lead) is a type of metal uncontaminated by radioactive isotopes from nuclear testing. That steel and lead is usually recovered from ships that sunk before the Trinity Test in 1945.

                                                                                                            • voytec a day ago

                                                                                                              To whomever downvoted parent: please don't act against people brave enough to state that they don't know something.

                                                                                                              This is a desired quality, increasingly less present in IT work environments. People afraid of being shamed for stating knowledge gaps are not the folks you want to work with.

                                                                                                              • umvi a day ago

                                                                                                                I feel like there's a minimum "due diligence" bar to meet though before asking, otherwise it comes across as "I'm too lazy to google the reference and connect the dots myself, but can someone just go ahead and distill a nice summary for me"

                                                                                                                • voytec a day ago

                                                                                                                  In this particular case, I was out of the loop regarding the clever analogy myself. I'm now a tad smarter because someone else expressed lack of understanding, and I learned from responses to this (grayed due to downvotes) comment.

                                                                                                                  • PhunkyPhil a day ago

                                                                                                                    The problem is that the answer was a really easy google. I didn't know what low background steel was and I just googled it.

                                                                                                                    • cwillu 17 hours ago

                                                                                                                      A person asking the question here means there are now several good succinct explanations of it here.

                                                                                                                    • input_sh 17 hours ago

                                                                                                                      But it's right there in the header, you could just click the link and find out on the top of the webpage.

                                                                                                                    • waveBidder 10 hours ago

                                                                                                                      modern polite way of saying rtfm

                                                                                                                  • KeplerBoy a day ago

                                                                                                                    Steel made before atmospheric tests of nuclear bombs were a thing is referred to as low background steel and invaluable for some applications.

                                                                                                                    LLMs pollute the internet like atomic bombs polluted the environment.

                                                                                                                    • cdman a day ago
                                                                                                                  • astennumero a day ago

                                                                                                                    That's exactly the opposite of what the author wanted IMO. The author no more wants to be a part of this mess. Aggregating these sources would just makes it so much more easier for the tech giants to scrape more data.

                                                                                                                    • iak8god 19 hours ago

                                                                                                                      The main concerns expressed in Robyn's note, as I read them, seem to be 1) generative AI has polluted the web with text that was not written by humans, and so it is no longer feasible to produce reliable word frequency data that reflects how humans use natural language; and 2) simultaneously, sources of natural language text that were previously accessible to researchers are now less accessible because the owners of that content don't want it used by others to create AI models without their permission. A third concern seems to be that support for and practice of any other NLP approaches is vanishing.

                                                                                                                      Making resources like wordfreq more visible won't exacerbate any of these concerns.

                                                                                                                      • rovr138 a day ago

                                                                                                                        The sources are just aggregated. The source doesn't change.

                                                                                                                        The new stuff generated does (and this is honestly already captured).

                                                                                                                        This author doesn't generate content. They analyze data from humans. That "from humans" is the part that can't be discerned enough and thus the project can't continue.

                                                                                                                        Their research and projects are great.

                                                                                                                      • Der_Einzige 10 hours ago

                                                                                                                        FYI: My two datasets, DebateSum and OpenDebateEvidence/OpenCaseList in their current forms qualify for this, as they end at latest in 2022.

                                                                                                                        • jgrahamc 7 hours ago

                                                                                                                          You can either add them to the site yourself via Tumblr or send them to me via email (jgc@cloudflare).

                                                                                                                        • imhoguy a day ago

                                                                                                                          I am not sure we should trust a site contaminated by AI graphics. /s

                                                                                                                          • gorkish a day ago

                                                                                                                            The buildings and shipping containers that store low background steel aren't built out of the stuff either.

                                                                                                                            • whywhywhywhy a day ago

                                                                                                                              Yeah pay an illustrator if this is important to you.

                                                                                                                              See a lot of people upset about AI still using AI image generation because it's not in their field so they feel less strongly about it and can't create art themselves anyway, hypocritical either use it or don't but don't fuss over it then use it for something thats convenient for you.

                                                                                                                              • imhoguy a day ago

                                                                                                                                I have updated my comment with "/s" as that is closer to what I've meant. However, seriously, from ethical point of view it is unlikely illustrators were asked or compensated for their work being used for training AI to produce the image.

                                                                                                                                • heckelson a day ago

                                                                                                                                  I thought the header image was a symbol of AI slop contamination because it looked really off-putting

                                                                                                                            • ClassyJacket 17 hours ago

                                                                                                                              :'( I thought I was clever for realising this parallel myself! Guess it's more obvious than I thought.

                                                                                                                              Another example is how data on humans after 2020 or so can't be separated by sex because gender activists fought to stop recording sex in statistics on crime, medicine, etc.

                                                                                                                              • thebruce87m 9 hours ago

                                                                                                                                I too realised this parallel and frequently tell people about it.

                                                                                                                                Edit: just the first one

                                                                                                                                • sweeter 17 hours ago

                                                                                                                                  This is a psychotic thing to say without a source, considering how it's blatantly untrue.

                                                                                                                              • jll29 a day ago

                                                                                                                                I regret the situation led to the OP feel discourage about the NLP community, wo which I belong, and I just want to say "we're not all like that", even though it is a trend and we're close to peak hype (slightly past even?).

                                                                                                                                The complaint about pollution of the Web with artificial content is timely, and it's not even the first time due to spam farms intended to game PageRank, among other nonsense. This may just mean there is new value in hand-curated lists of high-quality Web sites (some people use the term "small Web").

                                                                                                                                Each generation of the Web needs techniques to overcome its particular generation of adversarial mechanisms, and the current Web stage is no exception.

                                                                                                                                When Eric Arthur Blair wrote 1984 (under his pen name "George Orwell"), he anticipated people consuming auto-generated content to keep the masses from away from critical thinking. This is now happening (he even anticipated auto-generated porn in the novel), but the technologies criticized can also be used for good, and that is what I try to do in my NLP research team. Good will prevail in the end.

                                                                                                                                • solardev a day ago

                                                                                                                                  Have "good" small webs EVER prevailed?

                                                                                                                                  Every content system seems to get polluted by noise once it hits mainstream usage: IRC, Usenet, reddit, Facebook, geocities, Yahoo, webrings, etc. Once-small curated selections eventually grow big enough to become victims of their own successes and taken over by spam.

                                                                                                                                  It's always an arms race of quality vs quantity, and eventually the curators can't keep up with the sheer volume anymore.

                                                                                                                                  • squigz a day ago

                                                                                                                                    > Have "good" small webs EVER prevailed?

                                                                                                                                    You ask on HN, one of the highest quality sites I've ever visited in any age of the Internet.

                                                                                                                                    IRC is still alive and well among pretty much the same audience as always. I'm not sure it's fair to compare that with the others.

                                                                                                                                    • solardev a day ago

                                                                                                                                      Well, niche forums are kinda different when they manage to stay small and niche. Not just HN but car forums, LED forums, etc.

                                                                                                                                      But if they ever include other topics, they risk becoming more mainstream and noisy. Even within adjacent fields (like the various Stacks) it gets pretty bad.

                                                                                                                                      Maybe the trick is to stay within a single small sphere then and not become a general purpose discussion site? And to have a low enough volume of submissions where good moderation is still possible? (Thank you dang and HN staff)

                                                                                                                                      • squigz a day ago

                                                                                                                                        I'm not entirely sure it's about content (while HN is certainly tech-focused, politics, health, philosophy all come up with regularity) or even content moderation, although they both certainly play a part (particularly the moderation around here. Thanks, staff!)

                                                                                                                                        I wonder if it is more to do with the community itself. HN users tend to have very intelligent discussions on pretty much anything, and discourages shitty, unnuanced, one-line takes. This, coupled with a healthy moderation system, makes it hard for the lower quality discussion to break in and override the good stuff.

                                                                                                                                        • nick3443 a day ago

                                                                                                                                          The car headlight forums seem to expose the weakness of small web though, in that a lot of the forums that show up in search are "sponsored" by one or two major brands and any open discussion or validation of off-brand solutions, AliExpress parts, etc are quickly shunned or banned.

                                                                                                                                          • rovr138 a day ago

                                                                                                                                            Yes. That's the small web.

                                                                                                                                            A good example of the generalization problem you discuss is reddit.

                                                                                                                                            You have to unsubscribe from all the defaults and find the small, niche, communities about specific topics. If not, it's the same stuff, reposted, over and over, across different subs and/or social sites.

                                                                                                                                          • bongodongobob a day ago

                                                                                                                                            It's high quality when the content is within HN's bubble. Anything related to health, politics, or Microsoft is full of misinformation, ignorance, and garbage like any other site. The Microsoft discussions in particular are extremely low quality.

                                                                                                                                            • tdb7893 14 hours ago

                                                                                                                                              When economics has come up I've been curious and asked my brother about some of the stuff in the more upvoted comments (he has his PhD in economics with a focus on labor specifically) his reaction has always been something like "that doesn't match my understanding of that" or "I think their analysis is a bit oversimplified".

                                                                                                                                              My experience here is that it's pretty good for things outside of tech (at least better than the average internet) but definitely not great.

                                                                                                                                              • nerdponx 7 hours ago

                                                                                                                                                I don't have a PhD but I do have some background in economics, and economics is consistently one of the worst areas on HN. I think it's representative of society in general. There's something about economics that makes it feel like you can just reason through it with common sense, whereas that's rarely true in reality.

                                                                                                                                              • Retric a day ago

                                                                                                                                                IMO HN actually scores quite highly in terms of health/politics and so forth content because the both mainstream and fringe ideas get both shown and pushback.

                                                                                                                                                A vaping discussion brought up glycerin used was safe and the same thing used in smoke machines and someone else brought up a study showing that smoke machines are an occasional safety issue. Nowhere near every discussion goes that well but stick around and you’ll see in-depth discussion.

                                                                                                                                                Go to a public health website by comparison and you’ll see warnings without context and a possibility positive spin compared to smoking. https://www.cdc.gov/tobacco/e-cigarettes/index.html I suspect most people get basically nothing from looking at it.

                                                                                                                                                • mandevil a day ago

                                                                                                                                                  As a software engineer married to a healthcare professional, I disagree strongly about the quality of the healthcare discussions here. A whole lot of the conversation is software engineers who think that they can reason from first principles in two minutes about this thing that professionals dedicate their whole lives to mastering, and who therefore don't understand the most basic concepts of the field.

                                                                                                                                                  Sometimes I try and engage, but honestly, mostly I think it's not worth it. Otherwise you end up doing this with your life: https://xkcd.com/386/

                                                                                                                                                  • vladms 20 hours ago

                                                                                                                                                    > about this thing that professionals dedicate their whole lives to mastering

                                                                                                                                                    After doing some healthcare work I ended up understanding that some topics are not well known even by the professionals dedicating their whole lives to that because there are big gaps in the human knowledge on the topics.

                                                                                                                                                    I agree that people that think they can reason in two minutes about anything are a problem, but it's not a healthcare only issue (same happens for politics, economics, environment, etc.)

                                                                                                                                                    Engineers have the luck to work in the field where many things have a clear, known explanation (although, try to make an estimation about how long a team will implement a feature, and everybody will come up with something else).

                                                                                                                                                    • mandevil 19 hours ago

                                                                                                                                                      As to the uncertainty and mysteries, you are 100% correct. One of the big failure modes for engineers in dealing with human health is the assumption that things are as simple and logical as the stuff we build, when it's simply not at all like that. There are (1) big arguments over basic things like "why do SSRI's work?" Outside of LLM's I can't think of a thing in software where we are still arguing about why things work in production. We never say "Why does Postgres work?" in the same way. (2)

                                                                                                                                                      And yes, this is true for many other areas of discussion at HN. It's just that it is most obvious to me in the area that my wife specializes in, because I pick up enough via osmosis from her to know when other people don't even have my limited level of understanding.

                                                                                                                                                      1: Or at least were 15 years ago when my wife told me about it- the argument might have been largely concluded and she just never updated me since I don't keep up with the medical literature the way she does.

                                                                                                                                                      2: Two decades ago there was a huge push for the "human genome project" under the basis that this would be "reading the blueprints for human life" and that would give us all of these medical breakthroughs. Basically none of those breakthroughs happened because we've spent the past 20 years learning all of the different ways that it is NOT a blueprint and that cells do things very differently from human engineers.

                                                                                                                                                      • vladms 6 hours ago

                                                                                                                                                        Regarding the human genome project specifically it was research and no matter what was claimed (give us all of these medical breakthroughs) we (as the public) should understand there is no guarantee. Similarly to how most tech startups propose plans that lead to huge scales and ROI, but nobody is amazed when 3-4 years later they have a modest revenue (the lucky ones).

                                                                                                                                                        The benefits for understanding more about genomes are growing (ex: list of adverse effects based on genotype https://go.drugbank.com/pharmaco/genomics) but the field is/(was) so chaotic (just one example: there was not one standard about how to count: https://tidyomics.com/blog/2018/12/09/2018-12-09-the-devil-0...) and so lacking data that it will take many years to reap the benefits (ex: one of the largest study UK Bio bank gave access to researchers only in 2017 - https://en.wikipedia.org/wiki/UK_Biobank)

                                                                                                                                                    • Retric 21 hours ago

                                                                                                                                                      Spend time with medical researchers and they start disparaging Doctors. Everyone wants that one authoritative source free from bias, but IMO even having a few voices in the crowd worth listening to beats most other options.

                                                                                                                                                    • chimeracoder a day ago

                                                                                                                                                      > IMO HN actually scores quite highly in terms of health/politics and so forth content because the both mainstream and fringe ideas get both shown and pushback.

                                                                                                                                                      As someone with domain expertise here, I wholeheartedly disagree. HN is very bad at percolating accurate information about topics outside its wheelhouse, like clinical medicine, public health, or the natural sciences. It is also, simultaneously, extremely prone to overestimating its own collective competency at understanding technical knowledge outside its domain. In tandem, those two make for a rather dangerous combination.

                                                                                                                                                      Anytime I see a post about a topic within my area of specialty, I know to expect articulate, lengthy, and completely misguided or inaccurate comments dominating the discussion. It's enough of a problem that trying to wade in and correct them is a losing battle; I rarely even bother these days.

                                                                                                                                                      It's kind of funny that XKCD #793[0] is written about physicists, because the effect is way worse with software engineers.

                                                                                                                                                      [0] https://xkcd.com/793/

                                                                                                                                                      • matrix87 11 hours ago

                                                                                                                                                        people don't normally talk about healthcare on here so I'm not really sure what you're referring to or what your specialty is

                                                                                                                                                        • Retric a day ago

                                                                                                                                                          Obviously on an objective scale HN isn’t good, but nobody is doing a good job here.

                                                                                                                                                          I’ve worked on the government side of this stuff and find it disheartening.

                                                                                                                                                      • squigz a day ago

                                                                                                                                                        I disagree. Even politics spurs intelligent, nuanced discussion here on HN.

                                                                                                                                                        And to hold up discussions about MS as an example of 'extremely' low quality discussion is, ah, interesting. Do you have any recent examples of such discussions?

                                                                                                                                                        • matrix87 11 hours ago

                                                                                                                                                          > spurs intelligent, nuanced discussion here on HN

                                                                                                                                                          relative to what? reddit?

                                                                                                                                                          also there's a trade off between entropy and "quality". too much "quality" and everyone gets bored and goes somewhere more entertaining

                                                                                                                                                          • squigz 8 hours ago

                                                                                                                                                            Relative to... unintelligent discussions?

                                                                                                                                                            I also don't care if people leave because HN isn't 'entertaining' enough. I don't come here for that, and I don't expect the community members that make this place what it is to either.

                                                                                                                                                          • vundercind a day ago

                                                                                                                                                            Politics and philosophy discussions here are intelligent in that most of the commenters aren’t dumb. They tend to be entirely uneducated and resistant to the educated.

                                                                                                                                                            • bongodongobob a day ago

                                                                                                                                                              I hide every single article about MS because it's filled with all the neckbeardy tropes about their products being garbage spyware, switch to Linux, they're stealing your data, the OS is trash etc. It's comments from people who have never managed large scale MS based environments comparing their Windows Home to the other 90% of the business ecosystem that has nothing to do with home users or MS's main cash cow, businesses, Azure/Entra and M365. I'm done wasting my breath on MS here.

                                                                                                                                                      • htrp a day ago

                                                                                                                                                        Any curation mechanism that depends on passion and/or the goodwill of volunteers is unsustainable.

                                                                                                                                                        • 38 a day ago

                                                                                                                                                          its so easy to solve this problem, not sure why anyone hasnt done it yet.

                                                                                                                                                          1. build a userbase, free product

                                                                                                                                                          2. once userbase get big enough, any new account requires a monthly fee, maybe $1

                                                                                                                                                          3. keep raising the fee higher and higher, until you get to the point that the userbase is manageable.

                                                                                                                                                          no ads, simple.

                                                                                                                                                          • abridges6523 a day ago

                                                                                                                                                            This sounds like a good idea. I do wonder if enough people would sign up for it to be a worthy venture because I think the main issue with ads is I think once you add any price at all dramatically reduces participation even if it’s not about cost some people just see the payment and immediately disengage.!

                                                                                                                                                            • jachee a day ago

                                                                                                                                                              Until N ad views are worth more than $X account creation fee. Then the spammers will just sell ad posts for $X*1.5.

                                                                                                                                                              I can’t find it, but there’s someone selling sock puppet posts on HN even.

                                                                                                                                                          • squigz a day ago

                                                                                                                                                            > people consuming auto-generated content to keep the masses from away from critical thinking. This is now happening

                                                                                                                                                            The people who stay away from critical thinking were doing that already and will continue to do so, 'AI' content or not.

                                                                                                                                                            • psychoslave a day ago

                                                                                                                                                              I don't know, individually finely tuned addictive content served as real time interactive feedback loops is an other level of propaganda and attention capture tool than largest common denominator of the general crowd served as static passive content.

                                                                                                                                                              • squigz 21 hours ago

                                                                                                                                                                Perhaps, but the solution is the same either way, and it isn't trying to ban technology or halt progress or just sit and cry about how society is broken. It's educating each other and our children on the way these things work, how to break out of them, and how we might more responsibly use the technology.

                                                                                                                                                              • trehalose a day ago

                                                                                                                                                                How did they get started?

                                                                                                                                                                • squigz a day ago

                                                                                                                                                                  They likely never started critically thinking, so they never had to get started on not doing so.

                                                                                                                                                                  (If children are never taught to think critically, then...)

                                                                                                                                                                  • sweeter a day ago

                                                                                                                                                                    It's almost like its a systemic failure that is artificially created so that people wont think critically... hmmm

                                                                                                                                                                    • vladms 20 hours ago

                                                                                                                                                                      > is artificially created

                                                                                                                                                                      You imply that thousands of year ago everybody was thinking critically?

                                                                                                                                                                      Thinking critically is hard, stressful and might take some joy from your life.

                                                                                                                                                                      • sweeter 17 hours ago

                                                                                                                                                                        I'm not sure how that would imply anything about the past. We as a society have spent decades defanging the public school system through changing school to be test score driven, tying a schools funding to the local property value, making them less effective and less safe, choking them out financially etc... it should be no surprise that children are not equipped to navigate modern life. I've been though these systems, they are deeply flawed.

                                                                                                                                                                      • squigz a day ago

                                                                                                                                                                        Yeah, it's almost like it has nothing to do with AI

                                                                                                                                                                • Llamamoe a day ago

                                                                                                                                                                  > Good will prevail in the end.

                                                                                                                                                                  Even if, this is a dangerous thought that discourages decisive action that is likely to be necessary for this to happen.

                                                                                                                                                                  • sweeter a day ago

                                                                                                                                                                    tangentially related, but Marx also predicted that crypto and NFT's would exist in 1894 [1] and I only bring it up because its kind of wild how we keep crossing these "red lines" without even blinking. It's like that meme:

                                                                                                                                                                    Sci-fi author:

                                                                                                                                                                    I created the Torment Nexus to serve as a cautionary tale...

                                                                                                                                                                    Tech Company:

                                                                                                                                                                    Alas, we have created the Torment Nexus from the classic Sci-fi novel "Don't Create the Torment Nexus"

                                                                                                                                                                    1. https://www.marxists.org/archive/marx/works/1894-c3/ch25.htm

                                                                                                                                                                    • Intralexical 17 hours ago

                                                                                                                                                                      What if the way for good to prevail is to reject technologies and beliefs that have become destructive?

                                                                                                                                                                    • 0xbadcafebee a day ago

                                                                                                                                                                      I'm going to call it: The Web is dead. Thanks to "AI" I spend more time now digging through searches trying to find something useful than I did back in 2005. And the sites you do find are largely garbage.

                                                                                                                                                                      As a random example: just trying to find a particular popular set of wireless earbuds takes me at least 10 minutes, when I already know the company, the company's website, other vendors that sell the company's goods, etc. It's just buried under tons of dreck. And my laptop is "old" (an 8-core i7 processor with 16GB of RAM) so it struggles to push through graphics-intense "modern" websites like the vendor's. Their old website was plain and worked great, letting me quickly search through their products and quickly purchase them. Last night I literally struggled to add things to cart and check out; it was actually harrowing.

                                                                                                                                                                      Fuck the web, fuck web browsers, web design, SEO, searching, advertising, and all the schlock that comes with it. I'm done. If I can in any way purchase something without the web, I'mma do that. I don't hate technology (entirely...) but the web is just a rotten egg now.

                                                                                                                                                                      • Vegenoid 17 hours ago

                                                                                                                                                                        On Amazon, you used to be able to search the reviews and Q&A section via a search box. This was immensely useful. Now, that search box first routes your search to an LLM, which makes you wait 10-15 seconds while it searches for you. Then it presents its unhelpful summary, saying "some reviews said such and such", and I can finally click the button to show me the actual reviews and questions with the term I searched.

                                                                                                                                                                        This is going to be the thing that makes me quit Amazon. If I'm missing something and there's still a way to to a direct search, please tell me.

                                                                                                                                                                        • graeme 2 hours ago

                                                                                                                                                                          Ran into this the other day. Amazon.ca still has the old version for now

                                                                                                                                                                          • cosmotron 11 hours ago

                                                                                                                                                                            You can still get to product reviews directly and search them. Here's an example:

                                                                                                                                                                            Product page (copy the identifier at the end): https://www.amazon.com/Long-Thanks-Hitchhikers-Guide-Galaxy-...

                                                                                                                                                                            Review page (paste the identifier at the end): https://www.amazon.com/product-reviews/B001OF5F1E/

                                                                                                                                                                            This seems to bypass all of the LLM stuff for now.

                                                                                                                                                                            • Vegenoid 10 hours ago

                                                                                                                                                                              Pretty good! Unfortunately it does not include the Q&As, which are often just as useful as the reviews.

                                                                                                                                                                          • bbarn 21 hours ago

                                                                                                                                                                            No disagreement for the most part.

                                                                                                                                                                            I used to be able to say search for Trek bike derailleur hanger and the first result would be what I wanted. Now I have to scroll past 5 ads to buy a new bike, one that's a broken link to a third party, and if I'm really lucky, at the bottom of page 1 will be the link to that part's page.

                                                                                                                                                                            The shitification of the web is real.

                                                                                                                                                                            • klyrs 17 hours ago

                                                                                                                                                                              R.I.P. Sheldon Brown T_T

                                                                                                                                                                              (The Agner Fog of cycling?)

                                                                                                                                                                              • bbarn 5 minutes ago

                                                                                                                                                                                He was a legend.

                                                                                                                                                                            • Gethsemane 20 hours ago

                                                                                                                                                                              Sounds like your laptop is wholly out of date, you need to buy the next generation of laptops on Amazon that can handle the modern SEO load. I recommend the:

                                                                                                                                                                              LEEZWOO 15.6" Laptop - 16GB RAM 512GB SSD PC Laptop, Quad-Core N95 Processor Up to 3.1GHz, Laptop Computers with Touch ID, WiFi, BT4.2, for Students/Business

                                                                                                                                                                              Name rolls off the tongue doesn’t it

                                                                                                                                                                              • tim333 4 hours ago

                                                                                                                                                                                Or a Macbook.

                                                                                                                                                                              • cedric_h 16 hours ago

                                                                                                                                                                                There is a startup whose product is better search. The killer feature is that you pay for it, so you aren't the produdct. https://kagi.com/welcome

                                                                                                                                                                                • codezero 15 hours ago

                                                                                                                                                                                  Can vouch for this. It’s the first non-Google search alternative I’ve used that has 100% replaced Google. I don’t need Google as a fallback like I did with others.

                                                                                                                                                                                • akkartik 10 hours ago

                                                                                                                                                                                  I've been slowly detaching myself from the web for the past 10 years. These days I mostly build offline apps using native technologies. Those capabilities are still around. They just receded for a while because they'd gotten so polluted with toolbars and malware. But now the malware is on the other side, and native apps are cool again. If you know where to look. Here's my shingle: https://akkartik.name/freewheeling-apps

                                                                                                                                                                                  On the other hand, what you call "The Web" seems to be just what you can get at through search engines. There's still the old web, the thing that's mediated by relationships and reputation rather than aggregation services with billions of users. Like the link I shared above. Or this heroically moderated site we're using right now.

                                                                                                                                                                                  • w10-1 21 hours ago

                                                                                                                                                                                    > If I can in any way purchase something without the web, I'mma do that

                                                                                                                                                                                    To get to the milk you'll have to walk by 3 rows of chips and soda.

                                                                                                                                                                                    • odo1242 21 hours ago

                                                                                                                                                                                      Yeah, this is why I still use the web to order things in a nutshell lol

                                                                                                                                                                                      • 0xbadcafebee 19 hours ago

                                                                                                                                                                                        Where do you order things online that you aren't inundated by ads?

                                                                                                                                                                                        • freddie_mercury 14 hours ago

                                                                                                                                                                                          It's a lot LOT easier for me as an adult to ignore ads online than it is for my kids in brick and mortar stores to ignore the candy and toys placed at their eye level.

                                                                                                                                                                                          • cedric_h 16 hours ago

                                                                                                                                                                                            Ad blocker. Even just putting https://12ft.io/ in front of your link gets you pretty far.

                                                                                                                                                                                            • 0xbadcafebee 15 hours ago

                                                                                                                                                                                              Ah, you mean the web version of https://en.wikipedia.org/wiki/Blinkers_(horse_tack) . I don't think that helps when you're stopped in your tracks by an upsell. Dominos won't let you order a pizza online until you've declined garlic bread, cinnamon rolls and a liter of pepsi three times. And you can't just click "pepperoni pizza near me", you have to build your pepperoni pizza, after putting in your zip code, selecting the store, carry out, then click build again, sure you don't want buffalo wings too?, ....

                                                                                                                                                                                      • matrix87 11 hours ago

                                                                                                                                                                                        > Their old website was plain and worked great, letting me quickly search through their products and quickly purchase them. Last night I literally struggled to add things to cart and check out; it was actually harrowing.

                                                                                                                                                                                        Hey, who cares about making services that work when we can give people a cool chatbot assistant and a 1800 number with no real-person alternative to the decision tree

                                                                                                                                                                                        • gazook89 20 hours ago

                                                                                                                                                                                          The web is much more than a shopping site.

                                                                                                                                                                                          • yifanl 20 hours ago

                                                                                                                                                                                            It is, but the SEO spammers who ruined the web want it to be shopping mall, and they can't even do a particularly good job at being one.

                                                                                                                                                                                          • nlpparty 15 hours ago

                                                                                                                                                                                            I suppose it is just Amazon problems. I have never lived in the area where Amazon is prevalent. Where I live, search engines still can't find synonyms or process misspellings.

                                                                                                                                                                                            • BeetleB 20 hours ago

                                                                                                                                                                                              If search is your metric, the web was dead long before OpenAI's release of GPT. I gave up on web search a long time ago.

                                                                                                                                                                                              • kristopolous 15 hours ago

                                                                                                                                                                                                for tech stuff I just use documentation, bug trackers and source code now. Web searching has become useless.

                                                                                                                                                                                              • weinzierl a day ago

                                                                                                                                                                                                "I don't think anyone has reliable information about post-2021 language usage by humans."

                                                                                                                                                                                                We've been past the tipping point when it comes to text for some time, but for video I feel we are living through the watershed moment right now.

                                                                                                                                                                                                Especially smaller children don't have a good intuition on what is real and what is not. When I get asked if the person in a video is real, I still feel pretty confident to answer but I get less and less confident every day.

                                                                                                                                                                                                The technology is certainly there, but the majority of video content is still not affected by it. I expect this to change very soon.

                                                                                                                                                                                                • frognumber a day ago

                                                                                                                                                                                                  There are a series of challenges like:

                                                                                                                                                                                                  https://www.nytimes.com/interactive/2024/09/09/technology/ai...

                                                                                                                                                                                                  https://www.nytimes.com/interactive/2024/01/19/technology/ar...

                                                                                                                                                                                                  These are a little bit unfair, in that we're comparing handpicked examples, but I don't think many experts will pass a test like this. Technology only moves forward (and seemingly, at an accelerating pace).

                                                                                                                                                                                                  What's a little shocking to me is the speed of progress. Humanity is almost 3 million years old. Homosapiens are around 300,000 years old. Cities, agriculture, and civilization is around 10,000. Metal is around 4000. Industrial revolution is 500. Democracy? 200. Computation? 50-100.

                                                                                                                                                                                                  The revolutions shorten in time, seemingly exponentially.

                                                                                                                                                                                                  Comparing the world of today to that of my childhood....

                                                                                                                                                                                                  One revolution I'm still coming to grips with is automated manufacturing. Going on aliexpress, so much stuff is basically free. I bought a 5-port 120W (total) charger for less than 2 minutes of my time. It literally took less time to find it than to earn the money to buy it.

                                                                                                                                                                                                  I'm not quite sure where this is all headed.

                                                                                                                                                                                                  • homebrewer a day ago

                                                                                                                                                                                                    > so much stuff is basically free

                                                                                                                                                                                                    It really isn't. Have a look at daily median income statistics for the rest of the planet:

                                                                                                                                                                                                    https://ourworldindata.org/grapher/daily-median-income?tab=t...

                                                                                                                                                                                                      $2.48 Eastern and Southern Africa (PIP)
                                                                                                                                                                                                      $2.78 Sub-Saharan Africa (PIP)
                                                                                                                                                                                                      $3.22 Western and Central Africa (PIP)
                                                                                                                                                                                                      $3.72 India (rural)
                                                                                                                                                                                                      $4.22 South Asia (PIP)
                                                                                                                                                                                                      $4.60 India (urban)
                                                                                                                                                                                                      $5.40 Indonesia (rural)
                                                                                                                                                                                                      $6.54 Indonesia (urban)
                                                                                                                                                                                                      $7.50 Middle East and North Africa (PIP)
                                                                                                                                                                                                      $8.05 China (rural)
                                                                                                                                                                                                      $10.00 East Asia and Pacific (PIP)
                                                                                                                                                                                                      $11.60 Latin America and the Caribbean (PIP)
                                                                                                                                                                                                      $12.52 China (urban)
                                                                                                                                                                                                    
                                                                                                                                                                                                    And more generally:

                                                                                                                                                                                                      $7.75 World
                                                                                                                                                                                                    
                                                                                                                                                                                                    I looked around on Ali, and the cheapest charger that doesn't look too dangerous costs around five bucks. So it's roughly equal to one day's income of at least half the population of our planet.
                                                                                                                                                                                                    • knodi123 a day ago

                                                                                                                                                                                                      +100w chargers are one of the products I prefer to spend a little more on, so I get something from a company that knows it can be sued if they make a product that burns down your house or fries your phone.

                                                                                                                                                                                                      Flashlights? Sure, bring on aliexpress. USB cables with pop-off magnetically attached heads, no problem. But power supplies? Welp, to each their own!

                                                                                                                                                                                                      • fph 19 hours ago

                                                                                                                                                                                                        And then you plug your cheap pop-off USB cable into the expensive 100w charger?

                                                                                                                                                                                                        • knodi123 18 hours ago

                                                                                                                                                                                                          Yeah, sure, what could possibly go wrong? :-P

                                                                                                                                                                                                          But seriously, it's harder to accidentally make a USB cable that fries your equipment. The more common failure mode is it fails to work, or wears out too fast. Chargers on the other hand, handle a lot of voltage, generate a lot of heat, and output to sensitive equipment. More room to mess up, and more room for mistakes to cause damage.

                                                                                                                                                                                                      • csomar 10 hours ago

                                                                                                                                                                                                        Democracy (and Republics) are thousands of year old. Computation is also quite old though it only sky-rocketed with electricity and semiconductors. This is not the first time the global world created a potential for exponential growth (I'll consider the Pharaohs and Roman empires to be ones).

                                                                                                                                                                                                        There is the very real possibility that everything just stalls and plateau where we are at. You know, like our population growth, it should have gone exponentially but it did not. Actually, quite the reverse.

                                                                                                                                                                                                        • bee_rider a day ago

                                                                                                                                                                                                          > One revolution I'm still coming to grips with is automated manufacturing. Going on aliexpress, so much stuff is basically free. I bought a 5-port 120W (total) charger for less than 2 minutes of my time. It literally took less time to find it than to earn the money to buy it.

                                                                                                                                                                                                          Is there a big recent qualitative change here? Or is this a continuation of manufacturing trends (also shocking, not trying to minimize it all, just curious if there’s some new manufacturing tech I wasn’t aware of).

                                                                                                                                                                                                          For some reason, your comment got me thinking of a fully automated system, like: you go to a website, pick and choose charger capabilities (ports, does it have a battery, that sort of stuff). Then an automated factor makes you a bespoke device (software picks an appropriate shell, regulators, etc). I bet we’ll see it in our lifetimes at least.

                                                                                                                                                                                                          • jodrellblank a day ago

                                                                                                                                                                                                            > "The revolutions shorten in time, seemingly exponentially."

                                                                                                                                                                                                            The Technological Singularity - https://en.wikipedia.org/wiki/Technological_singularity

                                                                                                                                                                                                            • MengerSponge 19 hours ago

                                                                                                                                                                                                              Democracy is 200? You're off by a full order of magnitude.

                                                                                                                                                                                                              Progress isn't inevitable. It's possible for knowledge to be lost and for civilization to regress.

                                                                                                                                                                                                            • apricot a day ago

                                                                                                                                                                                                              > When I get asked if the person in a video is real, I still feel pretty confident to answer

                                                                                                                                                                                                              I don't. I mean, I can identify the bad ones, sure, but how do I know I'm not getting fooled by the good ones?

                                                                                                                                                                                                              • weinzierl 21 hours ago

                                                                                                                                                                                                                That is very true, but for now we have a baseline of videos that we either remember or that we remember key details of, like the persons in the video. I'm pretty sure if I watch The Primeagen or Tom Scott today, that they are real. Ask me in year, I might not be so sure anymore.

                                                                                                                                                                                                              • olabyne a day ago

                                                                                                                                                                                                                I never thought about that. Humans losing their ability to detect AI content from reality ? It's frightening.

                                                                                                                                                                                                                • BiteCode_dev a day ago

                                                                                                                                                                                                                  It's worse because many humans don't know they are.

                                                                                                                                                                                                                  I see a lot of outrage around fake posts already. People want to believe bad things from the other tribes.

                                                                                                                                                                                                                  And we are going to feed them with it, endlessly.

                                                                                                                                                                                                                  • PhunkyPhil a day ago

                                                                                                                                                                                                                    Did you think the same thing when photoshop came out?

                                                                                                                                                                                                                    It's relatively trivial to photoshop misinformation in a really powerful and undetectable way- but I don't see (legitimate) instances of groundbreaking news over a fake photo of the president or a CEO etc doing something nefarious. Why is AI different just because it's audio/video?

                                                                                                                                                                                                                    • chowells 15 hours ago

                                                                                                                                                                                                                      "AI" is different because it's low-effort and easily automated, making it easy to absolutely flood public spaces. Quantity has a quality all its own.

                                                                                                                                                                                                                      • BiteCode_dev 9 hours ago

                                                                                                                                                                                                                        I did.

                                                                                                                                                                                                                        And it's not the grounbreaking the problem, it's the little constant lies.

                                                                                                                                                                                                                        Last week a photoshopped Musk tweet was going around, people getting all up in arms against it despite the fact it was very easy to spot as a fabricated one.

                                                                                                                                                                                                                        People didn't care, they hate the guy, they just wanted to fuel their hate more.

                                                                                                                                                                                                                        The whole planet run on fake content, magazin covers, food packaging, instagram pics of places that never looks that way...

                                                                                                                                                                                                                        And now, with AI, you can automate it and scale it up.

                                                                                                                                                                                                                        People are not ready. And in fact, they don't want to be.

                                                                                                                                                                                                                    • jerf a day ago

                                                                                                                                                                                                                      It's even worse than that. Most people have no idea how far CGI has come, and how easily it is wielded even by a couple of dedicated teens on their home computer, let alone people with a vested interest in faking something for some financial reason. People think they know what a "special effect" looks like, and for the most part, people are wrong. They know what CGI being used to create something obviously impossible, like a dinosaur stomping through a city, looks like. They have no idea how easy a lot of stuff is to fake already. AI just adds to what is already there. Heck, to some extent it has caused scammers to overreach, with things like obviously fake Elon Musk videos on YouTube generated from (pure) AI and text-to-speech... when with just a little bit more learning, practice, and amounts of equipment completely reasonable for one person to obtain, they could have done a much better fake of Elon Musk using special effects techniques rather than shoveling text into an AI. The fact that "shoveling text into an AI" may in another few years itself generate immaculate videos is more a bonus than a fundamental change of capability.

                                                                                                                                                                                                                      Even what's free & open source in the special effects community is astonishing lately.

                                                                                                                                                                                                                      • jhbadger a day ago

                                                                                                                                                                                                                        And you see things like the The Lion King remake or its upcoming prequel being called "live action" because it doesn't look like a cartoon like the original. But they didn't film actual lions running around -- it's all CGI.

                                                                                                                                                                                                                        • bee_rider a day ago

                                                                                                                                                                                                                          Plus, movies continue (for some reason) to be made with very bad and obvious CGI, leading people to believe all CGI is easy to spot.

                                                                                                                                                                                                                          • PhunkyPhil a day ago

                                                                                                                                                                                                                            This is a common survivorship bias fallacy since you only notice the bad CGI.

                                                                                                                                                                                                                            I'm certain you'd be shocked to see the amount of CG that's in some of your favorite movies made in the last ~10-20 years that you didn't notice because it's undetectable

                                                                                                                                                                                                                            • xsmasher 20 hours ago

                                                                                                                                                                                                                              This is an amazing demo reel of effects shots used in "mundane" TV shows - comedies and produce procedurals. - for faking locations.

                                                                                                                                                                                                                              https://www.youtube.com/watch?v=clnozSXyF4k

                                                                                                                                                                                                                              • vundercind 12 hours ago

                                                                                                                                                                                                                                Luckily, for those of us who prefer when film photography meant at least mostly actually filming things, there’s plenty of very good film and TV (and even more of lesser quality) to keep a person occupied for a couple lifetimes.

                                                                                                                                                                                                                                • bee_rider 20 hours ago

                                                                                                                                                                                                                                  That is really something even as somebody who expects lots of CGI touch-up in sets.

                                                                                                                                                                                                                                  • coderedart 4 hours ago

                                                                                                                                                                                                                                    I hate this. I did not notice the vast majority of them. So many backgrounds/sets are just green screens :(

                                                                                                                                                                                                                                    • ars 13 hours ago

                                                                                                                                                                                                                                      And keep in mind - that video is 14 years old!

                                                                                                                                                                                                                                    • bee_rider 20 hours ago

                                                                                                                                                                                                                                      I won’t be, I’m aware that lots of movies are mostly CGI.

                                                                                                                                                                                                                                      But, yeah, I do think it is some kind of bias. Maybe not survivorship, though… maybe it is a generalized sort of Malmquist bias? Like the measurement is not skewed by the tendency of movies with good CGI to go away. It is skewed by the fact that bad CGI sticks out.

                                                                                                                                                                                                                                      • bee_rider 20 hours ago

                                                                                                                                                                                                                                        Actually wait I take it back, I mean, I was aware that lots of Digital Touch-up happens in movie sets, more than lots of people might expect, and more often that one might expect even in mundane movies, but even still, this comment’s video was pretty shocking anyway.

                                                                                                                                                                                                                                        https://news.ycombinator.com/item?id=41584276

                                                                                                                                                                                                                                • hn_throwaway_99 a day ago

                                                                                                                                                                                                                                  I mean, it's already apparent to me that a lot of people don't have a basic process in place to detect fact from fiction. And it's definitely not always easy, but when I hear some of the dumbest conspiracy theories known to man actually get traction in our media, political figures, and society at large, I just have to shake my head and laugh to keep from crying. I'm constantly reminded of my favorite saying, "people who believe in conspiracy theories have never been a project manager."

                                                                                                                                                                                                                                  • Suppafly a day ago

                                                                                                                                                                                                                                    >Humans losing their ability to detect AI content from reality ? It's frightening.

                                                                                                                                                                                                                                    And it already happened, and no one pushed back while it was happening.

                                                                                                                                                                                                                                    • Sharlin a day ago

                                                                                                                                                                                                                                      It's worse: they don't even care.

                                                                                                                                                                                                                                      • bunderbunder a day ago

                                                                                                                                                                                                                                        This video's worth a watch if you want to get a sense of the current state of things. Despite the (deliberately) clickbait title, the video itself is pretty even-handed.

                                                                                                                                                                                                                                        It's by Language Jones, a YouTube linguist. Title: "The AI Apocalypse is Here"

                                                                                                                                                                                                                                        https://youtu.be/XeQ-y5QFdB4

                                                                                                                                                                                                                                        • wraptile a day ago

                                                                                                                                                                                                                                          I find issue with this statement as content was never a clean representation of human actions or even thought. It was always driven by editorials, SEO, bot remixing and whatnot that heavily influences how we produce content. One might even argue that heightened content distrust is _good_ for our society.

                                                                                                                                                                                                                                          • BeFlatXIII a day ago

                                                                                                                                                                                                                                            It's a defense lawyer's dream.

                                                                                                                                                                                                                                            • bongodongobob a day ago

                                                                                                                                                                                                                                              Oh they definitely are. A lot of people are now calling out real photos as fake. I frequently get into stupid Instagram political arguments and a lot of times they come back with "yeah nice profile with all your AI art haha". It's all real high quality photography. Honestly, I don't think the avg person can tell anymore.

                                                                                                                                                                                                                                              • ziml77 20 hours ago

                                                                                                                                                                                                                                                I've reached a point where even if my first reaction to a photo is to be impressed, I then quickly think "oh but what it this is AI?" and then immediately my excitement for the photo is ruined because it may not actually be a photo at all.

                                                                                                                                                                                                                                                • bongodongobob 20 hours ago

                                                                                                                                                                                                                                                  I don't get that perspective at all. Who cares what made it.

                                                                                                                                                                                                                                                  • pbhjpbhj 5 hours ago

                                                                                                                                                                                                                                                    You don't find a difference between things that exist and things that don't?

                                                                                                                                                                                                                                            • bsder a day ago

                                                                                                                                                                                                                                              > When I get asked if the person in a video is real, I still feel pretty confident to answer

                                                                                                                                                                                                                                              I don't share your confidence in identifying real people anymore.

                                                                                                                                                                                                                                              I often flag as "false-ish" a lot of things from genuinely real people, but who have adopted the behaviors of the TikTok/Insta/YouTube creator. Hell, my beard is grey and even I poked fun at "YouTube Thumbnail Face" back in 2020 in a video talk I gave. AI twigs into these "semi-human" behavioral patterns super fast and super hard.

                                                                                                                                                                                                                                              There is a video floating around with pairs of young ladies with "This is real"/"This is not real" on signs. They could be completely lying about both, and I really can't tell the difference. All of them have behavioral patterns that seems a little "off" but are consistent with the small number of "influencer" videos I have exposure to.

                                                                                                                                                                                                                                            • dweinus a day ago

                                                                                                                                                                                                                                              > Now the Web at large is full of slop generated by large language models, written by no one to communicate nothing.

                                                                                                                                                                                                                                              Fair and accurate. In the best cases the person running the model didn't write this stuff and word salad doesn't communicate whatever they meant to say. In many cases though, content is simply pumped out for SEO with no intention of being valuable to anyone.

                                                                                                                                                                                                                                              • andrethegiant a day ago

                                                                                                                                                                                                                                                That sentence stood out to me too, very powerful. Felt it right in the feels.

                                                                                                                                                                                                                                              • dsign a day ago

                                                                                                                                                                                                                                                Somehow related, paper books from before 2020 could be a valuable commodity in a in a decade or two, when the Internet will be full of slop and even contemporary paper books will be treated with suspicion. And there will be human talking heads posing as the authors of books written by very smart AIs. God, why are we doing this????

                                                                                                                                                                                                                                                • rvnx a day ago

                                                                                                                                                                                                                                                  To support well-known “philanthropists” like Sam Altman or Mark Zuckerberg that many consider as their heroes here.

                                                                                                                                                                                                                                                  • user432678 a day ago

                                                                                                                                                                                                                                                    And I thought I had some kind of mental illness collecting all those books, barely reading them. Need to do that more now.

                                                                                                                                                                                                                                                    • globular-toast 18 hours ago

                                                                                                                                                                                                                                                      Yes. I've always loved my books but now consider them my most valuable possessions.

                                                                                                                                                                                                                                                    • RomanAlexander 21 hours ago

                                                                                                                                                                                                                                                      Or AI talking heads posing as the author of books written by AIs. https://youtu.be/pAPGRGTqIgI (warning: state sponsored disinformation AI)

                                                                                                                                                                                                                                                    • aucisson_masque a day ago

                                                                                                                                                                                                                                                      Did we (the humans) somehow managed to pollute the internet so much with AI that's it's now barely usable ?

                                                                                                                                                                                                                                                      In my opinion the internet can be considered as the equivalent of a natural environment like the earth. it's a space where people share, meet, talk, etc.

                                                                                                                                                                                                                                                      I find it astonishing that after polluting our natural environment we know polluted the internet.

                                                                                                                                                                                                                                                      • nkozyra a day ago

                                                                                                                                                                                                                                                        > Did we (the humans) somehow managed to pollute the internet so much with AI that's it's now barely usable

                                                                                                                                                                                                                                                        If we haven't already, we will be very soon. I'm sure there are people working on this problem, but I think we're starting to hit a very imminent feedback loop moment. Most of human's recorded information is digitized and most of that is generating non-human content at an incredible pace. We've injected a whole lot of noise into our usable data.

                                                                                                                                                                                                                                                        I don't know if the answer is more human content (I'm doing my part!) or novel generative content but this interim period is going to cause some medium-term challenges.

                                                                                                                                                                                                                                                        I like to think the LLM more-tokens-equals-better era is fading and we're getting into better use of existing data, but there's a very real inflection point we're facing.

                                                                                                                                                                                                                                                        • coldpie a day ago

                                                                                                                                                                                                                                                          There are smaller, gated communities that are still very valuable. You're posting in one. But yes, the open Internet is basically useless now, thanks ultimately to advertising as a business model.

                                                                                                                                                                                                                                                          • nicholassmith a day ago

                                                                                                                                                                                                                                                            I've seen plenty of comments here that read like they've been generated by an LLM, if this is a gated community we need a better gate.

                                                                                                                                                                                                                                                            • coldpie a day ago

                                                                                                                                                                                                                                                              Sure, there's bad actors everywhere, but there's really no incentive to do it here so I don't think it's a problem in the same way it is on the open internet, where slop is actively rewarded.

                                                                                                                                                                                                                                                              • globular-toast 18 hours ago

                                                                                                                                                                                                                                                                It's hard to tell, though. People have been saying my borderline autistic comments sound like GPT for years now.

                                                                                                                                                                                                                                                              • lobsterthief 15 hours ago

                                                                                                                                                                                                                                                                Also our collective unwillingness to pay for subscriptions for publications

                                                                                                                                                                                                                                                                • whimsicalism 19 hours ago

                                                                                                                                                                                                                                                                  this is not a gated community at all

                                                                                                                                                                                                                                                                  • coldpie 2 hours ago

                                                                                                                                                                                                                                                                    True, that is maybe too strong a phrase, but I think it's close to accurate. I think the culture & medium provide kind of a self-selecting gate: it's just plain text and links to articles, with the discussion expected by culture to be fairly serious. I think that turns off enough people that it kind of forms its own gate shutting out the people that make "eternal Septembers" happen. But yeah, ultimately, you're right.

                                                                                                                                                                                                                                                                • thwarted a day ago

                                                                                                                                                                                                                                                                  Tragedy of the Commons Ruins Everything Around Me

                                                                                                                                                                                                                                                                  • ashton314 a day ago

                                                                                                                                                                                                                                                                    That's a nice analogy. Fortunately (un)real estate is easier to manufacture out of thin air online. We have lost some valuable spaces like Twitter and Reddit to some degree though.

                                                                                                                                                                                                                                                                    • egypturnash 15 hours ago

                                                                                                                                                                                                                                                                      The public Internet has been relentlessly strip-mined for profit by ever since Canter & Siegel posted their immigration services ad to every single Usenet newsgroup.

                                                                                                                                                                                                                                                                      • mathnmusic a day ago

                                                                                                                                                                                                                                                                        > Did we (the humans) somehow managed to pollute the internet

                                                                                                                                                                                                                                                                        Corporations did that, not humans.

                                                                                                                                                                                                                                                                        "few people recognize that we already share our world with artificial creatures that participate as intelligent agents in our society: corporations" - https://arxiv.org/abs/1204.4116

                                                                                                                                                                                                                                                                        • left-struck a day ago

                                                                                                                                                                                                                                                                          >We the humans

                                                                                                                                                                                                                                                                          Nice try

                                                                                                                                                                                                                                                                          If it’s not clear, I’m joking.

                                                                                                                                                                                                                                                                          • surfingdino a day ago

                                                                                                                                                                                                                                                                            Yes. Here are practical instructions on how to turn it into an even more of a cesspit https://www.youtube.com/watch?v=endHz0jo9Ck I think it's now a law of nature that any new tech leads to SEO amplification. AI has become the Degelman M34 Manure Spreader of the internet https://degelman.com/products/manure-spreaders

                                                                                                                                                                                                                                                                          • baq a day ago

                                                                                                                                                                                                                                                                            All those writers who'll soon be out of job and/or already are and basically unhireable for their previous tasks should be paid for by the AI hyperscalers to write anything at all on one condition: not a single sentence in their works should be created with AI.

                                                                                                                                                                                                                                                                            (I initially wanted to say 'paid for by the government' but that'd be socialising losses and we've had quite enough of that in the past.)

                                                                                                                                                                                                                                                                            • vidarh a day ago

                                                                                                                                                                                                                                                                              There are already several companies doing this - I do occasional contract work for a couple -, and paying rates sometimes well above what an average earning writer can expect elsewhere. However, the vast majority of writers have never been able to make a living from their writing. The threshold to write is too love, too many people love it, and most people read very little.

                                                                                                                                                                                                                                                                              • baq a day ago

                                                                                                                                                                                                                                                                                Transformers read a lot during training, it might actually be beneficial for the companies to the point those works never see the light of day, only machines would read them. That's so dystopian I'd say those works should be published so they eventually get into the public domain.

                                                                                                                                                                                                                                                                                • ckemere a day ago

                                                                                                                                                                                                                                                                                  Rooms full of people writing into a computer is a striking mental picture. It feels like it could be background for a great plot for a book/movie.

                                                                                                                                                                                                                                                                                  • EasyMark 15 hours ago

                                                                                                                                                                                                                                                                                    Lots of books have had plots where a person is training their replacement.

                                                                                                                                                                                                                                                                                    • left-struck a day ago

                                                                                                                                                                                                                                                                                      Have you heard of Severance? This has a vibe extremely similar to that show.

                                                                                                                                                                                                                                                                                • tveita a day ago
                                                                                                                                                                                                                                                                                  • jfultz a day ago

                                                                                                                                                                                                                                                                                    _Thank you_. I read this story probably around 1980 (I think in a magazine that was subsequently trashed or garage-saled), and I have spent my adult life remembering the bones of the story, but not the author or the title.

                                                                                                                                                                                                                                                                                  • bondarchuk a day ago

                                                                                                                                                                                                                                                                                    AI companies are indeed hiring such people to generate customized training data for them.

                                                                                                                                                                                                                                                                                    • passion__desire a day ago

                                                                                                                                                                                                                                                                                      This idea could also be extended to domains like Art. Create new art styles for AI to learn from. But in future, that will also get automated. AI itself will create art styles and all humans would do is choose whether something is Hot or Not. Sort of like art breeder.

                                                                                                                                                                                                                                                                                      • neilv a day ago

                                                                                                                                                                                                                                                                                        Is it the same companies that simply took all the writers' previous work (hoping to be billionaires before the courts understand)?

                                                                                                                                                                                                                                                                                        • shadowgovt a day ago

                                                                                                                                                                                                                                                                                          Yes. This was always the failure with the argument that copyright was the relevant issue... Once the model was proven out, we knew some wealthy companies would hire humans to generate the training data that the companies could then own in whole, at the relative expense of all other humans that didn't get paid to feed the machines.

                                                                                                                                                                                                                                                                                      • nkozyra a day ago

                                                                                                                                                                                                                                                                                        People have been paid to generate noise for a decade+ now. Garbage in, garbage out will always be true.

                                                                                                                                                                                                                                                                                        Next token-seeking is a solved problem. Novel thinking can be solved by humans and possibly by AI soon, but adding more garbage to the data won't improve things.

                                                                                                                                                                                                                                                                                        • trilbyglens a day ago

                                                                                                                                                                                                                                                                                          Have you ever read american history? Lol.

                                                                                                                                                                                                                                                                                        • bane a day ago

                                                                                                                                                                                                                                                                                          This is one of the vanguards warning of the changes coming in the post-AI world.

                                                                                                                                                                                                                                                                                          >> Generative AI has polluted the data

                                                                                                                                                                                                                                                                                          Just like low-background steel marks the break in history from before and after the nuclear age, these types of data mark the distinction from before and after AI.

                                                                                                                                                                                                                                                                                          Future models will begin to continue to amplify certain statistical properties from their training, that amplified data will continue to pollute the public space from which future training data is drawn. Meanwhile certain low-frequency data will be selected by these models less and less and will become suppressed and possibly eliminated. We know from classic NLP techniques that low frequency words are often among the highest in information content and descriptive power.

                                                                                                                                                                                                                                                                                          Bitrot will continue to act as the agent of Entropy further reducing pre-AI datasets.

                                                                                                                                                                                                                                                                                          These feedback loops will persist, language will be ground down, neologisms will be prevented and...society, no longer with the mental tools to describe changing circumstances; new thoughts unable to be realized, will cease to advance and then regress.

                                                                                                                                                                                                                                                                                          Soon there will be no new low frequency ideas being removed from the data, only old low frequency ideas. Language's descriptive power is further eliminated and only the AIs seem able to produce anything that might represent the shadow of novelty. But it ends when the machines can only produce unintelligible pages of particles and articles, language is lost, civilization is lost when we no longer know what to call its downfall.

                                                                                                                                                                                                                                                                                          The glimmer of hope is that humanity figured out how to rise from the dreamstate of the world of animals once. Future humans will be able to climb from the ashes again. There used to be a word, the name of a bird, that encoded this ability to die and return again, but that name is already lost to the machines that will take our tongues.

                                                                                                                                                                                                                                                                                          • fer a day ago

                                                                                                                                                                                                                                                                                            > Future models will begin to continue to amplify certain statistical properties from their training, that amplified data will continue to pollute the public space from which future training data is drawn.

                                                                                                                                                                                                                                                                                            That's why on FB I mark my own writing as AI generated, and the AI generated slop as genuine. Because what is disguised as "transparency disclaimer" is just flagging content of what's a potential dataset to train from and what isn't.

                                                                                                                                                                                                                                                                                            • mitthrowaway2 21 hours ago

                                                                                                                                                                                                                                                                                              I'm sorry for the low-content remark, but, oh my god... I never thought about doing this, and now my mind is reeling at the implications. The idea of shielding my own writing from AI-plagiarism by masquerading it as AI-generated slop in the first place... but then in the same stroke, further undermining our collective ability to identify genuine human writing, while also flagging my own work as low-value to my readers, hoping that they can read between the lines. It's a fascinating play.

                                                                                                                                                                                                                                                                                              • Calzifer 15 hours ago

                                                                                                                                                                                                                                                                                                Reminds me of the good old times of first generation Google ReCaptcha where I always only entered the one word Google knows and ignored or intentionally mistyped the other.

                                                                                                                                                                                                                                                                                                • aanet 21 hours ago

                                                                                                                                                                                                                                                                                                  You, Sir, may have stumbled upon the just the -hack- advice needed to post on social media.

                                                                                                                                                                                                                                                                                                  Apropos of nothing in particular, see LinkedIn now admitting [1] it is training its AI models on "all users by default"

                                                                                                                                                                                                                                                                                                  [1] https://www.techmeme.com/240918/p34#a240918p34

                                                                                                                                                                                                                                                                                                • wvbdmp a day ago

                                                                                                                                                                                                                                                                                                  I Have No Words, And I Must Scream

                                                                                                                                                                                                                                                                                                  • thechao a day ago

                                                                                                                                                                                                                                                                                                    That went off the rails quickly. Calm down dude: my mother-in-law isn't going to forget words because of AI; she's gonna forget words because she's 3 glasses of crappy Texas wine into the evening.

                                                                                                                                                                                                                                                                                                    • bane 21 hours ago

                                                                                                                                                                                                                                                                                                      But your children's children will never learn about love because that word will have been mechanically trained out of existence.

                                                                                                                                                                                                                                                                                                      • Intralexical 17 hours ago

                                                                                                                                                                                                                                                                                                        That's pretty funny. You think love is just a word?

                                                                                                                                                                                                                                                                                                        • bane 16 hours ago

                                                                                                                                                                                                                                                                                                          I leave it up to the reader to determine how serious I may be.

                                                                                                                                                                                                                                                                                                    • midnitewarrior 21 hours ago

                                                                                                                                                                                                                                                                                                      From the day of the first spoken word, humans have guided the development of language through conversational use and institution. With the advent of AI being used to publish documents into the open web, humans have given up their exclusive domain.

                                                                                                                                                                                                                                                                                                      What would it take for Open AI overlords to inject words they want to force into usage in their models and will new words into use? Few have had the power to do such things. Open AI through its popular GPT platform now has the potential of dictating the evolution of human language.

                                                                                                                                                                                                                                                                                                      This is novel and scary.

                                                                                                                                                                                                                                                                                                      • bane 21 hours ago

                                                                                                                                                                                                                                                                                                        It's the ultimate seizure of the means of production, and in the end it will be the capitalists who realize that revolution.

                                                                                                                                                                                                                                                                                                      • Intralexical 17 hours ago

                                                                                                                                                                                                                                                                                                        > Soon there will be no new low frequency ideas being removed from the data, only old low frequency ideas. Language's descriptive power is further eliminated and only the AIs seem able to produce anything that might represent the shadow of novelty. But it ends when the machines can only produce unintelligible pages of particles and articles, language is lost, civilization is lost when we no longer know what to call its downfall.

                                                                                                                                                                                                                                                                                                        Or we'll be fine, because inbreeding isn't actually sustainable either economically nor technologically, and to most of the world the Silicon Valley "AI" crowd is more an obnoxious gang of socially stunted and predatory weirdos than some unstoppable omnipotent force.

                                                                                                                                                                                                                                                                                                      • aryonoco a day ago

                                                                                                                                                                                                                                                                                                        I feel so conflicted about this.

                                                                                                                                                                                                                                                                                                        On the one hand, I completely agree with Robyn Speer. The open web is dead, and the web is in a really sad state. The other day I decided to publish my personal blog on gopher. Just cause, there's a lot less crap on gopher (and no, gopher is not the answer).

                                                                                                                                                                                                                                                                                                        But...

                                                                                                                                                                                                                                                                                                        A couple of weeks ago, I had to send a video file to my wife's grandfather, who is 97, lives in another country, and doesn't use computers or mobile phones. Eventually we determined that he has a DVD player, so I turned to x264 to convert this modern 4K HDR video into a form that can be played by any ancient DVD player, while preserving as much visual fidelity as possible.

                                                                                                                                                                                                                                                                                                        The thing about x264 is, it doesn't have any docs. Unlike x265 which had a corporate sponsor who could spend money on writing proper docs, x264 was basically developed through trial and error by members of the doom9 forum. There are hundreds of obscure flags, some of which now operate differently to what they did 20 years ago. I could spend hours going through dozens of 20 year old threads on doom9 to figure out what each flag did, or I could do what I did and ask a LLM (in this case Claude).

                                                                                                                                                                                                                                                                                                        Claude wasn't perfect. It mixed up a few ffmpeg flags with x264 ones (easy mistake), but combined with some old fashioned searching and some trial and error, I could get the job done in about half an hour. I was quite happy with the quality of the end product, and the video did play on that very old DVD player.

                                                                                                                                                                                                                                                                                                        Back in pre-LLM days, it's not like I would have hired a x264 expert to do this job for me. I would have either had to spend hours more on this task, or more likely, this 97 year old man would never have seen his great granddaughter's dance, which apparently brought a massive smile to his face.

                                                                                                                                                                                                                                                                                                        Like everything before them, LLMs are just tools. Neither inherently good nor bad. It's what we do with them and how we use them that matters.

                                                                                                                                                                                                                                                                                                        • sangnoir a day ago

                                                                                                                                                                                                                                                                                                          > Back in pre-LLM days, it's not like I would have hired a x264 expert to do this job for me. I would have either had to spend hours more on this task, or more likely, this 97 year old man would never have seen his great granddaughter's dance

                                                                                                                                                                                                                                                                                                          Didn't most DVD burning software include video transcoding as a standard feature? Back in the day, you'd have used Nero Burning ROM, or Handbrake - granted, the quality may not have been optimized to your standards, but the result would have been a watchable video (especially to 97 year-old eyes)

                                                                                                                                                                                                                                                                                                          • aryonoco 21 hours ago

                                                                                                                                                                                                                                                                                                            Back in the day they did. I checked handbrake but now there's nothing specific about DVD compatibility there. I could have picked something like Super HQ 576p, and there's a good chance that would have sufficed, but old DVD players were extremely finicky about filenames, extensions, interlacing, etc. I didn't want to risk the DVD traveling half way across the world only to find that it's not playable.

                                                                                                                                                                                                                                                                                                            • sangnoir 21 hours ago

                                                                                                                                                                                                                                                                                                              I mentioned Handbrake without checking its DVD authoring capability - probably used it to rip DVDs many years ago and got it mixed up with burning them; a better FLOSS alternative for authoring would have been DeVeDe or bombono.

                                                                                                                                                                                                                                                                                                        • jgord 15 hours ago

                                                                                                                                                                                                                                                                                                          We will soon face another kind of bit-rot : where so much text is generated by LLMs that it pollutes the human natural language corpus available for training, on the web.

                                                                                                                                                                                                                                                                                                          Maybe we actually need to preserve all the old movies / documentaries / books in all languages and mark them as pre-LLM / non-LLM.

                                                                                                                                                                                                                                                                                                          But I hazard a guess this wont happen, as its a common good that could only be funded by left-leaning taxation policies - no one can make money doing this, unlike burning carbon chains to power LLMs.

                                                                                                                                                                                                                                                                                                          • ipaddr 15 hours ago

                                                                                                                                                                                                                                                                                                            Old content can make money now and will be more valuable why wouldn't it happen more frequency?

                                                                                                                                                                                                                                                                                                          • oneeyedpigeon a day ago

                                                                                                                                                                                                                                                                                                            I wonder if anyone will fork the project. Apart from anything else, the data may still be useful given that we know it is polluted. In fact, it could act as a means of judging the impact of LLMs via that very pollution.

                                                                                                                                                                                                                                                                                                            • Miraltar a day ago

                                                                                                                                                                                                                                                                                                              I guess it would be interesting but differentiating pollution from language evolution seems very tricky since getting a non polluted corpus gets harder and harder

                                                                                                                                                                                                                                                                                                              • Retr0id a day ago

                                                                                                                                                                                                                                                                                                                Arguably it is a form of language evolution. I bet humans have started using "delve" more too, on average. I think the best we can do is look at the trends and think about potential causes.

                                                                                                                                                                                                                                                                                                                • rvnx a day ago

                                                                                                                                                                                                                                                                                                                  “Seamless”, “honed”, “unparalleled”, “delve” are now polluting the landscape because of monkeys repeating what ChatGPT says without even questioning what the words mean.

                                                                                                                                                                                                                                                                                                                  Everything is “seamless” nowadays. Like I am seamlessly commenting here.

                                                                                                                                                                                                                                                                                                                  Arguably, the meaning of these words evolve due to misuse too.

                                                                                                                                                                                                                                                                                                                  • oneeyedpigeon a day ago

                                                                                                                                                                                                                                                                                                                    I see a lot of writing in my day-to-day, and the words that stick out most are things like "plethora" and "utilized". They're not terribly obscure, but they're just 'odd' and, maybe, formal enough to really stick out when overused.

                                                                                                                                                                                                                                                                                                                    • lobsterthief 15 hours ago

                                                                                                                                                                                                                                                                                                                      Btw can’t people just open their prompts by instructing LLMs not to use those words?

                                                                                                                                                                                                                                                                                                                    • pavel_lishin a day ago

                                                                                                                                                                                                                                                                                                                      > I bet humans have started using "delve" more too, on average.

                                                                                                                                                                                                                                                                                                                      I wish there were a way to check.

                                                                                                                                                                                                                                                                                                                    • wpietri a day ago

                                                                                                                                                                                                                                                                                                                      One way to tackle it would be to use LLMs to generate synthetic corpuses, so you have some good fingerprints for pollution. But even there I'm not sure how doable that is given the speed at which LLMs are being updated. Even if I know a particular page was created in, say, January 2023, I may no longer be able to try to generate something similar now to see how suspect it is, because the precise setups of the moment may no longer be available.

                                                                                                                                                                                                                                                                                                                  • greentxt a day ago

                                                                                                                                                                                                                                                                                                                    I think this person has too high a view of pre-2021, probably for ego reasons. In fact, their attitude seems very ego driven. AI didn't just occur in 2021. Nobody knows how much text was machine generated prior to 2021, it was much harder if not impossible to detect. If anything, it's probably easier now since people are all using the same ai that use words like delve so much much it becomes obvious.

                                                                                                                                                                                                                                                                                                                    • croes a day ago

                                                                                                                                                                                                                                                                                                                      >AI didn't just occur in 2021. Nobody knows how much text was machine generated prior to 2021

                                                                                                                                                                                                                                                                                                                      But we do know that now it's a lot more, with a big LOT.

                                                                                                                                                                                                                                                                                                                      • greentxt a day ago

                                                                                                                                                                                                                                                                                                                        I assume you are correct but how can we know rather than assume? I am not sure we can, so why get worked up about "internet died in 2021" when many would claim with similar conviction that it's been dead since 2012, or 2007, or ...

                                                                                                                                                                                                                                                                                                                        • ClassyJacket 17 hours ago

                                                                                                                                                                                                                                                                                                                          You are making a claim that somehow someone was sitting on something as powerful as ChatGPT, long before ChatGPT, and that it was in widespread use, secretly, without even a single leak by anyone at any point. That's not plausible.

                                                                                                                                                                                                                                                                                                                          • nlpparty 15 hours ago

                                                                                                                                                                                                                                                                                                                            Twitter has been accused of being full of bots long before ChatGPT appeared. For 140 symbols, a template with synonyms would be enough to create mass-generated content.

                                                                                                                                                                                                                                                                                                                    • miguno a day ago

                                                                                                                                                                                                                                                                                                                      I have been noticing this trend increasingly myself. It's getting more and more difficult to use tools like Google search to find relevant content.

                                                                                                                                                                                                                                                                                                                      Many of my searches nowadays include suffixes like "site:reddit.com" (or similar havens of, hopefully, still mostly human-generated content) to produce reasonably useful results. There's so much spam pollution by sites like Medium.com that it's disheartening. It feels as if the Internet humanity is already on the retreat into their last comely homes, which are more closed than open to the outside.

                                                                                                                                                                                                                                                                                                                      On the positive side:

                                                                                                                                                                                                                                                                                                                      1. Self-managed blogs (like: not on Substack or Medium) by individuals have become a strong indicator for interesting content. If the blog runs on Hugo, Zola, Astro, you-name-it, there's hope.

                                                                                                                                                                                                                                                                                                                      2. As a result of (1), I have started to use an RSS reader again. Who would have thought!

                                                                                                                                                                                                                                                                                                                      I am still torn about what to make of Discord. On the one hand, the closed-by-design nature of the thousands of Discord servers, where content is locked in forever without a chance of being indexed by a search engine, has many downsides in my opinion. On the other hand, the servers I do frequent are populated by humans, not content-generating bots camouflaged as users.

                                                                                                                                                                                                                                                                                                                      • nlpparty 15 hours ago

                                                                                                                                                                                                                                                                                                                        It has been for me the last 15 years like this.

                                                                                                                                                                                                                                                                                                                      • jchook 18 hours ago

                                                                                                                                                                                                                                                                                                                        If it is (apparently) easy for humans to tell when content is AI-generated slop, then it should be possible to develop an AI to distinguish human-created content.

                                                                                                                                                                                                                                                                                                                        As mentioned, we have heuristics like frequency of the word "delve", and simple techniques such as measuring perplexity. I'd like to see a GAN style approach to this problem. It could potentially help improve the "humanness" of AI-generated content.

                                                                                                                                                                                                                                                                                                                        • aDyslecticCrow 18 hours ago

                                                                                                                                                                                                                                                                                                                          > If it is (apparently) easy for humans to tell when content is AI-generated slop

                                                                                                                                                                                                                                                                                                                          It's actually not. It's rather difficult for humans as well. We can see verbose text that is confused and call it AI, but it could just be a human aswell.

                                                                                                                                                                                                                                                                                                                          To borrow an older model training method, "Generative adversarial network". If we can distinguish AI from humans... We can use it to improve AI and close the gap.

                                                                                                                                                                                                                                                                                                                          So, it becomes an arms race that constantly evolves.

                                                                                                                                                                                                                                                                                                                        • amai 4 hours ago

                                                                                                                                                                                                                                                                                                                          Science publications until 1955 may be last ones not contaminated by calculators.

                                                                                                                                                                                                                                                                                                                          https://news.ycombinator.com/item?id=34966335

                                                                                                                                                                                                                                                                                                                          We will all get used to it.

                                                                                                                                                                                                                                                                                                                          • sashank_1509 a day ago

                                                                                                                                                                                                                                                                                                                            Not to be too dismissive, but is there a worthwhile direction of research to pursue that is not LLM’s in NLP?

                                                                                                                                                                                                                                                                                                                            If we add linguistics to NLP I can see an argument but if we define NLP as the research of enabling a computer process language then it seems to me that LLM’s/ Generative AI is the only research that an NLP practitioner should focus on and everything else is moot. Is there any other paradigm that we think can enable a computer understand language other than training a large deep learning model on a lot of data?

                                                                                                                                                                                                                                                                                                                            • sinkasapa a day ago

                                                                                                                                                                                                                                                                                                                              Maybe it is "including linguistics" but most of the world's languages don't have the data available to train on. So I think one major question for NLP is exactly the question you posed: "Is there any other paradigm that we think can enable a computer understand language other than training a large deep learning model on a lot of data?"

                                                                                                                                                                                                                                                                                                                            • aucisson_masque a day ago

                                                                                                                                                                                                                                                                                                                              It could be used to spot LLM generated text.

                                                                                                                                                                                                                                                                                                                              compare the frequency of words to those used in human natural writings and you spot the computer from the human.

                                                                                                                                                                                                                                                                                                                              • Lvl999Noob a day ago

                                                                                                                                                                                                                                                                                                                                It could be used to differentiate LLM text from pre-LLM human text maybe. The thing, our AIs may not be very good at learning but our brains are. The more we use AI, the more we integrate LLMs and other tools into our life, the more their output will influence us. I believe there was a study (or a few anecdotes) where college papers checked for AI material were marked AI written even though they were written by humans because the students used AI during their studying and learned from it.

                                                                                                                                                                                                                                                                                                                                • MPSimmons a day ago

                                                                                                                                                                                                                                                                                                                                  You're exactly right. You only have to look at the prevalence of the word "unalive" in real life contexts to find an example.

                                                                                                                                                                                                                                                                                                                                  • thfuran a day ago

                                                                                                                                                                                                                                                                                                                                    >our AIs may not be very good at learning but our brains are

                                                                                                                                                                                                                                                                                                                                    Brains aren't nearly as good at slightly adjusting the statistical properties of a text corpus as computers are.

                                                                                                                                                                                                                                                                                                                                    • left-struck a day ago

                                                                                                                                                                                                                                                                                                                                      > The more we use AI, the more we integrate LLMs and other tools into our life, the more their output will influence us

                                                                                                                                                                                                                                                                                                                                      Hmm I don’t disagree but I think it will be valuable skill going forward to write text that doesn’t read like it was written by an LLM

                                                                                                                                                                                                                                                                                                                                      This is an arms race that I’m not sure we can win though. It’s almost like a GAN.

                                                                                                                                                                                                                                                                                                                                    • ithkuil a day ago

                                                                                                                                                                                                                                                                                                                                      it may work for a short time, but after a while natural language will evolve due to natural exposure of those new words or word patterns and even human will write in ways that, while being different from the LLMs, will also be different from the snapshot captured by this snapshot. It's already the case that we used to write differently 20 years ago from 50 years ago and even more so 100 years ago, etc

                                                                                                                                                                                                                                                                                                                                      • slashdave a day ago

                                                                                                                                                                                                                                                                                                                                        Hardly. You are talking about a statistical test, which will have rather large errors (since it is based on word frequencies). Not to mention word frequencies will vary depending on the type of text (essay, description, advertisement, etc).

                                                                                                                                                                                                                                                                                                                                        • TacticalCoder a day ago

                                                                                                                                                                                                                                                                                                                                          > ... compare the frequency of words to those used in human natural writings and you spot the computer from the human.

                                                                                                                                                                                                                                                                                                                                          But that's a losing endeavor: if you can do that, you can immediately ask your LLM to fix its output so that it passes that test (and many others). It can introduce typos, make small errors on purpose, and anything you can think of to make it look human.

                                                                                                                                                                                                                                                                                                                                        • karaterobot a day ago

                                                                                                                                                                                                                                                                                                                                          I guess a manageable, still-useful alternative would be to curate a whitelist of sources that don't use AI, and without making that list public, derive the word frequencies from only those sources. How to compile that list is left as an exercise for the reader. The result would not be as accurate as a broad sample of the web, but in a world where it's impossible to trust a broad sample of the web, it the option you are left with. And I have no reason to doubt that it could be done at a useful scale.

                                                                                                                                                                                                                                                                                                                                          I'm sure this has occurred to them already. Apart from the near-impossibility of continuing the task in the same way they've always done it, it seems like the other reason they're not updating wordfreq is to stick a thumb in the eye of OpenAI and Google. While I appreciate the sentiment, I recognize that those corporations' eyes will never be sufficiently thumbed to satisfy anybody, so I would not let that anger change the course of my life's work, personally.

                                                                                                                                                                                                                                                                                                                                          • WaitWaitWha a day ago

                                                                                                                                                                                                                                                                                                                                            > curate a whitelist of sources that don't use AI,

                                                                                                                                                                                                                                                                                                                                            I like this.

                                                                                                                                                                                                                                                                                                                                            Maybe even take it a step further - have a badge on the source that is both human and machine visible to indicate that the content is not AI generated.

                                                                                                                                                                                                                                                                                                                                          • amai 4 hours ago
                                                                                                                                                                                                                                                                                                                                            • PeterStuer a day ago

                                                                                                                                                                                                                                                                                                                                              Intuitively I feel like word frequency would be one of the things least impacted by LLM output, no?

                                                                                                                                                                                                                                                                                                                                              • Jcampuzano2 a day ago

                                                                                                                                                                                                                                                                                                                                                It'd be in fact quite the opposite. There comes a turning point where the majority of language usage would actually be written by AI, at which point we'd no longer be analysing the word frequency/usage by actual humans and so it wouldn't be representative of how humans actually communicate.

                                                                                                                                                                                                                                                                                                                                                Or potentially even more dystopian would be that AI slop would be dictating/driving human communication going forward.

                                                                                                                                                                                                                                                                                                                                                • baq a day ago

                                                                                                                                                                                                                                                                                                                                                  ‘delve’ is given as an example right there in TFA.

                                                                                                                                                                                                                                                                                                                                                  • PeterStuer a day ago

                                                                                                                                                                                                                                                                                                                                                    Yes, but the material presented in no way makes distiction between potential organic growth of 'delve' vs. LLM induced use. They just note that even though 'delve' was on the rise, in 23-24 the word gains more popularity, at the same time ChatGPT rose. Word adoption is certainly not a linear phenomenon. And as the author states 'I don't think anyone has reliable information about post-2021 language usage by humans'

                                                                                                                                                                                                                                                                                                                                                    So I would still state noun-phrase frequency in LLM output would tend to reflect noun-phrase frequency in training data in a similar context (disregarding enforced bias induced through RLHF and other tuning at the moment)

                                                                                                                                                                                                                                                                                                                                                    I'm sure there will be cross-fertilization from LLM to Human and back, but I'm not seeing the data yet that the influence on word-frequency is that outspoken.

                                                                                                                                                                                                                                                                                                                                                    The author seems to have some other objections to the rise of LLM's, which I fully understand.

                                                                                                                                                                                                                                                                                                                                                    • QuiDortDine a day ago

                                                                                                                                                                                                                                                                                                                                                      The fact that making this distinction is impossible is reason enough to stop.

                                                                                                                                                                                                                                                                                                                                                      • beepbooptheory a day ago

                                                                                                                                                                                                                                                                                                                                                        Even granting that we can disregard a really huge factor here, which I'm not sure we really can, one can not know beforehand how the clustering of the vocabulary is going to go pre-training, and its speculated that both at the center and at the edges of clusters we get random particularities. Hence the "solidgoldmagikarp" phenomenon and many others.

                                                                                                                                                                                                                                                                                                                                                      • whimsicalism 18 hours ago

                                                                                                                                                                                                                                                                                                                                                        there is almost certainly organic growth as well as more people in Nigeria and other SSA countries are getting very good internet penetration in recent years

                                                                                                                                                                                                                                                                                                                                                      • cdrini 7 hours ago

                                                                                                                                                                                                                                                                                                                                                        If only we had a data set that measured word frequency across the internet as we're getting more and more into AI being used... Maybe with a baseline from before 2021 for comparison... But no let's just stop measuring word frequency entirely because we can just assume what will happen and we're angry.

                                                                                                                                                                                                                                                                                                                                                        • joshdavham a day ago

                                                                                                                                                                                                                                                                                                                                                          Think of an LLM as a person on the internet. Just like everyone else, they have their own vocabulary and preferred way of talking which means they’ll use some words more than others. Now imagine we duplicate this hypothetical person an incredible amount of times and have their clones chatter on the internet frequently. ‘Certainly’ this would have an effect.

                                                                                                                                                                                                                                                                                                                                                          • efskap 7 hours ago

                                                                                                                                                                                                                                                                                                                                                            Yes but this person learned to mimic the internet at large. Theoretically its preferred way of talking would be the average of all training data, as mimicry is GPT's training objective, and would therefore have very similar word distributions. Only, this doesn't account for RLHF and prompts spreading memetically among users.

                                                                                                                                                                                                                                                                                                                                                            • joshdavham 3 minutes ago

                                                                                                                                                                                                                                                                                                                                                              > Theoretically its preferred way of talking is would be the average of all the training data

                                                                                                                                                                                                                                                                                                                                                              This is incorrect. Furthermore, what the LLM says is also determined by what its user wants it to say, and how frequently the user wants the LLM to post on the internet. This will have a large effect on the internet’s word frequency distribution.

                                                                                                                                                                                                                                                                                                                                                        • charlieyu1 a day ago

                                                                                                                                                                                                                                                                                                                                                          Web before 2021 was still polluted by content farms. The articles were written by humans, but still, they were rubbish. Not compared to current rate of generation, but the web was already dominated by them.

                                                                                                                                                                                                                                                                                                                                                          • devjab 7 hours ago

                                                                                                                                                                                                                                                                                                                                                            Maybe, it if you’re studying the way humans use language you’re still getting human made data from rubbish. There isn’t any value in AI generated content is what you’re cataloging is human language.

                                                                                                                                                                                                                                                                                                                                                          • jadayesnaamsi a day ago

                                                                                                                                                                                                                                                                                                                                                            The year 2021 is to wordfreq what 1945 was to carbon carbon-14 dating.

                                                                                                                                                                                                                                                                                                                                                            I guess the same way the scientists had to account for the bomb pulse in order to provide accurate carbon-14 dating, wordfreq would need a magic way to account for non human content.

                                                                                                                                                                                                                                                                                                                                                            Saying magic, because unfortunately it was much easier to detect nuclear testing in the atmosphere than to it will be to detect AI-generated content.

                                                                                                                                                                                                                                                                                                                                                            • altcognito a day ago

                                                                                                                                                                                                                                                                                                                                                              It might be fun to collect the same data if not for any other reason than to note the changes but adding the caveat that it doesn’t represent human output.

                                                                                                                                                                                                                                                                                                                                                              Might even change the tool name.

                                                                                                                                                                                                                                                                                                                                                              • jpjoi a day ago

                                                                                                                                                                                                                                                                                                                                                                The point was it’s getting harder and harder to do that as things get locked down or go behind a massive paywall to either profit off of or avoid being used in generative AI. The places where previous versions got data is impossible to gather from anymore so the dataset you would collect would be completely different, which (might) cause weird skewing.

                                                                                                                                                                                                                                                                                                                                                                • oneeyedpigeon a day ago

                                                                                                                                                                                                                                                                                                                                                                  But that would always be the case. Twitter will not last forever; heck, it may not even be long before an open alternative like Bluesky competes with it. Would be interesting to know what percentage of the original mined data was from Twitter.

                                                                                                                                                                                                                                                                                                                                                              • avazhi 8 hours ago

                                                                                                                                                                                                                                                                                                                                                                I agree with the general ethos of the piece (albeit a few of the details are puzzling and unnecessarily partisan - content on X isn't invariably worthless drivel, nor does what Reddit is doing make much intellectual as opposed to economic [IPO-influenced] sense - but this line:

                                                                                                                                                                                                                                                                                                                                                                'OpenAI and Google can collect their own damn data. I hope they have to pay a very high price for it, and I hope they're constantly cursing the mess that they made themselves.'

                                                                                                                                                                                                                                                                                                                                                                really does betray some real naivete. OpenAI and Google could literally burn $10million dollars per day (okay, maybe not OpenAI - but Google surely could) and reasonably fail to notice. Whatever costs those companies have to pay to collect training data will be well worth it to them. Any messes made in the course of obtaining that data will be dealt with by an army of employees either manually cleaning up the data, or by algorithms Google has its own LLM write for itself.

                                                                                                                                                                                                                                                                                                                                                                I do find the general sense of impending dystopian inhumanity arising out of the explosion of LLMs to be super fascinating (and completely understandable).

                                                                                                                                                                                                                                                                                                                                                                • devjab 7 hours ago

                                                                                                                                                                                                                                                                                                                                                                  > puzzling and unnecessarily partisan - content on X isn't invariably worthless drivel

                                                                                                                                                                                                                                                                                                                                                                  Maybe this is because I’m European, but what is partisan about calling X invariably worthless drivel? Seems a lot like facts to me considering what has been going on with the platform moderation since Elon Musk bought it. It’s so bad that the EU consider it a platform for misinformation these days.

                                                                                                                                                                                                                                                                                                                                                                  • cdrini 7 hours ago

                                                                                                                                                                                                                                                                                                                                                                    Do you have a citation on that last claim?

                                                                                                                                                                                                                                                                                                                                                                    • avazhi 5 hours ago

                                                                                                                                                                                                                                                                                                                                                                      Because the author specifically mentioned that it's worthless because it's 'right-wing' (a 'right-wing cesspool'), as if there aren't plenty of people espousing left-wing views on the platform. The right-wing comment in particular is what makes the statement blatantly partisan.

                                                                                                                                                                                                                                                                                                                                                                      • bakugo 6 hours ago

                                                                                                                                                                                                                                                                                                                                                                        > It’s so bad that the EU consider it a platform for misinformation these days.

                                                                                                                                                                                                                                                                                                                                                                        Can you define "misinformation"? Is it just things the government disagrees with?

                                                                                                                                                                                                                                                                                                                                                                    • nlpparty 15 hours ago

                                                                                                                                                                                                                                                                                                                                                                      It's just inevitable. Imagine a world where we get a cheap and accessable AGI. Most work in the world will be done by it. Certainly, it will organise the work the way it finds more preferable. Humans (and other AIs) will find it much harder to train from example as most of the work is performed in the same uniform way. The AI revolution should start with the field closest to its roots.

                                                                                                                                                                                                                                                                                                                                                                      • donatj a day ago

                                                                                                                                                                                                                                                                                                                                                                        I hear this complaint often but in reality I have encountered fairly little content in my day to day that has felt fully AI generated? AI assisted sure, but is that a problem if a human is in the mix, curating?

                                                                                                                                                                                                                                                                                                                                                                        I certainly have not encountered enough straight drivel where I would think it would have a significant effect on overall word statistics.

                                                                                                                                                                                                                                                                                                                                                                        I suspect there may be some over-identification of AI content happening, a sort of Baader–Meinhof effect cognitive bias. People have their eye out for it and suddenly everything that reads a little weird logically "must be AI generated" and isn't just a bad human writer.

                                                                                                                                                                                                                                                                                                                                                                        Maybe I am biased, about a decade ago I worked for an SEO company with a team of copywriters who pumped out mountains the most inane keyword packed text designed for literally no one but Google to read. It would rot your brain if you tried, and it was written by hand by a team of humans beings. This existed WELL before generative AI.

                                                                                                                                                                                                                                                                                                                                                                        • pavel_lishin a day ago

                                                                                                                                                                                                                                                                                                                                                                          > I hear this complaint often but in reality I have encountered fairly little content in my day to day that has felt fully AI generated?

                                                                                                                                                                                                                                                                                                                                                                          How confident are you in this assessment?

                                                                                                                                                                                                                                                                                                                                                                          > straight drivel

                                                                                                                                                                                                                                                                                                                                                                          We're past the point where what AI generates is "straight drivel"; every minute, it's harder to distinguish AI output from actual output unless you're approaching expertise in the subject being written about.

                                                                                                                                                                                                                                                                                                                                                                          > a team of copywriters who pumped out mountains the most inane keyword packed text designed for literally no one but Google to read.

                                                                                                                                                                                                                                                                                                                                                                          And now a machine can generate the same amount of output in 30 seconds. Scale matters.

                                                                                                                                                                                                                                                                                                                                                                          • PhunkyPhil a day ago

                                                                                                                                                                                                                                                                                                                                                                            > every minute, it's harder to distinguish AI output from actual output unless you're approaching expertise in the subject being written about.

                                                                                                                                                                                                                                                                                                                                                                            So, then what really is the problem with just including LLM-generated text in wordfreq?

                                                                                                                                                                                                                                                                                                                                                                            If quirky word distributions will remain a "problem", then I'd bet that human distributions for those words will follow shortly after (people are very quick to change their speech based on their environment, it's why language can change so quickly).

                                                                                                                                                                                                                                                                                                                                                                            Why not just own the fact that LLMs are going to be affecting our speech?

                                                                                                                                                                                                                                                                                                                                                                            • pavel_lishin 3 hours ago

                                                                                                                                                                                                                                                                                                                                                                              > So, then what really is the problem with just including LLM-generated text in wordfreq?

                                                                                                                                                                                                                                                                                                                                                                              > Why not just own the fact that LLMs are going to be affecting our speech?

                                                                                                                                                                                                                                                                                                                                                                              The problem is that we cannot tell what's a result of LLMs affecting our speech, and what's just the output of LLMs.

                                                                                                                                                                                                                                                                                                                                                                              If LLMs result in a 10% increase of the word "gimple" online, which then results in a 1% increase of humans using the word "gimple" online, how do we measure that? Simply continuing to use the web to update wordfreq would show a 10% increase, which is incorrect.

                                                                                                                                                                                                                                                                                                                                                                        • hcks 21 hours ago

                                                                                                                                                                                                                                                                                                                                                                          Okay but how big of a sample size do we even actually need for word frequencies? Like what’s the goal here? It looks like the initial project isn’t even stratified per year/decade

                                                                                                                                                                                                                                                                                                                                                                          • diggan a day ago

                                                                                                                                                                                                                                                                                                                                                                            One of the examples is the increased usage of "delve" which Google Trends confirms increased in usage since 2022 (initial ChatGPT release): https://trends.google.com/trends/explore?date=all&q=delve&hl...

                                                                                                                                                                                                                                                                                                                                                                            It seems however it started increasing most in usage just these last few months, maybe people are talking more about "delve" specifically because of the increase in usage? A usage recursion of some sorts.

                                                                                                                                                                                                                                                                                                                                                                            • bee_rider a day ago

                                                                                                                                                                                                                                                                                                                                                                              We’ve seen this with a couple words and expressions, and I don’t doubt that AI is somewhat likely to “like” some phrases for whatever reason. Big eigenvaues of the latent space or whatever, hahaha (I don’t know AI).

                                                                                                                                                                                                                                                                                                                                                                              But also, words and phrases do become popular among humans, right? It would be a shame if AI caused the language to get more stagnant, as keeping up with which phrases are popular get you labeled as an AI.

                                                                                                                                                                                                                                                                                                                                                                              • cdrini 7 hours ago

                                                                                                                                                                                                                                                                                                                                                                                Exactly, like how "mindful" and "demure" recently became more popular for seemingly no reason. Humans do this all the time.

                                                                                                                                                                                                                                                                                                                                                                                And language in general stagnates and shrinks in vocabulary over time ( https://evoandproud.blogspot.com/2019/09/why-is-vocabulary-s... ). (Link that ChatGPT helped me find :P) I think AI will increase the average persons vocabulary, since it appears to in general be better/more professionally written than a lot of what the average person is exposed to online.

                                                                                                                                                                                                                                                                                                                                                                              • bongodongobob a day ago

                                                                                                                                                                                                                                                                                                                                                                                Delves are a new thing in World of Warcraft released 9/10 this year. Delve is also an M365 product that has been around for some time and is being discontinued in December. So no, that has nothing to do with LLMs.

                                                                                                                                                                                                                                                                                                                                                                                • _proofs 21 hours ago

                                                                                                                                                                                                                                                                                                                                                                                  Delve was also an addition to PoE, which I imagine had its own spike in google searches relative to that word.

                                                                                                                                                                                                                                                                                                                                                                                • nlpparty 15 hours ago

                                                                                                                                                                                                                                                                                                                                                                                  If your select only USA, the trend disappears.

                                                                                                                                                                                                                                                                                                                                                                                • nlpparty 15 hours ago

                                                                                                                                                                                                                                                                                                                                                                                  https://trends.google.com/trends/explore?date=all&geo=US&q=d...

                                                                                                                                                                                                                                                                                                                                                                                  The funny fact: It doesn't result in the increase for search results for "delve".

                                                                                                                                                                                                                                                                                                                                                                                  • 1d22a 15 hours ago

                                                                                                                                                                                                                                                                                                                                                                                    That chart shows people searching for the world delve, and isn't (directly) related to the incidence of words in content on the open web.

                                                                                                                                                                                                                                                                                                                                                                                    • nlpparty 15 hours ago

                                                                                                                                                                                                                                                                                                                                                                                      I just assumed that if many people, especially not proficient language users encounter this word in the text generated by ChatGPT they would look it up.

                                                                                                                                                                                                                                                                                                                                                                                  • jhack 13 hours ago

                                                                                                                                                                                                                                                                                                                                                                                    Kind of weird to believe “slop” didn’t exist on the internet in mass quantities before AI.

                                                                                                                                                                                                                                                                                                                                                                                    • WalterBright 11 hours ago

                                                                                                                                                                                                                                                                                                                                                                                      I've wondered from time to time why I collect history books, keep my encyclopedias, when I could just google it. Now I know why. They predate AI and are unpolluted by generated bilge.

                                                                                                                                                                                                                                                                                                                                                                                      • jijojohnxx 12 hours ago

                                                                                                                                                                                                                                                                                                                                                                                        Sad to see wordfreq halted, it was a real party for linguistics enthusiasts. For those seeking new tools, keep expanding your knowledge with socialsignalai.

                                                                                                                                                                                                                                                                                                                                                                                        • jedberg 18 hours ago

                                                                                                                                                                                                                                                                                                                                                                                          We need a vintage data/handmade data service. A service that can provide text and images for training that are guaranteed to have either been produced by a human or produced before 2021.

                                                                                                                                                                                                                                                                                                                                                                                          Someone should start scanning all those microfiche archives in local libraries and sell the data.

                                                                                                                                                                                                                                                                                                                                                                                          • ilaksh a day ago

                                                                                                                                                                                                                                                                                                                                                                                            Reading through this entire thread, I suspect that somehow generative AI actually became a political issue. Polarized politics is like a vortex sucking all kinds of unrelated things in.

                                                                                                                                                                                                                                                                                                                                                                                            In case that doesn't get my comment completely buried, I will go ahead and say honestly that even though "AI slop" and paywalled content is a problem, I don't think that generative AI in itself is a negative at all. And I also think that part of this person's reaction is that LLMs have made previous NLP techniques, such a those based on simple usage counts etc., largely irrelevant.

                                                                                                                                                                                                                                                                                                                                                                                            What was/is wordfreq used for, and can those tasks not actually be done more effectively with a cutting edge language model of some sort these days? Maybe even a really small one for some things.

                                                                                                                                                                                                                                                                                                                                                                                            • ecshafer a day ago

                                                                                                                                                                                                                                                                                                                                                                                              Generative AI is inherently a political issue, its not surprising at all.

                                                                                                                                                                                                                                                                                                                                                                                              There is the case of what is "truth". As soon as you start to ensure some quality of truth to what is generated, that is political.

                                                                                                                                                                                                                                                                                                                                                                                              As soon as generative AI has the capability to take someone's job, that is political.

                                                                                                                                                                                                                                                                                                                                                                                              The instant AI can make someone money, it is political.

                                                                                                                                                                                                                                                                                                                                                                                              When AI is trained on something that someone has created, and now they can generate something similar, it is political.

                                                                                                                                                                                                                                                                                                                                                                                              • whimsicalism 18 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                > As soon as generative AI has the capability to take someone's job, that is political.

                                                                                                                                                                                                                                                                                                                                                                                                What is political is people enshrining themselves in chokepoints and demanding a toll for passing through or getting anything done. That is what you do when you make a certain job politically 'untakable'.

                                                                                                                                                                                                                                                                                                                                                                                                People who espouse that the 'personal is political' risk making the definition of politics so broad that it is useless.

                                                                                                                                                                                                                                                                                                                                                                                                • ilaksh a day ago

                                                                                                                                                                                                                                                                                                                                                                                                  Then .. everything is political?

                                                                                                                                                                                                                                                                                                                                                                                                  • commodoreboxer a day ago

                                                                                                                                                                                                                                                                                                                                                                                                    Everything involving any kind of coordination, cooperation, competition, and/ot communication between two or more people involves politics by its very nature. LLMs are communication tools. You can't divorce politics from their use when one person is generating text for another person to read.

                                                                                                                                                                                                                                                                                                                                                                                                    • JohnFen a day ago

                                                                                                                                                                                                                                                                                                                                                                                                      "Just because you do not take an interest in politics doesn't mean politics won't take an interest in you." -- Pericles

                                                                                                                                                                                                                                                                                                                                                                                                      • phito a day ago

                                                                                                                                                                                                                                                                                                                                                                                                        It is. Unfortunately.

                                                                                                                                                                                                                                                                                                                                                                                                    • rincebrain a day ago

                                                                                                                                                                                                                                                                                                                                                                                                      The simplest example that comes to mind of something frequency analysis might be useful for would be if you had simple ciphertext where you knew that the characters probably 1:1 mapped, but you didn't know anything about how.

                                                                                                                                                                                                                                                                                                                                                                                                      It could also be useful for guessing whether someone might have been trying to do some kind of steganographic or additional encoding in their work, by telling you how abnormal compared to how many people write it is that someone happened to choose a very unusual construction in their work, or whether it's unlikely that two people chose the same unusual construction by coincidence or plagiarism.

                                                                                                                                                                                                                                                                                                                                                                                                      You might also find statistical models interesting for things like noticing patterns in people for whom English or others are not their first language, and when they choose different constructions more often than speakers for whom it was their first language.

                                                                                                                                                                                                                                                                                                                                                                                                      I'm not saying you can't use an LLM to do some or all of these, but they also have something of a scalar attached to them of how unusual the conclusion is - e.g. "I have never seen this construction of words in 50 million lines of text" versus "Yes, that's natural.", which can be useful for trying to inform how close to the noise floor the answer is, even ignoring the prospect of hallucinations.

                                                                                                                                                                                                                                                                                                                                                                                                      • whimsicalism 18 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                        Yes, it's become extremely politicized and its very tiresome. Tech in general, to be frank. Pray that your field of interest never gets covered in the NYT.

                                                                                                                                                                                                                                                                                                                                                                                                      • yarg 10 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                        Generative AI has done to human speech analysis what atmospheric testing did to carbon dating.

                                                                                                                                                                                                                                                                                                                                                                                                        • grogenaut a day ago

                                                                                                                                                                                                                                                                                                                                                                                                          Is 2023 going to be for data what the trinity test was for iron? Eg post 2023 all data now contains trace amounts of ai?

                                                                                                                                                                                                                                                                                                                                                                                                        • zaik a day ago

                                                                                                                                                                                                                                                                                                                                                                                                          If generative AI has a significantly different word frequency from humans then it also shouldn't be hard to detect text written generative AI. However my last information is that tools to detect text written by generative AI are not that great.

                                                                                                                                                                                                                                                                                                                                                                                                          • DebtDeflation a day ago

                                                                                                                                                                                                                                                                                                                                                                                                            Enshittification is accelerating. A good 70% of my Facebook feed is now obviously AI generated images with AI generated text blurbs that have nothing to do with the accompanying images likely posted by overseas bot farms. I'm also noticing more and more "books" on Amazon that are clearly AI generated and self published.

                                                                                                                                                                                                                                                                                                                                                                                                            • janice1999 a day ago

                                                                                                                                                                                                                                                                                                                                                                                                              It's okay. Amazon has limited authors to self publishing only 3 books per day (yes, really). That will surely solve the problem.

                                                                                                                                                                                                                                                                                                                                                                                                              • wpietri a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                Hah! I'm trying to figure out the exact date that crossed from "plausible line from a Stross or Sterling novel" [1] to "of course they did".

                                                                                                                                                                                                                                                                                                                                                                                                                [1] Or maybe Sheckley or Lem, now that I think about it.

                                                                                                                                                                                                                                                                                                                                                                                                                • Drakim a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                  I read that as 3 books per year at first and thought to myself that that was a rather harsh limitation but surely any true respectable author wouldn't be spitting more than that...

                                                                                                                                                                                                                                                                                                                                                                                                                  ...and then I realized you wrote 3 books a day. What the hell.

                                                                                                                                                                                                                                                                                                                                                                                                                • Sohcahtoa82 a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                  > A good 70% of my Facebook feed is now obviously AI generated images with AI generated text blurbs that have nothing to do with the accompanying images likely posted by overseas bot farms.

                                                                                                                                                                                                                                                                                                                                                                                                                  This is a self-inflicted problem, IMO.

                                                                                                                                                                                                                                                                                                                                                                                                                  Do you just have shitty friends that share all that crap? Or are you following shitty pages?

                                                                                                                                                                                                                                                                                                                                                                                                                  I use Facebook a decent amount, and I don't suffer from what you're complaining about. Your feed is made of what you make it. Unfollow the pages that make that crap. If you have friends that share it, consider unfriending or at the very least, unfollowing. Or just block the specific pages they're sharing posts from.

                                                                                                                                                                                                                                                                                                                                                                                                                • ok123456 a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                  Most of the "random" bot content pre-2021 was low-quality Markov-generated text. If anything, these genitive AI tools would improve the accuracy of scraping large corpora of text from the web.

                                                                                                                                                                                                                                                                                                                                                                                                                  • jonas21 17 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                    I think the main reason for sunsetting the project is hinted at near the bottom:

                                                                                                                                                                                                                                                                                                                                                                                                                    > The field I know as "natural language processing" is hard to find these days. It's all being devoured by generative AI. Other techniques still exist but generative AI sucks up all the air in the room and gets all the money.

                                                                                                                                                                                                                                                                                                                                                                                                                    Traditional NLP has been surpassed by transformers, making this project obsolete. The rest of the post reads like rationalization and sour grapes.

                                                                                                                                                                                                                                                                                                                                                                                                                    • rovr138 15 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                      I think the reason to sunset the project is actually near the top.

                                                                                                                                                                                                                                                                                                                                                                                                                      > I don't think anyone has reliable information about post-2021 language usage by humans.

                                                                                                                                                                                                                                                                                                                                                                                                                      It's information about language usage by humans. We know the rate at which generated text has increased after 2021. How do we filter this to only have data from humans?

                                                                                                                                                                                                                                                                                                                                                                                                                      The bottom is just lamenting what's happening in the field (which is pretty much what everyone that's been doing anything with NLP research is also complaining about behind closed doors).

                                                                                                                                                                                                                                                                                                                                                                                                                    • tqi 20 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                      "Sure, there was spam in the wordfreq data sources, but it was manageable and often identifiable."

                                                                                                                                                                                                                                                                                                                                                                                                                      How sure can we be about that?

                                                                                                                                                                                                                                                                                                                                                                                                                      • thesnide a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                        I think that text on the internet will tainted by AI the same way that steel has being tainted by nuclear devices.

                                                                                                                                                                                                                                                                                                                                                                                                                        • iamnotsure a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                          "Multi-script languages

                                                                                                                                                                                                                                                                                                                                                                                                                          Two of the languages we support, Serbian and Chinese, are written in multiple scripts. To avoid spurious differences in word frequencies, we automatically transliterate the characters in these languages when looking up their words.

                                                                                                                                                                                                                                                                                                                                                                                                                          Serbian text written in Cyrillic letters is automatically converted to Latin letters, using standard Serbian transliteration, when the requested language is sr or sh."

                                                                                                                                                                                                                                                                                                                                                                                                                          I'd support keeping both scripts (српска ћирилица and latin script) , similarly to hiragana (ひらがな) and katakana (カタカナ) in Japanese.

                                                                                                                                                                                                                                                                                                                                                                                                                          • eqvinox a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                            Why is this a HN comment on a thread about it ending due to AI pollution?

                                                                                                                                                                                                                                                                                                                                                                                                                          • andai a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                            Has anyone taken a look at a random sample of web data? It's mostly crap. I was thinking of making my own search engine, knowledge database etc based on a random sample of web pages, but I found that almost all of them were drivel. Flame wars, asinine blog posts, and most of all, advertising. Forget spam, most of the legit pages are trying to sell something too!

                                                                                                                                                                                                                                                                                                                                                                                                                            The conclusion I arrived at was that making my own crawler actually is feasible (and given my goals, necessary!) because I'm only interested in a very, very small fraction of what's out there.

                                                                                                                                                                                                                                                                                                                                                                                                                            • andai 4 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                              The unspoken question here, of course, is "you wouldn't happen to have already done this for me?" ;)

                                                                                                                                                                                                                                                                                                                                                                                                                            • jijojohnxx 12 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                              Looks like the wordfreq party is over. Time for the next wave of knowledge tools, wonder what socialsignalai could bring to the table.

                                                                                                                                                                                                                                                                                                                                                                                                                              • honksillet 18 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                Twitter was a botnet long before LLMs and Musk got involved.

                                                                                                                                                                                                                                                                                                                                                                                                                                • joshdavham a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                  If the language you’re processing was generated by AI, it’s no longer NLP, it’s ALP.

                                                                                                                                                                                                                                                                                                                                                                                                                                  • aftbit a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                    Wow there is so much vitriol both in this post and in the comments here. I understand that there are many ethical and practical problems with generative AI, but when did we stop being hopeful and start seeing the darkest side of everything? Is it just that the average HN reader is now past the age where a new technological development is an exciting opportunity and on to the age where it is a threat? Remember, the Luddites were not opposed to looms, they just wanted to own them.

                                                                                                                                                                                                                                                                                                                                                                                                                                    • aryonoco a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                      When?

                                                                                                                                                                                                                                                                                                                                                                                                                                      For some of us, it was 1994, the eternal September.

                                                                                                                                                                                                                                                                                                                                                                                                                                      For some of us, it was when Aaron Swartz left us.

                                                                                                                                                                                                                                                                                                                                                                                                                                      For some of us, it was when Google killed Google Reader (in hindsight, the turning point of Google becoming evil).

                                                                                                                                                                                                                                                                                                                                                                                                                                      For some others, like the author of this post, it's when twitter and reddit closed their previously open APIs.

                                                                                                                                                                                                                                                                                                                                                                                                                                      • Der_Einzige 10 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                        Aaron Swartz would have loved open source GenAI models.

                                                                                                                                                                                                                                                                                                                                                                                                                                      • JohnFen a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                        > when did we stop being hopeful and start seeing the darkest side of everything?

                                                                                                                                                                                                                                                                                                                                                                                                                                        I think a decade or two ago, when most of the new tech being introduced (at least by our industry) started being unmistakably abusive and dehumanizing. When the recent past shows a strong trend, it's not unreasonable to expect the the near future will continue that trend. Particularly when it makes companies money.

                                                                                                                                                                                                                                                                                                                                                                                                                                        • slashdave a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                          Give us examples of generative AI in challenging applications (biology, medicine, physical sciences), and you'll get a lot of optimism. The text LLM stuff is the brute force application of the same class of statistical modeling. It's commercial, and boring.

                                                                                                                                                                                                                                                                                                                                                                                                                                        • anovikov a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                          Sad. I'd love to see by how much the use of world "delve" has increased since 2021...

                                                                                                                                                                                                                                                                                                                                                                                                                                          • Terretta a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                            > I'd love to see by how much the use of world "delve" has increased since 2021...

                                                                                                                                                                                                                                                                                                                                                                                                                                            There are charts / graphs in the link, both since 2021, and since earlier.

                                                                                                                                                                                                                                                                                                                                                                                                                                            The final graph suggests the phenomenon started earlier, possibly correlated in some way to Malaysian / Indian usages of English.

                                                                                                                                                                                                                                                                                                                                                                                                                                            It does seem OpenAI's family of GPTs as implemented in ChatGPT unspool concepts in a blend of India-based-consultancy English with American freshmen essay structure, frosted with superficially approachable or upbeat blogger prose ingratiatingly selling you something.

                                                                                                                                                                                                                                                                                                                                                                                                                                            Anthropic has clearly made efforts to steer this differently, Mistral and Meta as well but to a lesser degree.

                                                                                                                                                                                                                                                                                                                                                                                                                                            I've wondered if this reflects training material (the SEO is ruining the Internet theory), or is more simply explained by selection of pools of Hs hired for RLHF.

                                                                                                                                                                                                                                                                                                                                                                                                                                            • chipdart a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                              From the submission you're commenting on:

                                                                                                                                                                                                                                                                                                                                                                                                                                              > As one example, Philip Shapira reports that ChatGPT (OpenAI's popular brand of generative language model circa 2024) is obsessed with the word "delve" in a way that people never have been, and caused its overall frequency to increase by an order of magnitude.

                                                                                                                                                                                                                                                                                                                                                                                                                                            • slashdave a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                              Amusing that we now have a feedback loop. Let's see... delve delve delve delve delve delve delve delve. There, I've done my part.

                                                                                                                                                                                                                                                                                                                                                                                                                                              • dqv a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                Same for me but with the word “crucial”.

                                                                                                                                                                                                                                                                                                                                                                                                                                                • CaptainFever 20 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                  Google ngram viewer, perhaps?

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • xpl a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                    The fun thing is that while GPTs initially learned from humans (because ~100% of the content was human-generated), future humans will learn from GPTs, because almost all available content would be GPT-generated very soon.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    This will surely affect how we speak. It's possible that human language evolution could come to a halt, stuck in time as AI datasets stop being updated.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    In the worst case, we will see a global "model collapse" with human languages devolving along with AI's, if future AIs are trained on their own outputs...

                                                                                                                                                                                                                                                                                                                                                                                                                                                  • whimsicalism 18 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                    NLP and especially 'computational linguistics' in academia has been captured by certain political interests, this is reflective of that.

                                                                                                                                                                                                                                                                                                                                                                                                                                                    • will-burner 17 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                      > It's rare to see NLP research that doesn't have a dependency on closed data controlled by OpenAI and Google, two companies that I already despise.

                                                                                                                                                                                                                                                                                                                                                                                                                                                      The dependency on closed data combined with the cost of compute to do anything interesting with LLMs has made individual contributions to NLP research extremely difficult if one is not associated with a very large tech company. It's super unfortunate, makes the subject area much less approachable, and makes the people doing research in the subject area much more homogeneous.

                                                                                                                                                                                                                                                                                                                                                                                                                                                      • eadmund a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                        > the Web at large is full of slop generated by large language models, written by no one to communicate nothing

                                                                                                                                                                                                                                                                                                                                                                                                                                                        That’s neither fair nor accurate. That slop is ultimately generated by the humans who run those models; they are attempting (perhaps poorly) to communicate something.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        > two companies that I already despise

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Life’s too short to go through it hating others.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        > it's very likely because they are creating a plagiarism machine that will claim your words as its own

                                                                                                                                                                                                                                                                                                                                                                                                                                                        That begs the question. Plagiarism has a particular definition. It is not at all clear that a machine learning from text should be treated any differently from a human being learning from text: i.e., duplicating exact phrases or failing to credit ideas may in some circumstances be plagiarism, but no-one is required to append a statement crediting every text he has ever read to every document he ever writes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                        Credits: every document I have ever read grin

                                                                                                                                                                                                                                                                                                                                                                                                                                                        • miningape a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                          This is just the "guns don't shoot people, people do." argument except in this case we quite literally have a massive upside incentive to remove people from the process entirely (i.e. websites that automatically generate new content every day) - so I don't buy it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                          This kind of AI slop is quite literally written by no one (an algorithm pushed it out), and it doesn't communicate anything since communication first requires some level of understanding of the source material - and LLM's are just predicting the likely next token without understanding. I would also extend this to AI slop written by someone with a limited domain understanding, they themselves have nothing new to offer, nor the expertise or experience to ensure the AI is producing valuable content.

                                                                                                                                                                                                                                                                                                                                                                                                                                                          I would go even further and say it's "read by no one" - people are sick and tired of reading the next AI slop article on google and add stuff like "reddit" to the end of their queries to limit the amount of garbage they get.

                                                                                                                                                                                                                                                                                                                                                                                                                                                          Sure there are people using LLMs to enhance their research, but a vast, vast majority are using it to create slop that hits a word limit.

                                                                                                                                                                                                                                                                                                                                                                                                                                                          • slashdave a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                            > It is not at all clear that a machine learning from text should be treated any differently from a human being learning from text

                                                                                                                                                                                                                                                                                                                                                                                                                                                            Given that LLMs and human creativity work on fundamentally different principles, there is every reason to believe there is a difference.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            • weevil a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                              I feel like you're giving certain entities too much credit there. Yes text is generated to do _something_, but it may not be to communicate in good-faith; it could be keyword-dense gibberish designed to attract unsuspecting search engine users for click revenue, or generate political misinformation disseminated to a network of independent-looking "news" websites, or pump certain areas with so much noise and nonsense information that those spaces cannot sustain any kind of meaningful human conversation.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              The issue with generative 'AI' isn't that they generate text, it's that they can (and are) used to generate high-volume low-cost nonsense at a scale no human could ever achieve without them.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              > Life’s too short to go through it hating others

                                                                                                                                                                                                                                                                                                                                                                                                                                                              Only when they don't deserve it. I have my doubts about Google, but I've no love for OpenAI.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              > Plagiarism has a particular definition ... no-one is required to append a statement crediting every text he has ever read

                                                                                                                                                                                                                                                                                                                                                                                                                                                              Of course they aren't, because we rightly treat humans learning to communicate differently from training computer code to predict words in a sentence and pass it off as natural language with intent behind it. Musicians usually pay royalties to those whose songs they sample, but authors don't pay royalties to other authors whose work inspired them to construct their own stories maybe using similar concepts. There's a line there somewhere; falsely equating plagiarism and inspiration (or natural language learning in humans) misses the point.

                                                                                                                                                                                                                                                                                                                                                                                                                                                            • antirez a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                              Ok so post author is AI skeptic and this is his retaliation, likely because his work is affected. I believe governments should address the problem with welfare but being against technical advances is always being in the wrong side of history.

                                                                                                                                                                                                                                                                                                                                                                                                                                                              • exo-pla-net a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                This is a tech site, where >50% of us are programmers who have achieved greater productivity thanks to LLM advances.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                And yet we're filled to the gills with Luddite sentiments and AI content fearmongering.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                Imagine the hysteria and the skull-vibrating noise of the non-HN rabble when they come to understand where all of this is going. They're going to do their darndest to stop us from achieving post-economy.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                • devjab 7 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  I think programmers are in the perfect profession to call LLMs out for just how bad they are. They are fancy auto-complete and I love them in my daily usage, but a big part of that is because I can tell when they are ridiculously wrong. Which is so often you really have to question how useful they would be for anything where they aren’t just fancy auto-complete.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Which isn’t AIs fault. I’m sure they can be great in cancer detection, unless they replace what we’re already doing because they are cheaper than doctors. In combination with an expert AI is great, but that’s not what’s happening is it?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • antirez a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    I fail to see the difference. Actually, programming was one of the first field where LLMs shown proficiency. The helper nature of LLMs is true in all the fields so far, in the future this may change. I believe that for instance in the case or journalism the issue was already there: three euros per post written without clue by humans.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Anyway in the long run AI will kill tons of jobs. Regardless of blog posts like that. The true key is governments assistance.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • exo-pla-net a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      I don't know what difference you are referring to. I was agreeing with you.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      And also agreed: many trumpet the merits of "unassisted" human output. However, they're suffering from ancestor veneration: human writing has always been a vast mine of worthless rock (slop) with a few gems of high-IQ analysis hidden here and there.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      For instance, upon the invention of the printing press, it was immediately and predominantly used for promulgating religious tracts.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      And even when you got to Newton, who created for us some valuable gems, much of his output was nevertheless deranged and worthless. [1]

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      It follows that, whether we're a human or an LLM, if we achieve factual grounding and the capacity to reason, we achieve it despite the bulk of the information we ingest. Filtering out sludge is part of the required skillset for intellectual growth, and LLM slop qualitatively changes nothing.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      [1] https://www.newtonproject.ox.ac.uk/view/texts/diplomatic/THE...

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • antirez a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Sorry I didn't imply we didn't agree but that programmers were and are going to be impacted as much as writers for instance, yet I see an environment where AI is generally more accepted as a tool.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        About your last point sometimes I think that in the future there will be models specifically distilling the climax of selected thinkers, so that not only their production will be preserved but maybe something more that is only implicitly contained in their output.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • exo-pla-net 18 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          That's a good point: the greatest value that we can glean from one another is likely not epistemological "facts about the world", nor is it even the predictive models seen in science and higher brow social commentary, but in patterns of thinking. That alone is the infinite wellspring for achieving greater understanding, whether formalized with the scientific method or whether more loosely leveraged to succeed with a business endeavor.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Anecdotally, I met success in prompting GPT-3 to "mimic Stephen Pinker" when solving logical puzzles. Puzzles that it would initially fail, it would succeed attempting to mimic his language. GPT-3 seemed to have grokked the pattern of how Stephen Pinker thinks through problems, and it could leverage those patterns to improve its own reasoning. OpenAI o1 needs no such assistance, and I expect that o2 will fully supplant humans with its ability to reason.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          It follows that all that we have to offer with our brightest minds will be exhausted, and we will be eclipsed in every conceivable way by our creation. It will mark the end of the Anthropocene; something that likely exceeds the headiest of Nick Bostom speculations will take its place.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          It seems that this is coming in 2026 if not sooner, and Alignment is the only thing that ought occupy our minds: the question of whether we're creating something that will save us from ourselves, or whether all that we've built will culminate in something gross and final.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Looking around myself, however, I see impassioned "discourse" about immigration. The merits of DEI. Patriotism. Transgenderism. Religion. Copyright. Vast herds of dinosaurs preying upon one another, giving only idle attention to the glowing object in the sky. Is it an asteroid? Is it a UFO that is coming down to provide dinosaur healthcare? Nope, not even that level of thought is mustered. With 8 billion people on the planet, Utopia by Nick Bostrom hasn't even mustered 100 reviews on Amazon. On the advent of the defining moment of the universe itself, when virtually all that is imaginable is unlocked for us, our species' heads remains buried in the mud, gnawing at one another's filthy toes, and I'm alienated and disgusted.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The only glints of beauty I see in my fellow man are in those with minds which exceed a certain IQ threshold and cognitive flexibility, as well as in lesser minds which exhibit gentleness and humility. There is beauty there, and there is beauty in the staggering possibility of the universe itself. The rest is at best entomology, and I won't mourn its passing.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                • assanineass a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Well said

                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • yard2010 8 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    > Now Twitter is gone anyway, its public APIs have shut down, and the site has been replaced with an oligarch's plaything, a spam-infested right-wing cesspool called X

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    God I hate this dystopic timeline we live in.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • keeptakingshots 10 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      thank you for sharing this.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • floppiplopp a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        I really like the fact that the content of the conventional user content internet is becoming willfully polluted and ever more useless by the incessant influx of "ai"-garbage. At some point all of this will become so awful that nerds will create new and quiet corners of real people and real information while the idiot rabble has to use new and expensive tools peddled by scammy tech bros to handle the stench of automated manure that flows out of stagnant llms digesting themselves.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • JohnFen a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          > At some point all of this will become so awful that nerds will create new and quiet corners of real people and real information

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          It's already happening. There is a growing number of groups that are forming their own "private internets" that is separated from the internet-at-large, precisely because the internet at large is becoming increasingly useless for a whole lot of valuable things.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • biofox a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Most of the time, HN is that quiet corner. I just hope it stays that way.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • cdrini 7 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            This has to be the most annoying hacker news comment section I've ever seen. It's just the same ~4 viewpoints rehashed again, and again, and again. Why don't folks just upvote other comments that say the same thing instead of repeating the same things?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            And now a hopefully new comment: having a word frequency measure of the internet as we're going into AI being more used would be IMMENSELY useful specifically _because_ more of the internet is being AI generated! I could see such a dataset being immensely useful to researchers who are looking for the impacts of AI on language, and to test empirically a lot of claims the author has made in this very post! What a shame that they stopped measuring.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Also: as to the claims that AI will cause stagnation and a reduction of the variance of English vocabulary used, this is a trend in English that's been happening for over 100 years ( https://evoandproud.blogspot.com/2019/09/why-is-vocabulary-s... ). I believe the opposite will happen, AI will increase the average persons vocabulary, since chat AIs tends to be more professionally written than a lot of the internet. It's like being able to chat with someone that has an infinite vocabulary. It also makes it possible for people to read complicated documents well out of their domain, since they can ask not just for definitions but more in depth explanations of what words/sections mean.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Here's to a comment that will never be read because of all the noise in this thread :/

                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • appendix-rock 6 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              They want to display how they’re truly intelligent (unlike LLMs) by checks notes rehashing opinions that they’ve read millions of times online.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Sound familiar to anyone?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • wbillingsley 5 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                I wonder whether future generations will be ingrained with a Truman Show fear that maybe only the few thousand people they meet are real and everything else is generated background noise.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • Cthulhu_ 4 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  I already get this when I look at e.g. youtube comments.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • actionfromafar 6 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                I read, but I can't say I like it. :-D People will ELI5 everything to understand it, no hard word understand necessary, up-goer-five-style, then "de-compress" it into floral (Amorphophallus Titanum scented) GPT speak when sending responses back out.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • advael 5 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  On a meta level I agree that having this kind of dataset with "before and after" would be pretty interesting. On an object level I do not predict that this would increase the overall diversity of language usage - and in fact it would be extremely surprising if this was even possible due to some general mathematical properties of neural networks - nor would "more professional writing," though I do agree with this characterization of the way AI-generated text sounds. The more I work with LLMs and encounter them in the wild, the greater my confidence that I can tell when something was generated, with the exception of B2B marketing copy and communications from HR departments or state agencies

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  On the level of meta-discourse you seem to want to also speak to: Dang even when people have the Official Corporate Approved Perspective (in particular, the claim that it's "like being able to chat with someone that has an infinite vocabulary" is probably the silliest delusional AI hype I've heard all week) and the most upvotes in the thread they still think they're an embattled ideological minority. Starting to think that literally zero people in the modern world don't have or affect a victim complex of some kind

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • cdrini 4 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Haha I'm pleasantly surprised to see my comment at the top, I genuinely thought it would drown to the bottom! Not due to disagreement, just due to sheer volume and being posted rather late in this posts lifespan. Anyways my meta comment wasn't that I disagreed with all the other comments, I was just frustrated at how repetitive they were of one another. When I go to leave a comment, I do a pass reading through all or most of the comments to make sure someone hasn't left a comment in the same vein, and it was just frustrating to go through people saying almost verbatim the same thing others were saying! If your comment isn't adding something new why leave it? I'm all for healthy disagreement :) Also not sure what part of my post sounds like it's from an "embattled ideological minority".

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    But speaking of healthy disagreement, as to "chatting with someone that has an infinite vocabulary", I'd love to hear any counterarguments you might have; or was calling it "silly and delusional" meant to be your argument? :P I think it's a pretty uncontroversial statement seeing as eg ChatGPT very likely knows every word in the English language.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • advael 3 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The most ridiculous aspects for me were the anthropomorphizing (Reminds me of that one Sam Altman interview a bit) and the use of "infinite", which both doesn't really work on vibes (as many have noted, while I'm sure chatGPT has been exposed to every word, its pattern of communication is very "regression to the mean" among them), but also is silly if taken literally, because unless we're counting like some quirky technically-grammatical combinatoric compounding that we in practice infer the meaning of from composition of what we identify as separate individual words (like just hyphenating a bunch of adjectives and a noun or something) there's not really an argument for there being "infinite vocabulary" in the same sense that there is for "infinite possible sentences" because being a valid word requires at least that someone can meaningfully comprehend what is meant by it, and coordination requirements of this nature tend to truncate infinities

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The case for ChatGPT doing significant coinage that sticks isn't particularly strong either, partially from theory and partially because I'd think I'd've heard a lot of complaints about it by now, and the ones on hackernews would be repetitive to the point of seeming unavoidable (we agree on that for sure)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Anyway, re: the silliest hype I've heard all week, I'm mostly just trying to find humor in what has been a pretty bad hype wave for someone who's pathologically bad at sounding like the kind of nontechnical hype guys that pervade any tech hype wave but is nonetheless mostly seeking out jobs in this field because it's what my expertise is in. Incredibly awful job market for a lot of people I realize, but it feels like a special hell I get for getting into ML research before it was (quite so) cool. I'm trying to fight the negativity but I've gotten screwed over a lot lately, but I don't have anything against you personally for being silly on hn

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • mark-r 3 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Sure, ChatGPT knows every word in the English language (and probably quite a few that ain't). But how likely is it to use them all?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • BiteCode_dev 5 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Also it breaks the languagr barreer, you can now read the Chinese internet if you want, or chat transparently in Arabic. That's going to be interesting.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • Cthulhu_ 4 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        At the moment though (and ever since decent online translation services were a thing), it feels one-way, that is, people from that side of the internet coming to the anglosphere internet moreso than anglosphere people going internet-abroad. I may be wrong.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • BiteCode_dev 2 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          As a Frenchman, I learned very quickly that my language sphere market and resource pool is so much smaller than the English one that it's 10 times less effective to do anything in it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          I understand the position.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The only exception would be China, but the GFW is probably not helping.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          LLV might lower the cost of that so much that it will become more interesting to do so.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • vlan121 6 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        You haven't read the whole thing. It says that: or that could benefit generative AI.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • cdrini 5 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          I did read it :) not sure how that line applies here, can you expand?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • shortrounddev2 a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Man the AI folks really wrecked everything. Reminds me of when those scooter companies started just dumping their scooters everywhere without asking anybody if they wanted this.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • analog31 a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          perhaps germane to this thread, I think the scooter thing was an investment bubble. it was easier to burn investment money on new scooters than to collect and maintain old ones. until the money ran out.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • kdmccormick a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            At least scooters did something useful for the environment.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • Sander_Marechal a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Did they? A lot of then were barely used, got damaged or vandalized, etc. And when the companies folded or communities outlawed the scooters, they end up as trash. I don't believe for a second that the amount of pollutants and greenhouse gasses saved by usage is larger than the amount produced by manufacturing, shipping and trashing all those scooters.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • DrillShopper a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Their batteries on the other hand…

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • kdmccormick a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Sure, they're worse than walking or biking, but compared to an electric car battery or an ICE car?

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • Sharlin a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    At least where I'm from, scooters have mostly replaced walking and biking, not car trips :(

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            • syngrog66 19 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              A few years ago I began an effort to write a new tech book. I planned orig to do as much of it as I could across a series of commits in a public GitHub repo of mine.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              I then changed course. Why? I had read increasing reports of human e-book pirates (copying your book's content then repackaging it for sale under a diff title, byline, cover, and possibly at a much lower or even much higher price.)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              And then the rise of LLMs and their ravenous training ingest bots -- plagiarism at scale and potentially even easier to disguise.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              "Not gonna happen." - Bush Sr., via Dana Carvey

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Now I keep the bulk of my book material non-public during dev. I'm sure I'll share a chapter candidate or so at some point before final release, for feedback and publicity. But the bulk will debut all together at once, and only once polished and behind a paywall

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              • adr1an 9 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                I guess curating unpolluted text is one of the new jobs GenAI created? /s

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                • hoseja a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  >"Now Twitter is gone anyway, its public APIs have shut down, and the site has been replaced with an oligarch's plaything, a spam-infested right-wing cesspool called X. Even if X made its raw data feed available (which it doesn't), there would be no valuable information to be found there.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  >Reddit also stopped providing public data archives, and now they sell their archives at a price that only OpenAI will pay.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  >And given what's happening to the field, I don't blame them."

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  What beautiful doublethink.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • mschuster91 a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    > What beautiful doublethink.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Given just how many AI bots scrape up everything they can, oftentimes ignoring robots.txt or any rate limits (there have been a few complaint threads on HN about that), I can hardly blame the operators of large online services just cutting off data feeds.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Twitter however didn't stop their data feeds due to AI or because they wanted money, they stopped providing them because its new owner does everything he can to hinder researchers specializing in propaganda campaigns or public scrutiny.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • hluska a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      What was Reddit’s excuse? They did roughly the same thing (and have just as much garbage content).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      In other words, why is it wrong for X but okay for Reddit? If you ignore one individual’s politics, the two services did the same thing.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • mschuster91 a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Reddit shut their API access down only very recently, after the AI craze went off. Twitter did so right after Musk took over, way before Reddit, way before AI ever went nuts.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • dotnet00 a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          X shut down API access in Feb 2023, Reddit shut theirs down at the end of June of the same year. Just barely 6 months apart.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Furthermore, while X had also only announced this in February, Reddit announced their API shutdown just 2 months later in April.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          And, to further add to that, X was pretty upfront that they think they have access to a large and powerful dataset in X and didn't want to give it out for free. Reddit used very similar wording when announcing their changes.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  • primer42 a day ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Hear, hear!

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    • QRe 20 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      I understand the frustration shared in this post but I wholeheartedly disagree with the overall sentiment that comes with it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The web isn't dead, (Gen)AI, SEO, spam and pollution didn't kill anything.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The world is chaotic and net entropy (degree of disorder) of any isolated or closed system will always increase. Same goes for the web. We just have to embrace it and overcome the challenges that come with it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      • ryukoposting 19 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        I'm not so optimistic. The most basic requirements are:

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        1. Prove the human-ness of an author... 2. ...without grossly encroaching on their privacy. 3. Ensure that the author isn't passing off AI-generated material as their own.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        We'll leave out the "don't let AI models train on my data" part for now.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Whatever solution we come up with, if any, will necessarily be mired in the politics of privacy, anonymity, and/or DRM. In any case, it's hard to conceive of a world where the human web returns as we once knew it.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • vundercind 11 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          The good news—such as it is—is that the Web never really became what we assumed it surely would in its early days.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          If it was never really the case that you’d be better off for serious or improving reading having only the Web versus only access to a decent library, then we haven’t lost something so precious.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          I mean, the most valuable site on the Web is probably a book & research paper piracy website. That’s its crowning achievement. Faster interlibrary loan, basically, but illegal.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        • brunokim 19 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Here is an expert saying there is a problem and how it killed its research effort, and yet you say that things are the same as ever and nothing was killed.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • QRe 6 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1. I am not discrediting the expert in any way, if anything, I think their decision to quit is understandable - there is now a challenge that arose during his research that is not in their interest to pursue (information pollution is not research in corpus linguistics / NLP).

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            2. I never said that things are the same as ever, quite the opposite actually. I am saying the world evolves constantly. It's naive to say company X/Y/Z killed something or made something unusable, when there is constant inevitable change. We should focus on how to move forward giving this constraint, and not dwell on times where the web was so much 'cleaner' and 'nicer', more manageable etc.

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          • scudsworth 11 hours ago

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            T H A N K S