« BackCloudflare.com's Robots.txtcloudflare.comSubmitted by sans_souse 4 hours ago
  • seanwilson 15 minutes ago

    I have an ASCII art Easter egg like this in an SEO product I made. :)

    https://www.checkbot.io/robots.txt

    I should probably add this SEO tip too because the purpose of robots.txt is confusing: If you want to remove/deindex a page from Google search, you counterintuitively need to actually allow the page to be crawled in the robots.txt file, and then add a noindex response header or noindex meta tag to the page. This way the crawler gets to see the noindex instruction. Robots.txt is there to control which pages can be crawled, not which pages can be indexed.

    • m-app 5 minutes ago

      What does “OUR TREE IS A REDWOOD” refer to? A quick search doesn’t yield any definite results.

      • palsecam 2 hours ago

        That’s a funny one!

        Anyone knows of others like that?

        Here is mine: https://FreeSolitaire.win/robots.txt

        • jsheard 2 hours ago

          Google used to have a /killer-robots.txt which forbid the T-1000 and T-800 from accessing Larry Page and Sergey Brin, but they took that down at some point.

        • CodesInChaos 2 hours ago

          What's the purpose of "User-Agent: DemandbaseWebsitePreview/0.1"? I couldn't find anything about that agent, but I assume it's somehow related to demandbase.com?

          But why are it and twitter the only whitelisted entries? Google and bing missing is a bit surprising, but I assume they're whitelisted through a different mechanism (like a google webmaster account)?

          • saddist0 2 hours ago

            It is one of the service they use. As per the cookie policy page [1]:

            > DemandBase - Enables us to identify companies who intend to purchase our products and solutions and deliver more relevant messages and offers to our Website visitors.

            [1]: https://www.cloudflare.com/en-in/cookie-policy/

            • Maken an hour ago

              My guess is that the Twitter one is for previews when you link to a web in Twitter.

            • jsheard 3 hours ago

              This is what happens if your robot isn't nice

                > curl -I -H "User-Agent: Googlebot" https://www.cloudflare.com
                HTTP/2 403
              • jamesog 2 hours ago

                That's not from robots.txt, but their Bot Management feature which blocks things calling themselves Googlebot that don't come from known Google IPs.

                • speedgoose an hour ago

                  Are GCP IPs considered Google IPs?

                  • jgrahamc an hour ago

                    No.

                    • crop_rotation an hour ago

                      No I am very sure they are not.

                • yapyap 3 hours ago

                  That’s cool, if any scrapers would still respect the robots.txt that is

                  • bityard 3 hours ago

                    Think of robots.txt as less of a no trespassing sign and more of a, "You can visit but here are the rules to follow if you don't want to get shot" sign.

                  • op00to 2 hours ago

                    If those robots could read, they'd be very upset.

                    • orliesaurus 29 minutes ago

                      Has anyone worked on anything like this for AI scrapers?