[HN Gopher] Removing my site from Google search
       ___________________________________________________________________
        
       Removing my site from Google search
        
       Author : todsacerdoti
       Score  : 138 points
       Date   : 2021-10-03 17:57 UTC (5 hours ago)
        
 (HTM) web link (www.btao.org)
 (TXT) w3m dump (www.btao.org)
        
       | bouke wrote:
       | Why not use robots.txt instead of littering your html with
       | googlebot instructions?
        
         | intricatedetail wrote:
         | I have disallowed all robots in my robots.txt and still project
         | shows up in the search.
        
         | TheChaplain wrote:
         | Yes, pretty sure this is the way to go.
         | 
         | You can even tell which bots are allowed to index and not.
        
         | tao_oat wrote:
         | Hi, author here. Google stopped supporting robots.txt [edit: as
         | a way to fully remove your site] a few years ago, so these meta
         | tags are now the recommended way of keeping their crawler at
         | bay: https://developers.google.com/search/blog/2019/07/a-note-
         | on-...
        
           | new_guy wrote:
           | Did you actually read your link? That's not at all what it
           | says.
        
             | tao_oat wrote:
             | To be clear, stopped supporting robots.txt _noindex_ a few
             | years ago.
             | 
             | Combined with the fact that Google might list your site
             | [based only on third-party links][1], robots.txt isn't an
             | effective way to remove your site from Google's results.
             | 
             | Sorry, could have been clearer.
             | 
             | [1]: https://developers.google.com/search/docs/advanced/rob
             | ots/in...
        
             | dd82 wrote:
             | >noindex in robots meta tags: Supported both in the HTTP
             | response headers and in HTML, the noindex directive is the
             | most effective way to remove URLs from the index when
             | crawling is allowed.
             | 
             | Seems clear enough to me
        
             | jen20 wrote:
             | Quote from the linked article:
             | 
             | " For those of you who relied on the noindex indexing
             | directive in the robots.txt file, which controls crawling,
             | there are a number of alternative options:"
             | 
             | The first option is the meta tag. It does mention an
             | alternative directive for robots.txt, however.
        
               | intricatedetail wrote:
               | Meta tag implies robot is going to fetch the page,
               | effectively at very least using up you bandwidth. It
               | should be illegal.
        
               | ghassanmas wrote:
               | What about the blocking google bot by their IPs, also
               | combined with user-agent wouldn't that stop the crawlers
               | 
               | Google crawlers IPs https://www.lifewire.com/what-is-the-
               | ip-address-of-google-81...
        
             | gnabgib wrote:
             | This page has a little more detail: https://developers.goog
             | le.com/search/docs/advanced/crawling/...
             | 
             | "If other pages point to your page with descriptive text,
             | Google could still index the URL without visiting the page.
             | If you want to block your page from search results, use
             | another method such as password protection or noindex. "
        
           | Animats wrote:
           | Did you think that mighty Google would pay attention to your
           | puny "noindex" tag? Ha!
        
             | [deleted]
        
             | thih9 wrote:
             | According to google's own docs, this should work.
             | 
             | > You can prevent a page from appearing in Google Search by
             | including a noindex meta tag in the page's HTML code, or by
             | returning a noindex header in the HTTP response.
             | 
             | Source: https://developers.google.com/search/docs/advanced/
             | crawling/...
        
               | swills wrote:
               | Until they change the rules again...
        
               | bryanrasmussen wrote:
               | I mean technically that says that your site won't appear
               | in search results, not that your site won't be used to
               | profile people, determine other site ratings based on
               | your site's content etc.
               | 
               | they won't show your site's content, but that doesn't
               | mean they won't use your site's content.
        
               | thih9 wrote:
               | I thought that (i.e. removing the site from google
               | search) was the goal.
               | 
               | I'd review the other usage on a case by case basis; e.g.
               | determining ratings of other sites seems fair use to me.
               | I'd guess you're allowing others to use your site's
               | content when you're making your site public (TINLA).
        
               | bryanrasmussen wrote:
               | maybe, but I guess I would be cantankerous enough to see
               | the goal as preventing google from profiting off your
               | site.
        
             | mro_name wrote:
             | yes, I do think that
        
         | walshemj wrote:
         | robots.txt stops crawling - you can get indexed via other
         | mechanisms.
         | 
         | You want no index robots tags on all your pages and let google
         | see those.
         | 
         | You can use GSC (Google search console) to remove a site / page
         | from the index
        
         | vadfa wrote:
         | Or even better, iptables rules :P
        
           | TedDoesntTalk wrote:
           | Doesn't that mean you have to know every IP Address used by
           | Google bot now and in the future?
        
             | AshamedCaptain wrote:
             | Not a very hard problem; after all, many websites allow
             | full access to Googlebot IP ranges yet show a paywall to
             | everyone else (including competing search engines).
             | 
             | I also happen to ban Google ranges on multiple less-public
             | sites specially since they completely ignore robots.txt and
             | crawl-delay.
        
             | judge2020 wrote:
             | The way to check googlebot (in a way that will be resistant
             | to expansion of Googlebot's IP ranges in the future) is to
             | perform hostname lookup, with dns lookup as well to verify
             | that the rDNS isn't a lie: https://developers.google.com/se
             | arch/docs/advanced/crawling/...
        
               | lucb1e wrote:
               | Indeed, this was one of the things I considered (note I'm
               | not OP), but then I didn't really want to rely on DNS.
               | https://duckduckgo.com/?q=it's+always+DNS
        
       | intricatedetail wrote:
       | It's crazy that you have pollute your project with Google brand
       | just so that they don't steal your bandwidth and content. Why is
       | this not illegal?
        
       | lucb1e wrote:
       | Oh hey I thought I was the only one. lucb1e.com and another site
       | are also not indexed, though I blocked it based on the user agent
       | string. That way it doesn't get page data or non-HTML files from
       | my server. I introduced this when they were pulling this AMP
       | thing: https://lucb1e.com/?p=post&id=130 It personally doesn't
       | impact me, but it impacts other people on the internet and I
       | figured it was the only thing I can do to try and diversify this
       | market (since I myself already switched to another search
       | engine).
       | 
       | There are zero other restrictions on my site. Use any search
       | engine other than google. Or don't, up to you.
        
       | forgotmypw17 wrote:
       | I use simple HTTP auth with an easy username and password on most
       | my sites. It is rarely a problem for anyone I invite, except
       | perhaps Instagram's browser, but no crawler traffic.
        
         | markoutso wrote:
         | That's a shitty solution. The whole point is to keep the
         | website public.
        
       | rezonant wrote:
       | A more interesting issue is the opposite -- many large sites have
       | robots.txt rules that Disallow all crawlers _except_ Google. A
       | new search engine either 100% respects robots.txt with the result
       | that some major properties are completely unavailable in their
       | index, ignore robots.txt in these special cases where robots.txt
       | configuration is unreasonable, or- crawl anything that allows
       | Google to crawl it. None of these options are great.
        
         | greyivy wrote:
         | Any idea why this would be (other than incompetence)?
        
       | robbrown451 wrote:
       | It seems like the amount it will hurt Google is directly
       | proportional to the amount it will hurt the owner of the site
       | (assuming they want people to read their message).
       | 
       | I'm sure someone at Google is pretty happy that they don't have
       | to show this page in their search results. Nobody can accuse them
       | of bias against anti-Google pages -- the site owner did it to
       | themselves.
       | 
       | Seems like as perfect an example of "cutting off nose to spite
       | face" as I can imagine. (ok, refusing the vax and dying of COVID
       | to get back at the left might be a better example, but this one
       | is close)
        
       | keithnz wrote:
       | So I did a search for the title of his blog post to see what
       | comes up, this HN page is top hit.
        
       | mattlondon wrote:
       | Alternatively just use EFF's privacy badger and duckduckgo to
       | stop feeding the beast?
       | 
       | Those are active steps you can take - I am not convinced a few
       | metatags will stop Google spidering your site (even if it is
       | invisible in results), and is of questionable value if you are
       | still using Google search and not blocking their scripts.
        
         | pydry wrote:
         | It's not either/or. you can do both.
        
       | ObamaBinSpying wrote:
       | Or just block all of the IP addresses listed at these URLs:
       | 
       | https://whois.arin.net/rest/org/GOGL/nets
       | 
       | https://whois.arin.net/rest/org/GOOGL-1/nets
       | 
       | https://whois.arin.net/rest/org/GOOGL-2/nets
       | 
       | https://whois.arin.net/rest/org/GOOGL-24/nets
       | 
       | https://whois.arin.net/rest/org/GOOGL-4/nets
       | 
       | https://whois.arin.net/rest/org/GOOGL-46/nets
       | 
       | https://whois.arin.net/rest/org/GOOGL-5/nets
       | 
       | https://whois.arin.net/rest/org/GOOGL-9/nets
       | 
       | https://whois.arin.net/rest/org/GL-654/nets
       | 
       | https://whois.arin.net/rest/org/GL-895/nets
        
       | FpUser wrote:
       | Removing website from Google search is the least of worries.
       | Every meaningful aspect of your life is now being monitored by
       | corporations and governments. It is too fucking late. That fucked
       | up social scoring system being used in China to oppress people is
       | coming here. Only instead of government doing it directly It will
       | be mostly performed by corporations to keep the appearance of
       | "free" society. Corps will collect your data, assign you a rank
       | and act accordingly.
        
       | cubano wrote:
       | So if "surveillance capitalism" is apparently the new scare,
       | would "surveillance socialism" be better?
       | 
       | Or are we supposed to imagine that under socialism, there would
       | be no need for Big Tech surveillance? This I most certainly
       | disagree with.
       | 
       | I'm just noticing a trend lately where the word "capitalism" is
       | being attacked on many fronts, and I personally find that
       | troublesome.
       | 
       | Like Churchill said..."Capitalism is absolutely the worse system
       | there is, besides everything else of course"
        
         | pyrale wrote:
         | I don't really understand why you would assume that the only
         | alternative to surveillance capitalism is "surveillance
         | socialism".
         | 
         | This sophism seems to be built on two errors:
         | 
         | * that there is nothing outside of pure capitalism and pure
         | socialism
         | 
         | * that adjoining "surveillance" to capitalism means we're
         | talking about an inevitable aspect of our society that is
         | combined to capitalism, rather than a specific subset of the
         | way business is done in this age.
         | 
         | To be honest, this lack of ability to conceive alternative
         | social systems is concerning. The deformed Churchill quote
         | comes as a cherry on top.
        
           | cubano wrote:
           | Of course there are an infinite number of alternatives...I
           | just quickly picked something that sounded decent that tried
           | to make my point about my observation lately that Capitalism
           | was getting knocked around everywhere I looked.
           | 
           | In the last Econtalk podcast that I listened to last night,
           | they discussed the loneliness "epidemic" and the author ended
           | up blaming Thatcher-based Capitalism as perhaps the main
           | reason why people are so lonely today!
           | 
           | I thought that was quite the stretch but imagine my chagrin
           | when here was another spurious back-handed attack on it.
        
             | pyrale wrote:
             | > Capitalism [is] getting knocked around everywhere I
             | looked.
             | 
             | That's what happened with every social system in the past,
             | and, while failures certainly have happened, we've always
             | found ways for that criticism to result in improvements to
             | our societies.
             | 
             | It would be very surprising for the current shape of our
             | society or, more generally, capitalism to be an exception
             | to the rule, unless you subscribe to the "end of history"
             | thesis.
        
         | cweagans wrote:
         | I can't tell if you're being willfully obtuse or not.
         | Harvesting people's data for the express purpose of
         | manipulating them into thinking/buying things that they
         | otherwise wouldn't is wrong in every sense of the word.
         | 
         | Capitalism has its problems just like everything else.
         | Pretending it doesn't is just as disingenuous as pretending
         | that socialism would fix everything. If you're concerned about
         | people attacking capitalism, help fix the problems. Simple as
         | that.
        
         | dreamcompiler wrote:
         | Churchill was referring to democracy, not capitalism.
        
         | cubano wrote:
         | I'm interested in why this comment is getting so knocked down?
         | 
         | I asked what I think is a valid question and would like to hear
         | honest reactions from people about my observation.
         | 
         | I feel I have every right to be a cheerleader for Capitalism as
         | my father escaped communist Cuba in 1959 as Castro was coming
         | to power and used the US's system and tons of hard work to
         | create a extremely comfortable life for himself, while friends
         | and family members who stayed there lived rather wretched
         | lives.
         | 
         | He never forgot how lucky he was to be able to get out of there
         | just in time and told me time and time again that the US while
         | having flaws, was by far the best place in the world to live,
         | so my original comment comes from this background.
         | 
         | I don't give a flying fuck if the comment gets modded down, but
         | i would like to know just what in it is so offensive to those
         | modding it donw so I can learn something.
        
         | michaelt wrote:
         | Presumably the "capitalism" in "surveillance capitalism" is to
         | make it clear they're talking about _private companies_ - as
         | distinct from the traditional concerns about _government_
         | surveillance.
        
       | DantesKite wrote:
       | I like Google, but wouldn't mind a better search engine, even at
       | the cost of my privacy, so long as I had a choice for what could
       | be shared.
        
         | ssss11 wrote:
         | Can you explain what you mean by this? I've read it a few times
         | and don't understand
        
           | DantesKite wrote:
           | Sure. I probably could've been much more clearer.
           | 
           | I don't think Google taking your information and sharing it
           | with advertisers is a great sin. Somewhat annoying but
           | nothing particularly harmful.
           | 
           | I do think the search results are easily manipulated and it
           | can be frustrating trying to find relevant information. Like
           | most people I end up defaulting to Reddit for search queries
           | just to find something that isn't a blog by someone shilling
           | their product.
           | 
           | But I understand the invasion of privacy would irritate some
           | people and maybe in the long term it would be a net negative.
           | So if there was a search engine that explicitly asked for
           | certain information and you had the option to share, that
           | would probably go a long ways.
        
             | [deleted]
        
           | entire-name wrote:
           | I think OP means they don't mind sharing their information
           | with the search engine (be it Google, another engine that
           | provides better results, or even a better Google in terms of
           | results), _as long as_ OP has control over exactly what is
           | being shared.
           | 
           | As an aside, I do see the trend for some companies to provide
           | this control nowadays. Even Google is doing it (e.g. you can
           | auto delete your information, or turn them off completely):
           | https://myaccount.google.com/data-and-privacy
           | 
           | Of course, whether or not you believe Google is doing what
           | you have configured in the backend is another question... and
           | there is nothing anyone can do to actually make you believe
           | it short of giving you complete access to the entire Google
           | backend. Or is there a way to verify without exposing? Maybe
           | an interesting research topic...
        
         | ignoramous wrote:
         | You're in luck, since there's active development in this space:
         | Neeva.com and kagi.com two of the many alt search engines.
        
       | avipars wrote:
       | seems like it would hurt your traffic
        
         | amelius wrote:
         | Just join a web ring like in the old days.
         | 
         | https://en.wikipedia.org/wiki/Webring
        
         | [deleted]
        
         | freediver wrote:
         | Only if there is relevant traffic from Google to begin with,
         | which is highly unlikely for a site like this. A high
         | percentage of results in almost every Google search comes from
         | the closed circle of the same top 10,000 sites or so.
         | 
         | This is the beauty of a protest like this, because this site
         | does have valuable content, and if enough sites like this
         | joined the protest it could actually hurt the relevancy of the
         | Google index, that by the time Google figures out is valuable,
         | would not be allowed to index anymore.
        
           | jefftk wrote:
           | I don't think that's so unlikely: on my blog ~30% of visitors
           | come from searches
        
             | freediver wrote:
             | Sorry my statement was both generalized and specific at the
             | same time, and that did not turn out well. How many visits
             | does your blog have daily? And what would it take you to
             | remove your site from Google index?
        
               | jefftk wrote:
               | _> How many visits does your blog have daily?_
               | 
               | ~200k sessions in the past year, so ~550/d. Breakdown:
               | 
               | * ~30% search
               | 
               | * ~30% no referer
               | 
               | * ~25% HN
               | 
               | * ~7% Twitter/FB/etc
               | 
               | * ~8% other
               | 
               |  _> what would it take you to remove your site from
               | Google index?_
               | 
               | I don't see why I would want to exclude my site from any
               | index? Being in indexes helps people find my writing,
               | which I like!
        
               | indigochill wrote:
               | > I don't see why I would want to exclude my site from
               | any index? Being in indexes helps people find my writing,
               | which I like!
               | 
               | It's essentially a form of boycott. If one believes
               | Google is a problematic entity (too many fingers in too
               | many aspects of our lives), it's a way to sever
               | connections with them at some personal cost.
               | 
               | At least, if you care about search traffic - one might
               | argue the assumption that Google-like search is the
               | default way to navigate the web is one worth
               | reconsidering and encouraging alternatives to anyway.
        
         | onion2k wrote:
         | That depends on where your traffic originates from. Back when I
         | tracked people on my site, I found I got very little from
         | search results. Most of it (> 95%) came from links from social
         | media and Github. On a blog that's heavily about privacy I
         | wouldn't expect much to come from Google.
         | 
         | Also, so what if the numbers go down? If your reason from
         | writing a blog is to see a number on a screen then what does
         | that actually give you?
        
           | jefftk wrote:
           | _> so what if the numbers go down? If your reason from
           | writing a blog is to see a number on a screen then what does
           | that actually give you?_
           | 
           | Traffic numbers are not an end in themselves, but are a
           | decent proxy for "are other people getting value out of what
           | I write?"
        
         | lucb1e wrote:
         | If that's your goal. Personally I host content that people can
         | use or not. I'll link friends if I want them to see it.
         | Visitors don't cost me anything, it doesn't really bring me
         | anything (other than ego?) to have visitors either. Hence I saw
         | fit to also block google (two years ago already apparently, I
         | thought it was much more recent) and it didn't negatively
         | impact my site in any way.
        
         | floatingatoll wrote:
         | This assumes that the increase in traffic due to Google is
         | beneficial, which it rarely is for personal diary sites.
        
       | JimWestergren wrote:
       | Imagine wikipedia and the top newspapers doing this ... users
       | will start to use another search engine.
        
         | 0x073 wrote:
         | Normal users will start to use another newspaper
        
         | antattack wrote:
         | Wikipedia should do this as Apple and Google are showing
         | Wikipedia results as their own, robbing, IMO, Wikipedia of
         | importance. Wikipedia is large enough that it should have their
         | own search engine, likely with more relevant results.
        
           | nicce wrote:
           | Identical situation when Facebook was asked to not show
           | previews of the news articles, because of the ePrivacy
           | directive. Could this go for the same legistlation?
           | 
           | https://www.mysk.blog/2021/02/08/fb-link-previews/
        
           | joshuaissac wrote:
           | It does not help Wikipedia to do that. The content on
           | Wikipedia is licensed so that Apple and Google can show the
           | content from Wikipedia (and this is by design, not a
           | loophole). If the users can get access to the encyclopedic
           | content more conveniently, that is still in line with the
           | project's goals, even if that content reaches the user
           | indirectly via a third party.
        
         | breakingcups wrote:
         | No, people will start to read other news sites.
        
       | larrymcp wrote:
       | The first sentence grabbed my attention, and I was looking
       | forward to learning about the "threat that surveillance
       | capitalism poses to democracy and human autonomy". But then the
       | article fell flat: he gave no examples of that threat, and
       | neither did the linked article in The Guardian.
       | 
       | Are there specific examples of this type of harm? The only
       | complaints that he made were that Google makes a lot of money
       | (which I have no problem with), and that Google's conduct feels
       | "creepy" to him (which is merely an emotional reaction).
       | 
       | He did hint at Google "modifying your off-screen behavior", and I
       | was eager to learn about that as well... but then he left that
       | unexplored too, and gave no follow-up or examples of that
       | intriguing scenario.
        
         | rkarmani wrote:
         | He referenced this page: https://www.socialcooling.com/
        
       ___________________________________________________________________
       (page generated 2021-10-03 23:00 UTC)