[HN Gopher] Only Google is really allowed to crawl the web
       ___________________________________________________________________
        
       Only Google is really allowed to crawl the web
        
       Author : skinkestek
       Score  : 711 points
       Date   : 2021-03-26 14:34 UTC (8 hours ago)
        
 (HTM) web link (knuckleheads.club)
 (TXT) w3m dump (knuckleheads.club)
        
       | graiz wrote:
       | Not sure why http://commoncrawl.org/ wasn't mentioned.
        
       | dclaw wrote:
       | I can't really trust a website that spells its own name wrong on
       | their homepage. "Knucklesheads' Club"
       | 
       | Edit: https://imgur.com/a/inqYrjV
        
         | slenk wrote:
         | Everyone makes mistakes
        
       | sgsvnk wrote:
       | Money earns more money. Privilege begets more privilege.
       | 
       | This is not just true in the case of Google but in other other
       | domains as well like the financial markets.
       | 
       | Would you blame capitalism?
        
       | wunderflix wrote:
       | Even that won't change much. There is no way Google can be out-
       | googled by other search engines because of its market dominance:
       | more traffic means more clicks, more clicks mean better search
       | results, better search results will drive more traffic.
       | 
       | I try bing and DDG for a week or so every 6 months. I always
       | switch back to google eventually because the results are so much
       | better.
       | 
       | Google can only be disrupted if something new is invented,
       | something different than search but delivering way better
       | results. I have no clue what that might be. But I hope someone is
       | working on it.
        
         | internetslave wrote:
         | Yup. My opinion has long been that the only thing that will
         | take down google is a massive increase in NLP, such that the
         | historical click data can be outperformed by a straight up
         | really good NLP model
        
           | wunderflix wrote:
           | That's interesting. Is anyone working on this already? SV
           | startup? And: don't you think Google is in the best position
           | to build such a thing?
        
         | Zelphyr wrote:
         | I've had the exact opposite reaction to the comparison between
         | Google and DuckDuckGo. I use the latter daily and only rarely
         | revert to Google. Even then I usually don't find the results to
         | be any better and often find them to be worse.
         | 
         | In my estimation, Google's search results have significantly
         | declined in recent years.
        
           | rstupek wrote:
           | Agreed. I've fully changed over to DDG on my phone and rarely
           | add the !g to get a google search.
        
           | wunderflix wrote:
           | Ha, maybe I should give it a try again :) My 6 months period
           | is almost over again.
        
       | jrockway wrote:
       | I think there are plenty of other people crawling the web.
       | There's Common Crawl, there's the Wayback machine... it's not
       | just Google. Then there is a very long tail of crawlers that show
       | up in the logs for my small-potatoes personal website. Whatever
       | they're doing, they seem to be existing in peace, at the very
       | least.
       | 
       | To some extent, I agree with this site that people are nicer to
       | Google than other crawlers. That's because the crawl consumes
       | their resources but provides benefits -- you show up on Google,
       | the only search engine people actually use. But at the same time,
       | they are happy to drag Google in front of Congress for some
       | general abuse, so... maybe there is actually a little bit of
       | balance there.
        
       | anonu wrote:
       | > There Should Be A Public Cache Of The Web
       | 
       | This might be closest to it: https://commoncrawl.org/
        
       | lawwantsin17 wrote:
       | I'm all for killing Google's monopoly but spiders can ignore
       | robots.txt you know. This just seems like a failure of other
       | companies to effectively ignore those.
        
       | jeelecali wrote:
       | I'm looking for $ 576
        
       | villgax wrote:
       | The irony is that they bitch about you not scraping search or
       | other platforms without paid plans & want to do the same to you
        
       | ajcp wrote:
       | They really missed an opportunity to get creative with their own
       | `robots.txt` implementation.
        
       | nova22033 wrote:
       | _This isn't illegal and it isn't Google's fault_
       | 
       | Right there in the article..
        
         | WarOnPrivacy wrote:
         | Again, with critical context.
         | 
         |  _This isn't illegal and it isn't Google's fault, but this
         | monopoly on web crawling that has naturally emerged prevents
         | any other company from being able to effectively compete with
         | Google in the search engine market._
        
       | tyingq wrote:
       | The bigger problem, to me, is not around crawling. It's the
       | asymmetrical power Google has after crawling.
       | 
       | Google is obviously on a mission to keep people on Google owned
       | properties. So, they take what they crawl and find a way to
       | present that to the end user without anyone needing to visit the
       | place that data came from.
       | 
       | Airlines are a good example. If you search for flight status for
       | a particular flight, Google presents that flight status in a box.
       | As an end user, that's great. However, that sort of search used
       | to (most times) lead to a visit to the airline web site.
       | 
       | The airline web site could then present things Google can't do.
       | Like "hey, we see you haven't checked in yet" or "TSA wait times
       | are longer than usual" or "We have a more-legroom seat upgrade if
       | you want it".
       | 
       | Google took those eyeballs away. Okay, fine, that's their choice.
       | But they don't give anything back, which removes incentives from
       | the actual source to do things better.
       | 
       | You see this recently with Wikipedia. Google's widgets have been
       | reducing traffic to Wikipedia pretty dramatically. Enough so that
       | Wikipedia is now pushing back with a product that the Googles of
       | the world will have to pay for.
       | 
       | In short, I don't think the crawler is the problem. And I don't
       | think Google will realize what the problem is until they start
       | accidentally killing off large swaths of the actual sources of
       | this content by taking the audience away.
        
         | bouncycastle wrote:
         | In regards to airlines, Google and Amadeus have a partnership I
         | believe. Amadeus is the main source of data for many of these
         | airline websites. If Google gets the data from Amadeus directly
         | and not these websites, they are just cutting out the
         | middleman. I don't shed a tear for any of these middleman
         | (together with their Dark Pattern UX design).
        
           | tyingq wrote:
           | Amadeus isn't a source of flight status. It is a source for
           | (some) planned schedules and fares. Global distribution
           | systems are a complex topic that's hard to sum up on HN. For
           | flight status, Google is pulling from OAG and Flight Aware,
           | and also from airline websites. Though they don't show
           | airline sites as a source.
        
         | dan-robertson wrote:
         | The way to look at this from Google's point of view is to
         | realise that most websites are slow and bad[1], so if Google
         | sent you there you would have a bad experience with a bad slow
         | website trying to find the information you want. Google want to
         | make it better for you.
         | 
         | [1] it feels like Google have contributed a lot to websites
         | being slow and bad with eg ads, amp, angular, and probably more
         | things for the other 25 letters of the alphabet.
        
           | [deleted]
        
           | zentiggr wrote:
           | > Google want to make it better for you.
           | 
           | Hehe, sure, nothing nefarious or greedy here... move along,
           | move along, nothing to see...
        
         | supert56 wrote:
         | Perhaps I am misunderstanding or over simplifying things but it
         | always surprises me that there are legal cases brought against
         | companies who scrape data when so many of Google's products are
         | doing exactly this.
         | 
         | It definitely feels like one set of rules for them and a
         | different set for everyone else.
        
           | lupire wrote:
           | Google doesn't scrape anything that the site owner objects
           | to.
        
           | Spivak wrote:
           | I mean it's not that weird that a company would authorize
           | major search engines scraping them but no one else.
           | 
           | I don't really see this as Google playing by different rules
           | so much as economic incentives being aligned in Google's
           | favor.
        
         | 838812052807016 wrote:
         | Standardized interoperability enables overall progress.
         | 
         | Every airline doesn't need their own webpage. They could all
         | provide a standard API.
        
         | lelanthran wrote:
         | > And I don't think Google will realize what the problem is
         | until they start accidentally killing off large swaths of the
         | actual sources of this content by taking the audience away.
         | 
         | What makes you think they care? Killing off the sources of
         | content might even be there goal. If they kill off sources of
         | content, they'd be more than happy to create an easier-to-
         | datamine replacement.
         | 
         | Hypothetically, if they killed off wikipedia, they are best
         | placed to use the actual wikipedia content[1] in a replacement,
         | which they can use for more intrusive data-mining.
         | 
         | Google sells eyeballs to advertisers; being the source of all
         | content makes them more money from advertisers while making it
         | cheaper to acquire each eyeball.
         | 
         | [1] AFAIK, wikipedia content is free to reuse.
        
         | ilaksh wrote:
         | The way that the web has been fundamentally broken by Google
         | and other companies is one of the reasons I am excited about an
         | alternative protocol called Gemini. It doesn't replace the web
         | entirely, but for basic things like exchanging information,
         | it's great. https://gemini.circumlunar.space/
        
         | treis wrote:
         | >However, that sort of search used to (most times) lead to a
         | visit to the airline web site.
         | 
         | I don't think that's correct. In the old days you'd either call
         | a travel agent or use an aggregator like expedia.
         | 
         | Google muscles out intermediaries like Expedia, Yelp, and so
         | on. It's not likely much better or worse for the end user or
         | supplier. Just swapping one middleman for another.
        
           | darkwater wrote:
           | It's actually pretty different because another middleman can
           | basically arise only if it's a big success in the iOS App
           | Store because coming up in Google searches would be
           | impossible and more or less the same in the Play Store. So,
           | Google is not just yet another intermediary.
        
           | tyingq wrote:
           | I can't prove it was that way, but I spent a lot of time in
           | the space. For a long time, the airline's site used to be the
           | top organic result, and there was no widget. Similar for
           | other travel related searches (not just airlines) over time.
           | Google has been pushing down organic results in favor of ads
           | and widgets for a long time...and slowly, one little thing at
           | a time. Like no widgets -> small widget below first organic
           | result -> move the widget up -> make it bigger -> etc.
        
           | supernovae wrote:
           | I don't think google muscling out intermediaries like Expedia
           | is a good thing.
           | 
           | Just for example, Expedia is probably 5% of Google's total
           | revenue and Google doesn't like slim margin services by and
           | large that can't be automated.
           | 
           | Travel is fairly high-touch - people centric. It doesn't fit
           | Google's "MO".
           | 
           | But... its shitty that google can play all sides of the
           | markets while holding people ransom to mass sums of money to
           | pay to play on PPC where google doesn't... i think that's
           | where the problem shines.
           | 
           | In essence, you're advocating that eBay goes away because
           | google could do it... they could.. and eBay is technically
           | just an intermediary, but do we want everything to be
           | googlefied?
           | 
           | Google bought up/destroyed other aggregators - remember the
           | days of fatwallet, priceline, pricewatch, shopzilla and such
           | when they used to focus on discounts/coupons/deals and now
           | they're moving more towards rewards/shopping/experience - it
           | used to be i could do PPC on pricewatch and reach millions of
           | shoppers are a reasonable rate, but now that google destroyed
           | them all, the PPC rate on "goods" is absurdly high and not
           | having an affordable market means only the amazons and
           | walmarts can really afford to play...
           | 
           | it used to be you could niche out, but even then, that's
           | getting harder
        
             | treis wrote:
             | >In essence, you're advocating that eBay goes away because
             | google could do it... they could.. and eBay is technically
             | just an intermediary, but do we want everything to be
             | googlefied?
             | 
             | I don't think I'm really advocating for it as much as I see
             | as a more or less neutral change.
             | 
             | That said, I'm pretty ambivalent about Google. Their size
             | is a concern, but they also tend to be pretty low on the
             | dark pattern nonsense. eBay, to use an example you gave,
             | screwed me out of some buyer protection because of poor UX
             | and/or bug (I never saw the option to claim my money after
             | the seller didn't respond). In this specific instance
             | Google ends the process by sending you to the airline to
             | complete the booking. That, imho, is likely better than
             | dealing with Expedia.
        
               | supernovae wrote:
               | Companies opt in to sites like Expedia and list their
               | properties/flights/vacations on their marketplace and
               | they pay a commission for those being booked. Expedia
               | doesn't just crawl them and demand a royalty for sending
               | them traffic...
               | 
               | Google has a huge pay 2 play problem with PPC... i've
               | worked for Expedia so that's the only reason i know this
               | :)
               | 
               | It's the reason companies work with Expedia many times
               | because they don't have the leverage expedia group
               | does...
               | 
               | i see it as unnatural change btw... "borg" if you will.
        
           | josefx wrote:
           | Only if Google stays around long term. I wouldn't be
           | surprised if each free product on its graveyard took down a
           | dozen of competing products before it was killed of.
        
             | pc86 wrote:
             | Then someone can start a competitor up again, right?
             | Assuming there's actually a market for it.
        
               | josefx wrote:
               | Not every market is lucrative in the extreme and it can
               | take a long time to recover from being "disrupted". I
               | think it is also a common practice for larger shopping
               | chains to dump prices when they open a new location in
               | order to clear out the local competition, so the damage
               | it causes is well understood to be long lasting.
        
         | devoutsalsa wrote:
         | I've noticed that sometimes Google had updated flight
         | information before the displays at the airport.
        
           | tyingq wrote:
           | For the most part individual airports own that
           | infrastructure. So it's hard to generalize. For most types of
           | notable flight status/time changes, however, airlines usually
           | know first.
           | 
           | There are exceptions, like an airport-called ground stop.
        
         | magicalist wrote:
         | > _You see this recently with Wikipedia. Google 's widgets have
         | been reducing traffic to Wikipedia pretty dramatically._
         | 
         | Wikipedia visitors, edits, and revenue are all increasing, and
         | the rate that they're increasing is increasing, at least in the
         | last few years. Is this a claim about the third derivative?
         | 
         | > _Enough so that Wikipedia is now pushing back with a product
         | that the Googles of the world will have to pay for._
         | 
         | The Wikimedia Enterprise thing seems like it has nothing to do
         | with missing visitors and that companies ingesting raw
         | Wikipedia edits are an opportunity for diversifying revenue by
         | offering paid structured APIs and service contracts. Kind of
         | the traditional RedHat approach to revenue in open source:
         | https://meta.m.wikimedia.org/wiki/Wikimedia_Enterprise
        
           | tyingq wrote:
           | See https://searchengineland.com/wikipedia-confirms-they-are-
           | ste... from 2015. Google's widgets that present Wikipedia
           | data do reduce visitors to Wikipedia.
           | 
           | Or see page views on English Wikipedia from 2016-current: htt
           | ps://stats.wikimedia.org/#/en.wikipedia.org/reading/total...
           | Looks pretty flat, right? Does that seem normal?
           | 
           | As for Wikimedia Enterprise, you do have to read between the
           | lines a bit. _" The focus is on organizations that want to
           | repurpose Wikimedia content in other contexts, providing data
           | services at a large scale"_.
        
             | SamBam wrote:
             | The first link doesn't seem quite conclusive (see the part
             | at the bottom), and also doesn't give evidence that
             | Google's widgets are to blame.
             | 
             | The flattening of users could also be due to a general
             | internet-wide reduction in long-form (or even medium-form)
             | non-fiction reading. How are page views for The New York
             | Times?
             | 
             | Seems like it should be simple to A/B test, though.
             | Obviously Google could do it themselves by randomly taking
             | away the widget, but would could also see whether referrals
             | from non-Google search engines (though they are themselves
             | a tiny percentage) continue to increase while Google
             | remains flat.
        
               | tyingq wrote:
               | Edit: Removed bad "simple english graph", thanks. Though
               | the regular english wikipedia traffic is flat from
               | 2016-present.
               | 
               | As for NYT, is there a better proxy to compare to?
               | There's no public pageview stats and they have a paywall.
        
               | magicalist wrote:
               | That first graph is Simple English, not English, and is
               | in millions, not billions. They also explicitly call out
               | the methodology change in 2015...
        
         | JKCalhoun wrote:
         | > In short, I don't think the crawler is the problem.
         | 
         | Except that, allow other companies to crawl/compete, and you
         | can take eyeballs away from Google (which may well then return
         | eyeballs to Wikipedia so long as the Google competitors don't
         | also present scraped data).
        
           | [deleted]
        
         | benatkin wrote:
         | That's the result of the crawling, and it preventing
         | competition. Google would much prefer that people complain
         | about the details while ignoring the root cause.
        
           | tyingq wrote:
           | I don't understand that. The crawling access is mostly the
           | same as it ever was. Google's SERP pages are not. A mutually
           | beneficial search engine that respects it's sources would
           | still crawl the same. Google used to be that.
           | 
           | The core problem is incentives:
           | http://infolab.stanford.edu/~backrub/google.html _" we
           | believe the issue of advertising causes enough mixed
           | incentives that it is crucial to have a competitive search
           | engine that is transparent and in the academic realm."_
        
             | [deleted]
        
             | benatkin wrote:
             | That's incorrect. Before the search oligopolies formed, new
             | search engines could start up. There was excite, hotbot,
             | altavista, and more. Now they don't have access. Search
             | these comments for census.gov.
        
               | tyingq wrote:
               | There are companies that do pretty well in this space,
               | like ahrefs, for example. They do resort to trickery,
               | like proxy clients that look like home computers or cell
               | phones. But, if a small entity like ahrefs can do it,
               | anyone can do it.
               | 
               | In a nutshell, though, I don't see equal access for all
               | crawlers changing anything. Maybe that's the first
               | barrier they hit, but it isn't the biggest or hardest one
               | by far. Bing has good crawler access, but shit market
               | share.
        
               | [deleted]
        
         | dr-detroit wrote:
         | So nobody is going to book air travel? I cant hardly follow
         | what youre even saying besides google=bad.
        
         | veltas wrote:
         | I swear something like 50% of those digests are totally
         | incorrect as well. It's amazing they have kept the feature
         | because it has never had a very high signal-to-noise ratio. I
         | never trust what's presented in these digests without double-
         | checking the source page.
        
           | bombcar wrote:
           | Have you heard the story of Thomas Running? It's a story
           | Google will tell you.
           | 
           | (Search who invented running)
        
           | tyingq wrote:
           | I remember when rich snippets (one type of those widgets)
           | came out there were a lot of funny examples. One for a common
           | query about cancer treatments that pulled data from a dodgy
           | holistic site saying that "carrots cured most types of
           | cancer" (or something like that).
           | 
           | There was a similar one where Google emphatically claimed a
           | US quarter was worth five cents in a pretty and large snippet
           | graphic.
        
             | Mauricebranagh wrote:
             | I recall in the last uk election google got the infographic
             | of party leaders about 60-70% wrong.
             | 
             | And quite often a people also ask refine is just some
             | random guys comment from redit.
        
             | BeFlatXIII wrote:
             | The most memorable rich snippet humor I've seen is a horse
             | breeder sharing a story of how her searches gave snippets
             | with my little ponies as the preview image.
        
         | gtm1260 wrote:
         | I'm not sure I agree with this. I think airline websites are so
         | garbage filled that they've driven people to use the simple
         | alternative of the google flights checkout.
         | 
         | It's a bit of a vicious cycle, but In general most websites are
         | so chock filled with crap that not having to click into them
         | for real is a relief!
        
           | gxs wrote:
           | It's not Google's prerogative to scrape a website and display
           | its content, no matter how awful the website.
        
             | michaelmrose wrote:
             | If 1 airline let me view information in a friendly fashion
             | and the other didn't I would do business with the first.
             | 
             | Lest we forget the money in that scenario is from butts in
             | seats not clicks on a website. The particular example is
             | ill chosen as google is actually taking on a cost, taking
             | nothing, and gifting the airline a better ui.
        
             | BEEdwards wrote:
             | If you make an awful website that can be scrapped it's a
             | matter of when not if someone will take your data and give
             | it to your consumers whether your trying to upsell them or
             | not...
        
           | cyberpunk wrote:
           | BA had some tracking request inline on the "payment
           | processing" page which when blocked by my pihole prevents me
           | from ever getting to the confirmation page, just have to
           | refresh your email and wait for the best.
           | 
           | I have no idea how these companies, which make quite a decent
           | amount of money at least up until 2020, can have such utterly
           | poor sites.
           | 
           | I once counted some 20+ redirects on a single request during
           | this process heh..
        
             | bombcar wrote:
             | I don't know what they're doing but most every single sign
             | on tool I've seen redirects 10-20 times during the sign on
             | process (and then dumps you to the homepage to navigate
             | your way back).
        
               | merlinscholz wrote:
               | Probably to get first party cookies on a handful of
               | domains
        
           | tyingq wrote:
           | I'm talking about flight status. Not Google Flights,
           | shopping, or booking.
           | 
           | There are events associated with flight status that Google
           | doesn't know. Like change fee waivers, cash comp awards to
           | take a later or earlier flight, seat upgrades, etc.
        
           | creato wrote:
           | Yeah, the Google flights issue is difficult. On one hand, the
           | business practice is problematic. On the other hand, Google
           | flights is _so_ much better than its competitors it 's
           | ridiculous.
           | 
           | If there was a way to split Google flights into a separate
           | company and somehow ensure it wouldn't devolve into absolute
           | trash like its competitors, that would be a good thing.
        
             | tyingq wrote:
             | It was ITA and prior to Google buying them, did a pretty
             | good business selling backend flight shopping services to
             | aggregators and airlines.
             | 
             | Shopping for flights is a surprisingly technically
             | difficult thing to do well.
        
         | ChrisArchitect wrote:
         | They're making it easier to search for flights and arrange a
         | trip. It's UX and makes me not hate the airlines/travel process
         | as much. And I end up buying the flight from the airline
         | anyways, and in many cases doing the arranging on the airline
         | site in the end once it's determined, so Google is giving that
         | back. They're not taking stuff from the airlines, I mean what
         | ads and stuff are on the airline sites anyways specifically
         | during the search process. Where they are taking away is from
         | the Expedia's and other aggregation sites that offer a
         | garbage/hodgepodge experience that drives people crazy.
        
           | tyingq wrote:
           | You're talking about Google Flights, which is completely
           | unrelated to flight status.
        
         | throwaway_kufu wrote:
         | They are not just taking away internet traffic, but in the
         | flights example, they actually acquired an aggregate
         | flight/travel company and so they are actually entering markets
         | and competing with their own ad customers.
         | 
         | Then it comes fully circle to Google unfairly using their
         | market position vis-a-vis data, search and advertising. It's a
         | win-win Google lets the data dictate which markets to enter and
         | on one hand they can jack up advertising fees on
         | customers/competitors and unfairly build their own service into
         | search above both ads and organic results.
        
           | danielscrubs wrote:
           | Be careful when using Google Flight, last time I checked they
           | use significantly less margins between flights so it's
           | shorter trips but much riskier.
        
             | aetherane wrote:
             | You can screwed any time you book a connecting flight on
             | two different airlines even if the times aren't tight. For
             | instance if one is cancelled.
             | 
             | If you use the same airline they will make sure you get to
             | the destination.
        
               | HenryBemis wrote:
               | > even if the times aren't tight
               | 
               | Depending on the definition of "tight" each of us have. I
               | remember having 40mins in Munich, and that is a BIG
               | airport. Especially if you disembark on one side of the
               | terminal and your flight is on the far/opposite end.
               | That's 25-30mins brisk walking. With 5000 people in-
               | between you could as well miss your flight. No discussion
               | about stopping to get a coffee or a snack.. you'll miss
               | your flight.
        
               | matwood wrote:
               | That's true, but it can save you a ton of money. You just
               | have to be aware of the risks and plan accordingly.
               | 
               | I have typically used this strategy when flying back to
               | the US from the EU. Take an EZJet or similar low cost
               | airline from random small EU city to a larger EU city
               | like Paris, London, Frankfurt, etc... and book the return
               | trip to the US from the larger city. I've also been
               | forced to do this from some EU cities since there was no
               | connecting partner with a US airline.
        
               | hodgesrm wrote:
               | The difference is mind-boggling in some cases. On one
               | trip in 2019 I had the following coach fair choices for
               | SFO - Moscow return trip tickets booked 3 weeks prior to
               | departure.
               | 
               | * UA or Lufthansa round trip (single carrier) $3K
               | 
               | * UA round trip SFO - Paris + Aeroflot round trip Paris -
               | Moscow: $1K
               | 
               | No amount of search could reduce the gap. I went with the
               | second option. The gap is even bigger if you have a route
               | with multiple segments.
        
               | throwaway1777 wrote:
               | Yeah this strategy is good, but you need to allow a long
               | layover like 6 hours if you have to go through
               | immigration and change airports for the connection which
               | happens pretty often with ryanair and ezjet. It's a big
               | pain, but it does save money.
        
               | cbenneh wrote:
               | If you're booking each leg with different carrier, I find
               | it best to pay the little extra with kiwi.com and they
               | give you guarantee for the connection. I missed
               | connection twice and they always got me on the next
               | flight to the destination for free.
        
             | slymon99 wrote:
             | Can you elaborate on this? Do you mean shorter layovers?
        
               | bombcar wrote:
               | It sounds like it - and third-party companies will often
               | show you flights that involve different companies on the
               | different legs - which can leave you in a pickle because
               | technically each airline's job is to get you to the end
               | of THIER flight, not the entire journey.
        
               | Scoundreller wrote:
               | And sometimes with a change of airport!
        
               | foepys wrote:
               | I remember when in Germany some budget airlines used to
               | say they'd fly to "Frankfurt" (FRA) but actually flew to
               | "Frankfurt-Hahn" (HHN) - 115km away. After arrival in HHN
               | they put you on a bus to FRA that took about 2 hours.
        
             | SV_BubbleTime wrote:
             | Oh don't worry, you have 15 on-paper minutes to go from A1
             | to A70 in Detroit... in January... and the shuttle is down.
        
           | marshmallow_12 wrote:
           | Aren't there anti trust laws to prevent this kind of thing?
        
             | sangnoir wrote:
             | The current anti-trust doctrine in the US has a goal of
             | protecting _consumers_ - not competition. What Google is
             | doing is arguably great for consumers but awful to their
             | competitors /other organizations. Technically, companies
             | can simply block Google using robots.txt - but in reality
             | that will lose them more money than the current partial
             | disintermediation by Google is costing them - and Google
             | knows this.
             | 
             | It's a tall order to convince the courts that Google's
             | actions consumers, or is illegal: after all, being
             | innovative in ways that may end up hurting the competition
             | is a key feature of a capitalist society - _proving_ that a
             | line has been crossed is really hard, by design.
        
               | speeder wrote:
               | consumers are in this case the advertisers.
               | 
               | google has a monopoly on search ads and does enforce it,
               | being a drain on the economy since in many fields you
               | only succeed if you spend on search ads
        
               | sangnoir wrote:
               | > consumers are in this case the advertisers.
               | 
               | If someone could convince the courts that this is
               | correct, then I'm sure Google would lose. However, I bet
               | dollars to donuts Google's counter-arguement would be
               | that the people doing the searching and quickly finding
               | information are also consumers, and they outnumber
               | advertisers and may be harmed by any proposed remediation
               | in favor of advertisers.
        
               | basch wrote:
               | googles answer to this at yesterdays hearing..
               | 
               | Search isnt a single category. If you break it down, they
               | arent a monopoly. For example. 1/2 of PRODUCT SEARCHES
               | begin on Amazon. It's probably hard to argue Google as a
               | monopoly if who they see as their main competitor has
               | half the market share.
        
             | supernovae wrote:
             | Just tell people to stop using google. Go direct.
        
               | zentiggr wrote:
               | Upvoted - regardless how pointless some people might
               | think this comment is, it really is the ONLY way that
               | Google is going to drop out of its aggregate lead
               | position.
               | 
               | Enough people realizing Google is trapping and
               | cannibalizing traffic to the other sites it feeds off of,
               | and choosing to do other things EXCEPT touching Google
               | properties, is THE ONLY way they'll be unseated.
               | 
               | No clear legal path to stop a bully means it's an ethical
               | / habit path.
               | 
               | Not saying there's any easy way, just that this is it.
        
             | midoBB wrote:
             | Anti-trust in the US tend to not hit the big tech players
             | as much they do other sectors. Also there is actually a
             | debate in the judicial system about the extent of Anti
             | trust laws themselves.
        
             | Majromax wrote:
             | Antiturst laws are hard to enforce in the United States.
             | 
             | Monopolies themselves aren't illegal. To be convicted of an
             | antitrust violation, a firm needs to both have a monopoly
             | and needs to be using anticompetitive means to maintain
             | that monopoly. The recent "textbook" example was of
             | Microsoft, which in the 90s used its dominant position to
             | charge computer manufacturers for a Windows license for
             | each computer sold, regardless of whether it had Windows
             | installed or was a "bare" PC.
             | 
             | Depending on how you define the market, Google may not even
             | have a monopoly. It's probably dominant enough in web
             | search to count, but if you look at its advertising network
             | it competes with Facebook and other ad networks. In the
             | realm of travel planning (to pick an example from these
             | comments), it's barely a blip.
             | 
             | Furthermore, Google can potentially argue it's not being
             | anticompetitive: all businesses use their existing data to
             | optimize new products, so Google could claim that it _not_
             | doing so would be an artificial straightjacket.
        
               | twiddlebits wrote:
               | It's got a monopoly on "search ads" by far.
        
               | arrosenberg wrote:
               | It's not that hard, we're just out of practice due to the
               | absurd Borkist economic theories we've been operating
               | under for 40+ years. The laws are all there if the head
               | of the DOJ antitrust division has the gumption to go
               | reverse some bad precedents.
               | 
               | > In the realm of travel planning (to pick an example
               | from these comments), it's barely a blip.
               | 
               | They used their monopoly in web search to gain non-
               | negligible marketshare in entirely unrelated industry.
               | That's text book anti-competitive behavior.
               | 
               | Google can argue whatever they want, but the argument
               | that they're enabling other businesses is a bad one. It
               | casts Google as a private regulator of the economy, which
               | is exactly what antitrust laws are intended to deal with.
        
               | pmiller2 wrote:
               | Is web search even a "market" independent of ads?
        
               | rijoja wrote:
               | yes
        
               | pmiller2 wrote:
               | Where's the money?
        
             | samuelizdat wrote:
             | That depends, would Google let us know?
        
               | rijoja wrote:
               | not if they could avoid it
        
             | adamcstephens wrote:
             | Yes, but they lack enforcement.
        
           | kingo55 wrote:
           | Even before it gets to that point, they routinely display
           | snippets off regular websites and show ads next to it.
           | 
           | Keeping users from clicking through to organic results helps
           | them generate more revenue.
        
         | jeffbee wrote:
         | You're wrong on a lot of facts here. Google Flights doesn't get
         | its data just by crawling, they get it from Sabre, the FAA,
         | Eurocontrol, etc. Airlines are, obviously, extremely pleased to
         | disseminate this information. Google Flights "gives back" in
         | the exact same way as any other travel outlet: they book
         | passengers.
         | 
         | As for Wikipedia, the WMF is quite happy that most of their
         | traffic is now served by Google. WMF is in the business of
         | distributing knowledge, not in the eyeballs business. Serving
         | traffic is just a cost for them. The main problem has been that
         | the average cost for Wikipedia to serve a page has gone up,
         | because many readers read it via Google, and more people who
         | visit Wikipedia are logged-in authors, which costs them more to
         | serve. I'm sure there's an easy solution to this problem (for
         | example, beneficiaries of Wikipedia can donate compute
         | facilities and services, or something along those lines).
        
           | tyingq wrote:
           | They don't get individual flight status (what I was talking
           | about) from Sabre or the FAA or Eurocontrol. I didn't get
           | into fares and planned schedules and Google Flights, that's a
           | different topic. I was talking about the big widget you get
           | for queries on status for a particular flight, which is not
           | Google Flights.
           | 
           | They have relented in some ways, rolling out stuff in the
           | widget like: _" The airline has issued a change fee waiver
           | for this flight. See what options are available on American's
           | website"_
           | 
           | But obviously, that kind of stuff isn't shown on Google for
           | quite some time after it exists on the source site. And the
           | widget pushes the organics off the fold unless you have a
           | huge monitor.
           | 
           | As for Wikipedia, I was referring to this:
           | https://news.ycombinator.com/item?id=26487993
           | 
           |  _" Airlines are, obviously, extremely pleased to disseminate
           | this information"_
           | 
           | In the same way that publishers love AMP, yes. They don't
           | actually like it, but they are forced to make the best of it.
        
             | jeffbee wrote:
             | Oh, status. I was thinking of schedules. Still, what is the
             | point for the consumer of being directed to an airline's
             | terrible status page? And are they even capable of being
             | crawled? Looking at American's site (it was the most
             | ghastly airline that sprang to mind) I don't see how a
             | crawler would be able to deal with it, and indeed the
             | Google snippet for AA flight status, on the aa.com result
             | which is far down in the results page, just says "aa.com
             | uses cookies" which is about what you'd expect.
             | 
             | In this case, I want to be sent literally anywhere but
             | aa.com.
        
               | tyingq wrote:
               | _" what is the point for the consumer of being directed
               | to an airline's terrible status page?"_
               | 
               | One example...
               | 
               | If you back up a bit, the widget didn't used to tell you
               | there was a change fee waiver when the flight was full,
               | while aa.com did.
               | 
               | That's an actual, tangible benefit that a consumer might
               | want, worth real money. You can also even often "bid" on
               | a dollar amount to receive if you're willing to change
               | flights. Google doesn't present that info today.
               | 
               | There are more examples. My perspective isn't that Google
               | should lead you to aa.com, but I do feel it's a bit
               | dishonest that the widget is so large it pushes aa.com
               | below the fold. It doesn't need to be that large.
        
         | wbl wrote:
         | Does the concergie of a hotel take anything away when he
         | informs you that your flight has been delayed?
        
         | onlyrealcuzzo wrote:
         | Wikipedia isn't monetized. Doesn't it benefit them if Google is
         | serving their content for free and people are finding the
         | information they want without having to hit Wikipedia??
         | 
         | And also, isn't Google the largest sponsor for Wikipedia
         | already? In 2019 - Google donated $2M [1]. In 2010, Google also
         | donated $2m [2].
         | 
         | [1] https://techcrunch.com/2019/01/22/google-org-
         | donates-2-milli...
         | 
         | [2] https://en.wikipedia.org/wiki/Wikimedia_Foundation
        
           | minikites wrote:
           | Couldn't you make a similar argument about for-profit uses of
           | free/libre software? The software serves a useful purpose,
           | who cares where it came from?
        
           | dmitriid wrote:
           | Google was/is also the largest sponsor of Mozilla. This
           | doesn't stop Google from sabotaging Mozilla.
           | 
           | 2 mln is probably Google's hourly profit. For that they get
           | one of the biggest knowledge bases in the world. It's
           | basically free as far as Google is concerned.
           | 
           | The instant Google becomes confident they can supplant
           | Wikipedia, they will.
        
             | billiam wrote:
             | NOT a sponsor of Mozilla. Google buys web traffic (as
             | default search engine) for ~$300M and turns it into several
             | times that $ in ad revenue.
        
             | jedberg wrote:
             | > 2 mln is probably Google's hourly profit.
             | 
             | You don't have to guess, their numbers are public. In 2020
             | they made $40B in profit, so it takes them about 27 minutes
             | to make $2M in profit.
        
             | magicalist wrote:
             | > _Google was /is also the largest sponsor of Mozilla. This
             | doesn't stop Google from sabotaging Mozilla._
             | 
             | Google isn't a sponsor of Mozilla, they're a customer. Do
             | people think Google is "sponsoring" Apple with $1.5 billion
             | a year too?
        
               | dmitriid wrote:
               | Google being Apple's customer doesn't mean Google isn't
               | sponsoring Mozilla.
               | 
               | These are two very different companies with a very
               | different relationship with Google. And very different
               | influences on Google.
               | 
               | Google _wants_ to be on iOS. It brings customers to
               | Google. A lot of them. iOS is possibly more profitable to
               | Google than Android even with all the payments Apple
               | extracts from them.
               | 
               | Google needs Mozilla so that Google may pretend that
               | there's competition in browser space and that they don't
               | own standards committees. The latter already isn't really
               | true, and Google increasingly doesn't care about the
               | former.
        
               | foobarian wrote:
               | > they're a customer.
               | 
               | The cynic in me thinks the product is anti-trust
               | insurance.
        
             | kelnos wrote:
             | Not sure why you're being downvoted; I completely agree
             | with what you're saying (modulo questionable usage of
             | "sponsor"). If Wikipedia were to try to charge for this use
             | of their data, Google would likely make it a priority to
             | drop the Wikipedia blurbs, either without replacement, or
             | with data sourced elsewhere.
        
               | will4274 wrote:
               | > Google would likely make it a priority to drop the
               | Wikipedia blurbs, either without replacement, or with
               | data sourced elsewhere.
               | 
               | That's an odd way of phrasing things. If Wikipedia were
               | to take away free access to their data, Google wouldn't
               | be dropping Wikipedia, Wikipedia would be dropping
               | Google. This line of thinking "you took this when I was
               | giving it away for free, but now I want to charge for it,
               | so you are expected to keep paying for it" is incorrect.
        
               | zdragnar wrote:
               | Given the scale that google already operates at, I don't
               | doubt that they would just take a copy of thr content and
               | rebrand it as a google service, complete with user
               | contribution.
               | 
               | Then, after two or five years, let it fester then abandon
               | it. Nobody gets promoted for keeping well oiled machines
               | running.
        
               | dmitriid wrote:
               | Remember Knol?
               | https://en.wikipedia.org/wiki/Knol?wprov=sfti1
               | 
               | It was actually good for writing stuff when I tried it.
               | Never brought in enough traffic. Killed.
        
           | rincebrain wrote:
           | Wikimedia recently announced Wikimedia Enterprise for
           | "organizations that want to repurpose Wikimedia content in
           | other contexts, providing data services at a large scale".
           | 
           | So they're pretty clearly looking to monetize organizations
           | which consume their data in a for-profit context.
        
             | dathinab wrote:
             | monetizing != for-profit
             | 
             | You could e.g. just cover operational cost and/or improve
             | the service quality from it.
        
               | pmiller2 wrote:
               | I think they may have meant "(organizations) (which
               | consume their data in a for-profit context)."
        
           | tomp wrote:
           | Well then they can't nag users to donate to Jimmy Wales'
           | trust fund.
        
           | onetimemanytime wrote:
           | >> _Google donated $2M [1]. In 2010, Google also donated $2m
           | [2]._
           | 
           | $2 Million a year? Now I know why Googlers complained about
           | having one less olive in their lunch salad.
           | 
           | How much does Google PROFIT from Wikipedia and how much does
           | Wikipedia loses in fundraising when Google fails to send
           | users to the info provider?
        
             | lupire wrote:
             | Wikipedia is drowning in money so this whole line of
             | discussion is weird.
             | 
             | And most of the value of wikipedia is created by its unpaid
             | users, not Wikimedia foundation.
        
           | kelnos wrote:
           | > _Wikipedia isn 't monetized._
           | 
           | No, but they often ask for donations when you visit the site,
           | which people won't see if they just see the in-line blurb
           | from Wikipedia on the Google results page.
           | 
           | > _In 2019 - Google donated $2M [1]. In 2010, Google also
           | donated $2m [2]._
           | 
           | $2M is a pittance compared to what I expect Google believes
           | is the value of their Wikipedia blurbs. If Wikipedia could
           | charge for use of this data (which another commenter claims
           | they are working on doing), they could easily make orders of
           | magnitude more money from Google.
           | 
           | Of course, my expectation is that Google would rather drop
           | the Wikipedia blurbs entirely, or source the data elsewhere,
           | than pay significantly more.
        
             | tylerhou wrote:
             | Unlikely that Wikipedia will be able to charge for content,
             | seeing as all of their content is CC-BY-SA licensed.
             | https://en.wikipedia.org/wiki/Wikipedia:Licensing_update
             | 
             | They may be able to charge for _bandwidth_ (if you want to
             | use a Wikipedia image, you can use Wikipedia 's enterprise
             | CDN instead of their own), but their licensing allows me to
             | rehost content as long as I follow the attribution &
             | sublicensing terms.
             | 
             | Google has no problem operating their own CDNs, so I find
             | it unlikely that Wikipedia will be able to monetize Google
             | search results in such a manner as you described.
             | 
             | Disclaimer: I work for Google; opinions are my own.
        
         | Siira wrote:
         | Large swaths of web are garbage. Wasting people's time and
         | attention on visiting pointless sites for something presentable
         | in a small box is obviously not economical.
         | 
         | And if some of the sources somehow die? New sources will spring
         | up. It doesn't matter.
        
       | dheera wrote:
       | > Only a select few crawlers are allowed access to the entire
       | web, and Google is given extra special privileges on top of that.
       | 
       | Hmm, so set up a VPN on the Google Cloud so you have a Google IP
       | address, use a Google User-Agent, and go!
        
         | jesboat wrote:
         | https://developers.google.com/search/docs/advanced/crawling/...
         | 
         | describes the procedure for checking "is this source Google
         | it". You couldn't fake it just by running on gcp
        
       | cookiengineer wrote:
       | Can we take a moment to talk about this club's business model?
       | 
       | There's not even any information to see what the "private forum
       | access" that you have to pay for is about, what kind of people
       | are in it...or even to know about what happens with the money.
       | 
       | For me, this sounds like a scam.
       | 
       | I mean, no information about any company. No imprint. No privacy
       | policy. No non-profit organization. And just a copy/paste
       | wordpress instance.
       | 
       | I mean, srsly. I am building a peer-to-peer network that tries to
       | liberate the power of google, specifically, and I would not even
       | consider joining this club. And I am the best case scenario of
       | the proposed market fit.
        
         | adamdusty wrote:
         | They want you to pay them to "research" google's web crawling
         | monopoly. It's really just a donation, but they don't frame it
         | like that. Probably more credible than using a crowd funding
         | website, because it sounds like their pushing for actual
         | legislation.
         | 
         | > Meet with legislators and regulators to present our findings
         | as well as the mock legislation and regulations. We can't
         | expect that we can publish this website or a PDF and then sit
         | back while governments just all start moving ahead on their
         | own. Part of the process is meeting with legislators and
         | regulators and taking the time helping them understand why
         | regulating Google in this way is so important. Showing up and
         | answering legislators' questions is how we got cited in the
         | Congressional Antitrust report and we intend to keep doing
         | what's worked so far.
        
         | judge2020 wrote:
         | Not being set up as a 527 nonprofit[0] is the biggest red flag
         | - no donation or membership money has to be spent for political
         | purposes. They also use memberful for their membership/payment
         | system, which doesn't require owning a business, so you might
         | be paying out to the owner directly instead of to a business
         | with its own bank account. Maybe the owner is looking at HN and
         | can clarify.
         | 
         | To add, there are a lot of businesses that use the terms
         | 'Knucklehead' so finding their business on secretary of state
         | business searches might be impossible.
         | 
         | 0: https://www.irs.gov/charities-non-profits/political-
         | organiza...
        
       | drivingmenuts wrote:
       | How about a system whereby we tell others whether or not we want
       | to be crawled/not crawled by them? /s
        
       | [deleted]
        
       | tomc1985 wrote:
       | I think the solution here is everybody masquerades as Googlebot
       | so we can render the whole thing moot
        
         | quantumofalpha wrote:
         | Ignoring robots.txt is trivial, that's why some(many?) sites
         | enforce it by verifying source IP and recognize Googlebot from
         | its IP addresses - how will you get access to one of those?
        
           | p-sharma wrote:
           | What does "recognize Googlebot from its IP addresses" mean?
           | If I'm a human and I access a site, I have some other IP than
           | Googlebot, how should this side know if I'm a human or
           | knuckleheadsbot?
        
             | quantumofalpha wrote:
             | if you're claiming to be User-Agent: Googlebot, but your IP
             | doesn't seem like it belongs to Google, don't you think
             | it's a clear sign that you're FAKING IT?
             | 
             | The check itself could be implemented for example with ASN
             | or reverse DNS lookup or hard-coding known Google's IP
             | ranges (though that's prone to become stale)
        
             | smarx007 wrote:
             | https://developers.google.com/search/docs/advanced/crawling
             | /...
        
       | p-sharma wrote:
       | Maybe a naive question but what prevents Knuckleheads' from
       | ignoring the robots.txt and crawl the side anyway? And if it's so
       | easy to do, how does Google have a monopoly on crawling then?
        
         | foobar33333 wrote:
         | On smaller sites, nothing usually. But on bigger sites you will
         | be blocked. You will probably be blocked even if you do follow
         | robots.txt
        
         | judge2020 wrote:
         | It's just rude to do so, and there are some technical issues
         | with doing that as well (such as crawling admin panel which
         | might trigger backend alarms/security alerts). Google also
         | doesn't have a legal monopoly on crawling, only a natural
         | monopoly thanks to a lot of websites independently choose to
         | only allow Google and Bing because of the many issues with
         | third-party crawlers (eg. crawling all pages at once, costing
         | money/slowing down the site[0]).
         | 
         | 0: https://news.ycombinator.com/item?id=26593722
        
       | jinseokim wrote:
       | This has been submitted to HN quite a few times.
       | 
       | https://news.ycombinator.com/item?id=25426662 (Most comments; 11
       | comments)
       | 
       | https://news.ycombinator.com/item?id=25417067 (3 comments)
       | 
       | https://news.ycombinator.com/item?id=25546867 (Most recent; 89
       | days ago)
       | 
       | https://news.ycombinator.com/item?id=25543859
       | 
       | https://news.ycombinator.com/item?id=25424852
        
         | Darkphibre wrote:
         | Hooray! Looks like I'm one of today's lucky 10,000. :)
         | 
         | https://xkcd.com/1053/
        
         | [deleted]
        
         | skinkestek wrote:
         | Wasn't aware of that.
         | 
         | Resubmitting interesting content that hasn't got traction
         | earlier on is however explicitly allowed in the guidelines
         | IIRC.
        
           | pessimizer wrote:
           | And linking past threads on the same subject is helpful.
        
         | monkeybutton wrote:
         | Interesting that the most comments it got before was 11, and
         | today it succeeds and makes it to the front page! This is a
         | good illustration of whether or not submissions get any
         | traction can be fairly stochastic.
         | 
         | On topic, stack overflow does exactly what the article is
         | talking about; They lock down their sitemap and make special
         | exceptions for the Google bot:
         | 
         | https://meta.stackexchange.com/a/98087
         | 
         | https://meta.stackexchange.com/questions/33965/how-does-stac...
         | 
         | I can understand SO's reasoning but it only perpetuates the
         | incumbents' stranglehold on the internet.
        
           | jszymborski wrote:
           | I think it's partly because they create a website which
           | reported on the status of the Ever Given which rose to 1. on
           | the front page.
           | 
           | I feel like I often see submissions which are, even
           | tangentially, related to front page material rise very
           | quickly.
           | 
           | Regardless, congrats to Knuckleheads Club for fighting the
           | good fight.
        
             | skinkestek wrote:
             | You are right, that was how I found it.
        
           | judge2020 wrote:
           | > They lock down their sitemap and make special exceptions
           | for the Google bot:
           | 
           | Their robots.txt, on the other hand, is more restrictive of
           | Googlebot:
           | 
           | https://stackoverflow.com/robots.txt                 User-
           | agent: Googlebot-Image       Disallow: /*/ivc/*
           | Disallow: /users/flair/       Disallow: /jobs/n/*       ..
        
       | tmcw wrote:
       | I've definitely scraped by this problem on several occasions.
       | Recently I was writing a tool to check outgoing links from my
       | site, to see which sites are offline (it's called notfoundbot).
       | What I found was that many sites have "DDoS Protection" that
       | makes such an effort impossible, other sites whitelist the CuRL
       | headers, others like it when you pretend to be a search engine.
       | 
       | Basically writing some code that tests whether "a website is
       | currently online or offline" is much, much harder than you think,
       | because, yep, the only company that can do that is Google.
        
       | varispeed wrote:
       | I disallow scanning on all my projects. After GDPR I also removed
       | all analytics - I realised it is just a time sink - instead of
       | focusing on content I would often focus on getting the bigger
       | numbers. I am not a marketer, so it didn't have much value to me
       | and it would just enlarge Google dataset without any payment. I
       | get that you cannot find my projects in the search engine. I am
       | okay with that :-)
        
       | topspin wrote:
       | If the shared cache ever became significant enough to matter it
       | would be devastated by marketers, scammers and other abusers.
       | Google employs the groomers that make their index at least
       | tolerable, if still clearly imperfect. Without that cadre of well
       | compensated expertise to win the arms race against such abusers
       | the scheme is not feasible.
       | 
       | I suppose this could be crowdsourced if I didn't know about
       | politics and how any attempt at delegating the responsibility for
       | blessing sites and their indexes would become a controversy.
       | Google takes lots of heat about its behavior already, but Google
       | is a private entity and can indulge its private prerogatives for
       | the most part. Without that independence this couldn't function.
        
         | finnthehuman wrote:
         | I don't really understand your comment. Marketers, scammers and
         | other abusers already publish to the web with the intention to
         | be included in a crawl. Postprocessing crawl data is already a
         | thing.
         | 
         | Assuming this hypothetical shared crawl cache were to exist, it
         | does not preclude google (and all consumers of that cache)
         | doing their own processing downstream of that cache. Does it?
         | 
         | What's the new attack vector?
        
           | topspin wrote:
           | > I don't really understand your comment.
           | 
           | If you don't then you fail to appreciate the amount of labor
           | it takes to thwart bad actors from ruining indexes. Abusers
           | do publish to the web, and we enjoy not wallowing in their
           | crap because small army of experienced and expensive people
           | at a select few Big Tech companies are actively shielding us
           | from it.
           | 
           | It's easy to anticipate the malcontent view; 'Google spends
           | all its resources on ads and ranking and we don't need all
           | that.' That is naive; if Google completely neglected grooming
           | out the bad actors people wouldn't use Google and Google's
           | business model wouldn't be viable.
           | 
           | So the obvious question is; where is this mechanism without
           | Google et. al? Will the published caches be 99% crap (and
           | without an active defense against crap you can bet your life
           | it will) and anything derived from it hopelessly polluted? If
           | so then it isn't viable.
           | 
           | Now the instinct will be to find a groomer. Guess what;
           | that's probably doomed too. No selection will be impartial to
           | all, so you get to fight that battle. Good luck.
        
             | finnthehuman wrote:
             | >Will the published caches be 99% crap
             | 
             | Yes. It will be exactly as crap as whatever's published on
             | the web.
             | 
             | And the utility of google's search engine would be to
             | perform their proprietary processing on top of the
             | publicly-available crawl results. Analogous to how their
             | search is already preforming proprietary processing on top
             | of a crawl cache.
             | 
             | >If you don't then you fail to appreciate the amount of
             | labor it takes to thwart bad actors from ruining indexes.
             | 
             | Did you miss the part where I said "Assuming this
             | hypothetical shared crawl cache were to exist, it does not
             | preclude google (and all consumers of that cache) doing
             | their own processing downstream of that cache. Does it?"
        
       | herewhere wrote:
       | Around a decade ago, I was part of the team responsible for
       | msnbot (a web crawler for bing). There used to be robot.txt
       | (forgot the extension now). Most of the website was giving 10-20x
       | higher limits to googlebot than rest other crawler.
       | 
       | Google definitely has unfair advantage there.
       | 
       | Bing and duckduckgo still provide very reasonable result with
       | 10-20x less resources but not at par of google.
        
       | andrewclunn wrote:
       | How about an opt-in search engine cache? One where a domain needs
       | to agree allow their site to be crawled, but as a result also
       | gives said crawler full access? And then that repository would be
       | made publicly available to all search engines to use. Sort of an
       | AP for searches, that would give a base line that wouldn't
       | preclude search engines from going further, but which would
       | certainly lower the cost and network traffic for the search
       | engines and sites that take advantage of it?
        
       | l72 wrote:
       | I tried to set up YaCy [1] at home to index a few of may favorite
       | smaller websites, so I could quickly search just them. That
       | turned out to be a bad idea. Some ended up blocking my home IP
       | address and others reported me to my ISP. None of these sites
       | were that large, and I wasn't continuously crawling them...
       | 
       | [1] https://yacy.net/
        
         | slenk wrote:
         | I have been running my own Searx instance in AWS for a while
         | and have not gotten blocked yet anywhere
        
         | jedimastert wrote:
         | How often were you searching?
        
           | l72 wrote:
           | I was regularly searching, but I was rarely indexing any of
           | the sites. I struggled to even get an initial index of many
           | of the sites, due to being blocked or being reported.
        
       | samizdis wrote:
       | Coincidentally, this item [1] has just turned up in HN - Common
       | Crawl
       | 
       | [1] https://news.ycombinator.com/item?id=26594172
        
       | mrweasel wrote:
       | While I don't disagree with the idea that all crawlers should
       | have equal access, we also need to address the quality of many
       | crawlers.
       | 
       | Google and Microsoft have never hammered any website I've run
       | into the ground. Crawlers from other other, smaller, search
       | engines have, to the point where it was easier to just block them
       | entirely.
       | 
       | Part of the problem is that sites want search engine to index
       | their site, but not allow random people just scrapping the entire
       | site. So they do the best they can, and forget that Google isn't
       | the web. I doubt it's shady deals with Google, it's just small
       | teams doing the best they can and sometimes they forget to think
       | ideas through, because it's good enough.
        
         | rstupek wrote:
         | We've had the Bing crawler make a obscene number of requests
         | quite often but fortunately it doesn't bring us down.
        
         | kmeisthax wrote:
         | I think this is a problem which should be solved by automatic
         | rate-limiting and throttling at the application/caching layer
         | (or just individual web server for smaller sites). Requests
         | with a non-browser UA get put into a separate bots-only queue
         | that drains at a rate of ~1/sec or so. If the queue fills up
         | you start sending 429s with random early failures for bots
         | (UA/IP/subnet pairs) that are overrepresented in the traffic
         | flow.
         | 
         | I don't know if such software exists, but it should. It would
         | be a hell of a lot healthier for the web than "everyone but
         | Google f*ck off", and it creates an incentive for bots to
         | throttle themselves (as they're more likely to get a faster
         | response than trying to request as fast as possible).
        
           | NathanKP wrote:
           | I suspect that at least some of the bots use web server
           | response times and response codes as part of the signal for
           | ranking. If your website does not appear capable of handling
           | load then it won't rank as highly, because it is not in their
           | best interests to have search results that don't load.
        
       | henriquez wrote:
       | I'd like to see some data on their claim that website operators
       | are giving googlebot special privileges. As far as I can tell it
       | would be a huge pain in the ass to block crawler bots from my
       | servers, not that I've tried. I have some weird pages that tend
       | to get crawlers caught in infinite loops, and I try to give them
       | hints with robots.txt but most of the bots don't even respect
       | robots.txt.
       | 
       | If I actually wanted to restrict bots, it would be much easier to
       | restrict googlebot because they actually follow the rules.
       | 
       | I don't disagree in principle that there should be an open index
       | of the web, but for once I don't see Google as a bad actor here.
        
         | throwaway_uat wrote:
         | LinkedIn profile/Quora answer are accessible by Google bot
         | without signin
        
         | burkaman wrote:
         | See figure I.4 on page 24 of this UK government report:
         | https://assets.publishing.service.gov.uk/media/5efb1db6e90e0...
         | 
         | Additional evidence here: https://knuckleheads.club/the-
         | evidence-we-found-so-far/
        
         | malf wrote:
         | What do you think this is used for?
         | 
         | https://developers.google.com/search/docs/advanced/crawling/...
        
         | calimac wrote:
         | The studies and data to support their claim is in the first
         | paragraph of the article you "read" before posting the
         | question.
        
         | Lammy wrote:
         | Spoofing your user-agent as googlebot is a common way to bypass
         | paywalls, is (was?) a way to read Quora without creating an
         | account, etc. Publishers obviously need to send their
         | page/article to Google if they want it to be indexed but may
         | not want to send the same page content to a normal user:
         | https://www.256kilobytes.com/content/show/1934/spoofing-your...
         | 
         | This was common even back in the mid-2000s:
         | 
         | https://www.avivadirectory.com/bethebot/
         | 
         | https://developers.google.com/search/blog/2006/09/how-to-ver...
        
         | soheil wrote:
         | It's hilarious to think there exists people who think googlebot
         | does not get special treatment from website operators. Here is
         | an experiment you can do in a jiffy, write a script that crawls
         | any major website and see how many URL fetches it takes before
         | your IP gets blocked.
         | 
         | Googlebot has a range of IP addresses that it publicly
         | announces so websites can whitelist them.
        
           | quitethelogic wrote:
           | > Googlebot has a range of IP addresses that it publicly
           | announces so websites can whitelist them.
           | 
           | Google says[1] they do not do this:
           | 
           | "Google doesn't post a public list of IP addresses for
           | website owners to allowlist."
           | 
           | [1]https://developers.google.com/search/docs/advanced/crawlin
           | g/...
        
             | johncolanduoni wrote:
             | From that same page they recommend using a reverse DNS
             | lookup (and then a forward DNS lookup on the returned
             | domain) to validate that it is google bot. So the effect is
             | the same for anyone trying to impersonate googlebot (unless
             | they can attack the DNS resolution of the site they're
             | scraping I guess).
        
           | Mauricebranagh wrote:
           | I have never had that problem running screaming frog on big
           | brand sites apart from one or two times.
        
           | dheera wrote:
           | Do any of them intersect with Google Cloud IP addresses? If
           | so set up a VPN server on Google Cloud.
        
           | WesolyKubeczek wrote:
           | I don't scrape a website often, but when I do, I'm using a
           | user agent of a major browser.
        
           | tedunangst wrote:
           | I don't whitelist googlebot, but I don't block them either
           | because their crawler is fairly slow and unobtrusive. Other
           | crawlers seem determined to download the entire site in 60
           | seconds, and then download it again, and again, until they
           | get banned.
        
           | [deleted]
        
         | suicas wrote:
         | A company I worked for ~7 years ago ran its own focused web
         | crawler (fetching ~10-100m pages per month, targeting certain
         | sections of the web).
         | 
         | There were a surprising number of sites out there that
         | explicitly blocked access to anyone but Google/Bing at the
         | time.
         | 
         | We'd also get a dozen complaints or so a month from sites we'd
         | crawled. Mostly upset about us using up their bandwidth, and
         | telling us that only Google was allowed to crawl them (though
         | having no robots.txt configured to say so).
        
           | luckylion wrote:
           | I usually recommend setting only Google/Bing/Yandex/Baidu etc
           | to Allow and everything else to Disallow.
           | 
           | Yes, the bad bots don't give a fuck, but even the non-
           | malicious bots (ahrefs, moz, some university's search engine
           | etc) don't bring any value to me as a site owner, take up
           | band width and resources and fill up logs. If you can remove
           | them with three lines in your robots.txt, that's less noise.
           | Especially universities do, in my opinion, often behave badly
           | and are uncooperative when you point out their throttling
           | does not work and they're hammering your server. Giving them
           | a "Go Away, You Are Not Wanted Here" in a robots.txt works
           | for most, and the rest just gets blocked.
        
           | dillondoyle wrote:
           | Isn't that the website owners right though? I'm not sure I
           | understand the problem here.
           | 
           | If Google is taking traffic and reducing revenue, a company
           | can deny in robots.txt. Google will actually follow those
           | rules - unlike most others that are supposedly in this 2nd
           | class.
        
             | suicas wrote:
             | Yup, no problem here, was just making an observation about
             | how common such blocking was (and about the fact that some
             | people were upset at being crawled by someone other than
             | Google, despite not blocking them).
             | 
             | The company did respect robots.txt, though it was initially
             | a bit of a struggle to convince certain project managers to
             | do so.
        
         | jameshart wrote:
         | When you operate commercial sites at scale, bots are a real
         | thing you spend real engineering hours thinking about and
         | troubleshooting and coding to solve for.
         | 
         | And yes, that means google gets special treatment.
         | 
         | Think about the model for a site like stackoverflow. The
         | longest of long tail questions on that site: what's the actual
         | lifecycle of that question?
         | 
         | - posted by a random user - scraped by google, bing, et al -
         | visited by someone who clicked on a search result on google -
         | eventually, answered - hopefully, reindexed by google, bing et
         | al - maybe never visited again because the answer now shows up
         | on the google SERP
         | 
         | In the lifetime of that question how many times is it accessed
         | by a human, compared to the number of times it's requested and
         | rerequested by an indexing bot?
         | 
         | What would be the impact on your site of three more bots as
         | persistent as google bot? Why should you bother with their
         | requests?
         | 
         | So yes, sites care about bot traffic and they care about google
         | in particular.
        
         | noxvilleza wrote:
         | Google aren't the bad actor in the sense that they are actively
         | doing something wrong, but they are definitely benefiting from
         | the monopoly that they created and work on maintaining. If this
         | continues then nobody will really ever be able to challenge
         | them, which means possibly "better" products will fail to
         | penetrate the market.
        
         | ehsankia wrote:
         | > but for once I don't see Google as a bad actor here.
         | 
         | As inflammatory as the headline of the page looks, they
         | literally admit it's not google's fault in the smaller text
         | lower down:
         | 
         | "This isn't illegal and it isn't Google's fault, but"
        
         | zmarty wrote:
         | A lot of news websites restrict any crawler other than Google.
         | And this does not happen only via robots.txt.
        
           | simias wrote:
           | Indeed, years ago I had scripts to automatically fetch URLs
           | from IRC and I quickly realized that if I didn't spoof the
           | user agent of a proper web browser many websites would reject
           | the query. Googlebot's UA worked just fine however.
        
             | judge2020 wrote:
             | > Googlebot's UA worked just fine however
             | 
             | They obviously don't care enough then - Google says you
             | should use rdns to verify that googlebot crawls are
             | real[0]. Cloudflare does this automatically now as well for
             | customers with WAF (pro plan).
             | 
             | 0: https://developers.google.com/search/docs/advanced/crawl
             | ing/...
        
       | staunch wrote:
       | Google makes $150+ billion from Google Search per year. Running
       | Google Search could be operated for likely (much less than) $10
       | billion per year.
       | 
       | So, Google is in effect taxing us all $140 billion per year.
       | 
       | It's not dissimilar from how Wall Street effectively taxes us all
       | for an even larger amount.
       | 
       | In both cases, we could use some kind of non-profit open system
       | to facilitate web search and stock trading.
       | 
       | The Great Lie that Google is doing a good thing by charging money
       | to insert "relevant ads" above the search results is totally
       | wrong. If those ads are the most relevant, they should just be
       | the top organic results, obviously.
       | 
       | Google mostly solved search 20 years ago. There's really nothing
       | that impressive about Google Search in 2021. It should be
       | relatively easy to replace it with something open, leveraging the
       | massive improvements in hardware and software. It could operate
       | like Wikipedia or Archive.org. The hard part is probably getting
       | the right team and funding assembled.
        
       | systemBuilder wrote:
       | This is not really about Google.
       | 
       | Websites block crawlers because they get abused / crashed by
       | Crawlers. In the early days (2000-2010) Google not only got
       | banned by some websites, it even got DNS-banned for abusing some
       | DNS domains. You see, Google already has already built the
       | "megacrawlers" described in this article, it can melt any website
       | on the Internet, even Facebook - the largest, and they paid a
       | high price for letting the early Google crawlers run free.
       | 
       | Google today has a rate-limit for every single website and DNS
       | sub-domain on the internet. For small websites the default is a
       | handful of web pages every few seconds. Google has a very slow
       | (days) algorithm to increase its crawl rate, and a very fast (1d)
       | algorithm to cut the rate limit if it's getting any of the errors
       | likely due to website overload.
       | 
       | To summarize, Google has several layers of congestion control
       | custom-designed into the crawl application. Most small web
       | crawlers have zero.
       | 
       | None of these other crawlers have figured this out, so they abuse
       | websites, causing all small-scale crawlers to get banned.
       | 
       | - ex-Google Crawl SRE
        
         | ricardo81 wrote:
         | Thank you for those insights, it's a topic I'm interested in.
         | Agree with what you're saying about naive bots hitting
         | websites/hosts/subnets too hard, in the context of site owners
         | being hit by multiple bots for multiple reasons and them
         | questioning the return they'll get.
         | 
         | I'd be interested to know more info wrt DNS lookups. Did you
         | apply a blanket rate limit on the number of DNS requests you'd
         | make to any particular server?
         | 
         | From past experience I know the .uk Nominet servers would temp-
         | ban if you were doing more than a few hundred lookups per
         | second. At the next host level down, was there a blanket limit
         | or was it dependent on the number of domains that nameserver
         | was responsible for?
        
       | dbsmith83 wrote:
       | I just don't see this working out legally. How would it even
       | work?
       | 
       | From the "learn more"
       | 
       | > Sometime soon we will be publishing what we think should happen
       | and what we think will happen. These two futures diverge and we
       | believe that, while the gap between them exists, it will entrench
       | Google's control over the internet further. We believe that
       | nothing short of socialization of these resources will work to
       | remove Google's control over the internet. Our hope is that in
       | publishing this work right now we will let the genie out of the
       | bottle and start a process towards socialization that cannot be
       | undone.
       | 
       | Sorry, but I deeply skeptical of this. This sounds like the first
       | step towards a non-free internet. At the end of the day, it is
       | your box on the web, and if you want or don't want
       | someone/something to crawl it, that is your call to make.
        
       | marshmallow_12 wrote:
       | I have an idea: remove the art of web crawling from the domain of
       | a single company and instead create a international group of
       | interested parties to run it instead. I'm thinking broadly along
       | the lines of the Bluetooth SIG. Maybe it will be a bit more
       | complicated, and require international political efforts, but it
       | will make the search engine market way more democratic.
        
       | sxp wrote:
       | https://knuckleheads.club/the-googlebot-monopoly/ has actual
       | details.
       | 
       | > Let's take a look at the robots.txt for census.gov from October
       | of 2018 as a specific example to see how robots.txt files
       | typically work. This document is a good example of a common
       | pattern. The first two lines of the file specify that you cannot
       | crawl census.gov unless given explicit permission. The rest of
       | the file specifies that Google, Microsoft, Yahoo and two other
       | non-search engines are not allowed to crawl certain pages on
       | census.gov, but are otherwise allowed to crawl whatever else they
       | can find on the website. This tells us that there are two
       | different classes of crawlers in the eyes of the operators of
       | census.gov: those given wide access, and those that are totally
       | denied.
       | 
       | > And, broadly speaking, when we examine the robots.txt files for
       | many websites, we find two classes of crawlers. There is Google,
       | Microsoft, and other major search engine providers who have a
       | good level of access and then there is anyone besides the major
       | crawlers or crawlers that have behaved badly in the past that are
       | given much less access. Among the privileged, Google clearly
       | stands out as the preferred crawler of choice. Google is
       | typically given at least as much access as every other crawler,
       | and sometimes significantly more access than any other crawler.
        
         | indymike wrote:
         | Broadly speaking, robots.txt files are often ignored. I used to
         | run a fairly large job ad scraping organization, and we would
         | be hired by companies (700 of the fortune 1000 used us) to
         | scrape the job ads from their career pages, and then post those
         | jobs on job boards. 99 of 100 times, the robots file would
         | disallow us to scrape. Since we were being paid by that
         | company's HR team to scrape, we just ignored it because getting
         | it fixed would take six months and 22 meetings.
        
           | chmod775 wrote:
           | > Broadly speaking, robots.txt files are often ignored.
           | 
           | If you wanna go nuclear on people who do that, include an
           | invisible link in your html and forbid access to that URL in
           | your robots.txt, then block every IP who accesses that URL
           | for X amount of time.
           | 
           | Don't do this if you actually rely on search engine traffic
           | though. Google may get pissed and send you lots of angry mail
           | like "There's a problem with your site".
        
             | jedberg wrote:
             | > Don't do this if you actually rely on search engine
             | traffic though. Google may get pissed and send you lots of
             | angry mail like "There's a problem with your site".
             | 
             | Ah, but of course you would exclude Google's published
             | crawler IPs from this restriction, because that is exactly
             | what they want you to do.
        
         | TheAdamAndChe wrote:
         | Are there any actual repercussions for just ignoring
         | robots.txt?
        
           | 5560675260 wrote:
           | Your crawler's IP might get banned, eventually.
        
           | asciident wrote:
           | There is if you are doing it for work. For example, your
           | company could get sued if you are found using that data and
           | ignoring the ToS. If you are a public figure, you could get
           | your name tarnished as doing something unethical or the media
           | may call it "hacking". If you are rereleasing the data then
           | you risk getting a takedown notice.
        
           | the_dege wrote:
           | Sometimes website admins will also try to report your ips to
           | the service provider as a source of attacks (even if not
           | true).
        
             | DocTomoe wrote:
             | Given how often I've had misbehaving crawlers slow own
             | servers in the early 2000s, I do not see how a crawler that
             | disobeys robots.txt is not an attempted attack.
        
         | JackFr wrote:
         | So from the website's point of view there is no difference
         | between 'crawling' and 'scraping'. Census.gov I assume has a
         | ton of very useful information which is in the public domain
         | which a host of potential companies could monetize by regularly
         | scraping census.gov. Census.gov's purpose to make this
         | information available to people is served by google, yahoo and
         | bing. On the other hand if I have a business which is based on
         | that data, in fact I'm at cross purposes to them.
        
           | njharman wrote:
           | I'm generally anti business. But I have to disagree. "The
           | Public" that the government serves includes businesses.
           | Businesses (ignoring corporate personhood bullshit) are owned
           | and operated by people.
           | 
           | I do not want the government deciding "what purposes" e.g.
           | non-commercial, serve the public good. The public gets to
           | decide that. (charging a license for commercial use is maybe
           | ok (assuming supporting that use costs government "too
           | much").
           | 
           | And I very do not want current situation with the government
           | anointing a handful of corporations (the farthest thing from
           | the public possible) access and denying everyone else
           | including all of the actual public.
        
             | hnbroseph wrote:
             | > I do not want the government deciding "what purposes"
             | e.g. non-commercial, serve the public good. The public gets
             | to decide that.
             | 
             | the public's "decision" on things like this is made
             | manifest by government policy, no?
        
               | danShumway wrote:
               | In theory. In practice, is every single policy that our
               | government upholds currently popular with the majority of
               | people?
               | 
               | It's possible to have government policies that the
               | majority of people disagree with, that remain for
               | complicated reasons related to apathy, lobbying, party
               | ideology, or just because those issues get drowned out by
               | more important debates.
               | 
               | Government is an extension of the will of the people, but
               | the farther out that extension gets, the more divorced
               | from the will of the people it's possible to be. That's
               | not to say that businesses are immune from that effect
               | either -- there are markets where the majority of people
               | participating in them aren't happy with what the market
               | is offering. All of these systems are abstractions,
               | they're ways of trying to get closer to public will, and
               | they're all imperfect. But government is particularly
               | abstracted, especially because the US is not a direct
               | democracy.
               | 
               | I'm personally of the opinion that this discussion is
               | moot, because I think that people have a fundamental
               | Right to Delegate[0], and I include web scraping public
               | content under that right. But ignoring that, because not
               | everyone agrees with me that delegation is right,
               | allowing the government to unilaterally rule on who isn't
               | allowed to access public information is still
               | particularly susceptible to abuse above and beyond what
               | the market is capable of.
               | 
               | [0]: https://anewdigitalmanifesto.com/#right-to-delegate
        
             | pessimizer wrote:
             | A specific case where this favorite-picking by government
             | enables corruption: https://en.wikipedia.org/wiki/Nationall
             | y_recognized_statisti...
             | 
             | And an example from the quickly-approaching future, when
             | there will be Nationally Recognized Media Organizations who
             | license "Fact-Checkers," through which posts to public-
             | facing will have to be submitted for certification and
             | correction.
        
               | marcosdumay wrote:
               | Favorite-picking by the government is corruption by
               | itself already.
        
           | indymike wrote:
           | I used to run a fairly large job ad scraping operation. Our
           | scraped data was used by many US state and federal job sites.
           | "Scraping" is just using software to load a page and
           | extracting content. "Crawling" is just load a page, find
           | hyperlinks (hmm... a kind of content), and then crawling
           | those links. Crawling is just a kind of scraping.
        
           | vharuck wrote:
           | In the case of Census.gov, they offer an API to get the
           | data[0]. It's actually pretty nice. Stable, ton of data,
           | fairly uniform data structure across the different products.
           | Very high rate limits, considering most data only needs
           | retrieved once a year. I think they understand the difference
           | between crawling and scraping.
           | 
           | [1] https://www.census.gov/data/developers.html
        
           | ricardo81 wrote:
           | Having data in the right format as a download or via an API
           | would be the best way to go for public data.
           | 
           | If people have to 'scrape' that data from a public resource,
           | I'd say they're presenting the data in the wrong way.
        
           | mulmen wrote:
           | But Google, Yahoo and Bing are also monetizing the data. Why
           | are they allowed to provide "benefits" but "scrapers" are
           | not? Why is it wrong to monetize public data?
        
           | jonas21 wrote:
           | The census data is available for bulk download, mostly as CSV
           | (for example [1]). Scraping census.gov is worse for both the
           | Census Bureau (which might have to do an expensive database
           | query for each page) and for the scraper (who has to parse
           | the page).
           | 
           | Blocking scrapers in robots.txt is more of a way of saying,
           | "hey, you're doing it wrong."
           | 
           | It's also worth noting that the original article is out of
           | date. The current robots.txt at census.gov is basically wide-
           | open [2].
           | 
           | [1] https://www.census.gov/programs-surveys/acs/data/data-
           | via-ft...
           | 
           | [2] https://www.census.gov/robots.txt
        
             | foobar33333 wrote:
             | Scrapers don't care about robots.txt. I have scraped
             | multiple websites in a previous job and the robots.txt
             | means nothing. Bigger sites might detect and block you but
             | most don't.
        
         | gnramires wrote:
         | Perhaps there could be some kind of 'Crawler consortium'?
         | 
         | Under this consortium, website owners would be allowed to
         | either allow all crawlers (approved by the consortium) or none
         | at all (that is, none that is in the consortium, i.e. you could
         | allow a specific researcher or something to crawl your website
         | on a case-by-case basis).
         | 
         | This consortium would be composed of the search engines
         | (Google, MS, other industry members), as well as government
         | appointed individuals and relevant NGOs (electronic frontier
         | foundation, etc?). There would be an approval process that
         | simply requires your crawl to be ethical and respect bandwidth
         | usage. Violations of ethics or bandwidth limits could imply
         | temporary or permanent suspension. The consortium could have
         | some bargain or regulatory measures to prevent website owners
         | from ignoring those competitive and fairness provisions.
        
           | dragonwriter wrote:
           | > Perhaps there could be some kind of 'Crawler consortium'?
           | 
           | An industry-wide agreement not to compete for commercially
           | valuable access to suppliers of data?
           | 
           | Comprised of companies that are current (and in some cases
           | perennial) focusses of antitrust attention?
           | 
           | I think there might be a problem with that plan.
        
             | gnramires wrote:
             | Well, yes, and a common solution to anti-trust cases, that
             | I know of, is some kind of industry self-regulation. In
             | this case I wouldn't trust the industry only to self-
             | regulate; hence, they should at invite (while keeping a
             | minority but not insignificant position) governments and
             | civil society (ngos and other organizations) to
             | participate.
             | 
             | Could you better describe your objections?
        
             | neolog wrote:
             | I don't see the problem. If a bunch of non-google companies
             | pooled resources to make a crawl, that would reduce market
             | concentration, not increase it.
        
         | adolph wrote:
         | Is it legal for a government entity to issue a robots.txt like
         | that? Maybe the line between use and abuse hasn't been
         | delinated as well as it needs to be.
        
           | bigwavedave wrote:
           | > Is it legal for a government entity to issue a robots.txt
           | like that?
           | 
           | I may be wrong (this isn't my area), but I was under the
           | impression that robots.txt was just an unofficial convention?
           | I'm not saying people should ignore robots.txt, but are there
           | legal ramifications if ignored? I'm not asking about
           | techniques sites use to discourage crawlers/scrapers, I'm
           | specifically wondering if robots.txt has any legal weight.
        
           | vageli wrote:
           | Is failure to honor a robots.txt a crime? Or rather, would it
           | be unlawful to spoof a user agent to access this publicly
           | available data? After the linkedin [0] case it seems
           | reasonable to think not.
           | 
           | [0]: https://www.eff.org/deeplinks/2019/09/victory-ruling-
           | hiq-v-l...
        
             | Spivak wrote:
             | Spoofing user-agents hasn't worked in a long time for
             | anything but small operations because search engines
             | publish specific IP ranges their scrapers use.
        
       | zepearl wrote:
       | Maybe it would be nice if some sort of simple central index of
       | "URLs + their last updated timestamp/version/eTag/whatever" would
       | exist, updated by the site owners themselves?
       | ("push"-notification)
       | 
       | Meaning that whenever a page of a website would be created or
       | updated, that website itself would actively update that central
       | index, basically saying "I just created/deleted page X" or "I
       | just updated the contents of page X".
       | 
       | The consequence would be that...
       | 
       | 1) ...crawlers would not have anymore to actively (re)scan the
       | whole Internet to find out if anything has changed, but they
       | would only have to query that central index against their own
       | list URLs & timestamps to find out what needs to be (re)scanned.
       | 
       | 2) ...websites would not have to just wait&hope that some bot
       | would decide to come by to have a look at their sites, nor they
       | would have to answer over and over again requests that are just
       | meant to check if some content has changed.
        
       | soheil wrote:
       | I'm not sure if it is a good thing if there is a public cache of
       | everything that Google has. The issue is websites will simply
       | stop serving content to Google to protect their content from
       | being accessed by their competitors, this in turn will make
       | search much worse and will force us back to the pre-search dark
       | ages of the internet. The sites may even serve an even more
       | crippled version of their content just to get hits but there is
       | no doubt search quality will suffer.
       | 
       | We're left with a monopoly that is Google, destroying it now
       | could be foolish.
        
       | sesuximo wrote:
       | Seems like a private cache of the web would solve the problem?
       | Why does it need to be public?
        
         | lisper wrote:
         | Seriously? Google _is_ a private cache of the web. That _is_
         | the problem.
        
           | sesuximo wrote:
           | Google doesn't give anyone access to said cache. I mean one
           | crawler with a shared api among competitors. So exactly the
           | same as the public cache, but run my a private company and
           | accessed for a small fee
        
             | ajcp wrote:
             | > run [by] a private company and accessed for a small fee
             | 
             | That is exactly the opposite of a public cache.
        
               | sesuximo wrote:
               | Not really. It serves the same function. Either you pay
               | this hypothetical company or ??? pays to keep up the
               | public one.
        
               | ajcp wrote:
               | Just because it serves the same function does not mean
               | the implementation is the same. Private military
               | contractors and a US infantry squad serve the same
               | function, but the implementation completely changes their
               | context.
               | 
               | That being said what I think you're arguing for would be
               | the implementation of a public utility or private-public
               | business. If that's the case then yes, what you're saying
               | is correct.
        
             | visarga wrote:
             | > Google doesn't give anyone access to said cache.
             | 
             | It would also be useful for deep searches, exceeding the
             | 1000 result limit, empowering all sorts of NLP
             | applications.
        
             | lisper wrote:
             | I don't think you're quite clear on what the words "public"
             | and "private" mean. "Public" is not a synonym for "run by
             | the government" and "private" is not a synonym for "closed
             | to everyone but the owner". Restaurants, for example, are
             | generally open to the public, but they are not public. A
             | restaurant owner is, with a few exceptions, free to refuse
             | service to anyone at any time.
             | 
             | If it's "exactly the same as a public cache" then it's
             | public, even if it is managed by a private company. The
             | difference is not in who _has_ access, the difference is in
             | _who decides_ who has access.
        
               | sesuximo wrote:
               | Ok I am not clear then, but I'm less clear after your
               | comment! In a public cache, who would you want to decide
               | who has access? Is simply saying "anyone who pays has
               | access" enough to qualify as public? if so, then I agree
               | and this was my (possibly poorly phrased) intention in
               | the original comment.
               | 
               | But imo the restaurant model is also fine; in most cases
               | people have access and it works.
        
               | lisper wrote:
               | > Is simply saying "anyone who pays has access" enough to
               | qualify as public?
               | 
               | No because someone has to set the price, which is just an
               | indirect method of controlling who has access.
               | 
               | > the restaurant model is also fine
               | 
               | It works for restaurants because there is competition.
               | The whole point here is that web crawling/caching is a
               | natural monopoly.
               | 
               | A better analogy here would be Apple's app store, or
               | copyrighted standards with the force of law [1]. These
               | nominally follow the "anyone who pays has access" model
               | but they are not public, and the result is the same set
               | of problems.
               | 
               | [1] https://www.thebrandprotectionblog.com/public-laws-
               | private-s...
        
             | sct202 wrote:
             | You can API google search results to make a meta-search
             | engine if you want to but it's like $5 / 1k requests.
        
               | twiddlebits wrote:
               | Google's TOS prevents blending (alterations, etc.)
               | though.
        
             | [deleted]
        
       | ThePhysicist wrote:
       | On a related note, Cloudflare just introduced "Super Bot Fight
       | Mode" (https://blog.cloudflare.com/super-bot-fight-mode/) which
       | is basically a whitelisting approach that will block any
       | automated website crawling that doesn't originate from "good
       | bots" (they cite Google & Paypal as examples of such bots). So
       | basically everyone else is out of luck and will be tarpitted
       | (i.e. connections will get slower and slower until pages won't
       | load at all), presented with CAPTCHAs or outright blocked. In my
       | opinion this will turn the part of the web that Cloudflare
       | controls into a walled garden not unlike Twitter or Facebook: In
       | theory the content is "public", but if you want to interact with
       | it you have to do it on Cloudflare's terms. Quite sad really to
       | see this happen to the web.
        
         | judge2020 wrote:
         | On the other hand, I do not want my site to go down thanks to a
         | few bad 'crawlers' that fork() a thousand http requests every
         | second and take down my site, forcing me to do manual blocking
         | or pay for a bigger server/scale-out my infrastructure. Why
         | should I have to serve them?
        
           | progval wrote:
           | You can use the same rate-limiting for all crawlers, Google
           | or not.
        
             | dodobirdlord wrote:
             | Googlebot is pretty careful and generally doesn't cause
             | these problems.
        
               | spijdar wrote:
               | Right, then they shouldn't be effected by the rate-
               | limiting, as long as its reasonable. If it was applied
               | evenly to all clients/crawlers, it'd at least allow the
               | possibility for a respectful, well designed crawler to
               | compete.
        
               | jedberg wrote:
               | The problem is, if you own a website, it takes the same
               | amount of resources to handle the crawl from Google and
               | FooCrawler even if both are behaving, but I'm going to
               | get a lot more ROI out of letting Google crawl, so I'm
               | incentivized to block FooCrawler but not Google. In fact,
               | the ROI from Google is so high I'm incentivized to devote
               | _extra_ resources just for them to crawl faster.
        
         | TameAntelope wrote:
         | How hard is it to ask Cloudflare to let you crawl?
        
           | smarx007 wrote:
           | It's not Cloudflare who is deciding it. It's the website
           | owners who request things like "Super Bot Fight Mode". I
           | never enable such things on my CF properties. Mostly it's
           | people who manage websites with "valuable" content, e.g.
           | shops with prices who desperately want to stop scraping by
           | competitors.
        
             | f430 wrote:
             | I can say this will give a lot of businesses false sense of
             | security. It is already bypassable.
             | 
             | the Web scraping technology that I am aware of has reached
             | end game already: Unless you are prepared to authenticate
             | every user/visitor to your website with a dollar sign,
             | lobby congress to pass a bill to outlaw web scraping, you
             | will not be able to stop web scraping in 2021 and beyond.
        
         | kristopolous wrote:
         | In the early 90s there were various nascent systems for
         | essentially public database interfaces for searching
         | 
         | The idea was that instead of a centralized search, people could
         | have fat clients that individually query these apis and then
         | aggregate the results on the client machine.
         | 
         | Essentially every query would be a what/where or what/who pair.
         | This would focus the results
         | 
         | I really think we need to reboot those core ideas.
         | 
         | We have a manual version today. There's quite a few large
         | databases that the crawlers don't get.
         | 
         | The one place for everything approach has the same fundamental
         | problems that were pointed out 30 years ago, they've just
         | become obvious to everybody now.
        
         | grishka wrote:
         | So, one more reason to hate Cloudflare and every single website
         | that uses it.
        
           | jakear wrote:
           | Or maybe don't "hate" folks who are just trying to put some
           | content online and don't want to deal with botnets taking
           | down their work? You know, like what the internet was
           | intended for.
        
             | grishka wrote:
             | Internet was certainly _not_ intended for centralization. I
             | hit Cloudflare captchas and error pages so often it 's
             | almost sickening. So many things are behind Cloudflare,
             | things you least expect to be behind Cloudflare.
        
         | petercooper wrote:
         | I wonder what happens to RSS feeds in this situation. Programs
         | I run that process RSS feeds will just fetch them over HTTP
         | completely headlessly, so if there are any CAPTCHAs, I'm not
         | going to see them.
        
         | luckylion wrote:
         | That will be interesting to see with regards to legal
         | implications. If they (in the website operator's name) block
         | access to e.g. privacy info pages to a normal user "by
         | accident", that could be a compliance issue.
         | 
         | I don't think it's mass blocking is the right approach in
         | general. IPs, even residential, are relatively easy and
         | relatively cheap. At some point you're blocking too many normal
         | users. Captchas are a strong weapon, but they too have a
         | significant cost by annoying the users. Cloudflare could
         | theoretically do invisible-invisible captchas by never even
         | running any code on the client, but that would be wholesale
         | tracking and would probably not fly in the EU.
        
       | dleslie wrote:
       | The idea of a public cache available to anyone who wishes to
       | index it is ... kind of compelling.
       | 
       | If it was the only indexer allowed, and it was publically
       | governed, then enforcing changes to regulation would be a lot
       | more straightforward. Imagine if indexing public social media
       | profiles was deemed unacceptable, and within days that content
       | disappeared from all search engines.
       | 
       | I don't think it'll ever happen, but it's interesting to think
       | about.
        
         | tlibert wrote:
         | So out law web scrapping entirely?
        
         | simantel wrote:
         | Common Crawl is attempting to offer this as a non-profit:
         | https://commoncrawl.org
        
           | jackson1442 wrote:
           | o/t but what the hell are they doing to scroll on that page?
           | I move my fingers a centimeter on my trackpad and the page is
           | already scrolled all the way to the bottom.
           | 
           | Hijacking scroll like this is one of the biggest turnoffs a
           | website can have for me, up there with being plastered with
           | ads and crap. It's ok imo in the context of doing some flashy
           | branding stuff (think Google Pixel, Tesla splashes) but
           | contentful pages shouldn't ever do this.
        
             | aembleton wrote:
             | Add *##+js(aeld, scroll) to your uBO filters. That will
             | stop scroll JS for all websites.
        
         | xtracto wrote:
         | That would be a very cool use case for something like STORJ or
         | IPFS.
        
         | ricardo81 wrote:
         | An alternative but similar idea, apply your own algorithms to a
         | crawler/index. That's half the problem with these large
         | platforms commanding the majority of eyeballs, you search the
         | entire web for something and you get results back via a black
         | box. Alternatives in general are most definitely a good thing.
         | 
         | Knuckleheads' Club at the very least are doing a great job of
         | raising awareness and the potential barriers to entry for
         | alternatives.
        
         | ISL wrote:
         | Imagine if Donald Trump decided that indexing Joe Biden's
         | campaign site was unacceptable.
         | 
         | A mandated singular public cache has potential slippery slopes.
        
           | whimsicalism wrote:
           | Imagine if Donald Trump decided to tax campaign donations to
           | Joe Biden's campaign at 100%.
           | 
           | I am unconvinced by the "slippery slope" argument being
           | deployed by default to any governmental attempt to combat
           | tech monopolies.
        
             | ISL wrote:
             | This is an argument against centralization more than it is
             | against government.
             | 
             | "One index to rule them all" seems more fraught with
             | difficulty than, "large cloud providers are unhappy that
             | crawlers on the open web are crawling the open web".
        
               | whimsicalism wrote:
               | If the impact stopped at "large cloud providers" being
               | unhappy, I think that you're correct. But I think we've
               | seen considerably downstream "difficulty" for the rest of
               | society from search essentially being consolidated into
               | one private actor.
        
           | passivate wrote:
           | >A mandated singular public cache has potential slippery
           | slopes.
           | 
           | That may be, but it seems like everything has a slippery
           | slope - if the wrong person gets into power, or if the public
           | look the other way/complacence/ignorance/indifference, etc,
           | etc. It shouldn't stop us evaluating choices on their merits,
           | and there is a lot of merit to entrusting 'core
           | infrastructure' type entities to the government - or at-least
           | having an option.
        
         | drivingmenuts wrote:
         | > If it was the only indexer allowed, and it was publically
         | governed
         | 
         | Which would put it under government regulation and be forever
         | mired in politics over what was moral, immoral, ethical or
         | unethical and all other kerfuffle. To an extent, it's already
         | that way, but that would make it worse than it is currently.
        
         | hackeraccount wrote:
         | I'd have to look more but maybe running a cache isn't dead
         | simple. I can imagine that the benefits of manipulating what's
         | in the cache either adding or removing would be very high.
         | Google and the others are private companies so they're not
         | required to do everything in the public view.
         | 
         | A public cache wouldn't be able - indeed shouldn't - to play
         | cat and mouse games with potential opponents. I suspect most of
         | the games played require not explaining exactly what you're
         | doing.
        
         | sixdimensional wrote:
         | Here's an idea... what if search became a peer-to-peer
         | standardized protocol that is part of the stack to complement
         | DNS? E.g. instead of using DNS as the primary entry point, you
         | use a different protocol at that level to do "distributed
         | search". DNS would still play a role too, but if "search" was a
         | core protocol, the entry point for most people would be
         | different.
         | 
         | Similar to some of the concepts of "Linked Data", maybe -
         | https://en.wikipedia.org/wiki/Linked_data.
         | 
         | The problem is getting to a standard, it would essentially need
         | to be federated search so a standard would have to be
         | established (de facto most likely).
         | 
         | Also, indexes and storage, distribution of processing load..
         | peer-to-peer search is already a thing, but it doesn't seem to
         | be a core function of the Internet.
         | 
         | This is basically the same concept as making an "open" version
         | of something that is "closed" in order to compete, I guess.
        
       | rezonant wrote:
       | > Let's take a look at the robots.txt for census.gov from October
       | of 2018 as a specific example to see how robots.txt files
       | typically work. This document is a good example of a common
       | pattern. The first two lines of the file specify that you cannot
       | crawl census.gov unless given explicit permission.
       | 
       | This was eyebrow-raising. Actually seems like this is not (any
       | longer?) true:
       | 
       | https://census.gov/robots.txt:
       | 
       | User-agent: *
       | 
       | User-agent: W3C-checklink
       | 
       | Disallow: /cgi-bin/
       | 
       | Disallow: /libs/
       | 
       | ...
       | 
       | That first line wildcards for any user agent but does nothing
       | with it. It should say "Disallow /" on the next line if it
       | blocked all unnamed robots. It looks like someone found out about
       | it and told the operators, rightfully so, that government
       | webpages with public information (especially the census)
       | shouldn't have such restrictions. They then removed only the
       | second line and left the first. Leaving the first line has no
       | impact on the meaning of the file.
        
       | EGreg wrote:
       | Or use MaidSAFE where you get paid to serve your website as
       | opposed to the other way around.
        
       | hannob wrote:
       | I have seen sites behave differently if you use a Googlebot UA,
       | but am I missing something or does this merely mean that anyone
       | doing something like this
       | 
       | curl -A 'Mozilla/5.0 (compatible; Googlebot/2.1;
       | +http://www.google.com/bot.html)'
       | 
       | will get Google-level crawler access?
        
         | kirubakaran wrote:
         | That would work on website that have a naive check for just
         | user agent. Google also publishes the IP address ranges their
         | crawlers run on. Lot of websites check for that, and there's no
         | way around that.
         | 
         | https://developers.google.com/search/docs/advanced/crawling/...
        
       | AlphaWeaver wrote:
       | This "club" charges a membership fee of $10 a month (or $100 a
       | year) to comment.
       | 
       | Does this go to some sort of nonprofit or holding entity that's
       | governed by its members? Or do people have to trust the owner?
        
       | mancerayder wrote:
       | Any word on or opinions about Brave's initiative to challenge
       | search?
        
       | ChrisArchitect wrote:
       | dupe/posted earlier etc
       | 
       | I also got confused about this page as there's another project of
       | theirs around right now about RIP Google Reader that's on a
       | seperate domain...
       | 
       | Funny a site that's all about google this and that doesn't have
       | clear URL/pages for their articles that can be linked to easily
       | geez
       | 
       | Original post/discussion from the source, 3 months ago:
       | https://news.ycombinator.com/item?id=25417067
        
         | slenk wrote:
         | https://knuckleheads.club/introduction/
         | 
         | That seems like an easy link?
        
       ___________________________________________________________________
       (page generated 2021-03-26 23:00 UTC)