[HN Gopher] Only Google is really allowed to crawl the web ___________________________________________________________________ Only Google is really allowed to crawl the web Author : skinkestek Score : 711 points Date : 2021-03-26 14:34 UTC (8 hours ago) (HTM) web link (knuckleheads.club) (TXT) w3m dump (knuckleheads.club) | graiz wrote: | Not sure why http://commoncrawl.org/ wasn't mentioned. | dclaw wrote: | I can't really trust a website that spells its own name wrong on | their homepage. "Knucklesheads' Club" | | Edit: https://imgur.com/a/inqYrjV | slenk wrote: | Everyone makes mistakes | sgsvnk wrote: | Money earns more money. Privilege begets more privilege. | | This is not just true in the case of Google but in other other | domains as well like the financial markets. | | Would you blame capitalism? | wunderflix wrote: | Even that won't change much. There is no way Google can be out- | googled by other search engines because of its market dominance: | more traffic means more clicks, more clicks mean better search | results, better search results will drive more traffic. | | I try bing and DDG for a week or so every 6 months. I always | switch back to google eventually because the results are so much | better. | | Google can only be disrupted if something new is invented, | something different than search but delivering way better | results. I have no clue what that might be. But I hope someone is | working on it. | internetslave wrote: | Yup. My opinion has long been that the only thing that will | take down google is a massive increase in NLP, such that the | historical click data can be outperformed by a straight up | really good NLP model | wunderflix wrote: | That's interesting. Is anyone working on this already? SV | startup? And: don't you think Google is in the best position | to build such a thing? | Zelphyr wrote: | I've had the exact opposite reaction to the comparison between | Google and DuckDuckGo. I use the latter daily and only rarely | revert to Google. Even then I usually don't find the results to | be any better and often find them to be worse. | | In my estimation, Google's search results have significantly | declined in recent years. | rstupek wrote: | Agreed. I've fully changed over to DDG on my phone and rarely | add the !g to get a google search. | wunderflix wrote: | Ha, maybe I should give it a try again :) My 6 months period | is almost over again. | jrockway wrote: | I think there are plenty of other people crawling the web. | There's Common Crawl, there's the Wayback machine... it's not | just Google. Then there is a very long tail of crawlers that show | up in the logs for my small-potatoes personal website. Whatever | they're doing, they seem to be existing in peace, at the very | least. | | To some extent, I agree with this site that people are nicer to | Google than other crawlers. That's because the crawl consumes | their resources but provides benefits -- you show up on Google, | the only search engine people actually use. But at the same time, | they are happy to drag Google in front of Congress for some | general abuse, so... maybe there is actually a little bit of | balance there. | anonu wrote: | > There Should Be A Public Cache Of The Web | | This might be closest to it: https://commoncrawl.org/ | lawwantsin17 wrote: | I'm all for killing Google's monopoly but spiders can ignore | robots.txt you know. This just seems like a failure of other | companies to effectively ignore those. | jeelecali wrote: | I'm looking for $ 576 | villgax wrote: | The irony is that they bitch about you not scraping search or | other platforms without paid plans & want to do the same to you | ajcp wrote: | They really missed an opportunity to get creative with their own | `robots.txt` implementation. | nova22033 wrote: | _This isn't illegal and it isn't Google's fault_ | | Right there in the article.. | WarOnPrivacy wrote: | Again, with critical context. | | _This isn't illegal and it isn't Google's fault, but this | monopoly on web crawling that has naturally emerged prevents | any other company from being able to effectively compete with | Google in the search engine market._ | tyingq wrote: | The bigger problem, to me, is not around crawling. It's the | asymmetrical power Google has after crawling. | | Google is obviously on a mission to keep people on Google owned | properties. So, they take what they crawl and find a way to | present that to the end user without anyone needing to visit the | place that data came from. | | Airlines are a good example. If you search for flight status for | a particular flight, Google presents that flight status in a box. | As an end user, that's great. However, that sort of search used | to (most times) lead to a visit to the airline web site. | | The airline web site could then present things Google can't do. | Like "hey, we see you haven't checked in yet" or "TSA wait times | are longer than usual" or "We have a more-legroom seat upgrade if | you want it". | | Google took those eyeballs away. Okay, fine, that's their choice. | But they don't give anything back, which removes incentives from | the actual source to do things better. | | You see this recently with Wikipedia. Google's widgets have been | reducing traffic to Wikipedia pretty dramatically. Enough so that | Wikipedia is now pushing back with a product that the Googles of | the world will have to pay for. | | In short, I don't think the crawler is the problem. And I don't | think Google will realize what the problem is until they start | accidentally killing off large swaths of the actual sources of | this content by taking the audience away. | bouncycastle wrote: | In regards to airlines, Google and Amadeus have a partnership I | believe. Amadeus is the main source of data for many of these | airline websites. If Google gets the data from Amadeus directly | and not these websites, they are just cutting out the | middleman. I don't shed a tear for any of these middleman | (together with their Dark Pattern UX design). | tyingq wrote: | Amadeus isn't a source of flight status. It is a source for | (some) planned schedules and fares. Global distribution | systems are a complex topic that's hard to sum up on HN. For | flight status, Google is pulling from OAG and Flight Aware, | and also from airline websites. Though they don't show | airline sites as a source. | dan-robertson wrote: | The way to look at this from Google's point of view is to | realise that most websites are slow and bad[1], so if Google | sent you there you would have a bad experience with a bad slow | website trying to find the information you want. Google want to | make it better for you. | | [1] it feels like Google have contributed a lot to websites | being slow and bad with eg ads, amp, angular, and probably more | things for the other 25 letters of the alphabet. | [deleted] | zentiggr wrote: | > Google want to make it better for you. | | Hehe, sure, nothing nefarious or greedy here... move along, | move along, nothing to see... | supert56 wrote: | Perhaps I am misunderstanding or over simplifying things but it | always surprises me that there are legal cases brought against | companies who scrape data when so many of Google's products are | doing exactly this. | | It definitely feels like one set of rules for them and a | different set for everyone else. | lupire wrote: | Google doesn't scrape anything that the site owner objects | to. | Spivak wrote: | I mean it's not that weird that a company would authorize | major search engines scraping them but no one else. | | I don't really see this as Google playing by different rules | so much as economic incentives being aligned in Google's | favor. | 838812052807016 wrote: | Standardized interoperability enables overall progress. | | Every airline doesn't need their own webpage. They could all | provide a standard API. | lelanthran wrote: | > And I don't think Google will realize what the problem is | until they start accidentally killing off large swaths of the | actual sources of this content by taking the audience away. | | What makes you think they care? Killing off the sources of | content might even be there goal. If they kill off sources of | content, they'd be more than happy to create an easier-to- | datamine replacement. | | Hypothetically, if they killed off wikipedia, they are best | placed to use the actual wikipedia content[1] in a replacement, | which they can use for more intrusive data-mining. | | Google sells eyeballs to advertisers; being the source of all | content makes them more money from advertisers while making it | cheaper to acquire each eyeball. | | [1] AFAIK, wikipedia content is free to reuse. | ilaksh wrote: | The way that the web has been fundamentally broken by Google | and other companies is one of the reasons I am excited about an | alternative protocol called Gemini. It doesn't replace the web | entirely, but for basic things like exchanging information, | it's great. https://gemini.circumlunar.space/ | treis wrote: | >However, that sort of search used to (most times) lead to a | visit to the airline web site. | | I don't think that's correct. In the old days you'd either call | a travel agent or use an aggregator like expedia. | | Google muscles out intermediaries like Expedia, Yelp, and so | on. It's not likely much better or worse for the end user or | supplier. Just swapping one middleman for another. | darkwater wrote: | It's actually pretty different because another middleman can | basically arise only if it's a big success in the iOS App | Store because coming up in Google searches would be | impossible and more or less the same in the Play Store. So, | Google is not just yet another intermediary. | tyingq wrote: | I can't prove it was that way, but I spent a lot of time in | the space. For a long time, the airline's site used to be the | top organic result, and there was no widget. Similar for | other travel related searches (not just airlines) over time. | Google has been pushing down organic results in favor of ads | and widgets for a long time...and slowly, one little thing at | a time. Like no widgets -> small widget below first organic | result -> move the widget up -> make it bigger -> etc. | supernovae wrote: | I don't think google muscling out intermediaries like Expedia | is a good thing. | | Just for example, Expedia is probably 5% of Google's total | revenue and Google doesn't like slim margin services by and | large that can't be automated. | | Travel is fairly high-touch - people centric. It doesn't fit | Google's "MO". | | But... its shitty that google can play all sides of the | markets while holding people ransom to mass sums of money to | pay to play on PPC where google doesn't... i think that's | where the problem shines. | | In essence, you're advocating that eBay goes away because | google could do it... they could.. and eBay is technically | just an intermediary, but do we want everything to be | googlefied? | | Google bought up/destroyed other aggregators - remember the | days of fatwallet, priceline, pricewatch, shopzilla and such | when they used to focus on discounts/coupons/deals and now | they're moving more towards rewards/shopping/experience - it | used to be i could do PPC on pricewatch and reach millions of | shoppers are a reasonable rate, but now that google destroyed | them all, the PPC rate on "goods" is absurdly high and not | having an affordable market means only the amazons and | walmarts can really afford to play... | | it used to be you could niche out, but even then, that's | getting harder | treis wrote: | >In essence, you're advocating that eBay goes away because | google could do it... they could.. and eBay is technically | just an intermediary, but do we want everything to be | googlefied? | | I don't think I'm really advocating for it as much as I see | as a more or less neutral change. | | That said, I'm pretty ambivalent about Google. Their size | is a concern, but they also tend to be pretty low on the | dark pattern nonsense. eBay, to use an example you gave, | screwed me out of some buyer protection because of poor UX | and/or bug (I never saw the option to claim my money after | the seller didn't respond). In this specific instance | Google ends the process by sending you to the airline to | complete the booking. That, imho, is likely better than | dealing with Expedia. | supernovae wrote: | Companies opt in to sites like Expedia and list their | properties/flights/vacations on their marketplace and | they pay a commission for those being booked. Expedia | doesn't just crawl them and demand a royalty for sending | them traffic... | | Google has a huge pay 2 play problem with PPC... i've | worked for Expedia so that's the only reason i know this | :) | | It's the reason companies work with Expedia many times | because they don't have the leverage expedia group | does... | | i see it as unnatural change btw... "borg" if you will. | josefx wrote: | Only if Google stays around long term. I wouldn't be | surprised if each free product on its graveyard took down a | dozen of competing products before it was killed of. | pc86 wrote: | Then someone can start a competitor up again, right? | Assuming there's actually a market for it. | josefx wrote: | Not every market is lucrative in the extreme and it can | take a long time to recover from being "disrupted". I | think it is also a common practice for larger shopping | chains to dump prices when they open a new location in | order to clear out the local competition, so the damage | it causes is well understood to be long lasting. | devoutsalsa wrote: | I've noticed that sometimes Google had updated flight | information before the displays at the airport. | tyingq wrote: | For the most part individual airports own that | infrastructure. So it's hard to generalize. For most types of | notable flight status/time changes, however, airlines usually | know first. | | There are exceptions, like an airport-called ground stop. | magicalist wrote: | > _You see this recently with Wikipedia. Google 's widgets have | been reducing traffic to Wikipedia pretty dramatically._ | | Wikipedia visitors, edits, and revenue are all increasing, and | the rate that they're increasing is increasing, at least in the | last few years. Is this a claim about the third derivative? | | > _Enough so that Wikipedia is now pushing back with a product | that the Googles of the world will have to pay for._ | | The Wikimedia Enterprise thing seems like it has nothing to do | with missing visitors and that companies ingesting raw | Wikipedia edits are an opportunity for diversifying revenue by | offering paid structured APIs and service contracts. Kind of | the traditional RedHat approach to revenue in open source: | https://meta.m.wikimedia.org/wiki/Wikimedia_Enterprise | tyingq wrote: | See https://searchengineland.com/wikipedia-confirms-they-are- | ste... from 2015. Google's widgets that present Wikipedia | data do reduce visitors to Wikipedia. | | Or see page views on English Wikipedia from 2016-current: htt | ps://stats.wikimedia.org/#/en.wikipedia.org/reading/total... | Looks pretty flat, right? Does that seem normal? | | As for Wikimedia Enterprise, you do have to read between the | lines a bit. _" The focus is on organizations that want to | repurpose Wikimedia content in other contexts, providing data | services at a large scale"_. | SamBam wrote: | The first link doesn't seem quite conclusive (see the part | at the bottom), and also doesn't give evidence that | Google's widgets are to blame. | | The flattening of users could also be due to a general | internet-wide reduction in long-form (or even medium-form) | non-fiction reading. How are page views for The New York | Times? | | Seems like it should be simple to A/B test, though. | Obviously Google could do it themselves by randomly taking | away the widget, but would could also see whether referrals | from non-Google search engines (though they are themselves | a tiny percentage) continue to increase while Google | remains flat. | tyingq wrote: | Edit: Removed bad "simple english graph", thanks. Though | the regular english wikipedia traffic is flat from | 2016-present. | | As for NYT, is there a better proxy to compare to? | There's no public pageview stats and they have a paywall. | magicalist wrote: | That first graph is Simple English, not English, and is | in millions, not billions. They also explicitly call out | the methodology change in 2015... | JKCalhoun wrote: | > In short, I don't think the crawler is the problem. | | Except that, allow other companies to crawl/compete, and you | can take eyeballs away from Google (which may well then return | eyeballs to Wikipedia so long as the Google competitors don't | also present scraped data). | [deleted] | benatkin wrote: | That's the result of the crawling, and it preventing | competition. Google would much prefer that people complain | about the details while ignoring the root cause. | tyingq wrote: | I don't understand that. The crawling access is mostly the | same as it ever was. Google's SERP pages are not. A mutually | beneficial search engine that respects it's sources would | still crawl the same. Google used to be that. | | The core problem is incentives: | http://infolab.stanford.edu/~backrub/google.html _" we | believe the issue of advertising causes enough mixed | incentives that it is crucial to have a competitive search | engine that is transparent and in the academic realm."_ | [deleted] | benatkin wrote: | That's incorrect. Before the search oligopolies formed, new | search engines could start up. There was excite, hotbot, | altavista, and more. Now they don't have access. Search | these comments for census.gov. | tyingq wrote: | There are companies that do pretty well in this space, | like ahrefs, for example. They do resort to trickery, | like proxy clients that look like home computers or cell | phones. But, if a small entity like ahrefs can do it, | anyone can do it. | | In a nutshell, though, I don't see equal access for all | crawlers changing anything. Maybe that's the first | barrier they hit, but it isn't the biggest or hardest one | by far. Bing has good crawler access, but shit market | share. | [deleted] | dr-detroit wrote: | So nobody is going to book air travel? I cant hardly follow | what youre even saying besides google=bad. | veltas wrote: | I swear something like 50% of those digests are totally | incorrect as well. It's amazing they have kept the feature | because it has never had a very high signal-to-noise ratio. I | never trust what's presented in these digests without double- | checking the source page. | bombcar wrote: | Have you heard the story of Thomas Running? It's a story | Google will tell you. | | (Search who invented running) | tyingq wrote: | I remember when rich snippets (one type of those widgets) | came out there were a lot of funny examples. One for a common | query about cancer treatments that pulled data from a dodgy | holistic site saying that "carrots cured most types of | cancer" (or something like that). | | There was a similar one where Google emphatically claimed a | US quarter was worth five cents in a pretty and large snippet | graphic. | Mauricebranagh wrote: | I recall in the last uk election google got the infographic | of party leaders about 60-70% wrong. | | And quite often a people also ask refine is just some | random guys comment from redit. | BeFlatXIII wrote: | The most memorable rich snippet humor I've seen is a horse | breeder sharing a story of how her searches gave snippets | with my little ponies as the preview image. | gtm1260 wrote: | I'm not sure I agree with this. I think airline websites are so | garbage filled that they've driven people to use the simple | alternative of the google flights checkout. | | It's a bit of a vicious cycle, but In general most websites are | so chock filled with crap that not having to click into them | for real is a relief! | gxs wrote: | It's not Google's prerogative to scrape a website and display | its content, no matter how awful the website. | michaelmrose wrote: | If 1 airline let me view information in a friendly fashion | and the other didn't I would do business with the first. | | Lest we forget the money in that scenario is from butts in | seats not clicks on a website. The particular example is | ill chosen as google is actually taking on a cost, taking | nothing, and gifting the airline a better ui. | BEEdwards wrote: | If you make an awful website that can be scrapped it's a | matter of when not if someone will take your data and give | it to your consumers whether your trying to upsell them or | not... | cyberpunk wrote: | BA had some tracking request inline on the "payment | processing" page which when blocked by my pihole prevents me | from ever getting to the confirmation page, just have to | refresh your email and wait for the best. | | I have no idea how these companies, which make quite a decent | amount of money at least up until 2020, can have such utterly | poor sites. | | I once counted some 20+ redirects on a single request during | this process heh.. | bombcar wrote: | I don't know what they're doing but most every single sign | on tool I've seen redirects 10-20 times during the sign on | process (and then dumps you to the homepage to navigate | your way back). | merlinscholz wrote: | Probably to get first party cookies on a handful of | domains | tyingq wrote: | I'm talking about flight status. Not Google Flights, | shopping, or booking. | | There are events associated with flight status that Google | doesn't know. Like change fee waivers, cash comp awards to | take a later or earlier flight, seat upgrades, etc. | creato wrote: | Yeah, the Google flights issue is difficult. On one hand, the | business practice is problematic. On the other hand, Google | flights is _so_ much better than its competitors it 's | ridiculous. | | If there was a way to split Google flights into a separate | company and somehow ensure it wouldn't devolve into absolute | trash like its competitors, that would be a good thing. | tyingq wrote: | It was ITA and prior to Google buying them, did a pretty | good business selling backend flight shopping services to | aggregators and airlines. | | Shopping for flights is a surprisingly technically | difficult thing to do well. | ChrisArchitect wrote: | They're making it easier to search for flights and arrange a | trip. It's UX and makes me not hate the airlines/travel process | as much. And I end up buying the flight from the airline | anyways, and in many cases doing the arranging on the airline | site in the end once it's determined, so Google is giving that | back. They're not taking stuff from the airlines, I mean what | ads and stuff are on the airline sites anyways specifically | during the search process. Where they are taking away is from | the Expedia's and other aggregation sites that offer a | garbage/hodgepodge experience that drives people crazy. | tyingq wrote: | You're talking about Google Flights, which is completely | unrelated to flight status. | throwaway_kufu wrote: | They are not just taking away internet traffic, but in the | flights example, they actually acquired an aggregate | flight/travel company and so they are actually entering markets | and competing with their own ad customers. | | Then it comes fully circle to Google unfairly using their | market position vis-a-vis data, search and advertising. It's a | win-win Google lets the data dictate which markets to enter and | on one hand they can jack up advertising fees on | customers/competitors and unfairly build their own service into | search above both ads and organic results. | danielscrubs wrote: | Be careful when using Google Flight, last time I checked they | use significantly less margins between flights so it's | shorter trips but much riskier. | aetherane wrote: | You can screwed any time you book a connecting flight on | two different airlines even if the times aren't tight. For | instance if one is cancelled. | | If you use the same airline they will make sure you get to | the destination. | HenryBemis wrote: | > even if the times aren't tight | | Depending on the definition of "tight" each of us have. I | remember having 40mins in Munich, and that is a BIG | airport. Especially if you disembark on one side of the | terminal and your flight is on the far/opposite end. | That's 25-30mins brisk walking. With 5000 people in- | between you could as well miss your flight. No discussion | about stopping to get a coffee or a snack.. you'll miss | your flight. | matwood wrote: | That's true, but it can save you a ton of money. You just | have to be aware of the risks and plan accordingly. | | I have typically used this strategy when flying back to | the US from the EU. Take an EZJet or similar low cost | airline from random small EU city to a larger EU city | like Paris, London, Frankfurt, etc... and book the return | trip to the US from the larger city. I've also been | forced to do this from some EU cities since there was no | connecting partner with a US airline. | hodgesrm wrote: | The difference is mind-boggling in some cases. On one | trip in 2019 I had the following coach fair choices for | SFO - Moscow return trip tickets booked 3 weeks prior to | departure. | | * UA or Lufthansa round trip (single carrier) $3K | | * UA round trip SFO - Paris + Aeroflot round trip Paris - | Moscow: $1K | | No amount of search could reduce the gap. I went with the | second option. The gap is even bigger if you have a route | with multiple segments. | throwaway1777 wrote: | Yeah this strategy is good, but you need to allow a long | layover like 6 hours if you have to go through | immigration and change airports for the connection which | happens pretty often with ryanair and ezjet. It's a big | pain, but it does save money. | cbenneh wrote: | If you're booking each leg with different carrier, I find | it best to pay the little extra with kiwi.com and they | give you guarantee for the connection. I missed | connection twice and they always got me on the next | flight to the destination for free. | slymon99 wrote: | Can you elaborate on this? Do you mean shorter layovers? | bombcar wrote: | It sounds like it - and third-party companies will often | show you flights that involve different companies on the | different legs - which can leave you in a pickle because | technically each airline's job is to get you to the end | of THIER flight, not the entire journey. | Scoundreller wrote: | And sometimes with a change of airport! | foepys wrote: | I remember when in Germany some budget airlines used to | say they'd fly to "Frankfurt" (FRA) but actually flew to | "Frankfurt-Hahn" (HHN) - 115km away. After arrival in HHN | they put you on a bus to FRA that took about 2 hours. | SV_BubbleTime wrote: | Oh don't worry, you have 15 on-paper minutes to go from A1 | to A70 in Detroit... in January... and the shuttle is down. | marshmallow_12 wrote: | Aren't there anti trust laws to prevent this kind of thing? | sangnoir wrote: | The current anti-trust doctrine in the US has a goal of | protecting _consumers_ - not competition. What Google is | doing is arguably great for consumers but awful to their | competitors /other organizations. Technically, companies | can simply block Google using robots.txt - but in reality | that will lose them more money than the current partial | disintermediation by Google is costing them - and Google | knows this. | | It's a tall order to convince the courts that Google's | actions consumers, or is illegal: after all, being | innovative in ways that may end up hurting the competition | is a key feature of a capitalist society - _proving_ that a | line has been crossed is really hard, by design. | speeder wrote: | consumers are in this case the advertisers. | | google has a monopoly on search ads and does enforce it, | being a drain on the economy since in many fields you | only succeed if you spend on search ads | sangnoir wrote: | > consumers are in this case the advertisers. | | If someone could convince the courts that this is | correct, then I'm sure Google would lose. However, I bet | dollars to donuts Google's counter-arguement would be | that the people doing the searching and quickly finding | information are also consumers, and they outnumber | advertisers and may be harmed by any proposed remediation | in favor of advertisers. | basch wrote: | googles answer to this at yesterdays hearing.. | | Search isnt a single category. If you break it down, they | arent a monopoly. For example. 1/2 of PRODUCT SEARCHES | begin on Amazon. It's probably hard to argue Google as a | monopoly if who they see as their main competitor has | half the market share. | supernovae wrote: | Just tell people to stop using google. Go direct. | zentiggr wrote: | Upvoted - regardless how pointless some people might | think this comment is, it really is the ONLY way that | Google is going to drop out of its aggregate lead | position. | | Enough people realizing Google is trapping and | cannibalizing traffic to the other sites it feeds off of, | and choosing to do other things EXCEPT touching Google | properties, is THE ONLY way they'll be unseated. | | No clear legal path to stop a bully means it's an ethical | / habit path. | | Not saying there's any easy way, just that this is it. | midoBB wrote: | Anti-trust in the US tend to not hit the big tech players | as much they do other sectors. Also there is actually a | debate in the judicial system about the extent of Anti | trust laws themselves. | Majromax wrote: | Antiturst laws are hard to enforce in the United States. | | Monopolies themselves aren't illegal. To be convicted of an | antitrust violation, a firm needs to both have a monopoly | and needs to be using anticompetitive means to maintain | that monopoly. The recent "textbook" example was of | Microsoft, which in the 90s used its dominant position to | charge computer manufacturers for a Windows license for | each computer sold, regardless of whether it had Windows | installed or was a "bare" PC. | | Depending on how you define the market, Google may not even | have a monopoly. It's probably dominant enough in web | search to count, but if you look at its advertising network | it competes with Facebook and other ad networks. In the | realm of travel planning (to pick an example from these | comments), it's barely a blip. | | Furthermore, Google can potentially argue it's not being | anticompetitive: all businesses use their existing data to | optimize new products, so Google could claim that it _not_ | doing so would be an artificial straightjacket. | twiddlebits wrote: | It's got a monopoly on "search ads" by far. | arrosenberg wrote: | It's not that hard, we're just out of practice due to the | absurd Borkist economic theories we've been operating | under for 40+ years. The laws are all there if the head | of the DOJ antitrust division has the gumption to go | reverse some bad precedents. | | > In the realm of travel planning (to pick an example | from these comments), it's barely a blip. | | They used their monopoly in web search to gain non- | negligible marketshare in entirely unrelated industry. | That's text book anti-competitive behavior. | | Google can argue whatever they want, but the argument | that they're enabling other businesses is a bad one. It | casts Google as a private regulator of the economy, which | is exactly what antitrust laws are intended to deal with. | pmiller2 wrote: | Is web search even a "market" independent of ads? | rijoja wrote: | yes | pmiller2 wrote: | Where's the money? | samuelizdat wrote: | That depends, would Google let us know? | rijoja wrote: | not if they could avoid it | adamcstephens wrote: | Yes, but they lack enforcement. | kingo55 wrote: | Even before it gets to that point, they routinely display | snippets off regular websites and show ads next to it. | | Keeping users from clicking through to organic results helps | them generate more revenue. | jeffbee wrote: | You're wrong on a lot of facts here. Google Flights doesn't get | its data just by crawling, they get it from Sabre, the FAA, | Eurocontrol, etc. Airlines are, obviously, extremely pleased to | disseminate this information. Google Flights "gives back" in | the exact same way as any other travel outlet: they book | passengers. | | As for Wikipedia, the WMF is quite happy that most of their | traffic is now served by Google. WMF is in the business of | distributing knowledge, not in the eyeballs business. Serving | traffic is just a cost for them. The main problem has been that | the average cost for Wikipedia to serve a page has gone up, | because many readers read it via Google, and more people who | visit Wikipedia are logged-in authors, which costs them more to | serve. I'm sure there's an easy solution to this problem (for | example, beneficiaries of Wikipedia can donate compute | facilities and services, or something along those lines). | tyingq wrote: | They don't get individual flight status (what I was talking | about) from Sabre or the FAA or Eurocontrol. I didn't get | into fares and planned schedules and Google Flights, that's a | different topic. I was talking about the big widget you get | for queries on status for a particular flight, which is not | Google Flights. | | They have relented in some ways, rolling out stuff in the | widget like: _" The airline has issued a change fee waiver | for this flight. See what options are available on American's | website"_ | | But obviously, that kind of stuff isn't shown on Google for | quite some time after it exists on the source site. And the | widget pushes the organics off the fold unless you have a | huge monitor. | | As for Wikipedia, I was referring to this: | https://news.ycombinator.com/item?id=26487993 | | _" Airlines are, obviously, extremely pleased to disseminate | this information"_ | | In the same way that publishers love AMP, yes. They don't | actually like it, but they are forced to make the best of it. | jeffbee wrote: | Oh, status. I was thinking of schedules. Still, what is the | point for the consumer of being directed to an airline's | terrible status page? And are they even capable of being | crawled? Looking at American's site (it was the most | ghastly airline that sprang to mind) I don't see how a | crawler would be able to deal with it, and indeed the | Google snippet for AA flight status, on the aa.com result | which is far down in the results page, just says "aa.com | uses cookies" which is about what you'd expect. | | In this case, I want to be sent literally anywhere but | aa.com. | tyingq wrote: | _" what is the point for the consumer of being directed | to an airline's terrible status page?"_ | | One example... | | If you back up a bit, the widget didn't used to tell you | there was a change fee waiver when the flight was full, | while aa.com did. | | That's an actual, tangible benefit that a consumer might | want, worth real money. You can also even often "bid" on | a dollar amount to receive if you're willing to change | flights. Google doesn't present that info today. | | There are more examples. My perspective isn't that Google | should lead you to aa.com, but I do feel it's a bit | dishonest that the widget is so large it pushes aa.com | below the fold. It doesn't need to be that large. | wbl wrote: | Does the concergie of a hotel take anything away when he | informs you that your flight has been delayed? | onlyrealcuzzo wrote: | Wikipedia isn't monetized. Doesn't it benefit them if Google is | serving their content for free and people are finding the | information they want without having to hit Wikipedia?? | | And also, isn't Google the largest sponsor for Wikipedia | already? In 2019 - Google donated $2M [1]. In 2010, Google also | donated $2m [2]. | | [1] https://techcrunch.com/2019/01/22/google-org- | donates-2-milli... | | [2] https://en.wikipedia.org/wiki/Wikimedia_Foundation | minikites wrote: | Couldn't you make a similar argument about for-profit uses of | free/libre software? The software serves a useful purpose, | who cares where it came from? | dmitriid wrote: | Google was/is also the largest sponsor of Mozilla. This | doesn't stop Google from sabotaging Mozilla. | | 2 mln is probably Google's hourly profit. For that they get | one of the biggest knowledge bases in the world. It's | basically free as far as Google is concerned. | | The instant Google becomes confident they can supplant | Wikipedia, they will. | billiam wrote: | NOT a sponsor of Mozilla. Google buys web traffic (as | default search engine) for ~$300M and turns it into several | times that $ in ad revenue. | jedberg wrote: | > 2 mln is probably Google's hourly profit. | | You don't have to guess, their numbers are public. In 2020 | they made $40B in profit, so it takes them about 27 minutes | to make $2M in profit. | magicalist wrote: | > _Google was /is also the largest sponsor of Mozilla. This | doesn't stop Google from sabotaging Mozilla._ | | Google isn't a sponsor of Mozilla, they're a customer. Do | people think Google is "sponsoring" Apple with $1.5 billion | a year too? | dmitriid wrote: | Google being Apple's customer doesn't mean Google isn't | sponsoring Mozilla. | | These are two very different companies with a very | different relationship with Google. And very different | influences on Google. | | Google _wants_ to be on iOS. It brings customers to | Google. A lot of them. iOS is possibly more profitable to | Google than Android even with all the payments Apple | extracts from them. | | Google needs Mozilla so that Google may pretend that | there's competition in browser space and that they don't | own standards committees. The latter already isn't really | true, and Google increasingly doesn't care about the | former. | foobarian wrote: | > they're a customer. | | The cynic in me thinks the product is anti-trust | insurance. | kelnos wrote: | Not sure why you're being downvoted; I completely agree | with what you're saying (modulo questionable usage of | "sponsor"). If Wikipedia were to try to charge for this use | of their data, Google would likely make it a priority to | drop the Wikipedia blurbs, either without replacement, or | with data sourced elsewhere. | will4274 wrote: | > Google would likely make it a priority to drop the | Wikipedia blurbs, either without replacement, or with | data sourced elsewhere. | | That's an odd way of phrasing things. If Wikipedia were | to take away free access to their data, Google wouldn't | be dropping Wikipedia, Wikipedia would be dropping | Google. This line of thinking "you took this when I was | giving it away for free, but now I want to charge for it, | so you are expected to keep paying for it" is incorrect. | zdragnar wrote: | Given the scale that google already operates at, I don't | doubt that they would just take a copy of thr content and | rebrand it as a google service, complete with user | contribution. | | Then, after two or five years, let it fester then abandon | it. Nobody gets promoted for keeping well oiled machines | running. | dmitriid wrote: | Remember Knol? | https://en.wikipedia.org/wiki/Knol?wprov=sfti1 | | It was actually good for writing stuff when I tried it. | Never brought in enough traffic. Killed. | rincebrain wrote: | Wikimedia recently announced Wikimedia Enterprise for | "organizations that want to repurpose Wikimedia content in | other contexts, providing data services at a large scale". | | So they're pretty clearly looking to monetize organizations | which consume their data in a for-profit context. | dathinab wrote: | monetizing != for-profit | | You could e.g. just cover operational cost and/or improve | the service quality from it. | pmiller2 wrote: | I think they may have meant "(organizations) (which | consume their data in a for-profit context)." | tomp wrote: | Well then they can't nag users to donate to Jimmy Wales' | trust fund. | onetimemanytime wrote: | >> _Google donated $2M [1]. In 2010, Google also donated $2m | [2]._ | | $2 Million a year? Now I know why Googlers complained about | having one less olive in their lunch salad. | | How much does Google PROFIT from Wikipedia and how much does | Wikipedia loses in fundraising when Google fails to send | users to the info provider? | lupire wrote: | Wikipedia is drowning in money so this whole line of | discussion is weird. | | And most of the value of wikipedia is created by its unpaid | users, not Wikimedia foundation. | kelnos wrote: | > _Wikipedia isn 't monetized._ | | No, but they often ask for donations when you visit the site, | which people won't see if they just see the in-line blurb | from Wikipedia on the Google results page. | | > _In 2019 - Google donated $2M [1]. In 2010, Google also | donated $2m [2]._ | | $2M is a pittance compared to what I expect Google believes | is the value of their Wikipedia blurbs. If Wikipedia could | charge for use of this data (which another commenter claims | they are working on doing), they could easily make orders of | magnitude more money from Google. | | Of course, my expectation is that Google would rather drop | the Wikipedia blurbs entirely, or source the data elsewhere, | than pay significantly more. | tylerhou wrote: | Unlikely that Wikipedia will be able to charge for content, | seeing as all of their content is CC-BY-SA licensed. | https://en.wikipedia.org/wiki/Wikipedia:Licensing_update | | They may be able to charge for _bandwidth_ (if you want to | use a Wikipedia image, you can use Wikipedia 's enterprise | CDN instead of their own), but their licensing allows me to | rehost content as long as I follow the attribution & | sublicensing terms. | | Google has no problem operating their own CDNs, so I find | it unlikely that Wikipedia will be able to monetize Google | search results in such a manner as you described. | | Disclaimer: I work for Google; opinions are my own. | Siira wrote: | Large swaths of web are garbage. Wasting people's time and | attention on visiting pointless sites for something presentable | in a small box is obviously not economical. | | And if some of the sources somehow die? New sources will spring | up. It doesn't matter. | dheera wrote: | > Only a select few crawlers are allowed access to the entire | web, and Google is given extra special privileges on top of that. | | Hmm, so set up a VPN on the Google Cloud so you have a Google IP | address, use a Google User-Agent, and go! | jesboat wrote: | https://developers.google.com/search/docs/advanced/crawling/... | | describes the procedure for checking "is this source Google | it". You couldn't fake it just by running on gcp | cookiengineer wrote: | Can we take a moment to talk about this club's business model? | | There's not even any information to see what the "private forum | access" that you have to pay for is about, what kind of people | are in it...or even to know about what happens with the money. | | For me, this sounds like a scam. | | I mean, no information about any company. No imprint. No privacy | policy. No non-profit organization. And just a copy/paste | wordpress instance. | | I mean, srsly. I am building a peer-to-peer network that tries to | liberate the power of google, specifically, and I would not even | consider joining this club. And I am the best case scenario of | the proposed market fit. | adamdusty wrote: | They want you to pay them to "research" google's web crawling | monopoly. It's really just a donation, but they don't frame it | like that. Probably more credible than using a crowd funding | website, because it sounds like their pushing for actual | legislation. | | > Meet with legislators and regulators to present our findings | as well as the mock legislation and regulations. We can't | expect that we can publish this website or a PDF and then sit | back while governments just all start moving ahead on their | own. Part of the process is meeting with legislators and | regulators and taking the time helping them understand why | regulating Google in this way is so important. Showing up and | answering legislators' questions is how we got cited in the | Congressional Antitrust report and we intend to keep doing | what's worked so far. | judge2020 wrote: | Not being set up as a 527 nonprofit[0] is the biggest red flag | - no donation or membership money has to be spent for political | purposes. They also use memberful for their membership/payment | system, which doesn't require owning a business, so you might | be paying out to the owner directly instead of to a business | with its own bank account. Maybe the owner is looking at HN and | can clarify. | | To add, there are a lot of businesses that use the terms | 'Knucklehead' so finding their business on secretary of state | business searches might be impossible. | | 0: https://www.irs.gov/charities-non-profits/political- | organiza... | drivingmenuts wrote: | How about a system whereby we tell others whether or not we want | to be crawled/not crawled by them? /s | [deleted] | tomc1985 wrote: | I think the solution here is everybody masquerades as Googlebot | so we can render the whole thing moot | quantumofalpha wrote: | Ignoring robots.txt is trivial, that's why some(many?) sites | enforce it by verifying source IP and recognize Googlebot from | its IP addresses - how will you get access to one of those? | p-sharma wrote: | What does "recognize Googlebot from its IP addresses" mean? | If I'm a human and I access a site, I have some other IP than | Googlebot, how should this side know if I'm a human or | knuckleheadsbot? | quantumofalpha wrote: | if you're claiming to be User-Agent: Googlebot, but your IP | doesn't seem like it belongs to Google, don't you think | it's a clear sign that you're FAKING IT? | | The check itself could be implemented for example with ASN | or reverse DNS lookup or hard-coding known Google's IP | ranges (though that's prone to become stale) | smarx007 wrote: | https://developers.google.com/search/docs/advanced/crawling | /... | p-sharma wrote: | Maybe a naive question but what prevents Knuckleheads' from | ignoring the robots.txt and crawl the side anyway? And if it's so | easy to do, how does Google have a monopoly on crawling then? | foobar33333 wrote: | On smaller sites, nothing usually. But on bigger sites you will | be blocked. You will probably be blocked even if you do follow | robots.txt | judge2020 wrote: | It's just rude to do so, and there are some technical issues | with doing that as well (such as crawling admin panel which | might trigger backend alarms/security alerts). Google also | doesn't have a legal monopoly on crawling, only a natural | monopoly thanks to a lot of websites independently choose to | only allow Google and Bing because of the many issues with | third-party crawlers (eg. crawling all pages at once, costing | money/slowing down the site[0]). | | 0: https://news.ycombinator.com/item?id=26593722 | jinseokim wrote: | This has been submitted to HN quite a few times. | | https://news.ycombinator.com/item?id=25426662 (Most comments; 11 | comments) | | https://news.ycombinator.com/item?id=25417067 (3 comments) | | https://news.ycombinator.com/item?id=25546867 (Most recent; 89 | days ago) | | https://news.ycombinator.com/item?id=25543859 | | https://news.ycombinator.com/item?id=25424852 | Darkphibre wrote: | Hooray! Looks like I'm one of today's lucky 10,000. :) | | https://xkcd.com/1053/ | [deleted] | skinkestek wrote: | Wasn't aware of that. | | Resubmitting interesting content that hasn't got traction | earlier on is however explicitly allowed in the guidelines | IIRC. | pessimizer wrote: | And linking past threads on the same subject is helpful. | monkeybutton wrote: | Interesting that the most comments it got before was 11, and | today it succeeds and makes it to the front page! This is a | good illustration of whether or not submissions get any | traction can be fairly stochastic. | | On topic, stack overflow does exactly what the article is | talking about; They lock down their sitemap and make special | exceptions for the Google bot: | | https://meta.stackexchange.com/a/98087 | | https://meta.stackexchange.com/questions/33965/how-does-stac... | | I can understand SO's reasoning but it only perpetuates the | incumbents' stranglehold on the internet. | jszymborski wrote: | I think it's partly because they create a website which | reported on the status of the Ever Given which rose to 1. on | the front page. | | I feel like I often see submissions which are, even | tangentially, related to front page material rise very | quickly. | | Regardless, congrats to Knuckleheads Club for fighting the | good fight. | skinkestek wrote: | You are right, that was how I found it. | judge2020 wrote: | > They lock down their sitemap and make special exceptions | for the Google bot: | | Their robots.txt, on the other hand, is more restrictive of | Googlebot: | | https://stackoverflow.com/robots.txt User- | agent: Googlebot-Image Disallow: /*/ivc/* | Disallow: /users/flair/ Disallow: /jobs/n/* .. | tmcw wrote: | I've definitely scraped by this problem on several occasions. | Recently I was writing a tool to check outgoing links from my | site, to see which sites are offline (it's called notfoundbot). | What I found was that many sites have "DDoS Protection" that | makes such an effort impossible, other sites whitelist the CuRL | headers, others like it when you pretend to be a search engine. | | Basically writing some code that tests whether "a website is | currently online or offline" is much, much harder than you think, | because, yep, the only company that can do that is Google. | varispeed wrote: | I disallow scanning on all my projects. After GDPR I also removed | all analytics - I realised it is just a time sink - instead of | focusing on content I would often focus on getting the bigger | numbers. I am not a marketer, so it didn't have much value to me | and it would just enlarge Google dataset without any payment. I | get that you cannot find my projects in the search engine. I am | okay with that :-) | topspin wrote: | If the shared cache ever became significant enough to matter it | would be devastated by marketers, scammers and other abusers. | Google employs the groomers that make their index at least | tolerable, if still clearly imperfect. Without that cadre of well | compensated expertise to win the arms race against such abusers | the scheme is not feasible. | | I suppose this could be crowdsourced if I didn't know about | politics and how any attempt at delegating the responsibility for | blessing sites and their indexes would become a controversy. | Google takes lots of heat about its behavior already, but Google | is a private entity and can indulge its private prerogatives for | the most part. Without that independence this couldn't function. | finnthehuman wrote: | I don't really understand your comment. Marketers, scammers and | other abusers already publish to the web with the intention to | be included in a crawl. Postprocessing crawl data is already a | thing. | | Assuming this hypothetical shared crawl cache were to exist, it | does not preclude google (and all consumers of that cache) | doing their own processing downstream of that cache. Does it? | | What's the new attack vector? | topspin wrote: | > I don't really understand your comment. | | If you don't then you fail to appreciate the amount of labor | it takes to thwart bad actors from ruining indexes. Abusers | do publish to the web, and we enjoy not wallowing in their | crap because small army of experienced and expensive people | at a select few Big Tech companies are actively shielding us | from it. | | It's easy to anticipate the malcontent view; 'Google spends | all its resources on ads and ranking and we don't need all | that.' That is naive; if Google completely neglected grooming | out the bad actors people wouldn't use Google and Google's | business model wouldn't be viable. | | So the obvious question is; where is this mechanism without | Google et. al? Will the published caches be 99% crap (and | without an active defense against crap you can bet your life | it will) and anything derived from it hopelessly polluted? If | so then it isn't viable. | | Now the instinct will be to find a groomer. Guess what; | that's probably doomed too. No selection will be impartial to | all, so you get to fight that battle. Good luck. | finnthehuman wrote: | >Will the published caches be 99% crap | | Yes. It will be exactly as crap as whatever's published on | the web. | | And the utility of google's search engine would be to | perform their proprietary processing on top of the | publicly-available crawl results. Analogous to how their | search is already preforming proprietary processing on top | of a crawl cache. | | >If you don't then you fail to appreciate the amount of | labor it takes to thwart bad actors from ruining indexes. | | Did you miss the part where I said "Assuming this | hypothetical shared crawl cache were to exist, it does not | preclude google (and all consumers of that cache) doing | their own processing downstream of that cache. Does it?" | herewhere wrote: | Around a decade ago, I was part of the team responsible for | msnbot (a web crawler for bing). There used to be robot.txt | (forgot the extension now). Most of the website was giving 10-20x | higher limits to googlebot than rest other crawler. | | Google definitely has unfair advantage there. | | Bing and duckduckgo still provide very reasonable result with | 10-20x less resources but not at par of google. | andrewclunn wrote: | How about an opt-in search engine cache? One where a domain needs | to agree allow their site to be crawled, but as a result also | gives said crawler full access? And then that repository would be | made publicly available to all search engines to use. Sort of an | AP for searches, that would give a base line that wouldn't | preclude search engines from going further, but which would | certainly lower the cost and network traffic for the search | engines and sites that take advantage of it? | l72 wrote: | I tried to set up YaCy [1] at home to index a few of may favorite | smaller websites, so I could quickly search just them. That | turned out to be a bad idea. Some ended up blocking my home IP | address and others reported me to my ISP. None of these sites | were that large, and I wasn't continuously crawling them... | | [1] https://yacy.net/ | slenk wrote: | I have been running my own Searx instance in AWS for a while | and have not gotten blocked yet anywhere | jedimastert wrote: | How often were you searching? | l72 wrote: | I was regularly searching, but I was rarely indexing any of | the sites. I struggled to even get an initial index of many | of the sites, due to being blocked or being reported. | samizdis wrote: | Coincidentally, this item [1] has just turned up in HN - Common | Crawl | | [1] https://news.ycombinator.com/item?id=26594172 | mrweasel wrote: | While I don't disagree with the idea that all crawlers should | have equal access, we also need to address the quality of many | crawlers. | | Google and Microsoft have never hammered any website I've run | into the ground. Crawlers from other other, smaller, search | engines have, to the point where it was easier to just block them | entirely. | | Part of the problem is that sites want search engine to index | their site, but not allow random people just scrapping the entire | site. So they do the best they can, and forget that Google isn't | the web. I doubt it's shady deals with Google, it's just small | teams doing the best they can and sometimes they forget to think | ideas through, because it's good enough. | rstupek wrote: | We've had the Bing crawler make a obscene number of requests | quite often but fortunately it doesn't bring us down. | kmeisthax wrote: | I think this is a problem which should be solved by automatic | rate-limiting and throttling at the application/caching layer | (or just individual web server for smaller sites). Requests | with a non-browser UA get put into a separate bots-only queue | that drains at a rate of ~1/sec or so. If the queue fills up | you start sending 429s with random early failures for bots | (UA/IP/subnet pairs) that are overrepresented in the traffic | flow. | | I don't know if such software exists, but it should. It would | be a hell of a lot healthier for the web than "everyone but | Google f*ck off", and it creates an incentive for bots to | throttle themselves (as they're more likely to get a faster | response than trying to request as fast as possible). | NathanKP wrote: | I suspect that at least some of the bots use web server | response times and response codes as part of the signal for | ranking. If your website does not appear capable of handling | load then it won't rank as highly, because it is not in their | best interests to have search results that don't load. | henriquez wrote: | I'd like to see some data on their claim that website operators | are giving googlebot special privileges. As far as I can tell it | would be a huge pain in the ass to block crawler bots from my | servers, not that I've tried. I have some weird pages that tend | to get crawlers caught in infinite loops, and I try to give them | hints with robots.txt but most of the bots don't even respect | robots.txt. | | If I actually wanted to restrict bots, it would be much easier to | restrict googlebot because they actually follow the rules. | | I don't disagree in principle that there should be an open index | of the web, but for once I don't see Google as a bad actor here. | throwaway_uat wrote: | LinkedIn profile/Quora answer are accessible by Google bot | without signin | burkaman wrote: | See figure I.4 on page 24 of this UK government report: | https://assets.publishing.service.gov.uk/media/5efb1db6e90e0... | | Additional evidence here: https://knuckleheads.club/the- | evidence-we-found-so-far/ | malf wrote: | What do you think this is used for? | | https://developers.google.com/search/docs/advanced/crawling/... | calimac wrote: | The studies and data to support their claim is in the first | paragraph of the article you "read" before posting the | question. | Lammy wrote: | Spoofing your user-agent as googlebot is a common way to bypass | paywalls, is (was?) a way to read Quora without creating an | account, etc. Publishers obviously need to send their | page/article to Google if they want it to be indexed but may | not want to send the same page content to a normal user: | https://www.256kilobytes.com/content/show/1934/spoofing-your... | | This was common even back in the mid-2000s: | | https://www.avivadirectory.com/bethebot/ | | https://developers.google.com/search/blog/2006/09/how-to-ver... | soheil wrote: | It's hilarious to think there exists people who think googlebot | does not get special treatment from website operators. Here is | an experiment you can do in a jiffy, write a script that crawls | any major website and see how many URL fetches it takes before | your IP gets blocked. | | Googlebot has a range of IP addresses that it publicly | announces so websites can whitelist them. | quitethelogic wrote: | > Googlebot has a range of IP addresses that it publicly | announces so websites can whitelist them. | | Google says[1] they do not do this: | | "Google doesn't post a public list of IP addresses for | website owners to allowlist." | | [1]https://developers.google.com/search/docs/advanced/crawlin | g/... | johncolanduoni wrote: | From that same page they recommend using a reverse DNS | lookup (and then a forward DNS lookup on the returned | domain) to validate that it is google bot. So the effect is | the same for anyone trying to impersonate googlebot (unless | they can attack the DNS resolution of the site they're | scraping I guess). | Mauricebranagh wrote: | I have never had that problem running screaming frog on big | brand sites apart from one or two times. | dheera wrote: | Do any of them intersect with Google Cloud IP addresses? If | so set up a VPN server on Google Cloud. | WesolyKubeczek wrote: | I don't scrape a website often, but when I do, I'm using a | user agent of a major browser. | tedunangst wrote: | I don't whitelist googlebot, but I don't block them either | because their crawler is fairly slow and unobtrusive. Other | crawlers seem determined to download the entire site in 60 | seconds, and then download it again, and again, until they | get banned. | [deleted] | suicas wrote: | A company I worked for ~7 years ago ran its own focused web | crawler (fetching ~10-100m pages per month, targeting certain | sections of the web). | | There were a surprising number of sites out there that | explicitly blocked access to anyone but Google/Bing at the | time. | | We'd also get a dozen complaints or so a month from sites we'd | crawled. Mostly upset about us using up their bandwidth, and | telling us that only Google was allowed to crawl them (though | having no robots.txt configured to say so). | luckylion wrote: | I usually recommend setting only Google/Bing/Yandex/Baidu etc | to Allow and everything else to Disallow. | | Yes, the bad bots don't give a fuck, but even the non- | malicious bots (ahrefs, moz, some university's search engine | etc) don't bring any value to me as a site owner, take up | band width and resources and fill up logs. If you can remove | them with three lines in your robots.txt, that's less noise. | Especially universities do, in my opinion, often behave badly | and are uncooperative when you point out their throttling | does not work and they're hammering your server. Giving them | a "Go Away, You Are Not Wanted Here" in a robots.txt works | for most, and the rest just gets blocked. | dillondoyle wrote: | Isn't that the website owners right though? I'm not sure I | understand the problem here. | | If Google is taking traffic and reducing revenue, a company | can deny in robots.txt. Google will actually follow those | rules - unlike most others that are supposedly in this 2nd | class. | suicas wrote: | Yup, no problem here, was just making an observation about | how common such blocking was (and about the fact that some | people were upset at being crawled by someone other than | Google, despite not blocking them). | | The company did respect robots.txt, though it was initially | a bit of a struggle to convince certain project managers to | do so. | jameshart wrote: | When you operate commercial sites at scale, bots are a real | thing you spend real engineering hours thinking about and | troubleshooting and coding to solve for. | | And yes, that means google gets special treatment. | | Think about the model for a site like stackoverflow. The | longest of long tail questions on that site: what's the actual | lifecycle of that question? | | - posted by a random user - scraped by google, bing, et al - | visited by someone who clicked on a search result on google - | eventually, answered - hopefully, reindexed by google, bing et | al - maybe never visited again because the answer now shows up | on the google SERP | | In the lifetime of that question how many times is it accessed | by a human, compared to the number of times it's requested and | rerequested by an indexing bot? | | What would be the impact on your site of three more bots as | persistent as google bot? Why should you bother with their | requests? | | So yes, sites care about bot traffic and they care about google | in particular. | noxvilleza wrote: | Google aren't the bad actor in the sense that they are actively | doing something wrong, but they are definitely benefiting from | the monopoly that they created and work on maintaining. If this | continues then nobody will really ever be able to challenge | them, which means possibly "better" products will fail to | penetrate the market. | ehsankia wrote: | > but for once I don't see Google as a bad actor here. | | As inflammatory as the headline of the page looks, they | literally admit it's not google's fault in the smaller text | lower down: | | "This isn't illegal and it isn't Google's fault, but" | zmarty wrote: | A lot of news websites restrict any crawler other than Google. | And this does not happen only via robots.txt. | simias wrote: | Indeed, years ago I had scripts to automatically fetch URLs | from IRC and I quickly realized that if I didn't spoof the | user agent of a proper web browser many websites would reject | the query. Googlebot's UA worked just fine however. | judge2020 wrote: | > Googlebot's UA worked just fine however | | They obviously don't care enough then - Google says you | should use rdns to verify that googlebot crawls are | real[0]. Cloudflare does this automatically now as well for | customers with WAF (pro plan). | | 0: https://developers.google.com/search/docs/advanced/crawl | ing/... | staunch wrote: | Google makes $150+ billion from Google Search per year. Running | Google Search could be operated for likely (much less than) $10 | billion per year. | | So, Google is in effect taxing us all $140 billion per year. | | It's not dissimilar from how Wall Street effectively taxes us all | for an even larger amount. | | In both cases, we could use some kind of non-profit open system | to facilitate web search and stock trading. | | The Great Lie that Google is doing a good thing by charging money | to insert "relevant ads" above the search results is totally | wrong. If those ads are the most relevant, they should just be | the top organic results, obviously. | | Google mostly solved search 20 years ago. There's really nothing | that impressive about Google Search in 2021. It should be | relatively easy to replace it with something open, leveraging the | massive improvements in hardware and software. It could operate | like Wikipedia or Archive.org. The hard part is probably getting | the right team and funding assembled. | systemBuilder wrote: | This is not really about Google. | | Websites block crawlers because they get abused / crashed by | Crawlers. In the early days (2000-2010) Google not only got | banned by some websites, it even got DNS-banned for abusing some | DNS domains. You see, Google already has already built the | "megacrawlers" described in this article, it can melt any website | on the Internet, even Facebook - the largest, and they paid a | high price for letting the early Google crawlers run free. | | Google today has a rate-limit for every single website and DNS | sub-domain on the internet. For small websites the default is a | handful of web pages every few seconds. Google has a very slow | (days) algorithm to increase its crawl rate, and a very fast (1d) | algorithm to cut the rate limit if it's getting any of the errors | likely due to website overload. | | To summarize, Google has several layers of congestion control | custom-designed into the crawl application. Most small web | crawlers have zero. | | None of these other crawlers have figured this out, so they abuse | websites, causing all small-scale crawlers to get banned. | | - ex-Google Crawl SRE | ricardo81 wrote: | Thank you for those insights, it's a topic I'm interested in. | Agree with what you're saying about naive bots hitting | websites/hosts/subnets too hard, in the context of site owners | being hit by multiple bots for multiple reasons and them | questioning the return they'll get. | | I'd be interested to know more info wrt DNS lookups. Did you | apply a blanket rate limit on the number of DNS requests you'd | make to any particular server? | | From past experience I know the .uk Nominet servers would temp- | ban if you were doing more than a few hundred lookups per | second. At the next host level down, was there a blanket limit | or was it dependent on the number of domains that nameserver | was responsible for? | dbsmith83 wrote: | I just don't see this working out legally. How would it even | work? | | From the "learn more" | | > Sometime soon we will be publishing what we think should happen | and what we think will happen. These two futures diverge and we | believe that, while the gap between them exists, it will entrench | Google's control over the internet further. We believe that | nothing short of socialization of these resources will work to | remove Google's control over the internet. Our hope is that in | publishing this work right now we will let the genie out of the | bottle and start a process towards socialization that cannot be | undone. | | Sorry, but I deeply skeptical of this. This sounds like the first | step towards a non-free internet. At the end of the day, it is | your box on the web, and if you want or don't want | someone/something to crawl it, that is your call to make. | marshmallow_12 wrote: | I have an idea: remove the art of web crawling from the domain of | a single company and instead create a international group of | interested parties to run it instead. I'm thinking broadly along | the lines of the Bluetooth SIG. Maybe it will be a bit more | complicated, and require international political efforts, but it | will make the search engine market way more democratic. | sxp wrote: | https://knuckleheads.club/the-googlebot-monopoly/ has actual | details. | | > Let's take a look at the robots.txt for census.gov from October | of 2018 as a specific example to see how robots.txt files | typically work. This document is a good example of a common | pattern. The first two lines of the file specify that you cannot | crawl census.gov unless given explicit permission. The rest of | the file specifies that Google, Microsoft, Yahoo and two other | non-search engines are not allowed to crawl certain pages on | census.gov, but are otherwise allowed to crawl whatever else they | can find on the website. This tells us that there are two | different classes of crawlers in the eyes of the operators of | census.gov: those given wide access, and those that are totally | denied. | | > And, broadly speaking, when we examine the robots.txt files for | many websites, we find two classes of crawlers. There is Google, | Microsoft, and other major search engine providers who have a | good level of access and then there is anyone besides the major | crawlers or crawlers that have behaved badly in the past that are | given much less access. Among the privileged, Google clearly | stands out as the preferred crawler of choice. Google is | typically given at least as much access as every other crawler, | and sometimes significantly more access than any other crawler. | indymike wrote: | Broadly speaking, robots.txt files are often ignored. I used to | run a fairly large job ad scraping organization, and we would | be hired by companies (700 of the fortune 1000 used us) to | scrape the job ads from their career pages, and then post those | jobs on job boards. 99 of 100 times, the robots file would | disallow us to scrape. Since we were being paid by that | company's HR team to scrape, we just ignored it because getting | it fixed would take six months and 22 meetings. | chmod775 wrote: | > Broadly speaking, robots.txt files are often ignored. | | If you wanna go nuclear on people who do that, include an | invisible link in your html and forbid access to that URL in | your robots.txt, then block every IP who accesses that URL | for X amount of time. | | Don't do this if you actually rely on search engine traffic | though. Google may get pissed and send you lots of angry mail | like "There's a problem with your site". | jedberg wrote: | > Don't do this if you actually rely on search engine | traffic though. Google may get pissed and send you lots of | angry mail like "There's a problem with your site". | | Ah, but of course you would exclude Google's published | crawler IPs from this restriction, because that is exactly | what they want you to do. | TheAdamAndChe wrote: | Are there any actual repercussions for just ignoring | robots.txt? | 5560675260 wrote: | Your crawler's IP might get banned, eventually. | asciident wrote: | There is if you are doing it for work. For example, your | company could get sued if you are found using that data and | ignoring the ToS. If you are a public figure, you could get | your name tarnished as doing something unethical or the media | may call it "hacking". If you are rereleasing the data then | you risk getting a takedown notice. | the_dege wrote: | Sometimes website admins will also try to report your ips to | the service provider as a source of attacks (even if not | true). | DocTomoe wrote: | Given how often I've had misbehaving crawlers slow own | servers in the early 2000s, I do not see how a crawler that | disobeys robots.txt is not an attempted attack. | JackFr wrote: | So from the website's point of view there is no difference | between 'crawling' and 'scraping'. Census.gov I assume has a | ton of very useful information which is in the public domain | which a host of potential companies could monetize by regularly | scraping census.gov. Census.gov's purpose to make this | information available to people is served by google, yahoo and | bing. On the other hand if I have a business which is based on | that data, in fact I'm at cross purposes to them. | njharman wrote: | I'm generally anti business. But I have to disagree. "The | Public" that the government serves includes businesses. | Businesses (ignoring corporate personhood bullshit) are owned | and operated by people. | | I do not want the government deciding "what purposes" e.g. | non-commercial, serve the public good. The public gets to | decide that. (charging a license for commercial use is maybe | ok (assuming supporting that use costs government "too | much"). | | And I very do not want current situation with the government | anointing a handful of corporations (the farthest thing from | the public possible) access and denying everyone else | including all of the actual public. | hnbroseph wrote: | > I do not want the government deciding "what purposes" | e.g. non-commercial, serve the public good. The public gets | to decide that. | | the public's "decision" on things like this is made | manifest by government policy, no? | danShumway wrote: | In theory. In practice, is every single policy that our | government upholds currently popular with the majority of | people? | | It's possible to have government policies that the | majority of people disagree with, that remain for | complicated reasons related to apathy, lobbying, party | ideology, or just because those issues get drowned out by | more important debates. | | Government is an extension of the will of the people, but | the farther out that extension gets, the more divorced | from the will of the people it's possible to be. That's | not to say that businesses are immune from that effect | either -- there are markets where the majority of people | participating in them aren't happy with what the market | is offering. All of these systems are abstractions, | they're ways of trying to get closer to public will, and | they're all imperfect. But government is particularly | abstracted, especially because the US is not a direct | democracy. | | I'm personally of the opinion that this discussion is | moot, because I think that people have a fundamental | Right to Delegate[0], and I include web scraping public | content under that right. But ignoring that, because not | everyone agrees with me that delegation is right, | allowing the government to unilaterally rule on who isn't | allowed to access public information is still | particularly susceptible to abuse above and beyond what | the market is capable of. | | [0]: https://anewdigitalmanifesto.com/#right-to-delegate | pessimizer wrote: | A specific case where this favorite-picking by government | enables corruption: https://en.wikipedia.org/wiki/Nationall | y_recognized_statisti... | | And an example from the quickly-approaching future, when | there will be Nationally Recognized Media Organizations who | license "Fact-Checkers," through which posts to public- | facing will have to be submitted for certification and | correction. | marcosdumay wrote: | Favorite-picking by the government is corruption by | itself already. | indymike wrote: | I used to run a fairly large job ad scraping operation. Our | scraped data was used by many US state and federal job sites. | "Scraping" is just using software to load a page and | extracting content. "Crawling" is just load a page, find | hyperlinks (hmm... a kind of content), and then crawling | those links. Crawling is just a kind of scraping. | vharuck wrote: | In the case of Census.gov, they offer an API to get the | data[0]. It's actually pretty nice. Stable, ton of data, | fairly uniform data structure across the different products. | Very high rate limits, considering most data only needs | retrieved once a year. I think they understand the difference | between crawling and scraping. | | [1] https://www.census.gov/data/developers.html | ricardo81 wrote: | Having data in the right format as a download or via an API | would be the best way to go for public data. | | If people have to 'scrape' that data from a public resource, | I'd say they're presenting the data in the wrong way. | mulmen wrote: | But Google, Yahoo and Bing are also monetizing the data. Why | are they allowed to provide "benefits" but "scrapers" are | not? Why is it wrong to monetize public data? | jonas21 wrote: | The census data is available for bulk download, mostly as CSV | (for example [1]). Scraping census.gov is worse for both the | Census Bureau (which might have to do an expensive database | query for each page) and for the scraper (who has to parse | the page). | | Blocking scrapers in robots.txt is more of a way of saying, | "hey, you're doing it wrong." | | It's also worth noting that the original article is out of | date. The current robots.txt at census.gov is basically wide- | open [2]. | | [1] https://www.census.gov/programs-surveys/acs/data/data- | via-ft... | | [2] https://www.census.gov/robots.txt | foobar33333 wrote: | Scrapers don't care about robots.txt. I have scraped | multiple websites in a previous job and the robots.txt | means nothing. Bigger sites might detect and block you but | most don't. | gnramires wrote: | Perhaps there could be some kind of 'Crawler consortium'? | | Under this consortium, website owners would be allowed to | either allow all crawlers (approved by the consortium) or none | at all (that is, none that is in the consortium, i.e. you could | allow a specific researcher or something to crawl your website | on a case-by-case basis). | | This consortium would be composed of the search engines | (Google, MS, other industry members), as well as government | appointed individuals and relevant NGOs (electronic frontier | foundation, etc?). There would be an approval process that | simply requires your crawl to be ethical and respect bandwidth | usage. Violations of ethics or bandwidth limits could imply | temporary or permanent suspension. The consortium could have | some bargain or regulatory measures to prevent website owners | from ignoring those competitive and fairness provisions. | dragonwriter wrote: | > Perhaps there could be some kind of 'Crawler consortium'? | | An industry-wide agreement not to compete for commercially | valuable access to suppliers of data? | | Comprised of companies that are current (and in some cases | perennial) focusses of antitrust attention? | | I think there might be a problem with that plan. | gnramires wrote: | Well, yes, and a common solution to anti-trust cases, that | I know of, is some kind of industry self-regulation. In | this case I wouldn't trust the industry only to self- | regulate; hence, they should at invite (while keeping a | minority but not insignificant position) governments and | civil society (ngos and other organizations) to | participate. | | Could you better describe your objections? | neolog wrote: | I don't see the problem. If a bunch of non-google companies | pooled resources to make a crawl, that would reduce market | concentration, not increase it. | adolph wrote: | Is it legal for a government entity to issue a robots.txt like | that? Maybe the line between use and abuse hasn't been | delinated as well as it needs to be. | bigwavedave wrote: | > Is it legal for a government entity to issue a robots.txt | like that? | | I may be wrong (this isn't my area), but I was under the | impression that robots.txt was just an unofficial convention? | I'm not saying people should ignore robots.txt, but are there | legal ramifications if ignored? I'm not asking about | techniques sites use to discourage crawlers/scrapers, I'm | specifically wondering if robots.txt has any legal weight. | vageli wrote: | Is failure to honor a robots.txt a crime? Or rather, would it | be unlawful to spoof a user agent to access this publicly | available data? After the linkedin [0] case it seems | reasonable to think not. | | [0]: https://www.eff.org/deeplinks/2019/09/victory-ruling- | hiq-v-l... | Spivak wrote: | Spoofing user-agents hasn't worked in a long time for | anything but small operations because search engines | publish specific IP ranges their scrapers use. | zepearl wrote: | Maybe it would be nice if some sort of simple central index of | "URLs + their last updated timestamp/version/eTag/whatever" would | exist, updated by the site owners themselves? | ("push"-notification) | | Meaning that whenever a page of a website would be created or | updated, that website itself would actively update that central | index, basically saying "I just created/deleted page X" or "I | just updated the contents of page X". | | The consequence would be that... | | 1) ...crawlers would not have anymore to actively (re)scan the | whole Internet to find out if anything has changed, but they | would only have to query that central index against their own | list URLs & timestamps to find out what needs to be (re)scanned. | | 2) ...websites would not have to just wait&hope that some bot | would decide to come by to have a look at their sites, nor they | would have to answer over and over again requests that are just | meant to check if some content has changed. | soheil wrote: | I'm not sure if it is a good thing if there is a public cache of | everything that Google has. The issue is websites will simply | stop serving content to Google to protect their content from | being accessed by their competitors, this in turn will make | search much worse and will force us back to the pre-search dark | ages of the internet. The sites may even serve an even more | crippled version of their content just to get hits but there is | no doubt search quality will suffer. | | We're left with a monopoly that is Google, destroying it now | could be foolish. | sesuximo wrote: | Seems like a private cache of the web would solve the problem? | Why does it need to be public? | lisper wrote: | Seriously? Google _is_ a private cache of the web. That _is_ | the problem. | sesuximo wrote: | Google doesn't give anyone access to said cache. I mean one | crawler with a shared api among competitors. So exactly the | same as the public cache, but run my a private company and | accessed for a small fee | ajcp wrote: | > run [by] a private company and accessed for a small fee | | That is exactly the opposite of a public cache. | sesuximo wrote: | Not really. It serves the same function. Either you pay | this hypothetical company or ??? pays to keep up the | public one. | ajcp wrote: | Just because it serves the same function does not mean | the implementation is the same. Private military | contractors and a US infantry squad serve the same | function, but the implementation completely changes their | context. | | That being said what I think you're arguing for would be | the implementation of a public utility or private-public | business. If that's the case then yes, what you're saying | is correct. | visarga wrote: | > Google doesn't give anyone access to said cache. | | It would also be useful for deep searches, exceeding the | 1000 result limit, empowering all sorts of NLP | applications. | lisper wrote: | I don't think you're quite clear on what the words "public" | and "private" mean. "Public" is not a synonym for "run by | the government" and "private" is not a synonym for "closed | to everyone but the owner". Restaurants, for example, are | generally open to the public, but they are not public. A | restaurant owner is, with a few exceptions, free to refuse | service to anyone at any time. | | If it's "exactly the same as a public cache" then it's | public, even if it is managed by a private company. The | difference is not in who _has_ access, the difference is in | _who decides_ who has access. | sesuximo wrote: | Ok I am not clear then, but I'm less clear after your | comment! In a public cache, who would you want to decide | who has access? Is simply saying "anyone who pays has | access" enough to qualify as public? if so, then I agree | and this was my (possibly poorly phrased) intention in | the original comment. | | But imo the restaurant model is also fine; in most cases | people have access and it works. | lisper wrote: | > Is simply saying "anyone who pays has access" enough to | qualify as public? | | No because someone has to set the price, which is just an | indirect method of controlling who has access. | | > the restaurant model is also fine | | It works for restaurants because there is competition. | The whole point here is that web crawling/caching is a | natural monopoly. | | A better analogy here would be Apple's app store, or | copyrighted standards with the force of law [1]. These | nominally follow the "anyone who pays has access" model | but they are not public, and the result is the same set | of problems. | | [1] https://www.thebrandprotectionblog.com/public-laws- | private-s... | sct202 wrote: | You can API google search results to make a meta-search | engine if you want to but it's like $5 / 1k requests. | twiddlebits wrote: | Google's TOS prevents blending (alterations, etc.) | though. | [deleted] | ThePhysicist wrote: | On a related note, Cloudflare just introduced "Super Bot Fight | Mode" (https://blog.cloudflare.com/super-bot-fight-mode/) which | is basically a whitelisting approach that will block any | automated website crawling that doesn't originate from "good | bots" (they cite Google & Paypal as examples of such bots). So | basically everyone else is out of luck and will be tarpitted | (i.e. connections will get slower and slower until pages won't | load at all), presented with CAPTCHAs or outright blocked. In my | opinion this will turn the part of the web that Cloudflare | controls into a walled garden not unlike Twitter or Facebook: In | theory the content is "public", but if you want to interact with | it you have to do it on Cloudflare's terms. Quite sad really to | see this happen to the web. | judge2020 wrote: | On the other hand, I do not want my site to go down thanks to a | few bad 'crawlers' that fork() a thousand http requests every | second and take down my site, forcing me to do manual blocking | or pay for a bigger server/scale-out my infrastructure. Why | should I have to serve them? | progval wrote: | You can use the same rate-limiting for all crawlers, Google | or not. | dodobirdlord wrote: | Googlebot is pretty careful and generally doesn't cause | these problems. | spijdar wrote: | Right, then they shouldn't be effected by the rate- | limiting, as long as its reasonable. If it was applied | evenly to all clients/crawlers, it'd at least allow the | possibility for a respectful, well designed crawler to | compete. | jedberg wrote: | The problem is, if you own a website, it takes the same | amount of resources to handle the crawl from Google and | FooCrawler even if both are behaving, but I'm going to | get a lot more ROI out of letting Google crawl, so I'm | incentivized to block FooCrawler but not Google. In fact, | the ROI from Google is so high I'm incentivized to devote | _extra_ resources just for them to crawl faster. | TameAntelope wrote: | How hard is it to ask Cloudflare to let you crawl? | smarx007 wrote: | It's not Cloudflare who is deciding it. It's the website | owners who request things like "Super Bot Fight Mode". I | never enable such things on my CF properties. Mostly it's | people who manage websites with "valuable" content, e.g. | shops with prices who desperately want to stop scraping by | competitors. | f430 wrote: | I can say this will give a lot of businesses false sense of | security. It is already bypassable. | | the Web scraping technology that I am aware of has reached | end game already: Unless you are prepared to authenticate | every user/visitor to your website with a dollar sign, | lobby congress to pass a bill to outlaw web scraping, you | will not be able to stop web scraping in 2021 and beyond. | kristopolous wrote: | In the early 90s there were various nascent systems for | essentially public database interfaces for searching | | The idea was that instead of a centralized search, people could | have fat clients that individually query these apis and then | aggregate the results on the client machine. | | Essentially every query would be a what/where or what/who pair. | This would focus the results | | I really think we need to reboot those core ideas. | | We have a manual version today. There's quite a few large | databases that the crawlers don't get. | | The one place for everything approach has the same fundamental | problems that were pointed out 30 years ago, they've just | become obvious to everybody now. | grishka wrote: | So, one more reason to hate Cloudflare and every single website | that uses it. | jakear wrote: | Or maybe don't "hate" folks who are just trying to put some | content online and don't want to deal with botnets taking | down their work? You know, like what the internet was | intended for. | grishka wrote: | Internet was certainly _not_ intended for centralization. I | hit Cloudflare captchas and error pages so often it 's | almost sickening. So many things are behind Cloudflare, | things you least expect to be behind Cloudflare. | petercooper wrote: | I wonder what happens to RSS feeds in this situation. Programs | I run that process RSS feeds will just fetch them over HTTP | completely headlessly, so if there are any CAPTCHAs, I'm not | going to see them. | luckylion wrote: | That will be interesting to see with regards to legal | implications. If they (in the website operator's name) block | access to e.g. privacy info pages to a normal user "by | accident", that could be a compliance issue. | | I don't think it's mass blocking is the right approach in | general. IPs, even residential, are relatively easy and | relatively cheap. At some point you're blocking too many normal | users. Captchas are a strong weapon, but they too have a | significant cost by annoying the users. Cloudflare could | theoretically do invisible-invisible captchas by never even | running any code on the client, but that would be wholesale | tracking and would probably not fly in the EU. | dleslie wrote: | The idea of a public cache available to anyone who wishes to | index it is ... kind of compelling. | | If it was the only indexer allowed, and it was publically | governed, then enforcing changes to regulation would be a lot | more straightforward. Imagine if indexing public social media | profiles was deemed unacceptable, and within days that content | disappeared from all search engines. | | I don't think it'll ever happen, but it's interesting to think | about. | tlibert wrote: | So out law web scrapping entirely? | simantel wrote: | Common Crawl is attempting to offer this as a non-profit: | https://commoncrawl.org | jackson1442 wrote: | o/t but what the hell are they doing to scroll on that page? | I move my fingers a centimeter on my trackpad and the page is | already scrolled all the way to the bottom. | | Hijacking scroll like this is one of the biggest turnoffs a | website can have for me, up there with being plastered with | ads and crap. It's ok imo in the context of doing some flashy | branding stuff (think Google Pixel, Tesla splashes) but | contentful pages shouldn't ever do this. | aembleton wrote: | Add *##+js(aeld, scroll) to your uBO filters. That will | stop scroll JS for all websites. | xtracto wrote: | That would be a very cool use case for something like STORJ or | IPFS. | ricardo81 wrote: | An alternative but similar idea, apply your own algorithms to a | crawler/index. That's half the problem with these large | platforms commanding the majority of eyeballs, you search the | entire web for something and you get results back via a black | box. Alternatives in general are most definitely a good thing. | | Knuckleheads' Club at the very least are doing a great job of | raising awareness and the potential barriers to entry for | alternatives. | ISL wrote: | Imagine if Donald Trump decided that indexing Joe Biden's | campaign site was unacceptable. | | A mandated singular public cache has potential slippery slopes. | whimsicalism wrote: | Imagine if Donald Trump decided to tax campaign donations to | Joe Biden's campaign at 100%. | | I am unconvinced by the "slippery slope" argument being | deployed by default to any governmental attempt to combat | tech monopolies. | ISL wrote: | This is an argument against centralization more than it is | against government. | | "One index to rule them all" seems more fraught with | difficulty than, "large cloud providers are unhappy that | crawlers on the open web are crawling the open web". | whimsicalism wrote: | If the impact stopped at "large cloud providers" being | unhappy, I think that you're correct. But I think we've | seen considerably downstream "difficulty" for the rest of | society from search essentially being consolidated into | one private actor. | passivate wrote: | >A mandated singular public cache has potential slippery | slopes. | | That may be, but it seems like everything has a slippery | slope - if the wrong person gets into power, or if the public | look the other way/complacence/ignorance/indifference, etc, | etc. It shouldn't stop us evaluating choices on their merits, | and there is a lot of merit to entrusting 'core | infrastructure' type entities to the government - or at-least | having an option. | drivingmenuts wrote: | > If it was the only indexer allowed, and it was publically | governed | | Which would put it under government regulation and be forever | mired in politics over what was moral, immoral, ethical or | unethical and all other kerfuffle. To an extent, it's already | that way, but that would make it worse than it is currently. | hackeraccount wrote: | I'd have to look more but maybe running a cache isn't dead | simple. I can imagine that the benefits of manipulating what's | in the cache either adding or removing would be very high. | Google and the others are private companies so they're not | required to do everything in the public view. | | A public cache wouldn't be able - indeed shouldn't - to play | cat and mouse games with potential opponents. I suspect most of | the games played require not explaining exactly what you're | doing. | sixdimensional wrote: | Here's an idea... what if search became a peer-to-peer | standardized protocol that is part of the stack to complement | DNS? E.g. instead of using DNS as the primary entry point, you | use a different protocol at that level to do "distributed | search". DNS would still play a role too, but if "search" was a | core protocol, the entry point for most people would be | different. | | Similar to some of the concepts of "Linked Data", maybe - | https://en.wikipedia.org/wiki/Linked_data. | | The problem is getting to a standard, it would essentially need | to be federated search so a standard would have to be | established (de facto most likely). | | Also, indexes and storage, distribution of processing load.. | peer-to-peer search is already a thing, but it doesn't seem to | be a core function of the Internet. | | This is basically the same concept as making an "open" version | of something that is "closed" in order to compete, I guess. | rezonant wrote: | > Let's take a look at the robots.txt for census.gov from October | of 2018 as a specific example to see how robots.txt files | typically work. This document is a good example of a common | pattern. The first two lines of the file specify that you cannot | crawl census.gov unless given explicit permission. | | This was eyebrow-raising. Actually seems like this is not (any | longer?) true: | | https://census.gov/robots.txt: | | User-agent: * | | User-agent: W3C-checklink | | Disallow: /cgi-bin/ | | Disallow: /libs/ | | ... | | That first line wildcards for any user agent but does nothing | with it. It should say "Disallow /" on the next line if it | blocked all unnamed robots. It looks like someone found out about | it and told the operators, rightfully so, that government | webpages with public information (especially the census) | shouldn't have such restrictions. They then removed only the | second line and left the first. Leaving the first line has no | impact on the meaning of the file. | EGreg wrote: | Or use MaidSAFE where you get paid to serve your website as | opposed to the other way around. | hannob wrote: | I have seen sites behave differently if you use a Googlebot UA, | but am I missing something or does this merely mean that anyone | doing something like this | | curl -A 'Mozilla/5.0 (compatible; Googlebot/2.1; | +http://www.google.com/bot.html)' | | will get Google-level crawler access? | kirubakaran wrote: | That would work on website that have a naive check for just | user agent. Google also publishes the IP address ranges their | crawlers run on. Lot of websites check for that, and there's no | way around that. | | https://developers.google.com/search/docs/advanced/crawling/... | AlphaWeaver wrote: | This "club" charges a membership fee of $10 a month (or $100 a | year) to comment. | | Does this go to some sort of nonprofit or holding entity that's | governed by its members? Or do people have to trust the owner? | mancerayder wrote: | Any word on or opinions about Brave's initiative to challenge | search? | ChrisArchitect wrote: | dupe/posted earlier etc | | I also got confused about this page as there's another project of | theirs around right now about RIP Google Reader that's on a | seperate domain... | | Funny a site that's all about google this and that doesn't have | clear URL/pages for their articles that can be linked to easily | geez | | Original post/discussion from the source, 3 months ago: | https://news.ycombinator.com/item?id=25417067 | slenk wrote: | https://knuckleheads.club/introduction/ | | That seems like an easy link? ___________________________________________________________________ (page generated 2021-03-26 23:00 UTC)