[HN Gopher] Common Crawl ___________________________________________________________________ Common Crawl Author : Aissen Score : 283 points Date : 2021-03-26 16:42 UTC (6 hours ago) (HTM) web link (commoncrawl.org) (TXT) w3m dump (commoncrawl.org) | smaddox wrote: | I was hoping it was going to be a massively multiplayer dungeon | crawling game... | breck wrote: | Love it! Just donated. | cblconfederate wrote: | I don't think that this is the answer to "only google can crawl | the web". This is a huge archive suitable for making a web search | engine maybe. | | What if you want to make a simple link previewer? An abstract | crawler for scientific articles? Most websites are behind | cloudflare which will block/captcha you, but happily whitelist | only google & major social sites. Tha answer is measures that | bring the web back to basics, not this over-SEOed bot infested | ecosystem. FANGS succeeded in sucking out all the information of | the web, but they suck at creating protocols that are | interoperable (even twitter now needs its own tags!). | | Incidentally, maybe the next search engine should use a push- | system , where websites will ping to it whenever they have | updates. If the engine has unique features, it might be actually | seen as a measure to reduce the loads from bots. | gillesjacobs wrote: | This is mainly used for language modeling research. A filtered | CC was used in GPT3 and I have personally used data from CC for | NLP projects. | shadowgovt wrote: | I don't think we can bring the web back to basics in the sense | you're envisioning without kicking most users off of it. | | Cloudflare's protection is to guard against traffic spikes and | automated malicious attacks; Google and social sites are allow- | listed because they're trustable entities with well-defined | online presences and a vested interest (generally) in not | breaking sites. "How do we do away with the need to put up an | anti-automated-traffic screen plus whitelist" is in the same | category of problem as "How do we change the modern web to | address automated malicious attacks?" | neura wrote: | Another way to look at this is if CDNs don't allow Google, | nobody is going to want to use them. Their content doesn't | get indexed and anybody doing a search is going to get | directed to someone else that doesn't put their content | behind a CDN with that level of protection. That or someone | like google will just solve the problem themselves and be | both the CDN and the indexer, bringing them one step closer | to complete ownership of finding anything on the web. | th0ma5 wrote: | Actually I was able to search URLs with very little ram across | the entire collection as they have a series of indexes you can | download. | | In theory someone could do something similar with terms, or you | could first use URLs to filter the text size you download into | Elastic or Solr and do your own custom search that way. | | The indexes are really neat though, I highly recommend playing | with them. | thunderbong wrote: | Can you suggest where I can get them from? | breischl wrote: | >maybe the next search engine should use a push-system | | So setting up a new search engine that way would require going | to every site and convincing them to notify you of changes. | Wouldn't that be even more limited than the current Cloudflare | whitelist system? At least there's some chance you can get | around the whitelist system. | neura wrote: | This... I couldn't get past the irony of the comment. | Basically, the problem is that sites only let google and | friends index them. The solution? Site should only send index | data to google and friends. | | I mean, I get it, then sites can send to any number of | indexers, but let's be honest, like you say, any new search | engine has to get sites to push data out to them. That's just | not. going. to. happen. | waynesonfire wrote: | hope these guys team up w/ archive.org | rektide wrote: | relates closely to the recent "Only Google is allowed to crawl | the web"[1][2] post. | | [1] https://news.ycombinator.com/item?id=26592635 | | [2] https://knuckleheads.club/ | staunch wrote: | It seems like Common Crawl is doing a lot of awesome stuff, but | they're _not_ attacking Google 's stranglehold head on. | | Presumably this is because they lack the money to do so. Have | they attempted to estimate how much it would cost per year to | crawl the web as aggressively and comprehensively as Google does? | I've checked their site and didn't find anything like that. | | If they came up with a number, say $2 or $10 billion per year, it | might actually be possible to gather enough donations to dethrone | Google. | | A lot of Google competitors would love to see them dethroned. And | it would be a huge win for virtually everyone else too. There's | no one in the world that wants Google to maintain their web | search monopoly indefinitely. | ricardo81 wrote: | Good resource, admirable intention, great that it simply exists. | Good sized index. | | I see a lot of people subscribe to the idea of this being the | feeder to alternative search engines. | | I'd guess part of the problem with doing things this way is the | 'crawl priority' of what the search engine thinks are the next | best pages to crawl, it's totally out of their hands or at least, | they'd still need to crawl on top of the Common Crawl data. | | The recent UK CMA report into monopolies in online advertising | estimated Google's index to be around 500-600 billion pages in | size and Bing's to be 100-200 billion pages in size [0]. Of | course, what you define as a 'page' is subjective given URL | canonicals and page similarity. | | At the very least, the common crawl gets around crawl rate | limiting problems by being one massive download. | | Would be interesting to know if there's an appreciable % of site | owners blocking it, though going on past data (there is some data | in the UK CMA about this also), it's not a huge issue. | | [0] | https://assets.publishing.service.gov.uk/media/5efc57ed3a6f4... | (page 89) | Leary wrote: | Great resource. Does anyone have a good free source for popular | keywords/topics on google/the internet? | ttfxxcc wrote: | https://keywordshitter.com/ | tyingq wrote: | https://trends.google.com/trends/ is probably the best resource | for the "top" queries, though it doesn't dive too far down the | list. | | Particularly their "Year in Search" entries, like: | https://trends.google.com/trends/yis/2020/US/ | dang wrote: | The interesting past threads seem to be the following. Others? | | _Ask HN: What would be the fastest way to grep Common Crawl?_ - | https://news.ycombinator.com/item?id=22214474 - Feb 2020 (7 | comments) | | _Using Common Crawl to play Family Feud_ - | https://news.ycombinator.com/item?id=16543851 - March 2018 (4 | comments) | | _Web image size prediction for efficient focused image crawling_ | - https://news.ycombinator.com/item?id=10107819 - Aug 2015 (5 | comments) | | _102TB of New Crawl Data Available_ - | https://news.ycombinator.com/item?id=6811754 - Nov 2013 (37 | comments) | | _SwiftKey's Head Data Scientist on the Value of Common Crawl's | Open Data [video]_ - https://news.ycombinator.com/item?id=6214874 | - Aug 2013 (2 comments) | | _A Look Inside Our 210TB 2012 Web Corpus_ - | https://news.ycombinator.com/item?id=6208603 - Aug 2013 (36 | comments) | | _Blekko donates search data to Common Crawl_ - | https://news.ycombinator.com/item?id=4933149 - Dec 2012 (36 | comments) | | _Common Crawl_ - https://news.ycombinator.com/item?id=3690974 - | March 2012 (5 comments) | | _CommonCrawl: an open repository of web crawl data that is | universally accessible_ - | https://news.ycombinator.com/item?id=3346125 - Dec 2011 (8 | comments) | | _Tokenising the english text of 30TB common crawl_ - | https://news.ycombinator.com/item?id=3342543 - Dec 2011 (7 | comments) | | _Free 5 Billion Page Web Index Now Available from Common Crawl | Foundation_ - https://news.ycombinator.com/item?id=3209690 - Nov | 2011 (39 comments) | Grimm1 wrote: | I've used this and it's invaluable for all types of things but a | feeder for Google killers it is not. | | They don't approach the scale of what Google crawls, they state | as much. Nor do they do it on the same timeline as Google. This | is really nice for research or kick starting a project but this | isn't a long term viable solution for alternative search engines. | Between breadth, depth, timeline/speed, priority, and information | captured it falls well short. | ziftface wrote: | Well that's to be expected, but if a lot of search engines | start using it, it's likely that websites will start allowing | it to crawl and index their pages. So there might be potential | there. | Grimm1 wrote: | It's not an issue of being allowed to crawl. | ziftface wrote: | Oh I guess I misunderstood then. Why don't they crawl at | Google's scale? | samcgraw wrote: | Love to see this. | | As an aside, it always jars me when a site hijacks default | browser scrolling functionality. In my experience, making it as | fast as possible is a _far_ better use of dev resources than | figuring out how to make scrolling unique (no matter what the | marketing department says). | yesenadam wrote: | > it always jars me when a site hijacks default browser | scrolling functionality | | I assume you are saying this because this site does it. What do | you mean? I can't see any difference from normal scrolling | functionality on there. | psKama wrote: | How feasible would it be to store all that data on a | decentralized system like IPFS or Sia-Skynet etc, instead of | Amazon, to add further meaning to the cause? | gillesjacobs wrote: | Blockchain storage is going to cost you a pretty penny if you | were to store all of Common Crawls pentabytes, so not very | feasible. | gloriousternary wrote: | Moreso than Amazon? From my (limited) experience blockchain | storage solutions are often less expensive, although I've | never worked with petabytes of data so maybe it's different | on that scale. | psKama wrote: | That's not correct. When it comes to storage and transfer, | blockchain alternatives are fraction of the cost of Amazon. | For example Sia Skynet is offering $5/month/TB[1] storage. If | you skip Skynet and run your own Sia node the price can even | go lower to $2/month/TB basing on the market conditions. | | [1] https://blog.sia.tech/announcing-skynet-premium-plans- | faster... | coder543 wrote: | Amazon is hosting the Common Crawl on S3 for free, so... | yes, $2/month/TB is a lot more expensive. | gillesjacobs wrote: | It seems that at least on Sia's plans, you can maximally | host 20TB for 80$/month, not even a tenth of a monthly | common crawl. | | Of course Sia's Skynet are package deals right now and I | guess they're currently bootstrapping the network with | users. Filecoin has no operational storage yet. Storj | quotes 10$/Terabyte/month [1] so that will come out | expensive. | | 1. https://www.storj.io/blog/2019/11/announcing- | pioneer-2-and-t... | duskwuff wrote: | You may not grasp just how large the Common Crawl dataset is. | It's been growing steadily at 200-300 TB _per month_ for the | last few years. I 'm not certain how large the entire corpus is | at this point, but it's almost certainly in the tens to low | hundreds of petabytes. (This is significantly larger than the | capacity of the entire Sia network, for example.) | | Storing a dataset of this size and making it available online | is not inexpensive. Amazon has generously donated their | services to handle both of these tasks; it would be foolish to | turn them down. | duskwuff wrote: | (Update: the complete Common Crawl dataset is actually a | little smaller than I thought, at 6.4 PB. That's still pretty | big, though.) | new_realist wrote: | It's not clear, but it looks like the last crawl was 280 TiB | (100 TiB compressed) and contains a snapshot of the web at | that point; i.e. you don't need prior snapshots unless you're | interested in historical content. | | EDIT: the state of the crawls are summarized at | https://commoncrawl.github.io/cc-crawl-statistics/. | riedel wrote: | As a European (German) I am always wondering about the legal | basis for a) making a copy of copyrighted material and databases | available and b) processing contained personal data. | | The Internet archive seems more like a library with exceptions | applying, but common crawl seems to advertise also many other | purposes that go beyond archiving publically relevant content. | | Would this be possible in Europe, too? My feeling is that US | legislation different here. Do you have to actively claim | copyright in the US or enforce technically e.g. via DRM? Anyone | can use anything without a license as long nobody finds out? | new_realist wrote: | Soon, you and your friends can host your own private search | engine at modest cost and enjoy total privacy. | kristopolous wrote: | I'm kinda surprised this is new to people. It's 10 years old. Is | this really the first time it's been talked about here? | TechBro8615 wrote: | It's the 39th time, apparently, as you can see by clicking the | domain next to the submission. | kristopolous wrote: | Oh I always forget about that feature. Excuse my stupidity. ___________________________________________________________________ (page generated 2021-03-26 23:00 UTC)