[HN Gopher] Common Crawl
       ___________________________________________________________________
        
       Common Crawl
        
       Author : Aissen
       Score  : 283 points
       Date   : 2021-03-26 16:42 UTC (6 hours ago)
        
 (HTM) web link (commoncrawl.org)
 (TXT) w3m dump (commoncrawl.org)
        
       | smaddox wrote:
       | I was hoping it was going to be a massively multiplayer dungeon
       | crawling game...
        
       | breck wrote:
       | Love it! Just donated.
        
       | cblconfederate wrote:
       | I don't think that this is the answer to "only google can crawl
       | the web". This is a huge archive suitable for making a web search
       | engine maybe.
       | 
       | What if you want to make a simple link previewer? An abstract
       | crawler for scientific articles? Most websites are behind
       | cloudflare which will block/captcha you, but happily whitelist
       | only google & major social sites. Tha answer is measures that
       | bring the web back to basics, not this over-SEOed bot infested
       | ecosystem. FANGS succeeded in sucking out all the information of
       | the web, but they suck at creating protocols that are
       | interoperable (even twitter now needs its own tags!).
       | 
       | Incidentally, maybe the next search engine should use a push-
       | system , where websites will ping to it whenever they have
       | updates. If the engine has unique features, it might be actually
       | seen as a measure to reduce the loads from bots.
        
         | gillesjacobs wrote:
         | This is mainly used for language modeling research. A filtered
         | CC was used in GPT3 and I have personally used data from CC for
         | NLP projects.
        
         | shadowgovt wrote:
         | I don't think we can bring the web back to basics in the sense
         | you're envisioning without kicking most users off of it.
         | 
         | Cloudflare's protection is to guard against traffic spikes and
         | automated malicious attacks; Google and social sites are allow-
         | listed because they're trustable entities with well-defined
         | online presences and a vested interest (generally) in not
         | breaking sites. "How do we do away with the need to put up an
         | anti-automated-traffic screen plus whitelist" is in the same
         | category of problem as "How do we change the modern web to
         | address automated malicious attacks?"
        
           | neura wrote:
           | Another way to look at this is if CDNs don't allow Google,
           | nobody is going to want to use them. Their content doesn't
           | get indexed and anybody doing a search is going to get
           | directed to someone else that doesn't put their content
           | behind a CDN with that level of protection. That or someone
           | like google will just solve the problem themselves and be
           | both the CDN and the indexer, bringing them one step closer
           | to complete ownership of finding anything on the web.
        
         | th0ma5 wrote:
         | Actually I was able to search URLs with very little ram across
         | the entire collection as they have a series of indexes you can
         | download.
         | 
         | In theory someone could do something similar with terms, or you
         | could first use URLs to filter the text size you download into
         | Elastic or Solr and do your own custom search that way.
         | 
         | The indexes are really neat though, I highly recommend playing
         | with them.
        
           | thunderbong wrote:
           | Can you suggest where I can get them from?
        
         | breischl wrote:
         | >maybe the next search engine should use a push-system
         | 
         | So setting up a new search engine that way would require going
         | to every site and convincing them to notify you of changes.
         | Wouldn't that be even more limited than the current Cloudflare
         | whitelist system? At least there's some chance you can get
         | around the whitelist system.
        
           | neura wrote:
           | This... I couldn't get past the irony of the comment.
           | Basically, the problem is that sites only let google and
           | friends index them. The solution? Site should only send index
           | data to google and friends.
           | 
           | I mean, I get it, then sites can send to any number of
           | indexers, but let's be honest, like you say, any new search
           | engine has to get sites to push data out to them. That's just
           | not. going. to. happen.
        
       | waynesonfire wrote:
       | hope these guys team up w/ archive.org
        
       | rektide wrote:
       | relates closely to the recent "Only Google is allowed to crawl
       | the web"[1][2] post.
       | 
       | [1] https://news.ycombinator.com/item?id=26592635
       | 
       | [2] https://knuckleheads.club/
        
       | staunch wrote:
       | It seems like Common Crawl is doing a lot of awesome stuff, but
       | they're _not_ attacking Google 's stranglehold head on.
       | 
       | Presumably this is because they lack the money to do so. Have
       | they attempted to estimate how much it would cost per year to
       | crawl the web as aggressively and comprehensively as Google does?
       | I've checked their site and didn't find anything like that.
       | 
       | If they came up with a number, say $2 or $10 billion per year, it
       | might actually be possible to gather enough donations to dethrone
       | Google.
       | 
       | A lot of Google competitors would love to see them dethroned. And
       | it would be a huge win for virtually everyone else too. There's
       | no one in the world that wants Google to maintain their web
       | search monopoly indefinitely.
        
       | ricardo81 wrote:
       | Good resource, admirable intention, great that it simply exists.
       | Good sized index.
       | 
       | I see a lot of people subscribe to the idea of this being the
       | feeder to alternative search engines.
       | 
       | I'd guess part of the problem with doing things this way is the
       | 'crawl priority' of what the search engine thinks are the next
       | best pages to crawl, it's totally out of their hands or at least,
       | they'd still need to crawl on top of the Common Crawl data.
       | 
       | The recent UK CMA report into monopolies in online advertising
       | estimated Google's index to be around 500-600 billion pages in
       | size and Bing's to be 100-200 billion pages in size [0]. Of
       | course, what you define as a 'page' is subjective given URL
       | canonicals and page similarity.
       | 
       | At the very least, the common crawl gets around crawl rate
       | limiting problems by being one massive download.
       | 
       | Would be interesting to know if there's an appreciable % of site
       | owners blocking it, though going on past data (there is some data
       | in the UK CMA about this also), it's not a huge issue.
       | 
       | [0]
       | https://assets.publishing.service.gov.uk/media/5efc57ed3a6f4...
       | (page 89)
        
       | Leary wrote:
       | Great resource. Does anyone have a good free source for popular
       | keywords/topics on google/the internet?
        
         | ttfxxcc wrote:
         | https://keywordshitter.com/
        
         | tyingq wrote:
         | https://trends.google.com/trends/ is probably the best resource
         | for the "top" queries, though it doesn't dive too far down the
         | list.
         | 
         | Particularly their "Year in Search" entries, like:
         | https://trends.google.com/trends/yis/2020/US/
        
       | dang wrote:
       | The interesting past threads seem to be the following. Others?
       | 
       |  _Ask HN: What would be the fastest way to grep Common Crawl?_ -
       | https://news.ycombinator.com/item?id=22214474 - Feb 2020 (7
       | comments)
       | 
       |  _Using Common Crawl to play Family Feud_ -
       | https://news.ycombinator.com/item?id=16543851 - March 2018 (4
       | comments)
       | 
       |  _Web image size prediction for efficient focused image crawling_
       | - https://news.ycombinator.com/item?id=10107819 - Aug 2015 (5
       | comments)
       | 
       |  _102TB of New Crawl Data Available_ -
       | https://news.ycombinator.com/item?id=6811754 - Nov 2013 (37
       | comments)
       | 
       |  _SwiftKey's Head Data Scientist on the Value of Common Crawl's
       | Open Data [video]_ - https://news.ycombinator.com/item?id=6214874
       | - Aug 2013 (2 comments)
       | 
       |  _A Look Inside Our 210TB 2012 Web Corpus_ -
       | https://news.ycombinator.com/item?id=6208603 - Aug 2013 (36
       | comments)
       | 
       |  _Blekko donates search data to Common Crawl_ -
       | https://news.ycombinator.com/item?id=4933149 - Dec 2012 (36
       | comments)
       | 
       |  _Common Crawl_ - https://news.ycombinator.com/item?id=3690974 -
       | March 2012 (5 comments)
       | 
       |  _CommonCrawl: an open repository of web crawl data that is
       | universally accessible_ -
       | https://news.ycombinator.com/item?id=3346125 - Dec 2011 (8
       | comments)
       | 
       |  _Tokenising the english text of 30TB common crawl_ -
       | https://news.ycombinator.com/item?id=3342543 - Dec 2011 (7
       | comments)
       | 
       |  _Free 5 Billion Page Web Index Now Available from Common Crawl
       | Foundation_ - https://news.ycombinator.com/item?id=3209690 - Nov
       | 2011 (39 comments)
        
       | Grimm1 wrote:
       | I've used this and it's invaluable for all types of things but a
       | feeder for Google killers it is not.
       | 
       | They don't approach the scale of what Google crawls, they state
       | as much. Nor do they do it on the same timeline as Google. This
       | is really nice for research or kick starting a project but this
       | isn't a long term viable solution for alternative search engines.
       | Between breadth, depth, timeline/speed, priority, and information
       | captured it falls well short.
        
         | ziftface wrote:
         | Well that's to be expected, but if a lot of search engines
         | start using it, it's likely that websites will start allowing
         | it to crawl and index their pages. So there might be potential
         | there.
        
           | Grimm1 wrote:
           | It's not an issue of being allowed to crawl.
        
             | ziftface wrote:
             | Oh I guess I misunderstood then. Why don't they crawl at
             | Google's scale?
        
       | samcgraw wrote:
       | Love to see this.
       | 
       | As an aside, it always jars me when a site hijacks default
       | browser scrolling functionality. In my experience, making it as
       | fast as possible is a _far_ better use of dev resources than
       | figuring out how to make scrolling unique (no matter what the
       | marketing department says).
        
         | yesenadam wrote:
         | > it always jars me when a site hijacks default browser
         | scrolling functionality
         | 
         | I assume you are saying this because this site does it. What do
         | you mean? I can't see any difference from normal scrolling
         | functionality on there.
        
       | psKama wrote:
       | How feasible would it be to store all that data on a
       | decentralized system like IPFS or Sia-Skynet etc, instead of
       | Amazon, to add further meaning to the cause?
        
         | gillesjacobs wrote:
         | Blockchain storage is going to cost you a pretty penny if you
         | were to store all of Common Crawls pentabytes, so not very
         | feasible.
        
           | gloriousternary wrote:
           | Moreso than Amazon? From my (limited) experience blockchain
           | storage solutions are often less expensive, although I've
           | never worked with petabytes of data so maybe it's different
           | on that scale.
        
           | psKama wrote:
           | That's not correct. When it comes to storage and transfer,
           | blockchain alternatives are fraction of the cost of Amazon.
           | For example Sia Skynet is offering $5/month/TB[1] storage. If
           | you skip Skynet and run your own Sia node the price can even
           | go lower to $2/month/TB basing on the market conditions.
           | 
           | [1] https://blog.sia.tech/announcing-skynet-premium-plans-
           | faster...
        
             | coder543 wrote:
             | Amazon is hosting the Common Crawl on S3 for free, so...
             | yes, $2/month/TB is a lot more expensive.
        
             | gillesjacobs wrote:
             | It seems that at least on Sia's plans, you can maximally
             | host 20TB for 80$/month, not even a tenth of a monthly
             | common crawl.
             | 
             | Of course Sia's Skynet are package deals right now and I
             | guess they're currently bootstrapping the network with
             | users. Filecoin has no operational storage yet. Storj
             | quotes 10$/Terabyte/month [1] so that will come out
             | expensive.
             | 
             | 1. https://www.storj.io/blog/2019/11/announcing-
             | pioneer-2-and-t...
        
         | duskwuff wrote:
         | You may not grasp just how large the Common Crawl dataset is.
         | It's been growing steadily at 200-300 TB _per month_ for the
         | last few years. I 'm not certain how large the entire corpus is
         | at this point, but it's almost certainly in the tens to low
         | hundreds of petabytes. (This is significantly larger than the
         | capacity of the entire Sia network, for example.)
         | 
         | Storing a dataset of this size and making it available online
         | is not inexpensive. Amazon has generously donated their
         | services to handle both of these tasks; it would be foolish to
         | turn them down.
        
           | duskwuff wrote:
           | (Update: the complete Common Crawl dataset is actually a
           | little smaller than I thought, at 6.4 PB. That's still pretty
           | big, though.)
        
           | new_realist wrote:
           | It's not clear, but it looks like the last crawl was 280 TiB
           | (100 TiB compressed) and contains a snapshot of the web at
           | that point; i.e. you don't need prior snapshots unless you're
           | interested in historical content.
           | 
           | EDIT: the state of the crawls are summarized at
           | https://commoncrawl.github.io/cc-crawl-statistics/.
        
       | riedel wrote:
       | As a European (German) I am always wondering about the legal
       | basis for a) making a copy of copyrighted material and databases
       | available and b) processing contained personal data.
       | 
       | The Internet archive seems more like a library with exceptions
       | applying, but common crawl seems to advertise also many other
       | purposes that go beyond archiving publically relevant content.
       | 
       | Would this be possible in Europe, too? My feeling is that US
       | legislation different here. Do you have to actively claim
       | copyright in the US or enforce technically e.g. via DRM? Anyone
       | can use anything without a license as long nobody finds out?
        
       | new_realist wrote:
       | Soon, you and your friends can host your own private search
       | engine at modest cost and enjoy total privacy.
        
       | kristopolous wrote:
       | I'm kinda surprised this is new to people. It's 10 years old. Is
       | this really the first time it's been talked about here?
        
         | TechBro8615 wrote:
         | It's the 39th time, apparently, as you can see by clicking the
         | domain next to the submission.
        
           | kristopolous wrote:
           | Oh I always forget about that feature. Excuse my stupidity.
        
       ___________________________________________________________________
       (page generated 2021-03-26 23:00 UTC)