[HN Gopher] Internet Archive Scholar: Search Millions of Researc...
       ___________________________________________________________________
        
       Internet Archive Scholar: Search Millions of Research Papers
        
       Author : bnewbold
       Score  : 144 points
       Date   : 2021-03-09 18:06 UTC (4 hours ago)
        
 (HTM) web link (blog.archive.org)
 (TXT) w3m dump (blog.archive.org)
        
       | nathias wrote:
       | archive.org is really one of the few things still good on the
       | internet, while studying it has been invaluable for my studies, I
       | can't imagine what the previous generations that could only
       | access 5% of sources were even doing.
        
       | 8bitsrule wrote:
       | Oh yeah! Tried this on several specific topics I've looked at
       | recently (2 years ago, 7ya, and 150ya) and the results were fast
       | and on the mark. I'll certainly favor using Scholar over IA
       | searches. Congratulations!
        
       | BugsJustFindMe wrote:
       | I couldn't find a list of what sources (like which journals)
       | they're archiving from. Does anyone know where to find that? It
       | would be nice to see what subject categories the archive covers.
        
       | jahewson wrote:
       | I took one look at that logo and concluded "this is not for me".
        
         | throwaway8451 wrote:
         | Here is an appropriate soundtrack for browsing the results:
         | 
         | https://www.youtube.com/watch?v=x8gBfEDoEbY
        
         | simonw wrote:
         | I had the exact opposite reaction. That logo is fabulous.
        
         | AnimalMuppet wrote:
         | If you're going to judge it by the logo rather than by the
         | search results, it almost certainly is not for you...
        
       | betamaxthetape wrote:
       | This is amazing. I had a play around with it whilst it was in
       | beta, and was blown away by the variety of papers returned. On a
       | whim I searched for a very obscure topic that I'd researched
       | before (just for personal interest) in the past using worldcat /
       | google scholar, and to my surprise was presented with several
       | highly relevant papers I'd never come across before, that were
       | _exactly_ what I was looking for.
        
       | carbocation wrote:
       | Interesting. For my field (cardiovascular genetics), the results
       | weren't really what I was expecting. I think that my expectations
       | probably fit pretty well with a PageRank graph of citations. So
       | my guess is that the "relevancy" is semantic only?
        
       | sundarurfriend wrote:
       | (OffTopic) All this talk about the logo here made me check the
       | page out, instead of moving on after reading just the comments as
       | I might otherwise have done. Perhaps that's a HN strategy to use,
       | to get people to actually click through - add a bikesheddy thing
       | to the page that's likely to be divisive, but doesn't require
       | thought. Gives us a cheap way to have an opinion, and thus an
       | incentive to click!
        
       | endisneigh wrote:
       | I'm curious, how does the Internet Archive handle copyright with
       | all of its services?
        
       | marcodiego wrote:
       | The internet archive is becoming an alternative good internet. It
       | has a web archive, film archive, software archive, media
       | archive... and now research papers archive. That is the internet
       | as a giant library as we dreamed in early 90's.
        
         | Black101 wrote:
         | Way too centralized (Centranet?), but it is very nice for now.
         | It's a bit like the library of Alexandria, so it could
         | change/disappear at any time.
        
           | dbrereton wrote:
           | I'm sure they'd be willing to decentralize it if there was a
           | good way to do that. Maybe this can be done with something
           | like IPFS [0].
           | 
           | [0] https://ipfs.io/
        
             | zucker42 wrote:
             | The amount of data is absolutely insane.
        
             | Black101 wrote:
             | Yes, they have very good intentions right now, but what if
             | the leader gets hit by a bus.
        
               | musicale wrote:
               | Presumably it would be be acquired, paywalled, and
               | monetized by a private equity firm (or some suitably
               | hostile intellectual property rightsholder organization)
               | before going bankrupt and shutting down for good.
               | 
               | Thanks for an incredible journey.
        
         | puddingnomeat wrote:
         | Is it easy to have a local copy?
        
       | capableweb wrote:
       | Internet Archive strikes again! I love Internet Archive, not just
       | for archiving websites but for archiving everything and making it
       | easily accessible. This is another great service that'll help a
       | lot of researchers and hobby-researchers, which is lovely to see.
       | 
       | Don't forget to donate if you also like Internet Archive, they
       | need every penny: https://archive.org/donate/?origin=hn
        
       | bnewbold wrote:
       | This service was hinted at back in September, but is now formally
       | announced and live at https://scholar.archive.org
       | 
       | Related previous post:
       | https://news.ycombinator.com/item?id=24485444
       | 
       | Much of the catalog functionality can be accessed from the
       | fatcat.wiki API (https://api.fatcat.wiki/redoc). Scholar adds a
       | search index over the body content of papers, and we are still
       | thinking through how to make this available through a public API
       | without slowing down query latency even more.
       | 
       | Folks here might also be interested in this CLI for interfacing
       | with the catalog and making edits:
       | https://gitlab.com/bnewbold/fatcat-cli
        
         | breck wrote:
         | I absolutely love everything about it (the logo <3).
         | 
         | Super fast. All my test searches returned what I was looking
         | for.
         | 
         | What is your relationship with semantic scholar like?
         | 
         | Any plans to integrate ranking signals like references, etc?
         | 
         | I'm going to double my monthly donation. This is great.
        
           | bnewbold wrote:
           | Thank you for the kind words!
           | 
           | We are friendly with Semantic Scholar, and have used their
           | "open corpus" dumps as one of several URL seed lists for
           | crawling in the past. Their search and discovery tech is more
           | sophisticated than ours is likely to be any time soon
           | (https://medium.com/ai2-blog/building-a-better-search-
           | engine-...). We would love to get to the place where groups
           | like AI2, which are primarily research-oriented, could build
           | on an existing open catalog and corpus, and not need to
           | duplicate time crawling, merging catalogs, cleaning metadata,
           | etc. As of today Microsoft Academic (used by Semantic
           | Scholar) might be a better option.
           | 
           | Want to be thoughtful about ranking signals, and are deeply
           | skeptical of journal impact factor, h-index, and most
           | bibliometrics. "Has this been cited more than a handful of
           | times" seems like a reasonable coarse boost. Hope to include
           | more curated signals, like "won a paper prize", "journal in
           | DOAJ and other reviewed indices", etc.
           | 
           | Have been working on a citation graph, keep an eye out for
           | something about that in coming months. One cool thing we hope
           | to do with the citation graph is find "missing works" not yet
           | in the catalog (eg, don't have a DOI, especially for pre-1990
           | era).
        
       ___________________________________________________________________
       (page generated 2021-03-09 23:00 UTC)