[HN Gopher] Internet Archive Scholar
       ___________________________________________________________________
        
       Internet Archive Scholar
        
       Author : nabla9
       Score  : 389 points
       Date   : 2022-12-09 10:35 UTC (12 hours ago)
        
 (HTM) web link (scholar.archive.org)
 (TXT) w3m dump (scholar.archive.org)
        
       | zozbot234 wrote:
       | Not from the IA, but see https://scholia.toolforge.org for an
       | especially nice presentation of freely-available scholarly
       | metadata.
        
         | jboynyc wrote:
         | You may also be interested in OpenAlex.org which also uses
         | wikidata (along with DOIs, ORCIDs, ISSNs and a few other
         | standard identifiers) to classify publications.
        
       | photochemsyn wrote:
       | After a little testing, this looks like a good information
       | source, although the combination of Google Scholar and sci-hub is
       | probably still the best option, i.e. I couldn't find anything
       | with Internet Scholar that wasn't available with the other
       | options, and the quality of results on searchs is somewhat higher
       | with Google Scholar (this may be because Google Scholar utilizes
       | citation count as a search parameter, which Internet Archive
       | Scholar doesn't seem to do).
       | 
       | Internet Archive is a great resource, however, it should get
       | state funding as it provides a fundamentally important archival
       | service. It's too bad it has to rely so much on private
       | philanthropic donations (although state support comes with
       | possible political interference, i.e. censorship of material that
       | some politician doesn't like, maybe that's less of a problem with
       | private donations, although then you could have some billionaire
       | doing the same thing).
        
         | macrolime wrote:
         | I tried to search for scientific authors from the 1800s and
         | they're there.
         | 
         | Google Scholar on the other hand brings me to paywalls, even
         | though the articles are so old they should be out of copyright.
        
         | bnewbold wrote:
         | The content in scholar.archive.org has been indexed in to
         | Google Scholar (and other indices are likewise welcome to crawl
         | the sitemap). There was some content "only" in
         | scholar.archive.org, but now it should basically all be in
         | Google Scholar. We haven't gotten around to describing this
         | publicly, but it was an explicit decision and partnership
         | between the organizations.
         | 
         | Indeed scholar.archive.org does not currently use citation
         | count in search rankings. We have a decent citation graph,
         | which we are working to expose in scholar (it is visible in
         | fatcat.wiki today). Would probably only ever use citation count
         | as a weak boost in search rankings (eg, "any citations at all",
         | "more than 25 citations" as boosts, nothing beyond that), don't
         | want to create too strong a feedback loop influencing future
         | citations.
         | 
         | scholar.archive.org specifically was partially funded by the
         | Mellon Foundation (and partially through donations and other
         | service revenue). IA overall has diverse funding, including
         | grants and service revenue from the USA (Library of Congress,
         | IMLS, etc); other national governments (paid crawl services);
         | foundation grants; universities and libraries (crawl,
         | preservation, and digitization services); and of course general
         | donations. The last category of course has the fewest strings
         | and lets us pursue new projects which might be hard to get
         | traditional funding for. Remember that the whole premise of web
         | archiving was considered radical and quixotic at the beginning!
         | 
         | (source: I work at IA on scholar)
        
       | veqq wrote:
       | Some days, nearly half the links I click are dead so I've found
       | myself relying on the waybackmachine more and more over the past
       | few months. It's really shocking just how fast digital
       | obsolescence reared its ugly head. Of course angelcites etc. were
       | a clear early blow, but nowadays...
       | 
       | I've started saving the html (including the css seems like too
       | much overhead, and often it's incomplete or relies on downloads
       | still - and screenshots are not searchable) of every interesting
       | article I find online, downloading quite a few videos too with
       | yt-dlp. I'd long copypasted all interesting comments into a txt
       | file, but now it seems like data hoarding's the way to go - at
       | least in moderation, focusing on things I'll actually refer back
       | to.
       | 
       | I remember 15 years ago, discovering pdf dumps on random sites
       | like a kid in a candy store. Perhaps it'll be like that again,
       | with people presenting museums of their favorite old pages.
        
         | b1zguy wrote:
         | I've been wanting to run my own search engine sorta thingy that
         | indexes websites I feed it. I sometimes find little nooks of
         | the net that post resources I may need in the future. Like my
         | own mini-Google that indexes a list of sites.
         | 
         | How can I go about creating this? Are their off-the-shelf
         | solutions, will I need to say combine scrapy with elastic
         | search? The links in this thread look promising.
        
         | graderjs wrote:
         | Some people use this tool (of mine) for saving web content from
         | either bookmarks or just everything you browse:
         | https://github.com/crisdosyago/Diskernet
         | 
         | There's also plentY= of other similar tools:
         | 
         | - https://github.com/ArchiveBox/ArchiveBox
         | 
         | - https://github.com/gildas-lormeau/SingleFile
        
           | randomguy12 wrote:
           | Zotero also saves snapshots of pages if you already cite
           | academic pages
        
           | staringback wrote:
        
           | zote wrote:
           | How have I never seen your tool before.
           | 
           | >22120 archives content exactly as it is received and
           | presented by a browser, and it also replays that content
           | exactly as if the resource were being taken from online.
           | 
           | I've been looking for this for a really long time.
        
           | arminiusreturns wrote:
           | Beware the very strange and bad license for Diskernet, which
           | is "Polyform Strict License 1.0.0"
        
             | tetris11 wrote:
             | For people looking for more info on these strange licenses:
             | 
             | https://www.reddit.com/r/linux/comments/coazye/what_does_rl
             | i...
        
             | martyvis wrote:
             | That Polyform licence looks truly awful. Lack of mention of
             | how I could use it while working.
        
               | detaro wrote:
               | > _how I could use it while working._
               | 
               | The readme clearly links where to buy a license for that.
        
           | leslielurker wrote:
           | I'd never heard of SingleFile before but it looks excellent.
           | It would be great if Firefox could incorporate it into its
           | save function too. Firefox save page works but as shown on
           | the SingleFile demo video, it's not really what a user would
           | expect, it's often not complete and splitting it across
           | multiple files/directories isn't ideal either.
        
             | Springtime wrote:
             | It's unfortunate as Firefox used to have excellent MHTML
             | support (which similarly achieves an all-in-one file) via
             | addons, particularly the feature-rich _UnMHT_. While
             | Chromium and its derivatives support MHTML saving natively
             | (and in the past Opera Presto and IE).
             | 
             | If they brought back MHTML saving support it'd be a great
             | win.
        
           | tetris11 wrote:
           | > Coming to a future release, soon!: The ability to publish
           | your own search engine that you curated with the best
           | resources based on your expert knowledge and experience.
           | 
           | This would be fantastic, being able to browse a curated
           | internet made of accumulated lists from other trusted users
           | on the net, similar to how ad blocking lists work today.
           | 
           | You are genuinely trying to steer the internet into what it
           | used to be: a museum of knowledge and expert discussion.
           | 
           | Edit: Ah but wow, Polyform license. Huh.
        
         | msravi wrote:
         | I use the markdownload extension[1] on firefox and move the .md
         | file into my notes folder (notable[2]). Works very well.
         | 
         | 1. https://addons.mozilla.org/en-US/firefox/addon/markdownload/
         | 
         | 2. https://notable.app/
        
         | dspillett wrote:
         | _> I 've started saving_
         | 
         | I do similar. I've had https://github.com/ArchiveBox/ArchiveBox
         | bookmarked for a while as something to try better organise all
         | that, but like a great many things I haven't go around to it
         | yet.
        
           | circustaco wrote:
           | I use Raindrop for this. It's a pretty great bookmark manager
           | made by an indie dev, but it also can create archives of
           | bookmarked pages.
           | 
           | https://help.raindrop.io/backups#permanent-library
        
             | dspillett wrote:
             | "Only available in Pro plan". No information on whether
             | that is a general limitation or if these features are
             | present if self-hosted, so I assume the former. And there
             | seems to be little obvious information for self-hosting.
             | 
             | So probably not one for my use case.
        
             | dspillett wrote:
             | Thanks, I'll add that to the list of things to try out.
        
             | patrickdward wrote:
             | I use Raindrop but didn't know abou that feature. Thanks.
        
         | gwbas1c wrote:
         | I just save the pdf of a site that's _really_ important to me.
         | 
         | When I did a major college project in 2003, I made sure to make
         | pdfs of any academic article that I referenced. It actually
         | saved me, because some articles disappeared by the time I went
         | to revise my references.
        
         | 082349872349872 wrote:
         | While the author(s) are still alive, they are often a
         | productive contact.
         | 
         | (In one case I was able to give back: bundling up the several
         | scans an author had of a half-century old paper from their
         | student days into a single, hopefully cromulent, PDF)
         | 
         | Edit: recall also that accepting that links are one-way and
         | might be dead was _the key_ simplification that allowed HTTP to
         | take off after prior attempts at hypermedia had failed.
        
           | mrybczyn wrote:
           | Good (Edit) point! It's good that the web accepts dead links
           | by design, we can't expect perfection from our distributed
           | information, but it seems that the bitrot of information is
           | too high compared to the information storage technologies
           | available.
           | 
           | 2 spinning rust drive can store the library of congress. ~
           | 2,000 drives would store the web (1). How many millions of
           | these drives get manufactured per year? Our technology
           | systems are failing us - all those words are being lost, like
           | tears in rain.
           | 
           | (1)Back of the envelope estimation:
           | https://www.worldwidewebsize.com/ ~ estimates 50 billion
           | websites, with some estimates ~ 6 pages of information per
           | website. Let's say 1mb per page on average. So ~2,000 drives
           | would store the entire web.
        
             | brewdad wrote:
             | Even if that estimate is off by an order of magnitude,
             | which given the weight of modern web pages it easily could
             | be, 20,000 drives to store the entire web seems way more
             | doable than I ever would have imagined.
        
             | 8bitsrule wrote:
             | 2000 drives x $100 = $200,000. Double that for backup,
             | $400,000. Admin, maintenance, let's say total, $1M/year.
             | So, Wikipedia could end its own deadlink problem (IF the
             | reference sources would agree.
             | 
             | But stuff goes missing at Wayback because people don't
             | agree to their pages being backed-up. Copyright, whatever.
             | So it's like Global Heating, the tech is there, but people
             | just can't agree. So 'pirate' backer-uppers go to jail. And
             | island-nations and expensive ocean-side properties are
             | being submerged. So it goes.
        
               | toomuchtodo wrote:
               | The Internet Archive has a bot that updates dead
               | Wikipedia references to point to archived content.
               | 
               | https://meta.wikimedia.org/wiki/InternetArchiveBot
        
         | leephillips wrote:
         | I find the SingleFile extension superb for this.
        
         | [deleted]
        
       | riskable wrote:
       | This seems like the type of thing that will become the search
       | engine of _first_ resort in the future as AI-generated propaganda
       | and nonsense pollutes the spectrum of websites and search
       | results.
        
         | [deleted]
        
       | maggu123 wrote:
        
       | 082349872349872 wrote:
       | cf https://gallica.bnf.fr/accueil/en/content/accueil-
       | en?mode=de...
        
       | schmudde wrote:
       | I've already made my donation to IA this year but I might need to
       | make another.
       | 
       | Somehow it's the IA's job to fix problems that we all know are
       | problems, sadly.
        
         | wnkrshm wrote:
         | Wikipedia reminded me multiple times to donate to the Internet
         | Archive this year.
        
           | xattt wrote:
           | Is this so Wikipedia can be archived by the IA?
        
             | shepherdjerred wrote:
             | It's a joke. Hacker News doesn't like donating to
             | Wikipedia; many choose to donate to Internet Archive
             | instead.
        
             | wnkrshm wrote:
             | I always take the wikipedia donation drive as a reminder to
             | donate to archive.org instead.
        
           | nix23 wrote:
           | Haha, i know what you mean...my mind works the same way ;)
        
           | password4321 wrote:
           | Another chance to upvote the donation link to top thread on
           | an IA story since a direct submission got swallowed by the
           | dupe detector! They are doing such much amazing things.
           | 
           | https://archive.org/donate
           | 
           |  _Your Donation Will Be Matched 2-to-1! [...] Right now, we
           | have a 2-to-1 Matching Gift Campaign, tripling the impact of
           | every donation._ (from the home page)
        
           | brookst wrote:
           | Same. Oh hey these scammers are asking for money again? Wait,
           | I haven't given to IA in a while.
        
         | coldpie wrote:
         | If you're able/comfortable, please consider setting up a
         | recurring donation. For long-term planning reasons, it's
         | helpful for organizations to have a consistent recurring
         | revenue stream that they can use to project assets further into
         | the future. One-off donations are good, too! But if you're
         | going to consistently send them money anyway, you may as well
         | do it in a predictable manner to help their accounting.
        
       | pabs3 wrote:
       | I wonder if this includes anything from Sci-Hub or are they
       | unrelated.
        
         | toomuchtodo wrote:
         | Artifacts that are of questionable legality due to copyright
         | are archived but not made public for obvious reasons (typically
         | referred to being "darked").
        
       | aaroninsf wrote:
       | PSA the Internet Archive is a 501c3 non-profit (library!) and
       | survives on donations and grants.
       | 
       | A huge percentage of the operating budget is from small donors.
       | The funding is preposterously small compared to other public-
       | service public interest such as Wikipedia.
       | 
       | A lot of us take it for granted and assume there is e.g. support
       | from FAANG companies proportionate to the degree they lean on it.
       | 
       | This is 100% NOT THE CASE.
       | 
       | Please advocate for recurring institutional donations from your
       | firm. The audience reading this has a lot of voice in a lot of
       | organizations who could without a though sign up to make annual
       | 10K, 100K, 1M donations...
       | 
       | ...and essentially, none do.
       | 
       | Please help change that!!!
       | 
       | https://archive.org/donate/
        
         | doc_gunthrop wrote:
         | Anyone who uses amazon.com can set the Internet Archive as
         | their preferred charity and shop using smile.amazon.com. A
         | percentage of your purchases amount will go to the IA.
        
       | sfusato wrote:
       | archive.org is an alternative good internet as a giant library,
       | as dreamed in early 90's: web archive, film archive, software
       | archive, media archive... and now research papers archive.
        
       | dang wrote:
       | Related:
       | 
       |  _Internet Archive Scholar_ -
       | https://news.ycombinator.com/item?id=26419782 - March 2021 (3
       | comments)
       | 
       |  _Internet Archive Scholar: Search Millions of Research Papers_ -
       | https://news.ycombinator.com/item?id=26401568 - March 2021 (47
       | comments)
        
       ___________________________________________________________________
       (page generated 2022-12-09 23:00 UTC)