[HN Gopher] Internet Archive Scholar ___________________________________________________________________ Internet Archive Scholar Author : nabla9 Score : 389 points Date : 2022-12-09 10:35 UTC (12 hours ago) (HTM) web link (scholar.archive.org) (TXT) w3m dump (scholar.archive.org) | zozbot234 wrote: | Not from the IA, but see https://scholia.toolforge.org for an | especially nice presentation of freely-available scholarly | metadata. | jboynyc wrote: | You may also be interested in OpenAlex.org which also uses | wikidata (along with DOIs, ORCIDs, ISSNs and a few other | standard identifiers) to classify publications. | photochemsyn wrote: | After a little testing, this looks like a good information | source, although the combination of Google Scholar and sci-hub is | probably still the best option, i.e. I couldn't find anything | with Internet Scholar that wasn't available with the other | options, and the quality of results on searchs is somewhat higher | with Google Scholar (this may be because Google Scholar utilizes | citation count as a search parameter, which Internet Archive | Scholar doesn't seem to do). | | Internet Archive is a great resource, however, it should get | state funding as it provides a fundamentally important archival | service. It's too bad it has to rely so much on private | philanthropic donations (although state support comes with | possible political interference, i.e. censorship of material that | some politician doesn't like, maybe that's less of a problem with | private donations, although then you could have some billionaire | doing the same thing). | macrolime wrote: | I tried to search for scientific authors from the 1800s and | they're there. | | Google Scholar on the other hand brings me to paywalls, even | though the articles are so old they should be out of copyright. | bnewbold wrote: | The content in scholar.archive.org has been indexed in to | Google Scholar (and other indices are likewise welcome to crawl | the sitemap). There was some content "only" in | scholar.archive.org, but now it should basically all be in | Google Scholar. We haven't gotten around to describing this | publicly, but it was an explicit decision and partnership | between the organizations. | | Indeed scholar.archive.org does not currently use citation | count in search rankings. We have a decent citation graph, | which we are working to expose in scholar (it is visible in | fatcat.wiki today). Would probably only ever use citation count | as a weak boost in search rankings (eg, "any citations at all", | "more than 25 citations" as boosts, nothing beyond that), don't | want to create too strong a feedback loop influencing future | citations. | | scholar.archive.org specifically was partially funded by the | Mellon Foundation (and partially through donations and other | service revenue). IA overall has diverse funding, including | grants and service revenue from the USA (Library of Congress, | IMLS, etc); other national governments (paid crawl services); | foundation grants; universities and libraries (crawl, | preservation, and digitization services); and of course general | donations. The last category of course has the fewest strings | and lets us pursue new projects which might be hard to get | traditional funding for. Remember that the whole premise of web | archiving was considered radical and quixotic at the beginning! | | (source: I work at IA on scholar) | veqq wrote: | Some days, nearly half the links I click are dead so I've found | myself relying on the waybackmachine more and more over the past | few months. It's really shocking just how fast digital | obsolescence reared its ugly head. Of course angelcites etc. were | a clear early blow, but nowadays... | | I've started saving the html (including the css seems like too | much overhead, and often it's incomplete or relies on downloads | still - and screenshots are not searchable) of every interesting | article I find online, downloading quite a few videos too with | yt-dlp. I'd long copypasted all interesting comments into a txt | file, but now it seems like data hoarding's the way to go - at | least in moderation, focusing on things I'll actually refer back | to. | | I remember 15 years ago, discovering pdf dumps on random sites | like a kid in a candy store. Perhaps it'll be like that again, | with people presenting museums of their favorite old pages. | b1zguy wrote: | I've been wanting to run my own search engine sorta thingy that | indexes websites I feed it. I sometimes find little nooks of | the net that post resources I may need in the future. Like my | own mini-Google that indexes a list of sites. | | How can I go about creating this? Are their off-the-shelf | solutions, will I need to say combine scrapy with elastic | search? The links in this thread look promising. | graderjs wrote: | Some people use this tool (of mine) for saving web content from | either bookmarks or just everything you browse: | https://github.com/crisdosyago/Diskernet | | There's also plentY= of other similar tools: | | - https://github.com/ArchiveBox/ArchiveBox | | - https://github.com/gildas-lormeau/SingleFile | randomguy12 wrote: | Zotero also saves snapshots of pages if you already cite | academic pages | staringback wrote: | zote wrote: | How have I never seen your tool before. | | >22120 archives content exactly as it is received and | presented by a browser, and it also replays that content | exactly as if the resource were being taken from online. | | I've been looking for this for a really long time. | arminiusreturns wrote: | Beware the very strange and bad license for Diskernet, which | is "Polyform Strict License 1.0.0" | tetris11 wrote: | For people looking for more info on these strange licenses: | | https://www.reddit.com/r/linux/comments/coazye/what_does_rl | i... | martyvis wrote: | That Polyform licence looks truly awful. Lack of mention of | how I could use it while working. | detaro wrote: | > _how I could use it while working._ | | The readme clearly links where to buy a license for that. | leslielurker wrote: | I'd never heard of SingleFile before but it looks excellent. | It would be great if Firefox could incorporate it into its | save function too. Firefox save page works but as shown on | the SingleFile demo video, it's not really what a user would | expect, it's often not complete and splitting it across | multiple files/directories isn't ideal either. | Springtime wrote: | It's unfortunate as Firefox used to have excellent MHTML | support (which similarly achieves an all-in-one file) via | addons, particularly the feature-rich _UnMHT_. While | Chromium and its derivatives support MHTML saving natively | (and in the past Opera Presto and IE). | | If they brought back MHTML saving support it'd be a great | win. | tetris11 wrote: | > Coming to a future release, soon!: The ability to publish | your own search engine that you curated with the best | resources based on your expert knowledge and experience. | | This would be fantastic, being able to browse a curated | internet made of accumulated lists from other trusted users | on the net, similar to how ad blocking lists work today. | | You are genuinely trying to steer the internet into what it | used to be: a museum of knowledge and expert discussion. | | Edit: Ah but wow, Polyform license. Huh. | msravi wrote: | I use the markdownload extension[1] on firefox and move the .md | file into my notes folder (notable[2]). Works very well. | | 1. https://addons.mozilla.org/en-US/firefox/addon/markdownload/ | | 2. https://notable.app/ | dspillett wrote: | _> I 've started saving_ | | I do similar. I've had https://github.com/ArchiveBox/ArchiveBox | bookmarked for a while as something to try better organise all | that, but like a great many things I haven't go around to it | yet. | circustaco wrote: | I use Raindrop for this. It's a pretty great bookmark manager | made by an indie dev, but it also can create archives of | bookmarked pages. | | https://help.raindrop.io/backups#permanent-library | dspillett wrote: | "Only available in Pro plan". No information on whether | that is a general limitation or if these features are | present if self-hosted, so I assume the former. And there | seems to be little obvious information for self-hosting. | | So probably not one for my use case. | dspillett wrote: | Thanks, I'll add that to the list of things to try out. | patrickdward wrote: | I use Raindrop but didn't know abou that feature. Thanks. | gwbas1c wrote: | I just save the pdf of a site that's _really_ important to me. | | When I did a major college project in 2003, I made sure to make | pdfs of any academic article that I referenced. It actually | saved me, because some articles disappeared by the time I went | to revise my references. | 082349872349872 wrote: | While the author(s) are still alive, they are often a | productive contact. | | (In one case I was able to give back: bundling up the several | scans an author had of a half-century old paper from their | student days into a single, hopefully cromulent, PDF) | | Edit: recall also that accepting that links are one-way and | might be dead was _the key_ simplification that allowed HTTP to | take off after prior attempts at hypermedia had failed. | mrybczyn wrote: | Good (Edit) point! It's good that the web accepts dead links | by design, we can't expect perfection from our distributed | information, but it seems that the bitrot of information is | too high compared to the information storage technologies | available. | | 2 spinning rust drive can store the library of congress. ~ | 2,000 drives would store the web (1). How many millions of | these drives get manufactured per year? Our technology | systems are failing us - all those words are being lost, like | tears in rain. | | (1)Back of the envelope estimation: | https://www.worldwidewebsize.com/ ~ estimates 50 billion | websites, with some estimates ~ 6 pages of information per | website. Let's say 1mb per page on average. So ~2,000 drives | would store the entire web. | brewdad wrote: | Even if that estimate is off by an order of magnitude, | which given the weight of modern web pages it easily could | be, 20,000 drives to store the entire web seems way more | doable than I ever would have imagined. | 8bitsrule wrote: | 2000 drives x $100 = $200,000. Double that for backup, | $400,000. Admin, maintenance, let's say total, $1M/year. | So, Wikipedia could end its own deadlink problem (IF the | reference sources would agree. | | But stuff goes missing at Wayback because people don't | agree to their pages being backed-up. Copyright, whatever. | So it's like Global Heating, the tech is there, but people | just can't agree. So 'pirate' backer-uppers go to jail. And | island-nations and expensive ocean-side properties are | being submerged. So it goes. | toomuchtodo wrote: | The Internet Archive has a bot that updates dead | Wikipedia references to point to archived content. | | https://meta.wikimedia.org/wiki/InternetArchiveBot | leephillips wrote: | I find the SingleFile extension superb for this. | [deleted] | riskable wrote: | This seems like the type of thing that will become the search | engine of _first_ resort in the future as AI-generated propaganda | and nonsense pollutes the spectrum of websites and search | results. | [deleted] | maggu123 wrote: | 082349872349872 wrote: | cf https://gallica.bnf.fr/accueil/en/content/accueil- | en?mode=de... | schmudde wrote: | I've already made my donation to IA this year but I might need to | make another. | | Somehow it's the IA's job to fix problems that we all know are | problems, sadly. | wnkrshm wrote: | Wikipedia reminded me multiple times to donate to the Internet | Archive this year. | xattt wrote: | Is this so Wikipedia can be archived by the IA? | shepherdjerred wrote: | It's a joke. Hacker News doesn't like donating to | Wikipedia; many choose to donate to Internet Archive | instead. | wnkrshm wrote: | I always take the wikipedia donation drive as a reminder to | donate to archive.org instead. | nix23 wrote: | Haha, i know what you mean...my mind works the same way ;) | password4321 wrote: | Another chance to upvote the donation link to top thread on | an IA story since a direct submission got swallowed by the | dupe detector! They are doing such much amazing things. | | https://archive.org/donate | | _Your Donation Will Be Matched 2-to-1! [...] Right now, we | have a 2-to-1 Matching Gift Campaign, tripling the impact of | every donation._ (from the home page) | brookst wrote: | Same. Oh hey these scammers are asking for money again? Wait, | I haven't given to IA in a while. | coldpie wrote: | If you're able/comfortable, please consider setting up a | recurring donation. For long-term planning reasons, it's | helpful for organizations to have a consistent recurring | revenue stream that they can use to project assets further into | the future. One-off donations are good, too! But if you're | going to consistently send them money anyway, you may as well | do it in a predictable manner to help their accounting. | pabs3 wrote: | I wonder if this includes anything from Sci-Hub or are they | unrelated. | toomuchtodo wrote: | Artifacts that are of questionable legality due to copyright | are archived but not made public for obvious reasons (typically | referred to being "darked"). | aaroninsf wrote: | PSA the Internet Archive is a 501c3 non-profit (library!) and | survives on donations and grants. | | A huge percentage of the operating budget is from small donors. | The funding is preposterously small compared to other public- | service public interest such as Wikipedia. | | A lot of us take it for granted and assume there is e.g. support | from FAANG companies proportionate to the degree they lean on it. | | This is 100% NOT THE CASE. | | Please advocate for recurring institutional donations from your | firm. The audience reading this has a lot of voice in a lot of | organizations who could without a though sign up to make annual | 10K, 100K, 1M donations... | | ...and essentially, none do. | | Please help change that!!! | | https://archive.org/donate/ | doc_gunthrop wrote: | Anyone who uses amazon.com can set the Internet Archive as | their preferred charity and shop using smile.amazon.com. A | percentage of your purchases amount will go to the IA. | sfusato wrote: | archive.org is an alternative good internet as a giant library, | as dreamed in early 90's: web archive, film archive, software | archive, media archive... and now research papers archive. | dang wrote: | Related: | | _Internet Archive Scholar_ - | https://news.ycombinator.com/item?id=26419782 - March 2021 (3 | comments) | | _Internet Archive Scholar: Search Millions of Research Papers_ - | https://news.ycombinator.com/item?id=26401568 - March 2021 (47 | comments) ___________________________________________________________________ (page generated 2022-12-09 23:00 UTC)