[HN Gopher] Sci-Hub statistics and database ___________________________________________________________________ Sci-Hub statistics and database Author : NmAmDa Score : 317 points Date : 2022-02-12 17:40 UTC (5 hours ago) (HTM) web link (sci-hub.ru) (TXT) w3m dump (sci-hub.ru) | gw67 wrote: | How they are able to store data without being seized? | modeless wrote: | Did Sci-Hub start working again? Last time I checked it wasn't | adding new papers because of some legal thing going on in India. | phoe-krk wrote: | Yes - see | https://twitter.com/ringo_ring/status/1492419986291408898 | Ansil849 wrote: | The site is working in the sense of you can download old | papers, but I don't believe any new papers from the last year | are accessible. | DoItToMe81 wrote: | I accessed a paper from late last year not so long ago. I | think it's working fine. | Ansil849 wrote: | > I accessed a paper from late last year not so long ago. | I think it's working fine. | | It is not. A large batch of new papers was added | manually, but the old service of typing in a DOI and | having a paper be retrieved automatically is not working. | Pick 10 random DOIs from 2022 and see how many Scihub | will return. | lamontcg wrote: | AFAIK its not really working again? I think there was an upload | of a bunch of papers in a batch recently, but not ones that I | was hoping for. I'm sort of worried about the past-tense | language in this page suggesting that it isn't starting back up | again. | nefitty wrote: | Here's a notebook that fetches Sci-Hub mirrors from Wikidata and | tests them. I also included an iOS Shortcut to add to your Share | screen. When you're on a site that Sci-Hub recognizes and you use | the shortcut it will try to fetch the paper. | | https://observablehq.com/@iz/sci-hub | raziel2701 wrote: | Alexandra Elbakyan is a titan and a saint. I couldn't have been | able to finish my research without access to papers my | institution wasn't subscribed to. | [deleted] | OmicronCeti wrote: | I snuck her into my own dissertation acknowledgements: | https://imgur.com/bDgtBAE | jdrc wrote: | Now i feel foolish for not acknowledging her, especially in | elsevier papers. | [deleted] | p1esk wrote: | Interesting. Medical field dominates research in terms of | publications. Chemistry produces double the papers compared to | physics, and humanities are smaller than biology but larger than | physics. I wonder where machine learning papers fit in - CS or | Math or both? | sgillen wrote: | I would imagine it depends on the particular paper, the more | experimental ones in CS, the more theory ones in math. | | Do note though that most math and ML practitioners use arxiv | over sci-hub. | remuskaos wrote: | It is note worthy that most of physics (at least high energy | physics) ist published on arxiv.org and open access. | | I don't know if sci hub bothers with publications that are | available freely from an official source. | p1esk wrote: | Good point. If so, this data is a lot less interesting :( | philipkglass wrote: | Sci-hub will grab and serve anything with a DOI (or at least | used to; I don't know if they have started ingesting papers | again after turning it off a while ago). I have found open | access papers there before. It's simpler to just paste the | DOI into sci-hub than to check to see if it's one of the few | open access articles in a mostly paywalled journal. | anon_123g987 wrote: | > published on arxiv.org and open access | | Don't use the term "open access" like this. A paper published | on arXiv is free to read, and was freely published. "Open | access" is a scam by the big publishers, where they don't | take money from the _readers_ , but make the _authors_ pay. | Or, putting it another way, anyone can pay their way in those | journals and publish (sometimes sub par) papers. | nicoburns wrote: | No, "open access" means that the paper is available to | readers for free. Making the authors pay is typically | termed "gold open access". | mNovak wrote: | I've never heard the term "gold open access", but I know | plenty of "open access" journals that charge a fee to | authors. | remuskaos wrote: | I wasn't aware that there were different distinct forms | of "open access", so I had to read it up on Wikipedia. | From what I understand, publications on arxiv are either | gratis or libre open access. | | Either way, we don't pay anyone any fee to publish on | arxiv. | 13415 wrote: | Not that I want to defend open access fees but the way you | describe it is incorrect. Paying for open access fees with | large publishers like Springer is an option that is | separate from the review system, you can only choose it | once your paper has been reviewed and accepted. | remuskaos wrote: | As I wrote on another comment, I wasn't aware that there | are multiple forms of open access. Since it appears that | arxiv (again, at least high energy physics) employs mostly | either gratis or libre open access, and since the Wikipedia | article explicitly calls it an open access archive, I see | no harm in calling it that either. | | "arXiv (pronounced "archive"--the X represents the Greek | letter chi [kh])[1] is an open-access repository of | electronic preprints and postprints[...] " | The_rationalist wrote: | Machine learning publication rate is small, at least by | assuming that paperswithcode contains most of the publications. | [deleted] | ok123456 wrote: | What's the most popular paper on all of scihub? By field? | mmettler wrote: | Alexandra should get the Nobel prize. | iqanq wrote: | na85 wrote: | I mean, even if you limited yourself to just the Peace prize | (arguably the most controversial), you'd still have to | reconcile your statement with the fact that people like | Malala Yousafzai have won. | iqanq wrote: | allisdust wrote: | With the rent seeking companies being from Europe? Not a | chance. | | Nobel is a political tool that's mostly there to make a point | (especially that peace prize). | anon_123g987 wrote: | He said "should", not "will". Both of you are right. | [deleted] | 2Gkashmiri wrote: | i asked this question here and at many places before. why do | people "rely" on an organization that sifts through hundreds of | thousands of papers and then charge exorbitant prices for | providing this service? if we use the amazon analogy, is amazon | with millions of products worse than a boutique cat food seller | that specializes in a specific cat food for a specific cat breed? | maybe. but what about the "rest" of products? | | why are our scientists made to rely on elsevier et al to sift | through the junk and find for them the perfect paper instead of | doing it themselves? is science now such a cutthroat quick | competition that it requires you to give a company the priviledge | to work for you so that you dont have to do your own due | diligence? | | in india, we have a lot of local research that is done on open | databases like shodh ganga and many more. but if you have to | access foreign research material, better luck your university has | an agreement with elsevier and others to pay them millions for a | login. the alternative, go to scihub and find what you need. | | i understand the whole quality/delivery debate but doesnt the | average user already know who the big players in the specific | domain are and who are trusted? or you want discoverability at | the hands of a "trusted third party" without doing the legwork | yourself. | | then at the other end you have non-academics like me. I might | have heard of a research paper in some article and i cannot read | it without paying an arm and a leg. why? if we use the whole | ebook/book argument that compensation is commensurate to the | sales so more popular book means more money to the author but | here authors arent compensated but elsevier so why should i pay | elsevier? because they filtered through 1000 papers to provide 10 | and for that privilege, they require unlimited royalty for ever? | why? | Hendrikto wrote: | > why are our scientists made to rely on elsevier et al to sift | through the junk and find for them the perfect paper instead of | doing it themselves? | | Scientists do do that themselves. That's why it is called peer | review. Journals take scientists work for free, they just pre- | select papers, but don't do the review. | slater wrote: | There's some truth to "publish or perish". Scientists are | expected to publish in prestigious journals. | Qem wrote: | Not only expected, but actually forced. In many places, a | streak of a few years with no publications in prestigious | journals can unrecoverably sink a researcher career. | OmicronCeti wrote: | A typical PhD dissertation these days is 3 publications in | high-quality journals. It is explicitly required at most | schools. | f6v wrote: | Because otherwise you'll have to sift through tons of garbage | "research". It's already a common knowledge that many articles | coming from certain countries are fraudulent. There's a lower | chance of having those in journals like Nature Medicine. | aurizon wrote: | If only the Nobel Committee would say:- The Nobel Committee will | only consider research published under an Open Source Access | repository in reviewing published papers for consideration for | the Nobel Prize after ~~ June 30, 2022. This would unleash a | horde of hungry cats among those fat pigeons that are the | paywalled journals. There would be a crying and wailing - ending | with piles of feathers,(and purring cats), and researchers all | over the world, and especially in the many 'third world' | Universities whose minds are currently held hostage to budgets | and local politics. The world would gain immeasurably by this | simple act! | [deleted] | gumby wrote: | 100 TB is pretty small. I wonder if she will start torrenting it | so people can back it up and share the load. | logifail wrote: | > 100 TB is pretty small. I wonder if she will start torrenting | it so people can back it up and share the load | | This has been ongoing for a while now: | | _Rescue Mission for Sci-Hub and Open Science: We are the | library_ | https://www.reddit.com/r/DataHoarder/comments/nc27fv/rescue_... | gumby wrote: | Excellent, thanks! | intunderflow wrote: | Remember to donate to sci-hub to keep it going! Even a small | donation helps and is way more than the extortionate prices we'd | all have to pay without it :D | [deleted] | The_rationalist wrote: | How come we don't have extensive software for helping doctor | decision making by making use e.g of bayesian inference while | feeding on the available superintelligence that enable those 24 | millions paper? Expert systems long passed the hype curve and | it's time for them to cycle up again! | f6v wrote: | Because research can be controversial. There're papers in my | field saying patients have increased frequency of certain | cells. There're other papers saying they're not. Go figure. | Qem wrote: | Nailed it. With publish or perish incentivizing shenanigans | like "p-hacking", many of those papers are the research- | equivalent of spam. | [deleted] | nefitty wrote: | I think Watson does something like this. | roywiggins wrote: | It didn't seem to actually work though. | | https://slate.com/technology/2022/01/ibm-watson-health- | failu... | monkeybutton wrote: | I wonder how many years that sets back the field. Who will | want to invest in something that could end up being Watson | 2.0? | dagw wrote: | Did Watson fail because they where bad at their job or | because the problem is much harder than people assumed? | nefitty wrote: | I think the marketing got ahead of the tech. I would | classify that as a business failure. | kilburn wrote: | An older comment of mine | https://news.ycombinator.com/item?id=30049522 fits well here. | I'll adapt it to your question ;) | | Basically: medicine as a whole is already some sort of expert | system. | | - Data collection and cleanup: Researchers conduct experiments | to produce meaningful data and extract conclusions from that | data. | | This part isn't more automated because we have strict rules | that prevent medical data collection and analysis without a | clear purpose. Otherwise we'd be able to collect a lot more | information to try and extract results from it using more | inference-oriented techniques (deep learning and the like). | | - Modeling & training: Expert panels produce guidelines from | the results of that research. These panels are the "training | part" of the system. | | As a sibling comment said, replacing these panels with ML-based | techniques isn't trivial because the data produced in the | previous step is fairly noisy (p-value hacking, difficulty of | capturing all the variables, etc.). Furthermore, the techniques | that yield best results nowadays also produce them without | clear explanations on why they hold, which is not something we | are prepare to accept in medicine. | | - Execution: Doctors diagnose and treat following said | guidelines. In fact, they use decision flows that they | themselves call... algorithms! | | The main reason why execution is not automated is that we do | not have the technology for machines to capture the contextual | and communication nuances that doctors pick up on. There can be | a world of difference between the exact same statement given by | two different patients or even the same patient in two | different situations. Likewise, the effect of a doctors' | statement can be quite literally the opposite depending on who | the patient is and their state of mind. One of the most | important aspects of the GP's job is to handle these | differences to achieve the best possible outcomes for their | patients. | | All that being said, there are companies trying to produce | expert systems to help doctors diagnose. See | https://infermedica.com/product/infermedica-api for instance. | [deleted] | [deleted] | belter wrote: | It looks like the torrents have all subjects. Anybody aware if | there are torrents only for Math or Comp.sci ? | Ansil849 wrote: | Is there any word about when sci hub is going to start adding new | articles again? It's currently only useful as an archive of old | research articles. New papers from the last year are not | available. I never understood the rationale for stopping new | content, though I believe it had some relation to some court case | in India...but I don't understand why that was a reason to stop | adding articles, and why it hasn't been restarted yet. | derbOac wrote: | What I read was that the Indian judicial system tends to be | favorable to things like Sci Hub in its interpretation of | copyright, and Sci Hub wanted to act in good faith with regard | to that court, so as to have a fairly solid basis in | international law for operating, should it rule in Sci Hub's | favor. I might be off in this understanding, but that's what I | understood. | Ansil849 wrote: | Yeah, I have heard this reasoning, but it seems muddled. How | is keeping the site online so old articles are available but | no new articles are added acting in "good faith"? It's not | like the old articles are any less copyrighted than the new | articles, so this doesn't make sense to me. | | The court case has also been delayed for over a year now, so | if it is delayed indefinitely, like it seems to be, then we | will also not get access to new articles, also indefinitely? | That's ridiculous. The last update from the court proceedings | claimed that there would be a new update over a month ago, | which in turn got delayed yet again to a few days ago, and | there's been nothing [1]. | | [1] https://delhihighcourt.nic.in/dhc_case_status_oj_list.asp | ?pn... | baybal2 wrote: | In India, courts have famously few remedies against no-show | from plaintiffs. | joshuaissac wrote: | They had resumed adding articles after receiving legal advice | that the Delhi High Court injunction only applied for a few | months. | | https://mobile.twitter.com/ringo_ring/status/143435621720862... | Ansil849 wrote: | I saw that tweet, but it doesn't change the material reality: | try plugging in some DOIs from recent article from the last | year, and they will not be there. | | Scihub used to be a great resource, now it's only a resource | for old research. Still useful for background material, but | not for current work. | | I also don't understand why the Indian court case has any | impact on new article availability. The owner is not Indian. | The servers and domains are not Indian. There doesn't seem to | be any actual reason to stop adding new articles, other than | some idiotic halfbaked point that only hurts the people who | need the articles, like when Project Gutenberg banned anyone | from a German IP, except this is much worse since there is no | way around it for people who need new papers. | [deleted] | generationP wrote: | I have a hunch that the downfall of the "Plato" real-time | downloader wasn't the Indian court case but rather the fact | that it helped publishers trivially identify the university | accounts through which the downloads were happening. Even | if the appearance of papers were delayed by a random number | of days, there are other pitfalls now, and most | importantly, publishers started caring. In particular, | Elsevier now slaps UUIDs onto all PDFs you download from | them, and no, I'm not just talking of visible watermarks. | Other publishers seem to be doing similar things (there was | a recent twitter thread on this, retweeted by @textfiles, | which I can't find). The rational solution for Sci-Hub | seems to be to buffer their uploads and release them in | yearly batches, maybe programmatically removing various | kinds of watermarks and diffing against the same paper | downloaded from a second IP. If this is what they are | doing, I'm not surprised. Not sure how much of a winning | strategy they have in the long run, though. | | Guys: post your papers on the arXiv. | mohammad_ali85 wrote: | This might be the twitter thread you're referring to? | https://twitter.com/json_dirs/status/1486120144141123584 | generationP wrote: | Yep, thank you! | joshuaissac wrote: | > I also don't understand why the Indian court case has any | impact on new article availability. The owner is not | Indian. The servers and domains are not Indian. | | Because Sci-Hub has a good chance of winning the case. The | court in question has previously backed a very broad | definition of what constitutes fair dealing. | | https://en.m.wikipedia.org/wiki/University_of_Oxford_v._Ram | e... | Ansil849 wrote: | > Because Sci-Hub has a good chance of winning the case. | | I understand that this is the party line that is parroted | whenever this issue comes up, but it does not make any | sense as a rationale for keeping new articles off the | site. How is not adding any new articles (but, for | example, keeping old articles accessible) assisting the | possible winning of the case? And more to the point, why | does it matter at all if it wins or loses the case? As | stated, neither the owner or the infrastructure is | Indian, so of what relevancy is this jurisdiction? | | And further still, the case appears to have been delayed | indefinitely. That last update claims that there was | going to be an update a few days ago, but there was not. | The proceedings are just now a list of one postponement | after another [1]. Given that new articles are being held | hostage, it thus very obviously benefits the legal system | and the prosecution to continue to delay the case | indefinitely. | | [1] https://delhihighcourt.nic.in/dhc_case_status_oj_list | .asp?pn... | sa1 wrote: | The owner might not be Indian, but she's actively | defending the case(through lawyers) in India. Not | following the injunction would lead her to losing the | case, which is why she followed through. She didn't have | to fight the case in India, but she chose to. Why keep | old papers and stop adding new papers - that probably | depends on the terms of the injunction. | Ansil849 wrote: | > Why keep old papers and stop adding new papers - that | probably depends on the terms of the injunction. | | As per the official tweet that has already been mentioned | in this thread [1]: | | > how about the lawsuit in India you may ask: our lawyers | say that restriction is expired already | | So according to the owner's official Twitter, this is no | longer a valid reason, and yet new papers are still not | accessible. Why is that? | | [1] https://mobile.twitter.com/ringo_ring/status/14343562 | 1720862... | sa1 wrote: | Haven't got around to adding yet? | Ansil849 wrote: | That is not how scihub used to function. Scihub used to | have an engine, named Plato, which would fetch papers | automatically if not already in their database. For the | last year now, this essential service has not been | operational. This is what the issue I am raising is | about. | sa1 wrote: | It's clear what you're talking about. Software bitrots | over time. Plato might need fixes, might have a huge | backlog, lots of stuff can happen. | pmoriarty wrote: | It's interesting how sci-hub's papers on medicine dwarf those in | many other fields like comp-sci, math, and physics. I wonder if | that reflects the number of papers in those fields, or if sci-hub | just has a non-representative sample. If the latter, why? | p1esk wrote: | It does appear to be the latter. I just searched for several | famous ML papers (attention is all you need, lottery ticket | hypothesis, capsules, etc) and they are not there. I think if | someone counted all papers that have been ever published | anywhere, the picture would be a lot different. | pmoriarty wrote: | So does that mean that vastly more people in medicine use | sci-hub than do people from other fields? | | Or is there some other reason for the discrepancy? | p1esk wrote: | Could be. I've been an ML researcher for 8 years and I | haven't used sci-hub until today. Ironically one of my | (very obscure) papers is available there. | Qem wrote: | I guess today is much easier to find new noteworthy, | publishable facts in medicine than physics. New diseases are | discovered every year, and old diseases are poorly understood | (e.g Alzheimer disease), and the treatments for many of them | are still sub-optimal, or even inexistent. Every patient is | different, individual cases are research-worth. We only got | antibiotics in the 1940s. On the other hand, most big | breakthroughs of physics happened between the 17th century and | the first decades of the 20th century. After the general case | is cracked in physics, individual cases have very little | publishing value. | _Wintermute wrote: | I think it's due to the sheer number of biomedical papers | published each year, coupled with comp-sci, maths and physics | papers being less likely to be behind a paywall. ___________________________________________________________________ (page generated 2022-02-12 23:00 UTC)