[HN Gopher] Sci-Hub statistics and database
       ___________________________________________________________________
        
       Sci-Hub statistics and database
        
       Author : NmAmDa
       Score  : 317 points
       Date   : 2022-02-12 17:40 UTC (5 hours ago)
        
 (HTM) web link (sci-hub.ru)
 (TXT) w3m dump (sci-hub.ru)
        
       | gw67 wrote:
       | How they are able to store data without being seized?
        
       | modeless wrote:
       | Did Sci-Hub start working again? Last time I checked it wasn't
       | adding new papers because of some legal thing going on in India.
        
         | phoe-krk wrote:
         | Yes - see
         | https://twitter.com/ringo_ring/status/1492419986291408898
        
           | Ansil849 wrote:
           | The site is working in the sense of you can download old
           | papers, but I don't believe any new papers from the last year
           | are accessible.
        
             | DoItToMe81 wrote:
             | I accessed a paper from late last year not so long ago. I
             | think it's working fine.
        
               | Ansil849 wrote:
               | > I accessed a paper from late last year not so long ago.
               | I think it's working fine.
               | 
               | It is not. A large batch of new papers was added
               | manually, but the old service of typing in a DOI and
               | having a paper be retrieved automatically is not working.
               | Pick 10 random DOIs from 2022 and see how many Scihub
               | will return.
        
         | lamontcg wrote:
         | AFAIK its not really working again? I think there was an upload
         | of a bunch of papers in a batch recently, but not ones that I
         | was hoping for. I'm sort of worried about the past-tense
         | language in this page suggesting that it isn't starting back up
         | again.
        
       | nefitty wrote:
       | Here's a notebook that fetches Sci-Hub mirrors from Wikidata and
       | tests them. I also included an iOS Shortcut to add to your Share
       | screen. When you're on a site that Sci-Hub recognizes and you use
       | the shortcut it will try to fetch the paper.
       | 
       | https://observablehq.com/@iz/sci-hub
        
       | raziel2701 wrote:
       | Alexandra Elbakyan is a titan and a saint. I couldn't have been
       | able to finish my research without access to papers my
       | institution wasn't subscribed to.
        
         | [deleted]
        
         | OmicronCeti wrote:
         | I snuck her into my own dissertation acknowledgements:
         | https://imgur.com/bDgtBAE
        
           | jdrc wrote:
           | Now i feel foolish for not acknowledging her, especially in
           | elsevier papers.
        
         | [deleted]
        
       | p1esk wrote:
       | Interesting. Medical field dominates research in terms of
       | publications. Chemistry produces double the papers compared to
       | physics, and humanities are smaller than biology but larger than
       | physics. I wonder where machine learning papers fit in - CS or
       | Math or both?
        
         | sgillen wrote:
         | I would imagine it depends on the particular paper, the more
         | experimental ones in CS, the more theory ones in math.
         | 
         | Do note though that most math and ML practitioners use arxiv
         | over sci-hub.
        
         | remuskaos wrote:
         | It is note worthy that most of physics (at least high energy
         | physics) ist published on arxiv.org and open access.
         | 
         | I don't know if sci hub bothers with publications that are
         | available freely from an official source.
        
           | p1esk wrote:
           | Good point. If so, this data is a lot less interesting :(
        
           | philipkglass wrote:
           | Sci-hub will grab and serve anything with a DOI (or at least
           | used to; I don't know if they have started ingesting papers
           | again after turning it off a while ago). I have found open
           | access papers there before. It's simpler to just paste the
           | DOI into sci-hub than to check to see if it's one of the few
           | open access articles in a mostly paywalled journal.
        
           | anon_123g987 wrote:
           | > published on arxiv.org and open access
           | 
           | Don't use the term "open access" like this. A paper published
           | on arXiv is free to read, and was freely published. "Open
           | access" is a scam by the big publishers, where they don't
           | take money from the _readers_ , but make the _authors_ pay.
           | Or, putting it another way, anyone can pay their way in those
           | journals and publish (sometimes sub par) papers.
        
             | nicoburns wrote:
             | No, "open access" means that the paper is available to
             | readers for free. Making the authors pay is typically
             | termed "gold open access".
        
               | mNovak wrote:
               | I've never heard the term "gold open access", but I know
               | plenty of "open access" journals that charge a fee to
               | authors.
        
               | remuskaos wrote:
               | I wasn't aware that there were different distinct forms
               | of "open access", so I had to read it up on Wikipedia.
               | From what I understand, publications on arxiv are either
               | gratis or libre open access.
               | 
               | Either way, we don't pay anyone any fee to publish on
               | arxiv.
        
             | 13415 wrote:
             | Not that I want to defend open access fees but the way you
             | describe it is incorrect. Paying for open access fees with
             | large publishers like Springer is an option that is
             | separate from the review system, you can only choose it
             | once your paper has been reviewed and accepted.
        
             | remuskaos wrote:
             | As I wrote on another comment, I wasn't aware that there
             | are multiple forms of open access. Since it appears that
             | arxiv (again, at least high energy physics) employs mostly
             | either gratis or libre open access, and since the Wikipedia
             | article explicitly calls it an open access archive, I see
             | no harm in calling it that either.
             | 
             | "arXiv (pronounced "archive"--the X represents the Greek
             | letter chi [kh])[1] is an open-access repository of
             | electronic preprints and postprints[...] "
        
         | The_rationalist wrote:
         | Machine learning publication rate is small, at least by
         | assuming that paperswithcode contains most of the publications.
        
           | [deleted]
        
       | ok123456 wrote:
       | What's the most popular paper on all of scihub? By field?
        
       | mmettler wrote:
       | Alexandra should get the Nobel prize.
        
         | iqanq wrote:
        
           | na85 wrote:
           | I mean, even if you limited yourself to just the Peace prize
           | (arguably the most controversial), you'd still have to
           | reconcile your statement with the fact that people like
           | Malala Yousafzai have won.
        
             | iqanq wrote:
        
         | allisdust wrote:
         | With the rent seeking companies being from Europe? Not a
         | chance.
         | 
         | Nobel is a political tool that's mostly there to make a point
         | (especially that peace prize).
        
           | anon_123g987 wrote:
           | He said "should", not "will". Both of you are right.
        
         | [deleted]
        
       | 2Gkashmiri wrote:
       | i asked this question here and at many places before. why do
       | people "rely" on an organization that sifts through hundreds of
       | thousands of papers and then charge exorbitant prices for
       | providing this service? if we use the amazon analogy, is amazon
       | with millions of products worse than a boutique cat food seller
       | that specializes in a specific cat food for a specific cat breed?
       | maybe. but what about the "rest" of products?
       | 
       | why are our scientists made to rely on elsevier et al to sift
       | through the junk and find for them the perfect paper instead of
       | doing it themselves? is science now such a cutthroat quick
       | competition that it requires you to give a company the priviledge
       | to work for you so that you dont have to do your own due
       | diligence?
       | 
       | in india, we have a lot of local research that is done on open
       | databases like shodh ganga and many more. but if you have to
       | access foreign research material, better luck your university has
       | an agreement with elsevier and others to pay them millions for a
       | login. the alternative, go to scihub and find what you need.
       | 
       | i understand the whole quality/delivery debate but doesnt the
       | average user already know who the big players in the specific
       | domain are and who are trusted? or you want discoverability at
       | the hands of a "trusted third party" without doing the legwork
       | yourself.
       | 
       | then at the other end you have non-academics like me. I might
       | have heard of a research paper in some article and i cannot read
       | it without paying an arm and a leg. why? if we use the whole
       | ebook/book argument that compensation is commensurate to the
       | sales so more popular book means more money to the author but
       | here authors arent compensated but elsevier so why should i pay
       | elsevier? because they filtered through 1000 papers to provide 10
       | and for that privilege, they require unlimited royalty for ever?
       | why?
        
         | Hendrikto wrote:
         | > why are our scientists made to rely on elsevier et al to sift
         | through the junk and find for them the perfect paper instead of
         | doing it themselves?
         | 
         | Scientists do do that themselves. That's why it is called peer
         | review. Journals take scientists work for free, they just pre-
         | select papers, but don't do the review.
        
         | slater wrote:
         | There's some truth to "publish or perish". Scientists are
         | expected to publish in prestigious journals.
        
           | Qem wrote:
           | Not only expected, but actually forced. In many places, a
           | streak of a few years with no publications in prestigious
           | journals can unrecoverably sink a researcher career.
        
           | OmicronCeti wrote:
           | A typical PhD dissertation these days is 3 publications in
           | high-quality journals. It is explicitly required at most
           | schools.
        
         | f6v wrote:
         | Because otherwise you'll have to sift through tons of garbage
         | "research". It's already a common knowledge that many articles
         | coming from certain countries are fraudulent. There's a lower
         | chance of having those in journals like Nature Medicine.
        
       | aurizon wrote:
       | If only the Nobel Committee would say:- The Nobel Committee will
       | only consider research published under an Open Source Access
       | repository in reviewing published papers for consideration for
       | the Nobel Prize after ~~ June 30, 2022. This would unleash a
       | horde of hungry cats among those fat pigeons that are the
       | paywalled journals. There would be a crying and wailing - ending
       | with piles of feathers,(and purring cats), and researchers all
       | over the world, and especially in the many 'third world'
       | Universities whose minds are currently held hostage to budgets
       | and local politics. The world would gain immeasurably by this
       | simple act!
        
       | [deleted]
        
       | gumby wrote:
       | 100 TB is pretty small. I wonder if she will start torrenting it
       | so people can back it up and share the load.
        
         | logifail wrote:
         | > 100 TB is pretty small. I wonder if she will start torrenting
         | it so people can back it up and share the load
         | 
         | This has been ongoing for a while now:
         | 
         |  _Rescue Mission for Sci-Hub and Open Science: We are the
         | library_
         | https://www.reddit.com/r/DataHoarder/comments/nc27fv/rescue_...
        
           | gumby wrote:
           | Excellent, thanks!
        
       | intunderflow wrote:
       | Remember to donate to sci-hub to keep it going! Even a small
       | donation helps and is way more than the extortionate prices we'd
       | all have to pay without it :D
        
         | [deleted]
        
       | The_rationalist wrote:
       | How come we don't have extensive software for helping doctor
       | decision making by making use e.g of bayesian inference while
       | feeding on the available superintelligence that enable those 24
       | millions paper? Expert systems long passed the hype curve and
       | it's time for them to cycle up again!
        
         | f6v wrote:
         | Because research can be controversial. There're papers in my
         | field saying patients have increased frequency of certain
         | cells. There're other papers saying they're not. Go figure.
        
           | Qem wrote:
           | Nailed it. With publish or perish incentivizing shenanigans
           | like "p-hacking", many of those papers are the research-
           | equivalent of spam.
        
         | [deleted]
        
         | nefitty wrote:
         | I think Watson does something like this.
        
           | roywiggins wrote:
           | It didn't seem to actually work though.
           | 
           | https://slate.com/technology/2022/01/ibm-watson-health-
           | failu...
        
             | monkeybutton wrote:
             | I wonder how many years that sets back the field. Who will
             | want to invest in something that could end up being Watson
             | 2.0?
        
             | dagw wrote:
             | Did Watson fail because they where bad at their job or
             | because the problem is much harder than people assumed?
        
               | nefitty wrote:
               | I think the marketing got ahead of the tech. I would
               | classify that as a business failure.
        
         | kilburn wrote:
         | An older comment of mine
         | https://news.ycombinator.com/item?id=30049522 fits well here.
         | I'll adapt it to your question ;)
         | 
         | Basically: medicine as a whole is already some sort of expert
         | system.
         | 
         | - Data collection and cleanup: Researchers conduct experiments
         | to produce meaningful data and extract conclusions from that
         | data.
         | 
         | This part isn't more automated because we have strict rules
         | that prevent medical data collection and analysis without a
         | clear purpose. Otherwise we'd be able to collect a lot more
         | information to try and extract results from it using more
         | inference-oriented techniques (deep learning and the like).
         | 
         | - Modeling & training: Expert panels produce guidelines from
         | the results of that research. These panels are the "training
         | part" of the system.
         | 
         | As a sibling comment said, replacing these panels with ML-based
         | techniques isn't trivial because the data produced in the
         | previous step is fairly noisy (p-value hacking, difficulty of
         | capturing all the variables, etc.). Furthermore, the techniques
         | that yield best results nowadays also produce them without
         | clear explanations on why they hold, which is not something we
         | are prepare to accept in medicine.
         | 
         | - Execution: Doctors diagnose and treat following said
         | guidelines. In fact, they use decision flows that they
         | themselves call... algorithms!
         | 
         | The main reason why execution is not automated is that we do
         | not have the technology for machines to capture the contextual
         | and communication nuances that doctors pick up on. There can be
         | a world of difference between the exact same statement given by
         | two different patients or even the same patient in two
         | different situations. Likewise, the effect of a doctors'
         | statement can be quite literally the opposite depending on who
         | the patient is and their state of mind. One of the most
         | important aspects of the GP's job is to handle these
         | differences to achieve the best possible outcomes for their
         | patients.
         | 
         | All that being said, there are companies trying to produce
         | expert systems to help doctors diagnose. See
         | https://infermedica.com/product/infermedica-api for instance.
        
         | [deleted]
        
         | [deleted]
        
       | belter wrote:
       | It looks like the torrents have all subjects. Anybody aware if
       | there are torrents only for Math or Comp.sci ?
        
       | Ansil849 wrote:
       | Is there any word about when sci hub is going to start adding new
       | articles again? It's currently only useful as an archive of old
       | research articles. New papers from the last year are not
       | available. I never understood the rationale for stopping new
       | content, though I believe it had some relation to some court case
       | in India...but I don't understand why that was a reason to stop
       | adding articles, and why it hasn't been restarted yet.
        
         | derbOac wrote:
         | What I read was that the Indian judicial system tends to be
         | favorable to things like Sci Hub in its interpretation of
         | copyright, and Sci Hub wanted to act in good faith with regard
         | to that court, so as to have a fairly solid basis in
         | international law for operating, should it rule in Sci Hub's
         | favor. I might be off in this understanding, but that's what I
         | understood.
        
           | Ansil849 wrote:
           | Yeah, I have heard this reasoning, but it seems muddled. How
           | is keeping the site online so old articles are available but
           | no new articles are added acting in "good faith"? It's not
           | like the old articles are any less copyrighted than the new
           | articles, so this doesn't make sense to me.
           | 
           | The court case has also been delayed for over a year now, so
           | if it is delayed indefinitely, like it seems to be, then we
           | will also not get access to new articles, also indefinitely?
           | That's ridiculous. The last update from the court proceedings
           | claimed that there would be a new update over a month ago,
           | which in turn got delayed yet again to a few days ago, and
           | there's been nothing [1].
           | 
           | [1] https://delhihighcourt.nic.in/dhc_case_status_oj_list.asp
           | ?pn...
        
             | baybal2 wrote:
             | In India, courts have famously few remedies against no-show
             | from plaintiffs.
        
         | joshuaissac wrote:
         | They had resumed adding articles after receiving legal advice
         | that the Delhi High Court injunction only applied for a few
         | months.
         | 
         | https://mobile.twitter.com/ringo_ring/status/143435621720862...
        
           | Ansil849 wrote:
           | I saw that tweet, but it doesn't change the material reality:
           | try plugging in some DOIs from recent article from the last
           | year, and they will not be there.
           | 
           | Scihub used to be a great resource, now it's only a resource
           | for old research. Still useful for background material, but
           | not for current work.
           | 
           | I also don't understand why the Indian court case has any
           | impact on new article availability. The owner is not Indian.
           | The servers and domains are not Indian. There doesn't seem to
           | be any actual reason to stop adding new articles, other than
           | some idiotic halfbaked point that only hurts the people who
           | need the articles, like when Project Gutenberg banned anyone
           | from a German IP, except this is much worse since there is no
           | way around it for people who need new papers.
        
             | [deleted]
        
             | generationP wrote:
             | I have a hunch that the downfall of the "Plato" real-time
             | downloader wasn't the Indian court case but rather the fact
             | that it helped publishers trivially identify the university
             | accounts through which the downloads were happening. Even
             | if the appearance of papers were delayed by a random number
             | of days, there are other pitfalls now, and most
             | importantly, publishers started caring. In particular,
             | Elsevier now slaps UUIDs onto all PDFs you download from
             | them, and no, I'm not just talking of visible watermarks.
             | Other publishers seem to be doing similar things (there was
             | a recent twitter thread on this, retweeted by @textfiles,
             | which I can't find). The rational solution for Sci-Hub
             | seems to be to buffer their uploads and release them in
             | yearly batches, maybe programmatically removing various
             | kinds of watermarks and diffing against the same paper
             | downloaded from a second IP. If this is what they are
             | doing, I'm not surprised. Not sure how much of a winning
             | strategy they have in the long run, though.
             | 
             | Guys: post your papers on the arXiv.
        
               | mohammad_ali85 wrote:
               | This might be the twitter thread you're referring to?
               | https://twitter.com/json_dirs/status/1486120144141123584
        
               | generationP wrote:
               | Yep, thank you!
        
             | joshuaissac wrote:
             | > I also don't understand why the Indian court case has any
             | impact on new article availability. The owner is not
             | Indian. The servers and domains are not Indian.
             | 
             | Because Sci-Hub has a good chance of winning the case. The
             | court in question has previously backed a very broad
             | definition of what constitutes fair dealing.
             | 
             | https://en.m.wikipedia.org/wiki/University_of_Oxford_v._Ram
             | e...
        
               | Ansil849 wrote:
               | > Because Sci-Hub has a good chance of winning the case.
               | 
               | I understand that this is the party line that is parroted
               | whenever this issue comes up, but it does not make any
               | sense as a rationale for keeping new articles off the
               | site. How is not adding any new articles (but, for
               | example, keeping old articles accessible) assisting the
               | possible winning of the case? And more to the point, why
               | does it matter at all if it wins or loses the case? As
               | stated, neither the owner or the infrastructure is
               | Indian, so of what relevancy is this jurisdiction?
               | 
               | And further still, the case appears to have been delayed
               | indefinitely. That last update claims that there was
               | going to be an update a few days ago, but there was not.
               | The proceedings are just now a list of one postponement
               | after another [1]. Given that new articles are being held
               | hostage, it thus very obviously benefits the legal system
               | and the prosecution to continue to delay the case
               | indefinitely.
               | 
               | [1] https://delhihighcourt.nic.in/dhc_case_status_oj_list
               | .asp?pn...
        
               | sa1 wrote:
               | The owner might not be Indian, but she's actively
               | defending the case(through lawyers) in India. Not
               | following the injunction would lead her to losing the
               | case, which is why she followed through. She didn't have
               | to fight the case in India, but she chose to. Why keep
               | old papers and stop adding new papers - that probably
               | depends on the terms of the injunction.
        
               | Ansil849 wrote:
               | > Why keep old papers and stop adding new papers - that
               | probably depends on the terms of the injunction.
               | 
               | As per the official tweet that has already been mentioned
               | in this thread [1]:
               | 
               | > how about the lawsuit in India you may ask: our lawyers
               | say that restriction is expired already
               | 
               | So according to the owner's official Twitter, this is no
               | longer a valid reason, and yet new papers are still not
               | accessible. Why is that?
               | 
               | [1] https://mobile.twitter.com/ringo_ring/status/14343562
               | 1720862...
        
               | sa1 wrote:
               | Haven't got around to adding yet?
        
               | Ansil849 wrote:
               | That is not how scihub used to function. Scihub used to
               | have an engine, named Plato, which would fetch papers
               | automatically if not already in their database. For the
               | last year now, this essential service has not been
               | operational. This is what the issue I am raising is
               | about.
        
               | sa1 wrote:
               | It's clear what you're talking about. Software bitrots
               | over time. Plato might need fixes, might have a huge
               | backlog, lots of stuff can happen.
        
       | pmoriarty wrote:
       | It's interesting how sci-hub's papers on medicine dwarf those in
       | many other fields like comp-sci, math, and physics. I wonder if
       | that reflects the number of papers in those fields, or if sci-hub
       | just has a non-representative sample. If the latter, why?
        
         | p1esk wrote:
         | It does appear to be the latter. I just searched for several
         | famous ML papers (attention is all you need, lottery ticket
         | hypothesis, capsules, etc) and they are not there. I think if
         | someone counted all papers that have been ever published
         | anywhere, the picture would be a lot different.
        
           | pmoriarty wrote:
           | So does that mean that vastly more people in medicine use
           | sci-hub than do people from other fields?
           | 
           | Or is there some other reason for the discrepancy?
        
             | p1esk wrote:
             | Could be. I've been an ML researcher for 8 years and I
             | haven't used sci-hub until today. Ironically one of my
             | (very obscure) papers is available there.
        
         | Qem wrote:
         | I guess today is much easier to find new noteworthy,
         | publishable facts in medicine than physics. New diseases are
         | discovered every year, and old diseases are poorly understood
         | (e.g Alzheimer disease), and the treatments for many of them
         | are still sub-optimal, or even inexistent. Every patient is
         | different, individual cases are research-worth. We only got
         | antibiotics in the 1940s. On the other hand, most big
         | breakthroughs of physics happened between the 17th century and
         | the first decades of the 20th century. After the general case
         | is cracked in physics, individual cases have very little
         | publishing value.
        
         | _Wintermute wrote:
         | I think it's due to the sheer number of biomedical papers
         | published each year, coupled with comp-sci, maths and physics
         | papers being less likely to be behind a paywall.
        
       ___________________________________________________________________
       (page generated 2022-02-12 23:00 UTC)