[HN Gopher] Pirate Library Mirror: Preserving 7TB of books (that...
       ___________________________________________________________________
        
       Pirate Library Mirror: Preserving 7TB of books (that are not in
       Libgen)
        
       Author : ValentineC
       Score  : 243 points
       Date   : 2022-07-03 20:39 UTC (2 hours ago)
        
 (HTM) web link (pilimi.org)
 (TXT) w3m dump (pilimi.org)
        
       | krick wrote:
       | So, if I get it right. First there was Libgen, which is
       | mirrorable. Then, some Z-Library copied Libgen and added some
       | more books, without making it mirrorable. The goal is to make
       | these new books, which are not mirrorable -- mirrorable (i.e. to
       | "preserve" them).
       | 
       | So, why not just re-upload them to Libgen, then? I guess somebody
       | will do that now anyway, but you could easily done it in the
       | first place, without making your own mirror, which is not a
       | mirror of Libgen. Just upload them to Libgen and make a mirror of
       | Libgen.
        
         | facethewolf wrote:
         | From their FAQ:
         | 
         | > _Q: Should the Z-Library collection be added to Library
         | Genesis?_
         | 
         | > _A: Yes! However, it is tricky. Library Genesis splits out
         | its collection between non-fiction and fiction. They also have
         | relatively high quality standards. If you are interested in
         | organizing all the books to meet their requirements, let us
         | know._
        
           | krick wrote:
           | Oh, it didn't come up to me to navigate "backwards". Thanks.
           | Actually, their whole FAQ is quite more enlightening than
           | just the linked page.
           | 
           | http://pilimi.org/faq.html
        
         | [deleted]
        
       | sacrosanct wrote:
       | This really needs to be hosted on a Tor hidden service. Clearnet
       | sites are easy to take down.
        
         | fsflover wrote:
         | Or, better on torrents inside I2P: http://geti2p.net.
        
         | progman32 wrote:
         | The homepage has a link to the onion site.
        
           | sacrosanct wrote:
           | So why does it have a clearnet address? To have more reach?
           | What's their threat model such that a clearnet presence could
           | possibly out the people behind this?
        
             | progman32 wrote:
             | I don't know. I'd guess reach. Probably wouldn't be on HN
             | if it were darknet only.
        
             | ipaddr wrote:
             | There is no ssl so no cert fingerprint shodan matching leak
        
       | metadat wrote:
       | Please don't flag it this time. Folks deserve the option to at
       | least be aware of it and decide for themselves if they wish to
       | pursue it.
        
         | [deleted]
        
         | yeetsFromHellL3 wrote:
        
       | AdmiralAsshat wrote:
       | HTTP-only makes me weary of visiting a self-professed piracy
       | site.
       | 
       | They couldn't even spring for a Let's Encrypt cert?
        
         | 1vuio0pswjnm7 wrote:
         | https://web.archive.org/web/20220701204054if_/http://pilimi....
        
         | mkl wrote:
         | *wary
         | 
         | For some reason I'm seeing this mistake more and more lately.
         | https://en.wiktionary.org/wiki/weary vs
         | https://en.wiktionary.org/wiki/wary
        
           | nerdponx wrote:
           | I've seen it too, and it's a weird mistake to me because I am
           | certain that at least one of my friends who makes the mistake
           | did not make it in the past.
           | 
           | I wonder if there's a kind of memetic effect happening
           | online, where people who lack confidence in their English
           | spelling ability see somebody make this mistake and somehow
           | think that it's correct, so they switch how they write it.
        
             | gjvc wrote:
             | cf. revert / reply
        
           | jamiek88 wrote:
           | People have mixed up leery and wary and it has memetically
           | become 'weary'.
           | 
           | I notice it more and more too.
        
           | InCityDreams wrote:
           | No room for auto-correct?
        
             | brewdad wrote:
             | Even so, there are enough ESL readers on here, along with
             | native speakers who may not understand the difference, that
             | it makes sense to point it out once in a while.
             | 
             | Otherwise we end up in lose/loose situation where I see
             | more people use it wrongly than correctly.
        
             | mkl wrote:
             | Seems unlikely, as "wary" appears to be the more common
             | word (71M Google results vs 55M). Maybe some keyboard
             | mistakenly does it though.
        
               | bbarnett wrote:
               | Not sure why the count is relevant here. Both are valid
               | words.
        
           | wccrawford wrote:
           | I've heard people _say_ it. People who I think should know
           | better. It 's really frustrating.
        
         | quazar wrote:
         | It's a read only blog with 2 pages. What do you gain for
         | putting this over HTTPS?
        
           | ac0lyte wrote:
           | Trust?
        
             | na85 wrote:
             | Malicious sites can still use letsencrypt
        
           | SquareWheel wrote:
           | Do you really want your ISP to know which piracy sites you
           | frequent? This is all being sent in plain text. Or they could
           | change the content, insert a redirect, or inject ads without
           | your knowledge. TLS is needed on all websites - not just
           | those with interaction.
        
             | cookiengineer wrote:
             | I hate to break it to you, but why do you think ISPs
             | override the DNS responses with TTL set to 0?
             | 
             | TLS itself is only useful if you also rely on DNS over
             | HTTPS/TLS. Well, setting the issues with TLS 1.2 and
             | earlier aside.
        
               | SquareWheel wrote:
               | Those are problems too, but they aren't exploited nearly
               | as often as MITMing cleartext has been historically. The
               | solution you mention is already becoming widely-
               | supported, as are newer protocols like QUIC that
               | discourage snooping.
               | 
               | There's no reason to ignore a good solution just because
               | it's not 100% perfect.
        
             | ars wrote:
             | They still know which sites you visit even with https.
        
               | kenniskrag wrote:
               | only the destination IP. TLS encryption is inside tcp and
               | around the http protocol.
        
               | hombre_fatal wrote:
               | You don't need to copy and paste your reply everywhere
               | it's relevant on HN. Even us flea brains can carry your
               | remarks in our head and apply them to similar comments.
        
               | geoffeg wrote:
               | And, unless you setup an appropriate DNS server and the
               | default from your ISP, then they also know that you
               | looked up the site's hostname(s).
        
             | generalizations wrote:
             | https won't keep your ISP from knowing you visited the
             | site. And the rest of those? For a text-only blog, they
             | seem kinda trivial.
        
               | simlevesque wrote:
               | > https won't keep your ISP from knowing you visited the
               | site
               | 
               | If you use DoH, yes it does. Unless I'm mistaken. They
               | only know the IP address of the remote server.
        
               | kevin_thibedeau wrote:
               | And nobody would ever think of keeping a reverse DNS
               | index.
        
               | Anunayj wrote:
               | and the SNI, until ECH is widely adopted, SNI is leaked
               | in plaintext when connecting to a server, it needs to
               | because how will the server know which TLS cert to reply
               | with?
        
               | cookiengineer wrote:
               | > They only know the IP address of the remote server.
               | 
               | It's the internet. Everyone can scrape links and
               | measure/correlate which assets were on them to correlate
               | likely visited websites.
               | 
               | Especially if every web page these days is pretty unique
               | in terms of what kind of assets (network streams) with
               | what kind of byte size were loaded at which point in the
               | document loading timeline.
               | 
               | Now include the TLS fingerprint of your web browser and
               | well, privacy went to shit.
               | 
               | HTTP needs an upgrade with scattering and rerouting on
               | the fly, otherwise these deanonymization techniques can
               | never be fixed.
        
               | 1vuio0pswjnm7 wrote:
               | Not disagreeing but presenting a hypothetical:
               | 
               | If the user requests the page from Internet Archive,
               | Common Crawl or even Google Cache, how does the ISP know
               | what the user requested. (NB. Neither IA nor Google Cache
               | require sending SNI,^1 so the ISP may only see IP
               | addresses).
               | 
               | With IA, the IP address alone does not reveal which IA
               | site or page the user is requesting. There is more to IA
               | than only Wayback Machine.
               | 
               | With Common Crawl, the user can send the Cloudfront
               | domain name instead of a commoncrawl.org domain. Are all
               | ISPs going to know that this is Common Crawl. Even if
               | they expend the effort to learn, what benefit is
               | achieved.
               | 
               | With Google Cache, the IP address alone does not reveal
               | which Google site the user is accessing. Needless to say,
               | there are many, many domains using these IP addresses.
               | 
               | There is nothing that requires any web user to retrieve
               | web pages from a given host. The page may be mirrored at
               | a number of hosts. Some of those hosts might offer HTTPS,
               | support TLS1.3 and not require plaintext SNI/offer
               | encrypted ClientHello.
               | 
               | Even assuming an ISP can determine what domain name a
               | customer is sending in a Host header or ClientHello
               | packet, it would still be necessary to subpoena the
               | archive/CDN/cache to figure out precisely what pages were
               | being requested.
               | 
               | 1. The same party is controlling all the server
               | certificates. IA controls the certificates for all IA
               | domains, Amazon (issues and) controls all the
               | certificates for Cloudfront customers and Google controls
               | all the certificates for Google domains. Perhaps there
               | are web users commenting on HN who believe that
               | ingress/egress traffic for site saved/hosted/cached at an
               | archive/CDN/cache is somehow private as against the
               | company running the archive/CDN/cache in a meaningful
               | way. I am not one of them.
               | 
               | As for the question of an ISP modifying the contents of
               | web pages, this is an issue that could be addressed
               | contractually in a subscriber agreement. It stands to
               | reason that if this was a serious issue and not merely a
               | hypothetical one raised by nerds debating the merits of
               | TLS then it would be addressed in such agreements.
               | 
               | As for the "injection of advertising" issue as a argument
               | in favour of the way TLS^2 is being administered on the
               | web, IMO this is a bit silly since (a) it is trivial to
               | filter such advertising (e.g., Javascript in the examples
               | I saw) out out of the page and/or block it from
               | running/connecting/loading and (b) the amount of "tech"
               | company-mediated advertising that web users endure in
               | spite of using TLS is enormous. More likely than being
               | seen as a threat to web users, the injection of
               | advertising by ISPs was seen as a threat to the
               | advertising revenue of "tech" companies. The later are
               | responsible for facilitating the injection of advertising
               | (by their customers, not their competitors, i.e., ISPs),
               | not preventing it.
               | 
               | 2. By "TLS administration" I do not mean encryption as a
               | concept nor certificates as a concept. I mean TLS
               | administration measures designed to support "tech"
               | companies first and web users second, if at all. A system
               | where the questions of "threat model" and "trust" are
               | both decided by "tech" companies not users.
        
               | robonerd wrote:
               | > _For a text-only blog_
               | 
               | If somebody MITMs it, they can serve you anything they
               | want.
        
               | enriquto wrote:
               | > they can serve you anything they want.
               | 
               | Great. More books!
               | 
               | No really, I don't understand this argument. A static
               | site served by plain http is perfectly appropriate. It's
               | like a poster hanging on the wall for all to see. Of
               | course people can paint over it, but it doesn't really
               | matter.
        
               | robonerd wrote:
               | They could serve you javascript that exploits your
               | browser. At the very least, they could replace that
               | bitcoin donation address with their own. That's a
               | tempting target if nothing else.
        
               | SquareWheel wrote:
               | And "they" isn't just your ISP. It's also that free wifi
               | hotspot you connected to, or the hotel service, or your
               | company's network. Even if you trust your ISP (and you
               | probably shouldn't), there are other bad actors to be
               | aware of.
        
               | criddell wrote:
               | HTTP connections can be used as a weapon against others.
               | One example is China's Great Cannon.
               | 
               | https://citizenlab.ca/2015/04/chinas-great-cannon/
        
               | kenniskrag wrote:
               | only the destination IP. TLS encryption is inside tcp and
               | around the http protocol.
        
               | Nextgrid wrote:
               | Until ESNI becomes mainstream (and browsers offer the
               | ability to enforce it), the domain name is also sent out
               | in plaintext.
        
               | syntheticcorp wrote:
               | ESNI has been dropped, a new spec alters how it works and
               | renames it Encrypted client hello (ECH)
               | 
               | https://blog.mozilla.org/security/2021/01/07/encrypted-
               | clien...
        
               | btdmaster wrote:
               | ECH looks quite interesting, but isn't it quite easy to
               | do a reverse DNS lookup for most domains?
        
           | notriddle wrote:
           | The same thing you always get. Assurance that your free Wi-Fi
           | hotspot isn't tampering with the page.
        
             | krick wrote:
             | This, and also MITM (like ISP) needs to make their own
             | request to this site, to know what I read. And,
             | technically, they cannot really be sure it's what I've
             | read, since nothing says that this site is static.
             | 
             | I'm not that offended, and torrents are only available via
             | TOR anyway, but I do actually appreciate the sentiment.
             | There's no reason to be not using TLS.
        
       | uniqueuid wrote:
       | It's really funny to think about how the advances of technology
       | keeps changing how we perceive books.
       | 
       | 7TB is even a commodity disk these days. And it's a lot less than
       | the torrent of scientific papers that floated around some time
       | ago (that was ~18TB IIRC).
        
         | nonrandomstring wrote:
         | I foresee storage density reaching the point that for most
         | ordinary people "online" becomes rather unimportant. What would
         | be the effects of technology when computers behave as in early
         | science fiction, as stand-alone oracles? [1]
         | 
         | [1]
         | https://www.timeshighereducation.com/opinion/2048-informatio...
        
           | themodelplumber wrote:
           | How would the appeal of streamers and live data/content
           | settle out in that case? Sometimes context is available in
           | the moment that makes it easier for all parties to consume
           | and analyze in that moment as well.
           | 
           | Since transient, ethereal meme culture is also basically
           | emergent culture now, it's difficult not to also foresee a
           | greater cultural divide in such a case. This is saying
           | nothing of live data tools as well, even weather data...
        
         | ars wrote:
         | It's 7TB compressed. If it's text you'd need about 70TB to
         | decompress it. It's probably mostly images though, so probably
         | not quite that bad.
        
           | emj wrote:
           | I've tried to do lossy compression of epubs with some lines
           | of bash scripts; i.e. removing the images and fonts that were
           | not needed. Many epubs could be downsized to a third of their
           | size, but then I found a book that needed the supplied fonts
           | and gave up. When doing lossy compressions can not have those
           | kind of bugs.
           | 
           | What I also found was that many of the images in the epubs
           | were already unuseable and nothing like their counter parts
           | in phsyical books.
        
             | solarmist wrote:
             | I don't understand this. Are they epubs of comics or
             | something? Epubs are already compressed (zip).
        
               | [deleted]
        
               | robonerd wrote:
               | It's not terribly uncommon to find an epub with several
               | megabytes of cover art and a few hundred kilobytes of
               | text.
        
       | generalizations wrote:
       | I wonder if there are any search engines dedicated to indexing
       | these kinds of libraries. I know there's a decent one just for
       | scihub, but it would be awesome if I could do a Google-style
       | search that returned the contents of books, magazines and journal
       | articles instead of just websites.
        
         | delusional wrote:
         | Wasn't that what google books was supposed to be?
        
           | voisin wrote:
           | I assume Google abandoned this along with all their earlier
           | mission statements in favour of building another chat app.
        
             | bbarnett wrote:
             | I can't imagine a life at google. So much promise, all
             | turned to ash.
        
             | lupire wrote:
             | Could you take a moment to check it Google Books search
             | still exists?
             | 
             | I'll give you a hint: https://books.google.com/?hl=en
             | 
             | > Search the world's most comprehensive index of full-text
             | books.
        
               | voisin wrote:
               | I know it exists, but it has appeared to languish for
               | years. They rely on third parties now for inclusion of
               | books, whereas in the early days they innovated on their
               | own with specialized scanning technology. They seemed
               | quite proud of it a decade ago. When was the last time
               | Google has touted their books project? Have they even
               | integrated searching books into their main search (which
               | was supposed to catalogue And make searchable all the
               | world's information)?
               | 
               | It hasn't been killed but it is clearly a zombie.
        
               | londons_explore wrote:
               | It was paralyzed by legal disputes with book publishers.
               | 
               | In the years the lawsuits were going on, nearly everyone
               | left the project. And then the lawyers have put in so
               | many red lines that it's nearly impossible to make any
               | changes to it.
        
         | zozbot234 wrote:
         | Book metadata is widely available via sites like e.g. Open
         | Library. With good metadata, full text search is not as
         | relevant.
        
           | aaron695 wrote:
        
         | sacrosanct wrote:
         | There is the Imperial Library of Trantor: https://trantor.is/
         | 
         | They offer a clearnet and a hidden service .onion incase you
         | don't want ISPs blocking access to it.
        
         | lupire wrote:
         | books.google.com
        
       | 2OEH8eoCRo0 wrote:
       | Your movie recommendations suck, ValentineC ;)
        
       | ars wrote:
       | 7TB of compressed text? I don't think humanity has generated that
       | much written words in it's entire existence. Although it would be
       | an interesting Fermi Problem to estimate (and don't forget just
       | how well text compresses).
       | 
       | This has to be a lot of duplicates or bad formats (images). This
       | would be far more useful to people with some curating.
        
         | khazhoux wrote:
         | These are most likely PDFs, not plaintext.
        
         | [deleted]
        
       | Larrikin wrote:
       | Is there an index of what's included. 7 tb is alot to ask for
       | simply upholding an ideal
        
         | robonerd wrote:
         | It apparently comes with a index in the form of a MySQL
         | database that contains title, author, description, and
         | filetype.
        
       | robonerd wrote:
       | > _We will release the data in stages, as we are still processing
       | the files. Right now the metadata file and a few of the torrents
       | are available. Note that the torrent files are only available
       | through our TOR mirror._
       | 
       | Presently, only the first four of several dozens of parts are
       | available.
        
       | dkjaudyeqooe wrote:
       | To be fair, Z-Library doesn't charge unless you want to download
       | more than 10 books per 24 hour period. That's per account and
       | although they ask you not to open multiple accounts they don't
       | seem to do anything to stop you.
        
         | PostOnce wrote:
         | What kind of fairness can there be in charging for stolen
         | books?
         | 
         | I believe in free access to education, but charging for these
         | books they have no rights to is a whole other thing.
        
           | mdp2021 wrote:
           | Running costs.
        
           | Sparkle-san wrote:
           | It would depend on how they use the funds. I wouldn't be
           | surprised if bandwidth expenses made up a majority of what it
           | cost to run Z-Library and that money has to come from
           | somewhere.
        
           | II2II wrote:
           | At this point is worth noting that there are reputable
           | sources for free books, such as public libraries and Project
           | Gutenberg.
           | 
           | I realize that neither source will satisfy many of the people
           | on HN, simply because there is a need for current technical
           | books.
        
           | fancyfredbot wrote:
           | You don't have to pay them if you go steal them yourself. If
           | you find it more convenient to pay another thief to do it for
           | you then I don't think that's significantly less fair.
        
             | UmbertoNoEco wrote:
             | All those India/Malasya/Guatemala kids stealing a 200 USD
             | PDE book from Pearson should go to jail... in America...
             | Assange style
        
           | KMnO4 wrote:
           | They're not charging for the books; they're charging for the
           | bandwidth.
        
           | dkjaudyeqooe wrote:
           | The "to be fair" is against the claim in the article that
           | they charge for books.
           | 
           | It's a figure of speech, not a comment on what they do.
        
           | jl6 wrote:
           | Yeah, grey hat shenanigans easily become black hat as soon as
           | money is involved.
        
           | mgaunard wrote:
           | Technically, they're not charging for the books, they're
           | charging for the bandwidth.
        
           | krick wrote:
           | I don't quite agree. I mean, they provide useful service, and
           | it costs money to run it. It's ok that they earn (even if
           | it's actually making a profit, not just covering the costs).
           | 
           | That being said, 10 downloads/day feels a bit restrictive to
           | me. I'd get if it was 100, or 50, heck, maybe even 20. I
           | mean, I don't appreciate that it's not mirrorable in the
           | first place, but maybe they cannot afford it, I don't know...
           | But 10 feels less than somebody researching a new topic might
           | need to access in a day, even if he won't read them all
           | immediately.
           | 
           | ...That being said as well, it has some really nice UI. I
           | wish somebody did it for Libgen.
        
             | lupire wrote:
             | You read more than 10 books per day?
        
               | krick wrote:
               | Read -- no, I don't. Download to skim and see the
               | contents -- yes (even though I don't do it everyday,
               | obviously). In fact, I rarely download less than 4 books
               | at once, except for occasions when it's a new book of my
               | favorite writer (in which case I can as well just buy
               | it). Instead, there is some topic, some reason why I need
               | these books, and I somehow can gather a dozen of
               | recommendations, maybe more, then I need to actually get
               | a look inside of them, to see what I'll be reading (if
               | anything). It also happens that I kinda know the book,
               | but not precisely enough, because some authors really
               | like to milk the topic by publishing 5 books kinda the
               | same as the first successful one, and if they are
               | technical they can have 5 revisions each. I may not read
               | them at all, or I may be reading them during the whole
               | next year, but I'll need to get them all at once at
               | first.
               | 
               | And if we also count papers, which this site provides too
               | -- easily.
        
           | swayvil wrote:
           | >What kind of fairness...?
           | 
           | Well they are providing a great public service and their
           | system takes money to maintain. It's just a small fee.
        
       ___________________________________________________________________
       (page generated 2022-07-03 23:00 UTC)