[HN Gopher] LibGen's Bloat Problem
       ___________________________________________________________________
        
       LibGen's Bloat Problem
        
       Author : liberalgeneral
       Score  : 226 points
       Date   : 2022-08-21 12:22 UTC (10 hours ago)
        
 (HTM) web link (liberalgeneral.neocities.org)
 (TXT) w3m dump (liberalgeneral.neocities.org)
        
       | idealmedtech wrote:
       | I'd love to see that distribution at the end with a log-axis for
       | the file size! Or maybe even log-log, depending. Gives a much
       | better sense of "shape" when working with these sorts of
       | exponential distributions
        
       | Retr0id wrote:
       | In an ideal world, every book could be given an "importance"
       | score, for some arbitrary value of importance. For example, how
       | often it is cited. This could be customised on a per-user basis,
       | depending on which subjects and time periods you're interested
       | in.
       | 
       | Then you can specify your disk size, and solve the knapsack
       | problem to figure the optimal subset of files that you should
       | store.
       | 
       | Edit: Curious to see this being downvoted. Is it really that bad
       | of an idea? Or just off-topic?
        
         | ssivark wrote:
         | Seems like a perfectly good idea to me! Basically how proposing
         | that we decide caching by some score, and the details of the
         | score function be tweaked to handle the different aspects we
         | care for.
         | 
         | I wonder whether this idea is already used for locating data in
         | distributed systems -- from clusters all the way to something
         | like IPFS.
        
         | DiggyJohnson wrote:
         | Not to sound blunt, but answering your question on the
         | downvotes (which you probably didn't deserve, especially
         | without reply).
         | 
         | The concept of an importance score feels very centralized and
         | against the federated / free nature of the site. Towards what
         | end?
         | 
         | If the "importance score" impacts curation, I am strongly
         | against it. Not only is it icky, but how is it different than a
         | function of popularity?
        
           | Retr0id wrote:
           | I'm not suggesting reducing the size of the LibGen
           | collection, I'm thinking along the lines of "I have 2TB of
           | disk space spare, and I want to fill it with as much
           | culturally-relevant information as possible".
           | 
           | If the entire collection were availble as a torrent (maybe it
           | already is?), I could select which files I wish to download,
           | and then seed.
           | 
           | Those who have 52TB to spare would of course aim to store
           | everything, but most people don't.
           | 
           | Just as the proposal in the OP would result in the remaining
           | 32.59 TB of data being less well replicated, my approach has
           | the problem that less "popular" files would be poorly
           | replicated, but you could solve that by _also_ selecting some
           | files at random. (e.g. 1.5TB chosen algorithmically, 0.5TB
           | chosen at random).
        
             | liberalgeneral wrote:
             | I don't think you've deserved the downvotes, and I don't
             | think it's a bad idea either; indeed some coordination as
             | to how to seed the collection is really needed.
             | 
             | For instance phillm.net maintains a dynamically updated
             | list of LibGen and Sci-Hub torrents with less than 3
             | seeders so that people can pick some at random and start
             | seeding: https://phillm.net/libgen-seeds-needed.php
        
           | [deleted]
        
       | wishfish wrote:
       | Has anyone ever stumbled across an executable on LibGen? The
       | article mentioned finding them but I've never seen one.
       | 
       | I agree with the other comments that LibGen shouldn't purge the
       | larger books. But, in terms of mirrors, it would be nice to have
       | a slimmed down archive I could torrent. 19 TB would be
       | manageable. And would be nice to have a local copy of most of the
       | books.
        
         | liberalgeneral wrote:
         | > Has anyone ever stumbled across an executable on LibGen? The
         | article mentioned finding them but I've never seen one.
         | 
         | Here is a list of .exe files in LibGen:
         | https://paste.debian.net/hidden/1c82739a/
         | 
         | And a breakdown of file extensions:
         | https://paste.debian.net/hidden/579e319c/
         | 
         | > And would be nice to have a local copy of most of the books.
         | 
         | Yes! That was my intention--I wasn't advocating for a purge of
         | content but a leaner and more practical version would be
         | amazing.
        
           | marcosdumay wrote:
           | So, 1000 exes and 500 isos (that may be problematic, but most
           | probably aren't). Everything else seems to be what one would
           | expect.
           | 
           | That's way cleaner than I could possibly expect. Do people
           | manually review suspect files?
        
           | macintux wrote:
           | > Yes! That was my intention--I wasn't advocating for a purge
           | of content but a leaner and more practical version would be
           | amazing.
           | 
           | Your piece doesn't make that obvious at all, and given how
           | many people here are misunderstanding that point, you might
           | want to update it.
        
             | liberalgeneral wrote:
             | You are right, added a paragraph at the end.
        
           | wishfish wrote:
           | Thanks for the lists. I was genuinely curious about the exes.
           | Nice to know where they originate. Interesting that over half
           | of them have titles in Cyrillic. I guess not so many English
           | language textbooks (with included CDs) have been uploaded
           | with the data portion intact.
        
         | pnw wrote:
         | You can search by extension - there's a lot of .exe files,
         | mostly Russian AFAIK.
        
         | hoppyhoppy2 wrote:
         | I saw a book on antenna design on libgen that originally
         | included a CD with software, and that disk image had been
         | uploaded to the site.
        
       | johndough wrote:
       | That graph of file size vs. number of files would be much easier
       | to read if it were logarithmic. I guess OP is using matplotlib.
       | In this case, use plt.loglog instead of plt.plot. Also, consider
       | plt.savefig("chart.svg") instead of png.
        
         | liberalgeneral wrote:
         | Here is the raw data if you are interested:
         | https://paste.debian.net/hidden/77876d00/
        
           | johndough wrote:
           | Thanks. Here is a logarithmic plot as SVG:
           | https://files.catbox.moe/zbf35r.svg
           | 
           | On a second thought, a logarithmic histogram might convey
           | even more information, but that would require all file sizes
           | to recompute the bin sizes.
        
             | kqr wrote:
             | Huh, this distribution is not the power law I would have
             | expected. Maybe because it's limited to one media type
             | (books)?
        
       | cbarrick wrote:
       | This is a complete nit, but                   s/an utopia/a
       | utopia/
       | 
       | Even though "utopia" is spelled starting with a vowel, it is
       | pronounced as /ju:'toUpi@/, like "yoo-TOH-pee-@", with a
       | consonant sound at the start.
       | 
       | Since the word starts with a consonant sound, the proper
       | indefinite article is "a".
        
         | kevin_thibedeau wrote:
         | Now you have to convince the intelligentsia how to use the
         | proper article with "history".
        
       | boarush wrote:
       | I don't think OP takes into account that there seem to be
       | multiple editions of the same book which are often required by
       | people to refer to. Not everyone wants the latest edition when
       | the class you're in is using some old edition.
        
         | liberalgeneral wrote:
         | If you are referring to my duplication comments, sure (but even
         | then I believe there are duplicates of the exact same edition
         | of the same book). Though the filtering by filesize is
         | orthogonal to editions etc. so has nothing to do with that.
        
           | xenr1k wrote:
           | I agree. There are duplicates. I have seen it.
           | 
           | I have found the same book with multiple sized pdf, with same
           | content. Someone maybe uploaded a poorly scanned pdf when the
           | book was first released but later Someone else uploaded a
           | OCRed version, but the first one just stayed hogging a large
           | amount of storage.
        
             | MichaelCollins wrote:
             | How do you automate the process of figuring out which
             | version is better? It's not safe to assume the smaller
             | versions are always better, nor the inverse. Particularly
             | for books with images, one version of the book may have
             | passable image quality while the other compressed the
             | images to jpeg mush. And there are considerations that are
             | difficult to judge quantitatively, like the quality of
             | formatting. Even something seemingly simple like testing
             | whether a book's TOC is linked correctly entails a huge
             | rats nest of heuristics and guesswork.
        
               | macintux wrote:
               | I don't think anyone is arguing it can be fully
               | automated, but automating the selection of books to
               | manually review is certainly viable.
        
           | rinze wrote:
           | As the previous reply said, I've also seen duplicates while
           | browsing. Would it be possible to let users flag duplicates
           | somehow? It involves human unreliability, which is like
           | automated unreliability, only different.
        
         | generationP wrote:
         | In practice, it's more often the same file with minor edits
         | such as a PDF table of contents added or page numbers
         | corrected. Say, how many distinct editions of this standard
         | text on elementary algebraic geometry are in the following
         | list?
         | 
         | http://libgen.rs/search.php?req=cox+little+o%27shea+ideals&o...
         | 
         | Fun fact: the newest one (the 2018 corrected version of the
         | 2015 fourth edition) is not among them.
        
           | ZeroGravitas wrote:
           | I notice they have a place to store the OpenLibrary ID,
           | though I've not seen one filled in as yet.
           | 
           | OpenLibrary provides both Work and Edition ids, which helps
           | connect different versions.
           | 
           | Their database is not perfect either, but it might make more
           | sense to keep the bibliographic data seperate from the
           | copyright contents anyway.
           | 
           | https://openlibrary.org/works/OL1849157W/Ideals_varieties_an.
           | ..
        
           | boarush wrote:
           | I like to think that LibGen also serves as a historical
           | database wherein there is a record that a book of a specific
           | edition had its errors corrected. (Although it would be
           | better if errata could be appended to the same file if
           | possible)
           | 
           | Yes, for very minor edits, those files should obviously not
           | exist, but for that there would need to be someone who
           | verifies this, which is such an enormous task that likely no
           | one would take up.
        
       | keepquestioning wrote:
       | Can we put LibGen on the blockchain?
        
         | FabHK wrote:
         | In case that was not a joke:
         | 
         | No. LibGen is not a trivial amount of data (a few hard disks
         | full). The blockchain can only handle tiny amounts of data very
         | slowly.
        
       | rolling_robot wrote:
       | The graph should be in logarithmic scale to be readable,
       | actually.
        
         | remram wrote:
         | https://news.ycombinator.com/item?id=32540202
        
       | bagrow wrote:
       | > by filtering any "books" (rather, files) that are larger than
       | 30 MiB we can reduce the total size of the collection from 51.50
       | TB to 18.91 TB
       | 
       | I can see problems with a hard cutoff in file size. A long
       | architectural or graphic design textbook could be much larger
       | than that, for instance.
        
         | mananaysiempre wrote:
         | While it's a bit of an extreme case, the file for a single
         | 15-page article on Monte Carlo noise in rendering[1] is over
         | 50M (as noise should specifically not be compressed out of the
         | pictures).
         | 
         | [1] https://dl.acm.org/doi/10.1145/3414685.3417881
        
           | TigeriusKirk wrote:
           | I was just checking my PDFs over 30M because of this post and
           | was surprised to see the DALL-E 2 paper is 41.9M for 27
           | pages. Lots of images, of course, it was just surprising to
           | see it clock in around a group of full textbooks.
        
             | elteto wrote:
             | If I remember correctly images in PDFs can be stored full
             | res but are then rendered to final size, which more often
             | than not in double column research papers end up being
             | tiny.
        
       | _Algernon_ wrote:
       | My main issue with libgen is its awful search. Can't search by
       | multiple criteria, shitty fuzzy search, and cant filter by file
       | type.
        
         | MichaelCollins wrote:
         | Have you ever used a card catalogue?
        
           | _Algernon_ wrote:
           | No. Your point being?
        
             | MichaelCollins wrote:
             | In that case your expectations are understandable. People
             | in your generation are accustomed to finding anything in
             | mere seconds. Not very long ago, if it took you a few
             | minutes to find a book in the catalogue you would count
             | yourself lucky. And if your local library didn't have the
             | book you're looking for, you could spend weeks waiting for
             | the book to arrive from another library in the system.
             | 
             | Libgen's search certainly isn't as good as it could be, but
             | it's _more_ than good enough. If you can 't bear spending a
             | few minutes searching for a book, can you even claim to
             | want that book in the first place? It's hard for me to even
             | imagine being in such a rush that a few minutes searching
             | in a library is too much to tolerate. But then again, I
             | wasn't raised with the expectations of your generation.
        
         | liberalgeneral wrote:
         | Z-Library has been innovating a great deal in that regard.
         | Sadly they are not as open/sharing as LibGen mirrors in giving
         | back to the community (in terms of database dumps, torrents,
         | and source code).
        
       | Invictus0 wrote:
       | Storage space is not a problem, especially not on the order of
       | terabytes. If you want to download all of libgen on a cheap
       | drive, perhaps limit yourself to epub files only. No one needs
       | all of libgen anyway except archivists and data hoarders.
        
         | liberalgeneral wrote:
         | https://news.ycombinator.com/item?id=32540854
        
           | Invictus0 wrote:
           | Yes, that makes you a data hoarder. Normal people would just
           | use one of the many other methods of getting free books, like
           | legal libraries, googling it on Yandex, torrents, asking a
           | friend, etc. Or just actually pay for a book.
        
             | liberalgeneral wrote:
             | My target audience is not normal people though, and I don't
             | mean this in the "edgy" sense. The fact that we are having
             | this discussion is very abnormal to begin with, and I think
             | it's great that there are some deviants from the norm who
             | care about the longevity of such projects.
             | 
             | I can imagine many students and researchers hosting a
             | mirror of LibGen for their fellows for example.
        
               | Invictus0 wrote:
               | In that case, just pay whatever it costs to store the
               | data. With AWS glacier it would cost $50 a month.
        
             | [deleted]
        
       | RcouF1uZ4gsC wrote:
       | > I chose 30 MiB somewhat arbitrarily based on my personal e-book
       | library, thinking "30 MiB ought to be enough for anyone"
       | 
       | There are books on art and photography and pathology that have
       | multiple high resolution photographs.
       | 
       | I don't think limiting by file size is a good idea.
        
       | c-fe wrote:
       | This is a bit anecdotal, but I did upload a book to libgen. I am
       | am avid user of the site, and during my thesis research I was
       | looking for a specific book and could not find it on there. I did
       | however find it on archive.org. I spent the better half of one
       | afternoon extracting the book from archive.org with some Adobe
       | software, since I had to circumvent some DRM and other things,
       | and all of this was also novel to me. In the end I got a scanned
       | PDF, which had several hundred MB. I managed to reduce it to 47
       | MB, however further reduction was not easily possible at least
       | not with the means I knew or had at my disposal. I uploaded this
       | version to libgen.
       | 
       | I do agree that there may be some large files on there, however I
       | dont agree with removing them. I spent some hours to put this
       | book on there so others who need it can access it within seconds.
       | Removing it because it is too large would void all this effort
       | and require future users to go through a similar process than i
       | did just to browse through the book.
       | 
       | Also any book published today is most likely available in some
       | ebook format, which is much smaller in size, so I dont think that
       | the size of libgen will continue to grow at the same pace as it
       | is doing now.
        
         | culi wrote:
         | I've always wanted to contribute to LibGen. Got me through
         | college and has powered my Wikipedia editing hobby
         | 
         | Are there any good guides out there for best practices for
         | minimizing files, scanning books, etc?
        
           | generationP wrote:
           | There's a bunch. Here's what I do (for black-and-white text;
           | I'm not sure how to deal with more complex scenarios):
           | 
           | Scan with 600dpi resolution. Nevermind that this gives huge
           | output files; you'll compress them to something much smaller
           | at the end, and the better your resolution, the stronger
           | compression you can use without losing readability.
           | 
           | While scanning, periodically clean the camera or the scanner
           | screen, to avoid speckles of dirt on the scan.
           | 
           | The ideal output formats are TIF and PNG; use them if your
           | scanner allows. PDF is also fine (you'll then have to extract
           | the pages into TIF using pdfimages or using ScanKromsator).
           | Use JPG only as a last resort, if nothing else works.
           | 
           | Once you have TIF, PNG or JPG files, put them into a folder.
           | Make sure that the files are sorted correctly: IIRC, the
           | numbers in their names should match their order (i.e.,
           | blob030 must be an earlier page than blah045; it doesn't
           | matter whether the numbers are contiguous or what the non-
           | numerical characters are). (I use the shell command mmv for
           | convenient renaming.)
           | 
           | Import this folder into ScanTailor (
           | https://github.com/4lex4/scantailor-advanced/releases ), save
           | the project, and run it through all 6 stages.
           | 
           | Stage 1 (Fix Orientation): Use the arrow buttons to make sure
           | all text is upright. Use Q and W to move between pages.
           | 
           | Stage 2 (Split Pages): You can auto-run this using the |>
           | button, but you should check that the result is correct. It
           | doesn't always detect the page borders correctly. (Again, use
           | Q and W to move between pages.)
           | 
           | Stage 3 (Deskew): Auto-run using |>. This is supposed to
           | ensure that all text is correctly rotated. If some text is
           | still skew, you can detect and fix this later.
           | 
           | Stage 4 (Select Content): This is about cutting out the
           | margins. This is the most grueling and boring stage of the
           | process. You can auto-run it using |>, but it will often cut
           | off too much and you'll have to painstakingly fix it by hand.
           | Alternatively (and much more quickly), set "Content Box" to
           | "Disable" and manually cut off the most obvious parts without
           | trying to save every single pixel. Don't worry: White space
           | will not inflate the size of the ultimate file; it compresses
           | well. The important thing is to cut off the black/grey parts
           | beyond the pages. In this process, you'll often discover
           | problems with your scan or with previous stages. You can
           | always go back to previous stages to fix them.
           | 
           | Stage 5 (Margins): I auto-run this.
           | 
           | Stage 6 (Output): This is important to get right. The
           | despeckling algorithm often breaks formulas (e.g., "..."s get
           | misinterpreted as speckles and removed), so I typically
           | uncheck "Despeckle" when scanning anything technical (it's
           | probably fine for fiction). I also tend to uncheck "Savitzki-
           | Golay smoothing" and "Morphological smoothing" for some
           | reason; don't remember why (probably they broke something for
           | me in some case). The "threshold" slider is important:
           | Experiment with it! (Check which value makes a typical page
           | of your book look crisp. Be mindful of pages that are paler
           | or fatter than others. You can set it for each page
           | separately, but most of the time it suffices to find one
           | value for the whole book, except perhaps the cover.) Note the
           | "Apply To..." buttons; they allow you to promote a setting
           | from a single page to the whole book. (Keep in mind that
           | there are two -- the second one is for the despeckling
           | setting.)
           | 
           | Now look at the tab on the right of the page. You should see
           | "Output" as the active one, but you can switch to "Fill
           | Zones". This lets you white-out (or black-out) certain
           | regions of the page. This is very useful if you see some
           | speckles (or stupid write-ins, or other imperfections) that
           | need removal. I try not to be perfectionistic: The best way
           | to avoid large speckles is by keeping the scanner clean at
           | the scanning stage; small ones aren't too big a deal; I often
           | avoid this stage unless I _know_ I got something dirty. Some
           | kinds of speckles (particularly those that look like
           | mathematical symbols) can be confusing in a scan.
           | 
           | There is also a "Picture Zones" rider for graphics and color;
           | that's beyond my paygrade.
           | 
           | Auto-run stage 6 again at the end (even if you think you've
           | done everything -- it needs to recompile the output TIFFs).
           | 
           | Now, go to the folder where you have saved your project, and
           | more precisely to its "out/" subfolder. You should see a
           | bunch of .tif files, each one corresponding to a page. Your
           | goal is to collect them into one PDF. I usually do this as
           | follows:                 tiffcp *.tif ../combined.tif
           | tiff2pdf -o ../combined.pdf ../combined.tif       rm -v
           | ../combined.tif
           | 
           | Thus you end up with a PDF in the folder in which your
           | project is.
           | 
           | Optional: add OCR to it; add bookmarks for chapters and
           | sections; add metadata; correct the page numbering (so that
           | page 1 is actual page 1). I use PDF-XChangeLite for this all;
           | but use whatever tool you know best.
           | 
           | At that point, your PDF isn't super-compressed (don't know
           | how to get those), but it's reasonable (about 10MB per 200
           | pages), and usually the quality is almost professional.
           | 
           | Uploading to LibGen... well, I think they've made the UI
           | pretty intuitive these days :)
           | 
           | PS. If some of this is out of date or unnecessarily
           | complicated, I'd love to hear!
        
             | crazygringo wrote:
             | > _At that point, your PDF isn 't super-compressed (don't
             | know how to get those)_
             | 
             | As far as I know, it's making sure your text-only pages are
             | monochrome (not grayscale) and to use Group4 compression
             | for them, which is actually what fax machines use (!) and
             | is optimized specifically for monochrome text. Both TIFF
             | and PDF's support Group4 -- I use ImageMagick to take a
             | scanned input page and run grayscale, contrast, Group4
             | monochrome encoding, and PDF conversion in one fell swoop
             | which generates one PDF per page, and then "pdfunite" to
             | join the pages. Works like a charm.
             | 
             | I'm not aware of anything superior to Group4 for regular
             | black and white text pages, but would love to know if there
             | is.
        
               | generationP wrote:
               | Oh, I should have said that I scan in grayscale, but
               | ScanTailor (at stage 6) makes the output monochrome;
               | that's what the slider is about (it determines the
               | boundary between what will become black and what will
               | become white). So this isn't what I'm missing.
               | 
               | I am not sure if the result is G4-compressed, though. Is
               | there a quick way to tell?
        
         | liberalgeneral wrote:
         | Thank you for your efforts!
         | 
         | To be clear, I am not advocating for the removal of any files
         | larger than 30 MiB (or any other arbitrary hard limits). It'd
         | be great of course to flag large files for further review, but
         | the current software doesn't do a great job at crowdsourcing
         | these kinds of tasks (another one being deduplication) sadly.
         | 
         | Given the very little amount of volunteer-power, I'm suggesting
         | that a "lean edition" of LibGen can still be immensely useful
         | to many people.
        
           | ssivark wrote:
           | Files are a very bad unit to elevate in importance, and
           | number of files or file size are really bad proxy metrics,
           | especially without considering the statistical distribution
           | of downloads (leave alone the question of what is more
           | "important"!). Eg: Junk that's less than the size limit is
           | implicitly being valued over good content that happens to be
           | larger in size. Textbooks & reference books will likewise get
           | filtered out with higher likelihood -- and that would screw
           | students in countries where they cannot afford them (which
           | might arguable be a more important audience to some, compared
           | to those downloading comics). Etc.
           | 
           | After all this, the most likely human response from people
           | who really depend on this platform would be to slice a big
           | file into volumes under the size limit. Seems to be a
           | horrible UX downgrade in the medium to long term for no other
           | reason than satisfying some arbitrary metric of
           | legibility[1].
           | 
           | Here's a different idea -- might it be worthwhile to convert
           | the larger files to better compressed versions eg. PDF ->
           | DJVU? This would lead to a duplication in the medium term,
           | but if one sees a convincing pattern that users switch to the
           | compressed versions without needing to come back to the
           | larger versions, that would imply that the compressed version
           | works and the larger version could eventually be garbage
           | collected.
           | 
           | Thinking in an even more open-ended manner, if this corpus is
           | not growing at a substantial rate, can we just wait out a
           | decade or so of storage improvements before this becomes a
           | non-issue? How long might it take for storage to become 3x,
           | 10x, 30x cheaper?
           | 
           | [1]: https://www.ribbonfarm.com/2010/07/26/a-big-little-idea-
           | call...
        
             | didgetmaster wrote:
             | > can we just wait out a decade or so of storage
             | improvements before this becomes a non-issue?
             | 
             | I'm not sure that there is anything on the horizon which
             | would make duplicate data a 'non-issue'. Capacities are
             | certainly growing, so within a decade we might see 100TB
             | HDDs available and affordable 20TB SSDs. But that does not
             | solve the bandwidth issues. It still takes a long, long
             | time to transfer all the data.
             | 
             | The fastest HDD is still under 300MB/s which means it takes
             | a minimum of 20 hours to read all the data off a 20TB HDD.
             | That is if you could somehow get it to read the whole thing
             | at the maximum sustained read speed.
             | 
             | SSDs are much faster, but it will always be easier to
             | double the capacity than it is to double the speed.
        
               | fragmede wrote:
               | The problem isn't the technology, it's the cost. Given a
               | far larger budget, you wouldn't run the hard drives at
               | anywhere near capacity, in order to gain a read speed
               | advantage by running a ton in parallel. That'll let you
               | read 20 TB in a hour if you can afford it. Put it this
               | way; Netflix is able to do 4k video and that's far more
               | intensive.
        
         | jtbayly wrote:
         | Agreed. Deduplication should be the bigger goal, in my opinion.
        
           | CamperBob2 wrote:
           | Have to be careful there. A jihad against duplication means
           | that poor-quality scans will drive out good ones, or prevent
           | them from ever being created. Especially if you're misguided
           | enough to optimize for minimum file size.
           | 
           | I agree with samatman's position below: as long as the format
           | is the slightest bit lossy -- and it always will be --
           | aggressive deduplication has more downsides than upsides.
        
             | willnonya wrote:
             | While intended to agree the duplicates need to be easily
             | identifiable and preferably filterable by quality for bulk
             | downloads.
        
             | exmadscientist wrote:
             | Deduplication doesn't have to mean removal. It might be
             | just tagging. It would be very nice to be able to fetch the
             | "best filesize" version of the entire collection, then pull
             | down the "best quality" editions of only a few things I'm
             | particularly interested in.
        
           | signaru wrote:
           | Probably only safe in cases where the files in question are
           | exactly the same binaries (if binary diffing can be automated
           | somehow).
        
           | DiggyJohnson wrote:
           | Even then, I wouldn't want a file with text + illustrations
           | to be considered a dupe of a text-only copy of the same work.
        
             | ajsnigrutin wrote:
             | Plus there are a lot of books, where one version is a high
             | quality scan, but no OCR, and the other is OCRed scan (with
             | a bunch of errors, but searching works 80% of the time) and
             | horrible scan quality.
             | 
             | Also, some books included appendices, that are scanned in
             | some versions but not in others, plus large posters, that
             | are shrunk to a4 size in one version, split onto multiple
             | a4 pages in another, and one huge page in a third version.
             | 
             | Then there are zips of books, containing 1 pdf + eg.
             | example code, libraries, etc (eg progrmaming books).
        
             | samatman wrote:
             | IMHO a process which is lossy should never be described as
             | deduplication.
             | 
             | What would work out fairly well for this use case is to
             | group files by similarity, and compress them with an
             | algorithm which can look at all 'editions' of a text.
             | 
             | This should mean that storing a PDF with a (perhaps badly,
             | perhaps brilliantly) type-edited version next to it would
             | 'weigh' about as much as the original PDF plus a patch.
        
               | duskwuff wrote:
               | > IMHO a process which is lossy should never be described
               | as deduplication.
               | 
               | Depends. There are going to be some cases where files
               | aren't _literally_ duplicates, but the duplicates don 't
               | add any value -- for example, MOBI conversions of EPUB
               | files, or multiple versions of an EPUB with different
               | publisher-inserted content (like adding a preview of a
               | sequel, or updating an author's bibliography).
        
               | samatman wrote:
               | Splitting those into two cases: I think getting rid of
               | format conversions (which can, after all, be performed
               | again) is worthwhile, but isn't deduplication, that's
               | more like pruning.
               | 
               | Multiple versions of an EPUB with slightly different
               | content is exactly the case where a compression algorithm
               | with an attention span, and some metadata to work with
               | can, get the multiple copies down enough in size that
               | there's no point in disposing of the unique parts.
        
       | ad404b8a372f2b9 wrote:
       | That's funny, I did the same analysis with sci-hub. Back when
       | there was an organized drive to back it up.
       | 
       | I downloaded parts of it and wanted to figure out why it was so
       | heavy, seeing as you'd expect articles to be mostly text and very
       | light.
       | 
       | There was a similar distribution of file sizes. My immediate
       | instinct was also to cut off the tail-end, but looking at the
       | larger files I realized it was a whole range of good articles
       | that included high quality graphics that were crucial to the
       | research being presented, not poor compression or useless bloat.
        
         | dredmorbius wrote:
         | It can be illuminating to look at the size of ePub documents.
         | This is in general an HTML container (and compressed), such
         | that file sizes tend to be quite small. A book-length text
         | (~250 pp or more) might be from 0.3 -- 5 MB, and often at the
         | lower end of the scale.
         | 
         | Books with a large number of images or graphics, however, can
         | still bloat to 40-50 MB or even more.
         | 
         | Otherwise, generally, text-based PDFs (as opposed to scans) are
         | often in the 2--5 MB range, whilst scans can run 40--400 MB.
         | The largest I'm aware of in my own collection is a copy of
         | Lyell's _Geography_ , sourced from Archive.org. It is of course
         | scans of the original 19th century typography. Beautiful to
         | read, but a bit on the weighty side.
        
         | liberalgeneral wrote:
         | I think Sci-Hub is the opposite since 1 DOI = 1 PDF in its
         | canonical form (straight from the publisher) so neither
         | duplication nor low-quality is the case.
        
           | dredmorbius wrote:
           | It does depend on when the work was published. Pre-digital
           | works scanned in without OCR can be larger in size. That's
           | typically works from the 1980s and before.
           | 
           | Given the explosion of scientific publishing, that's likely a
           | small fraction of the archive by _work_ though it may be
           | significant in terms of _storage_.
        
       | aaron695 wrote:
       | > by filtering any "books" (rather, files) that are larger than
       | 30 MiB we can reduce the total size of the collection from 51.50
       | TB to 18.91 TB, shaving a whopping 32.59 TB
       | 
       | Books greater than 30 MiB are all the textbooks.
       | 
       | You are killing the knowledge.
       | 
       | Also killing a lot of rare things.
       | 
       | If you want to do something amazing and small, OCR them.
       | 
       | As an example of greater than 30 meg, I grabbed a short story by
       | Greg Bear the other day not available digitally, it was in a 90
       | meg copy of a 1983 Analog Science Fiction and Fact
       | 
       | Side note de-duping is an incredibly hard project, how will you
       | diff a mobi and a epub and then make a decision? Or a decision
       | between a mobi and a mobi?
       | 
       | Books also change with time. Even in the 90's kids books from the
       | 60's had been 'edited' These can be hidden gems to collectors.
       | Cover art also.
        
       | gizajob wrote:
       | One of my favourite places on the internet too. The thing is, you
       | just search for what you want and spend 10 seconds finding the
       | right book and link. While I'd love to mirror whole archive
       | locally, it would really be superfluous because I can only read a
       | couple of quality books at a time anyway, so building my own
       | small archive of annotated PDFs (philosophy is my drug of choice)
       | is better than having the whole. I think it's actually remarkably
       | free of bloat and cruft considering, but maybe I'm not trawling
       | the same corners as you are. Do kind of wish they'd clear out the
       | mobi and djvu versions and make it unified however.
        
         | sitkack wrote:
         | > djvu versions
         | 
         | This would be disastrous for preservation. Often the djvu
         | versions have no digital version, the books not in print and
         | the publisher isn't around. The djvu archives are often
         | specifically because some old book, _really_ has and had value
         | to people.
        
           | crazygringo wrote:
           | Yeah, I always convert DVJU to PDF (pretty easy) but it never
           | compresses quite as nicely.
           | 
           | DJVU is pretty clever in how it uses a mask layer for more
           | efficient compression, and as far as I know, converting to
           | PDF is always done "dumb" -- flattening the DJVU file into a
           | single image and then encoding that image traditionally in
           | PDF.
           | 
           | I wonder if it's possible to create a "lossless" DJVU to PDF
           | converter, or something close to it, if the PDF primitives
           | allow it? I'm not sure if they do, if the "mask layer" can be
           | efficiently reproduced in PDF.
        
             | sitkack wrote:
             | If you smoke enough algebra, you could use the DJVU
             | algorithm to implement DJVU in PDF with layers. Or heck you
             | could do it in SVG.
        
         | napier wrote:
         | Is there a torrent available that would allow straightforward
         | setup of locally storable and accessible Libgen library? For
         | the storage rich but internet connection reliability poor,
         | something like this would be a godsend.
        
           | mdaniel wrote:
           | They have a dedicated page where they offer torrents, so pick
           | one of the currently available hostnames:
           | https://duckduckgo.com/?q=libgen+torrent&ia=web
           | 
           | Obviously, folks can disagree on the "straightforward" part
           | of your comment given the overwhelming number of files we're
           | discussing
        
         | liberalgeneral wrote:
         | > While I'd love to mirror whole archive locally, it would
         | really be superfluous because I can only read a couple of
         | quality books at a time anyway, [...]
         | 
         | I'd love to agree but as a matter of fact LibGen and Sci-Hub
         | are (forced to be) "pirates" and they are more vulnerable to
         | takedowns than other websites. So while I feel no need to
         | maintain a local copy of Wikipedia, since I'm relatively
         | certain that it'll be alive in the next decade, I cannot say
         | the same about those two with the same certainty (not that I
         | think there are any imminent threats to either, just reasoning
         | a priori).
        
           | jart wrote:
           | Well when a site claims it's for scientific research
           | articles, and you search for "Game Of Thrones" and find this:
           | 
           | https://libgen.is/search.php?req=game+of+thrones&lg_topic=li.
           | ..
           | 
           | Someone's going to prison eventually, like The Pirate Bay
           | founders. It's only a matter of time.
        
             | contingencies wrote:
             | First, SciHub != LibGen. Allied projects that clearly share
             | a support base but not identical.
             | 
             | Second, please provide a citation for the assertion that
             | sharing copies of printed fiction erodes sales volume. At
             | this point, one may assume that anything that helps to sell
             | computer games and offline swag is cash-in-bank for content
             | producers. Whether original authors get the same royalties
             | is an interesting question.
             | 
             | Third, the former Soviet milieu probably isn't currently in
             | the mood to cooperate with western law enforcement.
        
           | BossingAround wrote:
           | Speaking of mirroring, is there a way to download one big
           | "several-hundred-GB" blob with the full content of the sites
           | for archival purposes?
           | 
           | Surely that would act as a failsafe to your problem.
        
             | charcircuit wrote:
             | I think it's split into a several different torrents since
             | it's so big.
        
         | scott_siskind wrote:
         | Why would they clear out djvu? It's one of the best/most
         | efficient storage format for scanned books.
        
           | nsajko wrote:
           | I'm not for clearing out djvu, but it sure is frustrating
           | when a PDF isn't available.
           | 
           | It's not just about laziness preventing one from installing
           | the more obscure ebook readers which support djvu. It's about
           | security: I only trust PDFs when I create them myself with
           | TeX or similar, otherwise I need to use the Chromium PDF
           | reader to be (relatively) safe. I don't trust the readers
           | that support Djvu to be robust enough against maliciously
           | malformed djvu files, as I'm guessing the readers are
           | implemented in some ancient dialect of C or C++ and I doubt
           | they're getting much if any scrutiny in the way of security.
        
             | crazygringo wrote:
             | It's super easy to convert a DJVU file to PDF though.
             | There's an increase in filesize but it's not the end of the
             | world.
             | 
             | And since you're creating the PDF yourself seems like you
             | can trust it? Since nothing malicious could survive the
             | DJVU to PDF conversion since it's just "dumb" bitmap-based.
        
           | xdavidliu wrote:
           | djvu is really quite a marvellous format, but I'm only able
           | to read them on Evince (the default pdf reader that comes
           | with Debian, Fedora, and probably a bunch of other distros).
           | For my macbook I need to download a Djvu reader, and for my
           | ipad, I didn't even bother trying because the experience
           | would likely be much worse than Preview / Ibooks.
        
             | eru wrote:
             | Apparently you can install Evince on MacOS as well. But I
             | haven't tried it there.
             | 
             | Evince doesn't come by default with Archlinux (my desktop
             | distribution of choice), but I still install it everywhere.
        
               | nsajko wrote:
               | > Evince doesn't come by default with Archlinux (my
               | desktop distribution of choice)
               | 
               | This doesn't make sense; nothing comes "by default" on
               | Arch, but evince _is_ in the official repos as far as I
               | see.
        
             | dredmorbius wrote:
             | DJVU is supported by numerous book-reading applications,
             | including (in my experience) FB Reader (FS/OSS),
             | Pocketbook, and Onyx's Neoreader.
             | 
             | As a format for preserving full native scan views (large,
             | but often strongly preferable for visually-significant
             | works or preserving original typesetting / typography),
             | DJVU is highly useful.
             | 
             | I _do_ wish that it were more widely supported by both
             | toolchains and readers. That will come in time, I suspect.
        
             | MichaelCollins wrote:
             | Calibre supports djvu on any platform. Deleting djvu books
             | just because Microsoft and Apple don't see fit to support
             | it by default would be a travesty.
        
         | gizajob wrote:
         | My comment about djvu was mostly just about my own laziness,
         | because (kill me if you need to) I like using Preview on the
         | Mac for reading and annotating, and it doesn't read them, and
         | once they have to live in a djvu viewer, I tend not to read
         | them or mark them up. Same goes for Adobe Acrobat Reader when
         | I'm on Windows on my University's networked PCs.
        
       | repple wrote:
       | This book has a great overview of the origins of library genesis.
       | 
       | Shadow Libraries: Access to Knowledge in Global Higher Education
       | 
       | https://libgen.is/search.php?req=shadow+libraries
        
       | gmjoe wrote:
       | Honestly, it's not a big problem.
       | 
       | First of all, bloat has nothing to do with file size -- EPUB's
       | are often around 2 MB, typeset PDF's are often 2-10 MB (depending
       | on quantity of illustrations), and scanned PDF's are anywhere
       | from 10 MB (if reduced to black and white) to 100 MB (for colors
       | scans, like where necessary for full-color illustrations).
       | 
       | The idea of a 30 MB cutoff does nothing to reduce bloat, it just
       | removes many of the most essential textbooks. :( Also it's very
       | rare to see duplicates of 100 MB PDF's.
       | 
       | Second, file duplication is there, but it's not really an
       | unwieldy problem right now. Probably the majority of titles have
       | only a single file, many have 2-5 versions, and a tiny minority
       | have 10+. But they're often useful variants -- different editions
       | (2nd, 3rd, 4th) plus alternate formats like reflowable EBUB vs
       | PDF scan. These are all genuinely useful and need to be kept.
       | 
       | Most of the unhelpful duplication I see tends to fall into three
       | categories:
       | 
       | 1) There are often 2-3 versions of the identical typeset PDF
       | except with a different resolution for the cover page image. That
       | one baffles me -- zero idea who uploads the extras or why. My
       | best guess is a bot that re-uploads lower-res cover page
       | versions? But it's usually like original 2.7 MB becoming 2.3 MB,
       | not a big difference. Feels very unnecessary to me.
       | 
       | 2) People (or a bot?) who seem to take EPUB's and produce PDF
       | versions. I can understand how that could be done in a helpful
       | spirit, but honestly the resulting PDF's are so abysmally ugly
       | that I really think people are better off producing their own
       | PDF's using e.g. Calibre, with their own desired paper size,
       | font, etc. Unless there's no original EPUB/MOBI on the site, PDF
       | conversions of them should be discouraged IMHO
       | 
       | 3) A very small number of titles do genuinely have like 5+
       | seemingly identical EPUB versions. These are usually very popular
       | bestselling books. I'm totally baffled here as to why this
       | happens.
       | 
       | It does seem like it would be a nice feature to be able to leave
       | some kind of crowdsourced comments/flags/annotations to help
       | future downloaders figure out which version is best for them
       | (e.g. is this PDF an original typeset, a scan, or a conversion?
       | -- metadata from the uploader is often missing or inaccurate
       | here). But for a site that operates on anoynmity, it seems like
       | this would be too open to abuse/spamming. Being able to delete
       | duplicates opens the door to accidental or malicious deleting of
       | anything. I'd rather live with the "bloat", it's really not an
       | impediment to anything at the moment.
        
         | titoCA321 wrote:
         | When you look at movie pirates, there's still uploads of Xvid
         | in 2022. Crap goes in as PDF, mobi, epub, txt and comes out as
         | PDF, mobi, DOCX, txt.
        
       | agumonkey wrote:
       | There are classes of books that are significantly larger than the
       | rest, like medical / biology books. I don't know if they embed
       | vector based images of the whole body or maybe hundreds of images
       | but it's surprising big they are.
       | 
       | Who's in to make some large data gathering about unoptimized
       | books and potentially redudant ones ? or maybe trim pdfs (qpdf
       | can optimize a structure to an extent)
        
         | liberalgeneral wrote:
         | Database dumps are available here if you are interested:
         | http://libgen.rs/dbdumps/
         | 
         | libgen_compact_* is what you are probably looking for, but they
         | are all SQL dumps so you'll need to import them into MySQL
         | first. :/
        
           | agumonkey wrote:
           | the dumps are not enough, one has too scan the actual file
           | content to assess the quality
           | 
           | are you alone in your analysis or are there groups who try to
           | improve lg ?
        
       | [deleted]
        
       | Synaesthesia wrote:
       | >"30 MiB ought to be enough for anyone"
       | 
       | Sometimes you have eg a history book which has a lot of high
       | quality photos, and then it can be quite large.
        
       | spiffistan wrote:
       | I've been dreaming of a book decompiler that would some
       | newfangled AI/ML to produce a perfectly typeset copy of an older
       | book; in the same font or similar, recognizing multiple languages
       | and scripts within the work.
        
         | copperx wrote:
         | In the same vein, I would like an e-reader that has TeX or
         | InDesign quality typesetting. I'd settle for Knuth-Plass line
         | breaking with decent justification (and hyphenation).
         | 
         | At the very least, make it so that headings do not appear at
         | the bottom of a page. Who thought that was OK?
        
       | signaru wrote:
       | I've experienced scanning personal books and also try to reduce
       | them since I'm also concerned with bloat on my (older) mobile
       | reading devices. Unfortunately, there are reasons I cannot upload
       | those, but the procedures might still be helpful for existing
       | scans.
       | 
       | Use ScanTailor to clean them up. If there is no need for
       | color/grayscale, have the output strictly black and white.
       | 
       | OCR them with Adobe Acrobat ClearScan (or something else, that is
       | what I have).
       | 
       | Convert to black and white DJVU (Djvu-Spec).
       | 
       | Dealing with color is another thing, and takes some time. I find
       | that using G'MIC's anisotropic smoothing can help with the ink-
       | jet/half-tone patterns. But it's too time consuming to be used
       | for books.
        
         | pronoiac wrote:
         | I like ScanTailor! I've used ocrmypdf for the OCR and
         | compression steps. It uses lossless JBIG2 by default, at 2 or
         | 3k per page; I'm curious how that compares to DJVU. (And my
         | mistake, pdf and DJVU are competing container formats.)
        
           | signaru wrote:
           | If the PDF is from a scanned source, converting it to DJVU
           | with equivalent DPI typically results to about half the file
           | size (figures can vary depending on the specifics of the PDF
           | source).
        
       | powera wrote:
       | Curation is hard, particularly for a "community" project.
       | 
       | Every file is there for a reason, and much of the time, even if
       | it is a stupid reason, removing it means there is one more person
       | opposed to the concept of "curation".
        
       | Hizonner wrote:
       | Um, if the goal is to fit what you can onto a 20TB hard drive at
       | home, then nobody is stopping you from choosing your own subset,
       | as opposed to deleting stuff out of the main archive based on
       | ham-handed criteria...
        
       | mjreacher wrote:
       | I think one of the problems is the lack of a good open source PDF
       | compressor. We have good open source OCR software like ocrmypdf
       | which I've seen used before, but some of the best compressed
       | books I've seen on libgen used some commercial compressor while
       | the open source ones I've used were generally quite lackluster.
       | This applies double so when people are ripping images from
       | another source, combining them into a PDF then uploading as a
       | high resolution PDF which inevitably ends up being between 70-370
       | MB.
       | 
       | How to deal with duplication is also a very difficult problem
       | because there's loads of reasons why things could be duplicated.
       | Take a textbook, I've seen duplicates which contain either one or
       | several of the following: different editions, different printings
       | (of any particular edition), added bookmarks/table of contents
       | for the file, removed blank white pages, removed front/end cover
       | pages, removed introduction/index/copyright/book information
       | pages, LaTeX'd copies of pre-TeX textbooks, OCR'd, different
       | resolution, other kinds of optimization by software that reduces
       | to wildly different file sizes, different file types (eg .chm,
       | PDFs that are straight conversions from epub/mobi), etc. Some of
       | this can be detected by machines, eg usage of OCR but some of the
       | other things aren't easy at all to detect.
        
         | crazygringo wrote:
         | What commercial compressor/performance are you talking about?
         | 
         | AFAIK the best compression you see is monochrome pages encoded
         | in Group4, which for example ImageMagick will do which is open
         | source, and ocrmypdf happily works on top of.
         | 
         | Otherwise it's just your choice of using underlying JPG, PNG,
         | or JPEG 2000, and up to you to set your desired lossy
         | compression ratio.
        
       ___________________________________________________________________
       (page generated 2022-08-21 23:00 UTC)