[HN Gopher] LibGen's Bloat Problem ___________________________________________________________________ LibGen's Bloat Problem Author : liberalgeneral Score : 226 points Date : 2022-08-21 12:22 UTC (10 hours ago) (HTM) web link (liberalgeneral.neocities.org) (TXT) w3m dump (liberalgeneral.neocities.org) | idealmedtech wrote: | I'd love to see that distribution at the end with a log-axis for | the file size! Or maybe even log-log, depending. Gives a much | better sense of "shape" when working with these sorts of | exponential distributions | Retr0id wrote: | In an ideal world, every book could be given an "importance" | score, for some arbitrary value of importance. For example, how | often it is cited. This could be customised on a per-user basis, | depending on which subjects and time periods you're interested | in. | | Then you can specify your disk size, and solve the knapsack | problem to figure the optimal subset of files that you should | store. | | Edit: Curious to see this being downvoted. Is it really that bad | of an idea? Or just off-topic? | ssivark wrote: | Seems like a perfectly good idea to me! Basically how proposing | that we decide caching by some score, and the details of the | score function be tweaked to handle the different aspects we | care for. | | I wonder whether this idea is already used for locating data in | distributed systems -- from clusters all the way to something | like IPFS. | DiggyJohnson wrote: | Not to sound blunt, but answering your question on the | downvotes (which you probably didn't deserve, especially | without reply). | | The concept of an importance score feels very centralized and | against the federated / free nature of the site. Towards what | end? | | If the "importance score" impacts curation, I am strongly | against it. Not only is it icky, but how is it different than a | function of popularity? | Retr0id wrote: | I'm not suggesting reducing the size of the LibGen | collection, I'm thinking along the lines of "I have 2TB of | disk space spare, and I want to fill it with as much | culturally-relevant information as possible". | | If the entire collection were availble as a torrent (maybe it | already is?), I could select which files I wish to download, | and then seed. | | Those who have 52TB to spare would of course aim to store | everything, but most people don't. | | Just as the proposal in the OP would result in the remaining | 32.59 TB of data being less well replicated, my approach has | the problem that less "popular" files would be poorly | replicated, but you could solve that by _also_ selecting some | files at random. (e.g. 1.5TB chosen algorithmically, 0.5TB | chosen at random). | liberalgeneral wrote: | I don't think you've deserved the downvotes, and I don't | think it's a bad idea either; indeed some coordination as | to how to seed the collection is really needed. | | For instance phillm.net maintains a dynamically updated | list of LibGen and Sci-Hub torrents with less than 3 | seeders so that people can pick some at random and start | seeding: https://phillm.net/libgen-seeds-needed.php | [deleted] | wishfish wrote: | Has anyone ever stumbled across an executable on LibGen? The | article mentioned finding them but I've never seen one. | | I agree with the other comments that LibGen shouldn't purge the | larger books. But, in terms of mirrors, it would be nice to have | a slimmed down archive I could torrent. 19 TB would be | manageable. And would be nice to have a local copy of most of the | books. | liberalgeneral wrote: | > Has anyone ever stumbled across an executable on LibGen? The | article mentioned finding them but I've never seen one. | | Here is a list of .exe files in LibGen: | https://paste.debian.net/hidden/1c82739a/ | | And a breakdown of file extensions: | https://paste.debian.net/hidden/579e319c/ | | > And would be nice to have a local copy of most of the books. | | Yes! That was my intention--I wasn't advocating for a purge of | content but a leaner and more practical version would be | amazing. | marcosdumay wrote: | So, 1000 exes and 500 isos (that may be problematic, but most | probably aren't). Everything else seems to be what one would | expect. | | That's way cleaner than I could possibly expect. Do people | manually review suspect files? | macintux wrote: | > Yes! That was my intention--I wasn't advocating for a purge | of content but a leaner and more practical version would be | amazing. | | Your piece doesn't make that obvious at all, and given how | many people here are misunderstanding that point, you might | want to update it. | liberalgeneral wrote: | You are right, added a paragraph at the end. | wishfish wrote: | Thanks for the lists. I was genuinely curious about the exes. | Nice to know where they originate. Interesting that over half | of them have titles in Cyrillic. I guess not so many English | language textbooks (with included CDs) have been uploaded | with the data portion intact. | pnw wrote: | You can search by extension - there's a lot of .exe files, | mostly Russian AFAIK. | hoppyhoppy2 wrote: | I saw a book on antenna design on libgen that originally | included a CD with software, and that disk image had been | uploaded to the site. | johndough wrote: | That graph of file size vs. number of files would be much easier | to read if it were logarithmic. I guess OP is using matplotlib. | In this case, use plt.loglog instead of plt.plot. Also, consider | plt.savefig("chart.svg") instead of png. | liberalgeneral wrote: | Here is the raw data if you are interested: | https://paste.debian.net/hidden/77876d00/ | johndough wrote: | Thanks. Here is a logarithmic plot as SVG: | https://files.catbox.moe/zbf35r.svg | | On a second thought, a logarithmic histogram might convey | even more information, but that would require all file sizes | to recompute the bin sizes. | kqr wrote: | Huh, this distribution is not the power law I would have | expected. Maybe because it's limited to one media type | (books)? | cbarrick wrote: | This is a complete nit, but s/an utopia/a | utopia/ | | Even though "utopia" is spelled starting with a vowel, it is | pronounced as /ju:'toUpi@/, like "yoo-TOH-pee-@", with a | consonant sound at the start. | | Since the word starts with a consonant sound, the proper | indefinite article is "a". | kevin_thibedeau wrote: | Now you have to convince the intelligentsia how to use the | proper article with "history". | boarush wrote: | I don't think OP takes into account that there seem to be | multiple editions of the same book which are often required by | people to refer to. Not everyone wants the latest edition when | the class you're in is using some old edition. | liberalgeneral wrote: | If you are referring to my duplication comments, sure (but even | then I believe there are duplicates of the exact same edition | of the same book). Though the filtering by filesize is | orthogonal to editions etc. so has nothing to do with that. | xenr1k wrote: | I agree. There are duplicates. I have seen it. | | I have found the same book with multiple sized pdf, with same | content. Someone maybe uploaded a poorly scanned pdf when the | book was first released but later Someone else uploaded a | OCRed version, but the first one just stayed hogging a large | amount of storage. | MichaelCollins wrote: | How do you automate the process of figuring out which | version is better? It's not safe to assume the smaller | versions are always better, nor the inverse. Particularly | for books with images, one version of the book may have | passable image quality while the other compressed the | images to jpeg mush. And there are considerations that are | difficult to judge quantitatively, like the quality of | formatting. Even something seemingly simple like testing | whether a book's TOC is linked correctly entails a huge | rats nest of heuristics and guesswork. | macintux wrote: | I don't think anyone is arguing it can be fully | automated, but automating the selection of books to | manually review is certainly viable. | rinze wrote: | As the previous reply said, I've also seen duplicates while | browsing. Would it be possible to let users flag duplicates | somehow? It involves human unreliability, which is like | automated unreliability, only different. | generationP wrote: | In practice, it's more often the same file with minor edits | such as a PDF table of contents added or page numbers | corrected. Say, how many distinct editions of this standard | text on elementary algebraic geometry are in the following | list? | | http://libgen.rs/search.php?req=cox+little+o%27shea+ideals&o... | | Fun fact: the newest one (the 2018 corrected version of the | 2015 fourth edition) is not among them. | ZeroGravitas wrote: | I notice they have a place to store the OpenLibrary ID, | though I've not seen one filled in as yet. | | OpenLibrary provides both Work and Edition ids, which helps | connect different versions. | | Their database is not perfect either, but it might make more | sense to keep the bibliographic data seperate from the | copyright contents anyway. | | https://openlibrary.org/works/OL1849157W/Ideals_varieties_an. | .. | boarush wrote: | I like to think that LibGen also serves as a historical | database wherein there is a record that a book of a specific | edition had its errors corrected. (Although it would be | better if errata could be appended to the same file if | possible) | | Yes, for very minor edits, those files should obviously not | exist, but for that there would need to be someone who | verifies this, which is such an enormous task that likely no | one would take up. | keepquestioning wrote: | Can we put LibGen on the blockchain? | FabHK wrote: | In case that was not a joke: | | No. LibGen is not a trivial amount of data (a few hard disks | full). The blockchain can only handle tiny amounts of data very | slowly. | rolling_robot wrote: | The graph should be in logarithmic scale to be readable, | actually. | remram wrote: | https://news.ycombinator.com/item?id=32540202 | bagrow wrote: | > by filtering any "books" (rather, files) that are larger than | 30 MiB we can reduce the total size of the collection from 51.50 | TB to 18.91 TB | | I can see problems with a hard cutoff in file size. A long | architectural or graphic design textbook could be much larger | than that, for instance. | mananaysiempre wrote: | While it's a bit of an extreme case, the file for a single | 15-page article on Monte Carlo noise in rendering[1] is over | 50M (as noise should specifically not be compressed out of the | pictures). | | [1] https://dl.acm.org/doi/10.1145/3414685.3417881 | TigeriusKirk wrote: | I was just checking my PDFs over 30M because of this post and | was surprised to see the DALL-E 2 paper is 41.9M for 27 | pages. Lots of images, of course, it was just surprising to | see it clock in around a group of full textbooks. | elteto wrote: | If I remember correctly images in PDFs can be stored full | res but are then rendered to final size, which more often | than not in double column research papers end up being | tiny. | _Algernon_ wrote: | My main issue with libgen is its awful search. Can't search by | multiple criteria, shitty fuzzy search, and cant filter by file | type. | MichaelCollins wrote: | Have you ever used a card catalogue? | _Algernon_ wrote: | No. Your point being? | MichaelCollins wrote: | In that case your expectations are understandable. People | in your generation are accustomed to finding anything in | mere seconds. Not very long ago, if it took you a few | minutes to find a book in the catalogue you would count | yourself lucky. And if your local library didn't have the | book you're looking for, you could spend weeks waiting for | the book to arrive from another library in the system. | | Libgen's search certainly isn't as good as it could be, but | it's _more_ than good enough. If you can 't bear spending a | few minutes searching for a book, can you even claim to | want that book in the first place? It's hard for me to even | imagine being in such a rush that a few minutes searching | in a library is too much to tolerate. But then again, I | wasn't raised with the expectations of your generation. | liberalgeneral wrote: | Z-Library has been innovating a great deal in that regard. | Sadly they are not as open/sharing as LibGen mirrors in giving | back to the community (in terms of database dumps, torrents, | and source code). | Invictus0 wrote: | Storage space is not a problem, especially not on the order of | terabytes. If you want to download all of libgen on a cheap | drive, perhaps limit yourself to epub files only. No one needs | all of libgen anyway except archivists and data hoarders. | liberalgeneral wrote: | https://news.ycombinator.com/item?id=32540854 | Invictus0 wrote: | Yes, that makes you a data hoarder. Normal people would just | use one of the many other methods of getting free books, like | legal libraries, googling it on Yandex, torrents, asking a | friend, etc. Or just actually pay for a book. | liberalgeneral wrote: | My target audience is not normal people though, and I don't | mean this in the "edgy" sense. The fact that we are having | this discussion is very abnormal to begin with, and I think | it's great that there are some deviants from the norm who | care about the longevity of such projects. | | I can imagine many students and researchers hosting a | mirror of LibGen for their fellows for example. | Invictus0 wrote: | In that case, just pay whatever it costs to store the | data. With AWS glacier it would cost $50 a month. | [deleted] | RcouF1uZ4gsC wrote: | > I chose 30 MiB somewhat arbitrarily based on my personal e-book | library, thinking "30 MiB ought to be enough for anyone" | | There are books on art and photography and pathology that have | multiple high resolution photographs. | | I don't think limiting by file size is a good idea. | c-fe wrote: | This is a bit anecdotal, but I did upload a book to libgen. I am | am avid user of the site, and during my thesis research I was | looking for a specific book and could not find it on there. I did | however find it on archive.org. I spent the better half of one | afternoon extracting the book from archive.org with some Adobe | software, since I had to circumvent some DRM and other things, | and all of this was also novel to me. In the end I got a scanned | PDF, which had several hundred MB. I managed to reduce it to 47 | MB, however further reduction was not easily possible at least | not with the means I knew or had at my disposal. I uploaded this | version to libgen. | | I do agree that there may be some large files on there, however I | dont agree with removing them. I spent some hours to put this | book on there so others who need it can access it within seconds. | Removing it because it is too large would void all this effort | and require future users to go through a similar process than i | did just to browse through the book. | | Also any book published today is most likely available in some | ebook format, which is much smaller in size, so I dont think that | the size of libgen will continue to grow at the same pace as it | is doing now. | culi wrote: | I've always wanted to contribute to LibGen. Got me through | college and has powered my Wikipedia editing hobby | | Are there any good guides out there for best practices for | minimizing files, scanning books, etc? | generationP wrote: | There's a bunch. Here's what I do (for black-and-white text; | I'm not sure how to deal with more complex scenarios): | | Scan with 600dpi resolution. Nevermind that this gives huge | output files; you'll compress them to something much smaller | at the end, and the better your resolution, the stronger | compression you can use without losing readability. | | While scanning, periodically clean the camera or the scanner | screen, to avoid speckles of dirt on the scan. | | The ideal output formats are TIF and PNG; use them if your | scanner allows. PDF is also fine (you'll then have to extract | the pages into TIF using pdfimages or using ScanKromsator). | Use JPG only as a last resort, if nothing else works. | | Once you have TIF, PNG or JPG files, put them into a folder. | Make sure that the files are sorted correctly: IIRC, the | numbers in their names should match their order (i.e., | blob030 must be an earlier page than blah045; it doesn't | matter whether the numbers are contiguous or what the non- | numerical characters are). (I use the shell command mmv for | convenient renaming.) | | Import this folder into ScanTailor ( | https://github.com/4lex4/scantailor-advanced/releases ), save | the project, and run it through all 6 stages. | | Stage 1 (Fix Orientation): Use the arrow buttons to make sure | all text is upright. Use Q and W to move between pages. | | Stage 2 (Split Pages): You can auto-run this using the |> | button, but you should check that the result is correct. It | doesn't always detect the page borders correctly. (Again, use | Q and W to move between pages.) | | Stage 3 (Deskew): Auto-run using |>. This is supposed to | ensure that all text is correctly rotated. If some text is | still skew, you can detect and fix this later. | | Stage 4 (Select Content): This is about cutting out the | margins. This is the most grueling and boring stage of the | process. You can auto-run it using |>, but it will often cut | off too much and you'll have to painstakingly fix it by hand. | Alternatively (and much more quickly), set "Content Box" to | "Disable" and manually cut off the most obvious parts without | trying to save every single pixel. Don't worry: White space | will not inflate the size of the ultimate file; it compresses | well. The important thing is to cut off the black/grey parts | beyond the pages. In this process, you'll often discover | problems with your scan or with previous stages. You can | always go back to previous stages to fix them. | | Stage 5 (Margins): I auto-run this. | | Stage 6 (Output): This is important to get right. The | despeckling algorithm often breaks formulas (e.g., "..."s get | misinterpreted as speckles and removed), so I typically | uncheck "Despeckle" when scanning anything technical (it's | probably fine for fiction). I also tend to uncheck "Savitzki- | Golay smoothing" and "Morphological smoothing" for some | reason; don't remember why (probably they broke something for | me in some case). The "threshold" slider is important: | Experiment with it! (Check which value makes a typical page | of your book look crisp. Be mindful of pages that are paler | or fatter than others. You can set it for each page | separately, but most of the time it suffices to find one | value for the whole book, except perhaps the cover.) Note the | "Apply To..." buttons; they allow you to promote a setting | from a single page to the whole book. (Keep in mind that | there are two -- the second one is for the despeckling | setting.) | | Now look at the tab on the right of the page. You should see | "Output" as the active one, but you can switch to "Fill | Zones". This lets you white-out (or black-out) certain | regions of the page. This is very useful if you see some | speckles (or stupid write-ins, or other imperfections) that | need removal. I try not to be perfectionistic: The best way | to avoid large speckles is by keeping the scanner clean at | the scanning stage; small ones aren't too big a deal; I often | avoid this stage unless I _know_ I got something dirty. Some | kinds of speckles (particularly those that look like | mathematical symbols) can be confusing in a scan. | | There is also a "Picture Zones" rider for graphics and color; | that's beyond my paygrade. | | Auto-run stage 6 again at the end (even if you think you've | done everything -- it needs to recompile the output TIFFs). | | Now, go to the folder where you have saved your project, and | more precisely to its "out/" subfolder. You should see a | bunch of .tif files, each one corresponding to a page. Your | goal is to collect them into one PDF. I usually do this as | follows: tiffcp *.tif ../combined.tif | tiff2pdf -o ../combined.pdf ../combined.tif rm -v | ../combined.tif | | Thus you end up with a PDF in the folder in which your | project is. | | Optional: add OCR to it; add bookmarks for chapters and | sections; add metadata; correct the page numbering (so that | page 1 is actual page 1). I use PDF-XChangeLite for this all; | but use whatever tool you know best. | | At that point, your PDF isn't super-compressed (don't know | how to get those), but it's reasonable (about 10MB per 200 | pages), and usually the quality is almost professional. | | Uploading to LibGen... well, I think they've made the UI | pretty intuitive these days :) | | PS. If some of this is out of date or unnecessarily | complicated, I'd love to hear! | crazygringo wrote: | > _At that point, your PDF isn 't super-compressed (don't | know how to get those)_ | | As far as I know, it's making sure your text-only pages are | monochrome (not grayscale) and to use Group4 compression | for them, which is actually what fax machines use (!) and | is optimized specifically for monochrome text. Both TIFF | and PDF's support Group4 -- I use ImageMagick to take a | scanned input page and run grayscale, contrast, Group4 | monochrome encoding, and PDF conversion in one fell swoop | which generates one PDF per page, and then "pdfunite" to | join the pages. Works like a charm. | | I'm not aware of anything superior to Group4 for regular | black and white text pages, but would love to know if there | is. | generationP wrote: | Oh, I should have said that I scan in grayscale, but | ScanTailor (at stage 6) makes the output monochrome; | that's what the slider is about (it determines the | boundary between what will become black and what will | become white). So this isn't what I'm missing. | | I am not sure if the result is G4-compressed, though. Is | there a quick way to tell? | liberalgeneral wrote: | Thank you for your efforts! | | To be clear, I am not advocating for the removal of any files | larger than 30 MiB (or any other arbitrary hard limits). It'd | be great of course to flag large files for further review, but | the current software doesn't do a great job at crowdsourcing | these kinds of tasks (another one being deduplication) sadly. | | Given the very little amount of volunteer-power, I'm suggesting | that a "lean edition" of LibGen can still be immensely useful | to many people. | ssivark wrote: | Files are a very bad unit to elevate in importance, and | number of files or file size are really bad proxy metrics, | especially without considering the statistical distribution | of downloads (leave alone the question of what is more | "important"!). Eg: Junk that's less than the size limit is | implicitly being valued over good content that happens to be | larger in size. Textbooks & reference books will likewise get | filtered out with higher likelihood -- and that would screw | students in countries where they cannot afford them (which | might arguable be a more important audience to some, compared | to those downloading comics). Etc. | | After all this, the most likely human response from people | who really depend on this platform would be to slice a big | file into volumes under the size limit. Seems to be a | horrible UX downgrade in the medium to long term for no other | reason than satisfying some arbitrary metric of | legibility[1]. | | Here's a different idea -- might it be worthwhile to convert | the larger files to better compressed versions eg. PDF -> | DJVU? This would lead to a duplication in the medium term, | but if one sees a convincing pattern that users switch to the | compressed versions without needing to come back to the | larger versions, that would imply that the compressed version | works and the larger version could eventually be garbage | collected. | | Thinking in an even more open-ended manner, if this corpus is | not growing at a substantial rate, can we just wait out a | decade or so of storage improvements before this becomes a | non-issue? How long might it take for storage to become 3x, | 10x, 30x cheaper? | | [1]: https://www.ribbonfarm.com/2010/07/26/a-big-little-idea- | call... | didgetmaster wrote: | > can we just wait out a decade or so of storage | improvements before this becomes a non-issue? | | I'm not sure that there is anything on the horizon which | would make duplicate data a 'non-issue'. Capacities are | certainly growing, so within a decade we might see 100TB | HDDs available and affordable 20TB SSDs. But that does not | solve the bandwidth issues. It still takes a long, long | time to transfer all the data. | | The fastest HDD is still under 300MB/s which means it takes | a minimum of 20 hours to read all the data off a 20TB HDD. | That is if you could somehow get it to read the whole thing | at the maximum sustained read speed. | | SSDs are much faster, but it will always be easier to | double the capacity than it is to double the speed. | fragmede wrote: | The problem isn't the technology, it's the cost. Given a | far larger budget, you wouldn't run the hard drives at | anywhere near capacity, in order to gain a read speed | advantage by running a ton in parallel. That'll let you | read 20 TB in a hour if you can afford it. Put it this | way; Netflix is able to do 4k video and that's far more | intensive. | jtbayly wrote: | Agreed. Deduplication should be the bigger goal, in my opinion. | CamperBob2 wrote: | Have to be careful there. A jihad against duplication means | that poor-quality scans will drive out good ones, or prevent | them from ever being created. Especially if you're misguided | enough to optimize for minimum file size. | | I agree with samatman's position below: as long as the format | is the slightest bit lossy -- and it always will be -- | aggressive deduplication has more downsides than upsides. | willnonya wrote: | While intended to agree the duplicates need to be easily | identifiable and preferably filterable by quality for bulk | downloads. | exmadscientist wrote: | Deduplication doesn't have to mean removal. It might be | just tagging. It would be very nice to be able to fetch the | "best filesize" version of the entire collection, then pull | down the "best quality" editions of only a few things I'm | particularly interested in. | signaru wrote: | Probably only safe in cases where the files in question are | exactly the same binaries (if binary diffing can be automated | somehow). | DiggyJohnson wrote: | Even then, I wouldn't want a file with text + illustrations | to be considered a dupe of a text-only copy of the same work. | ajsnigrutin wrote: | Plus there are a lot of books, where one version is a high | quality scan, but no OCR, and the other is OCRed scan (with | a bunch of errors, but searching works 80% of the time) and | horrible scan quality. | | Also, some books included appendices, that are scanned in | some versions but not in others, plus large posters, that | are shrunk to a4 size in one version, split onto multiple | a4 pages in another, and one huge page in a third version. | | Then there are zips of books, containing 1 pdf + eg. | example code, libraries, etc (eg progrmaming books). | samatman wrote: | IMHO a process which is lossy should never be described as | deduplication. | | What would work out fairly well for this use case is to | group files by similarity, and compress them with an | algorithm which can look at all 'editions' of a text. | | This should mean that storing a PDF with a (perhaps badly, | perhaps brilliantly) type-edited version next to it would | 'weigh' about as much as the original PDF plus a patch. | duskwuff wrote: | > IMHO a process which is lossy should never be described | as deduplication. | | Depends. There are going to be some cases where files | aren't _literally_ duplicates, but the duplicates don 't | add any value -- for example, MOBI conversions of EPUB | files, or multiple versions of an EPUB with different | publisher-inserted content (like adding a preview of a | sequel, or updating an author's bibliography). | samatman wrote: | Splitting those into two cases: I think getting rid of | format conversions (which can, after all, be performed | again) is worthwhile, but isn't deduplication, that's | more like pruning. | | Multiple versions of an EPUB with slightly different | content is exactly the case where a compression algorithm | with an attention span, and some metadata to work with | can, get the multiple copies down enough in size that | there's no point in disposing of the unique parts. | ad404b8a372f2b9 wrote: | That's funny, I did the same analysis with sci-hub. Back when | there was an organized drive to back it up. | | I downloaded parts of it and wanted to figure out why it was so | heavy, seeing as you'd expect articles to be mostly text and very | light. | | There was a similar distribution of file sizes. My immediate | instinct was also to cut off the tail-end, but looking at the | larger files I realized it was a whole range of good articles | that included high quality graphics that were crucial to the | research being presented, not poor compression or useless bloat. | dredmorbius wrote: | It can be illuminating to look at the size of ePub documents. | This is in general an HTML container (and compressed), such | that file sizes tend to be quite small. A book-length text | (~250 pp or more) might be from 0.3 -- 5 MB, and often at the | lower end of the scale. | | Books with a large number of images or graphics, however, can | still bloat to 40-50 MB or even more. | | Otherwise, generally, text-based PDFs (as opposed to scans) are | often in the 2--5 MB range, whilst scans can run 40--400 MB. | The largest I'm aware of in my own collection is a copy of | Lyell's _Geography_ , sourced from Archive.org. It is of course | scans of the original 19th century typography. Beautiful to | read, but a bit on the weighty side. | liberalgeneral wrote: | I think Sci-Hub is the opposite since 1 DOI = 1 PDF in its | canonical form (straight from the publisher) so neither | duplication nor low-quality is the case. | dredmorbius wrote: | It does depend on when the work was published. Pre-digital | works scanned in without OCR can be larger in size. That's | typically works from the 1980s and before. | | Given the explosion of scientific publishing, that's likely a | small fraction of the archive by _work_ though it may be | significant in terms of _storage_. | aaron695 wrote: | > by filtering any "books" (rather, files) that are larger than | 30 MiB we can reduce the total size of the collection from 51.50 | TB to 18.91 TB, shaving a whopping 32.59 TB | | Books greater than 30 MiB are all the textbooks. | | You are killing the knowledge. | | Also killing a lot of rare things. | | If you want to do something amazing and small, OCR them. | | As an example of greater than 30 meg, I grabbed a short story by | Greg Bear the other day not available digitally, it was in a 90 | meg copy of a 1983 Analog Science Fiction and Fact | | Side note de-duping is an incredibly hard project, how will you | diff a mobi and a epub and then make a decision? Or a decision | between a mobi and a mobi? | | Books also change with time. Even in the 90's kids books from the | 60's had been 'edited' These can be hidden gems to collectors. | Cover art also. | gizajob wrote: | One of my favourite places on the internet too. The thing is, you | just search for what you want and spend 10 seconds finding the | right book and link. While I'd love to mirror whole archive | locally, it would really be superfluous because I can only read a | couple of quality books at a time anyway, so building my own | small archive of annotated PDFs (philosophy is my drug of choice) | is better than having the whole. I think it's actually remarkably | free of bloat and cruft considering, but maybe I'm not trawling | the same corners as you are. Do kind of wish they'd clear out the | mobi and djvu versions and make it unified however. | sitkack wrote: | > djvu versions | | This would be disastrous for preservation. Often the djvu | versions have no digital version, the books not in print and | the publisher isn't around. The djvu archives are often | specifically because some old book, _really_ has and had value | to people. | crazygringo wrote: | Yeah, I always convert DVJU to PDF (pretty easy) but it never | compresses quite as nicely. | | DJVU is pretty clever in how it uses a mask layer for more | efficient compression, and as far as I know, converting to | PDF is always done "dumb" -- flattening the DJVU file into a | single image and then encoding that image traditionally in | PDF. | | I wonder if it's possible to create a "lossless" DJVU to PDF | converter, or something close to it, if the PDF primitives | allow it? I'm not sure if they do, if the "mask layer" can be | efficiently reproduced in PDF. | sitkack wrote: | If you smoke enough algebra, you could use the DJVU | algorithm to implement DJVU in PDF with layers. Or heck you | could do it in SVG. | napier wrote: | Is there a torrent available that would allow straightforward | setup of locally storable and accessible Libgen library? For | the storage rich but internet connection reliability poor, | something like this would be a godsend. | mdaniel wrote: | They have a dedicated page where they offer torrents, so pick | one of the currently available hostnames: | https://duckduckgo.com/?q=libgen+torrent&ia=web | | Obviously, folks can disagree on the "straightforward" part | of your comment given the overwhelming number of files we're | discussing | liberalgeneral wrote: | > While I'd love to mirror whole archive locally, it would | really be superfluous because I can only read a couple of | quality books at a time anyway, [...] | | I'd love to agree but as a matter of fact LibGen and Sci-Hub | are (forced to be) "pirates" and they are more vulnerable to | takedowns than other websites. So while I feel no need to | maintain a local copy of Wikipedia, since I'm relatively | certain that it'll be alive in the next decade, I cannot say | the same about those two with the same certainty (not that I | think there are any imminent threats to either, just reasoning | a priori). | jart wrote: | Well when a site claims it's for scientific research | articles, and you search for "Game Of Thrones" and find this: | | https://libgen.is/search.php?req=game+of+thrones&lg_topic=li. | .. | | Someone's going to prison eventually, like The Pirate Bay | founders. It's only a matter of time. | contingencies wrote: | First, SciHub != LibGen. Allied projects that clearly share | a support base but not identical. | | Second, please provide a citation for the assertion that | sharing copies of printed fiction erodes sales volume. At | this point, one may assume that anything that helps to sell | computer games and offline swag is cash-in-bank for content | producers. Whether original authors get the same royalties | is an interesting question. | | Third, the former Soviet milieu probably isn't currently in | the mood to cooperate with western law enforcement. | BossingAround wrote: | Speaking of mirroring, is there a way to download one big | "several-hundred-GB" blob with the full content of the sites | for archival purposes? | | Surely that would act as a failsafe to your problem. | charcircuit wrote: | I think it's split into a several different torrents since | it's so big. | scott_siskind wrote: | Why would they clear out djvu? It's one of the best/most | efficient storage format for scanned books. | nsajko wrote: | I'm not for clearing out djvu, but it sure is frustrating | when a PDF isn't available. | | It's not just about laziness preventing one from installing | the more obscure ebook readers which support djvu. It's about | security: I only trust PDFs when I create them myself with | TeX or similar, otherwise I need to use the Chromium PDF | reader to be (relatively) safe. I don't trust the readers | that support Djvu to be robust enough against maliciously | malformed djvu files, as I'm guessing the readers are | implemented in some ancient dialect of C or C++ and I doubt | they're getting much if any scrutiny in the way of security. | crazygringo wrote: | It's super easy to convert a DJVU file to PDF though. | There's an increase in filesize but it's not the end of the | world. | | And since you're creating the PDF yourself seems like you | can trust it? Since nothing malicious could survive the | DJVU to PDF conversion since it's just "dumb" bitmap-based. | xdavidliu wrote: | djvu is really quite a marvellous format, but I'm only able | to read them on Evince (the default pdf reader that comes | with Debian, Fedora, and probably a bunch of other distros). | For my macbook I need to download a Djvu reader, and for my | ipad, I didn't even bother trying because the experience | would likely be much worse than Preview / Ibooks. | eru wrote: | Apparently you can install Evince on MacOS as well. But I | haven't tried it there. | | Evince doesn't come by default with Archlinux (my desktop | distribution of choice), but I still install it everywhere. | nsajko wrote: | > Evince doesn't come by default with Archlinux (my | desktop distribution of choice) | | This doesn't make sense; nothing comes "by default" on | Arch, but evince _is_ in the official repos as far as I | see. | dredmorbius wrote: | DJVU is supported by numerous book-reading applications, | including (in my experience) FB Reader (FS/OSS), | Pocketbook, and Onyx's Neoreader. | | As a format for preserving full native scan views (large, | but often strongly preferable for visually-significant | works or preserving original typesetting / typography), | DJVU is highly useful. | | I _do_ wish that it were more widely supported by both | toolchains and readers. That will come in time, I suspect. | MichaelCollins wrote: | Calibre supports djvu on any platform. Deleting djvu books | just because Microsoft and Apple don't see fit to support | it by default would be a travesty. | gizajob wrote: | My comment about djvu was mostly just about my own laziness, | because (kill me if you need to) I like using Preview on the | Mac for reading and annotating, and it doesn't read them, and | once they have to live in a djvu viewer, I tend not to read | them or mark them up. Same goes for Adobe Acrobat Reader when | I'm on Windows on my University's networked PCs. | repple wrote: | This book has a great overview of the origins of library genesis. | | Shadow Libraries: Access to Knowledge in Global Higher Education | | https://libgen.is/search.php?req=shadow+libraries | gmjoe wrote: | Honestly, it's not a big problem. | | First of all, bloat has nothing to do with file size -- EPUB's | are often around 2 MB, typeset PDF's are often 2-10 MB (depending | on quantity of illustrations), and scanned PDF's are anywhere | from 10 MB (if reduced to black and white) to 100 MB (for colors | scans, like where necessary for full-color illustrations). | | The idea of a 30 MB cutoff does nothing to reduce bloat, it just | removes many of the most essential textbooks. :( Also it's very | rare to see duplicates of 100 MB PDF's. | | Second, file duplication is there, but it's not really an | unwieldy problem right now. Probably the majority of titles have | only a single file, many have 2-5 versions, and a tiny minority | have 10+. But they're often useful variants -- different editions | (2nd, 3rd, 4th) plus alternate formats like reflowable EBUB vs | PDF scan. These are all genuinely useful and need to be kept. | | Most of the unhelpful duplication I see tends to fall into three | categories: | | 1) There are often 2-3 versions of the identical typeset PDF | except with a different resolution for the cover page image. That | one baffles me -- zero idea who uploads the extras or why. My | best guess is a bot that re-uploads lower-res cover page | versions? But it's usually like original 2.7 MB becoming 2.3 MB, | not a big difference. Feels very unnecessary to me. | | 2) People (or a bot?) who seem to take EPUB's and produce PDF | versions. I can understand how that could be done in a helpful | spirit, but honestly the resulting PDF's are so abysmally ugly | that I really think people are better off producing their own | PDF's using e.g. Calibre, with their own desired paper size, | font, etc. Unless there's no original EPUB/MOBI on the site, PDF | conversions of them should be discouraged IMHO | | 3) A very small number of titles do genuinely have like 5+ | seemingly identical EPUB versions. These are usually very popular | bestselling books. I'm totally baffled here as to why this | happens. | | It does seem like it would be a nice feature to be able to leave | some kind of crowdsourced comments/flags/annotations to help | future downloaders figure out which version is best for them | (e.g. is this PDF an original typeset, a scan, or a conversion? | -- metadata from the uploader is often missing or inaccurate | here). But for a site that operates on anoynmity, it seems like | this would be too open to abuse/spamming. Being able to delete | duplicates opens the door to accidental or malicious deleting of | anything. I'd rather live with the "bloat", it's really not an | impediment to anything at the moment. | titoCA321 wrote: | When you look at movie pirates, there's still uploads of Xvid | in 2022. Crap goes in as PDF, mobi, epub, txt and comes out as | PDF, mobi, DOCX, txt. | agumonkey wrote: | There are classes of books that are significantly larger than the | rest, like medical / biology books. I don't know if they embed | vector based images of the whole body or maybe hundreds of images | but it's surprising big they are. | | Who's in to make some large data gathering about unoptimized | books and potentially redudant ones ? or maybe trim pdfs (qpdf | can optimize a structure to an extent) | liberalgeneral wrote: | Database dumps are available here if you are interested: | http://libgen.rs/dbdumps/ | | libgen_compact_* is what you are probably looking for, but they | are all SQL dumps so you'll need to import them into MySQL | first. :/ | agumonkey wrote: | the dumps are not enough, one has too scan the actual file | content to assess the quality | | are you alone in your analysis or are there groups who try to | improve lg ? | [deleted] | Synaesthesia wrote: | >"30 MiB ought to be enough for anyone" | | Sometimes you have eg a history book which has a lot of high | quality photos, and then it can be quite large. | spiffistan wrote: | I've been dreaming of a book decompiler that would some | newfangled AI/ML to produce a perfectly typeset copy of an older | book; in the same font or similar, recognizing multiple languages | and scripts within the work. | copperx wrote: | In the same vein, I would like an e-reader that has TeX or | InDesign quality typesetting. I'd settle for Knuth-Plass line | breaking with decent justification (and hyphenation). | | At the very least, make it so that headings do not appear at | the bottom of a page. Who thought that was OK? | signaru wrote: | I've experienced scanning personal books and also try to reduce | them since I'm also concerned with bloat on my (older) mobile | reading devices. Unfortunately, there are reasons I cannot upload | those, but the procedures might still be helpful for existing | scans. | | Use ScanTailor to clean them up. If there is no need for | color/grayscale, have the output strictly black and white. | | OCR them with Adobe Acrobat ClearScan (or something else, that is | what I have). | | Convert to black and white DJVU (Djvu-Spec). | | Dealing with color is another thing, and takes some time. I find | that using G'MIC's anisotropic smoothing can help with the ink- | jet/half-tone patterns. But it's too time consuming to be used | for books. | pronoiac wrote: | I like ScanTailor! I've used ocrmypdf for the OCR and | compression steps. It uses lossless JBIG2 by default, at 2 or | 3k per page; I'm curious how that compares to DJVU. (And my | mistake, pdf and DJVU are competing container formats.) | signaru wrote: | If the PDF is from a scanned source, converting it to DJVU | with equivalent DPI typically results to about half the file | size (figures can vary depending on the specifics of the PDF | source). | powera wrote: | Curation is hard, particularly for a "community" project. | | Every file is there for a reason, and much of the time, even if | it is a stupid reason, removing it means there is one more person | opposed to the concept of "curation". | Hizonner wrote: | Um, if the goal is to fit what you can onto a 20TB hard drive at | home, then nobody is stopping you from choosing your own subset, | as opposed to deleting stuff out of the main archive based on | ham-handed criteria... | mjreacher wrote: | I think one of the problems is the lack of a good open source PDF | compressor. We have good open source OCR software like ocrmypdf | which I've seen used before, but some of the best compressed | books I've seen on libgen used some commercial compressor while | the open source ones I've used were generally quite lackluster. | This applies double so when people are ripping images from | another source, combining them into a PDF then uploading as a | high resolution PDF which inevitably ends up being between 70-370 | MB. | | How to deal with duplication is also a very difficult problem | because there's loads of reasons why things could be duplicated. | Take a textbook, I've seen duplicates which contain either one or | several of the following: different editions, different printings | (of any particular edition), added bookmarks/table of contents | for the file, removed blank white pages, removed front/end cover | pages, removed introduction/index/copyright/book information | pages, LaTeX'd copies of pre-TeX textbooks, OCR'd, different | resolution, other kinds of optimization by software that reduces | to wildly different file sizes, different file types (eg .chm, | PDFs that are straight conversions from epub/mobi), etc. Some of | this can be detected by machines, eg usage of OCR but some of the | other things aren't easy at all to detect. | crazygringo wrote: | What commercial compressor/performance are you talking about? | | AFAIK the best compression you see is monochrome pages encoded | in Group4, which for example ImageMagick will do which is open | source, and ocrmypdf happily works on top of. | | Otherwise it's just your choice of using underlying JPG, PNG, | or JPEG 2000, and up to you to set your desired lossy | compression ratio. ___________________________________________________________________ (page generated 2022-08-21 23:00 UTC)