[HN Gopher] Academic Torrents ___________________________________________________________________ Academic Torrents Author : julianj Score : 231 points Date : 2020-01-13 12:29 UTC (10 hours ago) (HTM) web link (academictorrents.com) (TXT) w3m dump (academictorrents.com) | krick wrote: | Yeah, it could really benefit from some organizational work, like | on more mature music torrent trackers or such. Categories, | mandatory tags, unified names, reviewed by community-chosen | category-wise moderators. In it's current state in's basically a | file dump, either you have the direct link, or you can only | _hope_ to find something interesting. Not that much better than | sharing magnet links via public pastebin records... | colechristensen wrote: | One very interesting thing I wish would be studied in depth are | the virtual economies of mature trackers. Limiting access to | resources and granting increasing access for contributing and | correcting quality has in places been extremely successful. It | is interesting to see the varying quality and associated | economic mechanics. | | Some environments, based just on prestige, have big problems | with toxicity (StackOverflow, Wikipedia) which I didn't see _at | all_ in some music trackers. | ryacko wrote: | Wikipedia does cover that issue. Competing views are | difficult to reconcile. | | https://en.wikipedia.org/w/index.php?title=Wikipedia:Systemi. | .. | | (using a version of the article from ten years ago because | everything is unnecessarily verbose on wikipedia now) | ailideex wrote: | I'm not sure what the point of quoting that is really. I | guess if you subscribe to the idea that reality is somehow | modified by your age, gender, sex, age, education or | whatever the heck then it has some relevance but then the | whole idea behind an encyclopedia seems pointless and we | should just each maintain our own unique knowledge bases as | they will have no relevance to someone other than us. | | That an article like that exists is patently absurd in my | view and kind of makes me a bit ill. Things like that is | what led to this: | https://www.youtube.com/watch?v=C9SiRNibD14 | | I really firmly believe that if you think there is a | European (?) science and an African science and they are | distinct and equally valid then either me or you do not | belong on Wikipedia and I would actually like Wikipedia to | clarify their mission in this light. | husainalshehhi wrote: | Downloading some of this might be illegal. I see some entries | that says "No license specified, the work may be protected by | copyright." | glofish wrote: | Cool idea, it is impressive that it is still around - alas it is | flawed the same way all scientific data is flawed. | | There is no metadata - all you have is an awkward imprecise | textual search of the abstract that comes with the data. Good | luck hosting the world's data that way. | ieee8023 wrote: | There is metadata. It is stored in bibtex along with every | torrent. This format allows it to be a freeform database where | the user can add fields as they want. We (Academic Torrents) | can then build new ways to display this metadata. Also the | "abstract" part of the metadata is rendered as markdown on the | details page of a torrent. Here is a good example: | https://academictorrents.com/details/d52ccc21455c7a82fd6e589... | glofish wrote: | Ok, I see that there is code provided there. Better than | nothing but geez, it is not really what metadata should be | like def get_labels(rightside): met = | {} met['brain'] = ( 1. * (rightside != | 0).sum() / (rightside == 0).sum()) met['tumor'] = ( | 1. * (rightside > 2).sum() / ((rightside != 0).sum() + | 1e-10)) met['has_enough_brain'] = met['brain'] > 0.30 | met['has_tumor'] = met['tumor'] > 0.01 return met | | I will say that it is very handy to know exactly how the | labels were computed. | | What I really meant is a way to search and select data based | on metadata. For example has_tumor. | | Also note how everything is still one single blob, to get one | line of any of the files, one would need to download | everything. | Mathnerd314 wrote: | Bittorrent does support partial downloads that request only | some files or byte ranges out of a torrent. Some of the | torrents are just compressed zip's but for the others you | could look at the code / documentation to see which files | were relevant before downloading 10GB of data. | | I think the abstract is sufficient for searching data; | expecting some kind of smart database that can handle all | the weird formats science uses is a bit much. | mtone wrote: | | one would need to download everything | | Just download it then. We got mp3 albums off Napster on | modems back in the day, surely getting that torrent is | easier and faster today. | derefr wrote: | One nice thing about digital data, as opposed to physical | artefacts, is that you don't need to keep digital data's | metadata attached to the data "at the hip." | | Through the magic of cryptographic hash algorithms, you can | just keep your data sets floating around "raw" (like in these | torrents), and then, _elsewhere_ , ascribe metadata _to the | hash_ of the content it is meant to annotate. | | Then, later, you can reassemble them in either order--either by | first finding a data set, hashing it, and then looking up | metadata in some metadata-hosting service; or by first browsing | a catalogue of indexed metadata, finding out about a dataset | that meets your needs, and then retrieving the data set _by_ | its hash. | | Which is to say: with digital data, library science (creating | metadata and chains-of-custody and indexing them for search) | and archiving (ensuring access to pristine artifacts over time) | don't need to happen at the same time, in the same place. There | can be separate "artifact hosting" and "metadata library" | services. (Which is especially helpful in contexts where | private IP is involved--you can still keep in your metadata | library, the metadata for a data-set you don't have the rights | to; and those _with_ the rights can go get the data-set | themselves.) | metasj wrote: | This flexibility in time, specialization, and order of | operation is surely one of the joys of modern digital | collections. | | Library scientists might say archiving and structuring and | curation are all facets of that science. And you'll also want | a hash search engine that finds related hashes, as there can | be many revisions + versions, only some of which have some | metadata. | bordercases wrote: | Aaaand someone has to do the work for computing the index and | annotating the hashes. | robbya wrote: | I think it's worth recognizing that this is a good first | step in a hard problem. Hosting many TB of data for free | isn't easy. Building an index on top of that data isn't | easy either, and it looks like no such index exists today, | but if someone decided to build that index they wouldn't | need to worry about the hosting portion of the problem. | That's a great starting point. | robbya wrote: | https://academictorrents.com/about.php#mirroring | | Using RSS to allow mirrors to host different subjects is really | clever, although some of the categories seem quite large (>5TB). | It may be worth breaking up each category (sharding) to keep each | to 100GB or less so a volunteer can pick a couple and not worry | about running out of disk when a category grows. | | Then it would be good to track how many seeds each category-shard | has so volunteers can help where it's most needed. | DuskStar wrote: | Some individual items are multiple TB, which would make 100GB | shards a little difficult. | aldoushuxley001 wrote: | This is amazing, really a great source of data. | yig wrote: | 2016 HN discussion: https://news.ycombinator.com/item?id=12381791 | | 2014 HN discussion: https://news.ycombinator.com/item?id=7149006 | dang wrote: | 2018 too: https://news.ycombinator.com/item?id=17744150 | DuskStar wrote: | I wish I could add Gwern's Danbooru dataset [0] here - 2.7TB of | labeled anime images. But they only support torrent files up to | 10MB, and that's over 20MB for the full dataset or 12MB for the | SFW low-rez set... | | Incidentally, when the _torrent file_ for your anime image | collection passes 20MB, something has obviously gone very wrong | right. | | 0: https://www.gwern.net/Danbooru2019 ___________________________________________________________________ (page generated 2020-01-13 23:00 UTC)