hngopher.com

       [HN Gopher] Academic Torrents
       ___________________________________________________________________
        
       Academic Torrents
        
       Author : julianj
       Score  : 231 points
       Date   : 2020-01-13 12:29 UTC (10 hours ago)
        
 (HTM) web link (academictorrents.com)
 (TXT) w3m dump (academictorrents.com)
        
       | krick wrote:
       | Yeah, it could really benefit from some organizational work, like
       | on more mature music torrent trackers or such. Categories,
       | mandatory tags, unified names, reviewed by community-chosen
       | category-wise moderators. In it's current state in's basically a
       | file dump, either you have the direct link, or you can only
       | _hope_ to find something interesting. Not that much better than
       | sharing magnet links via public pastebin records...
        
         | colechristensen wrote:
         | One very interesting thing I wish would be studied in depth are
         | the virtual economies of mature trackers. Limiting access to
         | resources and granting increasing access for contributing and
         | correcting quality has in places been extremely successful. It
         | is interesting to see the varying quality and associated
         | economic mechanics.
         | 
         | Some environments, based just on prestige, have big problems
         | with toxicity (StackOverflow, Wikipedia) which I didn't see _at
         | all_ in some music trackers.
        
           | ryacko wrote:
           | Wikipedia does cover that issue. Competing views are
           | difficult to reconcile.
           | 
           | https://en.wikipedia.org/w/index.php?title=Wikipedia:Systemi.
           | ..
           | 
           | (using a version of the article from ten years ago because
           | everything is unnecessarily verbose on wikipedia now)
        
             | ailideex wrote:
             | I'm not sure what the point of quoting that is really. I
             | guess if you subscribe to the idea that reality is somehow
             | modified by your age, gender, sex, age, education or
             | whatever the heck then it has some relevance but then the
             | whole idea behind an encyclopedia seems pointless and we
             | should just each maintain our own unique knowledge bases as
             | they will have no relevance to someone other than us.
             | 
             | That an article like that exists is patently absurd in my
             | view and kind of makes me a bit ill. Things like that is
             | what led to this:
             | https://www.youtube.com/watch?v=C9SiRNibD14
             | 
             | I really firmly believe that if you think there is a
             | European (?) science and an African science and they are
             | distinct and equally valid then either me or you do not
             | belong on Wikipedia and I would actually like Wikipedia to
             | clarify their mission in this light.
        
       | husainalshehhi wrote:
       | Downloading some of this might be illegal. I see some entries
       | that says "No license specified, the work may be protected by
       | copyright."
        
       | glofish wrote:
       | Cool idea, it is impressive that it is still around - alas it is
       | flawed the same way all scientific data is flawed.
       | 
       | There is no metadata - all you have is an awkward imprecise
       | textual search of the abstract that comes with the data. Good
       | luck hosting the world's data that way.
        
         | ieee8023 wrote:
         | There is metadata. It is stored in bibtex along with every
         | torrent. This format allows it to be a freeform database where
         | the user can add fields as they want. We (Academic Torrents)
         | can then build new ways to display this metadata. Also the
         | "abstract" part of the metadata is rendered as markdown on the
         | details page of a torrent. Here is a good example:
         | https://academictorrents.com/details/d52ccc21455c7a82fd6e589...
        
           | glofish wrote:
           | Ok, I see that there is code provided there. Better than
           | nothing but geez, it is not really what metadata should be
           | like                 def get_labels(rightside):         met =
           | {}         met['brain'] = (             1. * (rightside !=
           | 0).sum() / (rightside == 0).sum())         met['tumor'] = (
           | 1. * (rightside > 2).sum() / ((rightside != 0).sum() +
           | 1e-10))         met['has_enough_brain'] = met['brain'] > 0.30
           | met['has_tumor'] = met['tumor'] > 0.01         return met
           | 
           | I will say that it is very handy to know exactly how the
           | labels were computed.
           | 
           | What I really meant is a way to search and select data based
           | on metadata. For example has_tumor.
           | 
           | Also note how everything is still one single blob, to get one
           | line of any of the files, one would need to download
           | everything.
        
             | Mathnerd314 wrote:
             | Bittorrent does support partial downloads that request only
             | some files or byte ranges out of a torrent. Some of the
             | torrents are just compressed zip's but for the others you
             | could look at the code / documentation to see which files
             | were relevant before downloading 10GB of data.
             | 
             | I think the abstract is sufficient for searching data;
             | expecting some kind of smart database that can handle all
             | the weird formats science uses is a bit much.
        
             | mtone wrote:
             | | one would need to download everything
             | 
             | Just download it then. We got mp3 albums off Napster on
             | modems back in the day, surely getting that torrent is
             | easier and faster today.
        
         | derefr wrote:
         | One nice thing about digital data, as opposed to physical
         | artefacts, is that you don't need to keep digital data's
         | metadata attached to the data "at the hip."
         | 
         | Through the magic of cryptographic hash algorithms, you can
         | just keep your data sets floating around "raw" (like in these
         | torrents), and then, _elsewhere_ , ascribe metadata _to the
         | hash_ of the content it is meant to annotate.
         | 
         | Then, later, you can reassemble them in either order--either by
         | first finding a data set, hashing it, and then looking up
         | metadata in some metadata-hosting service; or by first browsing
         | a catalogue of indexed metadata, finding out about a dataset
         | that meets your needs, and then retrieving the data set _by_
         | its hash.
         | 
         | Which is to say: with digital data, library science (creating
         | metadata and chains-of-custody and indexing them for search)
         | and archiving (ensuring access to pristine artifacts over time)
         | don't need to happen at the same time, in the same place. There
         | can be separate "artifact hosting" and "metadata library"
         | services. (Which is especially helpful in contexts where
         | private IP is involved--you can still keep in your metadata
         | library, the metadata for a data-set you don't have the rights
         | to; and those _with_ the rights can go get the data-set
         | themselves.)
        
           | metasj wrote:
           | This flexibility in time, specialization, and order of
           | operation is surely one of the joys of modern digital
           | collections.
           | 
           | Library scientists might say archiving and structuring and
           | curation are all facets of that science. And you'll also want
           | a hash search engine that finds related hashes, as there can
           | be many revisions + versions, only some of which have some
           | metadata.
        
           | bordercases wrote:
           | Aaaand someone has to do the work for computing the index and
           | annotating the hashes.
        
             | robbya wrote:
             | I think it's worth recognizing that this is a good first
             | step in a hard problem. Hosting many TB of data for free
             | isn't easy. Building an index on top of that data isn't
             | easy either, and it looks like no such index exists today,
             | but if someone decided to build that index they wouldn't
             | need to worry about the hosting portion of the problem.
             | That's a great starting point.
        
       | robbya wrote:
       | https://academictorrents.com/about.php#mirroring
       | 
       | Using RSS to allow mirrors to host different subjects is really
       | clever, although some of the categories seem quite large (>5TB).
       | It may be worth breaking up each category (sharding) to keep each
       | to 100GB or less so a volunteer can pick a couple and not worry
       | about running out of disk when a category grows.
       | 
       | Then it would be good to track how many seeds each category-shard
       | has so volunteers can help where it's most needed.
        
         | DuskStar wrote:
         | Some individual items are multiple TB, which would make 100GB
         | shards a little difficult.
        
       | aldoushuxley001 wrote:
       | This is amazing, really a great source of data.
        
       | yig wrote:
       | 2016 HN discussion: https://news.ycombinator.com/item?id=12381791
       | 
       | 2014 HN discussion: https://news.ycombinator.com/item?id=7149006
        
         | dang wrote:
         | 2018 too: https://news.ycombinator.com/item?id=17744150
        
       | DuskStar wrote:
       | I wish I could add Gwern's Danbooru dataset [0] here - 2.7TB of
       | labeled anime images. But they only support torrent files up to
       | 10MB, and that's over 20MB for the full dataset or 12MB for the
       | SFW low-rez set...
       | 
       | Incidentally, when the _torrent file_ for your anime image
       | collection passes 20MB, something has obviously gone very wrong
       | right.
       | 
       | 0: https://www.gwern.net/Danbooru2019
        
       ___________________________________________________________________
       (page generated 2020-01-13 23:00 UTC)