[HN Gopher] Download the Entire Wikimedia Database
       ___________________________________________________________________
        
       Download the Entire Wikimedia Database
        
       Author : surround
       Score  : 49 points
       Date   : 2021-03-06 20:46 UTC (2 hours ago)
        
 (HTM) web link (dumps.wikimedia.org)
 (TXT) w3m dump (dumps.wikimedia.org)
        
       | orblivion wrote:
       | You can also get it in a user-friendly format with the
       | application Kiwix (https://www.kiwix.org/) if that's your use
       | case. PC, phone, or server. You get subsets of the data, and
       | images are smaller to save space.
        
         | MeinBlutIstBlau wrote:
         | Kiwix use is still somewhat hit or miss when browsing. Im not
         | sure how it handles text parsing but it either takes forever or
         | doesn't return results making it sort of unusable.
         | 
         | But it's still a fantastic and incredible piece of software.
         | When it gets to the point where I can portably keep the full
         | 60gb zim file seemlessly, it will change simple computer for
         | low broadband areas. Imagine the uses as a portable and
         | versatile database that could accept json, html/css, data to
         | make your own offline encyclopedias!
        
         | kregasaurusrex wrote:
         | It's useful to browse on a phone if you have limited mobile
         | data, and the text-only English Wikipedia fits onto a modern
         | micro SD card.
        
       | karlicoss wrote:
       | Wouldn't it be cool if Steam supported distributing offline
       | Wikipedia database? It's just a few gigs (depending on
       | languages/images/etc, but it fits the DLC model perfectly), and
       | it already uses bittorrent.
        
       | dwheeler wrote:
       | I'm so glad the download-entire-wikipedia function continues to
       | exist. That will help counter the "lost the entire library
       | problem" from the city of Alexandria. To be fair, Wikipedia only
       | has summaries, not the detailed material, but it's still
       | important.
        
         | tablespoon wrote:
         | > I'm so glad the download-entire-wikipedia function continues
         | to exist. That will help counter the "lost the entire library
         | problem" from the city of Alexandria. To be fair, Wikipedia
         | only has summaries, not the detailed material, but it's still
         | important.
         | 
         | Personally, I think Wikipedia's quality is too poor for that.
         | Plus, it's digital, so when our civilization is at risk of
         | "[losing] the entire library" it probably would have already
         | lost the ability to maintain the computer systems to access
         | Wikipedia dumps.
        
           | smoldesu wrote:
           | The content on Wikipedia is really not that bad. Obviously a
           | Wikipedia article will never be the final say on any specific
           | subject, but it tends to do a pretty good of aggregating
           | sources and condensing it into a reader-friendly synopsis.
           | This data is super valuable, if not just for the sources
           | alone.
        
           | buzzerbetrayed wrote:
           | > so when our civilization is at risk of "[losing] the entire
           | library" it probably would have already lost the ability to
           | maintain the computer systems to access Wikipedia dumps
           | 
           | But as long as it continues to exist, some future
           | civilization could figure out how to read the data again,
           | eventually. Just like we eventually discovered how to read
           | ancient languages that were once forgotten.
        
             | tablespoon wrote:
             | > But as long as it continues to exist, some future
             | civilization could figure out how to read the data again,
             | eventually. Just like we eventually discovered how to read
             | ancient languages that were once forgotten.
             | 
             | Eh, I think you're vastly underestimating how difficult
             | that would be.
             | 
             | 1. The media would have to last hundreds of years at least,
             | when it's _hoped_ modern archival media can last _maybe_
             | fifty.
             | 
             | 2. Even assuming the media did last, the new civilization
             | would have to reverse engineer encoding on top of encoding
             | on top of encoding (e.g. disk encoding, complex
             | filesystems, file formats, character encodings). Our
             | civilization _already_ has trouble reading some old file
             | formats.
             | 
             | It took the Rosetta stone to figure out how to read
             | encoding of Egyptian hieroglyphics, when that language was
             | still alive in the form of Coptic.
             | 
             | 3. Then you're dealing with the probability that the hard
             | disks the future archeologists find will even have a
             | Wikipedia dump on them. That probability will be very
             | small, given very few people will download these dumps.
        
         | LeoPanthera wrote:
         | Kiwix offers downloadable material (including the _full_
         | Wikipedia), in a format specifically designed for offline
         | browsing.
         | 
         | https://wiki.kiwix.org/wiki/Content
         | 
         | Their Wikipedia bundle wasn't being updated for a while and had
         | fallen out of date, but that seems to have been fixed now.
        
         | qwertywert_ wrote:
         | Can it also download sources if available too, would be cool.
        
         | porphyra wrote:
         | It is pretty awesome that there are people like /r/datahoarder
         | that are obsessed with backing up the collective knowledge of
         | humanity.
        
           | aarchi wrote:
           | There's also Archive Team, focused on preserving at-risk
           | sites before being taken offline.
           | 
           | https://wiki.archiveteam.org/
        
           | capableweb wrote:
           | I'm not familiar with r/datahoarder, but if the name bears
           | any significance, it seems they are mostly centered on
           | hoarding data, which means just digital I guess? If so, I
           | much rather would want to promote efforts like Internet
           | Archive that back up all kind of things, not just digital
           | data.
        
             | [deleted]
        
             | KMnO4 wrote:
             | What does the Internet Archive back up that isn't
             | represented by 1s and 0s?
        
               | LeoPanthera wrote:
               | The Physical Archive. https://en.wikipedia.org/wiki/Inter
               | net_Archive#Physical_medi...
        
               | cguess wrote:
               | A ton of 35mm and 16mm film reels, vinyl, and even wax
               | recording, physical books and a lot more. They make
               | digital copies of them, but they also archive the
               | physical versions as well. Here's a selection of the
               | movies:
               | https://archive.org/details/moviesandfilms?tab=about
        
       | bawolff wrote:
       | Well not the entire db, just the public parts. User passwords are
       | not included ;)
        
       | dudus wrote:
       | Wikipedia is always bugging me about donations, and yet here it
       | is a feature they could charge for or at least hint to donate. It
       | would be perfectly acceptable to charge here since abuse of this
       | can rack up quite a bill. Maybe they don't pay as much as I do
       | for outbound traffic on aws, but still
        
         | capableweb wrote:
         | > would be perfectly acceptable to charge here since abuse of
         | this can rack up quite a bill
         | 
         | Not according to Wikipedia. Wikipedia much rather beg people
         | from all corners of the world to donate, than restricting
         | access to their data. That's what a good, honest and well-
         | meaning foundation does.
         | 
         | And yes, no sane person shuffling a lot of data around is using
         | AWS because of their awful bandwidth pricing, Wikipedia
         | included.
        
         | morsch wrote:
         | I guess a hint would be fine, but charging for access, even
         | bulk access, feels quite contrary to the spirit of the project.
         | It excludes huge ranges of people who cannot afford it or don't
         | have access to Internet payment methods.
         | 
         | I suspect the traffic caused by this is minuscule compared to
         | the overall traffic, anyway. But that's just a guess.
        
       | libraryofalex wrote:
       | One of the only things you can do to ensure lasting democracy
       | today is to download the pages, with complete history, put it on
       | a usb drive or microsd card properly labelled for you to keep
       | offline, and just forget about it. You can do this as a consumer,
       | it's easy. There's no harm in it, it's not some kind of private
       | data such as personal photos or documents. If you end up
       | forgetting or losing track of it, it really is no big deal. You
       | just decided to download it when you saw it on hacker news back
       | in 2021, right?
       | 
       | My reason for saying this is one of the only things you can do to
       | ensure lasting democracy is that it is in the realm of what is
       | possible in a physical sense that at some point through some
       | mechanism the online version simply does not inform the public on
       | some important public issue, whereas the history as you can
       | download it today does. Though, I wouldn't speculate about what
       | the mechanism might be or what kinds of subject.
       | 
       | At that point in a physical sense you could consult your offline
       | copy on an airgapped PC or future equivalent and I think it would
       | be impossible for any group of any kind to even know you were
       | doing that let alone stop it.
       | 
       | How you might get the word out is another question but having
       | this personal capability is easy for the people here, as
       | technical users and simple consumers. Indeed the whole entire
       | Internet was set up as a distributed network in case of nuclear
       | attack, so the entire topology of the Internet is set up for you
       | to do this easily today.
       | 
       | It's a click and a cheap flash drive or slightly more expensive
       | microsd card away. You can take this step in less than 20 active
       | minutes of your time and for less than $50 if you go with an
       | external spinning disk drive (such as 1 terabyte) or $200 or so
       | if you go with a microsd card. It doesn't really matter if the
       | file ultimately fails, this is not a critical backup for you to
       | have just a nice to have. You could write the file's checksum
       | onto the drive in marker so you can tell whether it's still
       | correct later (as opposed to having bit errors).
       | 
       | Maybe there is some file type that has a bit of redundancy
       | (checksums) for long-term storage, since due to the large amount
       | (several hundreds gigabytes) I wouldn't be all that surprised if
       | a few bits flipped over the course of several years in cold
       | storage. But I don't know what kind of file type has any sort of
       | redundancy or parity built into it that is supposed to protect
       | against this. (Does anyone know?) Most likely the hash just
       | wouldn't match what you wrote in pen on it but it would still be
       | useable.
       | 
       | Regarding choice of spinning disk or microsd card: I guess it's
       | in the realm of what's possible in a physical sense that at some
       | point people would have their personal property rummaged through
       | by some group and a hard drive is pretty obvious and could be
       | stolen or removed for that reason. (In a physical sense, not
       | speculating about social or political developments that might
       | lead to that.)
       | 
       | So for this reason perhaps best would be to put it on a microsd
       | card even though it is quite a bit more expensive. I guess
       | written once, bit rot causes microsd cards to decay within a few
       | years if not used at all.[1] I don't know for spinning media but
       | I guess it's also about 5-10 years at least.[2]
       | 
       | You could put the microsd card under a postage stamp for example
       | and put an important unrelated document into the envelope, which
       | you would expect to keep for many years. Of course you could
       | always end up accidentally discarding your envelope (while
       | retaining its contents) but that risk shouldn't matter too much.
       | In a physical sense it is possible for groups to x-ray all
       | paperwork (such as envelopes as I just suggested) and a microsd
       | card's electrical contacts are pretty obvious in an x-ray. (It
       | looks like this [3]). I don't have any suggestion that works
       | against this attack, which is within the realm of what's possible
       | according to the laws of physics.
       | 
       | I'm not speculating on what social or political developments
       | might possibly make anything like this necessary at some point in
       | the future, but we still live in a world governed by the laws of
       | physics so as technical professionals you have a huge leg up on
       | most of the world. Spending $50 doing this today might save
       | democracy tomorrow. You could also leave it as a time capsule
       | however the storage longevity is not that long (between 5 and 20
       | years I guess), and in a physical sense, a time capsule is not
       | particularly secure and would require instructions for someone
       | else to figure out so it's not great in that sense.
       | 
       | So in terms of what you can do today, I would suggest just
       | getting an external 1 terabyte usb drive ($50), downloading the
       | dump together with history (20 active minutes), writing the
       | checksum onto it in marker and just putting it somewhere.
       | Obviously this small $50 investment is one you would hope never
       | to have to use, but who knows, you might go down in history as
       | the one who saved some small part of the world. Though,
       | obviously, not in Wikipedia history.
       | 
       | [1] https://www.quora.com/What-is-the-longevity-of-a-sd-
       | memory-c...
       | 
       | [2] https://serverfault.com/questions/986911/how-long-will-
       | unuse...
       | 
       | [3]
       | https://www.reddit.com/r/pics/comments/3b6bjw/i_xrayed_an_sd...
        
       | nayuki wrote:
       | Back in 2014 I computed the PageRanks within English Wikipedia,
       | thanks to their database dump.
       | https://www.nayuki.io/page/computing-wikipedias-internal-pag...
        
         | crazygringo wrote:
         | That's intriguing.
         | 
         | Curious if you ever compared how PageRanks correlate to
         | traffic? (They make their per-page traffic available too.)
         | 
         | It would be interesting to see the largest disparities --
         | super-popular pages in visits but which don't have nearly as
         | many internal Wikipedia links to them, versus unpopular pages
         | but that have tons of internal Wikipedia links to them.
        
         | tomaszs wrote:
         | What page had the highest PageRank?
        
           | vinger wrote:
           | The homepage.
        
             | tomaszs wrote:
             | Cool. Was there any surprising result in top highest ranks?
        
             | codezero wrote:
             | Would be interesting to see the results if you have that
             | rank to the most viewed page (maybe that is the homepage
             | though)
        
       ___________________________________________________________________
       (page generated 2021-03-06 23:00 UTC)