[HN Gopher] Download the Entire Wikimedia Database ___________________________________________________________________ Download the Entire Wikimedia Database Author : surround Score : 49 points Date : 2021-03-06 20:46 UTC (2 hours ago) (HTM) web link (dumps.wikimedia.org) (TXT) w3m dump (dumps.wikimedia.org) | orblivion wrote: | You can also get it in a user-friendly format with the | application Kiwix (https://www.kiwix.org/) if that's your use | case. PC, phone, or server. You get subsets of the data, and | images are smaller to save space. | MeinBlutIstBlau wrote: | Kiwix use is still somewhat hit or miss when browsing. Im not | sure how it handles text parsing but it either takes forever or | doesn't return results making it sort of unusable. | | But it's still a fantastic and incredible piece of software. | When it gets to the point where I can portably keep the full | 60gb zim file seemlessly, it will change simple computer for | low broadband areas. Imagine the uses as a portable and | versatile database that could accept json, html/css, data to | make your own offline encyclopedias! | kregasaurusrex wrote: | It's useful to browse on a phone if you have limited mobile | data, and the text-only English Wikipedia fits onto a modern | micro SD card. | karlicoss wrote: | Wouldn't it be cool if Steam supported distributing offline | Wikipedia database? It's just a few gigs (depending on | languages/images/etc, but it fits the DLC model perfectly), and | it already uses bittorrent. | dwheeler wrote: | I'm so glad the download-entire-wikipedia function continues to | exist. That will help counter the "lost the entire library | problem" from the city of Alexandria. To be fair, Wikipedia only | has summaries, not the detailed material, but it's still | important. | tablespoon wrote: | > I'm so glad the download-entire-wikipedia function continues | to exist. That will help counter the "lost the entire library | problem" from the city of Alexandria. To be fair, Wikipedia | only has summaries, not the detailed material, but it's still | important. | | Personally, I think Wikipedia's quality is too poor for that. | Plus, it's digital, so when our civilization is at risk of | "[losing] the entire library" it probably would have already | lost the ability to maintain the computer systems to access | Wikipedia dumps. | smoldesu wrote: | The content on Wikipedia is really not that bad. Obviously a | Wikipedia article will never be the final say on any specific | subject, but it tends to do a pretty good of aggregating | sources and condensing it into a reader-friendly synopsis. | This data is super valuable, if not just for the sources | alone. | buzzerbetrayed wrote: | > so when our civilization is at risk of "[losing] the entire | library" it probably would have already lost the ability to | maintain the computer systems to access Wikipedia dumps | | But as long as it continues to exist, some future | civilization could figure out how to read the data again, | eventually. Just like we eventually discovered how to read | ancient languages that were once forgotten. | tablespoon wrote: | > But as long as it continues to exist, some future | civilization could figure out how to read the data again, | eventually. Just like we eventually discovered how to read | ancient languages that were once forgotten. | | Eh, I think you're vastly underestimating how difficult | that would be. | | 1. The media would have to last hundreds of years at least, | when it's _hoped_ modern archival media can last _maybe_ | fifty. | | 2. Even assuming the media did last, the new civilization | would have to reverse engineer encoding on top of encoding | on top of encoding (e.g. disk encoding, complex | filesystems, file formats, character encodings). Our | civilization _already_ has trouble reading some old file | formats. | | It took the Rosetta stone to figure out how to read | encoding of Egyptian hieroglyphics, when that language was | still alive in the form of Coptic. | | 3. Then you're dealing with the probability that the hard | disks the future archeologists find will even have a | Wikipedia dump on them. That probability will be very | small, given very few people will download these dumps. | LeoPanthera wrote: | Kiwix offers downloadable material (including the _full_ | Wikipedia), in a format specifically designed for offline | browsing. | | https://wiki.kiwix.org/wiki/Content | | Their Wikipedia bundle wasn't being updated for a while and had | fallen out of date, but that seems to have been fixed now. | qwertywert_ wrote: | Can it also download sources if available too, would be cool. | porphyra wrote: | It is pretty awesome that there are people like /r/datahoarder | that are obsessed with backing up the collective knowledge of | humanity. | aarchi wrote: | There's also Archive Team, focused on preserving at-risk | sites before being taken offline. | | https://wiki.archiveteam.org/ | capableweb wrote: | I'm not familiar with r/datahoarder, but if the name bears | any significance, it seems they are mostly centered on | hoarding data, which means just digital I guess? If so, I | much rather would want to promote efforts like Internet | Archive that back up all kind of things, not just digital | data. | [deleted] | KMnO4 wrote: | What does the Internet Archive back up that isn't | represented by 1s and 0s? | LeoPanthera wrote: | The Physical Archive. https://en.wikipedia.org/wiki/Inter | net_Archive#Physical_medi... | cguess wrote: | A ton of 35mm and 16mm film reels, vinyl, and even wax | recording, physical books and a lot more. They make | digital copies of them, but they also archive the | physical versions as well. Here's a selection of the | movies: | https://archive.org/details/moviesandfilms?tab=about | bawolff wrote: | Well not the entire db, just the public parts. User passwords are | not included ;) | dudus wrote: | Wikipedia is always bugging me about donations, and yet here it | is a feature they could charge for or at least hint to donate. It | would be perfectly acceptable to charge here since abuse of this | can rack up quite a bill. Maybe they don't pay as much as I do | for outbound traffic on aws, but still | capableweb wrote: | > would be perfectly acceptable to charge here since abuse of | this can rack up quite a bill | | Not according to Wikipedia. Wikipedia much rather beg people | from all corners of the world to donate, than restricting | access to their data. That's what a good, honest and well- | meaning foundation does. | | And yes, no sane person shuffling a lot of data around is using | AWS because of their awful bandwidth pricing, Wikipedia | included. | morsch wrote: | I guess a hint would be fine, but charging for access, even | bulk access, feels quite contrary to the spirit of the project. | It excludes huge ranges of people who cannot afford it or don't | have access to Internet payment methods. | | I suspect the traffic caused by this is minuscule compared to | the overall traffic, anyway. But that's just a guess. | libraryofalex wrote: | One of the only things you can do to ensure lasting democracy | today is to download the pages, with complete history, put it on | a usb drive or microsd card properly labelled for you to keep | offline, and just forget about it. You can do this as a consumer, | it's easy. There's no harm in it, it's not some kind of private | data such as personal photos or documents. If you end up | forgetting or losing track of it, it really is no big deal. You | just decided to download it when you saw it on hacker news back | in 2021, right? | | My reason for saying this is one of the only things you can do to | ensure lasting democracy is that it is in the realm of what is | possible in a physical sense that at some point through some | mechanism the online version simply does not inform the public on | some important public issue, whereas the history as you can | download it today does. Though, I wouldn't speculate about what | the mechanism might be or what kinds of subject. | | At that point in a physical sense you could consult your offline | copy on an airgapped PC or future equivalent and I think it would | be impossible for any group of any kind to even know you were | doing that let alone stop it. | | How you might get the word out is another question but having | this personal capability is easy for the people here, as | technical users and simple consumers. Indeed the whole entire | Internet was set up as a distributed network in case of nuclear | attack, so the entire topology of the Internet is set up for you | to do this easily today. | | It's a click and a cheap flash drive or slightly more expensive | microsd card away. You can take this step in less than 20 active | minutes of your time and for less than $50 if you go with an | external spinning disk drive (such as 1 terabyte) or $200 or so | if you go with a microsd card. It doesn't really matter if the | file ultimately fails, this is not a critical backup for you to | have just a nice to have. You could write the file's checksum | onto the drive in marker so you can tell whether it's still | correct later (as opposed to having bit errors). | | Maybe there is some file type that has a bit of redundancy | (checksums) for long-term storage, since due to the large amount | (several hundreds gigabytes) I wouldn't be all that surprised if | a few bits flipped over the course of several years in cold | storage. But I don't know what kind of file type has any sort of | redundancy or parity built into it that is supposed to protect | against this. (Does anyone know?) Most likely the hash just | wouldn't match what you wrote in pen on it but it would still be | useable. | | Regarding choice of spinning disk or microsd card: I guess it's | in the realm of what's possible in a physical sense that at some | point people would have their personal property rummaged through | by some group and a hard drive is pretty obvious and could be | stolen or removed for that reason. (In a physical sense, not | speculating about social or political developments that might | lead to that.) | | So for this reason perhaps best would be to put it on a microsd | card even though it is quite a bit more expensive. I guess | written once, bit rot causes microsd cards to decay within a few | years if not used at all.[1] I don't know for spinning media but | I guess it's also about 5-10 years at least.[2] | | You could put the microsd card under a postage stamp for example | and put an important unrelated document into the envelope, which | you would expect to keep for many years. Of course you could | always end up accidentally discarding your envelope (while | retaining its contents) but that risk shouldn't matter too much. | In a physical sense it is possible for groups to x-ray all | paperwork (such as envelopes as I just suggested) and a microsd | card's electrical contacts are pretty obvious in an x-ray. (It | looks like this [3]). I don't have any suggestion that works | against this attack, which is within the realm of what's possible | according to the laws of physics. | | I'm not speculating on what social or political developments | might possibly make anything like this necessary at some point in | the future, but we still live in a world governed by the laws of | physics so as technical professionals you have a huge leg up on | most of the world. Spending $50 doing this today might save | democracy tomorrow. You could also leave it as a time capsule | however the storage longevity is not that long (between 5 and 20 | years I guess), and in a physical sense, a time capsule is not | particularly secure and would require instructions for someone | else to figure out so it's not great in that sense. | | So in terms of what you can do today, I would suggest just | getting an external 1 terabyte usb drive ($50), downloading the | dump together with history (20 active minutes), writing the | checksum onto it in marker and just putting it somewhere. | Obviously this small $50 investment is one you would hope never | to have to use, but who knows, you might go down in history as | the one who saved some small part of the world. Though, | obviously, not in Wikipedia history. | | [1] https://www.quora.com/What-is-the-longevity-of-a-sd- | memory-c... | | [2] https://serverfault.com/questions/986911/how-long-will- | unuse... | | [3] | https://www.reddit.com/r/pics/comments/3b6bjw/i_xrayed_an_sd... | nayuki wrote: | Back in 2014 I computed the PageRanks within English Wikipedia, | thanks to their database dump. | https://www.nayuki.io/page/computing-wikipedias-internal-pag... | crazygringo wrote: | That's intriguing. | | Curious if you ever compared how PageRanks correlate to | traffic? (They make their per-page traffic available too.) | | It would be interesting to see the largest disparities -- | super-popular pages in visits but which don't have nearly as | many internal Wikipedia links to them, versus unpopular pages | but that have tons of internal Wikipedia links to them. | tomaszs wrote: | What page had the highest PageRank? | vinger wrote: | The homepage. | tomaszs wrote: | Cool. Was there any surprising result in top highest ranks? | codezero wrote: | Would be interesting to see the results if you have that | rank to the most viewed page (maybe that is the homepage | though) ___________________________________________________________________ (page generated 2021-03-06 23:00 UTC)