[HN Gopher] Why I link to Wayback Machine instead of original we... ___________________________________________________________________ Why I link to Wayback Machine instead of original web content Author : puggo Score : 496 points Date : 2020-09-08 08:03 UTC (14 hours ago) (HTM) web link (hawaiigentech.com) (TXT) w3m dump (hawaiigentech.com) | NateEag wrote: | I understand where the author is coming from, but I think the | best approach is to write your content with direct links to the | canonical versions of articles. | | Have a link checking process you run regularly against your site, | using some of the standard tools I've mentioned elsewhere in this | thread: | | https://www.npmjs.com/package/broken-link-checker-local | | https://linkchecker.github.io/linkchecker/ | | When you run the link check (which should be regularly, perhaps | at least weekly), also run a process that harvests the non-local | links from your site and 1) adds any new links' content to your | own local, unpublished archive of external content, and 2) | submits those new links to archive.org. | | This keeps canonical URLs canonical, makes sure content you've | linked to is backed up on archive.org so a reasonably trustworthy | source is available should the canonical one die out, and gives | you your own backup in case archive.org and the original both | vanish. | | I don't currently do this with my own sites, but now I'm | questioning why not. I already have the regular link checks, and | the second half seems pretty straightforward to add (for static | sites, anyway). | axelfreeman wrote: | You could link to the original web url and also do a print | version of the web content as PDF. That's how i archive howtos | and write-ups of interesting content. Print view and create a PDF | version. | drummer wrote: | For anything important you can't beat a good save to pdf feature | in the browser. You can then upload the pdf and link to that | instead. Someone should make a wordpress plugin to do this | automatically. | Cthulhu_ wrote: | If it's to actually reference a third party source, it's probably | better to make a self-hosted copy of the page. You can print it | to a PDF file for example. I don't believe archive.org is | eternal, or that its pages will remain the same. | LostJourneyman wrote: | There's some subtle irony in that the linked site is not in fact | a WayBackMachine link, but instead a direct link to the site. | Andrew_nenakhov wrote: | Hmm. is there a place for a service that makes a permanent copy | of content, available at the original url at the time of posting? | luord wrote: | While I generally disagree because I'd rather my site was the one | getting the hits--and I would rather give the same courtesy to | other authors--this does give me the idea of checking (or | creating if none exists) an archive link of whatever I reference, | and include that archive link in the metadata of every link I | include. | | Users will find the archive link if they really want to, and it | will make it easier for me to replace broken links in the future. | hgo wrote: | Maybe the solution isn't technical and we should look at other | fields that have relied on referencing credible sources for a | long time? I can think of research, news and perhaps law. | bartread wrote: | I'm not sure I'm a fan of this because it just turns | WayBackMachine into another content silo. It's called the world | wide web for a reason, and this isn't helping. | | I can see it for corporate sites where they change content, | remove pages, and break links without a moment's consideration. | | But for my personal site, for example, I'd much rather you link | to me directly rather than content in WayBackMachine. Apart from | anything else linking to WayBackMachine only drives traffic to | WayBackMachine, not my site. Similarly, when I link to other | content, I want to show its creators the same courtesy by linking | directly to their content rather than WayBackMachine. | | What I can see, and I don't know if it exists yet (a quick search | suggests perhaps not), is some build task that will check all | links and replace those that are broken with links to | WayBackMachine, or (perhaps better) generate a report of broken | links and allow me to update them manually just in case a site or | two happen to be down when my build runs. | | I think it would probably need to treat redirects like broken | links given the prevalence of corporate sites where content is | simply removed and redirected to the homepage, or geo-locked and | redirected to the homepage in other locales (I'm looking at you | and your international warranty, and access to tutorials, Fender. | Grr.). | | I also probably wouldn't run it on every build because it would | take a while, but once a week or once a month would probably do | it. | akavel wrote: | Or: snapshot a WARC archive of the site locally, then start | serving it only in case the original goes down. For extra | street cred, seed it to IPFS. (A.k.a. one of too many projects | on my To Build One Day list.) | nikisweeting wrote: | ArchiveBox is built for exactly this use-case :) | | https://github.com/pirate/ArchiveBox | silicon2401 wrote: | > But for my personal site, for example, I'd much rather you | link to me directly rather than content in WayBackMachine. | | That would make sense if users were archiving your site for | your benefit, but they're probably not. If I were to archive | your site, it's because I want my own bookmarks/backups/etc to | be more reliable than just a link, not because I'm looking out | to preserve your website. Otherwise, I'm just gambling that you | won't one day change your content, design, etc on a whim. | | Hence I'm in a similar boat as the blog author. If there's a | webpage I really like, I download and archive it myself. If | it's not worth going through that process, I use the wayback | machine. If it's not worth that, then I just keep a bookmark. | 3pt14159 wrote: | The issue is that if this becomes widespread then we're going | to get into copyright claims against the wayback machine. | When I write content it is mine. I don't even let Facebook | crawlers index it because I don't want it appearing on their | platform. I'm happy to have wayback machine archive it, but | that's with the understanding that it is a backup, not an | authoritative or primary source. | | Ideally, links would be able to handle 404s and fallback. | Like we can do with images and srcset in html. That way if my | content goes away we have a backup. I can still write updates | to a blog piece or add translations that people send in and | everyone benefits from the dynamic nature of content, while | still being able to either fallback or verify content at the | time it was publish via the wayback machine. | alisonkisk wrote: | Perhaps the wayback machine can help fix that by telling | users to visit the authoritative site and demanding a | confirmation clickthrough before showing the archived | content. | bartread wrote: | > Perhaps the wayback machine can help fix that by | telling users to visit the authoritative site and | demanding a confirmation clickthrough before showing the | archived content. | | I'm trying to figure out if you're being ironic or | serious. | | People on here (rightly) spend a lot of time complaining | about how user experience on the web is becoming terrible | due to ads, pop-ups, pop-unders, endless cookie banners, | consent forms, and miscellaneous GDPR nonsense, all of | which get in the way of whatever it is you're trying to | read or watch, and all of it on top of the more run-of- | the-mill UX snafus with which people casually litter | their sites. | | Your idea boils down to adding another layer of consent | clicking to the mess, to implement a semi-manual redirect | through the WayBackMachine for every link clicked. That's | ridiculous. | | I have to believe you're being ironic because nobody | could seriously think this is a good idea. | CarCooler wrote: | Agree, cut the clutter just like it is simple on the HN | website. | notriddle wrote: | There already have been copyright claims against The | Wayback Machine. They've been responding to it by allowing | site owners to use robots.txt to remove their content. | headmelted wrote: | But it's also not guaranteed to be consistent. What if you | don't delete the content but just change it? (I.e. what if | your opinions change or you're pressured to edit | information by a third party?). | 3pt14159 wrote: | I addressed this. | | > I can still write updates to a blog piece or add | translations that people send in and everyone benefits | from the dynamic nature of content, while still being | able to either fallback or verify content at the time it | was publish via the wayback machine. | | Updates are usually good. Sometimes you need to verify | what was said though, and for that wayback machine works. | I agree it would be nice if there was a technical way to | support both, but for the average web request it's better | to link to the source. | [deleted] | ethagnawl wrote: | > If it's not worth that, then I just keep a bookmark. | | I've made a habit of saving every page I bookmark to the | WayBackMachine. To my mind, this is the best of both worlds: | you'll see any edits, additions, etc. to the source material | and if something you remember has been changed or gone | missing, you have a static reference. I just wish there was | an simple way to diff the two. | | I keep meaning to write browser extensions to do both of | these things on my behalf ... | PaulHoule wrote: | It's a deep problem with the web as we know it. | | If I want to make a "scrapbook" to support a research project | of some kind. Really I want to make a "pyramid" with a | general overview that is at most a few pages at the top, then | some documents that are more detailed, but with the original | reference material incorporated and linked to what it | supports. | | In 2020 much of that reference material will come from the | web and you are left with doing the "webby" thing (linking) | which is doomed to fall victim to broken links or with | archiving the content which is OK for personal use, but will | not be OK with the content owners if you make it public. You | could say the public web is also becoming a cess pool/crime | scene, where even reputable web sites are suspected of | pervasive click fraud, where the line between marketing and | harassment gets harder to see every day. | alisonkisk wrote: | Is it a deep problem? You can download content you want to | keep. There are many services like evernote and pocket that | can help you with it. | TeMPOraL wrote: | It is, because it ultimately comes down to owner's | control of how their content is being used. | | For example, a modern news site will want the ability to | define which text is "authoritative", and make | modifications to it on the fly, including unpublishing | it. As a reader OTOH, I want a permanent, immutable copy | of everything said site ever publishes, so that silent | edits and unpublishing is not possible. These two | perspectives are in conflict, and that conflict repeats | itself throughout the entire web. | PaulHoule wrote: | Some consumers will want the latest and greatest content. | To please everyone (other than the owner) you'd need to | look at the content across time, versions, alternate | world views,... Thus "deep". | | My central use case is that I might 'scrape' content from | sources such as | | https://en.wikipedia.org/wiki/List_of_U.S._states_and_ter | rit... | | and have the process be "repeatable" in the sense that: | | 1. The system archives the original inputs and the | process to create refined data outputs | | 2. If the inputs change the system should normally be | able to download updated versions of the inputs, apply | the process and produce good outputs | | 3. If something goes wrong there are sufficient | diagnostics and tests that would show invariants are | broken, or that the system can't tell how many fingers | you are holding up | | 4. and in that case you can revert to "known good" inputs | | I am thinking of data products here, but even if the | 'product' is a paper, presentation, or report that | involves human judgements there should be a structured | process to propagate changes. | [deleted] | [deleted] | ogre_codes wrote: | I can understand posting a link, plus an archival link just | in case the original content is lost. But linking to an | archival site only is IMO somewhat rude. | mcv wrote: | Would be nice if there's an automatic way to have a link revert | to the Wayback Machine once the original link stops working. I | can't think of an easy way to do that, though. | boogies wrote: | I just use a bookmarklet javascript:void(wi | ndow.open('https://web.archive.org/web/*/'+location.href.repl | ace(/\/$/,%20''))); | | (which is only slightly less convenient than what others have | already pointed out -- the FF extension and Brave built-in | feature). | kevincox wrote: | Another nice solution is to create a "search engine" for | https://web.archive.org/web/*/%s you can then just add the | keyword before the URL (For example I type `<Ctrl-l><Left>w | <Enter>`). Search engines like this are supported by chrome | and firefox. | boogies wrote: | I would love for there to be a site that redirected eg. | better.site/ https://www.youtube.com/watch?v=jzwMjOl8Iyo | to https://invidious.site/watch?v=jzwMjOl8Iyo so I could | easily open YouTube links with Invidious, and the same | for Twitter-Nitter, Instagram-bibliogram, Google Maps - | OSM, etc without having to manually remove the beginning | of the URL. I'd presume someone on HN has the skill to do | this similarly to | https://news.ycombinator.com/item?id=24344127 | kevincox wrote: | You can make a "search engine" or bookmarklet that is a | javascript/data URL that does whatever URL mangling you | need. (Other than some minor escaping issues). | | Something like the following should work. You can add | more logic to supoort all of the sites with the same | script or make one per site. | | javascript:document.location="%s".replace(/^https:\/\/www | .youtube.com/, "https://invidious.site") | riffraff wrote: | wikipedia just does "$some-link-here (Archived $archived- | version-link)", and it works pretty well, imo. | II2II wrote: | Agreed, and it shouldn't be too much of a burden to use | since the author was quite clear about it being for | reference materials. The idea isn't all that different from | referring to specific print editions. | notagoodidea wrote: | For me that is the real solution when you know that the | _archived-link_ is the one consulted by the author | /whatever and the normal one being the content (or its | evolution). | jazzyjackson wrote: | Brave browser has this built in, if you end up at a dead link | the address bar offers to take you to wayback machine. | | http://blog.archive.org/2020/02/25/brave-browser-and-the- | way... | bad_user wrote: | This was first implemented in Firefox, as an experiment, | and is now an extension: | | https://addons.mozilla.org/ro/firefox/addon/wayback- | machine_... | liability wrote: | I used this extension for a while but had to stop due to | frequent false positives. YMMV | CompuHacker wrote: | There exists a manual extension called Resurrect Pages | for Firefox 57+, with Google Cache, archive.is, Wayback | Machine, and WebCite. | MaxBarraclough wrote: | Either a browser extension, or an 'active' system where your | site checks the health of the pages it links to. | iggldiggl wrote: | > browser extension | | E.g. https://addons.mozilla.org/firefox/addon/wayback- | machine_new... | DavideNL wrote: | Their browser extention does exactly that... | polygot wrote: | I made a browser extension which replaces links in articles and | stackoverflow answers with archive.org links on the date of | their publication (and date of answers for stackoverflow | questions): | https://github.com/alexyorke/archiveorg_link_restorer | FinnLeSueur wrote: | > generate a report of broken links | | I actually made a little script that does just this. It's | pretty dinky but works a charm on a couple of sites I run. | | https://github.com/finnito/link-checker | scruffyherder wrote: | I spent hours getting all the stupid redirects working from | different hosts, domains and platforms. | | People still use rss to either steal my stuff, or discuss it | off site (as if commenting to the author is so scary!) or in a | way to make me totally unaware of it happening as so many times | people either ask questions of the author on a site like this, | or even bring up good points or something to go further on that | I would miss otherwise. | | It's a shame ping backs were hijacked but the siloing sucks | too. | | Sometimes I forget for months at a time to check other sites, | not every post generates 5000+ hits in an hour. | abdullahkhalids wrote: | Gwern.net has a pretty sophisticated system for this | https://www.gwern.net/Archiving-URLs | jrochkind1 wrote: | The International Internet Preservation Consortium is | attempting a technological solution that gives you the best of | both worlds in a flexible way, and is meant to be extended to | support multiple archival preservation content providers. | | https://robustlinks.mementoweb.org/about/ | | (although nothing else like the IA Wayback machine exists | presently, and I'm not sure what would make someone else try to | 'compete' when IA is doing so well, which is a problem, but | refusing to use the IA doesn't solve it!) | NateEag wrote: | I use linkchecker for this on my personal sites: | | https://linkchecker.github.io/linkchecker/ | | There's a similar NodeJS program called blcl (broken-link- | checker-local) which has the handy attribute that it works on | local directories, making it particularly easy to use with | static websites before deploying them. | | https://www.npmjs.com/package/broken-link-checker-local | privong wrote: | > There's a similar NodeJS program called blcl (broken-link- | checker-local) which has the handy attribute that it works on | local directories | | linkchecker can do this as well, if you provide it a | directory path instead of a url. | NateEag wrote: | Ah, thanks! I was not aware of that feature. | DeusExMachina wrote: | > generate a report of broken links and allow me to update them | manually just in case a site or two happen to be down when my | build runs. | | SEO tools like Ahrefs do this already. Although, the price | might be a bit too steep if you only want that functionality. | But there are probably cheaper alternatives as well. | codethief wrote: | > What I can see, and I don't know if it exists yet (a quick | search suggests perhaps not), is some build task that will | check all links and replace those that are broken with links to | WayBackMachine | | Addendum: First, that same tool should - at the time of | creating your web site / blog post / ... - ask WayBackMachine | to capture those links in the first place. That would actually | be a very neat feature, as it would guarantee that you could | always roll back the linked websites to exactly the time you | linked to them on your page. | ethagnawl wrote: | Doesn't Wikipedia do something like this? If not, the | WBM/Archive.org does something like it on Wikipedia's behalf. | thotsBgone wrote: | I don't care enough to look into it, but I think Gwern has | something like this set up on gwern.net. | zwayhowder wrote: | Not to forget that while I might go to an article written ten | years ago, the Wayback archive won't show me a related article | that you published two years ago updating the article | information or correcting a mistake. | deepstack wrote: | yeah at some point, way back machine need to be on webttorrent, | ipfs type of thing where it is immutable. | toomuchtodo wrote: | https://blog.archive.org/2018/07/21/decentralized-web-faq/ | scruffyherder wrote: | I was surprised when digital.com got purged | | Then further dismayed that the utzoo Usenet archives were | purged. | | Archive sites are still subject to being censored and | deleted. | alfonsodev wrote: | it's there any active project perusing this idea ? | Taek wrote: | https://github.com/exp0nge/wayback | | Here's an extension to archive pages on Skynet, which is | similar to IPFS but uses financial compensation to ensure | availability and reliability. | | I don't know if the author intends to continue developing | this idea or if it was a one-off for a hackathon. | Sargos wrote: | FileCoin is the incentivization layer for IPFS, both | built by Protocol Labs. | nikisweeting wrote: | The largest active project doing this (to my knowledge) is | the Inter-Planetary Wayback Machine: | | https://github.com/oduwsdl/ipwb | | There have been many other attempts though, including | internetarchive.bak on IPFS, which ended up failing because | it was too much data. | | http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/ | i... | | http://brewster.kahle.org/2015/08/11/locking-the-web- | open-a-... | deepstack wrote: | I'm hoping someone here in Hacker News will pick it up and | apply for the next round at ycombinator. A non-profit would | be better than for-profit in this case. Block-chain ish | type tech would be perfect for this. If in a few years no | one does, then I'll do it. | 1vuio0pswjnm7 wrote: | What if your personal site is, like so many others these days, | on shared IP hosting like Cloudflare, AWS, Fastly, Azure, etc. | | In the case of Cloudflare, for example, we as users are not | reaching the target site, we are just accessing a CDN. The nice | thing about archive.org is that it does not require SNI. | (Cloudflare's TLS1.3 and ESNI works quite well AFAICT but they | are the only CDN who has it working.) | | I think there should be more archive.org's. We need more CDNs | for users as opposed to CDNs for website owners. | bad_user wrote: | The "target site" is the URL from the author's domain, and | Cloudflare is the domain's designated CDN. The user is | reaching the server that the webmaster wants reachable. | | That's how the web works. | | > _The nice thing about archive.org is that it does not | require SNI_ | | I fail to see how that's even a thing to consider. | 1vuio0pswjnm7 wrote: | If the user follows an Internet Archive URL (or Google | cache URL or BING cache URL or ...), then does she still | she reach "the server the webmaster wants reachable". | | SNI, more specifically sending domain names in plaintext | over the wire when using HTTPS, matters to the IETF because | they have gone through the trouble of encrypting server | certificate in TLS 1.3 and eventually they will be | encrypting SNI. If you truly know "how the web works", then | you should be able to figure out why they think domain | names in plaintext is an issue. | [deleted] | uniqueid wrote: | Yeah, that's another problem with the design of the web, and kind | of a significant one! Somewhat pointless to link to external | documents when half of them won't be around next year. | nullandvoid wrote: | I experienced this just the other day. | | I was browsing an old HN post from 2018, with lots of what seemed | like useful links to their blog | | Upon visiting it the site had been rebranded and the blog entries | had disappeared | | Waybackmachine saved me in this cass, but a link to it originally | would have saved me a few clicks | ique wrote: | Just another reason to have content-adressable storage | everywhere, then at least if it changed you'll know it changed, | and if you can't get the original content anymore then the change | is probably malicious. | fornowiamhere wrote: | > _Now it's spam from a site suffering financial need._ Well, | yeah! | | Of course, linking to WBM is not the main reason why a site might | be in this situation but it piles up. | asdfman123 wrote: | > So in Feb 14 2019 your users would have seen the content you | intended. However in Sep 07 2020, your users are being asked to | support independent Journalism instead. | | Can you believe it? Yesterday, I tried to walk out of the grocery | store with a head of lettuce for free, and they instead were more | interested in making me pay money to support the grocery and | agricultural business! | monktastic1 wrote: | Right. I thought it was pretty bad form for him to call this | "spam," as though they're the ones wronging _him._ | 8bitsrule wrote: | Gotta completely agree ... for anything you need to be stable and | available. | | I've been building lists of -reference- URLs for over a decade | ... and the ones aimed at Archive.org (are slower to load, but) | are much more reliable. | | Saved Wayback URLs contain the original site URL. It's really | easy to check it to see if the site has deteriorated (usually it | has). If it's gotten better ... it's easy to update your saved WB | link. | shortformblog wrote: | This man's entire argument is completely terrible for two | reasons: | | 1) The example he uses is The Epoch Times, a questionable source | even on the best of days. | | 2) What he refers to as "spam" is a paywall. He is literally | taking away from business opportunities for this outlet that | produced a piece of content he wants to draw attention to, but he | does not want to otherwise support. | | He's a taker. And while the Wayback Machine is very useful for | sharing archived information, that's not what this guy is doing. | He's trying to undermine the business model of the outlets he's | reading. | | The Epoch Times is one thing--it's an outlet that is essentially | propaganda--but when he does this to a local newspaper or an | actual independent media outlet, what happens? | zdw wrote: | For reference: https://en.wikipedia.org/wiki/Epoch_Times | | They're hyper right wing Qanon/antivax spreaders associated | with the Falun Gong movement. | Ensorceled wrote: | > 2) What he refers to as "spam" is a paywall. He is literally | taking away from business opportunities for this outlet that | produced a piece of content he wants to draw attention to, but | he does not want to otherwise support. | | For the destination site, this is all of the downsides of AMP | with none of the upsides. | aldo712 wrote: | Here's a WayBackMachine Link to this article. :) | | https://web.archive.org/web/20200908090515/https://hawaiigen... | lizardmancan wrote: | also something to take home from this is that we all think to | have an idea what the www is or amounts to while in reality it is | changing all the time at a much more dramatic rate than we can | see or indeed imagine. Depending on current events very large | numbers of new sites are created and new top (of index) content | is written and an even larger amount vanishes. When the new | topics mature and its angles are reasonably fleshed out the | incineration wave kicks in again and POOF we have a whole new | www. After all, since most content is rarely linked to the | problem is much larger. Naive people think the value of content | is also static. They can of course advertise that opinion but for | the rest of us to just accept it as ghospel??? We should be | outraged so that "they" can delete it. Then we can truly feel the | totalitarianism of it. | koboll wrote: | This seems like a problem that would be better solved by | something like: | | 1. Browsers build in a system whereby if a link appears dead, | they first check against the Wayback Machine to see if a backup | exists. | | 2. If it does, they go there instead. | | 3. In return for this service, and to offset costs associated | with increased traffic, they jointly agree to financially support | the Internet Archive in perpetuity. | nikisweeting wrote: | Or link to your own archive of the content with ArchiveBox! | | That way we're not all completely reliant on a central system. | (ArchiveBox submits your links to Archive.org in addition to | saving them locally). | | https://github.com/pirate/ArchiveBox | | Also many other tools that can do this too: | | https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Comm... | [deleted] | rmoriz wrote: | I once discovered an information leak of German public | broadcasting organization ARD which leaked real mobile numbers on | their CI/CD page where they showed the business card designs | (lol). | | All records of this page on Archive.org were deleted after a | couple of days, a twitter account posting the details with a | screenshot and link was reported and my account temporarily | suspended. | | I assume it must be very easy to remove inconvenient content from | archive.org. | | (in German) https://blog.rolandmoriz.de/2019/04/25/sind-die- | leute-von-de... | krapp wrote: | Apropos of nothing but I added the ability to archive links in | Anarki a few months back[0]. If dang or someone wants to take it | for HN they're welcome to. Excuse the crappy quality of my code | and pr format, though. | | It might be useful as a backup if the original site starts | getting hugged to death. | | [0]https://github.com/arclanguage/anarki/pull/179 | rkagerer wrote: | I link to the original, but archive it in both WayBackMachine and | Archive.is. | scruffyherder wrote: | So it can be deleted too? | | Or so there is no engagement at the source? | mountainb wrote: | Link rot has convinced me that the web is not good for its | ostensible purpose. I used to roll my eyes reading how academic | researchers and librarians would discourage using webpages as | resources. Many years later, it's obvious that the web is pretty | bad for anything that isn't ephemeral. | sfg wrote: | We have deposit libraries in the U.K., such as The British | library and Oxford University's Bodleian. When you publish a | book in the U.K. you are supposed to offer a copy to these | institutions. | | If we had legal deposit web archiving institutions, then | academics, and others, could create an archive snapshot of some | resource and then reference the URI to that (either with or | without the original URI), so as to ensure permanence. | ImaCake wrote: | >I used to roll my eyes reading how academic researchers and | librarians would discourage using webpages as resources. | | While this is true in general, I am amused that this is _not_ | true for citing wikipedia. Wikipedia can be trusted to remain | online for many more years to come. And it has a built-in | wayback machine in the form of Revision History. | mountainb wrote: | Try following the references on big Wiki pages and you will | see why Wikipedia pages are nightmarish for any kind of | research. This is important when you are trying to drill down | to the sources of various claims. Many major pages relating | to significant events and concepts are riddled with rotted | links. | | The page can be completely correct and accurate, but if you | cannot trace the references then it cannot be verified and | you cannot make the claims in a new work as a result. The | whole point of references is to make it so that the claims | can be independently verified. Even when there isn't a link | rot problem you will often find junk references that cannot | be verified. | | Wikipedia isn't a bad starting point and sometimes you can | find good references. But it is not anywhere close to | reliable: just trace the references in the next 20 Wiki | articles you read and your faith will be shaken. | techphys_91 wrote: | Usually a reference indicates that an author believes | something to be true, but won't explicitly state their | reasons. It isn't just a statement of where information comes | from, but a justification for trusting that information. If | the reference is from a reputable source, then it indicates | that this belief is justified. If an author believes | something to be true because they read it on wikipedia, then | that belief probably isn't justified, because the reliability | of wikipedia content is mixed. | | Good quality information on wikipedia often refers back to | published sources, and at the very least an author should | check that source and refer to it, rather than wikipedia | itself. | scruffyherder wrote: | After someone published an authoritative ftp listening, so many | people panicked as their were out of date and insecure versions | so rather than patch they all went dark. | | Anyone doing research just got screwed. | | So many papers have code listed to places that don't exist | anymore. | k1m wrote: | I think this a good idea, but especially because the | WayBackMachine uses good content security policies to prevent | some of the intrusive JS ad-dependent sites like to push on | people. So you're not only protecting from future 404 scenarios, | but also protecting your visitors' privacy from unscrupulous ad- | tech which seems to be everywhere now. | | The example provided in the article, showing how a site looked | cleaner before, could simply be the content security policies at | the WayBackMachine preventing the clutter from getting loaded, | rather than any specific changes on the site - although I haven't | checked that particular site. | prgmatic wrote: | I stopped reading after the part where they describe the paywall | gated version of the journalism website as "Now it's spam from a | site suffering financial need." | | That website spends money creating content for commercial | viability, it doesn't have to bow to you and make sure you can | consume it for free, and the Wayback Machine isn't a tool for you | to bypass premium content. | s9w wrote: | In practice however, archive.org did censor content based on | political preference. | encom wrote: | Sounds plausible, but I sure would like a citation for that | claim. | s9w wrote: | I do have two links in my "clownworld" link list under, but | ironically they're both in reddits that have since been | banned and are therefore not available anymore. | dependenttypes wrote: | They exclude Snopes and I think Salon from archiving. | romwell wrote: | Good idea, by why not both (i.e. link to a webpage, _and_ to the | Archive)? | | Linking to Archive only makes Archive a single point of failure. | thunderrabbit wrote: | Agreed. I usually link to both the original and then | archive.org in parentheses. | sseneca wrote: | Yes, this makes the most sense in my opinion: | | Check out [this link](https://...) ([archived](https://...)) | | This can also help in the event of a "hug of death" | roberto wrote: | This is what I do on my blog, with some additional metadata: | <p> <a data-archive- | date="2020-09-01T22:11:02.287871+00:00" data- | archive-url="https://web.archive.org/web/20200901221101/https | ://reubenwu.com/projects/25/aeroglyphs" | href="https://reubenwu.com/projects/25/aeroglyphs" | > Aeroglyphs </a> <span | class="archive"> [<a href="https://web.archive.or | g/web/20200901221101/https://reubenwu.com/projects/25/aerogly | phs">archived</a>] </span> is an ongoing | series of photos of nature with superimposed geometrical | shapes drawn by drones. </p> | dredmorbius wrote: | The WBM link includes the canonical source clearly within the | URL. | romwell wrote: | Yeah, and the non-technical users will surely understand that | what they need to do when the link doesn't work is: | | 1. Recognize that it's an Archive.org URL | | 2. Understand that the link references an archived page whose | URL is "clearly" referenced as a parameter | | 3. Edit the URL (especially pleasant on a cell phone) | correctly and try loading that | | If you expect the user to be able to go through all this | trouble if the Archive is down, you can also expect them to | look up the page on the Archive if the link does not load. | | But better yet, one shouldn't expect either. | iib wrote: | By the way the archive works, isn't the link just adding the | https://web.archive.org/web/*/ before the actual link? I guess | linking to both is especially important for people not knowing | about the existence of archive.org, and a small convenience for | everyone. But the link seems to be reversible in either | direction. | hinkley wrote: | I wonder if the anchor tag should be altered to support this? | | Alternatively, this is a good thing for a user agent to handles | natively, or through a plugin. | cornedor wrote: | But how certain is the future of WayBackMachine, when disaster | strikes, all your links are dead. On the other hand, the original | links can still be read from the url, so the original reference | is not completely gone. | dredmorbius wrote: | INTERNETARCHIVE.BAK: | | _The INTERNETARCHIVE.BAK project (also known as IA.BAK or | IABAK) is a combined experiment and research project to back up | the Internet Archive 's data stores, utilizing zero | infrastructure of the Archive itself (save for bandwidth used | in download) and, along the way, gain real-world knowledge of | what issues and considerations are involved with such a | project. Started in April 2015, the project already has dozens | of contributors and partners, and has resulted in a fairly | robust environment backing up terabytes of the Archive in | multiple locations around the world._ | | https://www.archiveteam.org/index.php?title=INTERNETARCHIVE.... | | Snapshots from 2002 and 2006 are preserved in Alexandria, | Egypt. I hope there's good fire suppression. | | https://www.bibalex.org/isis/frontend/archive/archive_web.as... | phendrenad2 wrote: | I wish there were a way to get a low-rez copy of their entire | archive. So, only text, no images, binaries, PDFs (other than | PDFs converted to text which they seem to do). As it stands | the archive is so huge, the barrier to mirroring is high. | dredmorbius wrote: | Agreed. | | When scoping out the size of Google+, one of ArchiveTeam's | recent projects, it emerged that the typical size of a post | was roughly 120 bytes, but total page weight a minimum of 1 | MB, for a 1% payload to throw-weight ratio. This seems | typical of much the modern Web. And that excludes external | assets: images, JS, CSS, etc. | | If _just the source text and sufficient metadata_ were | preserved, all of G+ would be startlingly small -- on the | order of 100 GB I believe. Yes, posts _could_ be longer (I | wrote some large ones), and images (associated with about | 30% of posts by my estimate) blew things up a lot. But the | scary thing is actually how _little_ content there really | was. And while G+ certainly had a "ghost town" image | (which I somewhat helped define), it wasn't _tiny_ --- | there were plausibly 100 - 300 million users with | substantial activity. | | But IA's WBM has a goal and policy of preserving the Web | _as it manifests_ , which means one hell of a lot of cruft | and bloat. As you note, increasingly a liability. | ta8908695 wrote: | The external assets for a page could be archived | separately though, right? I would think that the static | G+ assets: JS, CSS, images, etc. could be archived once, | and then all the remaining data would be much closer the | 120B of real content. Is there a technical reason that's | not the case? | dredmorbius wrote: | In theory. | | In practice, this would likely involve recreating at | least some of the presentation side of numerous changing | (some constantly) Web apps. Which is a substantial | programming overhead. | | WARC is dumb as rocks, from a redundancy standpoint, but | also atomically complete, independent (all WARCs are | entirely self-contained), and reliable. When dealing with | billions of individual websites, these are useful | attributes. | | It's a matter of trade-offs. | nikisweeting wrote: | So archive your links yourself with one of the many local-web- | archiving tools. | | https://webrecorder.io | | https://github.com/pirate/ArchiveBox | | https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Comm... | oblio wrote: | Doesn't the link to the WayBackMachine contain the original | link? | INTPenis wrote: | Yeah, my thoughts were more of the way Waybackmachine is | funded. | | I don't feel comfortable sending a bunch of web traffic to them | for no reason other than it being convenient. The wayback | machine is a web archival project, not your personal content | proxy to make sure your links don't go stale. | | They need our help both in funding and in action, one simple | action is not to abuse their service. | sanitycheck wrote: | Precisely my first thoughts, too. It's an archive, not a free | CDN. | | I hope the author of this piece considers donating and | promoting donation to their readers: | https://archive.org/donate/ | Lex-2008 wrote: | WayBackMachine alternative, archive.is, has an option to | download zip archive of HTML with images and CSS (but no JS) - | this way you can preserve and host a copy of original webpage | on your own website | moonchild wrote: | Or just wget -rk... | | Mirroring a website isn't so hard that you need a service to | do it for you. Your browser even has such a function; try | ctrl-s. | abricot wrote: | The "SingleFile" plugin is a better version of ctrl+s. It | will save all pages as single html file and even include | images as an octet stream in the file so they aren't | missed. | peq wrote: | I would be careful in mirroring a site. It's very likely to | violate copyright or similar laws, depending on where you | are. I think archive.org is considered fair use, but if you | put it on a personal or even business page it might be | different. For example Google News in EU is very limited in | what content they may steal from other web pages. | j1elo wrote: | This is a bad idea for the reasons that other commenters have | already stated. If WayBackMachine falls, all links would fall. | Actually the "Web" would stop being one, if all links are all | within the same service. | | For docs and other texts, I just link to the original site and | add an (Archive) suffix, e.g. the "Sources" section in | https://doc-kurento.readthedocs.io/en/latest/knowledge/nat.h... | | That is a simple and effective solution, yes it is a bit more | cumbersome, but it does not bother me. | euske wrote: | This is both good and scary idea: for the good part, I'm | frustrated enough that some unscrupulous websites (even some news | outlets) secretly alter their contents without mentioning the | change. I want a mechanism that holds the publisher responsible. | At the same time, this is scary because we're basically using one | private organization a single arbitrator. (I know it's a | nonprofit, but they're probably not as public as a government | entity.) Maybe it's good for the time being, but we should be | aware that this is a solution that's far from perfect. | anaganisk wrote: | Public "or" a government entity. | icemelt8 wrote: | Just FYI, archive.org is banned in a few countries, including the | UAE, where I cannot open any links from there. | dirtnugget wrote: | Huh I wonder if they are also blocking mirrors. Also, in | countries with restrictions to internet access you probably | want to make using TOR a general habit. | kibibu wrote: | Can we update this link to point to the archive version? | drummer wrote: | Brilliant | arnoooooo wrote: | On the same topic, I wish I could link with highlights in the | page. Having a spec for highlights in URLS would be neat. | basscomm wrote: | Chrome 80 supports this: | https://www.chromestatus.com/feature/4733392803332096 | [deleted] | bherb wrote: | Here, I fixed your link: | https://web.archive.org/web/20200908090515/https://hawaiigen... | shemnon42 wrote: | Came here for this. Have my upvote. | EllieEffingMae wrote: | I maintain a Fork of a program that does exactly this! You can | check it out here | | https://github.com/Lifesgood123/prevent-link-rot | celsoazevedo wrote: | Is there any WordPress plugin that adds a link to the WayBack | Machine next to the original link? I would use something like | that. | dredmorbius wrote: | Perhaps: https://wordpress.org/plugins/media-library-internet- | archive... | aargh_aargh wrote: | Look at the format of the wayback machine URL. It's trivial to | generate. | | Where a WP plugin would add value is by saving to the archive | whenever WP publishes a new or edited article. | sebastianconcpt wrote: | Clever way to make the reference immutable. | | Some blockchain will end up taking care of this. | imhoguy wrote: | This is building yet another silo and point of failure. We can't | pass the entire Internet traffic thru WayBackMachine as its | resources are limited. | | Most preserving solutions are like that and at the end the | funding or business priorities (google groups) become a serious | problem. | | I think we need something like web - distributed and dumb easy to | participate and contribute a preservation space. | | Look, there are Torrents available for 17 years [0]. Sure, there | are some unintresting long gone but there is always a little | chance somebody still has the file and someday becomes online | with it. | | I know about IPFS/Dat/SBB, but still that stuff, like Bitcoin, is | too complex for a layman contributor with a plain altruistic | motivation. It should be like SETI@Home - fire and forget. | Eventually integrated with a browser to cache content you | star/bookmark and share when it is offline. | | [0] https://torrentfreak.com/worlds-oldest-torrent-still- | alive-a... | TheSpiceIsLife wrote: | This behaviour should be reported to the WayBackMachine as abuse. | cpcallen wrote: | This seems like a risky strategy, what with the pending lawsuit | against archive.org over their National Emergency Library: I am | fully expecting that web.archive.org will go away permanently | within a few years. | ffpip wrote: | The wayback machine has helps me on a daily basis. So many old | links are dead. | | The other day, I noticed that even old links from the front page | of Google and Youtube are dead now. Internet Archive still has | them. These were links on the front page of YT. Was very | disappointed that even Google has dead links. | lizardmancan wrote: | The real problem here is that url's provide only single method to | obtain content. Combined with the registers rent seeking scheme | we are left with flimsy technology. | | I implemented this one time for images when a bunch of free image | hosts i was using failed: | | <img src="http://example.com/img.jpg" data-x="0" onerror="a=[ | 'http://example.com/img.jpg', 'http:/example.com/img2.jpg', | 'http://example.com/michael-faraday.jpg', this.dataset.uri]; | this.src=a[this.dataset.x++]" data- | uri='data:image/gif,GIF89a%1...'> | ffpip wrote: | You can create a bookmark in Firefox to save a link quickly. | | Bookmark Location- https://web.archive.org/save/%s | | Keyword - save | | So searching 'save https://news.ycombinator.com/item?id=24406193' | archives this post. | | You can use any Keyword instead of 'save'. | | You can also search with https://web.archive.org/*/%s | bad_user wrote: | Does that `save` keyword work? | | The problem is %s gets escaped, so Firefox generates this URL, | which seems to be invalid: | | https://web.archive.org/save/https%3A%2F%2Fnews.ycombinator.... | aendruk wrote: | Uppercase %S for unescaped, e.g.: | | https://web.archive.org/web/*/%S | bad_user wrote: | Ah, nice, thanks! | ffpip wrote: | web.archive.org automatically converts the https%3A%2F things | to https:// for me. I noticed it many times. | | If you are still facing problems, go to | https://web.archive.org . In the bottom right 'Save page now' | field, right click and select 'add keyword for search'. | Choose your desired keyword. | fireattack wrote: | >web.archive.org automatically converts the https%3A%2F | | Did you try the link provided by the one you replied to? | | Because it says "HTTP 400" here, so apparently it doesn't | convert well, at least not in my end. | kilroy123 wrote: | Nice. I forgot how you can do that. | | I just use the extension myself: | | https://addons.mozilla.org/en-US/firefox/addon/wayback-machi... | ffpip wrote: | Yeah. That requires access to all sites. I wasn't comfortable | adding another addon with that permission. | | The permission is just for a simple reason and should be off | by default. It is so you can right click a link on any page | and select 'archive' from the menu. Small function, but | requires access to all sites. | robotron wrote: | The source is available if you want to know what's going on | with those permissions: | https://github.com/internetarchive/wayback-machine-chrome | ffpip wrote: | Thanks. I already knew that. I'm familiar with the dev's | extensions. Clear Browsing Data and Captcha Buster and | very useful. | badsectoracula wrote: | One issue i have with this extension is that it randomly pops | up the 'this site appears to be offline' (which overrides the | entire page) even when the site actually works (i hit the | back button and it appears). I have it installed for some | time now and so far i have almost daily false negatives and | only once actually it worked as intended. | | Also there doesn't seem to be a way to open a URL directly | from the extension which seems a weird omission, so i end up | going to the archive site anyway since i very often want to | find old long lost sites. | fireattack wrote: | It pops up when there is a HTTP 404 status code or similar | returned. So these false negatives are likely due to the | specific sites that are configured in a wacky way. | | (Don't get me wrong, it is still very annoying for the user | regardless what the cause is.) | badsectoracula wrote: | Does it pop up for _any_ 404 error? If so it might be | some script or font or whatever resource the site itself | is using that would otherwise fail silently. If not... | then there has to be some other bug /issue because i get | it for many different sites that shouldn't have it. | fireattack wrote: | Nope, only for the "main" page (for lack of a better | word), and when there _is_ an archive for it. | eruci wrote: | WBM is like a content snapshot. You can't go back in time and | change anything. That's why it is better than linking to the | original. | wila wrote: | The idea of being able to access the URL once it is gone is good. | However this also means that any updates made to the original | page are no longer seen. | | Not all updates are about "begging for money" as the example in | the article. | markjgraham wrote: | We suggest/encourage people link to original URLs but ALSO (as | opposed to instead of) provide Wayback Machine URLs so that | if/when the original URLs go bad (link rot) the archive URL is | available, or to give people a way to compare the content | associated with a given URL over time (content drift) | | BTW, we archive all outlinks from all Wikipedia articles from all | Wikipedia sites, in near-real-time... so that we are able to fix | them if/when they break. We have rescued more than 10 million so | far from more than 30 Wikipedia sites. We are now working to have | Wayback Machine URLs added IN ADDITION to Live Web links when any | new outlinks are added... so that those references are "born | archived" and inherently persistent. | | Note, I manage the Wayback Machine team at the Internet Archive. | We appreciate all your support, advice, suggestions and requests. | arendtio wrote: | I always wonder about rise the hosting costs in the wake of | people liking to the Wayback Machine on popular sites. | | How do you think about it? | Arkanosis wrote: | This is so much better than INSTEAD. | | Not for the sole reason that it leaves some control to the | content owner while ultimately leaving the choice to the user, | but also because things like updates and erratums (eg. | retracted papers) can't be found in archives. When you have | both, it's the best of both world: you have the original | version, the updated version, and you can somehow have the diff | between them. IMHO, this is especially relevant in when the | purpose is reference. | tracker1 wrote: | I mostly agree... however, given how many "news" sites are now | going back and completely changing articles (headlines, | content) without any history, I think it's a mixed bag. | | Link rot isn't the only reason why one would want an archive | link instead of original. Not that I'd want to overwhelm the | internet archive's resources. | jhallenworld wrote: | It's interesting to think about how HTML could be modified to | fix the issue. Initial thought: along with HREF, provide AREF- | a list of archive links. The browser could automatically try a | backup if the main one fails. The user should be able to right- | click the link to select a specific backup. Another idea is to | allow the web-page author to provide a rewrite rule to | automatically generate wayback machine (or whatever) links from | the original. This seems less error prone and browsers could | provide a default that authors could override. | | Anyway, the fix should work even with plain HTML. I'm sure | there are a bunch of corner cases and security issues | involved.. | | Well as mentioned by others, there is a browser extension. It's | interesting to read the issues people have with it: | | https://addons.mozilla.org/en-US/firefox/addon/wayback-machi... | devenblake wrote: | Yup, I've been using the extension for probably about a year | now and get the same issues they do. It really isn't that | bad, most of the time backing out of the message once or | twice does the trick, but it's funny because most of the time | I get that message when going to the IA web uploader. | shortformblog wrote: | This is literally where my brain was going and I was glad to | see someone went in the same direction. Given the <img> tag's | addition of srcset in recent years, there is precedent for | doing something more with href. | javajosh wrote: | So this is a little indirect, but it does avoid the case | where the Wayback machine goes down (or is subverted): | include a HASHREF which is a hash of the state of the content | when linked. Then you could find the resource using the | content-addressable system of your choice. (Including, it | must be said, the wayback machine itself). | punnerud wrote: | I love the feature that you easily can add a page to archive: | https://web.archive.org/save/https://example.com | | Replace https://example.com from the URL above. I try to | respect the cost of archiving, by not saving to often the same | page. | Ziggy_Zaggy wrote: | Kudos for doing what you do. | michaelanckaert wrote: | In the past I would fall back to WBM when something is no longer | online. Though recently I've been bookmarking interesting content | very rigorously and just rely on the archival feature of my | bookmarking software. | codetrotter wrote: | By that reasoning, shouldn't you be be using WayBack Machine | links when posting your own content to HN, instead of posting | direct links? | dirtnugget wrote: | He is actually showcasing a very nice technique to get around | paywalls: turn off JS. Often enough that's enough to get around | the paywall. I believe the archives also disable JS when grabbing | the content. | rchaud wrote: | That is changing. I've noticed over the past couple of years | that sites that could be accessed with JS turned off are now | showing a "Please enable Javascript to continue" (Quora) or | just hiding the content entirely (Business Insider). | | I'm sure there are other examples as well. | dirtnugget wrote: | Not surprised. When paywalls started becoming a thing most of | them could be circumvented simply by removing a DOM element | and some CSS classes. Nowadays this is basically not possible | anywhere anymore. | not2b wrote: | It's probably better to link to both. If a site corrects a story, | you readers will want to see the correction, but if the page | disappears, it's good to have the backup. | AnonHP wrote: | WayBackMachine is slow (slower than many bloated websites). So | it's not a good enough experience for the person clicking on that | link. | | Secondly, I personally don't like the fact that WayBackMachine | doesn't provide an easy way to get content removed and to stop | indexing and caching content (the only way I know is to email | them, with delayed responses or responses that don't help). It's | far easier to get content de-indexed in the major search engines. | I know that the team running it have some reasons to archive | anything and everything (as) permanently (as possible), but it | doesn't serve everybody's needs. | tannhaeuser wrote: | The proper way is for a site to expose a canonical link to an | article via a meta-link (rel=canonical) if necessary, and then | have a browser plugin to automatically try archive.org with an | URL generated from the canonical one if it is down. | LoSboccacc wrote: | has waybackmachine stopped retroactively applying robots? | | if not link to that are one misconfiguration or one parked domain | from being wiped. | runxel wrote: | While I certainly wouldn't do this with every page and also not | every time, I got so anxious of link rot lately I save out of | reflex any good content I come across to the Waybackmachine. | | The use of the bookmarklet makes this really convenient. | ImAlreadyTracer wrote: | Is there a chrome app that utilises waybackmachine? | CaptArmchair wrote: | So, this is the problem of persistence of URL's always | referencing the original content, regardless of where it is | hosted, in an authoritative way. | | It's an okay idea to link to WB, because (a) it's de facto | assumed to be authoritative by the wider global community and (b) | as an archive it provides a promise that it's URL's will keep | pointing to the archived content come what may. | | Though, such promises are just that: promises. Over a long period | of time, no one can truly guarantee the persistence of a | relationship between an URI and the resource it references to. | That's not something technology itself solves. | | The "original" URI still does carry the most authority, as that's | the domain on which the content was first published. Moreover, | the author can explicitly point to the original URI as the | "canonical" URI in the HTML head of the document. | | Moreover, when you link to the WB machine, what do you link to? A | specific archived version? Or the overview page with many | different archived versions? Which of those versions is currently | endorsed by the original publisher, and which are deprecated? How | do you know this? | | Part of ensuring persistence is the responsibility of original | publisher. That's where solutions such as URL resolving come into | play. In the academic world, DOI or handle.net are trying to | solve this problem. Protocols such as ORE or Memento further try | to cater to this issue. It's a rabbit hole, really, when you | start to think about this. | im3w1l wrote: | Signed HTTP Exchanges could be a neat solution here. | kapep wrote: | > Moreover, when you link to the WB machine, what do you link | to? A specific archived version? Or the overview page with many | different archived versions? Which of those versions is | currently endorsed by the original publisher, and which are | deprecated? How do you know this? | | WB also supports linking to the very latest version. If the | archive is updated frequently enough I would say it is | reasonable to link to that if you use WB just as a mirror. In | some cases I've seen error pages being archived after the | original page has been moved or removed though but that is | probably just a technical issue caused by some website | misconfiguration or bad error handling. | susam wrote: | I think the fundamental problem here is that URLs locate | resources. We find the desired content by finding its location | given by an address. Now what server or content lives on that | address may change from time to time or may even disappear. This | leads to broken links. | | The problem with linking to Wayback Machine is that we are still | writing archive.org URLs still linking to Wayback Machine | servers. What guarantee is there that those archive.org links | will not break in future? | | It would have been nice if the web were designed to be content- | addressable. That is, the identifier or string we use to access a | content addresses the content directly, not a location where the | content lives. There is good effort going on in this area in the | InterPlanetary File System (IPFS) project but I don't think the | mainstream content providers on the Internet are going to move to | IPFS anytime soon. | spurgu wrote: | I think a good solution might be to host the archive version | yourself (archive.org is slow, and always using it centralizes | everything there). | | Let's say you write an article on your site, | https://yoursite.com/my-article, and from it you want to link to | an article https://example.com/some-article | | You then create a mirror of https://example.com/some-article to | be served from your site at | https://yoursite.com/mirror/2019-09-08/some-article (put /mirror/ | in robots.txt and set to noindex (or maybe even better to put a | rel="canonical" towards the original article?)) and on the top of | this mirrored page you add a header bar thingy containing a link | to the original article, as well as one to archive.org if you so | want. | | tl;dr instead of linking to https://example.com/some-article you | link to https://yoursite.com/mirror/2019-09-08/some-article | (which has links to the original) | andy_ppp wrote: | It would be good to create a distributed, consensus version (to | help stop edits) of the content rather than have a single point | of failure... | zoid_ wrote: | I find that web archive pages always appear broken --- perhaps a | lot of js or css is not properly archived? | dltj wrote: | Take a look at _Robustify Your Links_.[1] It is an API and a | snippet of JavaScript that saves your target HREF in one of the | web archiving services and adds a decorator to the link display | that offers the option to the user to view the web archive. | | [1] https://robustlinks.mementoweb.org/about/ | spqr233 wrote: | I made a chrome extension called Capsule that works perfectly for | this use case. With just a click, you can create a publically | shareable link that preserves the webpage exactly as you see it | in your browser. | | https://capsule.click | nikisweeting wrote: | Does it use SingleFile under-the-hood? What storage format does | this use, is it portable? e.g. WARC/memento/zim/etc? | outsomnia wrote: | This is a bad idea... | | In the worst case one might write a cool article and get two | hits, one noticing it exists, and the other from the archive | service. After that it might go viral, but the author may have | given up by then. | | The author is losing out on inbound links so google thinks their | site is irrelevant and gives it a bad pagerank. | | All you need to do is get archive.org to take a copy at the time, | you can always adjust your link to point to that if the original | is dead. | ethanwillis wrote: | Google shouldn't be the center of the Web. They could also | easily determine where the archive link is pointing to and not | penalize. But I guess making sure we align with Google's | incentives is more important than just using the Web. | bartread wrote: | > Google shouldn't be the center of the Web. | | I agree, but are you suggesting it's going to be better if | WayBackMachine is? | ethanwillis wrote: | That's a strawman because I never said they should be. | There's room for better alternatives. | | We as a community need to think bigger rather than | resigning ourselves to our fate. | bartread wrote: | It's not a strawman because (a) I agreed with you, (b) | context, and (c) I asked a question based on what you | seemed to be implying in that context: a question to | which you still haven't provided an answer. | | Let me put it another way: what specifically are you | suggesting as an alternative? | ethanwillis wrote: | If I had to pick a solution from what's available right | now technology wise I'd pick something that links based | on content hashes. And then pulls the content from | decentralized hosting. | | I don't think I like IPFS as an organization, but tech | wise it's probably what I'd go with. | encom wrote: | Yes. At least Archive.org isn't an evil mega corporation | destroying the internet. Yet. | rriepe wrote: | We'll see what their new owners do after the lawsuit. | rchaud wrote: | Every search engine uses the number of backlinks as one of | the key factors in influencing search rank; it's a | fundamental KPI when it comes to understanding whether a link | is credible. | | What is true for Google in this regard is also true of Bing, | DDG and Yandex. | luckylion wrote: | > But I guess making sure we align with Google's incentives | is more important than just using the Web. | | It's not about Google's incentives. It's about directing the | traffic where it should go. Google is just the means to do | so. | | Build an alternative, I'm sure nobody _wants_ Google to be | the number one way of finding content, it 's just that they | are, so pretending they're not and doing something that will | hurt your ability to have your content found isn't | productive. | johannes1234321 wrote: | One can also do it similar to Wikipedia references sections, | which links to the original and the memento in the archive. | (Once the bot notices it's gone) | | Additional benefit: Some edits are good (addendums, typo | corrections etc.) | marcus_holmes wrote: | I totally agree. | | I guess the answer is "don't mess with your old site", but | that's also impractical. | | And I'm sorry, but if it's my site, then it's _my_ site. I | reserve the right to mess about with it endlessly. Including | taking down a post for whatever reason I like. | | I'm sorry if that conflicts with someone else's need for | everything to stay the same but it's _my_ site. | | Also, if you're linking to my article, and I decide to remove | said article, then surely that's my right? It's _my_ article. | Your right to not have a dead link doesn 't supercede my right | to withdraw a previous publication, surely? | pingpongchef wrote: | You can go down this road, but it looks like you're | advocating for each party to simply do whatever he wants. In | which case the viewing party will continue to value | archiving. | mitchdoogle wrote: | I certainly don't know about legal rights, but I think the | ethical thing is to make sure that any writings published as | freely accessible should remain so forever. What would people | think if an author went into every library in the world to | yank out one of their books they no longer want to be seen? | | I do think the author is wrong to _immediately_ post links to | archived versions of sources. At the least, he could link to | both the original and archived. | fwip wrote: | Why is that the most ethical thing to do? | | As a motivating example, I wrote some stuff on my MySpace | page as a teenager that I'm very glad is no longer | available. They were published as "freely accessible" and | indeed, I wanted people to see it. But when I read it back | 15 years later, I was more than a little embarrassed about | it, and I deleted it - despite it also having comments from | my friends at the time, or being referenced in their pages. | | No great value was contained in those works. | marcus_holmes wrote: | I'm not sure I agree. I know that journalism (as a | discipline) considers this ethical. I kinda get that this | is part of the newspaper industry as a public service - | that withdrawing publication of something, or changing it | without alerting the reader to the change, alters the | historical record. | | But no-one has a problem with other creative industries | withdrawing their publications. Film-makers are forever | deciding that movies are no longer available, for purely | commercial reasons. Why is writing different? Why is | pulling your books from a library unethical but pulling | your movie from distribution is OK? | | I think we either need to extend this to all creative | activity, or reconsider it for writing. | falcolas wrote: | This has a very easy answer for me: It's _not_ ethical | for film makers to decide that movies are no longer | available. | | Copyright was created to encourage publication of | information, not to squirrel it away. Copyright should be | considered the exception of the standard - public domain. | fwip wrote: | Why not? | | Is it unacceptable for an artist to throw her art away | after it has finished its museum tour? Should a parent | hang on to every drawing their child has ever made? | | If you are a software developer - is all of the code | you've ever written still accessible online, for free? | (To the legal extent that you are able, of course.) | | Have you written a blog before, or did you have a | MySpace? Have you taken care to make sure your creative | work has been preserved in perpetuity, regardless on how | you feel about the artistic value of displaying your teen | emotions? | | Consider why you feel it is unethical for the author or | persons responsible for the work to ever stop selling it. | falcolas wrote: | > Is it unacceptable for an artist to throw her art away | after it has finished its museum tour? Should a parent | hang on to every drawing their child has ever made? | | This boils down to the public domain, IMO. We have made a | long practice of rescuing art from private caches and | trash bins to make them publicly available after the | artists' passing (the copyright expiring); regardless of | their views on what should happen with those works. | | > Consider why you feel it is unethical for the author or | persons responsible for the work to ever stop selling it. | | Selling something and then pulling it down is | fundamentally an attempt to create scarcity for something | that would otherwise be freely available. It's a | marketing technique that capitalizes on our fear of | missing out to make a sale. | | Again, the right to even sell writings was enshrined in | law as an _exception_ to the norm of of it immediately | being part of the public domain, in an effort to | encourage more writing. | Fargren wrote: | > But no-one has a problem with other creative industries | withdrawing their publications | | I wouldn't say no one has a problem with this. It does | happen, but it certainly doesn't make everyone happy. I | for one would like for all released media to be | available, or at least not actively removed from access. | badprose wrote: | Publishing on your own website is more akin to putting up a | signboard on your front lawn than writing a book for | publication. | | People are free to view it and take pictures for their own | records, but I could still take it down and put something | else up. | bryanrasmussen wrote: | There's no reason that pagerank couldn't be adapted to take | into account wayback machine urls, there is a link with a url | pointing at | https://web.archive.org/web/*/https://news.ycombinator.com/ | google could easily register that as a link to both resources - | one to web.archive, the other to the site. | | there is also no reason why that has to become a slippery | slope, if anyone is going to ask "but where do you stop!!" | dmitriid wrote: | After all, they did change their search to accommodate AMP. | Changing it to take WebArchive into account is a) peanuts and | b) is actually better for the web | TheSpiceIsLife wrote: | There's a business idea in there somewhere. | | Some kind of CDN-edge-archive hybrid. | britmob wrote: | "CDN-Whether-You-Want-It-Or-Not" | quickthrower2 wrote: | Foreverspin meets Cloudflare | scruffyherder wrote: | Even worse, when you have people using rss to wholesale copy | your site and it's updates and again that traffic and more | importantly the engagement disappear. | | It's very demotivating | acatton wrote: | archive.org sends the HTTP header Link: | <https://example.com>; rel="original" | | This can be used by search engines to adjust their ranking | algorithms. | stratigos wrote: | I link to WayBackMachine as Ive built a great many greenfield | applications for startups as a freelancer, which only existed for | about 6-8 months before hitting their burn rate. If I linked to | their original domains, my portfolio would be a list of 404s. | jakeogh wrote: | If it's not distributed, it is going to disappear. | | The waybackmachine is backed by WARC files. It's perhaps the only | thing on archive.org that cant be downloaded... well except the | original mpg files for 911 news footage. | | https://news.ycombinator.com/item?id=20623177 | wolco wrote: | No one touched on this but the experience of viewing through the | waybackmachine is awful. | | Media many times will not be saved so pages look broken. The | iframe and the iframe breakers on original sites can kill any | navigating. | | The waybackmachine is okay for researching but a poor replacement | as a perm link. | ethagnawl wrote: | > Media many times will not be saved so pages look broken. | | In my experience, this has gotten much, much better in the last | few years. I haven't explored enough to know if this is part of | the archival process or not, but I've noticed on a few | occasions that assets will suddenly appear some time after | archiving a page. For instance, when I first archived this page | (https://web.archive.org/web/20180928051336/https://www.intel.. | .), none of the stylesheets, scripts, fonts or images were | present. However, after some amount of time (days/weeks) they | suddenly appeared and I was able to use the site as it | originally appeared. | yreg wrote: | I'm all for Archive.org. However, using it in this way -- setting | up a mirror of some content and purposefuly diverting traffic to | said mirror -- is copyright infringement (freebooting), as it | competes with the original source. | samatman wrote: | This is such a fundamental problem that I'd like to be able to | solve it at the HTML level. | | An anchor type which allows several URLs, to be tried in order, | would go a long way. Then we could add automatic archiving and | backup links to a CMS. | | It isn't real content-centric networking, which is a pity, but | it's achievable with what we have. | hownottowrite wrote: | Awesome. Hey, mods... Can you change the link on this post to | http://web.archive.org/web/20200908090515/https://hawaiigent... | ashishb wrote: | I wrote a link checker[1] to detect outbound links and mark dead | links, so that, I can replace them manually with archive.org | links. | | 1 - https://github.com/ashishb/outbound-link-checker ___________________________________________________________________ (page generated 2020-09-08 23:00 UTC)