hngopher.com

       [HN Gopher] Why I link to Wayback Machine instead of original we...
       ___________________________________________________________________
        
       Why I link to Wayback Machine instead of original web content
        
       Author : puggo
       Score  : 496 points
       Date   : 2020-09-08 08:03 UTC (14 hours ago)
        
 (HTM) web link (hawaiigentech.com)
 (TXT) w3m dump (hawaiigentech.com)
        
       | NateEag wrote:
       | I understand where the author is coming from, but I think the
       | best approach is to write your content with direct links to the
       | canonical versions of articles.
       | 
       | Have a link checking process you run regularly against your site,
       | using some of the standard tools I've mentioned elsewhere in this
       | thread:
       | 
       | https://www.npmjs.com/package/broken-link-checker-local
       | 
       | https://linkchecker.github.io/linkchecker/
       | 
       | When you run the link check (which should be regularly, perhaps
       | at least weekly), also run a process that harvests the non-local
       | links from your site and 1) adds any new links' content to your
       | own local, unpublished archive of external content, and 2)
       | submits those new links to archive.org.
       | 
       | This keeps canonical URLs canonical, makes sure content you've
       | linked to is backed up on archive.org so a reasonably trustworthy
       | source is available should the canonical one die out, and gives
       | you your own backup in case archive.org and the original both
       | vanish.
       | 
       | I don't currently do this with my own sites, but now I'm
       | questioning why not. I already have the regular link checks, and
       | the second half seems pretty straightforward to add (for static
       | sites, anyway).
        
       | axelfreeman wrote:
       | You could link to the original web url and also do a print
       | version of the web content as PDF. That's how i archive howtos
       | and write-ups of interesting content. Print view and create a PDF
       | version.
        
       | drummer wrote:
       | For anything important you can't beat a good save to pdf feature
       | in the browser. You can then upload the pdf and link to that
       | instead. Someone should make a wordpress plugin to do this
       | automatically.
        
       | Cthulhu_ wrote:
       | If it's to actually reference a third party source, it's probably
       | better to make a self-hosted copy of the page. You can print it
       | to a PDF file for example. I don't believe archive.org is
       | eternal, or that its pages will remain the same.
        
       | LostJourneyman wrote:
       | There's some subtle irony in that the linked site is not in fact
       | a WayBackMachine link, but instead a direct link to the site.
        
       | Andrew_nenakhov wrote:
       | Hmm. is there a place for a service that makes a permanent copy
       | of content, available at the original url at the time of posting?
        
       | luord wrote:
       | While I generally disagree because I'd rather my site was the one
       | getting the hits--and I would rather give the same courtesy to
       | other authors--this does give me the idea of checking (or
       | creating if none exists) an archive link of whatever I reference,
       | and include that archive link in the metadata of every link I
       | include.
       | 
       | Users will find the archive link if they really want to, and it
       | will make it easier for me to replace broken links in the future.
        
       | hgo wrote:
       | Maybe the solution isn't technical and we should look at other
       | fields that have relied on referencing credible sources for a
       | long time? I can think of research, news and perhaps law.
        
       | bartread wrote:
       | I'm not sure I'm a fan of this because it just turns
       | WayBackMachine into another content silo. It's called the world
       | wide web for a reason, and this isn't helping.
       | 
       | I can see it for corporate sites where they change content,
       | remove pages, and break links without a moment's consideration.
       | 
       | But for my personal site, for example, I'd much rather you link
       | to me directly rather than content in WayBackMachine. Apart from
       | anything else linking to WayBackMachine only drives traffic to
       | WayBackMachine, not my site. Similarly, when I link to other
       | content, I want to show its creators the same courtesy by linking
       | directly to their content rather than WayBackMachine.
       | 
       | What I can see, and I don't know if it exists yet (a quick search
       | suggests perhaps not), is some build task that will check all
       | links and replace those that are broken with links to
       | WayBackMachine, or (perhaps better) generate a report of broken
       | links and allow me to update them manually just in case a site or
       | two happen to be down when my build runs.
       | 
       | I think it would probably need to treat redirects like broken
       | links given the prevalence of corporate sites where content is
       | simply removed and redirected to the homepage, or geo-locked and
       | redirected to the homepage in other locales (I'm looking at you
       | and your international warranty, and access to tutorials, Fender.
       | Grr.).
       | 
       | I also probably wouldn't run it on every build because it would
       | take a while, but once a week or once a month would probably do
       | it.
        
         | akavel wrote:
         | Or: snapshot a WARC archive of the site locally, then start
         | serving it only in case the original goes down. For extra
         | street cred, seed it to IPFS. (A.k.a. one of too many projects
         | on my To Build One Day list.)
        
           | nikisweeting wrote:
           | ArchiveBox is built for exactly this use-case :)
           | 
           | https://github.com/pirate/ArchiveBox
        
         | silicon2401 wrote:
         | > But for my personal site, for example, I'd much rather you
         | link to me directly rather than content in WayBackMachine.
         | 
         | That would make sense if users were archiving your site for
         | your benefit, but they're probably not. If I were to archive
         | your site, it's because I want my own bookmarks/backups/etc to
         | be more reliable than just a link, not because I'm looking out
         | to preserve your website. Otherwise, I'm just gambling that you
         | won't one day change your content, design, etc on a whim.
         | 
         | Hence I'm in a similar boat as the blog author. If there's a
         | webpage I really like, I download and archive it myself. If
         | it's not worth going through that process, I use the wayback
         | machine. If it's not worth that, then I just keep a bookmark.
        
           | 3pt14159 wrote:
           | The issue is that if this becomes widespread then we're going
           | to get into copyright claims against the wayback machine.
           | When I write content it is mine. I don't even let Facebook
           | crawlers index it because I don't want it appearing on their
           | platform. I'm happy to have wayback machine archive it, but
           | that's with the understanding that it is a backup, not an
           | authoritative or primary source.
           | 
           | Ideally, links would be able to handle 404s and fallback.
           | Like we can do with images and srcset in html. That way if my
           | content goes away we have a backup. I can still write updates
           | to a blog piece or add translations that people send in and
           | everyone benefits from the dynamic nature of content, while
           | still being able to either fallback or verify content at the
           | time it was publish via the wayback machine.
        
             | alisonkisk wrote:
             | Perhaps the wayback machine can help fix that by telling
             | users to visit the authoritative site and demanding a
             | confirmation clickthrough before showing the archived
             | content.
        
               | bartread wrote:
               | > Perhaps the wayback machine can help fix that by
               | telling users to visit the authoritative site and
               | demanding a confirmation clickthrough before showing the
               | archived content.
               | 
               | I'm trying to figure out if you're being ironic or
               | serious.
               | 
               | People on here (rightly) spend a lot of time complaining
               | about how user experience on the web is becoming terrible
               | due to ads, pop-ups, pop-unders, endless cookie banners,
               | consent forms, and miscellaneous GDPR nonsense, all of
               | which get in the way of whatever it is you're trying to
               | read or watch, and all of it on top of the more run-of-
               | the-mill UX snafus with which people casually litter
               | their sites.
               | 
               | Your idea boils down to adding another layer of consent
               | clicking to the mess, to implement a semi-manual redirect
               | through the WayBackMachine for every link clicked. That's
               | ridiculous.
               | 
               | I have to believe you're being ironic because nobody
               | could seriously think this is a good idea.
        
               | CarCooler wrote:
               | Agree, cut the clutter just like it is simple on the HN
               | website.
        
             | notriddle wrote:
             | There already have been copyright claims against The
             | Wayback Machine. They've been responding to it by allowing
             | site owners to use robots.txt to remove their content.
        
             | headmelted wrote:
             | But it's also not guaranteed to be consistent. What if you
             | don't delete the content but just change it? (I.e. what if
             | your opinions change or you're pressured to edit
             | information by a third party?).
        
               | 3pt14159 wrote:
               | I addressed this.
               | 
               | > I can still write updates to a blog piece or add
               | translations that people send in and everyone benefits
               | from the dynamic nature of content, while still being
               | able to either fallback or verify content at the time it
               | was publish via the wayback machine.
               | 
               | Updates are usually good. Sometimes you need to verify
               | what was said though, and for that wayback machine works.
               | I agree it would be nice if there was a technical way to
               | support both, but for the average web request it's better
               | to link to the source.
        
           | [deleted]
        
           | ethagnawl wrote:
           | > If it's not worth that, then I just keep a bookmark.
           | 
           | I've made a habit of saving every page I bookmark to the
           | WayBackMachine. To my mind, this is the best of both worlds:
           | you'll see any edits, additions, etc. to the source material
           | and if something you remember has been changed or gone
           | missing, you have a static reference. I just wish there was
           | an simple way to diff the two.
           | 
           | I keep meaning to write browser extensions to do both of
           | these things on my behalf ...
        
           | PaulHoule wrote:
           | It's a deep problem with the web as we know it.
           | 
           | If I want to make a "scrapbook" to support a research project
           | of some kind. Really I want to make a "pyramid" with a
           | general overview that is at most a few pages at the top, then
           | some documents that are more detailed, but with the original
           | reference material incorporated and linked to what it
           | supports.
           | 
           | In 2020 much of that reference material will come from the
           | web and you are left with doing the "webby" thing (linking)
           | which is doomed to fall victim to broken links or with
           | archiving the content which is OK for personal use, but will
           | not be OK with the content owners if you make it public. You
           | could say the public web is also becoming a cess pool/crime
           | scene, where even reputable web sites are suspected of
           | pervasive click fraud, where the line between marketing and
           | harassment gets harder to see every day.
        
             | alisonkisk wrote:
             | Is it a deep problem? You can download content you want to
             | keep. There are many services like evernote and pocket that
             | can help you with it.
        
               | TeMPOraL wrote:
               | It is, because it ultimately comes down to owner's
               | control of how their content is being used.
               | 
               | For example, a modern news site will want the ability to
               | define which text is "authoritative", and make
               | modifications to it on the fly, including unpublishing
               | it. As a reader OTOH, I want a permanent, immutable copy
               | of everything said site ever publishes, so that silent
               | edits and unpublishing is not possible. These two
               | perspectives are in conflict, and that conflict repeats
               | itself throughout the entire web.
        
               | PaulHoule wrote:
               | Some consumers will want the latest and greatest content.
               | To please everyone (other than the owner) you'd need to
               | look at the content across time, versions, alternate
               | world views,... Thus "deep".
               | 
               | My central use case is that I might 'scrape' content from
               | sources such as
               | 
               | https://en.wikipedia.org/wiki/List_of_U.S._states_and_ter
               | rit...
               | 
               | and have the process be "repeatable" in the sense that:
               | 
               | 1. The system archives the original inputs and the
               | process to create refined data outputs
               | 
               | 2. If the inputs change the system should normally be
               | able to download updated versions of the inputs, apply
               | the process and produce good outputs
               | 
               | 3. If something goes wrong there are sufficient
               | diagnostics and tests that would show invariants are
               | broken, or that the system can't tell how many fingers
               | you are holding up
               | 
               | 4. and in that case you can revert to "known good" inputs
               | 
               | I am thinking of data products here, but even if the
               | 'product' is a paper, presentation, or report that
               | involves human judgements there should be a structured
               | process to propagate changes.
        
             | [deleted]
        
           | [deleted]
        
           | ogre_codes wrote:
           | I can understand posting a link, plus an archival link just
           | in case the original content is lost. But linking to an
           | archival site only is IMO somewhat rude.
        
         | mcv wrote:
         | Would be nice if there's an automatic way to have a link revert
         | to the Wayback Machine once the original link stops working. I
         | can't think of an easy way to do that, though.
        
           | boogies wrote:
           | I just use a bookmarklet                   javascript:void(wi
           | ndow.open('https://web.archive.org/web/*/'+location.href.repl
           | ace(/\/$/,%20'')));
           | 
           | (which is only slightly less convenient than what others have
           | already pointed out -- the FF extension and Brave built-in
           | feature).
        
             | kevincox wrote:
             | Another nice solution is to create a "search engine" for
             | https://web.archive.org/web/*/%s you can then just add the
             | keyword before the URL (For example I type `<Ctrl-l><Left>w
             | <Enter>`). Search engines like this are supported by chrome
             | and firefox.
        
               | boogies wrote:
               | I would love for there to be a site that redirected eg.
               | better.site/ https://www.youtube.com/watch?v=jzwMjOl8Iyo
               | to https://invidious.site/watch?v=jzwMjOl8Iyo so I could
               | easily open YouTube links with Invidious, and the same
               | for Twitter-Nitter, Instagram-bibliogram, Google Maps -
               | OSM, etc without having to manually remove the beginning
               | of the URL. I'd presume someone on HN has the skill to do
               | this similarly to
               | https://news.ycombinator.com/item?id=24344127
        
               | kevincox wrote:
               | You can make a "search engine" or bookmarklet that is a
               | javascript/data URL that does whatever URL mangling you
               | need. (Other than some minor escaping issues).
               | 
               | Something like the following should work. You can add
               | more logic to supoort all of the sites with the same
               | script or make one per site.
               | 
               | javascript:document.location="%s".replace(/^https:\/\/www
               | .youtube.com/, "https://invidious.site")
        
           | riffraff wrote:
           | wikipedia just does "$some-link-here (Archived $archived-
           | version-link)", and it works pretty well, imo.
        
             | II2II wrote:
             | Agreed, and it shouldn't be too much of a burden to use
             | since the author was quite clear about it being for
             | reference materials. The idea isn't all that different from
             | referring to specific print editions.
        
             | notagoodidea wrote:
             | For me that is the real solution when you know that the
             | _archived-link_ is the one consulted by the author
             | /whatever and the normal one being the content (or its
             | evolution).
        
           | jazzyjackson wrote:
           | Brave browser has this built in, if you end up at a dead link
           | the address bar offers to take you to wayback machine.
           | 
           | http://blog.archive.org/2020/02/25/brave-browser-and-the-
           | way...
        
             | bad_user wrote:
             | This was first implemented in Firefox, as an experiment,
             | and is now an extension:
             | 
             | https://addons.mozilla.org/ro/firefox/addon/wayback-
             | machine_...
        
               | liability wrote:
               | I used this extension for a while but had to stop due to
               | frequent false positives. YMMV
        
               | CompuHacker wrote:
               | There exists a manual extension called Resurrect Pages
               | for Firefox 57+, with Google Cache, archive.is, Wayback
               | Machine, and WebCite.
        
           | MaxBarraclough wrote:
           | Either a browser extension, or an 'active' system where your
           | site checks the health of the pages it links to.
        
             | iggldiggl wrote:
             | > browser extension
             | 
             | E.g. https://addons.mozilla.org/firefox/addon/wayback-
             | machine_new...
        
           | DavideNL wrote:
           | Their browser extention does exactly that...
        
         | polygot wrote:
         | I made a browser extension which replaces links in articles and
         | stackoverflow answers with archive.org links on the date of
         | their publication (and date of answers for stackoverflow
         | questions):
         | https://github.com/alexyorke/archiveorg_link_restorer
        
         | FinnLeSueur wrote:
         | > generate a report of broken links
         | 
         | I actually made a little script that does just this. It's
         | pretty dinky but works a charm on a couple of sites I run.
         | 
         | https://github.com/finnito/link-checker
        
         | scruffyherder wrote:
         | I spent hours getting all the stupid redirects working from
         | different hosts, domains and platforms.
         | 
         | People still use rss to either steal my stuff, or discuss it
         | off site (as if commenting to the author is so scary!) or in a
         | way to make me totally unaware of it happening as so many times
         | people either ask questions of the author on a site like this,
         | or even bring up good points or something to go further on that
         | I would miss otherwise.
         | 
         | It's a shame ping backs were hijacked but the siloing sucks
         | too.
         | 
         | Sometimes I forget for months at a time to check other sites,
         | not every post generates 5000+ hits in an hour.
        
         | abdullahkhalids wrote:
         | Gwern.net has a pretty sophisticated system for this
         | https://www.gwern.net/Archiving-URLs
        
         | jrochkind1 wrote:
         | The International Internet Preservation Consortium is
         | attempting a technological solution that gives you the best of
         | both worlds in a flexible way, and is meant to be extended to
         | support multiple archival preservation content providers.
         | 
         | https://robustlinks.mementoweb.org/about/
         | 
         | (although nothing else like the IA Wayback machine exists
         | presently, and I'm not sure what would make someone else try to
         | 'compete' when IA is doing so well, which is a problem, but
         | refusing to use the IA doesn't solve it!)
        
         | NateEag wrote:
         | I use linkchecker for this on my personal sites:
         | 
         | https://linkchecker.github.io/linkchecker/
         | 
         | There's a similar NodeJS program called blcl (broken-link-
         | checker-local) which has the handy attribute that it works on
         | local directories, making it particularly easy to use with
         | static websites before deploying them.
         | 
         | https://www.npmjs.com/package/broken-link-checker-local
        
           | privong wrote:
           | > There's a similar NodeJS program called blcl (broken-link-
           | checker-local) which has the handy attribute that it works on
           | local directories
           | 
           | linkchecker can do this as well, if you provide it a
           | directory path instead of a url.
        
             | NateEag wrote:
             | Ah, thanks! I was not aware of that feature.
        
         | DeusExMachina wrote:
         | > generate a report of broken links and allow me to update them
         | manually just in case a site or two happen to be down when my
         | build runs.
         | 
         | SEO tools like Ahrefs do this already. Although, the price
         | might be a bit too steep if you only want that functionality.
         | But there are probably cheaper alternatives as well.
        
         | codethief wrote:
         | > What I can see, and I don't know if it exists yet (a quick
         | search suggests perhaps not), is some build task that will
         | check all links and replace those that are broken with links to
         | WayBackMachine
         | 
         | Addendum: First, that same tool should - at the time of
         | creating your web site / blog post / ... - ask WayBackMachine
         | to capture those links in the first place. That would actually
         | be a very neat feature, as it would guarantee that you could
         | always roll back the linked websites to exactly the time you
         | linked to them on your page.
        
           | ethagnawl wrote:
           | Doesn't Wikipedia do something like this? If not, the
           | WBM/Archive.org does something like it on Wikipedia's behalf.
        
           | thotsBgone wrote:
           | I don't care enough to look into it, but I think Gwern has
           | something like this set up on gwern.net.
        
         | zwayhowder wrote:
         | Not to forget that while I might go to an article written ten
         | years ago, the Wayback archive won't show me a related article
         | that you published two years ago updating the article
         | information or correcting a mistake.
        
         | deepstack wrote:
         | yeah at some point, way back machine need to be on webttorrent,
         | ipfs type of thing where it is immutable.
        
           | toomuchtodo wrote:
           | https://blog.archive.org/2018/07/21/decentralized-web-faq/
        
           | scruffyherder wrote:
           | I was surprised when digital.com got purged
           | 
           | Then further dismayed that the utzoo Usenet archives were
           | purged.
           | 
           | Archive sites are still subject to being censored and
           | deleted.
        
           | alfonsodev wrote:
           | it's there any active project perusing this idea ?
        
             | Taek wrote:
             | https://github.com/exp0nge/wayback
             | 
             | Here's an extension to archive pages on Skynet, which is
             | similar to IPFS but uses financial compensation to ensure
             | availability and reliability.
             | 
             | I don't know if the author intends to continue developing
             | this idea or if it was a one-off for a hackathon.
        
               | Sargos wrote:
               | FileCoin is the incentivization layer for IPFS, both
               | built by Protocol Labs.
        
             | nikisweeting wrote:
             | The largest active project doing this (to my knowledge) is
             | the Inter-Planetary Wayback Machine:
             | 
             | https://github.com/oduwsdl/ipwb
             | 
             | There have been many other attempts though, including
             | internetarchive.bak on IPFS, which ended up failing because
             | it was too much data.
             | 
             | http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/
             | i...
             | 
             | http://brewster.kahle.org/2015/08/11/locking-the-web-
             | open-a-...
        
             | deepstack wrote:
             | I'm hoping someone here in Hacker News will pick it up and
             | apply for the next round at ycombinator. A non-profit would
             | be better than for-profit in this case. Block-chain ish
             | type tech would be perfect for this. If in a few years no
             | one does, then I'll do it.
        
         | 1vuio0pswjnm7 wrote:
         | What if your personal site is, like so many others these days,
         | on shared IP hosting like Cloudflare, AWS, Fastly, Azure, etc.
         | 
         | In the case of Cloudflare, for example, we as users are not
         | reaching the target site, we are just accessing a CDN. The nice
         | thing about archive.org is that it does not require SNI.
         | (Cloudflare's TLS1.3 and ESNI works quite well AFAICT but they
         | are the only CDN who has it working.)
         | 
         | I think there should be more archive.org's. We need more CDNs
         | for users as opposed to CDNs for website owners.
        
           | bad_user wrote:
           | The "target site" is the URL from the author's domain, and
           | Cloudflare is the domain's designated CDN. The user is
           | reaching the server that the webmaster wants reachable.
           | 
           | That's how the web works.
           | 
           | > _The nice thing about archive.org is that it does not
           | require SNI_
           | 
           | I fail to see how that's even a thing to consider.
        
             | 1vuio0pswjnm7 wrote:
             | If the user follows an Internet Archive URL (or Google
             | cache URL or BING cache URL or ...), then does she still
             | she reach "the server the webmaster wants reachable".
             | 
             | SNI, more specifically sending domain names in plaintext
             | over the wire when using HTTPS, matters to the IETF because
             | they have gone through the trouble of encrypting server
             | certificate in TLS 1.3 and eventually they will be
             | encrypting SNI. If you truly know "how the web works", then
             | you should be able to figure out why they think domain
             | names in plaintext is an issue.
        
             | [deleted]
        
       | uniqueid wrote:
       | Yeah, that's another problem with the design of the web, and kind
       | of a significant one! Somewhat pointless to link to external
       | documents when half of them won't be around next year.
        
       | nullandvoid wrote:
       | I experienced this just the other day.
       | 
       | I was browsing an old HN post from 2018, with lots of what seemed
       | like useful links to their blog
       | 
       | Upon visiting it the site had been rebranded and the blog entries
       | had disappeared
       | 
       | Waybackmachine saved me in this cass, but a link to it originally
       | would have saved me a few clicks
        
       | ique wrote:
       | Just another reason to have content-adressable storage
       | everywhere, then at least if it changed you'll know it changed,
       | and if you can't get the original content anymore then the change
       | is probably malicious.
        
       | fornowiamhere wrote:
       | > _Now it's spam from a site suffering financial need._ Well,
       | yeah!
       | 
       | Of course, linking to WBM is not the main reason why a site might
       | be in this situation but it piles up.
        
       | asdfman123 wrote:
       | > So in Feb 14 2019 your users would have seen the content you
       | intended. However in Sep 07 2020, your users are being asked to
       | support independent Journalism instead.
       | 
       | Can you believe it? Yesterday, I tried to walk out of the grocery
       | store with a head of lettuce for free, and they instead were more
       | interested in making me pay money to support the grocery and
       | agricultural business!
        
         | monktastic1 wrote:
         | Right. I thought it was pretty bad form for him to call this
         | "spam," as though they're the ones wronging _him._
        
       | 8bitsrule wrote:
       | Gotta completely agree ... for anything you need to be stable and
       | available.
       | 
       | I've been building lists of -reference- URLs for over a decade
       | ... and the ones aimed at Archive.org (are slower to load, but)
       | are much more reliable.
       | 
       | Saved Wayback URLs contain the original site URL. It's really
       | easy to check it to see if the site has deteriorated (usually it
       | has). If it's gotten better ... it's easy to update your saved WB
       | link.
        
       | shortformblog wrote:
       | This man's entire argument is completely terrible for two
       | reasons:
       | 
       | 1) The example he uses is The Epoch Times, a questionable source
       | even on the best of days.
       | 
       | 2) What he refers to as "spam" is a paywall. He is literally
       | taking away from business opportunities for this outlet that
       | produced a piece of content he wants to draw attention to, but he
       | does not want to otherwise support.
       | 
       | He's a taker. And while the Wayback Machine is very useful for
       | sharing archived information, that's not what this guy is doing.
       | He's trying to undermine the business model of the outlets he's
       | reading.
       | 
       | The Epoch Times is one thing--it's an outlet that is essentially
       | propaganda--but when he does this to a local newspaper or an
       | actual independent media outlet, what happens?
        
         | zdw wrote:
         | For reference: https://en.wikipedia.org/wiki/Epoch_Times
         | 
         | They're hyper right wing Qanon/antivax spreaders associated
         | with the Falun Gong movement.
        
         | Ensorceled wrote:
         | > 2) What he refers to as "spam" is a paywall. He is literally
         | taking away from business opportunities for this outlet that
         | produced a piece of content he wants to draw attention to, but
         | he does not want to otherwise support.
         | 
         | For the destination site, this is all of the downsides of AMP
         | with none of the upsides.
        
       | aldo712 wrote:
       | Here's a WayBackMachine Link to this article. :)
       | 
       | https://web.archive.org/web/20200908090515/https://hawaiigen...
        
       | lizardmancan wrote:
       | also something to take home from this is that we all think to
       | have an idea what the www is or amounts to while in reality it is
       | changing all the time at a much more dramatic rate than we can
       | see or indeed imagine. Depending on current events very large
       | numbers of new sites are created and new top (of index) content
       | is written and an even larger amount vanishes. When the new
       | topics mature and its angles are reasonably fleshed out the
       | incineration wave kicks in again and POOF we have a whole new
       | www. After all, since most content is rarely linked to the
       | problem is much larger. Naive people think the value of content
       | is also static. They can of course advertise that opinion but for
       | the rest of us to just accept it as ghospel??? We should be
       | outraged so that "they" can delete it. Then we can truly feel the
       | totalitarianism of it.
        
       | koboll wrote:
       | This seems like a problem that would be better solved by
       | something like:
       | 
       | 1. Browsers build in a system whereby if a link appears dead,
       | they first check against the Wayback Machine to see if a backup
       | exists.
       | 
       | 2. If it does, they go there instead.
       | 
       | 3. In return for this service, and to offset costs associated
       | with increased traffic, they jointly agree to financially support
       | the Internet Archive in perpetuity.
        
       | nikisweeting wrote:
       | Or link to your own archive of the content with ArchiveBox!
       | 
       | That way we're not all completely reliant on a central system.
       | (ArchiveBox submits your links to Archive.org in addition to
       | saving them locally).
       | 
       | https://github.com/pirate/ArchiveBox
       | 
       | Also many other tools that can do this too:
       | 
       | https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Comm...
        
       | [deleted]
        
       | rmoriz wrote:
       | I once discovered an information leak of German public
       | broadcasting organization ARD which leaked real mobile numbers on
       | their CI/CD page where they showed the business card designs
       | (lol).
       | 
       | All records of this page on Archive.org were deleted after a
       | couple of days, a twitter account posting the details with a
       | screenshot and link was reported and my account temporarily
       | suspended.
       | 
       | I assume it must be very easy to remove inconvenient content from
       | archive.org.
       | 
       | (in German) https://blog.rolandmoriz.de/2019/04/25/sind-die-
       | leute-von-de...
        
       | krapp wrote:
       | Apropos of nothing but I added the ability to archive links in
       | Anarki a few months back[0]. If dang or someone wants to take it
       | for HN they're welcome to. Excuse the crappy quality of my code
       | and pr format, though.
       | 
       | It might be useful as a backup if the original site starts
       | getting hugged to death.
       | 
       | [0]https://github.com/arclanguage/anarki/pull/179
        
       | rkagerer wrote:
       | I link to the original, but archive it in both WayBackMachine and
       | Archive.is.
        
       | scruffyherder wrote:
       | So it can be deleted too?
       | 
       | Or so there is no engagement at the source?
        
       | mountainb wrote:
       | Link rot has convinced me that the web is not good for its
       | ostensible purpose. I used to roll my eyes reading how academic
       | researchers and librarians would discourage using webpages as
       | resources. Many years later, it's obvious that the web is pretty
       | bad for anything that isn't ephemeral.
        
         | sfg wrote:
         | We have deposit libraries in the U.K., such as The British
         | library and Oxford University's Bodleian. When you publish a
         | book in the U.K. you are supposed to offer a copy to these
         | institutions.
         | 
         | If we had legal deposit web archiving institutions, then
         | academics, and others, could create an archive snapshot of some
         | resource and then reference the URI to that (either with or
         | without the original URI), so as to ensure permanence.
        
         | ImaCake wrote:
         | >I used to roll my eyes reading how academic researchers and
         | librarians would discourage using webpages as resources.
         | 
         | While this is true in general, I am amused that this is _not_
         | true for citing wikipedia. Wikipedia can be trusted to remain
         | online for many more years to come. And it has a built-in
         | wayback machine in the form of Revision History.
        
           | mountainb wrote:
           | Try following the references on big Wiki pages and you will
           | see why Wikipedia pages are nightmarish for any kind of
           | research. This is important when you are trying to drill down
           | to the sources of various claims. Many major pages relating
           | to significant events and concepts are riddled with rotted
           | links.
           | 
           | The page can be completely correct and accurate, but if you
           | cannot trace the references then it cannot be verified and
           | you cannot make the claims in a new work as a result. The
           | whole point of references is to make it so that the claims
           | can be independently verified. Even when there isn't a link
           | rot problem you will often find junk references that cannot
           | be verified.
           | 
           | Wikipedia isn't a bad starting point and sometimes you can
           | find good references. But it is not anywhere close to
           | reliable: just trace the references in the next 20 Wiki
           | articles you read and your faith will be shaken.
        
           | techphys_91 wrote:
           | Usually a reference indicates that an author believes
           | something to be true, but won't explicitly state their
           | reasons. It isn't just a statement of where information comes
           | from, but a justification for trusting that information. If
           | the reference is from a reputable source, then it indicates
           | that this belief is justified. If an author believes
           | something to be true because they read it on wikipedia, then
           | that belief probably isn't justified, because the reliability
           | of wikipedia content is mixed.
           | 
           | Good quality information on wikipedia often refers back to
           | published sources, and at the very least an author should
           | check that source and refer to it, rather than wikipedia
           | itself.
        
         | scruffyherder wrote:
         | After someone published an authoritative ftp listening, so many
         | people panicked as their were out of date and insecure versions
         | so rather than patch they all went dark.
         | 
         | Anyone doing research just got screwed.
         | 
         | So many papers have code listed to places that don't exist
         | anymore.
        
       | k1m wrote:
       | I think this a good idea, but especially because the
       | WayBackMachine uses good content security policies to prevent
       | some of the intrusive JS ad-dependent sites like to push on
       | people. So you're not only protecting from future 404 scenarios,
       | but also protecting your visitors' privacy from unscrupulous ad-
       | tech which seems to be everywhere now.
       | 
       | The example provided in the article, showing how a site looked
       | cleaner before, could simply be the content security policies at
       | the WayBackMachine preventing the clutter from getting loaded,
       | rather than any specific changes on the site - although I haven't
       | checked that particular site.
        
       | prgmatic wrote:
       | I stopped reading after the part where they describe the paywall
       | gated version of the journalism website as "Now it's spam from a
       | site suffering financial need."
       | 
       | That website spends money creating content for commercial
       | viability, it doesn't have to bow to you and make sure you can
       | consume it for free, and the Wayback Machine isn't a tool for you
       | to bypass premium content.
        
       | s9w wrote:
       | In practice however, archive.org did censor content based on
       | political preference.
        
         | encom wrote:
         | Sounds plausible, but I sure would like a citation for that
         | claim.
        
           | s9w wrote:
           | I do have two links in my "clownworld" link list under, but
           | ironically they're both in reddits that have since been
           | banned and are therefore not available anymore.
        
           | dependenttypes wrote:
           | They exclude Snopes and I think Salon from archiving.
        
       | romwell wrote:
       | Good idea, by why not both (i.e. link to a webpage, _and_ to the
       | Archive)?
       | 
       | Linking to Archive only makes Archive a single point of failure.
        
         | thunderrabbit wrote:
         | Agreed. I usually link to both the original and then
         | archive.org in parentheses.
        
         | sseneca wrote:
         | Yes, this makes the most sense in my opinion:
         | 
         | Check out [this link](https://...) ([archived](https://...))
         | 
         | This can also help in the event of a "hug of death"
        
           | roberto wrote:
           | This is what I do on my blog, with some additional metadata:
           | <p>           <a              data-archive-
           | date="2020-09-01T22:11:02.287871+00:00"             data-
           | archive-url="https://web.archive.org/web/20200901221101/https
           | ://reubenwu.com/projects/25/aeroglyphs"
           | href="https://reubenwu.com/projects/25/aeroglyphs"
           | >             Aeroglyphs           </a>           <span
           | class="archive">             [<a href="https://web.archive.or
           | g/web/20200901221101/https://reubenwu.com/projects/25/aerogly
           | phs">archived</a>]           </span>           is an ongoing
           | series of photos of nature with superimposed geometrical
           | shapes drawn by drones.         </p>
        
         | dredmorbius wrote:
         | The WBM link includes the canonical source clearly within the
         | URL.
        
           | romwell wrote:
           | Yeah, and the non-technical users will surely understand that
           | what they need to do when the link doesn't work is:
           | 
           | 1. Recognize that it's an Archive.org URL
           | 
           | 2. Understand that the link references an archived page whose
           | URL is "clearly" referenced as a parameter
           | 
           | 3. Edit the URL (especially pleasant on a cell phone)
           | correctly and try loading that
           | 
           | If you expect the user to be able to go through all this
           | trouble if the Archive is down, you can also expect them to
           | look up the page on the Archive if the link does not load.
           | 
           | But better yet, one shouldn't expect either.
        
         | iib wrote:
         | By the way the archive works, isn't the link just adding the
         | https://web.archive.org/web/*/ before the actual link? I guess
         | linking to both is especially important for people not knowing
         | about the existence of archive.org, and a small convenience for
         | everyone. But the link seems to be reversible in either
         | direction.
        
         | hinkley wrote:
         | I wonder if the anchor tag should be altered to support this?
         | 
         | Alternatively, this is a good thing for a user agent to handles
         | natively, or through a plugin.
        
       | cornedor wrote:
       | But how certain is the future of WayBackMachine, when disaster
       | strikes, all your links are dead. On the other hand, the original
       | links can still be read from the url, so the original reference
       | is not completely gone.
        
         | dredmorbius wrote:
         | INTERNETARCHIVE.BAK:
         | 
         |  _The INTERNETARCHIVE.BAK project (also known as IA.BAK or
         | IABAK) is a combined experiment and research project to back up
         | the Internet Archive 's data stores, utilizing zero
         | infrastructure of the Archive itself (save for bandwidth used
         | in download) and, along the way, gain real-world knowledge of
         | what issues and considerations are involved with such a
         | project. Started in April 2015, the project already has dozens
         | of contributors and partners, and has resulted in a fairly
         | robust environment backing up terabytes of the Archive in
         | multiple locations around the world._
         | 
         | https://www.archiveteam.org/index.php?title=INTERNETARCHIVE....
         | 
         | Snapshots from 2002 and 2006 are preserved in Alexandria,
         | Egypt. I hope there's good fire suppression.
         | 
         | https://www.bibalex.org/isis/frontend/archive/archive_web.as...
        
           | phendrenad2 wrote:
           | I wish there were a way to get a low-rez copy of their entire
           | archive. So, only text, no images, binaries, PDFs (other than
           | PDFs converted to text which they seem to do). As it stands
           | the archive is so huge, the barrier to mirroring is high.
        
             | dredmorbius wrote:
             | Agreed.
             | 
             | When scoping out the size of Google+, one of ArchiveTeam's
             | recent projects, it emerged that the typical size of a post
             | was roughly 120 bytes, but total page weight a minimum of 1
             | MB, for a 1% payload to throw-weight ratio. This seems
             | typical of much the modern Web. And that excludes external
             | assets: images, JS, CSS, etc.
             | 
             | If _just the source text and sufficient metadata_ were
             | preserved, all of G+ would be startlingly small -- on the
             | order of 100 GB I believe. Yes, posts _could_ be longer (I
             | wrote some large ones), and images (associated with about
             | 30% of posts by my estimate) blew things up a lot. But the
             | scary thing is actually how _little_ content there really
             | was. And while G+ certainly had a  "ghost town" image
             | (which I somewhat helped define), it wasn't _tiny_ ---
             | there were plausibly 100 - 300 million users with
             | substantial activity.
             | 
             | But IA's WBM has a goal and policy of preserving the Web
             | _as it manifests_ , which means one hell of a lot of cruft
             | and bloat. As you note, increasingly a liability.
        
               | ta8908695 wrote:
               | The external assets for a page could be archived
               | separately though, right? I would think that the static
               | G+ assets: JS, CSS, images, etc. could be archived once,
               | and then all the remaining data would be much closer the
               | 120B of real content. Is there a technical reason that's
               | not the case?
        
               | dredmorbius wrote:
               | In theory.
               | 
               | In practice, this would likely involve recreating at
               | least some of the presentation side of numerous changing
               | (some constantly) Web apps. Which is a substantial
               | programming overhead.
               | 
               | WARC is dumb as rocks, from a redundancy standpoint, but
               | also atomically complete, independent (all WARCs are
               | entirely self-contained), and reliable. When dealing with
               | billions of individual websites, these are useful
               | attributes.
               | 
               | It's a matter of trade-offs.
        
         | nikisweeting wrote:
         | So archive your links yourself with one of the many local-web-
         | archiving tools.
         | 
         | https://webrecorder.io
         | 
         | https://github.com/pirate/ArchiveBox
         | 
         | https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Comm...
        
         | oblio wrote:
         | Doesn't the link to the WayBackMachine contain the original
         | link?
        
         | INTPenis wrote:
         | Yeah, my thoughts were more of the way Waybackmachine is
         | funded.
         | 
         | I don't feel comfortable sending a bunch of web traffic to them
         | for no reason other than it being convenient. The wayback
         | machine is a web archival project, not your personal content
         | proxy to make sure your links don't go stale.
         | 
         | They need our help both in funding and in action, one simple
         | action is not to abuse their service.
        
           | sanitycheck wrote:
           | Precisely my first thoughts, too. It's an archive, not a free
           | CDN.
           | 
           | I hope the author of this piece considers donating and
           | promoting donation to their readers:
           | https://archive.org/donate/
        
         | Lex-2008 wrote:
         | WayBackMachine alternative, archive.is, has an option to
         | download zip archive of HTML with images and CSS (but no JS) -
         | this way you can preserve and host a copy of original webpage
         | on your own website
        
           | moonchild wrote:
           | Or just wget -rk...
           | 
           | Mirroring a website isn't so hard that you need a service to
           | do it for you. Your browser even has such a function; try
           | ctrl-s.
        
             | abricot wrote:
             | The "SingleFile" plugin is a better version of ctrl+s. It
             | will save all pages as single html file and even include
             | images as an octet stream in the file so they aren't
             | missed.
        
             | peq wrote:
             | I would be careful in mirroring a site. It's very likely to
             | violate copyright or similar laws, depending on where you
             | are. I think archive.org is considered fair use, but if you
             | put it on a personal or even business page it might be
             | different. For example Google News in EU is very limited in
             | what content they may steal from other web pages.
        
       | j1elo wrote:
       | This is a bad idea for the reasons that other commenters have
       | already stated. If WayBackMachine falls, all links would fall.
       | Actually the "Web" would stop being one, if all links are all
       | within the same service.
       | 
       | For docs and other texts, I just link to the original site and
       | add an (Archive) suffix, e.g. the "Sources" section in
       | https://doc-kurento.readthedocs.io/en/latest/knowledge/nat.h...
       | 
       | That is a simple and effective solution, yes it is a bit more
       | cumbersome, but it does not bother me.
        
       | euske wrote:
       | This is both good and scary idea: for the good part, I'm
       | frustrated enough that some unscrupulous websites (even some news
       | outlets) secretly alter their contents without mentioning the
       | change. I want a mechanism that holds the publisher responsible.
       | At the same time, this is scary because we're basically using one
       | private organization a single arbitrator. (I know it's a
       | nonprofit, but they're probably not as public as a government
       | entity.) Maybe it's good for the time being, but we should be
       | aware that this is a solution that's far from perfect.
        
         | anaganisk wrote:
         | Public "or" a government entity.
        
       | icemelt8 wrote:
       | Just FYI, archive.org is banned in a few countries, including the
       | UAE, where I cannot open any links from there.
        
         | dirtnugget wrote:
         | Huh I wonder if they are also blocking mirrors. Also, in
         | countries with restrictions to internet access you probably
         | want to make using TOR a general habit.
        
       | kibibu wrote:
       | Can we update this link to point to the archive version?
        
         | drummer wrote:
         | Brilliant
        
       | arnoooooo wrote:
       | On the same topic, I wish I could link with highlights in the
       | page. Having a spec for highlights in URLS would be neat.
        
         | basscomm wrote:
         | Chrome 80 supports this:
         | https://www.chromestatus.com/feature/4733392803332096
        
       | [deleted]
        
       | bherb wrote:
       | Here, I fixed your link:
       | https://web.archive.org/web/20200908090515/https://hawaiigen...
        
         | shemnon42 wrote:
         | Came here for this. Have my upvote.
        
       | EllieEffingMae wrote:
       | I maintain a Fork of a program that does exactly this! You can
       | check it out here
       | 
       | https://github.com/Lifesgood123/prevent-link-rot
        
       | celsoazevedo wrote:
       | Is there any WordPress plugin that adds a link to the WayBack
       | Machine next to the original link? I would use something like
       | that.
        
         | dredmorbius wrote:
         | Perhaps: https://wordpress.org/plugins/media-library-internet-
         | archive...
        
         | aargh_aargh wrote:
         | Look at the format of the wayback machine URL. It's trivial to
         | generate.
         | 
         | Where a WP plugin would add value is by saving to the archive
         | whenever WP publishes a new or edited article.
        
       | sebastianconcpt wrote:
       | Clever way to make the reference immutable.
       | 
       | Some blockchain will end up taking care of this.
        
       | imhoguy wrote:
       | This is building yet another silo and point of failure. We can't
       | pass the entire Internet traffic thru WayBackMachine as its
       | resources are limited.
       | 
       | Most preserving solutions are like that and at the end the
       | funding or business priorities (google groups) become a serious
       | problem.
       | 
       | I think we need something like web - distributed and dumb easy to
       | participate and contribute a preservation space.
       | 
       | Look, there are Torrents available for 17 years [0]. Sure, there
       | are some unintresting long gone but there is always a little
       | chance somebody still has the file and someday becomes online
       | with it.
       | 
       | I know about IPFS/Dat/SBB, but still that stuff, like Bitcoin, is
       | too complex for a layman contributor with a plain altruistic
       | motivation. It should be like SETI@Home - fire and forget.
       | Eventually integrated with a browser to cache content you
       | star/bookmark and share when it is offline.
       | 
       | [0] https://torrentfreak.com/worlds-oldest-torrent-still-
       | alive-a...
        
       | TheSpiceIsLife wrote:
       | This behaviour should be reported to the WayBackMachine as abuse.
        
       | cpcallen wrote:
       | This seems like a risky strategy, what with the pending lawsuit
       | against archive.org over their National Emergency Library: I am
       | fully expecting that web.archive.org will go away permanently
       | within a few years.
        
       | ffpip wrote:
       | The wayback machine has helps me on a daily basis. So many old
       | links are dead.
       | 
       | The other day, I noticed that even old links from the front page
       | of Google and Youtube are dead now. Internet Archive still has
       | them. These were links on the front page of YT. Was very
       | disappointed that even Google has dead links.
        
       | lizardmancan wrote:
       | The real problem here is that url's provide only single method to
       | obtain content. Combined with the registers rent seeking scheme
       | we are left with flimsy technology.
       | 
       | I implemented this one time for images when a bunch of free image
       | hosts i was using failed:
       | 
       | <img src="http://example.com/img.jpg" data-x="0" onerror="a=[
       | 'http://example.com/img.jpg', 'http:/example.com/img2.jpg',
       | 'http://example.com/michael-faraday.jpg', this.dataset.uri];
       | this.src=a[this.dataset.x++]" data-
       | uri='data:image/gif,GIF89a%1...'>
        
       | ffpip wrote:
       | You can create a bookmark in Firefox to save a link quickly.
       | 
       | Bookmark Location- https://web.archive.org/save/%s
       | 
       | Keyword - save
       | 
       | So searching 'save https://news.ycombinator.com/item?id=24406193'
       | archives this post.
       | 
       | You can use any Keyword instead of 'save'.
       | 
       | You can also search with https://web.archive.org/*/%s
        
         | bad_user wrote:
         | Does that `save` keyword work?
         | 
         | The problem is %s gets escaped, so Firefox generates this URL,
         | which seems to be invalid:
         | 
         | https://web.archive.org/save/https%3A%2F%2Fnews.ycombinator....
        
           | aendruk wrote:
           | Uppercase %S for unescaped, e.g.:
           | 
           | https://web.archive.org/web/*/%S
        
             | bad_user wrote:
             | Ah, nice, thanks!
        
           | ffpip wrote:
           | web.archive.org automatically converts the https%3A%2F things
           | to https:// for me. I noticed it many times.
           | 
           | If you are still facing problems, go to
           | https://web.archive.org . In the bottom right 'Save page now'
           | field, right click and select 'add keyword for search'.
           | Choose your desired keyword.
        
             | fireattack wrote:
             | >web.archive.org automatically converts the https%3A%2F
             | 
             | Did you try the link provided by the one you replied to?
             | 
             | Because it says "HTTP 400" here, so apparently it doesn't
             | convert well, at least not in my end.
        
         | kilroy123 wrote:
         | Nice. I forgot how you can do that.
         | 
         | I just use the extension myself:
         | 
         | https://addons.mozilla.org/en-US/firefox/addon/wayback-machi...
        
           | ffpip wrote:
           | Yeah. That requires access to all sites. I wasn't comfortable
           | adding another addon with that permission.
           | 
           | The permission is just for a simple reason and should be off
           | by default. It is so you can right click a link on any page
           | and select 'archive' from the menu. Small function, but
           | requires access to all sites.
        
             | robotron wrote:
             | The source is available if you want to know what's going on
             | with those permissions:
             | https://github.com/internetarchive/wayback-machine-chrome
        
               | ffpip wrote:
               | Thanks. I already knew that. I'm familiar with the dev's
               | extensions. Clear Browsing Data and Captcha Buster and
               | very useful.
        
           | badsectoracula wrote:
           | One issue i have with this extension is that it randomly pops
           | up the 'this site appears to be offline' (which overrides the
           | entire page) even when the site actually works (i hit the
           | back button and it appears). I have it installed for some
           | time now and so far i have almost daily false negatives and
           | only once actually it worked as intended.
           | 
           | Also there doesn't seem to be a way to open a URL directly
           | from the extension which seems a weird omission, so i end up
           | going to the archive site anyway since i very often want to
           | find old long lost sites.
        
             | fireattack wrote:
             | It pops up when there is a HTTP 404 status code or similar
             | returned. So these false negatives are likely due to the
             | specific sites that are configured in a wacky way.
             | 
             | (Don't get me wrong, it is still very annoying for the user
             | regardless what the cause is.)
        
               | badsectoracula wrote:
               | Does it pop up for _any_ 404 error? If so it might be
               | some script or font or whatever resource the site itself
               | is using that would otherwise fail silently. If not...
               | then there has to be some other bug /issue because i get
               | it for many different sites that shouldn't have it.
        
               | fireattack wrote:
               | Nope, only for the "main" page (for lack of a better
               | word), and when there _is_ an archive for it.
        
       | eruci wrote:
       | WBM is like a content snapshot. You can't go back in time and
       | change anything. That's why it is better than linking to the
       | original.
        
       | wila wrote:
       | The idea of being able to access the URL once it is gone is good.
       | However this also means that any updates made to the original
       | page are no longer seen.
       | 
       | Not all updates are about "begging for money" as the example in
       | the article.
        
       | markjgraham wrote:
       | We suggest/encourage people link to original URLs but ALSO (as
       | opposed to instead of) provide Wayback Machine URLs so that
       | if/when the original URLs go bad (link rot) the archive URL is
       | available, or to give people a way to compare the content
       | associated with a given URL over time (content drift)
       | 
       | BTW, we archive all outlinks from all Wikipedia articles from all
       | Wikipedia sites, in near-real-time... so that we are able to fix
       | them if/when they break. We have rescued more than 10 million so
       | far from more than 30 Wikipedia sites. We are now working to have
       | Wayback Machine URLs added IN ADDITION to Live Web links when any
       | new outlinks are added... so that those references are "born
       | archived" and inherently persistent.
       | 
       | Note, I manage the Wayback Machine team at the Internet Archive.
       | We appreciate all your support, advice, suggestions and requests.
        
         | arendtio wrote:
         | I always wonder about rise the hosting costs in the wake of
         | people liking to the Wayback Machine on popular sites.
         | 
         | How do you think about it?
        
         | Arkanosis wrote:
         | This is so much better than INSTEAD.
         | 
         | Not for the sole reason that it leaves some control to the
         | content owner while ultimately leaving the choice to the user,
         | but also because things like updates and erratums (eg.
         | retracted papers) can't be found in archives. When you have
         | both, it's the best of both world: you have the original
         | version, the updated version, and you can somehow have the diff
         | between them. IMHO, this is especially relevant in when the
         | purpose is reference.
        
         | tracker1 wrote:
         | I mostly agree... however, given how many "news" sites are now
         | going back and completely changing articles (headlines,
         | content) without any history, I think it's a mixed bag.
         | 
         | Link rot isn't the only reason why one would want an archive
         | link instead of original. Not that I'd want to overwhelm the
         | internet archive's resources.
        
         | jhallenworld wrote:
         | It's interesting to think about how HTML could be modified to
         | fix the issue. Initial thought: along with HREF, provide AREF-
         | a list of archive links. The browser could automatically try a
         | backup if the main one fails. The user should be able to right-
         | click the link to select a specific backup. Another idea is to
         | allow the web-page author to provide a rewrite rule to
         | automatically generate wayback machine (or whatever) links from
         | the original. This seems less error prone and browsers could
         | provide a default that authors could override.
         | 
         | Anyway, the fix should work even with plain HTML. I'm sure
         | there are a bunch of corner cases and security issues
         | involved..
         | 
         | Well as mentioned by others, there is a browser extension. It's
         | interesting to read the issues people have with it:
         | 
         | https://addons.mozilla.org/en-US/firefox/addon/wayback-machi...
        
           | devenblake wrote:
           | Yup, I've been using the extension for probably about a year
           | now and get the same issues they do. It really isn't that
           | bad, most of the time backing out of the message once or
           | twice does the trick, but it's funny because most of the time
           | I get that message when going to the IA web uploader.
        
           | shortformblog wrote:
           | This is literally where my brain was going and I was glad to
           | see someone went in the same direction. Given the <img> tag's
           | addition of srcset in recent years, there is precedent for
           | doing something more with href.
        
           | javajosh wrote:
           | So this is a little indirect, but it does avoid the case
           | where the Wayback machine goes down (or is subverted):
           | include a HASHREF which is a hash of the state of the content
           | when linked. Then you could find the resource using the
           | content-addressable system of your choice. (Including, it
           | must be said, the wayback machine itself).
        
         | punnerud wrote:
         | I love the feature that you easily can add a page to archive:
         | https://web.archive.org/save/https://example.com
         | 
         | Replace https://example.com from the URL above. I try to
         | respect the cost of archiving, by not saving to often the same
         | page.
        
         | Ziggy_Zaggy wrote:
         | Kudos for doing what you do.
        
       | michaelanckaert wrote:
       | In the past I would fall back to WBM when something is no longer
       | online. Though recently I've been bookmarking interesting content
       | very rigorously and just rely on the archival feature of my
       | bookmarking software.
        
       | codetrotter wrote:
       | By that reasoning, shouldn't you be be using WayBack Machine
       | links when posting your own content to HN, instead of posting
       | direct links?
        
       | dirtnugget wrote:
       | He is actually showcasing a very nice technique to get around
       | paywalls: turn off JS. Often enough that's enough to get around
       | the paywall. I believe the archives also disable JS when grabbing
       | the content.
        
         | rchaud wrote:
         | That is changing. I've noticed over the past couple of years
         | that sites that could be accessed with JS turned off are now
         | showing a "Please enable Javascript to continue" (Quora) or
         | just hiding the content entirely (Business Insider).
         | 
         | I'm sure there are other examples as well.
        
           | dirtnugget wrote:
           | Not surprised. When paywalls started becoming a thing most of
           | them could be circumvented simply by removing a DOM element
           | and some CSS classes. Nowadays this is basically not possible
           | anywhere anymore.
        
       | not2b wrote:
       | It's probably better to link to both. If a site corrects a story,
       | you readers will want to see the correction, but if the page
       | disappears, it's good to have the backup.
        
       | AnonHP wrote:
       | WayBackMachine is slow (slower than many bloated websites). So
       | it's not a good enough experience for the person clicking on that
       | link.
       | 
       | Secondly, I personally don't like the fact that WayBackMachine
       | doesn't provide an easy way to get content removed and to stop
       | indexing and caching content (the only way I know is to email
       | them, with delayed responses or responses that don't help). It's
       | far easier to get content de-indexed in the major search engines.
       | I know that the team running it have some reasons to archive
       | anything and everything (as) permanently (as possible), but it
       | doesn't serve everybody's needs.
        
       | tannhaeuser wrote:
       | The proper way is for a site to expose a canonical link to an
       | article via a meta-link (rel=canonical) if necessary, and then
       | have a browser plugin to automatically try archive.org with an
       | URL generated from the canonical one if it is down.
        
       | LoSboccacc wrote:
       | has waybackmachine stopped retroactively applying robots?
       | 
       | if not link to that are one misconfiguration or one parked domain
       | from being wiped.
        
       | runxel wrote:
       | While I certainly wouldn't do this with every page and also not
       | every time, I got so anxious of link rot lately I save out of
       | reflex any good content I come across to the Waybackmachine.
       | 
       | The use of the bookmarklet makes this really convenient.
        
       | ImAlreadyTracer wrote:
       | Is there a chrome app that utilises waybackmachine?
        
       | CaptArmchair wrote:
       | So, this is the problem of persistence of URL's always
       | referencing the original content, regardless of where it is
       | hosted, in an authoritative way.
       | 
       | It's an okay idea to link to WB, because (a) it's de facto
       | assumed to be authoritative by the wider global community and (b)
       | as an archive it provides a promise that it's URL's will keep
       | pointing to the archived content come what may.
       | 
       | Though, such promises are just that: promises. Over a long period
       | of time, no one can truly guarantee the persistence of a
       | relationship between an URI and the resource it references to.
       | That's not something technology itself solves.
       | 
       | The "original" URI still does carry the most authority, as that's
       | the domain on which the content was first published. Moreover,
       | the author can explicitly point to the original URI as the
       | "canonical" URI in the HTML head of the document.
       | 
       | Moreover, when you link to the WB machine, what do you link to? A
       | specific archived version? Or the overview page with many
       | different archived versions? Which of those versions is currently
       | endorsed by the original publisher, and which are deprecated? How
       | do you know this?
       | 
       | Part of ensuring persistence is the responsibility of original
       | publisher. That's where solutions such as URL resolving come into
       | play. In the academic world, DOI or handle.net are trying to
       | solve this problem. Protocols such as ORE or Memento further try
       | to cater to this issue. It's a rabbit hole, really, when you
       | start to think about this.
        
         | im3w1l wrote:
         | Signed HTTP Exchanges could be a neat solution here.
        
         | kapep wrote:
         | > Moreover, when you link to the WB machine, what do you link
         | to? A specific archived version? Or the overview page with many
         | different archived versions? Which of those versions is
         | currently endorsed by the original publisher, and which are
         | deprecated? How do you know this?
         | 
         | WB also supports linking to the very latest version. If the
         | archive is updated frequently enough I would say it is
         | reasonable to link to that if you use WB just as a mirror. In
         | some cases I've seen error pages being archived after the
         | original page has been moved or removed though but that is
         | probably just a technical issue caused by some website
         | misconfiguration or bad error handling.
        
       | susam wrote:
       | I think the fundamental problem here is that URLs locate
       | resources. We find the desired content by finding its location
       | given by an address. Now what server or content lives on that
       | address may change from time to time or may even disappear. This
       | leads to broken links.
       | 
       | The problem with linking to Wayback Machine is that we are still
       | writing archive.org URLs still linking to Wayback Machine
       | servers. What guarantee is there that those archive.org links
       | will not break in future?
       | 
       | It would have been nice if the web were designed to be content-
       | addressable. That is, the identifier or string we use to access a
       | content addresses the content directly, not a location where the
       | content lives. There is good effort going on in this area in the
       | InterPlanetary File System (IPFS) project but I don't think the
       | mainstream content providers on the Internet are going to move to
       | IPFS anytime soon.
        
       | spurgu wrote:
       | I think a good solution might be to host the archive version
       | yourself (archive.org is slow, and always using it centralizes
       | everything there).
       | 
       | Let's say you write an article on your site,
       | https://yoursite.com/my-article, and from it you want to link to
       | an article https://example.com/some-article
       | 
       | You then create a mirror of https://example.com/some-article to
       | be served from your site at
       | https://yoursite.com/mirror/2019-09-08/some-article (put /mirror/
       | in robots.txt and set to noindex (or maybe even better to put a
       | rel="canonical" towards the original article?)) and on the top of
       | this mirrored page you add a header bar thingy containing a link
       | to the original article, as well as one to archive.org if you so
       | want.
       | 
       | tl;dr instead of linking to https://example.com/some-article you
       | link to https://yoursite.com/mirror/2019-09-08/some-article
       | (which has links to the original)
        
       | andy_ppp wrote:
       | It would be good to create a distributed, consensus version (to
       | help stop edits) of the content rather than have a single point
       | of failure...
        
       | zoid_ wrote:
       | I find that web archive pages always appear broken --- perhaps a
       | lot of js or css is not properly archived?
        
       | dltj wrote:
       | Take a look at _Robustify Your Links_.[1] It is an API and a
       | snippet of JavaScript that saves your target HREF in one of the
       | web archiving services and adds a decorator to the link display
       | that offers the option to the user to view the web archive.
       | 
       | [1] https://robustlinks.mementoweb.org/about/
        
       | spqr233 wrote:
       | I made a chrome extension called Capsule that works perfectly for
       | this use case. With just a click, you can create a publically
       | shareable link that preserves the webpage exactly as you see it
       | in your browser.
       | 
       | https://capsule.click
        
         | nikisweeting wrote:
         | Does it use SingleFile under-the-hood? What storage format does
         | this use, is it portable? e.g. WARC/memento/zim/etc?
        
       | outsomnia wrote:
       | This is a bad idea...
       | 
       | In the worst case one might write a cool article and get two
       | hits, one noticing it exists, and the other from the archive
       | service. After that it might go viral, but the author may have
       | given up by then.
       | 
       | The author is losing out on inbound links so google thinks their
       | site is irrelevant and gives it a bad pagerank.
       | 
       | All you need to do is get archive.org to take a copy at the time,
       | you can always adjust your link to point to that if the original
       | is dead.
        
         | ethanwillis wrote:
         | Google shouldn't be the center of the Web. They could also
         | easily determine where the archive link is pointing to and not
         | penalize. But I guess making sure we align with Google's
         | incentives is more important than just using the Web.
        
           | bartread wrote:
           | > Google shouldn't be the center of the Web.
           | 
           | I agree, but are you suggesting it's going to be better if
           | WayBackMachine is?
        
             | ethanwillis wrote:
             | That's a strawman because I never said they should be.
             | There's room for better alternatives.
             | 
             | We as a community need to think bigger rather than
             | resigning ourselves to our fate.
        
               | bartread wrote:
               | It's not a strawman because (a) I agreed with you, (b)
               | context, and (c) I asked a question based on what you
               | seemed to be implying in that context: a question to
               | which you still haven't provided an answer.
               | 
               | Let me put it another way: what specifically are you
               | suggesting as an alternative?
        
               | ethanwillis wrote:
               | If I had to pick a solution from what's available right
               | now technology wise I'd pick something that links based
               | on content hashes. And then pulls the content from
               | decentralized hosting.
               | 
               | I don't think I like IPFS as an organization, but tech
               | wise it's probably what I'd go with.
        
             | encom wrote:
             | Yes. At least Archive.org isn't an evil mega corporation
             | destroying the internet. Yet.
        
               | rriepe wrote:
               | We'll see what their new owners do after the lawsuit.
        
           | rchaud wrote:
           | Every search engine uses the number of backlinks as one of
           | the key factors in influencing search rank; it's a
           | fundamental KPI when it comes to understanding whether a link
           | is credible.
           | 
           | What is true for Google in this regard is also true of Bing,
           | DDG and Yandex.
        
           | luckylion wrote:
           | > But I guess making sure we align with Google's incentives
           | is more important than just using the Web.
           | 
           | It's not about Google's incentives. It's about directing the
           | traffic where it should go. Google is just the means to do
           | so.
           | 
           | Build an alternative, I'm sure nobody _wants_ Google to be
           | the number one way of finding content, it 's just that they
           | are, so pretending they're not and doing something that will
           | hurt your ability to have your content found isn't
           | productive.
        
         | johannes1234321 wrote:
         | One can also do it similar to Wikipedia references sections,
         | which links to the original and the memento in the archive.
         | (Once the bot notices it's gone)
         | 
         | Additional benefit: Some edits are good (addendums, typo
         | corrections etc.)
        
         | marcus_holmes wrote:
         | I totally agree.
         | 
         | I guess the answer is "don't mess with your old site", but
         | that's also impractical.
         | 
         | And I'm sorry, but if it's my site, then it's _my_ site. I
         | reserve the right to mess about with it endlessly. Including
         | taking down a post for whatever reason I like.
         | 
         | I'm sorry if that conflicts with someone else's need for
         | everything to stay the same but it's _my_ site.
         | 
         | Also, if you're linking to my article, and I decide to remove
         | said article, then surely that's my right? It's _my_ article.
         | Your right to not have a dead link doesn 't supercede my right
         | to withdraw a previous publication, surely?
        
           | pingpongchef wrote:
           | You can go down this road, but it looks like you're
           | advocating for each party to simply do whatever he wants. In
           | which case the viewing party will continue to value
           | archiving.
        
           | mitchdoogle wrote:
           | I certainly don't know about legal rights, but I think the
           | ethical thing is to make sure that any writings published as
           | freely accessible should remain so forever. What would people
           | think if an author went into every library in the world to
           | yank out one of their books they no longer want to be seen?
           | 
           | I do think the author is wrong to _immediately_ post links to
           | archived versions of sources. At the least, he could link to
           | both the original and archived.
        
             | fwip wrote:
             | Why is that the most ethical thing to do?
             | 
             | As a motivating example, I wrote some stuff on my MySpace
             | page as a teenager that I'm very glad is no longer
             | available. They were published as "freely accessible" and
             | indeed, I wanted people to see it. But when I read it back
             | 15 years later, I was more than a little embarrassed about
             | it, and I deleted it - despite it also having comments from
             | my friends at the time, or being referenced in their pages.
             | 
             | No great value was contained in those works.
        
             | marcus_holmes wrote:
             | I'm not sure I agree. I know that journalism (as a
             | discipline) considers this ethical. I kinda get that this
             | is part of the newspaper industry as a public service -
             | that withdrawing publication of something, or changing it
             | without alerting the reader to the change, alters the
             | historical record.
             | 
             | But no-one has a problem with other creative industries
             | withdrawing their publications. Film-makers are forever
             | deciding that movies are no longer available, for purely
             | commercial reasons. Why is writing different? Why is
             | pulling your books from a library unethical but pulling
             | your movie from distribution is OK?
             | 
             | I think we either need to extend this to all creative
             | activity, or reconsider it for writing.
        
               | falcolas wrote:
               | This has a very easy answer for me: It's _not_ ethical
               | for film makers to decide that movies are no longer
               | available.
               | 
               | Copyright was created to encourage publication of
               | information, not to squirrel it away. Copyright should be
               | considered the exception of the standard - public domain.
        
               | fwip wrote:
               | Why not?
               | 
               | Is it unacceptable for an artist to throw her art away
               | after it has finished its museum tour? Should a parent
               | hang on to every drawing their child has ever made?
               | 
               | If you are a software developer - is all of the code
               | you've ever written still accessible online, for free?
               | (To the legal extent that you are able, of course.)
               | 
               | Have you written a blog before, or did you have a
               | MySpace? Have you taken care to make sure your creative
               | work has been preserved in perpetuity, regardless on how
               | you feel about the artistic value of displaying your teen
               | emotions?
               | 
               | Consider why you feel it is unethical for the author or
               | persons responsible for the work to ever stop selling it.
        
               | falcolas wrote:
               | > Is it unacceptable for an artist to throw her art away
               | after it has finished its museum tour? Should a parent
               | hang on to every drawing their child has ever made?
               | 
               | This boils down to the public domain, IMO. We have made a
               | long practice of rescuing art from private caches and
               | trash bins to make them publicly available after the
               | artists' passing (the copyright expiring); regardless of
               | their views on what should happen with those works.
               | 
               | > Consider why you feel it is unethical for the author or
               | persons responsible for the work to ever stop selling it.
               | 
               | Selling something and then pulling it down is
               | fundamentally an attempt to create scarcity for something
               | that would otherwise be freely available. It's a
               | marketing technique that capitalizes on our fear of
               | missing out to make a sale.
               | 
               | Again, the right to even sell writings was enshrined in
               | law as an _exception_ to the norm of of it immediately
               | being part of the public domain, in an effort to
               | encourage more writing.
        
               | Fargren wrote:
               | > But no-one has a problem with other creative industries
               | withdrawing their publications
               | 
               | I wouldn't say no one has a problem with this. It does
               | happen, but it certainly doesn't make everyone happy. I
               | for one would like for all released media to be
               | available, or at least not actively removed from access.
        
             | badprose wrote:
             | Publishing on your own website is more akin to putting up a
             | signboard on your front lawn than writing a book for
             | publication.
             | 
             | People are free to view it and take pictures for their own
             | records, but I could still take it down and put something
             | else up.
        
         | bryanrasmussen wrote:
         | There's no reason that pagerank couldn't be adapted to take
         | into account wayback machine urls, there is a link with a url
         | pointing at
         | https://web.archive.org/web/*/https://news.ycombinator.com/
         | google could easily register that as a link to both resources -
         | one to web.archive, the other to the site.
         | 
         | there is also no reason why that has to become a slippery
         | slope, if anyone is going to ask "but where do you stop!!"
        
           | dmitriid wrote:
           | After all, they did change their search to accommodate AMP.
           | Changing it to take WebArchive into account is a) peanuts and
           | b) is actually better for the web
        
             | TheSpiceIsLife wrote:
             | There's a business idea in there somewhere.
             | 
             | Some kind of CDN-edge-archive hybrid.
        
               | britmob wrote:
               | "CDN-Whether-You-Want-It-Or-Not"
        
               | quickthrower2 wrote:
               | Foreverspin meets Cloudflare
        
         | scruffyherder wrote:
         | Even worse, when you have people using rss to wholesale copy
         | your site and it's updates and again that traffic and more
         | importantly the engagement disappear.
         | 
         | It's very demotivating
        
         | acatton wrote:
         | archive.org sends the HTTP header                 Link:
         | <https://example.com>; rel="original"
         | 
         | This can be used by search engines to adjust their ranking
         | algorithms.
        
       | stratigos wrote:
       | I link to WayBackMachine as Ive built a great many greenfield
       | applications for startups as a freelancer, which only existed for
       | about 6-8 months before hitting their burn rate. If I linked to
       | their original domains, my portfolio would be a list of 404s.
        
       | jakeogh wrote:
       | If it's not distributed, it is going to disappear.
       | 
       | The waybackmachine is backed by WARC files. It's perhaps the only
       | thing on archive.org that cant be downloaded... well except the
       | original mpg files for 911 news footage.
       | 
       | https://news.ycombinator.com/item?id=20623177
        
       | wolco wrote:
       | No one touched on this but the experience of viewing through the
       | waybackmachine is awful.
       | 
       | Media many times will not be saved so pages look broken. The
       | iframe and the iframe breakers on original sites can kill any
       | navigating.
       | 
       | The waybackmachine is okay for researching but a poor replacement
       | as a perm link.
        
         | ethagnawl wrote:
         | > Media many times will not be saved so pages look broken.
         | 
         | In my experience, this has gotten much, much better in the last
         | few years. I haven't explored enough to know if this is part of
         | the archival process or not, but I've noticed on a few
         | occasions that assets will suddenly appear some time after
         | archiving a page. For instance, when I first archived this page
         | (https://web.archive.org/web/20180928051336/https://www.intel..
         | .), none of the stylesheets, scripts, fonts or images were
         | present. However, after some amount of time (days/weeks) they
         | suddenly appeared and I was able to use the site as it
         | originally appeared.
        
       | yreg wrote:
       | I'm all for Archive.org. However, using it in this way -- setting
       | up a mirror of some content and purposefuly diverting traffic to
       | said mirror -- is copyright infringement (freebooting), as it
       | competes with the original source.
        
       | samatman wrote:
       | This is such a fundamental problem that I'd like to be able to
       | solve it at the HTML level.
       | 
       | An anchor type which allows several URLs, to be tried in order,
       | would go a long way. Then we could add automatic archiving and
       | backup links to a CMS.
       | 
       | It isn't real content-centric networking, which is a pity, but
       | it's achievable with what we have.
        
       | hownottowrite wrote:
       | Awesome. Hey, mods... Can you change the link on this post to
       | http://web.archive.org/web/20200908090515/https://hawaiigent...
        
       | ashishb wrote:
       | I wrote a link checker[1] to detect outbound links and mark dead
       | links, so that, I can replace them manually with archive.org
       | links.
       | 
       | 1 - https://github.com/ashishb/outbound-link-checker
        
       ___________________________________________________________________
       (page generated 2020-09-08 23:00 UTC)