[HN Gopher] SingleFile: Save a complete web page into a single H...
       ___________________________________________________________________
        
       SingleFile: Save a complete web page into a single HTML file
        
       Author : crbelaus
       Score  : 604 points
       Date   : 2022-03-02 14:55 UTC (8 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | IggleSniggle wrote:
       | What a cool project! I love the way this embeds images. One of
       | things I miss most, though, when going back to old sites, is
       | embedded audio or video. From looking at the options, it seems
       | like it might be able to handle encoding video and/or audio as
       | Data URIs, but it's not totally clear if SingleFile does this or
       | not. I wasn't sure if I was doing the correct things to force
       | this behavior in the options. It would be great if the README
       | could clarify how these are handled by SingleFile. Sometimes it
       | might be nice to be able to embed these sorts of things, even if
       | it does make the HTML ridiculous and bloated. Or, barring that,
       | maybe just a recommendation to use one of the other formats in
       | the comparison table for this kind of use case.
        
       | manigandham wrote:
       | Relevant 'awesome' list for web archiving:
       | https://github.com/iipc/awesome-web-archiving
       | 
       | There are many similar tools there, from archiving to rendering.
        
       | abnry wrote:
       | I love, love this extension. I am working on an app to turn this
       | into a single click bookmark system on Linux. Run an inotify
       | service to watch your downloads and then process any Single file
       | downloads to a database and update a browsable index.
        
         | jrm4 wrote:
         | TELL ME MORE.
         | 
         | I think I basically get the idea, what kind of database are you
         | using? Recoll sounds like a good idea, but I'm also thinking
         | about how I might also make this public-ish.
         | 
         | (i.e. I teach in college and would love to have a centralized
         | way to store and search all my assigned readings, which are
         | most often webpages)
        
           | abnry wrote:
           | I am not a trained software engineer but...
           | 
           | Each html page is processed by (1) getting url, title, time
           | saved (this is under-rated as approximate time of saving is
           | useful if you want to rediscover) and then (2) taking a
           | screenshot and finally (3) extracting text with
           | readability.js and hopefully doing some keyword analysis.
           | 
           | Right now it is stored in a local SQLite Database, although
           | the article content is stored in text files. For search, I
           | can use ripgrep to look through the associated text files.
           | 
           | The eventual goal is to create a flask app which will allow
           | for interactive management of the bookmarks (tagging,
           | searching). I've already got static generation of bookmarks.
           | 
           | Here's a screenshot: https://imgur.com/5YP4sP5
        
         | m-p-3 wrote:
         | I archived (privately) some documentation pages from some of
         | our vendors that were behind a login page using that just in
         | case it became inaccessible at a critical time for us.
        
         | samstave wrote:
         | WANT
        
         | makeworld wrote:
         | You might like https://archivebox.io/, I think it can does this
         | for you and then some.
        
         | rhn_mk1 wrote:
         | I'm using Recoll for this exact purpose. Just without inotify.
        
         | sitkack wrote:
         | This sounds neat.
        
       | causi wrote:
       | This is great for a page. I'd love to see it expanded to include
       | an entire site.
        
       | dgellow wrote:
       | That's a nice and simple tool, good work. I'm personally using
       | Zotero to save copies of web pages: https://www.zotero.org/. With
       | the browser extension you can save a snapshot in a few seconds.
        
         | gildas wrote:
         | Zotero is actually using SingleFile under the hood to save web
         | pages ;)
        
           | dgellow wrote:
           | Oh, that's nice :)
        
       | js8 wrote:
       | I use SavePageWE, it can save the page (into single file) as it
       | was modified by JS after load, which is often useful.
       | 
       | The only thing I miss I wish it was easier to script.
        
       | rambambram wrote:
       | I have been using WebScrapBook (an add-on for Firefox) for some
       | time. I really like it. Has anyone else some experience with this
       | add-on? Good or bad.
        
         | jjice wrote:
         | I've been using it for a couple years (2 maybe) and I like it
         | quite a bit as a quick and easy way to save pages. ArchiveBox
         | looks fantastic, but I just don't have the motivation to set up
         | the service and maintain it since I don't save enough links to
         | make it worthwhile. SingleFile might be worth a shot, but it
         | looks like WebScrapBook has been handling your needs just fine
         | (they seem to have 90% of the same functionality).
        
           | vageli wrote:
           | As a webscrapbook user, do you know if there is a migration
           | path from pocket or another hosted service?
        
             | rambambram wrote:
             | Don't know about a migration option, but I do remember
             | there's a lot of custom configuration possible.
        
           | rambambram wrote:
           | Thanks!
           | 
           | ArchiveBox does indeed look fantastic. Their homepage alone
           | is beautiful.
           | 
           | I bookmarked both ArchiveBox and now also SingleFile, but
           | WebScrapBook gets the job done (in almost all cases).
        
       | sharps1 wrote:
       | Should be noted Manifest V3 will break this extension for
       | chromium based browsers.
       | 
       | https://github.com/gildas-lormeau/SingleFile-Lite
        
         | photon-torpedo wrote:
         | Love the list of notable "features". :)
        
           | a1445c8b wrote:
           | Also this:
           | 
           | > Benefits of the Manifest V3
           | 
           | > - None
        
       | black3r wrote:
       | Can we please stop with the 17MB GIF images used as demos? They
       | use up lots of data immediately as you open the page, and are
       | impractical, you don't know how long the animation is, can't
       | forward/rewind, and you can't press fullscreen on a mobile.
       | 
       | And GitHub supports embedded videos in README.md files, videos
       | are generally smaller than GIF files and their disabled autoplay
       | is a feature = you save your data until you press play.
        
         | andrewmcwatters wrote:
         | I wish browsers came standard, preconfigured with warning
         | dialogs that triggered if assets attempting to load were beyond
         | some threshold. That threshold could be decided by the browser
         | vendors group based on some collection of network statistics
         | and be adjusted on an annual basis or so.
        
         | wackget wrote:
         | https://old.reddit.com/r/firefox/comments/aaek23/how_to_stop...
        
           | black3r wrote:
           | The issue is mainly with mobile browsers, as mobile data is
           | expensive..., Firefox on iPhone doesn't have about:config.
        
         | foobarbecue wrote:
         | GitHub only recently expanded video support from gif to decent
         | video formats, and many github enterprise installs don't have
         | those new features yet. So, keep spreading the word.
        
         | gildas wrote:
         | Author here, sorry for the GIF file. I created it because
         | people were not happy with the video hosted on Youtube. AFAIK,
         | video files did not work when I did this demo. I'll try to
         | improve this in the future.
        
           | tux1968 wrote:
           | Would you mind sharing which tools you used to create that
           | demo? It is really well done.
        
             | gildas wrote:
             | Sure! See here
             | https://news.ycombinator.com/item?id=30530438
        
           | localhost wrote:
           | I wanted to comment on how useful that demo was to me. It did
           | a great job at demonstrating why this is useful and how well
           | it works compared to the native browser implementation. Thank
           | you both for the demo and for the project!
        
             | gildas wrote:
             | Thanks :)
        
         | lostgame wrote:
         | Giving a massive upvote for this, disappointed and confused to
         | see you've been downvoted here. There's literally no reason to
         | use GIFs like this, and - as you stated, it's massively
         | disrespectful to those not fortunate enough to have broadband
         | connections, but would like access to the information.
         | 
         | Using data so wastefully like this always reeks of privilege to
         | me - especially on something like GitHub. Wikipedia, for
         | instance, never allows things like this.
        
           | Zababa wrote:
           | Sharing a project with the world and taking time to document
           | it reeks of privilege? I really can't understand your
           | reasoning.
        
           | pizza234 wrote:
           | > disappointed and confused to see you've been downvoted here
           | 
           | Because it's a relatively new feature, and probably, a lot of
           | devs don't know about it (I didn't).
           | 
           | I did this [animated gif] once actually, before the feature
           | was introduced, and I definitely hated it, but I had no
           | choice.
           | 
           | Thanks for bringing this to the general attention, though :)
        
         | bob1029 wrote:
         | I think there is some nuance here.
         | 
         | If the demo sequence is <5 seconds, I have never found myself
         | becoming impatient. Gif is perfect for very brief demos.
         | Anything longer than that and I'd like to have some idea where
         | I am at in the video stream (and other controls as indicated)
        
         | jazzyjackson wrote:
         | > GitHub supports embedded videos in README.md files
         | 
         | True since May 2021 so I think a lot of people are still
         | finding this out...
         | 
         | In my experience GIF is still the most set-it-and-forget-it way
         | to know a video will play, to get cross-platform support out of
         | mp4 you may have to provide two different codecs. Anyway, not
         | disagreeing with you and most gifs could drop 90% of their size
         | with better choice of resolution and framerate. This readme is
         | particularly egregious doing a screen capture with scrolling.
         | 
         | As for saving bandwidth until you want to play, I haven't tried
         | this yet but it seems adequately clever to wrap a loading=lazy
         | gif inside a details/summary tag: https://css-tricks.com/pause-
         | gif-details-summary/
        
           | Melatonic wrote:
           | Not to mention that H264 can take quite a bit of horsepower
           | to decode and play as well (assuming your machine doesnt have
           | a hardware chip specifically for doing just that)
        
             | Mogzol wrote:
             | Is this really still an issue in 2022? How many people are
             | browsing the internet on a device that can't do hardware
             | H264 decoding?
        
               | TingPing wrote:
               | Some browsers have poor hw decoding support on Linux
               | (their problem, not drivers) but its gotten a lot better
               | recently.
        
             | tambourine_man wrote:
             | Which machine doesn't? Anything in the last 10 or so years
             | will decode H264 with much less power than GIF because of
             | it. Even a Pi supports it.
        
               | jjice wrote:
               | My 2014 Thinkpad X1 Carbon (gen 3) doesn't have hardware
               | transcoding as far as I can tell made Zoom and Discord
               | impossible to use for class, especially because there was
               | no way (that I knew of) to disable all video except the
               | presenter. Even playing a YouTube video on it makes it
               | ramp up.
        
               | botdan wrote:
               | I'm not sure which CPU you have specifically but the
               | lowest-end model of the X1 Carbon Gen3 has an i5-5200U
               | [1] that lists Intel Quick Sync Video support.
               | 
               | From the wiki page for Quick Sync [2]:
               | 
               | > Intel Quick Sync Video is Intel's brand for its
               | dedicated video encoding and decoding hardware core.
               | Quick Sync was introduced with the Sandy Bridge CPU
               | microarchitecture on 9 January 2011 and has been found on
               | the die of Intel CPUs ever since.
               | 
               | I can't confirm but I'd guess your performance issues lie
               | elsewhere than in the h264 decoding specifically.
               | 
               | [1] - https://ark.intel.com/content/www/us/en/ark/product
               | s/85212/i...
               | 
               | [2] -
               | https://en.wikipedia.org/wiki/Intel_Quick_Sync_Video
        
               | jjice wrote:
               | If you check out the generation-codec table in that
               | wikipedia article [1], under Broadwell (I believe that's
               | the 5200U's generation name), it says there is support
               | for AVC (which I believe is H264, I'm not a codec wiz),
               | so that's a really good point. I'm not sure why I've
               | consistently had issues with this on my machine then. I
               | wonder if this is something with a configuration on Linux
               | then?
               | 
               | Thanks for pointing that out. I've looked at this table
               | before and payed attention to HEVC, not AVC, so I believe
               | that's where my mistake came from.
               | 
               | [1] https://en.wikipedia.org/wiki/Intel_Quick_Sync_Video#
               | Hardwar...
        
               | zerocrates wrote:
               | AVC is H.264, yes.
               | 
               | Accelerated video decode is often disabled by default on
               | Linux versions of browsers and can be quite dependent on
               | versions of drivers/mesa/X-vs-Wayland/etc.
        
               | black3r wrote:
               | YouTube by default prefers newer, bitrate saving codecs
               | over old ones if it thinks your CPU can handle software
               | decoding them. On my 2017 Dell XPS 1080p and lower
               | resolutions on YouTube play in software decoded AV1, only
               | 1440p and higher play in hardware decoded VP9, so playing
               | 4K video on YouTube is less taxing for my CPU than
               | playing a 1080p video....
        
               | folmar wrote:
               | You can use h264ify extension to fix it.
        
           | divbzero wrote:
           | > _to get cross-platform support out of mp4 you may have to
           | provide two different codecs_
           | 
           | Video codecs are not my area of expertise. Which codecs are
           | these and what tool(s) would you typically use to ensure you
           | provide them?
        
         | berkes wrote:
         | > And GitHub supports embedded videos in README.md files
         | 
         | Any documentation on this? Because I have tried to embed video
         | in issues and PRs before, and did not manage. I'm hoping such
         | documentation will explain how this extends to issues and PRs.
        
           | TingPing wrote:
           | In issues its just drag-n-drop.
        
       | bachmeier wrote:
       | Maybe a little OT, but founders should take a careful look at
       | this landing page. That's how you sell something. The demo is
       | clear about the problem they're trying to solve and it convinced
       | me that their product actually solves it. It's not just all the
       | information they've included, but also the lack of irrelevant
       | clutter.
        
       | wanderer_ wrote:
       | Dang it, he beat me to it! I have been toying with the idea for
       | quite some time, but this implementation is great, better than
       | mine would have been, so I'm glad he did it.
       | 
       | Maybe I'll make a CLI implementation (sorta like wget but with
       | this tacked on...)
        
       | givemeethekeys wrote:
       | Naming a thing takes creativity and luck. Congratulations on an
       | excellent name!
        
       | civilian wrote:
       | I was hoping this tool also solved a problem that comes from
       | saving & reproducing JS-framework-heavy websites.
       | 
       | Here's the bug: According the HTML spec, elements like <h2> and
       | <div> cannot be inside <a> tags. But using js you _can_ push
       | <div>s instead of <a>s. (It happens from document.insert-type
       | functions, frameworks like Angular/React allow this)
       | 
       | Look at nasa.gov, there's html:                 <a href="/press-
       | release/nasa-invites-media-to-next-spacex-commercial-crew-space-
       | station-launch-0" date="Wed Mar 02 2022 10:35:00 GMT-0800
       | (Pacific Standard Time)" id="ember196" class="card ubernode cards
       | --card cards--2row cards--2col nodeid-477815 ember-view"><div
       | class="bg-card-canvas" style="background-image: url(/sites/defaul
       | t/files/styles/2x2_cardfeed/public/thumbnails/image/51846702013_a
       | 0cc55100a_k.jpeg);">       <!---->    <h2 class="headline"> ...
       | </h2>       </div>       </a>
       | 
       | After running this through SingleFile you can visually see the
       | changes, but the html changes are:                 <a
       | href="/press-release/nasa-invites-media-to-next-spacex-
       | commercial-crew-space-station-launch-0" date="Wed Mar 02 2022
       | 10:35:00 GMT-0800 (Pacific Standard Time)" id="ember196"
       | class="card ubernode cards--card cards--2row cards--2col
       | nodeid-477815 ember-view"></a>       <div class="bg-card-canvas"
       | style="background-image: url(/sites/default/files/styles/2x2_card
       | feed/public/thumbnails/image/51846702013_a0cc55100a_k.jpeg);">
       | <h2 class="headline"> ...</h2>
       | 
       | The way that sites like Wayback Machine handle this is by using
       | the web-replay library Wombat
       | https://github.com/webrecorder/wombat that also uses JS to insert
       | those elements.
       | 
       | But what the hell! I was working on a similar html-
       | downloading/reproducing tool and this bug really bothers me. I'd
       | either like the HTML reading standard to be updated to accept
       | <div> inside of <a>, or _also_ make that impossible to do via JS.
        
         | gildas wrote:
         | I think this issue could be circumvented by manipulating the
         | page (replacing images, frames, css etc.) in the tab itself
         | (SingleFile does it in background with a DOMParser instance).
         | The trick is to avoid HTML parsing.
        
       | zmix wrote:
       | I'd also recommend "Print Edit WE" and "Save Page WE" [2] for
       | Chrome type browsers, both by one author. First one allows for
       | editing of the page before printing/saving (as a single page HTML
       | or MHTML), second one allows for single-page save.
       | 
       | [1] https://chrome.google.com/webstore/detail/print-edit-
       | we/olnb... [2] https://chrome.google.com/webstore/detail/save-
       | page-we/dhhpe...
        
       | sandes wrote:
       | wget -r url ?
        
       | reidjs wrote:
       | Unfortunately that won't allow you to click links in your offline
       | version. you can do this properly with wget: (sorry I don't know
       | how to do code formatting in hackernews)
       | 
       | wget --mirror \ --convert-links \ --html-extension \ --wait=2 \
       | -o log \ https://example.com
        
         | berkes wrote:
         | Are you suggesting to mirror e.g. the entire Wikipedia through
         | wget?
         | 
         | That is not only suboptimal, it is stressing on the server. At
         | least you added a --wait=2, but on any large site/hoster/CDN,
         | this might still get your IP banned or throttled. And on e.g.
         | the English wikipedia this will then take 149 days. Which means
         | that by the time you hit the last page, the first ones (and
         | their links) are out of date.
        
           | falcolas wrote:
           | If you add '--no-parent' (doesn't request anything that's not
           | a page dependency above the requested URI) and a '--level=5'
           | (only follows link 5 deep), you won't get all of a site. It
           | makes it more realistic for getting wikipedia articles.
        
         | lysium wrote:
         | Looks like SingleFile helps with sites where you have to be
         | logged in, something that is not that easy with wget.
        
         | hombre_fatal wrote:
         | You don't need to newline every flag of a trivial command.
        
           | all2 wrote:
           | I'm guessing the user's intent was to have the command
           | formatted across multiple lines.
        
         | [deleted]
        
         | _dain_ wrote:
         | What are you talking about? I have hundreds of pages saved with
         | SingleFile and I can click links in all of them.
        
           | reidjs wrote:
           | Oh maybe it does work then. I assumed it didn't follow links
           | because they didn't show it in the video.
        
         | z3c0 wrote:
         | Code formatting is just blockquotes.                 So one
         | empty space followed by indented text (2 or more spaces)
        
       | megaman821 wrote:
       | Is this still on track to become a standard?
       | https://github.com/WICG/webpackage
        
       | j1elo wrote:
       | Related: I used to keep a collection of locally mirrored web
       | pages a long time ago, with a legendary Firefox extension called
       | _ScrapBook_ [0] (now long retired). The surprise for me is that
       | after all these years I still remembered the name...
       | 
       | While writing this comment I found that it lived on as a (now
       | "legacy") new extension named _ScrapBook X_ [1], and then yet
       | another one named _WebScrapBook_ [2], which seems to still be
       | alive!
       | 
       | [0]: http://www.xuldev.org/scrapbook/
       | 
       | [1]: https://github.com/danny0838/firefox-scrapbook
       | 
       | [2]: https://addons.mozilla.org/en-US/firefox/addon/webscrapbook/
        
       | wetpaws wrote:
       | Ah, millennials invented .mht
        
       | als0 wrote:
       | This is great. I've always wondered why this isn't the default
       | behaviour for page saving in browsers. To an ordinary user saving
       | a page implies saving a single file, not a file plus a directory
       | of stuff. HAR can be useful but seems only for niche or
       | specialised reasons.
        
       | kwhitefoot wrote:
       | The list of problems that Manifest V3 causes are just more
       | reasons to never use Chrome.
        
       | avivo wrote:
       | Why does this need to:
       | 
       | - Read and change all your data on all websites
       | 
       | - Modify data you copy and paste
       | 
       | - Manage your downloads
       | 
       | Is there a way to use a version that requires less of these
       | permissions? e.g. it seems we can address the first permission by
       | only activating it on click, but I'm not sure if that addresses
       | the other ones.
        
         | gildas wrote:
         | I try to use optional permissions as much as I can. The first
         | permission is required because of assets and frames stored on
         | third-party servers. The second permission should be optional,
         | I don't remember why it's not. I'll try to see if I can make it
         | optional. The last permission is required in order to save the
         | page on the filesystem with the "downloads" API. Note that even
         | if I make these permissions optional, you might still have to
         | trust me anyway ;)
        
         | [deleted]
        
       | anned20 wrote:
       | I also want to give praise about the demo. It's one of the best
       | demos I've ever seen with such a project. Nice job!
        
         | netsharc wrote:
         | A 16MB gif with no playback controls, so you had to go through
         | the tedium.
        
           | Minor49er wrote:
           | I would be surprised that the author wasn't using WebM to get
           | a smaller filesize (not to mention higher quality) but the
           | project itself leads me to believe that the author has a lot
           | of free disk space to use
        
             | a1445c8b wrote:
             | There's no need to make further assumptions about the
             | author (who btw took the time to build a very useful tool
             | and share to in the Internet for free). Just point out the
             | issue of the GIF and move along.
        
               | Minor49er wrote:
               | I never made an assumption about the author and certainly
               | never said that the tool wasn't useful. You can feel free
               | to move along yourself, though.
        
       | treeman79 wrote:
       | Iran has a habit of using tools like this to trick defense
       | contractors into using their page.
        
       | dtjohnnymonkey wrote:
       | Thank you! I've been looking for this for a while, nice to see
       | someone finally did it!
        
       | ilrwbwrkhv wrote:
       | Thanks for this. I expected to see a pricing link somewhere,
       | having been attuned to all the subscription Saas these days. Glad
       | to see there are tools offering immense value for free still.
        
         | gildas wrote:
         | It is in fact more or less self-financed by... hmmm... a SaaS
         | that I market but it's in B2B.
        
       | [deleted]
        
       | mysterypie wrote:
       | Security question: Is a web extension safe if it is installed but
       | if you're not using it at the moment? For example, if I were
       | logged into my bank's website and I did _not_ click the
       | SingleFile button in the extension toolbar, could it still
       | theoretically collect info from my bank 's webpage or do other
       | actions?
       | 
       | I'd like to use SingleFile and have no reason at all to distrust
       | it, but I'd like to understand the security impact of installing
       | lots of web extensions. How do people handle security risks like
       | that? Do you run a separate vanilla browser with no extensions
       | for sensitive tasks?
        
         | fsflover wrote:
         | If you care about security, consider using Qubes OS with
         | hardware-virtualized VMs for compartmentalization. Then, you
         | Firefox for banking won't have the same extensions which you
         | use elsewhere. Works for me.
        
         | gildas wrote:
         | For technical reasons beyond my control, SingleFile injects a
         | (very small) script when the page loads even if you don't click
         | on the button. It could also send any data to a third party
         | server. Unfortunately, it is therefore impossible for me to
         | technically and formally guarantee that SingleFile cannot
         | behave maliciously. Note however that the extension has the
         | status "recommended" on Firefox and that it undergoes a manual
         | code review by Mozilla at each update.
        
           | fmntf wrote:
           | Could you please elaborate what script is injected, that
           | reason and why it is that out of your control? Thank you
        
             | gildas wrote:
             | I will do it, but it will take me some time to explain it
             | and rather than answering on HN I will integrate it in the
             | FAQ. I created an issue for this here:
             | https://github.com/gildas-lormeau/SingleFile/issues/885.
        
         | prox wrote:
         | In Firefox you could run a totally different profile.
         | 
         | I don't do this myself, I try to research any extension I add
         | and don't do automatic upgrades. I use as little extensions as
         | possible.
        
       | tzs wrote:
       | > For security reasons, you cannot save pages hosted on
       | https://chrome.google.com, https://addons.mozilla.org and some
       | other Mozilla domains.
       | 
       | Interesting. What is it about those pages that makes saving them
       | raise security issues?
        
         | Isthatablackgsd wrote:
         | That is not the extension issue, that's the Google/Mozilla
         | policy thing.
        
         | amccollum wrote:
         | Maybe because JS files (specifically add-ons) run from the
         | local filesystem are given escalated privileges compared to
         | normal usage, perhaps for ease of development. I'm just
         | speculating, though.
        
       | slmjkdbtl wrote:
       | Does it create an inline dataurl for each image even if they're
       | the same?
        
       | assemblylang wrote:
       | Nice project! This project, and a similar project called
       | Monolith[0], was a bit of an inspiration for making my own single
       | HTML file tool called Humble[1] to solve a few edges cases I was
       | having with bundling pages (and since I wanted a TypeScript API
       | for making page bundles).
       | 
       | [0] https://github.com/Y2Z/monolith
       | 
       | [1] https://github.com/assemblylanguage/humble
        
       | alberth wrote:
       | FYI - there's an official standard (MHTML) for doing this that
       | has existed for 20+ years and exists natively in browsers.
       | 
       | https://en.m.wikipedia.org/wiki/MHTML
        
         | setum wrote:
         | IIRC, back in the day mhtml won't save java applets.
        
           | pstuart wrote:
           | Are any sites still using applets these days?
        
             | IYasha wrote:
             | 80% of server IPMI Web control panels. But who whould want
             | to save those anyway? :)
        
         | twapi wrote:
         | I use this Chrome extension to save web pages as MHTML:
         | https://chrome.google.com/webstore/detail/save-webpages-offl...
        
         | paulirish wrote:
         | The Chrome engineer who maintains the MTHML work wrote up a
         | comprehensive doc on the modifications on the MHTML spec (RFC
         | 2557) that are implemented:
         | https://docs.google.com/document/d/1FvmYUC0S0BkdkR7wZsg0hLdK...
         | Might be useful for you, gildas.
        
           | gildas wrote:
           | Thank you Paul! I had read this document some time ago,
           | especially to see how the shadow DOM was serialized.
        
         | rplnt wrote:
         | I was gonna say Opera (the old, good one) had this. When saving
         | a page there were some options and one was a single file IIRC.
        
         | rtsil wrote:
         | I remember saving webpages in MHTML when I was using dial-up so
         | that I could read them offline later.
         | 
         | I would also download entire websites using a software which
         | name I forgot, to read them offline. Back when websites held in
         | a single floppy disk.
         | 
         | Good times!
        
           | TheFlyingFish wrote:
           | I remember using HTTrack for this a while back. Still have a
           | few of those sites lying around, I think.
        
         | domador wrote:
         | Does anyone else get two security warnings whenever you try to
         | save an MHTML page using a Chrome extension? I have to click on
         | one warning's button to confirm that I indeed want to save the
         | "dangerous" file and another to confirm I'm really sure. It's
         | gotten very annoying. I've looked all over for an option to
         | disable this behavior but haven't been able to.
        
         | toqy wrote:
         | For anyone else that didn't read the README, MHTML is mentioned
         | in the comparison section https://github.com/gildas-
         | lormeau/SingleFile#file-format-com...
        
           | dsl wrote:
           | Take the comparison with a grain of salt. Not including WARC
           | is like excluding water from a comparison of beverages, it is
           | the baseline standard.
        
         | bgro wrote:
         | I've extensively looked into this as I can't find a good light
         | and easy backup options that isn't extreme overkill.
         | 
         | I thought MHTML was NOT standardized which is why it wasn't
         | across all browsers yet. From what I remember, every company
         | was doing their own implementation of it. Maybe it's gotten
         | more standardized the last few years though.
        
           | chungy wrote:
           | I've always thought the "M" stood for "Microsoft" -- wasn't
           | even aware any browsers other than IE supported it.
        
             | chme wrote:
             | There is also CHM which is actually a Microsoft only file
             | format for "Compiled HTML Help" files.
        
               | IYasha wrote:
               | I love this format. Very fast and compact. Entire Visual
               | Studio help was in it once. Worked VERY well. And there's
               | a KDE/Qt reader.
        
         | iKlsR wrote:
         | Over a decade ago I had a laptop but no internet at home. This
         | was one of the ways I taught myself programming (and also
         | downloading dozens of manga) by using internet explorer at a
         | cafe which had an option to save to mhtml which was one file
         | and had everything self contained. Legit owe a portion of my
         | success to this. I still have some of these files, old crusty
         | hello world c++ tutorials etc.
        
           | falcolas wrote:
           | I have fantastic internet, and I still do something similar.
           | Local docs just load so much faster, and if something happens
           | (which it still does, even on Fiber in the US), I have docs
           | and can program.
           | 
           | Lemme see if I can pull up the command I use to mirror doc
           | sites.                   wget \           --recursive \
           | --level=5 \           --convert-links \           --page-
           | requisites \           --wait=1 \           --random-wait \
           | --timestamping \           --no-parent \           $1
        
           | a9h74j wrote:
           | For people who cannot afford internet access now, and for
           | perhaps more in the future if times get more difficult, I
           | believe this is a very important use-case.
        
         | geitir wrote:
         | And it generally does not do a good job
        
           | als0 wrote:
           | What are the issues?
        
             | hulitu wrote:
             | From my experience, wrong layout,missing pictures.
        
         | ByThyGrace wrote:
         | > MHTML, (...) is a web page archive format used to combine, in
         | a single computer file, the HTML code and its companion
         | resources (such as images, _Flash animations, Java applets_ ,
         | (...)
         | 
         | Well that goes to show its longevity I guess.
        
         | rpdillon wrote:
         | The browser compatibility section suggests MHTML is unsupported
         | in current versions of Firefox and Safari.
        
           | tekknik wrote:
           | Safari supports webarchive, which does basically the same
           | thing
        
             | gildas wrote:
             | The problem is that it is a proprietary format. The
             | advantage of the format produced by SingleFile (HTML) is
             | that as long as your browser is capable of interpreting
             | HTML, you will be able to read your archives without
             | worries.
        
               | tekknik wrote:
               | Not so proprietary. It's really just a plist file, which
               | the format is known and even open sourced by Apple[1].
               | Really it's only proprietary in that no other platforms
               | have implemented it.
               | 
               | [1]: https://opensource.apple.com/source/CF/CF-550/CFBina
               | ryPList....
        
           | mrspuratic wrote:
           | I don't think it was ever native in Firefox, there is/was the
           | excellent unMHT extension that was broken by
           | Quantum/WebExtensions and The Great XUL Silliness. Shame.
           | 
           | I have Waterfox-Classic and unMHT (fished out of the Classic
           | Addons Archive, just remember to turn off Waterfox's
           | multiprocess feature) since I occasionally need to archive
           | web pages - and more importantly, reopen them later.
           | 
           | mhtml is just MIME, literally every discrete URL as a MIME
           | part with its origin in a Content-Location header, all
           | wrapped in a multipart container. I don't understand why it's
           | not a default format.
        
             | Groxx wrote:
             | I can see WebExtensions breaking it (as it's a completely
             | new set of APIs for extensions, and the losses do
             | definitely still hurt)... but quantum/xul? How is that
             | related, aside from "it happened around the same time"?
        
         | cookiengineer wrote:
         | > FYI
         | 
         | The alternative format (used by the Internet Archive and
         | Wayback Machine) is WARC. It's also a single file, but it's
         | preserving the HTTP headers as well; so its applications is
         | specifically for archival purposes. [1] The "wget" tool which
         | is co-maintained by the Web Archive people also has support for
         | it via CLI flags.
         | 
         | Though when it comes to mobile browser support I'd recommend to
         | use MHTML, because webkit and chromium both have support for it
         | upstream.
         | 
         | [1] http://iipc.github.io/warc-specifications/
         | 
         | [2] https://www.gnu.org/software/wget/wget.html
        
           | londons_explore wrote:
           | Is there any objection to adding WARC support to
           | webkit/chromium? Seems like a not-so-complex project...
        
             | cookiengineer wrote:
             | I know that WebKit relies on either libsoup [1] (on
             | Linux/Unices) or curl [2] (legacy Windows and maybe WPE(?))
             | as a network adapter, so the header handling and parsing
             | mechanisms would have to be implemented in there.
             | 
             | Though, on MacOS, WebKit tries to migrate most APIs to the
             | Core Foundation Framework, which makes it kind of
             | impossible to implement as a non-Apple-employee because
             | it's basically a dump-it-and-never-care Open Source
             | approach. [3]
             | 
             | Don't know about chromium (my knowledge is ~2012ish about
             | their architecture, and pre-Blink).
             | 
             | [1] https://github.com/WebKit/WebKit/tree/main/Source/WebKi
             | t/Net...
             | 
             | [2] https://github.com/WebKit/WebKit/tree/main/Source/WebKi
             | t/Net...
             | 
             | [3] https://github.com/opensource-apple/CF
        
               | TingPing wrote:
               | GTK/WPE use libsoup. Playstation/Windows uses curl. And
               | yes Apples networking is proprietary.
        
           | chefandy wrote:
           | WARC is also used by the Webrecorder project. They made an
           | app called Wabac which does entirely client-side WARC or HAR
           | replays using service workers and it seems to have pretty
           | good browser support, but I haven't really dug into the
           | specifics.
           | 
           | https://github.com/webrecorder/wabac.js-1.0
        
         | admax88qqq wrote:
         | Unfortunately mhtml is not widely supported.
        
       | pan69 wrote:
       | In the olden days, Internet Explorer used to allow you to do this
       | by saving the page to a HTM file. It would be a single archive
       | with HTML and images etc embedded.
       | 
       | New browsers don't seem to do this, the create a separate folder
       | for the assets, which is super annoying.
        
         | nickflood wrote:
         | The Chromium Edge can produce .MHT files as well
        
       | xnx wrote:
       | I love SingleFile and have been using it for years! Is there any
       | version that works on current mobile browser versions? I've stuck
       | with an old version of Firefox on Android that still supports the
       | extension.
        
         | gildas wrote:
         | You should be able to use it on Firefox for Android Nightly
         | (which is very stable) by following this procedure:
         | https://blog.mozilla.org/addons/2020/09/29/expanded-extensio...
         | 
         | > approx
        
       | moffkalast wrote:
       | This is what 10 year old me thought "Save As" in IE would do, but
       | soon realized the harsh reality of "that's not how any of this
       | works".
        
       | edf13 wrote:
       | The most impressive part of the demo is seeing how tidy his
       | Downloads folder is!
        
       | ctxc wrote:
       | Been eyeing this for a long time!
       | 
       | I'm building a bookmark app, and I plan to use this to save
       | bookmarks!
       | 
       | I'm a simple man, nothing too fancy. Here's a crude demo in
       | progress - https://zewallet.netlify.app/ Follow progress here -
       | https://twitter.com/recursiveSwings/status/14917723874649088...
       | 
       | Would love to have ANY tips or feedback!
        
         | TehShrike wrote:
         | the signup email confirmation link points to
         | http://localhost:3000/ btw
         | 
         | I'm definitely in the market for a bookmark service that
         | archives my bookmarks, Diigo stopped working a year or two ago,
         | and Pinboard can't stay up
        
           | ctxc wrote:
           | Fixed now!
        
           | cxr wrote:
           | Zotero deals with this reasonably well--and happens to be
           | using SingleFile under the hood. Its landing page just
           | targets a different audience (academics), which means
           | probably upwards of 90% of the people who would happily use
           | it probably end up bouncing after thinking, "This isn't for
           | me", before ever trying it. Give it a shot.
        
           | ctxc wrote:
           | Ahh damn, should fix it! For now, you can edit the URL
           | manually to take a peek. If you're interested, feel free to
           | send a DM on Twitter @recursiveSwings, I'll let you know once
           | it's in Beta! :)
        
       | fender256 wrote:
       | You read my mind, I was exactly looking for that!
        
       | didericis wrote:
       | Similar project -> https://github.com/Y2Z/monolith
       | 
       | (I used both and ended up favoring monolith, but can't remember
       | why. I think they're pretty comparable/am grateful for both of
       | them)
        
       | theden wrote:
       | This would be very useful in many situations, and a great demo!
        
       | spankalee wrote:
       | We really, really need Web Bundles to progress and fix these
       | problems correctly, once and for all. There are a lot of things
       | that a tool like this can never get right, and the rest is
       | complicated work that should never need to be done if we have a
       | standard multi-file bundle format.
       | 
       | https://wicg.github.io/webpackage/draft-yasskin-wpack-bundle...
        
       | necovek wrote:
       | Great stuff!
       | 
       | For some reason, I went in expecting to see a JS-enabled multi-
       | page web site into a SPA in a single HTML file, but I didn't
       | expect to see images get embedded.
       | 
       | Perhaps offer a recursive traversal option too, but don't try
       | that on Wikipedia :)
        
       | sam0x17 wrote:
       | Back in the day this was always one thing that had me
       | begrudgingly and shamefully opening IE so I could save a page as
       | an MHT file. So long ago now. Cool to see this idea has been
       | revived and not in a proprietary way
        
       | vincentmarle wrote:
       | If it's a single file, then how do the images get stored?
        
         | gildas wrote:
         | Images are stored as data URIs [1]. Note that they could also
         | be stored as entries in a zip file too! [2].
         | 
         | [1] https://en.wikipedia.org/wiki/Data_URI_scheme
         | 
         | [2] https://github.com/gildas-lormeau/SingleFileZ
        
         | danielam wrote:
         | They're base64 encoded[0]. (This is an approach I myself have
         | used in the past for simplifying the archival of regulatory
         | texts.)
         | 
         | [0] https://github.com/gildas-
         | lormeau/SingleFile/blob/15801c8ef4...
        
       | codeflo wrote:
       | Does this simply remove the JavaScript or do something more
       | clever? Because I think in the age of SPAs, the proper way to
       | save "content pages" might be to execute the JavaScript once and
       | serialize the resulting DOM back to HTML. I didn't find anything
       | in the FAQ that explains if it does something like that.
        
         | gildas wrote:
         | It saves what you see (and remove JS by default). There is an
         | option to embed the JS and another one to save the "raw" page
         | but I would not say it is reliable. The cleverness lies more in
         | the ability to produce light pages.
        
       | sergiotapia wrote:
       | I'm building a tool for people have a personal archive to their
       | digital life so that 30 years from now they can revisit content
       | they enjoyed in their younger years.
       | 
       | https://github.com/sergiotapia/ekeko
       | 
       | This is awesome! I would love to integrate this somehow into my
       | project to "singlefile" bookmarks as people make them.
       | 
       | @gildas do you have any recommendation on how to approach this
       | with your extension? Could I run a headless chrome and trigger
       | this extension?
        
         | gildas wrote:
         | I confirm that you could use a headless browser for this. This
         | is actually what SingleFile CLI does [1]. Here is an example of
         | JS code showing how to configure and inject SingleFile with
         | puppeteer [2].
         | 
         | [1] https://github.com/gildas-
         | lormeau/SingleFile/tree/master/cli
         | 
         | [2] https://github.com/gildas-
         | lormeau/SingleFile/blob/master/cli...
        
           | sergiotapia wrote:
           | Thank you!
        
       | phil294 wrote:
       | How old is that demo gif? I just tried reproducing the normal
       | saving shortcomings, and the bottom image ("Example of an SVG
       | image with embedded JPEG images") loads just fine from the local
       | folder, so this seems outdated.
       | 
       | That being said, it's a bit weird that this kind of tool is even
       | necessary at all. I would have expected native saving to include
       | CSS background graphics as well, but apparently they don't for
       | some reason, so I think this is pretty useful. Until now, I have
       | also used pandoc (--standalone) to merge all resources into a
       | single HTML file which worked great.
        
         | gildas wrote:
         | The demo is approximately 2 years old. Things probably changed
         | meanwhile.
        
       | brentcetinich wrote:
       | I use HAR file extractor because normally I don't want a single
       | file I want a replica of the web servers file system structure
       | including any dynamically loaded assets
       | https://blog.cetinich.net/content/2022/download-website-and-...
        
       | kosasbest wrote:
       | Love this. Use it all the time. Handy for saving huge pages with
       | all the styling intact for reading offline (like on a plane). You
       | could save a webpage as a PDF, but I prefer this over a PDF.
        
       | steren wrote:
       | Chrome can save to a single file (.mhtml). I am not sure I
       | understand the difference.
        
         | gildas wrote:
         | The difference is the output format. I created SingleFile
         | before Chrome supported MHTML files. At that time, to save web
         | pages in a single file, the only technical solution in Chrome
         | was to implement something like SingleFile. The advantage of
         | HTML is that this format is much more durable though.
        
         | Isthatablackgsd wrote:
         | Yes, there is .mhtml but it execution plainly sucks because it
         | didn't exactly saves everything. It would attempts to save but
         | it won't be valiant at it, it is like using mhtml without
         | "force (-f) argument".
        
       | gildas wrote:
       | Author here, it makes me really happy to see SIngleFile on the
       | front page of HN. Thank you! I take the opportunity to make you
       | aware of the upcoming impacts of the Manifest V3 [1], and for
       | those who prefer zip files, I recommend you to have a look here
       | [2].
       | 
       | [1] https://github.com/gildas-lormeau/SingleFile-Lite
       | 
       | [2] https://github.com/gildas-lormeau/SingleFileZ
        
         | joisig wrote:
         | Thank you for the Manifest V3 critique, the examples you give
         | make it really clear how many things are regressing with this
         | upcoming change :/
        
         | austincheney wrote:
         | Twelve year project with nearly 7000 commits shows a lot of
         | dedication. Good work.
        
         | mieko wrote:
         | Thanks for this project. I found SingleFile a year or two ago,
         | and used it to take "HTML Screenshots" of third party sites I
         | could embed in guided walkthroughs with modified/example data
         | changed, instead of just PNGs.
         | 
         | SingleFile was ultra-valuable for this.
         | 
         | If anyone has a similar use-case, I wrote some pretty rough
         | (and slow) code to post-process SingleFile's output to remove
         | any HTML that wasn't contributing to the presentational render
         | by launching puppeteer and comparing pixels. It's available
         | here: https://github.com/mieko/trailcap
        
           | gildas wrote:
           | It's interesting! I had started something similar as part of
           | testing but hadn't really finished my work. I will have a
           | look at your project.
        
         | stragio wrote:
         | Very nice! Will use it for sure. May I ask you how you created
         | that good looking demo gif?
        
           | gildas wrote:
           | I used:
           | 
           | - ScreenToGif to record video sequences and produce the final
           | GIF: https://www.screentogif.com/
           | 
           | - Macro Recorder to record and replay user navigation:
           | https://www.macrorecorder.com/
           | 
           | - Blender to edit the video, add text comments, and make the
           | intro: https://www.blender.org/
        
         | badsectoracula wrote:
         | Single File is one of my favorite addons since it allows me to
         | keep offline copies of articles, tutorials, etc i see online
         | without losing images, etc (there have been a ton of articles
         | lost over the years and while some are preserved in
         | archive.org, they often lack things like images, etc, so i
         | prefer to save anything i come across). So thank you for making
         | it :-).
         | 
         | Now, having said that, the text in SingleFile-Lite's "Notable
         | features of SingleFile Lite" sound like a list of issues :-P.
         | It looks like these are issues with Chrome, but do you know
         | if/how these "improvements" will affect Firefox?
        
           | gildas wrote:
           | AFAIK, for the moment Mozilla is aware of the regressions
           | that Manifest V3 causes and shows a good will to try to
           | reduce them as much as possible. You can find some
           | information about this here
           | https://github.com/w3c/webextensions/tree/main/_minutes
        
         | rahimnathwani wrote:
         | If I start using SingleFile today, will I still be open saved
         | pages after the update to Manifest V3?
         | 
         | I mean, if I want to save pages over the next 11 months, should
         | I install SinglePage or SinglePage-lite?
        
           | gildas wrote:
           | In fact, you simply do not need an extension to open pages
           | saved with SingleFile (or SingleFile Lite) because they are
           | standard HTML pages. So you don't have to worry about that.
        
             | warmwaffles wrote:
             | This alone is fantastic. I've been looking for an mhtml
             | replacement that worked well across all browsers.
        
         | JeremyNT wrote:
         | I've been using SingleFile for the last year or so, it's
         | amazing!
         | 
         | I'm going to hijack your post for a question! I love the way
         | you can use the editor and select "format for better
         | readability," then save just the stripped down version of the
         | page. I use this to send it to my e-ink device.
         | 
         | The question I have is whether it's possible to toggle the
         | default save to use the formatted version automatically? I dug
         | into the options and didn't turn anything up!
        
           | gildas wrote:
           | You can enable these options for this:
           | 
           | - Annotation editor > default mode > edit the page
           | 
           | - Annotation editor > annotate the page before saving
        
             | gildas wrote:
             | Sorry, I was wrong, you have to select "format the page"
             | instead of "edit the page" (first item).
        
         | narag wrote:
         | Thank you, very useful and works like a charm: a must have.
        
         | cloudwizard wrote:
         | Is there a configuration for the zip version where I can avoid
         | duplicating the static assets? Thanks
        
           | gildas wrote:
           | I guess you're referring to SingleFileZ. This option is not
           | needed because zip files (i.e. what SingleFileZ produces)
           | already provide this feature.
        
         | aantix wrote:
         | Is it possible to use this within the context of the current
         | web page, without the extension portion?
         | 
         | Taking a snapshot of my user's screen and then display it to
         | them later (maybe in an iFrame)?
        
           | gildas wrote:
           | It's possible but it's a bit limited. It won't be able for
           | example to save images coming from a different origin.
        
         | hrgiger wrote:
         | Thanks for the work you have done, its a lazy man heaven
         | especially for bulk downloads and helped me a lot. About a
         | month ago I have decided to backup my bookmarks via archivebox,
         | it was more than 1k bookmarks, most reliable methods were
         | singlefile and wget.
        
         | gwbas1c wrote:
         | FYI: Figure tags don't convert their hrefs to base64.
         | 
         | For example, try saving my home page:
         | https://andrewrondeau.herokuapp.com/
         | 
         | The img tags are converted correctly, but there's still <figure
         | class=image><a href="https://andrewrondeau.herokuapp.com/... in
         | the single HTML file.
        
           | gildas wrote:
           | I cannot reproduce your issue, I just did a test on this page
           | and I see the expected `<img src="data:image/jpeg;base64,...`
           | in the saved page.
        
       | Mr_Modulo wrote:
       | This is good for people who don't have constant internet access
       | who need to reference web resources offline.
       | 
       | Webpage saving technology does not seem to have kept pace with
       | the evolution of the web.
       | 
       | Images loaded by CSS aren't saved at all. JavaScript on the page
       | will often hijack a saved page and not let it display at all.
       | 
       | One option that works fairly well and does not require installing
       | a browser extension is to save the page as a PDF.
       | 
       | I wish browser developers would put more effort in this area.
        
       | manor wrote:
       | If you keep the javascript, you also get the world's most
       | portable (desktop) application format...
        
       | stanislavb wrote:
       | Opening the repo makes you download a 17MB gif. I hope you are
       | not on expensive mobile connection.
       | 
       | p.s. the demo is nice
        
       ___________________________________________________________________
       (page generated 2022-03-02 23:00 UTC)