[HN Gopher] SingleFile: Save a complete web page into a single H... ___________________________________________________________________ SingleFile: Save a complete web page into a single HTML file Author : crbelaus Score : 604 points Date : 2022-03-02 14:55 UTC (8 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | IggleSniggle wrote: | What a cool project! I love the way this embeds images. One of | things I miss most, though, when going back to old sites, is | embedded audio or video. From looking at the options, it seems | like it might be able to handle encoding video and/or audio as | Data URIs, but it's not totally clear if SingleFile does this or | not. I wasn't sure if I was doing the correct things to force | this behavior in the options. It would be great if the README | could clarify how these are handled by SingleFile. Sometimes it | might be nice to be able to embed these sorts of things, even if | it does make the HTML ridiculous and bloated. Or, barring that, | maybe just a recommendation to use one of the other formats in | the comparison table for this kind of use case. | manigandham wrote: | Relevant 'awesome' list for web archiving: | https://github.com/iipc/awesome-web-archiving | | There are many similar tools there, from archiving to rendering. | abnry wrote: | I love, love this extension. I am working on an app to turn this | into a single click bookmark system on Linux. Run an inotify | service to watch your downloads and then process any Single file | downloads to a database and update a browsable index. | jrm4 wrote: | TELL ME MORE. | | I think I basically get the idea, what kind of database are you | using? Recoll sounds like a good idea, but I'm also thinking | about how I might also make this public-ish. | | (i.e. I teach in college and would love to have a centralized | way to store and search all my assigned readings, which are | most often webpages) | abnry wrote: | I am not a trained software engineer but... | | Each html page is processed by (1) getting url, title, time | saved (this is under-rated as approximate time of saving is | useful if you want to rediscover) and then (2) taking a | screenshot and finally (3) extracting text with | readability.js and hopefully doing some keyword analysis. | | Right now it is stored in a local SQLite Database, although | the article content is stored in text files. For search, I | can use ripgrep to look through the associated text files. | | The eventual goal is to create a flask app which will allow | for interactive management of the bookmarks (tagging, | searching). I've already got static generation of bookmarks. | | Here's a screenshot: https://imgur.com/5YP4sP5 | m-p-3 wrote: | I archived (privately) some documentation pages from some of | our vendors that were behind a login page using that just in | case it became inaccessible at a critical time for us. | samstave wrote: | WANT | makeworld wrote: | You might like https://archivebox.io/, I think it can does this | for you and then some. | rhn_mk1 wrote: | I'm using Recoll for this exact purpose. Just without inotify. | sitkack wrote: | This sounds neat. | causi wrote: | This is great for a page. I'd love to see it expanded to include | an entire site. | dgellow wrote: | That's a nice and simple tool, good work. I'm personally using | Zotero to save copies of web pages: https://www.zotero.org/. With | the browser extension you can save a snapshot in a few seconds. | gildas wrote: | Zotero is actually using SingleFile under the hood to save web | pages ;) | dgellow wrote: | Oh, that's nice :) | js8 wrote: | I use SavePageWE, it can save the page (into single file) as it | was modified by JS after load, which is often useful. | | The only thing I miss I wish it was easier to script. | rambambram wrote: | I have been using WebScrapBook (an add-on for Firefox) for some | time. I really like it. Has anyone else some experience with this | add-on? Good or bad. | jjice wrote: | I've been using it for a couple years (2 maybe) and I like it | quite a bit as a quick and easy way to save pages. ArchiveBox | looks fantastic, but I just don't have the motivation to set up | the service and maintain it since I don't save enough links to | make it worthwhile. SingleFile might be worth a shot, but it | looks like WebScrapBook has been handling your needs just fine | (they seem to have 90% of the same functionality). | vageli wrote: | As a webscrapbook user, do you know if there is a migration | path from pocket or another hosted service? | rambambram wrote: | Don't know about a migration option, but I do remember | there's a lot of custom configuration possible. | rambambram wrote: | Thanks! | | ArchiveBox does indeed look fantastic. Their homepage alone | is beautiful. | | I bookmarked both ArchiveBox and now also SingleFile, but | WebScrapBook gets the job done (in almost all cases). | sharps1 wrote: | Should be noted Manifest V3 will break this extension for | chromium based browsers. | | https://github.com/gildas-lormeau/SingleFile-Lite | photon-torpedo wrote: | Love the list of notable "features". :) | a1445c8b wrote: | Also this: | | > Benefits of the Manifest V3 | | > - None | black3r wrote: | Can we please stop with the 17MB GIF images used as demos? They | use up lots of data immediately as you open the page, and are | impractical, you don't know how long the animation is, can't | forward/rewind, and you can't press fullscreen on a mobile. | | And GitHub supports embedded videos in README.md files, videos | are generally smaller than GIF files and their disabled autoplay | is a feature = you save your data until you press play. | andrewmcwatters wrote: | I wish browsers came standard, preconfigured with warning | dialogs that triggered if assets attempting to load were beyond | some threshold. That threshold could be decided by the browser | vendors group based on some collection of network statistics | and be adjusted on an annual basis or so. | wackget wrote: | https://old.reddit.com/r/firefox/comments/aaek23/how_to_stop... | black3r wrote: | The issue is mainly with mobile browsers, as mobile data is | expensive..., Firefox on iPhone doesn't have about:config. | foobarbecue wrote: | GitHub only recently expanded video support from gif to decent | video formats, and many github enterprise installs don't have | those new features yet. So, keep spreading the word. | gildas wrote: | Author here, sorry for the GIF file. I created it because | people were not happy with the video hosted on Youtube. AFAIK, | video files did not work when I did this demo. I'll try to | improve this in the future. | tux1968 wrote: | Would you mind sharing which tools you used to create that | demo? It is really well done. | gildas wrote: | Sure! See here | https://news.ycombinator.com/item?id=30530438 | localhost wrote: | I wanted to comment on how useful that demo was to me. It did | a great job at demonstrating why this is useful and how well | it works compared to the native browser implementation. Thank | you both for the demo and for the project! | gildas wrote: | Thanks :) | lostgame wrote: | Giving a massive upvote for this, disappointed and confused to | see you've been downvoted here. There's literally no reason to | use GIFs like this, and - as you stated, it's massively | disrespectful to those not fortunate enough to have broadband | connections, but would like access to the information. | | Using data so wastefully like this always reeks of privilege to | me - especially on something like GitHub. Wikipedia, for | instance, never allows things like this. | Zababa wrote: | Sharing a project with the world and taking time to document | it reeks of privilege? I really can't understand your | reasoning. | pizza234 wrote: | > disappointed and confused to see you've been downvoted here | | Because it's a relatively new feature, and probably, a lot of | devs don't know about it (I didn't). | | I did this [animated gif] once actually, before the feature | was introduced, and I definitely hated it, but I had no | choice. | | Thanks for bringing this to the general attention, though :) | bob1029 wrote: | I think there is some nuance here. | | If the demo sequence is <5 seconds, I have never found myself | becoming impatient. Gif is perfect for very brief demos. | Anything longer than that and I'd like to have some idea where | I am at in the video stream (and other controls as indicated) | jazzyjackson wrote: | > GitHub supports embedded videos in README.md files | | True since May 2021 so I think a lot of people are still | finding this out... | | In my experience GIF is still the most set-it-and-forget-it way | to know a video will play, to get cross-platform support out of | mp4 you may have to provide two different codecs. Anyway, not | disagreeing with you and most gifs could drop 90% of their size | with better choice of resolution and framerate. This readme is | particularly egregious doing a screen capture with scrolling. | | As for saving bandwidth until you want to play, I haven't tried | this yet but it seems adequately clever to wrap a loading=lazy | gif inside a details/summary tag: https://css-tricks.com/pause- | gif-details-summary/ | Melatonic wrote: | Not to mention that H264 can take quite a bit of horsepower | to decode and play as well (assuming your machine doesnt have | a hardware chip specifically for doing just that) | Mogzol wrote: | Is this really still an issue in 2022? How many people are | browsing the internet on a device that can't do hardware | H264 decoding? | TingPing wrote: | Some browsers have poor hw decoding support on Linux | (their problem, not drivers) but its gotten a lot better | recently. | tambourine_man wrote: | Which machine doesn't? Anything in the last 10 or so years | will decode H264 with much less power than GIF because of | it. Even a Pi supports it. | jjice wrote: | My 2014 Thinkpad X1 Carbon (gen 3) doesn't have hardware | transcoding as far as I can tell made Zoom and Discord | impossible to use for class, especially because there was | no way (that I knew of) to disable all video except the | presenter. Even playing a YouTube video on it makes it | ramp up. | botdan wrote: | I'm not sure which CPU you have specifically but the | lowest-end model of the X1 Carbon Gen3 has an i5-5200U | [1] that lists Intel Quick Sync Video support. | | From the wiki page for Quick Sync [2]: | | > Intel Quick Sync Video is Intel's brand for its | dedicated video encoding and decoding hardware core. | Quick Sync was introduced with the Sandy Bridge CPU | microarchitecture on 9 January 2011 and has been found on | the die of Intel CPUs ever since. | | I can't confirm but I'd guess your performance issues lie | elsewhere than in the h264 decoding specifically. | | [1] - https://ark.intel.com/content/www/us/en/ark/product | s/85212/i... | | [2] - | https://en.wikipedia.org/wiki/Intel_Quick_Sync_Video | jjice wrote: | If you check out the generation-codec table in that | wikipedia article [1], under Broadwell (I believe that's | the 5200U's generation name), it says there is support | for AVC (which I believe is H264, I'm not a codec wiz), | so that's a really good point. I'm not sure why I've | consistently had issues with this on my machine then. I | wonder if this is something with a configuration on Linux | then? | | Thanks for pointing that out. I've looked at this table | before and payed attention to HEVC, not AVC, so I believe | that's where my mistake came from. | | [1] https://en.wikipedia.org/wiki/Intel_Quick_Sync_Video# | Hardwar... | zerocrates wrote: | AVC is H.264, yes. | | Accelerated video decode is often disabled by default on | Linux versions of browsers and can be quite dependent on | versions of drivers/mesa/X-vs-Wayland/etc. | black3r wrote: | YouTube by default prefers newer, bitrate saving codecs | over old ones if it thinks your CPU can handle software | decoding them. On my 2017 Dell XPS 1080p and lower | resolutions on YouTube play in software decoded AV1, only | 1440p and higher play in hardware decoded VP9, so playing | 4K video on YouTube is less taxing for my CPU than | playing a 1080p video.... | folmar wrote: | You can use h264ify extension to fix it. | divbzero wrote: | > _to get cross-platform support out of mp4 you may have to | provide two different codecs_ | | Video codecs are not my area of expertise. Which codecs are | these and what tool(s) would you typically use to ensure you | provide them? | berkes wrote: | > And GitHub supports embedded videos in README.md files | | Any documentation on this? Because I have tried to embed video | in issues and PRs before, and did not manage. I'm hoping such | documentation will explain how this extends to issues and PRs. | TingPing wrote: | In issues its just drag-n-drop. | bachmeier wrote: | Maybe a little OT, but founders should take a careful look at | this landing page. That's how you sell something. The demo is | clear about the problem they're trying to solve and it convinced | me that their product actually solves it. It's not just all the | information they've included, but also the lack of irrelevant | clutter. | wanderer_ wrote: | Dang it, he beat me to it! I have been toying with the idea for | quite some time, but this implementation is great, better than | mine would have been, so I'm glad he did it. | | Maybe I'll make a CLI implementation (sorta like wget but with | this tacked on...) | givemeethekeys wrote: | Naming a thing takes creativity and luck. Congratulations on an | excellent name! | civilian wrote: | I was hoping this tool also solved a problem that comes from | saving & reproducing JS-framework-heavy websites. | | Here's the bug: According the HTML spec, elements like <h2> and | <div> cannot be inside <a> tags. But using js you _can_ push | <div>s instead of <a>s. (It happens from document.insert-type | functions, frameworks like Angular/React allow this) | | Look at nasa.gov, there's html: <a href="/press- | release/nasa-invites-media-to-next-spacex-commercial-crew-space- | station-launch-0" date="Wed Mar 02 2022 10:35:00 GMT-0800 | (Pacific Standard Time)" id="ember196" class="card ubernode cards | --card cards--2row cards--2col nodeid-477815 ember-view"><div | class="bg-card-canvas" style="background-image: url(/sites/defaul | t/files/styles/2x2_cardfeed/public/thumbnails/image/51846702013_a | 0cc55100a_k.jpeg);"> <!----> <h2 class="headline"> ... | </h2> </div> </a> | | After running this through SingleFile you can visually see the | changes, but the html changes are: <a | href="/press-release/nasa-invites-media-to-next-spacex- | commercial-crew-space-station-launch-0" date="Wed Mar 02 2022 | 10:35:00 GMT-0800 (Pacific Standard Time)" id="ember196" | class="card ubernode cards--card cards--2row cards--2col | nodeid-477815 ember-view"></a> <div class="bg-card-canvas" | style="background-image: url(/sites/default/files/styles/2x2_card | feed/public/thumbnails/image/51846702013_a0cc55100a_k.jpeg);"> | <h2 class="headline"> ...</h2> | | The way that sites like Wayback Machine handle this is by using | the web-replay library Wombat | https://github.com/webrecorder/wombat that also uses JS to insert | those elements. | | But what the hell! I was working on a similar html- | downloading/reproducing tool and this bug really bothers me. I'd | either like the HTML reading standard to be updated to accept | <div> inside of <a>, or _also_ make that impossible to do via JS. | gildas wrote: | I think this issue could be circumvented by manipulating the | page (replacing images, frames, css etc.) in the tab itself | (SingleFile does it in background with a DOMParser instance). | The trick is to avoid HTML parsing. | zmix wrote: | I'd also recommend "Print Edit WE" and "Save Page WE" [2] for | Chrome type browsers, both by one author. First one allows for | editing of the page before printing/saving (as a single page HTML | or MHTML), second one allows for single-page save. | | [1] https://chrome.google.com/webstore/detail/print-edit- | we/olnb... [2] https://chrome.google.com/webstore/detail/save- | page-we/dhhpe... | sandes wrote: | wget -r url ? | reidjs wrote: | Unfortunately that won't allow you to click links in your offline | version. you can do this properly with wget: (sorry I don't know | how to do code formatting in hackernews) | | wget --mirror \ --convert-links \ --html-extension \ --wait=2 \ | -o log \ https://example.com | berkes wrote: | Are you suggesting to mirror e.g. the entire Wikipedia through | wget? | | That is not only suboptimal, it is stressing on the server. At | least you added a --wait=2, but on any large site/hoster/CDN, | this might still get your IP banned or throttled. And on e.g. | the English wikipedia this will then take 149 days. Which means | that by the time you hit the last page, the first ones (and | their links) are out of date. | falcolas wrote: | If you add '--no-parent' (doesn't request anything that's not | a page dependency above the requested URI) and a '--level=5' | (only follows link 5 deep), you won't get all of a site. It | makes it more realistic for getting wikipedia articles. | lysium wrote: | Looks like SingleFile helps with sites where you have to be | logged in, something that is not that easy with wget. | hombre_fatal wrote: | You don't need to newline every flag of a trivial command. | all2 wrote: | I'm guessing the user's intent was to have the command | formatted across multiple lines. | [deleted] | _dain_ wrote: | What are you talking about? I have hundreds of pages saved with | SingleFile and I can click links in all of them. | reidjs wrote: | Oh maybe it does work then. I assumed it didn't follow links | because they didn't show it in the video. | z3c0 wrote: | Code formatting is just blockquotes. So one | empty space followed by indented text (2 or more spaces) | megaman821 wrote: | Is this still on track to become a standard? | https://github.com/WICG/webpackage | j1elo wrote: | Related: I used to keep a collection of locally mirrored web | pages a long time ago, with a legendary Firefox extension called | _ScrapBook_ [0] (now long retired). The surprise for me is that | after all these years I still remembered the name... | | While writing this comment I found that it lived on as a (now | "legacy") new extension named _ScrapBook X_ [1], and then yet | another one named _WebScrapBook_ [2], which seems to still be | alive! | | [0]: http://www.xuldev.org/scrapbook/ | | [1]: https://github.com/danny0838/firefox-scrapbook | | [2]: https://addons.mozilla.org/en-US/firefox/addon/webscrapbook/ | wetpaws wrote: | Ah, millennials invented .mht | als0 wrote: | This is great. I've always wondered why this isn't the default | behaviour for page saving in browsers. To an ordinary user saving | a page implies saving a single file, not a file plus a directory | of stuff. HAR can be useful but seems only for niche or | specialised reasons. | kwhitefoot wrote: | The list of problems that Manifest V3 causes are just more | reasons to never use Chrome. | avivo wrote: | Why does this need to: | | - Read and change all your data on all websites | | - Modify data you copy and paste | | - Manage your downloads | | Is there a way to use a version that requires less of these | permissions? e.g. it seems we can address the first permission by | only activating it on click, but I'm not sure if that addresses | the other ones. | gildas wrote: | I try to use optional permissions as much as I can. The first | permission is required because of assets and frames stored on | third-party servers. The second permission should be optional, | I don't remember why it's not. I'll try to see if I can make it | optional. The last permission is required in order to save the | page on the filesystem with the "downloads" API. Note that even | if I make these permissions optional, you might still have to | trust me anyway ;) | [deleted] | anned20 wrote: | I also want to give praise about the demo. It's one of the best | demos I've ever seen with such a project. Nice job! | netsharc wrote: | A 16MB gif with no playback controls, so you had to go through | the tedium. | Minor49er wrote: | I would be surprised that the author wasn't using WebM to get | a smaller filesize (not to mention higher quality) but the | project itself leads me to believe that the author has a lot | of free disk space to use | a1445c8b wrote: | There's no need to make further assumptions about the | author (who btw took the time to build a very useful tool | and share to in the Internet for free). Just point out the | issue of the GIF and move along. | Minor49er wrote: | I never made an assumption about the author and certainly | never said that the tool wasn't useful. You can feel free | to move along yourself, though. | treeman79 wrote: | Iran has a habit of using tools like this to trick defense | contractors into using their page. | dtjohnnymonkey wrote: | Thank you! I've been looking for this for a while, nice to see | someone finally did it! | ilrwbwrkhv wrote: | Thanks for this. I expected to see a pricing link somewhere, | having been attuned to all the subscription Saas these days. Glad | to see there are tools offering immense value for free still. | gildas wrote: | It is in fact more or less self-financed by... hmmm... a SaaS | that I market but it's in B2B. | [deleted] | mysterypie wrote: | Security question: Is a web extension safe if it is installed but | if you're not using it at the moment? For example, if I were | logged into my bank's website and I did _not_ click the | SingleFile button in the extension toolbar, could it still | theoretically collect info from my bank 's webpage or do other | actions? | | I'd like to use SingleFile and have no reason at all to distrust | it, but I'd like to understand the security impact of installing | lots of web extensions. How do people handle security risks like | that? Do you run a separate vanilla browser with no extensions | for sensitive tasks? | fsflover wrote: | If you care about security, consider using Qubes OS with | hardware-virtualized VMs for compartmentalization. Then, you | Firefox for banking won't have the same extensions which you | use elsewhere. Works for me. | gildas wrote: | For technical reasons beyond my control, SingleFile injects a | (very small) script when the page loads even if you don't click | on the button. It could also send any data to a third party | server. Unfortunately, it is therefore impossible for me to | technically and formally guarantee that SingleFile cannot | behave maliciously. Note however that the extension has the | status "recommended" on Firefox and that it undergoes a manual | code review by Mozilla at each update. | fmntf wrote: | Could you please elaborate what script is injected, that | reason and why it is that out of your control? Thank you | gildas wrote: | I will do it, but it will take me some time to explain it | and rather than answering on HN I will integrate it in the | FAQ. I created an issue for this here: | https://github.com/gildas-lormeau/SingleFile/issues/885. | prox wrote: | In Firefox you could run a totally different profile. | | I don't do this myself, I try to research any extension I add | and don't do automatic upgrades. I use as little extensions as | possible. | tzs wrote: | > For security reasons, you cannot save pages hosted on | https://chrome.google.com, https://addons.mozilla.org and some | other Mozilla domains. | | Interesting. What is it about those pages that makes saving them | raise security issues? | Isthatablackgsd wrote: | That is not the extension issue, that's the Google/Mozilla | policy thing. | amccollum wrote: | Maybe because JS files (specifically add-ons) run from the | local filesystem are given escalated privileges compared to | normal usage, perhaps for ease of development. I'm just | speculating, though. | slmjkdbtl wrote: | Does it create an inline dataurl for each image even if they're | the same? | assemblylang wrote: | Nice project! This project, and a similar project called | Monolith[0], was a bit of an inspiration for making my own single | HTML file tool called Humble[1] to solve a few edges cases I was | having with bundling pages (and since I wanted a TypeScript API | for making page bundles). | | [0] https://github.com/Y2Z/monolith | | [1] https://github.com/assemblylanguage/humble | alberth wrote: | FYI - there's an official standard (MHTML) for doing this that | has existed for 20+ years and exists natively in browsers. | | https://en.m.wikipedia.org/wiki/MHTML | setum wrote: | IIRC, back in the day mhtml won't save java applets. | pstuart wrote: | Are any sites still using applets these days? | IYasha wrote: | 80% of server IPMI Web control panels. But who whould want | to save those anyway? :) | twapi wrote: | I use this Chrome extension to save web pages as MHTML: | https://chrome.google.com/webstore/detail/save-webpages-offl... | paulirish wrote: | The Chrome engineer who maintains the MTHML work wrote up a | comprehensive doc on the modifications on the MHTML spec (RFC | 2557) that are implemented: | https://docs.google.com/document/d/1FvmYUC0S0BkdkR7wZsg0hLdK... | Might be useful for you, gildas. | gildas wrote: | Thank you Paul! I had read this document some time ago, | especially to see how the shadow DOM was serialized. | rplnt wrote: | I was gonna say Opera (the old, good one) had this. When saving | a page there were some options and one was a single file IIRC. | rtsil wrote: | I remember saving webpages in MHTML when I was using dial-up so | that I could read them offline later. | | I would also download entire websites using a software which | name I forgot, to read them offline. Back when websites held in | a single floppy disk. | | Good times! | TheFlyingFish wrote: | I remember using HTTrack for this a while back. Still have a | few of those sites lying around, I think. | domador wrote: | Does anyone else get two security warnings whenever you try to | save an MHTML page using a Chrome extension? I have to click on | one warning's button to confirm that I indeed want to save the | "dangerous" file and another to confirm I'm really sure. It's | gotten very annoying. I've looked all over for an option to | disable this behavior but haven't been able to. | toqy wrote: | For anyone else that didn't read the README, MHTML is mentioned | in the comparison section https://github.com/gildas- | lormeau/SingleFile#file-format-com... | dsl wrote: | Take the comparison with a grain of salt. Not including WARC | is like excluding water from a comparison of beverages, it is | the baseline standard. | bgro wrote: | I've extensively looked into this as I can't find a good light | and easy backup options that isn't extreme overkill. | | I thought MHTML was NOT standardized which is why it wasn't | across all browsers yet. From what I remember, every company | was doing their own implementation of it. Maybe it's gotten | more standardized the last few years though. | chungy wrote: | I've always thought the "M" stood for "Microsoft" -- wasn't | even aware any browsers other than IE supported it. | chme wrote: | There is also CHM which is actually a Microsoft only file | format for "Compiled HTML Help" files. | IYasha wrote: | I love this format. Very fast and compact. Entire Visual | Studio help was in it once. Worked VERY well. And there's | a KDE/Qt reader. | iKlsR wrote: | Over a decade ago I had a laptop but no internet at home. This | was one of the ways I taught myself programming (and also | downloading dozens of manga) by using internet explorer at a | cafe which had an option to save to mhtml which was one file | and had everything self contained. Legit owe a portion of my | success to this. I still have some of these files, old crusty | hello world c++ tutorials etc. | falcolas wrote: | I have fantastic internet, and I still do something similar. | Local docs just load so much faster, and if something happens | (which it still does, even on Fiber in the US), I have docs | and can program. | | Lemme see if I can pull up the command I use to mirror doc | sites. wget \ --recursive \ | --level=5 \ --convert-links \ --page- | requisites \ --wait=1 \ --random-wait \ | --timestamping \ --no-parent \ $1 | a9h74j wrote: | For people who cannot afford internet access now, and for | perhaps more in the future if times get more difficult, I | believe this is a very important use-case. | geitir wrote: | And it generally does not do a good job | als0 wrote: | What are the issues? | hulitu wrote: | From my experience, wrong layout,missing pictures. | ByThyGrace wrote: | > MHTML, (...) is a web page archive format used to combine, in | a single computer file, the HTML code and its companion | resources (such as images, _Flash animations, Java applets_ , | (...) | | Well that goes to show its longevity I guess. | rpdillon wrote: | The browser compatibility section suggests MHTML is unsupported | in current versions of Firefox and Safari. | tekknik wrote: | Safari supports webarchive, which does basically the same | thing | gildas wrote: | The problem is that it is a proprietary format. The | advantage of the format produced by SingleFile (HTML) is | that as long as your browser is capable of interpreting | HTML, you will be able to read your archives without | worries. | tekknik wrote: | Not so proprietary. It's really just a plist file, which | the format is known and even open sourced by Apple[1]. | Really it's only proprietary in that no other platforms | have implemented it. | | [1]: https://opensource.apple.com/source/CF/CF-550/CFBina | ryPList.... | mrspuratic wrote: | I don't think it was ever native in Firefox, there is/was the | excellent unMHT extension that was broken by | Quantum/WebExtensions and The Great XUL Silliness. Shame. | | I have Waterfox-Classic and unMHT (fished out of the Classic | Addons Archive, just remember to turn off Waterfox's | multiprocess feature) since I occasionally need to archive | web pages - and more importantly, reopen them later. | | mhtml is just MIME, literally every discrete URL as a MIME | part with its origin in a Content-Location header, all | wrapped in a multipart container. I don't understand why it's | not a default format. | Groxx wrote: | I can see WebExtensions breaking it (as it's a completely | new set of APIs for extensions, and the losses do | definitely still hurt)... but quantum/xul? How is that | related, aside from "it happened around the same time"? | cookiengineer wrote: | > FYI | | The alternative format (used by the Internet Archive and | Wayback Machine) is WARC. It's also a single file, but it's | preserving the HTTP headers as well; so its applications is | specifically for archival purposes. [1] The "wget" tool which | is co-maintained by the Web Archive people also has support for | it via CLI flags. | | Though when it comes to mobile browser support I'd recommend to | use MHTML, because webkit and chromium both have support for it | upstream. | | [1] http://iipc.github.io/warc-specifications/ | | [2] https://www.gnu.org/software/wget/wget.html | londons_explore wrote: | Is there any objection to adding WARC support to | webkit/chromium? Seems like a not-so-complex project... | cookiengineer wrote: | I know that WebKit relies on either libsoup [1] (on | Linux/Unices) or curl [2] (legacy Windows and maybe WPE(?)) | as a network adapter, so the header handling and parsing | mechanisms would have to be implemented in there. | | Though, on MacOS, WebKit tries to migrate most APIs to the | Core Foundation Framework, which makes it kind of | impossible to implement as a non-Apple-employee because | it's basically a dump-it-and-never-care Open Source | approach. [3] | | Don't know about chromium (my knowledge is ~2012ish about | their architecture, and pre-Blink). | | [1] https://github.com/WebKit/WebKit/tree/main/Source/WebKi | t/Net... | | [2] https://github.com/WebKit/WebKit/tree/main/Source/WebKi | t/Net... | | [3] https://github.com/opensource-apple/CF | TingPing wrote: | GTK/WPE use libsoup. Playstation/Windows uses curl. And | yes Apples networking is proprietary. | chefandy wrote: | WARC is also used by the Webrecorder project. They made an | app called Wabac which does entirely client-side WARC or HAR | replays using service workers and it seems to have pretty | good browser support, but I haven't really dug into the | specifics. | | https://github.com/webrecorder/wabac.js-1.0 | admax88qqq wrote: | Unfortunately mhtml is not widely supported. | pan69 wrote: | In the olden days, Internet Explorer used to allow you to do this | by saving the page to a HTM file. It would be a single archive | with HTML and images etc embedded. | | New browsers don't seem to do this, the create a separate folder | for the assets, which is super annoying. | nickflood wrote: | The Chromium Edge can produce .MHT files as well | xnx wrote: | I love SingleFile and have been using it for years! Is there any | version that works on current mobile browser versions? I've stuck | with an old version of Firefox on Android that still supports the | extension. | gildas wrote: | You should be able to use it on Firefox for Android Nightly | (which is very stable) by following this procedure: | https://blog.mozilla.org/addons/2020/09/29/expanded-extensio... | | > approx | moffkalast wrote: | This is what 10 year old me thought "Save As" in IE would do, but | soon realized the harsh reality of "that's not how any of this | works". | edf13 wrote: | The most impressive part of the demo is seeing how tidy his | Downloads folder is! | ctxc wrote: | Been eyeing this for a long time! | | I'm building a bookmark app, and I plan to use this to save | bookmarks! | | I'm a simple man, nothing too fancy. Here's a crude demo in | progress - https://zewallet.netlify.app/ Follow progress here - | https://twitter.com/recursiveSwings/status/14917723874649088... | | Would love to have ANY tips or feedback! | TehShrike wrote: | the signup email confirmation link points to | http://localhost:3000/ btw | | I'm definitely in the market for a bookmark service that | archives my bookmarks, Diigo stopped working a year or two ago, | and Pinboard can't stay up | ctxc wrote: | Fixed now! | cxr wrote: | Zotero deals with this reasonably well--and happens to be | using SingleFile under the hood. Its landing page just | targets a different audience (academics), which means | probably upwards of 90% of the people who would happily use | it probably end up bouncing after thinking, "This isn't for | me", before ever trying it. Give it a shot. | ctxc wrote: | Ahh damn, should fix it! For now, you can edit the URL | manually to take a peek. If you're interested, feel free to | send a DM on Twitter @recursiveSwings, I'll let you know once | it's in Beta! :) | fender256 wrote: | You read my mind, I was exactly looking for that! | didericis wrote: | Similar project -> https://github.com/Y2Z/monolith | | (I used both and ended up favoring monolith, but can't remember | why. I think they're pretty comparable/am grateful for both of | them) | theden wrote: | This would be very useful in many situations, and a great demo! | spankalee wrote: | We really, really need Web Bundles to progress and fix these | problems correctly, once and for all. There are a lot of things | that a tool like this can never get right, and the rest is | complicated work that should never need to be done if we have a | standard multi-file bundle format. | | https://wicg.github.io/webpackage/draft-yasskin-wpack-bundle... | necovek wrote: | Great stuff! | | For some reason, I went in expecting to see a JS-enabled multi- | page web site into a SPA in a single HTML file, but I didn't | expect to see images get embedded. | | Perhaps offer a recursive traversal option too, but don't try | that on Wikipedia :) | sam0x17 wrote: | Back in the day this was always one thing that had me | begrudgingly and shamefully opening IE so I could save a page as | an MHT file. So long ago now. Cool to see this idea has been | revived and not in a proprietary way | vincentmarle wrote: | If it's a single file, then how do the images get stored? | gildas wrote: | Images are stored as data URIs [1]. Note that they could also | be stored as entries in a zip file too! [2]. | | [1] https://en.wikipedia.org/wiki/Data_URI_scheme | | [2] https://github.com/gildas-lormeau/SingleFileZ | danielam wrote: | They're base64 encoded[0]. (This is an approach I myself have | used in the past for simplifying the archival of regulatory | texts.) | | [0] https://github.com/gildas- | lormeau/SingleFile/blob/15801c8ef4... | codeflo wrote: | Does this simply remove the JavaScript or do something more | clever? Because I think in the age of SPAs, the proper way to | save "content pages" might be to execute the JavaScript once and | serialize the resulting DOM back to HTML. I didn't find anything | in the FAQ that explains if it does something like that. | gildas wrote: | It saves what you see (and remove JS by default). There is an | option to embed the JS and another one to save the "raw" page | but I would not say it is reliable. The cleverness lies more in | the ability to produce light pages. | sergiotapia wrote: | I'm building a tool for people have a personal archive to their | digital life so that 30 years from now they can revisit content | they enjoyed in their younger years. | | https://github.com/sergiotapia/ekeko | | This is awesome! I would love to integrate this somehow into my | project to "singlefile" bookmarks as people make them. | | @gildas do you have any recommendation on how to approach this | with your extension? Could I run a headless chrome and trigger | this extension? | gildas wrote: | I confirm that you could use a headless browser for this. This | is actually what SingleFile CLI does [1]. Here is an example of | JS code showing how to configure and inject SingleFile with | puppeteer [2]. | | [1] https://github.com/gildas- | lormeau/SingleFile/tree/master/cli | | [2] https://github.com/gildas- | lormeau/SingleFile/blob/master/cli... | sergiotapia wrote: | Thank you! | phil294 wrote: | How old is that demo gif? I just tried reproducing the normal | saving shortcomings, and the bottom image ("Example of an SVG | image with embedded JPEG images") loads just fine from the local | folder, so this seems outdated. | | That being said, it's a bit weird that this kind of tool is even | necessary at all. I would have expected native saving to include | CSS background graphics as well, but apparently they don't for | some reason, so I think this is pretty useful. Until now, I have | also used pandoc (--standalone) to merge all resources into a | single HTML file which worked great. | gildas wrote: | The demo is approximately 2 years old. Things probably changed | meanwhile. | brentcetinich wrote: | I use HAR file extractor because normally I don't want a single | file I want a replica of the web servers file system structure | including any dynamically loaded assets | https://blog.cetinich.net/content/2022/download-website-and-... | kosasbest wrote: | Love this. Use it all the time. Handy for saving huge pages with | all the styling intact for reading offline (like on a plane). You | could save a webpage as a PDF, but I prefer this over a PDF. | steren wrote: | Chrome can save to a single file (.mhtml). I am not sure I | understand the difference. | gildas wrote: | The difference is the output format. I created SingleFile | before Chrome supported MHTML files. At that time, to save web | pages in a single file, the only technical solution in Chrome | was to implement something like SingleFile. The advantage of | HTML is that this format is much more durable though. | Isthatablackgsd wrote: | Yes, there is .mhtml but it execution plainly sucks because it | didn't exactly saves everything. It would attempts to save but | it won't be valiant at it, it is like using mhtml without | "force (-f) argument". | gildas wrote: | Author here, it makes me really happy to see SIngleFile on the | front page of HN. Thank you! I take the opportunity to make you | aware of the upcoming impacts of the Manifest V3 [1], and for | those who prefer zip files, I recommend you to have a look here | [2]. | | [1] https://github.com/gildas-lormeau/SingleFile-Lite | | [2] https://github.com/gildas-lormeau/SingleFileZ | joisig wrote: | Thank you for the Manifest V3 critique, the examples you give | make it really clear how many things are regressing with this | upcoming change :/ | austincheney wrote: | Twelve year project with nearly 7000 commits shows a lot of | dedication. Good work. | mieko wrote: | Thanks for this project. I found SingleFile a year or two ago, | and used it to take "HTML Screenshots" of third party sites I | could embed in guided walkthroughs with modified/example data | changed, instead of just PNGs. | | SingleFile was ultra-valuable for this. | | If anyone has a similar use-case, I wrote some pretty rough | (and slow) code to post-process SingleFile's output to remove | any HTML that wasn't contributing to the presentational render | by launching puppeteer and comparing pixels. It's available | here: https://github.com/mieko/trailcap | gildas wrote: | It's interesting! I had started something similar as part of | testing but hadn't really finished my work. I will have a | look at your project. | stragio wrote: | Very nice! Will use it for sure. May I ask you how you created | that good looking demo gif? | gildas wrote: | I used: | | - ScreenToGif to record video sequences and produce the final | GIF: https://www.screentogif.com/ | | - Macro Recorder to record and replay user navigation: | https://www.macrorecorder.com/ | | - Blender to edit the video, add text comments, and make the | intro: https://www.blender.org/ | badsectoracula wrote: | Single File is one of my favorite addons since it allows me to | keep offline copies of articles, tutorials, etc i see online | without losing images, etc (there have been a ton of articles | lost over the years and while some are preserved in | archive.org, they often lack things like images, etc, so i | prefer to save anything i come across). So thank you for making | it :-). | | Now, having said that, the text in SingleFile-Lite's "Notable | features of SingleFile Lite" sound like a list of issues :-P. | It looks like these are issues with Chrome, but do you know | if/how these "improvements" will affect Firefox? | gildas wrote: | AFAIK, for the moment Mozilla is aware of the regressions | that Manifest V3 causes and shows a good will to try to | reduce them as much as possible. You can find some | information about this here | https://github.com/w3c/webextensions/tree/main/_minutes | rahimnathwani wrote: | If I start using SingleFile today, will I still be open saved | pages after the update to Manifest V3? | | I mean, if I want to save pages over the next 11 months, should | I install SinglePage or SinglePage-lite? | gildas wrote: | In fact, you simply do not need an extension to open pages | saved with SingleFile (or SingleFile Lite) because they are | standard HTML pages. So you don't have to worry about that. | warmwaffles wrote: | This alone is fantastic. I've been looking for an mhtml | replacement that worked well across all browsers. | JeremyNT wrote: | I've been using SingleFile for the last year or so, it's | amazing! | | I'm going to hijack your post for a question! I love the way | you can use the editor and select "format for better | readability," then save just the stripped down version of the | page. I use this to send it to my e-ink device. | | The question I have is whether it's possible to toggle the | default save to use the formatted version automatically? I dug | into the options and didn't turn anything up! | gildas wrote: | You can enable these options for this: | | - Annotation editor > default mode > edit the page | | - Annotation editor > annotate the page before saving | gildas wrote: | Sorry, I was wrong, you have to select "format the page" | instead of "edit the page" (first item). | narag wrote: | Thank you, very useful and works like a charm: a must have. | cloudwizard wrote: | Is there a configuration for the zip version where I can avoid | duplicating the static assets? Thanks | gildas wrote: | I guess you're referring to SingleFileZ. This option is not | needed because zip files (i.e. what SingleFileZ produces) | already provide this feature. | aantix wrote: | Is it possible to use this within the context of the current | web page, without the extension portion? | | Taking a snapshot of my user's screen and then display it to | them later (maybe in an iFrame)? | gildas wrote: | It's possible but it's a bit limited. It won't be able for | example to save images coming from a different origin. | hrgiger wrote: | Thanks for the work you have done, its a lazy man heaven | especially for bulk downloads and helped me a lot. About a | month ago I have decided to backup my bookmarks via archivebox, | it was more than 1k bookmarks, most reliable methods were | singlefile and wget. | gwbas1c wrote: | FYI: Figure tags don't convert their hrefs to base64. | | For example, try saving my home page: | https://andrewrondeau.herokuapp.com/ | | The img tags are converted correctly, but there's still <figure | class=image><a href="https://andrewrondeau.herokuapp.com/... in | the single HTML file. | gildas wrote: | I cannot reproduce your issue, I just did a test on this page | and I see the expected `<img src="data:image/jpeg;base64,...` | in the saved page. | Mr_Modulo wrote: | This is good for people who don't have constant internet access | who need to reference web resources offline. | | Webpage saving technology does not seem to have kept pace with | the evolution of the web. | | Images loaded by CSS aren't saved at all. JavaScript on the page | will often hijack a saved page and not let it display at all. | | One option that works fairly well and does not require installing | a browser extension is to save the page as a PDF. | | I wish browser developers would put more effort in this area. | manor wrote: | If you keep the javascript, you also get the world's most | portable (desktop) application format... | stanislavb wrote: | Opening the repo makes you download a 17MB gif. I hope you are | not on expensive mobile connection. | | p.s. the demo is nice ___________________________________________________________________ (page generated 2022-03-02 23:00 UTC)