[HN Gopher] Show HN: Self-hosted offline Internet from your brow...
       ___________________________________________________________________
        
       Show HN: Self-hosted offline Internet from your browsing history
        
       Author : graderjs
       Score  : 297 points
       Date   : 2020-11-11 16:15 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | segmondy wrote:
       | I just wanna cache all my bookmarks, I rarely look at them, but
       | when I do go look at them, a good chunk tend to have rotted. It
       | will be awesome to cache all my bookmarks and then have option to
       | recursively cache the path I'm in. I don't want to cache every
       | page I visit, 90% is junk.
        
         | knyazhefilms wrote:
         | >have option to recursively cache the path I'm in
         | 
         | It's interesting, what do you mean by that?
        
       | jb775 wrote:
       | Would be cool if you could create a whitelist of websites, then
       | have a feature to check if any other users have more recent
       | versions of those sites (if they happen to be online). This way
       | you get decentralized site updates without actually going to each
       | site itself.
        
         | cutemonster wrote:
         | Yes. But also keep one's own originally downloaded version, in
         | case the newer version is messed up
         | 
         | And even more cool: If one could browse one's friends' sites,
         | while everyone was offline (if their privacy / sharing settings
         | allowed), just a local net in maybe a rural village
         | 
         | Edit: roadmap: "Distributed p2p web browser on IPFS" -- is that
         | it? :-)
        
       | severine wrote:
       | Given the hard 'no' in the FAQ, does anyone know about a similar
       | project for Firefox?
        
         | rzzzt wrote:
         | Could the "HAR" file I can save from Firefox' Network tab
         | somehow be used for this? That looks to be a recording from the
         | entire timeline, including payloads.
        
           | franga2000 wrote:
           | I have used HAR files for archiving purposes in the past and
           | it did work fairly well, but I'm not sure if there's a way of
           | getting them programmatically
        
         | BlackLotus89 wrote:
         | Like the title of the github repo suggests ArchiveBox can be
         | used. You have to manually import your browsing history
         | though....
         | 
         | In theory you could also use yacy... But that is intended as
         | search engine and not as archive.
         | 
         | Edit: while looking into it I found alternatives [2] and Memex
         | [3] seems to be interesting.
         | 
         | Edit2: I remember 2 Show HNs. One recorded your entire desktop
         | and made it searchable. Can't remember what that was called,
         | but the AllSeingEye I found [4]
         | 
         | [0] https://archivebox.io/
         | 
         | [1] https://yacy.net/
         | 
         | [2] https://docs.archivebox.io/en/latest/Web-Archiving-
         | Community...
         | 
         | [3] https://getmemex.com/
         | 
         | [4] https://news.ycombinator.com/item?id=7886270
        
           | severine wrote:
           | Thanks for your answer, I had missed ArchiveBox completely!
           | 
           | I currently use Memex, but this is different approach, and I
           | keep looking for a polished experience that can get more
           | mainstream users into archiving/offline browsing.
        
           | vezycash wrote:
           | Add webrecoder to the list
        
             | BlackLotus89 wrote:
             | It's now called Conifer and listed under my [2] link :) but
             | thank you for mentioning it by name so I could look it up
             | again, seems interesting.
             | 
             | Looks like I got some research for this week
        
       | tiborsaas wrote:
       | I'd love to see an entry in the FAQ explaining the weird name.
        
         | guavaNinja wrote:
         | Just the port they used by default, to help remember it
        
       | deelawn wrote:
       | Seems cool, but can someone explain to me the need for all of the
       | obfuscated code in the files with "22120" in the name?
        
         | totony wrote:
         | I think those are the build artifacts
        
       | alliao wrote:
       | I miss RSS primarily because I was able to search for stuff
       | either I've read or I care about...
       | 
       | I am too embarrassed to admit that a disproportionate amount of
       | my time are spent on looking for a sentence or god forbid a tweet
       | I vaguely remember reading last week.
       | 
       | SO yes, consider this a vote for that sexy full text search
       | please.
        
       | [deleted]
        
       | avmich wrote:
       | Awesome thing, but -
       | 
       | > Can I use this with a browser that's not Chrome-based? > No.
       | 
       | Note that a (rather similar) thing I've participated in in 2002
       | was browser-neutral.
        
       | nosmokewhereiam wrote:
       | This is really cool. Thank you for being open source.
        
       | xiphias2 wrote:
       | I'd love to use this on my mobile, as that's where I mostly have
       | problems with connecting to internet, but it still looks pretty
       | interesting
        
       | peterburkimsher wrote:
       | It looks like something I'd appreciate! I make a significant
       | effort to archive things that I think I'll need.
       | 
       | Unfortunately it didn't work when I just tried installing it now
       | (macOS 10.13.6, node v14.8.0).                 MacBook-
       | Pro:Desktop peter$ npx archivist1       npx: installed 79 in
       | 8.282s       Preferences file does not exist. Creating one...
       | Args usage: <server_port> <save|serve> <chrome_port>
       | <library_path>       Updating base path from undefined to
       | /Users/peter...       Archive directory
       | (/Users/peter/22120-arc/public/library) does not exist,
       | creating...       Created.       Cache file does not exist,
       | creating...       Created!       Index file does not exist,
       | creating...       Created!       Base path updated to:
       | /Users/peter. Saving to preferences...       Saved!       Running
       | in node...       Importing dependencies...       Attempting to
       | shut running chrome...       There was no running chrome.
       | Removing 22120's existing temporary browser cache if it exists...
       | Launching library server...       Library server started.
       | Waiting 1 second...
       | {"server_up":{"upAt":"2020-11-11T21:48:25.324Z","port":22120}}
       | Launching chrome...       (node:33988)
       | UnhandledPromiseRejectionWarning: Error: connect ECONNREFUSED
       | 127.0.0.1:9222           at TCPConnectWrap.afterConnect [as
       | oncomplete] (net.js:1144:16)       (Use `node --trace-warnings
       | ...` to show where the warning was created)       (node:33988)
       | UnhandledPromiseRejectionWarning: Unhandled promise rejection.
       | This error originated either by throwing inside of an async
       | function without a catch block, or by rejecting a promise which
       | was not handled with .catch(). To terminate the node process on
       | unhandled promise rejection, use the CLI flag `--unhandled-
       | rejections=strict` (see
       | https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode).
       | (rejection id: 1)       (node:33988) [DEP0018]
       | DeprecationWarning: Unhandled promise rejections are deprecated.
       | In the future, promise rejections that are not handled will
       | terminate the Node.js process with a non-zero exit code.
       | (node:33988) UnhandledPromiseRejectionWarning: TypeError: Cannot
       | read property 'writeFileSync' of undefined           at ae (/User
       | s/peter/.npm/_npx/33988/lib/node_modules/archivist1/22120.js:321:
       | 14209)           at Object.changeMode (/Users/peter/.npm/_npx/339
       | 88/lib/node_modules/archivist1/22120.js:321:8088)           at /U
       | sers/peter/.npm/_npx/33988/lib/node_modules/archivist1/22120.js:3
       | 21:16174           at s.handle_request (/Users/peter/.npm/_npx/33
       | 988/lib/node_modules/archivist1/22120.js:128:783)           at s 
       | (/Users/peter/.npm/_npx/33988/lib/node_modules/archivist1/22120.j
       | s:121:879)           at p.dispatch (/Users/peter/.npm/_npx/33988/
       | lib/node_modules/archivist1/22120.js:121:901)           at
       | s.handle_request (/Users/peter/.npm/_npx/33988/lib/node_modules/a
       | rchivist1/22120.js:128:783)           at /Users/peter/.npm/_npx/3
       | 3988/lib/node_modules/archivist1/22120.js:114:2533           at
       | Function.v.process_params (/Users/peter/.npm/_npx/33988/lib/node_
       | modules/archivist1/22120.js:114:3436)           at b (/Users/pete
       | r/.npm/_npx/33988/lib/node_modules/archivist1/22120.js:114:2476)
       | (node:33988) UnhandledPromiseRejectionWarning: Unhandled promise
       | rejection. This error originated either by throwing inside of an
       | async function without a catch block, or by rejecting a promise
       | which was not handled with .catch(). To terminate the node
       | process on unhandled promise rejection, use the CLI flag
       | `--unhandled-rejections=strict` (see
       | https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode).
       | (rejection id: 3)       ^CCleanup called on reason: SIGINT
       | MacBook-Pro:Desktop peter$
        
       | halukakin wrote:
       | This was a nice ie feature 20 years ago.
       | 
       | https://support.microsoft.com/en-us/help/196646/how-to-make-...
        
       | atum47 wrote:
       | Very interesting indeed. I remember having to paste a script in
       | the console in order to be able to view my cached files.
        
       | abnry wrote:
       | I am coding my own hacked together bookmarks manager. I can save
       | any page with a click of the button using SingleFile (a fantastic
       | Chrome extension, by the way!).
       | 
       | Then a cronjob runs and puts it into a folder to be processed
       | into a database, which generates a static html index and puts it
       | in my Google Drive.
       | 
       | Then it syncs offline on my chromebook. Which means that without
       | internet, I can put my chromebook in tablet mode and do some nice
       | reading. I've been very pleased so far.
        
         | johnmaguire2013 wrote:
         | Any chance you have this open-sourced or described in more
         | detail somewhere?
        
           | abnry wrote:
           | It's too hackish, system dependent, and not feature complete
           | yet. I plan to run it as a flask app on the local network as
           | a more intuitive way of tagging and managing bookmarks... lot
           | more to do.
        
         | kilroy_jones wrote:
         | I had started working on something similar to this, but without
         | the Google Drive component. I wanted something where I could
         | right click and "snag" a file, link or document and have it
         | saved to a server I controlled.
         | 
         | It's not complete, mostly because the frontend is a mess, but
         | the backend is able to save files, pages and links
         | (https://gitlab.com/thebird/snag). Used Rust backend, Svelte
         | and JS for the extension (of course).
        
         | rezeroed wrote:
         | Pocket?
        
         | throwii wrote:
         | I currently extract bookmarks from Firefox and Safari and store
         | them inside a local database. Then a cronjob saves them to
         | Wayback machine if a prior check revealed that they are
         | currently not.. donating regularly for that. Mine just makes
         | sure that the pages are not lost, but yours enables offline
         | reading.
         | 
         | I'm uncertain what the best mechanism is, there are so many
         | ways to solve it. From filtering to recrawling for new content
         | to enabling more advanced features, there are so many
         | possibilities.
        
         | agumonkey wrote:
         | one day this will be as famous as youtube-dl
        
           | jbc1 wrote:
           | https://github.com/pirate/ArchiveBox
        
       | darepublic wrote:
       | Interesting stuff I will look into this more
        
       | wooptoo wrote:
       | Isn't this how the internet was supposed to work in the first
       | place? I remember Netscape navigator having a 'go offline' icon
       | in the corner.
        
         | rusk wrote:
         | Kind of. HTTP was designed with caching in mind, so the idea
         | was that if you GET a page it should more or less not change
         | and you could add headers and stuff to instruct proxy servers
         | about whether to cache or not and for how long. I think you
         | could use HEAD then to check if a page had changed ...
         | 
         | The browser cache used to actually be quite dependable as an
         | offline way to view pages but this seems to have fallen out of
         | favour in the mid naughties. I remember how disgusted I was
         | when I realised Safari was no longer me letting see a page
         | unless it could contact the server and download the latest
         | version.
         | 
         | I used to have a caching proxy server that would basically MITM
         | my browsing and be more vigilant than even the cache and it
         | really worked quite well. This was back in the 90s when every
         | bit of your max 54kbs counted, or when you wanted to read
         | something while your Dad or sister wanted to also use the
         | phone.
         | 
         | Anyway, you can no longer take this approach because bad people
         | broke the Internet and now you have to have a great honking
         | opaque TLS layer between you and the caching servers so there's
         | no way for this optimisation to work any more.
         | 
         | Of course it isn't really as important these days because we've
         | got faster connections and interactions with the server are far
         | less transactional and richer. But I still would like to have a
         | way of tracking my own webusage and being able to go back in
         | time without having to actually revisit each and every site.
         | 
         | These days you have to hack the browser because that's where
         | your TLS endpoint emerges. Kaspersky tried this for their HTTP
         | firewall application and there was ructions over that.
         | 
         | I'll defo take a look at this. Sounds just like what I've been
         | looking for.
         | 
         | > Isn't this how the internet was supposed to work in the first
         | place? I remember Netscape navigator having a 'go offline' icon
         | in the corner.
         | 
         | Thinking back actually, if you forget about "the web"/HTTP -
         | then yes actually - this is exactly how usenet worked and now
         | I'm remembering that the "go offline" button used to download
         | all your newsgroups along with your email and stuff so you
         | could look at it all offline :-)
         | 
         | If you want something that's like Usenet these days check out
         | Scuttlebut.
        
         | romanoderoma wrote:
         | Don't know if recently it changed, but that's how internet in
         | Cuba works (at least until 2014)
         | 
         | https://www.google.com/amp/s/amp.theguardian.com/world/2014/...
        
         | lights0123 wrote:
         | Firefox still has it under File.
        
           | runxel wrote:
           | Sure, but does it do anything?
        
           | teddyh wrote:
           | In the modern hamburger menu, it's under "More".
        
             | anonymfus wrote:
             | Ever if you have menu disabled, you can still open it by
             | pressing Alt, no need to suffer using hamburger.
        
         | reaperducer wrote:
         | I remember that button, too, but I think it had more to do with
         | connection charges than caching.
         | 
         | In Netscape days, many people would have to pay by the minute
         | to be connected to the internet. In those days, web pages
         | generally contained far more information than they do now, and
         | were less interactive. So you'd connect, load the content you
         | wanted to see, disconnect, and then just sit there and read it
         | for free, instead of bleeding cash.
        
           | asdff wrote:
           | Or even dialup. Expecting a phone call? Go offline.
        
           | rusk wrote:
           | Back in the nineties the web was fairly new and people still
           | used a thing called usenet quite a bit. You interacted with
           | it kind of like email (Google groups is actually the final
           | vestiges of it) - and the go offline button would just
           | download all your emails and newsgroups and you got peruse
           | them offline at your pleasure. It might seem strange also
           | that back in those days you downloaded your emails from a
           | server using POP3 rather than looking at them remotely (e.g.
           | Web or IMAP), and you viewed them offline.
        
           | derefr wrote:
           | I'm not sure that was the use-case. With pre-DHTML HTML4,
           | there really just wasn't anything on a page that could
           | continue to interact with the server after the page finished
           | loading. So, presuming the button was for your described use-
           | case, what would the difference be between "going offline"
           | and just... not clicking any more links? (It's not like
           | Netscape could or _should_ signal your modem to hang up --
           | Netscape doesn 't know what else in your OS might also be
           | using the modem.)
        
             | Donald wrote:
             | You must be young :)
             | 
             | These browsers were born in the era of dialup Internet that
             | had per minute charges and/or long distance charges. At the
             | very least you were tying up your family's phone line.
             | 
             | Basically it's like paying for every minute your cable
             | modem is plugged in.
             | 
             | For the feature itself: Netscape had integration with the
             | modem connectivity for the OS and would initiate a
             | connection when you tried to visit a remote page. Offline
             | mode let you disable automatic dialing of the modem.
        
               | derefr wrote:
               | I ran a BBS, my friend :) I'm quite familiar with modems.
               | I just never used Windows (or the web!) until well past
               | the Netscape era, so I'm not too familiar with the
               | intersection of modems and early web browsers.
               | 
               | > Netscape had integration with the modem connectivity
               | for the OS and would initiate a connection when you tried
               | to visit a remote page.
               | 
               | That's not "integration with modem connectivity", that's
               | just going through the OS's socket API (or userland
               | socket stack, e.g. Trumpet Winsock); where the socket
               | library dials the modem to serve the first bind(2). Sort
               | of like auto-mounting a network share to serve a VFS
               | open(2).
               | 
               | Try it yourself: boot up a Windows 95 OSR2 machine with a
               | (configured) modem, and try e.g. loading your Outlook
               | Express email. The modem will dial. It's a feature of the
               | socket stack.
               | 
               | These socket stacks would also automatically hang up the
               | modem if the stack was idle (= no open sockets) for long
               | enough.
               | 
               | My point was that a quiescent HTML4 browser _has_ no open
               | sockets, whether or not it 's intentionally "offline." If
               | you do as you say -- load up a bunch of pages, and then
               | sit there reading them -- your modem _will_ hang up,
               | whether or not you play with Netscape 's toggles.
               | 
               | (On single-tasking OSes like DOS -- where a TCP/IP socket
               | stack would be a part of a program, rather than a part of
               | the OS -- there was software that would eagerly hang up
               | the modem whenever its internal socket refcount dropped
               | to zero. But this isn't really a useful strategy for a
               | multitasking OS, since a lot of things -- e.g. AOL's
               | chatroom software presaging AIM -- would love to poll
               | _just_ often enough to cause the line that had just
               | disconnected to reconnect. Since calls were charged per-
               | minute rather than per-second, these reconnects had
               | overhead costs!)
               | 
               | > [Netscape's] offline mode let you disable automatic
               | dialing of the modem.
               | 
               | When you do... what?
               | 
               | When you first open the browser, to avoid loading your
               | home page? (I guess that's sensible, especially if you're
               | using Netscape in its capacity as an email client to read
               | your already-synced email; or using it to author and test
               | HTML; or using it to read local HTML documentation. And
               | yet, not _too_ sensible, since you need to _open_ the
               | browser to _get_ to that toggle... is this a thing you
               | had to think about in advance, like turning off your AC
               | before shutting off your car?)
               | 
               | But I think you're implying that it's for when you try to
               | navigate to a URL in the address bar, or click a link.
               | 
               | In which case, would the page, in fact, be served from
               | the client-side cache, or would you just get nothing?
               | (Was HTTP client-side caching even a _thing_ in the early
               | 90s? Did disks have the room to _hold_ client-side
               | caches? Did web servers by-and-large _bother_ to send
               | HTTP /1.0 Expires and Last-Modified headers? Etc.)
        
               | fiddlerwoaroof wrote:
               | I used to go into offline-mode so the browser would
               | access pages from the cache when I went to their URLs. It
               | wasn't a ton, but it was enough that you could queue up a
               | handful of sites, go offline and then, if you
               | accidentally closed the tab, re-open it and see the
               | caches version.
        
             | rzzzt wrote:
             | Server-Sent Events? How old is that mechanism?
        
             | [deleted]
        
       | yamrzou wrote:
       | This uses Chrome DevTools Protocol in a pretty clever way. I used
       | it to archive a highly interactive website and it worked like a
       | charm.
       | 
       | The README states: "It runs connected to a browser, and so is
       | able to access the full-scope of resources (with, currently, the
       | exception of video, audio and websockets, for now)"
       | 
       | I wonder what kind of limitations makes it hard to intercept
       | those resources like the rest of the content.
        
         | jmaygarden wrote:
         | Video and audio is probably just a matter of not having gotten
         | around to it yet. WebSockets are another matter. I'm not sure
         | what one would do with a two-way channel in a general sense.
         | It's often not an idempotent operation.
        
         | nexuist wrote:
         | Video files are massive, so it may just be the case that
         | archiving videos takes so long they didn't want to support it.
        
         | harlanji wrote:
         | I could smash in the audio visual, as my platform is all about
         | archiving and being the origin for a CDN-fronted open offline-
         | first platform with minimal resources that can go into a boat
         | etc. tinydatacenter.com, Github / harlanji /
         | ispooge,tinydatacenter, biz@harlanji.com - need $375/wk.
        
       | a254613e wrote:
       | I remember seeing this on reddit, the license changed quite a lot
       | over the past month, with some very weird custom licenses asking
       | you not to be a fake victim, lie, etc in the process -
       | https://github.com/c9fe/22120/commits/master/LICENSE - how safe
       | is it to assume that the current license will stay?
        
         | ciarannolan wrote:
         | Assuming good faith in the creator of this, it looks like they
         | tried to type up something that they thought would cover their
         | bases, then realized that wasn't right and copy/pasted in a
         | real license.
        
       | yuskii wrote:
       | This post has made me angor;
        
       | caymanjim wrote:
       | This is a neat idea, but I wish it would respect some basic Unix
       | standards by default. Two big annoyances jump out at first
       | glance: it assumes you want to use port 22120, and it puts its
       | config in ~/22120-arc. Maybe both of these are configurable, but
       | the directory is a terrible default. Use XDG (~/.config/22120) or
       | _at least_ use a hidden directory in the home dir. And the port
       | it operates on should be completely configurable. Naming the
       | project 22120 is a terrible idea, and assuming that port won 't
       | need to change is bad practice.
       | 
       | I'm not making any value judgment about the actual tool. It
       | sounds interesting enough. But it should behave better.
        
         | fizixer wrote:
         | I agree what you say while also pointing out that unix home
         | directory has become a complete mess. Anyone (any installed
         | software) can do whatever they like, there is no mechanism of
         | enforcement, and advice in the form of constructive critque or
         | comment is not even a drop in the bucket towards fixing the
         | problem.
        
       | lxgr wrote:
       | Very interesting project.
       | 
       | I wish this was actually (optional) built-in behavior for
       | browsers when bookmarking pages, or at least when adding to a
       | "read later" list like Pocket/Instapaper etc.
       | 
       | Pocket seems to offer something like this, but only in the
       | premium version, so the "permanent archive" ironically seems to
       | go away when unsubscribing.
       | 
       | As a workaround, what if bookmarking a (public) page could
       | actually ping it to archive.org for archival?
        
         | sbeckeriv wrote:
         | selfplug: https://sbeckeriv.github.io/personal_search/
         | 
         | I am working on a personal project that like this. It is in
         | early stages. I am creating a local search based on my browser
         | history. So it doesnt crawl pages. Also the fetch is out of
         | bounds of the browser so Authed urls are not supported out of
         | the box.
         | 
         | I have a bookmarklet currently that lets met "pin" my page. my
         | pinned pages are my new home page. Its how I keep my tabs
         | closed.
         | 
         | I do not do a full archive level (but i could). Instead you get
         | an offline view that is stripped of most things. example
         | https://raw.githubusercontent.com/sbeckeriv/personal_search/...
         | 
         | Demo of the pin:
         | 
         | https://www.youtube.com/watch?v=5g_mXXFwQlg
         | 
         | a self hosted version is on the roadmap.
        
         | asaddhamani wrote:
         | I have a project https://www.github.com/dhamaniasad/crestify
         | that does the archival to archive.org and archive.today, you
         | might find it useful
        
         | shrike wrote:
         | Pinboard.in (not affiliated, just a happy customer) offers an
         | archiving service for saved bookmarks.
        
           | tokamak-teapot wrote:
           | What it doesn't offer is an integration with the browser to
           | make it seamless to work with those bookmarks. There are
           | various extensions for Firefox which will save to Pinboard
           | (one of them is mine!) but to work with them - you have the
           | option of going to the website or using the mobile site in a
           | sidebar (I do this, with some custom css to make it more
           | readable for me).
           | 
           | There's a nice MacOS application (sorry can't remember the
           | name right now) which gives you a better interface, but...
           | they're bookmarks. I would like them to be integrated with
           | rather browser bookmarks. And to be usable when the site is
           | down or I'm offline. And to appear when I search... lots of
           | possibilities there.
        
         | severine wrote:
         | _I wish this was actually (optional) built-in behavior for
         | browsers when bookmarking pages, or at least when adding to a
         | "read later" list like Pocket/Instapaper etc._
         | 
         | Sideshow Ask HN: Didn't Firefox mobile work like this? I could
         | read the reader view items offline...
         | 
         | Anyone knows what's happening with the whole
         | bookmarks/collections situation?
        
         | gildas wrote:
         | I implemented some options for that purpose in SingleFile [1].
         | They allow you to save the page when you bookmark it and
         | eventually replace the URL of the page with the file URI on
         | your disk.
         | 
         | [1] https://github.com/gildas-lormeau/SingleFile
        
           | johnchristopher wrote:
           | Cool, I was so into maff back in the days. I'll give it a
           | try.
           | 
           | (I even wrote this before checking out your link: Have you
           | heard of https://en.wikipedia.org/wiki/Mozilla_Archive_Format
           | from two or three Internets ago ? If so what's your thoughts
           | on it ?)
        
             | gildas wrote:
             | I would recommend you to take a look at SingleFileZ [1], it
             | should remind you of something ;)
             | 
             | [1] https://github.com/gildas-lormeau/SingleFileZ
        
           | toomuchtodo wrote:
           | Why a zip file instead of a WARC file?
           | 
           | https://en.wikipedia.org/wiki/Web_ARChive
        
             | walski wrote:
             | see: https://github.com/c9fe/22120#why-not-warc-or-another-
             | format...
             | 
             | > Both WARC and MHTML require mutilatious modifications of
             | the resources so that the resources can be "forced to fit"
             | the format. At 22120, we believe this is not required
        
             | gildas wrote:
             | Because it's easier to produce and extract. The zip format
             | also allows creating self-extracting files (I'm referring
             | to SingleFileZ). I'm not sure this is possible with the
             | WARC format.
        
               | toomuchtodo wrote:
               | I see you answered this in a thread a year ago [1] (came
               | up in a Google search), my apologies.
               | 
               | [1] https://news.ycombinator.com/item?id=21426056
        
         | kall wrote:
         | This is a feature of iOS safari with the read later list. It's
         | not been particularly reliable for me though.
        
           | lxgr wrote:
           | iOS's implementation is definitely useful, but I was thinking
           | more along the lines of a permanent archive persistently
           | stored.
           | 
           | iOS seems to optimize for temporary offline scenarios; saved
           | pages do not seem to be backed up or synced to iCloud.
        
             | kall wrote:
             | Yeah. I also assume it deletes the pages after they are
             | "read" but who knows, there's no insight into the feature.
             | 
             | The best bookmarking option for archival seems to be the
             | pinboard.in archive plan.
        
       | xtiansimon wrote:
       | I've been storing my research as text files (manual copy and
       | paste of web page content) for years.
       | 
       | And, I've wanted a _search history first_ plugin for web search
       | to find pages I missed saving, but recall reading.
       | 
       | Since the former takes time and the latter doesn't exist, I
       | gather I could buy storage and save browsing using this tool.
       | 
       | It would be interesting to see how it works in practice--saving
       | so much data.
       | 
       | Also, For work I'd be interested to know how it works for
       | password protected sites like banking, social media, etc.
        
         | ryanfox wrote:
         | I've been working on an app that's pretty much exactly "the
         | latter"! [0]
         | 
         | The amount of disk space it takes up isn't crazy. It has been
         | _very_ useful for me.
         | 
         | [0] https://apse.io
        
         | hiisukun wrote:
         | Just chiming in to say that Firefox location bar has some great
         | filters [1] that might help you search history first (and other
         | things). It doesn't do a full text search, but often helps me
         | in a way I think you're after.
         | 
         | If you type: "^ worms" in the searchbar it will search your
         | history for 'worms' and show the results in the dropdown.
         | Typing "* worms" will search your bookmarks instead. The rest
         | of the shortcut symbols are listed on the linked page. Hope
         | that helps!
         | 
         | [1] http://kb.mozillazine.org/Location_Bar_search
        
       | dksidana wrote:
       | Reminds me days of Webaroo[1] and Google grears[2]
       | 
       | [1] https://en.m.wikipedia.org/wiki/Webaroo [2]
       | https://en.m.wikipedia.org/wiki/Gears_(software)
        
         | lxgr wrote:
         | Ah, that brings back memories. Didn't Palm OS have something
         | similar? I think it was Plucker [1], but I'm not too sure.
         | 
         | [1]
        
           | reaperducer wrote:
           | Yep. With Plucker, I could download the New York Times web
           | site (I think via RSS) before I went to work, sync it to my
           | Palm Pilot, and then read it on my lunch.
        
       | ghostbrainalpha wrote:
       | Very cool idea. I always bring my laptop with me camping in case
       | I get the urge to write something.
       | 
       | Having the ability to see the last week or so of my browsing
       | history would have come in handy on more than one occasion.
        
       | jsilence wrote:
       | Awesome! I always wanted this and at one point tried to achieve
       | it with WWWOFFLE, but the welcome proliferation of https thwarted
       | that attempt.
       | 
       | Gonna check it.
       | 
       | Unfortunately only for chrome. I am very much used to having my
       | favourite set of Firefox plugins. Will have to check whether I
       | can replicate that with Chrome.
        
         | wolco2 wrote:
         | Unfortunate state of affairs with firefox extentions. Niche
         | extentions do not exist anymore.
         | 
         | I had to switch to chrome for extensions. Finding a chrome
         | extension that provides similiar functionality to your firefox
         | ones should be easy.
        
           | phkahler wrote:
           | >> Unfortunate state of affairs with firefox extentions.
           | Niche extentions do not exist anymore.
           | 
           | This seems like something that could be done in a proxy and
           | be browser independent.
        
             | codetrotter wrote:
             | But then your proxy would need to do the TLS termination.
             | Which is both kinda cumbersome to set up probably, and also
             | it means you can no longer look at the certificates for
             | your connections.
        
         | silon42 wrote:
         | I've used http://www.gedanken.org.uk/software/wwwoffle/ a long
         | time ago (when on modem).
         | 
         | What I'd like is to cache the history for each page too
         | (important for news pages).
        
       | ppezaris wrote:
       | Cool concept. In a world that's getting increasingly connected
       | what are the main use-cases?
       | 
       | I ask because the dev tool that our company creates occasionally
       | (okay, very rarely) gets a question about offline mode, and when
       | I prod, it's usually just out of curiosity, not because they
       | actually need it in real life.
        
         | lxgr wrote:
         | This seems to geared towars "content goes down" scenarios,
         | rather than "reader is temporarily offline".
         | 
         | It's a concern I have every time I find a particularly
         | interesting independently hosted blog post or article.
         | 
         | The Internet Archive goes a long way towards making me worry
         | about this less, though. (Let's just hope they don't go away!)
        
         | reaperducer wrote:
         | _In a world that 's getting increasingly connected what are the
         | main use-cases?_
         | 
         | Increasingly [?] totally.
         | 
         | Even though I'm a developer, pre-pandemic I would have to spend
         | a day or three offline several times a year while working. This
         | would be useful for that.
         | 
         | I know an IT guy who works in mines. He loves anything that
         | works offline.
        
         | jaggirs wrote:
         | The coolest part I think is that you have a copy of all these
         | websites on disk, which means you can run a full text search on
         | all the websites you visited (or on their html, technically).
         | 
         | Browsers'history sucks. I don't know if this project does this,
         | but I would absolutely love to be able to do SQL queries on my
         | browsing history.
         | 
         | I have 'lost' many websites I remember visiting, but for which
         | I didn't remember anything in the title.
         | 
         | Also, obviously, websites change sometimes, and the web archive
         | might not have cached the website you visited. Although from
         | what I can tell, this project doesn't version websites, it just
         | caches the latest, so you would probably just overwrite the
         | previous version accidentally.
        
       | hnguy321 wrote:
       | Anyone know of something like this that can sit on a network,
       | possibly as a web proxy?
        
         | erulabs wrote:
         | Squid (http://www.squid-cache.org/) is fairly close to what
         | you're looking for.
        
       ___________________________________________________________________
       (page generated 2020-11-11 23:00 UTC)