[HN Gopher] Webrecorder: Make an interactive copy of any web pag...
       ___________________________________________________________________
        
       Webrecorder: Make an interactive copy of any web page that you
       browse
        
       Author : pcr910303
       Score  : 209 points
       Date   : 2020-05-11 16:56 UTC (6 hours ago)
        
 (HTM) web link (webrecorder.io)
 (TXT) w3m dump (webrecorder.io)
        
       | A4ET8a8uTh0 wrote:
       | This is neat. More and more stuff is hidden behind a login
       | screen. Odd question. How does internet archive handle those
       | types of pages these days?
       | 
       | edit: types of
        
       | bacondude3 wrote:
       | Just want to plug HTTrack [1] here as well. Not nearly as slick,
       | but it's worked extremely well for me when Webrecorder couldn't
       | do the job. Being usable through the command line also makes it
       | useful for some projects WR can't do.
       | 
       | [1]: http://www.httrack.com/
        
         | anigbrowl wrote:
         | Seconded, it ain't pretty but it's powerful.
        
         | causality0 wrote:
         | I came here to say the same thing. HTTRACK is incredibly handy
         | especially when it comes to old pages. My personal use case is
         | archiving small sites to make sure they don't go dark, for
         | example webcomics. Having a single folder with all the comic
         | image files is great.
        
       | BiteCode_dev wrote:
       | For a KISS version of this, there is the Single File add on:
       | 
       | https://addons.mozilla.org/fr/firefox/addon/single-file/
       | 
       | It will save any page as a standalone HTML file, including
       | inlined external resources.
        
         | joepie91_ wrote:
         | Doesn't solve the same problem. Webrecorder is for saving
         | browsing sessions, not individual pages.
        
           | BiteCode_dev wrote:
           | Hence the kiss
        
         | mlok wrote:
         | I like the zipped-HTML version of this :
         | https://addons.mozilla.org/fr/firefox/addon/singlefilez/
        
         | abnry wrote:
         | I love this extension! I use it in place of bookmarks. I then
         | run a cron job to move .html files from my downloads folder to
         | a bookmarks folder. Then I generate thumbnails and an html
         | index to easily browse my bookmarks. I feel so much more
         | relaxed knowing I have can save with one click any good
         | information I find on the web. Eventually I want to add NLP
         | keyword extraction and categorization, and an internal search
         | feature.
        
       | jl6 wrote:
       | I would love just a simple one-click print-current-page-to-PDF
       | button that managed to capture the whole page, and did something
       | intelligent about infinite scroll sites.
        
       | Diederich wrote:
       | The problem here is quite real; just because one has access to a
       | remote resource now doesn't mean that access will remain, either
       | in the short or long term.
       | 
       | I'm not crazy about using a 3rd party service for it though.
       | 
       | I've half-considered wiring up something with OBS to just record
       | my web browser all day, but, besides the intense storage,
       | indexing and searching is more than painful.
        
         | greglindahl wrote:
         | Webrecorder is open source, and you can run your own instance
         | if you like.
        
           | mellosouls wrote:
           | Yes - it's also a digital preservation project from a not-
           | for-profit arts organisation.
           | 
           | https://webrecorder.io/_faq
        
         | jabroni_salad wrote:
         | It looks like webrecorder has a downloadable version you can
         | run entirely on local storage with no need for an account.
        
       | elil17 wrote:
       | Is there a reason you can't just use archive.org for this?
        
       | NelsonMinar wrote:
       | Wow, neat!
       | 
       | Is there a reliable screenshot version of this I can install on
       | my own Linux system? Some sort of headless browser, I imagine. I
       | realize this interactive system is much more powerful but it's
       | overkill for an application I have in mind, a best effort single
       | screenshot is fine. I've seen various attempts at doing this over
       | the years and none have really worked reliably.
       | 
       | I'd consider paying a service to do this but it seems like self-
       | installable is better.
        
         | heipei wrote:
         | Couple of options: You can build a small JavaScript package
         | with a HTTP server to call your endpoint and puppeteer for
         | actually running and interfacing with Chrome headless and then
         | either Dockerize it or even run it as a Lambda / Google Cloud
         | Function. Then there's https://github.com/browserless/chrome/
         | which is both a commercial service as well as being Open
         | Source. Plenty of other OSS and commercial options as well,
         | ultimately it comes down to your use-case. Personally I've
         | found reliably taking a screenshot at the right point in time
         | challenging since a page is never really "done" loading, i.e.
         | there could always be a refresh or timeout which loads another
         | URLs, adds an iframe or changes something else entirely, so
         | there is no single "yes, we're done loading now" event to wait
         | for. Source: I run the service at https://urlscan.io where we
         | have the same problem ;)
        
         | asab wrote:
         | Firefox can screenshot the whole page - it's built in. I
         | believe it can be done in headless as well.
         | 
         | https://support.mozilla.org/en-US/kb/firefox-screenshots
        
           | Nux wrote:
           | Can it screenshot lazy loading pages?
        
             | williamdclt wrote:
             | I think it can be smart-ish about this (scrolling for you,
             | but maybe I'm thinking about an extension). I've never
             | found anything that handles virtualised lists/tables
             | though, like screenshotting an entire Slack thread
        
       | mdaniel wrote:
       | I can't tell which spawned which, but there's a related
       | discussion in r/DataHoarder about an extension that does that:
       | https://old.reddit.com/r/DataHoarder/comments/ggyzoy/is_ther...
       | in which Webrecorder was mentioned yesterday
        
       | seph-reed wrote:
       | This seems like something that would work really well on the
       | piHole level.
        
         | jcahill wrote:
         | A pi isn't going to cut it here.
        
           | edoceo wrote:
           | Pi could for sure do the capture and save, it can run full or
           | headless, and all the things. For personal itd be fine, not
           | at scale tho
        
             | dependenttypes wrote:
             | You would have to intercept HTTPS connections and add your
             | own CA to every computer that uses the pihole which is a
             | pain.
        
               | seph-reed wrote:
               | Hmmm...
               | 
               | https://security.stackexchange.com/questions/8145/does-
               | https...
               | 
               | I need to read more about CAs to figure out why the Pi
               | couldn't fake it.
        
       | boromi wrote:
       | Can't we do this with OBS? What's the difference. Edit was going
       | to download then saw it was an electron app.. no thanks.
        
         | xfer wrote:
         | It's Http Archive(HAR) recorder+indexer not a scree capture/mp4
         | video recorder program. You can use it with your webbrowser,
         | you don't need an app. You can self-host the instance as well.
        
       | eastendguy wrote:
       | Technically interesting, but why would I use this over one of the
       | many full page screenshot or "Print to PDF" browser extensions?
       | That is what I use when I want to archive something.
        
         | heinrichhartman wrote:
         | Which browser and extension are you using?
         | 
         | I tried a few, but could not get good results with any of them.
         | print to pdf, output always looked terrible. Full page
         | screenshot took ages to scroll throught the page hijacking my
         | viewport. Those captures should happen in the background.
        
           | eastendguy wrote:
           | For manual capture I use
           | https://chrome.google.com/webstore/detail/full-page-
           | screen-c... and Chrome "Print to PDF". Layout issues are not
           | important for my use case.
           | 
           | For scheduled captures I automated this workflow with kantu:
           | https://chrome.google.com/webstore/detail/uivision-
           | rpa/gcbal...
           | 
           | > Those captures should happen in the background.
           | 
           | Yes, it would be nice if the Chrome extension api would allow
           | full page screen captures to happen instanstly. Currently all
           | extensions need to scroll up/down.
        
             | gildas wrote:
             | > Currently all extensions need to scroll up/down.
             | 
             | SingleFile does not. It can save lazy loaded contents
             | without scrolling.
        
         | jcahill wrote:
         | Web archivists are coming at the problem with different
         | requirements, related to fidelity and the systematic mirroring
         | of address structure along with payload content and control
         | information.
         | 
         | By the sound of it, you likely wouldn't want to substantially
         | change your personal filing system to have your page captures
         | conform with those requirements.
        
       | amelius wrote:
       | > Webrecorder creates an interactive copy of any web page that
       | you browse, including content revealed by your interactions such
       | as (...) clicking buttons, and so forth.
       | 
       | How can they guarantee this if the code may run on the server?
        
         | netsharc wrote:
         | It's probably a replay of what you did. I'm trying to think how
         | the internals would be made using Javascript, but probably it's
         | a case of recording changes to the DOM structure ("After 3
         | seconds, the password input field had the value 'hunter2'.
         | After 5 seconds, a DIV appears with the the text 'Incorrect
         | password'", etc).
        
           | amelius wrote:
           | Makes sense, but I wouldn't call it an "interactive copy".
        
       | dang wrote:
       | Surprisingly little prior discussion, but
       | https://news.ycombinator.com/item?id=10838985 was related (2016).
        
       | jcahill wrote:
       | We use webrecorder for some interactive work at my workplace, a
       | web archival nonprofit.
       | 
       | If you're new to web archival, expect a learning curve.
        
       | chrischen wrote:
       | I feel like the "Make website browsable offline" feature of
       | yesteryears have been neglected. Somehow people assume everyone
       | always has internet connectivity... but I need to save websites,
       | docs, etc, for long flights or camping trips without signal.
       | 
       | Safari annoyingly has a reading list feature that claims to cache
       | the web page but annoyingly 50% of the time doesn't. As with all
       | apple cloud services there is just no way to explicitly sync.
        
         | saurik wrote:
         | Since when is the Safari Reading List a cloud service? It might
         | be that I just have most of iCloud disabled and that makes the
         | system work as expected, but for me it is definitely 100% local
         | (at which point I would presume failures are due to the
         | mechanism it using to save things not working on all kinds of
         | web pages: I honestly only use it for simple document sites,
         | and so wouldn't know if it fails a lot on rich web app sites).
        
           | saagarjha wrote:
           | I think perhaps they're annoyed that their Reading List isn't
           | syncing between their devices and Apple hasn't put a button
           | that forces the sync to occur.
        
             | chrischen wrote:
             | Yes, it is advertised as being able to make it available
             | offline. But if you save it to reading list on iPhone, or
             | your Mac, there's no guarantee it's available on the other
             | device and often it even stops working on the device that I
             | added to reading list.
        
         | eitland wrote:
         | Until Firefox broke its extension API there was an extremely
         | useful extension called Scrapbook that could you could point at
         | a website and save that site and all linked site (with settings
         | for how far to recurse and what URL patterns to accept.)
         | 
         | FTR: I still use Firefox as for me the alternatives are worse.
         | As someone else mentioned Pocket exist and while the
         | communication around it could hardly been worse it is still way
         | better than Chrome IMO.
        
         | judex wrote:
         | Exactly the reason I use Opera on Android devices. Does not
         | save movies or odd file data but html + images perfectly and
         | web pages looks like originals. I use it a lot to save pages
         | for offline viewing during flights. Now Vivaldi is on Android
         | too and I hope it does have the same feature.
        
         | EamonnMR wrote:
         | Offline users don't generate data/ad impressions, so offline
         | user experience isn't considered.
        
         | giancarlostoro wrote:
         | Oddly enough Mozilla gets crap for it but they bought Pocket
         | which allows for offline reading. I think its fine for Mozilla
         | to include Pocket. We seem to be fine installing Chrome which
         | has much worse spyware. You can uninstall Pocket, you gotta go
         | through more effort to uninstall Googles OS from Chrome.
        
       | boredgamer2 wrote:
       | Very cool. This seems like such a "Duh. Users want this" feature.
       | I wish it was integrated in Firefox years ago. I bookmark some
       | sites and then come back years later, but perhaps they hijacked
       | the URL or they later changed the URL's parameters.
       | 
       | And then when I return, I get a 404. Instead of bookmarking, I'd
       | love to "capture" the current info, divs, and graphics.
        
         | PenguinCoder wrote:
         | Not a firefox extension, but I moved to using Polar
         | Bookshelf[0] for exactly this reason.
         | 
         | [0] https://getpolarized.io/
        
         | gildas wrote:
         | There's an option for this ("Misc. > save the page of a newly
         | created bookmark") in SingleFile [1].
         | 
         | [1] https://github.com/gildas-lormeau/SingleFile/
        
         | pzmarzly wrote:
         | While it's far from an ideal solution, you can throw your
         | bookmarks periodically to archive.org's "Save Page Now!"
         | service. It's easy to semi-automate it - here's how I use it
         | with pinboard.in bookmark exports:
         | https://pastebin.com/uUVE22RD
        
           | holidaygoose wrote:
           | This solution is on the user side, which is great because
           | each person can get and manage saved pages for themselves.
           | 
           | But if we're looking for a developer side solution, then
           | making pages that last an order of magnitude longer may be
           | better for everyone in the long run, e.g.
           | https://jeffhuang.com/designed_to_last/
        
         | Cymen wrote:
         | Joplin is a local thing you can run for this. Basically, saving
         | a snapshot of a page. Has browser extensions.
         | 
         | https://joplinapp.org/
         | 
         | I like the ability to use tags too -- I've got various
         | product/tech ideas and it's nice collecting information with it
         | and not having to worry about the pages going away or changing.
        
         | zxter wrote:
         | Pinboard [0] has this feature with additional price. Just
         | bookmark and you are done. Later you can text search in their
         | content.
         | 
         | [0] https://www.pinboard.in/
        
           | derefr wrote:
           | Does it capture _your view_ of the page, though (i.e. use
           | your own browser, with its cookie jar, to do the scrape)? I
           | 'd like to, for example, snapshot my Facebook feed.
        
             | zxter wrote:
             | Tnen check out historysearch [0] If I remember correctly
             | they index your history so it could include your history.
             | You don't even need to bookmark.
             | 
             | Of course I cannot vouch for their respect for privacy.
             | 
             | [0] https://historysearch.com/
        
             | jabroni_salad wrote:
             | No it does not. That means it also doesnt work for any news
             | sites that you have subscriptions to. Im using the joplin
             | web clipper pretty heavily for this purpose.
        
       ___________________________________________________________________
       (page generated 2020-05-11 23:00 UTC)