[HN Gopher] Webrecorder: Make an interactive copy of any web pag... ___________________________________________________________________ Webrecorder: Make an interactive copy of any web page that you browse Author : pcr910303 Score : 209 points Date : 2020-05-11 16:56 UTC (6 hours ago) (HTM) web link (webrecorder.io) (TXT) w3m dump (webrecorder.io) | A4ET8a8uTh0 wrote: | This is neat. More and more stuff is hidden behind a login | screen. Odd question. How does internet archive handle those | types of pages these days? | | edit: types of | bacondude3 wrote: | Just want to plug HTTrack [1] here as well. Not nearly as slick, | but it's worked extremely well for me when Webrecorder couldn't | do the job. Being usable through the command line also makes it | useful for some projects WR can't do. | | [1]: http://www.httrack.com/ | anigbrowl wrote: | Seconded, it ain't pretty but it's powerful. | causality0 wrote: | I came here to say the same thing. HTTRACK is incredibly handy | especially when it comes to old pages. My personal use case is | archiving small sites to make sure they don't go dark, for | example webcomics. Having a single folder with all the comic | image files is great. | BiteCode_dev wrote: | For a KISS version of this, there is the Single File add on: | | https://addons.mozilla.org/fr/firefox/addon/single-file/ | | It will save any page as a standalone HTML file, including | inlined external resources. | joepie91_ wrote: | Doesn't solve the same problem. Webrecorder is for saving | browsing sessions, not individual pages. | BiteCode_dev wrote: | Hence the kiss | mlok wrote: | I like the zipped-HTML version of this : | https://addons.mozilla.org/fr/firefox/addon/singlefilez/ | abnry wrote: | I love this extension! I use it in place of bookmarks. I then | run a cron job to move .html files from my downloads folder to | a bookmarks folder. Then I generate thumbnails and an html | index to easily browse my bookmarks. I feel so much more | relaxed knowing I have can save with one click any good | information I find on the web. Eventually I want to add NLP | keyword extraction and categorization, and an internal search | feature. | jl6 wrote: | I would love just a simple one-click print-current-page-to-PDF | button that managed to capture the whole page, and did something | intelligent about infinite scroll sites. | Diederich wrote: | The problem here is quite real; just because one has access to a | remote resource now doesn't mean that access will remain, either | in the short or long term. | | I'm not crazy about using a 3rd party service for it though. | | I've half-considered wiring up something with OBS to just record | my web browser all day, but, besides the intense storage, | indexing and searching is more than painful. | greglindahl wrote: | Webrecorder is open source, and you can run your own instance | if you like. | mellosouls wrote: | Yes - it's also a digital preservation project from a not- | for-profit arts organisation. | | https://webrecorder.io/_faq | jabroni_salad wrote: | It looks like webrecorder has a downloadable version you can | run entirely on local storage with no need for an account. | elil17 wrote: | Is there a reason you can't just use archive.org for this? | NelsonMinar wrote: | Wow, neat! | | Is there a reliable screenshot version of this I can install on | my own Linux system? Some sort of headless browser, I imagine. I | realize this interactive system is much more powerful but it's | overkill for an application I have in mind, a best effort single | screenshot is fine. I've seen various attempts at doing this over | the years and none have really worked reliably. | | I'd consider paying a service to do this but it seems like self- | installable is better. | heipei wrote: | Couple of options: You can build a small JavaScript package | with a HTTP server to call your endpoint and puppeteer for | actually running and interfacing with Chrome headless and then | either Dockerize it or even run it as a Lambda / Google Cloud | Function. Then there's https://github.com/browserless/chrome/ | which is both a commercial service as well as being Open | Source. Plenty of other OSS and commercial options as well, | ultimately it comes down to your use-case. Personally I've | found reliably taking a screenshot at the right point in time | challenging since a page is never really "done" loading, i.e. | there could always be a refresh or timeout which loads another | URLs, adds an iframe or changes something else entirely, so | there is no single "yes, we're done loading now" event to wait | for. Source: I run the service at https://urlscan.io where we | have the same problem ;) | asab wrote: | Firefox can screenshot the whole page - it's built in. I | believe it can be done in headless as well. | | https://support.mozilla.org/en-US/kb/firefox-screenshots | Nux wrote: | Can it screenshot lazy loading pages? | williamdclt wrote: | I think it can be smart-ish about this (scrolling for you, | but maybe I'm thinking about an extension). I've never | found anything that handles virtualised lists/tables | though, like screenshotting an entire Slack thread | mdaniel wrote: | I can't tell which spawned which, but there's a related | discussion in r/DataHoarder about an extension that does that: | https://old.reddit.com/r/DataHoarder/comments/ggyzoy/is_ther... | in which Webrecorder was mentioned yesterday | seph-reed wrote: | This seems like something that would work really well on the | piHole level. | jcahill wrote: | A pi isn't going to cut it here. | edoceo wrote: | Pi could for sure do the capture and save, it can run full or | headless, and all the things. For personal itd be fine, not | at scale tho | dependenttypes wrote: | You would have to intercept HTTPS connections and add your | own CA to every computer that uses the pihole which is a | pain. | seph-reed wrote: | Hmmm... | | https://security.stackexchange.com/questions/8145/does- | https... | | I need to read more about CAs to figure out why the Pi | couldn't fake it. | boromi wrote: | Can't we do this with OBS? What's the difference. Edit was going | to download then saw it was an electron app.. no thanks. | xfer wrote: | It's Http Archive(HAR) recorder+indexer not a scree capture/mp4 | video recorder program. You can use it with your webbrowser, | you don't need an app. You can self-host the instance as well. | eastendguy wrote: | Technically interesting, but why would I use this over one of the | many full page screenshot or "Print to PDF" browser extensions? | That is what I use when I want to archive something. | heinrichhartman wrote: | Which browser and extension are you using? | | I tried a few, but could not get good results with any of them. | print to pdf, output always looked terrible. Full page | screenshot took ages to scroll throught the page hijacking my | viewport. Those captures should happen in the background. | eastendguy wrote: | For manual capture I use | https://chrome.google.com/webstore/detail/full-page- | screen-c... and Chrome "Print to PDF". Layout issues are not | important for my use case. | | For scheduled captures I automated this workflow with kantu: | https://chrome.google.com/webstore/detail/uivision- | rpa/gcbal... | | > Those captures should happen in the background. | | Yes, it would be nice if the Chrome extension api would allow | full page screen captures to happen instanstly. Currently all | extensions need to scroll up/down. | gildas wrote: | > Currently all extensions need to scroll up/down. | | SingleFile does not. It can save lazy loaded contents | without scrolling. | jcahill wrote: | Web archivists are coming at the problem with different | requirements, related to fidelity and the systematic mirroring | of address structure along with payload content and control | information. | | By the sound of it, you likely wouldn't want to substantially | change your personal filing system to have your page captures | conform with those requirements. | amelius wrote: | > Webrecorder creates an interactive copy of any web page that | you browse, including content revealed by your interactions such | as (...) clicking buttons, and so forth. | | How can they guarantee this if the code may run on the server? | netsharc wrote: | It's probably a replay of what you did. I'm trying to think how | the internals would be made using Javascript, but probably it's | a case of recording changes to the DOM structure ("After 3 | seconds, the password input field had the value 'hunter2'. | After 5 seconds, a DIV appears with the the text 'Incorrect | password'", etc). | amelius wrote: | Makes sense, but I wouldn't call it an "interactive copy". | dang wrote: | Surprisingly little prior discussion, but | https://news.ycombinator.com/item?id=10838985 was related (2016). | jcahill wrote: | We use webrecorder for some interactive work at my workplace, a | web archival nonprofit. | | If you're new to web archival, expect a learning curve. | chrischen wrote: | I feel like the "Make website browsable offline" feature of | yesteryears have been neglected. Somehow people assume everyone | always has internet connectivity... but I need to save websites, | docs, etc, for long flights or camping trips without signal. | | Safari annoyingly has a reading list feature that claims to cache | the web page but annoyingly 50% of the time doesn't. As with all | apple cloud services there is just no way to explicitly sync. | saurik wrote: | Since when is the Safari Reading List a cloud service? It might | be that I just have most of iCloud disabled and that makes the | system work as expected, but for me it is definitely 100% local | (at which point I would presume failures are due to the | mechanism it using to save things not working on all kinds of | web pages: I honestly only use it for simple document sites, | and so wouldn't know if it fails a lot on rich web app sites). | saagarjha wrote: | I think perhaps they're annoyed that their Reading List isn't | syncing between their devices and Apple hasn't put a button | that forces the sync to occur. | chrischen wrote: | Yes, it is advertised as being able to make it available | offline. But if you save it to reading list on iPhone, or | your Mac, there's no guarantee it's available on the other | device and often it even stops working on the device that I | added to reading list. | eitland wrote: | Until Firefox broke its extension API there was an extremely | useful extension called Scrapbook that could you could point at | a website and save that site and all linked site (with settings | for how far to recurse and what URL patterns to accept.) | | FTR: I still use Firefox as for me the alternatives are worse. | As someone else mentioned Pocket exist and while the | communication around it could hardly been worse it is still way | better than Chrome IMO. | judex wrote: | Exactly the reason I use Opera on Android devices. Does not | save movies or odd file data but html + images perfectly and | web pages looks like originals. I use it a lot to save pages | for offline viewing during flights. Now Vivaldi is on Android | too and I hope it does have the same feature. | EamonnMR wrote: | Offline users don't generate data/ad impressions, so offline | user experience isn't considered. | giancarlostoro wrote: | Oddly enough Mozilla gets crap for it but they bought Pocket | which allows for offline reading. I think its fine for Mozilla | to include Pocket. We seem to be fine installing Chrome which | has much worse spyware. You can uninstall Pocket, you gotta go | through more effort to uninstall Googles OS from Chrome. | boredgamer2 wrote: | Very cool. This seems like such a "Duh. Users want this" feature. | I wish it was integrated in Firefox years ago. I bookmark some | sites and then come back years later, but perhaps they hijacked | the URL or they later changed the URL's parameters. | | And then when I return, I get a 404. Instead of bookmarking, I'd | love to "capture" the current info, divs, and graphics. | PenguinCoder wrote: | Not a firefox extension, but I moved to using Polar | Bookshelf[0] for exactly this reason. | | [0] https://getpolarized.io/ | gildas wrote: | There's an option for this ("Misc. > save the page of a newly | created bookmark") in SingleFile [1]. | | [1] https://github.com/gildas-lormeau/SingleFile/ | pzmarzly wrote: | While it's far from an ideal solution, you can throw your | bookmarks periodically to archive.org's "Save Page Now!" | service. It's easy to semi-automate it - here's how I use it | with pinboard.in bookmark exports: | https://pastebin.com/uUVE22RD | holidaygoose wrote: | This solution is on the user side, which is great because | each person can get and manage saved pages for themselves. | | But if we're looking for a developer side solution, then | making pages that last an order of magnitude longer may be | better for everyone in the long run, e.g. | https://jeffhuang.com/designed_to_last/ | Cymen wrote: | Joplin is a local thing you can run for this. Basically, saving | a snapshot of a page. Has browser extensions. | | https://joplinapp.org/ | | I like the ability to use tags too -- I've got various | product/tech ideas and it's nice collecting information with it | and not having to worry about the pages going away or changing. | zxter wrote: | Pinboard [0] has this feature with additional price. Just | bookmark and you are done. Later you can text search in their | content. | | [0] https://www.pinboard.in/ | derefr wrote: | Does it capture _your view_ of the page, though (i.e. use | your own browser, with its cookie jar, to do the scrape)? I | 'd like to, for example, snapshot my Facebook feed. | zxter wrote: | Tnen check out historysearch [0] If I remember correctly | they index your history so it could include your history. | You don't even need to bookmark. | | Of course I cannot vouch for their respect for privacy. | | [0] https://historysearch.com/ | jabroni_salad wrote: | No it does not. That means it also doesnt work for any news | sites that you have subscriptions to. Im using the joplin | web clipper pretty heavily for this purpose. ___________________________________________________________________ (page generated 2020-05-11 23:00 UTC)