[HN Gopher] Web scraping via JavaScript runtime heap snapshots ___________________________________________________________________ Web scraping via JavaScript runtime heap snapshots Author : adriancooney Score : 196 points Date : 2022-04-29 13:56 UTC (9 hours ago) (HTM) web link (www.adriancooney.ie) (TXT) w3m dump (www.adriancooney.ie) | superasn wrote: | Awesome, I wonder if there is a possibility to create a chrome | extension that works like 'Vue devttools' and show the heap and | changes in real-time and maybe allow editing. That would be | amazing for learning / debugging. | | > We use the --no-headless argument to boot a windowed Chrome | instance (i.e. not headless) because Google can detect and thwart | headless Chrome - but that's a story for another time. | | Use `puppeteer-extra-plugin-stealth`(1) for such sites. It | defeats a lot of bot identification including recaptcha v3. | | (1) https://www.npmjs.com/package/puppeteer-extra-plugin-stealth | acemarke wrote: | Not _quite_ what you're describing, but Replay [0], the company | I work for, _is_ building a true "time-traveling debugger" for | JS. It works by recording the OS-level interactions with the | browser process, then re-running those in the cloud. From the | user's perspective in our debugging client UI, they can jump to | any point in a timeline and do typical step debugging. However, | you can also see how many times any line of code ran, and also | add print statements to any line that will print out the | results from _every time that line got executed_. | | So, no heap analysis per se, but you can definitely inspect the | variables and stack from anywhere in the recording. | | Right now our debugging client is just scratching the surface | of the info we have available from our backend. We recently put | together a couple small examples that use the Replay backend | API to extract data from recordings and do other analysis, like | generating code coverage reports and introspecting React's | internals to determine whether a given component was mounting | or re-rendering. | | Given that capability, we hope to add the ability to do "React | component stack" debugging in the not-too-distant future, such | as a button that would let you "Step Back to Parent Component". | We're also working on adding Redux DevTools integration now | (like, I filed an initial PR for this today! [2]), and hope to | add integration with other frameworks down the road. | | [0] https://replay.io | | [1] https://github.com/RecordReplay/replay-protocol-examples | | [2] https://github.com/RecordReplay/devtools/pull/6601 | invalidname wrote: | Scraping is inherently fragile due to all the small changes that | can happen to the data model as a website evolves. The important | thing is to fix these things quickly. This article discusses a | related approach of debugging such failures directly on the | server: https://talktotheduck.dev/debugging-jsoup-java-code-in- | produ... | | It's in Java (using JSoup) but the approach will work for Node, | Python, Kotlin etc. The core concept is to discover the cause of | the regression instantly on the server and deploy a fix fast. | There are also user specific regressions in scraping that are | again very hard to debug. | dymk wrote: | Would this method work if the website obfuscated its HTML as per | the usual techniques, but also rendered everything server side? | adriancooney wrote: | If it's rendered server-side - no. The data likely won't be | loaded into the JS heap (the DOM isn't included in the heap | snapshots) when you visit the page. You might be in luck if the | website executes JavaScript to augment the server-side rendered | page however. If it does, your data may be loaded into memory | in a way you can extract it. | marwis wrote: | This sadly does not help if js code is minified/obfuscated and | data is exchanged using some binary/binary-like protocol like | grpc. Unfortunately this is increasingly common. | | The only long term way is to parse visible text. | mdaniel wrote: | I've never seen grpc from a browser on a consumer-facing site; | do you have an example I could see? | | That said, for this approach something like grpc would be a | benefit since AIUI grpc is designed to be versioned so one | could identify structural changes in the payload fairly quickly | versus the json-y way of "I dunno, are there suddenly new | fields?" | marwis wrote: | Not aware of any actual grpc websites but given grpc-web has | 6.5k stars on github something must be out there. | | Google's websites frequently use binary-like formats where | json is just an array of values with no properties, and most | of these values are numbers. See for example Gmail. | toraway1234 wrote: | swsieber wrote: | Yeah, found when it happened - | https://news.ycombinator.com/item?id=30275804 | d4a wrote: | You're banned, which means everything you post will be marked | as "Dead". Only those with "showdead" enabled in their profile | will be able to see your comments and posts. Others can "vouch" | for your post to make it not "dead" so it can be replied to | (which is what I have done) | | As for why you were banned: | https://news.ycombinator.com/item?id=30275804 | BbzzbB wrote: | Odd, I see his comment and I'm playing HN with undeads off | ("showdead" is "no"). | d4a wrote: | I had to vouch for the comment (make it not dead) to reply | to it | BbzzbB wrote: | Oh I see, you're playing medic. | | Thanks, I always wondered what "showdead" meant (tho not | enough to Google it I guess). | d4a wrote: | https://github.com/minimaxir/hacker-news-undocumented | [deleted] | kvathupo wrote: | The article brings up two interesting points for web | preservation: | | 1. The reliance on externally hosted APIs | | 2. Source code obfuscation | | For 1, in order to fully preserve a webpage, you'd have to go | down the rabbit hole of externally hosted APIs, and preserve | those as well. For example, sometimes a webpage won't render | latex notation since a MathJax endpoint can't be connected to. | Were we to save this webpage, we would need a copy of MathJax JS | too. | | For 2, I think WASM makes things more interesting. With Web | Assembly, I'd imagine it's much easier to obfuscate source code: | a preservationist would need a WASM decompiler for whatever | source language was used. | flockonus wrote: | Awesome experimentation! I'd be curious to how you navigate the | heap dump in some real website examples. | mwcampbell wrote: | > Developers no longer need to label their data with class-names | or ids - it's only a courtesy to screen readers now. | | In general, screen readers don't use class names or IDs. In | principle they can, to enable site-specific workarounds for | accessibility problems. But of course, that's as fragile as | scraping. Perhaps you were thinking of semantic HTML tag names | and ARIA roles. | ComputerGuru wrote: | Anything relying on id/class names has been broken since the | advent of machine-generated names that come part and parcel | with the most popular SPA frameworks. They're all gobbly-dook | now, which makes writing custom ad block cosmetic filters a | real PITA. | jchw wrote: | React doesn't do that. You may still find gibberish on | hostile sites like Twitter which intentionally obfuscate | class names, using something like React Armor. | mdaniel wrote: | That's an exceedingly clever idea, thanks for sharing it! | | Please consider adding an actual license text file to your repo, | since (a) I don't think GitHub's licensee looks inside | package.json (b) I bet _most_ of the "license" properties of | package.json files are "yeah, yeah, whatever" versus an | intentional choice: https://github.com/adriancooney/puppeteer- | heap-snapshot/blob... I'm not saying that applies to you, but an | explicit license file in the repo would make your wishes clearer | adriancooney wrote: | Ah thank you for the reminder. Added it now! | marmada wrote: | Wow this is brilliant. I've sometimes tried to reverse engineer | APIs in the past, but this is definitely the next level. | | I used to think ML models could be good for scraping too, but | this seems better. | | I think this + a network request interception tool (to get data | that is embedded into HTML) could be the future. | lemax wrote: | I've used a similar technique on some web pages that get returned | from the server with an in-tact redux state object just sitting | in a <script> tag. Instead of parsing the HTML, I just pull out | the state object. Super | 1vuio0pswjnm7 wrote: | "We can see the response is JSON! A clean, well-formed data | structure extracted magically from the webpage." | | Honest question: Why are heap snapshots required. | | Why not just request the page and then extract the JSON. | tnftp -4o 1.htm https://www.youtube.com/watch?v=L_o_O7v1ews | yy059 < 1.htm > 1.json less 1.json | | yy059 is a quick and dirty, small, simple, portable C program to | extract and reformat JSON to make reading JSON easier for me. | | For me, it works. There is no "magic". | | "Tech" companies, the ones manipulating access to public | information to sell online advertising services, can change their | free browsers at any time. In the future, they could cripple or | remove the heap snapshot feature. Utilising the feature to | extract data is interesting but how is this technique more | "future-proof" than requesting a web page using any client, not | necessarily a "tech" company's web browser, and looking at what | it contains. | quickthrower2 wrote: | Because cloudflare, recaptcha etc. mean this is not general in | possible. You need to quack like a normal user for it to work. | If a site is really against scraping they could probably | completely make it uneconomical by tracking user footprints and | detect unexpected patterns of usage. | datalopers wrote: | They detect and block headless browsers just as easy. | Jiger104 wrote: | Really cool approach, great work | elbajo wrote: | Love this approach, thanks for sharing! | | I am trying this on a website for which Puppeteer has trouble | loading so I got a heap snapshot directly in Chrome. I was trying | to search for relevant objects directly in the Chrome heap viewer | but I don't think the search looks inside objects. | | I think your tool would work: "puppeteer-heap-snapshot query -f | /tmp/file.heapsnapshot -p property1" or really any JSON parser | but it requires extra steps. Would you say this is the easiest | way to view/debug a heap snapshot? | BbzzbB wrote: | This is great, thanks a lot. | | It's my understanding that Playwright is the "new Puppeteer" | (even with core devs migrating). I presume this sort of technique | would be feasible on Playwright too? Do you think there's any | advantage or disadvantage of using one over the other for this | use case, or it's basically the same (or I'm off base and they're | not so interchangeable)? | | I'm basing my personal "scraping toolbox" off Scrapy which I | think has decent Playwright integration, hence the question if I | try to reproduce this strategy in Playwright. | mdaniel wrote: | My understanding of Playwright is that it's trying to be the | new Selenium, in that it's a programming language orchestrating | the WebDriver protocol | | That means that if you are running against Chromium, this will | likely work, but unless Firefox has a similar heapdump | function, it is unlikely to work[1]. And almost certainly not | Safari, based on my experience. All of that is also qualified | by whether Playwright exposes that behavior, or otherwise | allows one to "get into the weeds" to invoke the function under | the hood | | 1 = as an update, I checked and Firefox does have a memory | snapshot feature, but the file it saved is some kind of binary | encoded thing without any obvious strings in it | | I didn't see any such thing in Safari | asabla wrote: | Well kind of for Firefox, there is this profiling tool which | you could use (semi-built in) | | https://github.com/firefox-devtools/profiler. Which let you | save a report in json.gz format | anyfactor wrote: | Very interesting. Can't wait to give it a shot. | | I personally use a combination of xpath, basic math and regex, so | this class/id security solution isn't a major deterrent. Couple | of times, I did find it to be an hassle to scrape data embedded | in iframes, and I can see the heap snapshots treat iframes | differently. | | Also, if a website takes the extra steps to block web scrapers, | identification of elements is never the main problem. It is | always IP bans and other security measures. | | After all that, I do look forward using something like this and | making a switch to nodejs based solution soon. But if you are | trying web scraping at scale, reverse engineering should always | be your first choice. Not only it enables you a faster solution, | it is more ethical (IMO) as you are minimizing your impact to | it's resources. Rendering full website resources is always my | last choice. | rvnx wrote: | Nice this won't work anymore then | benbristow wrote: | Exactly my thoughts - the author is using it 'in production' - | speaking out loud to a forum where Facebook/Meta employees (and | other Silicon Valley folk) are definitely observing is a rookie | mistake | chrismeller wrote: | A neat idea for sure, I just wanted to point out that this is why | I prefer XPath over CSS selectors. | | We all know the display of the page and the structure of the page | should be mutually exclusive, so why would you base your | selectors on display? Particularly if you're looking for | something on a semantically designed page, why would I look for | an .article, a class that may disappear with the next redesign, | when they're unlikely to stop using the article HTML tag? | goldenkey wrote: | CSS selectors don't have to select purely by classes. They can | be something like: | | div > div > * > *:nth-child(7) | | XPath doesn't have any additional abilities, it's just verbose | and difficult to write. It's a lemon. | tommica wrote: | I might be wrong, but xpath has contains, where you can look | for a text content inside an element, which I don't think CSS | can do | mdaniel wrote: | Yeah, for sure XPath is the more powerful of the two, so | much so that Scrapy's parsel library parses CSS selectors | and transforms then into the equivalent XPath for execution | | To the best of my knowledge, CSS selectors care only about | the _structure_ of the page, lightly dipping its toes in | the content only for attributes and things like :first-char | and :first-line | chrismeller wrote: | Well that is 100% originally an XPath selector (:nth-child), | so kudos if CSS selectors support it now. | | Still, using // instead of multiple *'s (and the two divs) | still seems better for longer-term scraping. ___________________________________________________________________ (page generated 2022-04-29 23:00 UTC)