[HN Gopher] Web scraping via JavaScript runtime heap snapshots
       ___________________________________________________________________
        
       Web scraping via JavaScript runtime heap snapshots
        
       Author : adriancooney
       Score  : 196 points
       Date   : 2022-04-29 13:56 UTC (9 hours ago)
        
 (HTM) web link (www.adriancooney.ie)
 (TXT) w3m dump (www.adriancooney.ie)
        
       | superasn wrote:
       | Awesome, I wonder if there is a possibility to create a chrome
       | extension that works like 'Vue devttools' and show the heap and
       | changes in real-time and maybe allow editing. That would be
       | amazing for learning / debugging.
       | 
       | > We use the --no-headless argument to boot a windowed Chrome
       | instance (i.e. not headless) because Google can detect and thwart
       | headless Chrome - but that's a story for another time.
       | 
       | Use `puppeteer-extra-plugin-stealth`(1) for such sites. It
       | defeats a lot of bot identification including recaptcha v3.
       | 
       | (1) https://www.npmjs.com/package/puppeteer-extra-plugin-stealth
        
         | acemarke wrote:
         | Not _quite_ what you're describing, but Replay [0], the company
         | I work for, _is_ building a true "time-traveling debugger" for
         | JS. It works by recording the OS-level interactions with the
         | browser process, then re-running those in the cloud. From the
         | user's perspective in our debugging client UI, they can jump to
         | any point in a timeline and do typical step debugging. However,
         | you can also see how many times any line of code ran, and also
         | add print statements to any line that will print out the
         | results from _every time that line got executed_.
         | 
         | So, no heap analysis per se, but you can definitely inspect the
         | variables and stack from anywhere in the recording.
         | 
         | Right now our debugging client is just scratching the surface
         | of the info we have available from our backend. We recently put
         | together a couple small examples that use the Replay backend
         | API to extract data from recordings and do other analysis, like
         | generating code coverage reports and introspecting React's
         | internals to determine whether a given component was mounting
         | or re-rendering.
         | 
         | Given that capability, we hope to add the ability to do "React
         | component stack" debugging in the not-too-distant future, such
         | as a button that would let you "Step Back to Parent Component".
         | We're also working on adding Redux DevTools integration now
         | (like, I filed an initial PR for this today! [2]), and hope to
         | add integration with other frameworks down the road.
         | 
         | [0] https://replay.io
         | 
         | [1] https://github.com/RecordReplay/replay-protocol-examples
         | 
         | [2] https://github.com/RecordReplay/devtools/pull/6601
        
       | invalidname wrote:
       | Scraping is inherently fragile due to all the small changes that
       | can happen to the data model as a website evolves. The important
       | thing is to fix these things quickly. This article discusses a
       | related approach of debugging such failures directly on the
       | server: https://talktotheduck.dev/debugging-jsoup-java-code-in-
       | produ...
       | 
       | It's in Java (using JSoup) but the approach will work for Node,
       | Python, Kotlin etc. The core concept is to discover the cause of
       | the regression instantly on the server and deploy a fix fast.
       | There are also user specific regressions in scraping that are
       | again very hard to debug.
        
       | dymk wrote:
       | Would this method work if the website obfuscated its HTML as per
       | the usual techniques, but also rendered everything server side?
        
         | adriancooney wrote:
         | If it's rendered server-side - no. The data likely won't be
         | loaded into the JS heap (the DOM isn't included in the heap
         | snapshots) when you visit the page. You might be in luck if the
         | website executes JavaScript to augment the server-side rendered
         | page however. If it does, your data may be loaded into memory
         | in a way you can extract it.
        
       | marwis wrote:
       | This sadly does not help if js code is minified/obfuscated and
       | data is exchanged using some binary/binary-like protocol like
       | grpc. Unfortunately this is increasingly common.
       | 
       | The only long term way is to parse visible text.
        
         | mdaniel wrote:
         | I've never seen grpc from a browser on a consumer-facing site;
         | do you have an example I could see?
         | 
         | That said, for this approach something like grpc would be a
         | benefit since AIUI grpc is designed to be versioned so one
         | could identify structural changes in the payload fairly quickly
         | versus the json-y way of "I dunno, are there suddenly new
         | fields?"
        
           | marwis wrote:
           | Not aware of any actual grpc websites but given grpc-web has
           | 6.5k stars on github something must be out there.
           | 
           | Google's websites frequently use binary-like formats where
           | json is just an array of values with no properties, and most
           | of these values are numbers. See for example Gmail.
        
       | toraway1234 wrote:
        
         | swsieber wrote:
         | Yeah, found when it happened -
         | https://news.ycombinator.com/item?id=30275804
        
         | d4a wrote:
         | You're banned, which means everything you post will be marked
         | as "Dead". Only those with "showdead" enabled in their profile
         | will be able to see your comments and posts. Others can "vouch"
         | for your post to make it not "dead" so it can be replied to
         | (which is what I have done)
         | 
         | As for why you were banned:
         | https://news.ycombinator.com/item?id=30275804
        
           | BbzzbB wrote:
           | Odd, I see his comment and I'm playing HN with undeads off
           | ("showdead" is "no").
        
             | d4a wrote:
             | I had to vouch for the comment (make it not dead) to reply
             | to it
        
               | BbzzbB wrote:
               | Oh I see, you're playing medic.
               | 
               | Thanks, I always wondered what "showdead" meant (tho not
               | enough to Google it I guess).
        
               | d4a wrote:
               | https://github.com/minimaxir/hacker-news-undocumented
        
         | [deleted]
        
       | kvathupo wrote:
       | The article brings up two interesting points for web
       | preservation:
       | 
       | 1. The reliance on externally hosted APIs
       | 
       | 2. Source code obfuscation
       | 
       | For 1, in order to fully preserve a webpage, you'd have to go
       | down the rabbit hole of externally hosted APIs, and preserve
       | those as well. For example, sometimes a webpage won't render
       | latex notation since a MathJax endpoint can't be connected to.
       | Were we to save this webpage, we would need a copy of MathJax JS
       | too.
       | 
       | For 2, I think WASM makes things more interesting. With Web
       | Assembly, I'd imagine it's much easier to obfuscate source code:
       | a preservationist would need a WASM decompiler for whatever
       | source language was used.
        
       | flockonus wrote:
       | Awesome experimentation! I'd be curious to how you navigate the
       | heap dump in some real website examples.
        
       | mwcampbell wrote:
       | > Developers no longer need to label their data with class-names
       | or ids - it's only a courtesy to screen readers now.
       | 
       | In general, screen readers don't use class names or IDs. In
       | principle they can, to enable site-specific workarounds for
       | accessibility problems. But of course, that's as fragile as
       | scraping. Perhaps you were thinking of semantic HTML tag names
       | and ARIA roles.
        
         | ComputerGuru wrote:
         | Anything relying on id/class names has been broken since the
         | advent of machine-generated names that come part and parcel
         | with the most popular SPA frameworks. They're all gobbly-dook
         | now, which makes writing custom ad block cosmetic filters a
         | real PITA.
        
           | jchw wrote:
           | React doesn't do that. You may still find gibberish on
           | hostile sites like Twitter which intentionally obfuscate
           | class names, using something like React Armor.
        
       | mdaniel wrote:
       | That's an exceedingly clever idea, thanks for sharing it!
       | 
       | Please consider adding an actual license text file to your repo,
       | since (a) I don't think GitHub's licensee looks inside
       | package.json (b) I bet _most_ of the  "license" properties of
       | package.json files are "yeah, yeah, whatever" versus an
       | intentional choice: https://github.com/adriancooney/puppeteer-
       | heap-snapshot/blob... I'm not saying that applies to you, but an
       | explicit license file in the repo would make your wishes clearer
        
         | adriancooney wrote:
         | Ah thank you for the reminder. Added it now!
        
       | marmada wrote:
       | Wow this is brilliant. I've sometimes tried to reverse engineer
       | APIs in the past, but this is definitely the next level.
       | 
       | I used to think ML models could be good for scraping too, but
       | this seems better.
       | 
       | I think this + a network request interception tool (to get data
       | that is embedded into HTML) could be the future.
        
       | lemax wrote:
       | I've used a similar technique on some web pages that get returned
       | from the server with an in-tact redux state object just sitting
       | in a <script> tag. Instead of parsing the HTML, I just pull out
       | the state object. Super
        
       | 1vuio0pswjnm7 wrote:
       | "We can see the response is JSON! A clean, well-formed data
       | structure extracted magically from the webpage."
       | 
       | Honest question: Why are heap snapshots required.
       | 
       | Why not just request the page and then extract the JSON.
       | tnftp -4o 1.htm https://www.youtube.com/watch?v=L_o_O7v1ews
       | yy059 < 1.htm > 1.json             less 1.json
       | 
       | yy059 is a quick and dirty, small, simple, portable C program to
       | extract and reformat JSON to make reading JSON easier for me.
       | 
       | For me, it works. There is no "magic".
       | 
       | "Tech" companies, the ones manipulating access to public
       | information to sell online advertising services, can change their
       | free browsers at any time. In the future, they could cripple or
       | remove the heap snapshot feature. Utilising the feature to
       | extract data is interesting but how is this technique more
       | "future-proof" than requesting a web page using any client, not
       | necessarily a "tech" company's web browser, and looking at what
       | it contains.
        
         | quickthrower2 wrote:
         | Because cloudflare, recaptcha etc. mean this is not general in
         | possible. You need to quack like a normal user for it to work.
         | If a site is really against scraping they could probably
         | completely make it uneconomical by tracking user footprints and
         | detect unexpected patterns of usage.
        
           | datalopers wrote:
           | They detect and block headless browsers just as easy.
        
       | Jiger104 wrote:
       | Really cool approach, great work
        
       | elbajo wrote:
       | Love this approach, thanks for sharing!
       | 
       | I am trying this on a website for which Puppeteer has trouble
       | loading so I got a heap snapshot directly in Chrome. I was trying
       | to search for relevant objects directly in the Chrome heap viewer
       | but I don't think the search looks inside objects.
       | 
       | I think your tool would work: "puppeteer-heap-snapshot query -f
       | /tmp/file.heapsnapshot -p property1" or really any JSON parser
       | but it requires extra steps. Would you say this is the easiest
       | way to view/debug a heap snapshot?
        
       | BbzzbB wrote:
       | This is great, thanks a lot.
       | 
       | It's my understanding that Playwright is the "new Puppeteer"
       | (even with core devs migrating). I presume this sort of technique
       | would be feasible on Playwright too? Do you think there's any
       | advantage or disadvantage of using one over the other for this
       | use case, or it's basically the same (or I'm off base and they're
       | not so interchangeable)?
       | 
       | I'm basing my personal "scraping toolbox" off Scrapy which I
       | think has decent Playwright integration, hence the question if I
       | try to reproduce this strategy in Playwright.
        
         | mdaniel wrote:
         | My understanding of Playwright is that it's trying to be the
         | new Selenium, in that it's a programming language orchestrating
         | the WebDriver protocol
         | 
         | That means that if you are running against Chromium, this will
         | likely work, but unless Firefox has a similar heapdump
         | function, it is unlikely to work[1]. And almost certainly not
         | Safari, based on my experience. All of that is also qualified
         | by whether Playwright exposes that behavior, or otherwise
         | allows one to "get into the weeds" to invoke the function under
         | the hood
         | 
         | 1 = as an update, I checked and Firefox does have a memory
         | snapshot feature, but the file it saved is some kind of binary
         | encoded thing without any obvious strings in it
         | 
         | I didn't see any such thing in Safari
        
           | asabla wrote:
           | Well kind of for Firefox, there is this profiling tool which
           | you could use (semi-built in)
           | 
           | https://github.com/firefox-devtools/profiler. Which let you
           | save a report in json.gz format
        
       | anyfactor wrote:
       | Very interesting. Can't wait to give it a shot.
       | 
       | I personally use a combination of xpath, basic math and regex, so
       | this class/id security solution isn't a major deterrent. Couple
       | of times, I did find it to be an hassle to scrape data embedded
       | in iframes, and I can see the heap snapshots treat iframes
       | differently.
       | 
       | Also, if a website takes the extra steps to block web scrapers,
       | identification of elements is never the main problem. It is
       | always IP bans and other security measures.
       | 
       | After all that, I do look forward using something like this and
       | making a switch to nodejs based solution soon. But if you are
       | trying web scraping at scale, reverse engineering should always
       | be your first choice. Not only it enables you a faster solution,
       | it is more ethical (IMO) as you are minimizing your impact to
       | it's resources. Rendering full website resources is always my
       | last choice.
        
       | rvnx wrote:
       | Nice this won't work anymore then
        
         | benbristow wrote:
         | Exactly my thoughts - the author is using it 'in production' -
         | speaking out loud to a forum where Facebook/Meta employees (and
         | other Silicon Valley folk) are definitely observing is a rookie
         | mistake
        
       | chrismeller wrote:
       | A neat idea for sure, I just wanted to point out that this is why
       | I prefer XPath over CSS selectors.
       | 
       | We all know the display of the page and the structure of the page
       | should be mutually exclusive, so why would you base your
       | selectors on display? Particularly if you're looking for
       | something on a semantically designed page, why would I look for
       | an .article, a class that may disappear with the next redesign,
       | when they're unlikely to stop using the article HTML tag?
        
         | goldenkey wrote:
         | CSS selectors don't have to select purely by classes. They can
         | be something like:
         | 
         | div > div > * > *:nth-child(7)
         | 
         | XPath doesn't have any additional abilities, it's just verbose
         | and difficult to write. It's a lemon.
        
           | tommica wrote:
           | I might be wrong, but xpath has contains, where you can look
           | for a text content inside an element, which I don't think CSS
           | can do
        
             | mdaniel wrote:
             | Yeah, for sure XPath is the more powerful of the two, so
             | much so that Scrapy's parsel library parses CSS selectors
             | and transforms then into the equivalent XPath for execution
             | 
             | To the best of my knowledge, CSS selectors care only about
             | the _structure_ of the page, lightly dipping its toes in
             | the content only for attributes and things like :first-char
             | and :first-line
        
           | chrismeller wrote:
           | Well that is 100% originally an XPath selector (:nth-child),
           | so kudos if CSS selectors support it now.
           | 
           | Still, using // instead of multiple *'s (and the two divs)
           | still seems better for longer-term scraping.
        
       ___________________________________________________________________
       (page generated 2022-04-29 23:00 UTC)