hngopher.com

       [HN Gopher] Web Scraping with JavaScript
       ___________________________________________________________________
        
       Web Scraping with JavaScript
        
       Author : paulpro
       Score  : 95 points
       Date   : 2020-10-26 16:34 UTC (6 hours ago)
        
 (HTM) web link (qoob.cc)
 (TXT) w3m dump (qoob.cc)
        
       | tekkk wrote:
       | Hmm, I think I'd still choose Scrapy over JS in this case. While
       | it can be a bit convoluted, for real production stuff I don't
       | know any better choices.
       | 
       | I have myself deployed a Scrapy web scraper as AWS Lambda
       | function and it has worked quite nicely. Every day for the last
       | year now I guess, it has been scraping some websites to make my
       | life a little easier.
        
       | cvhashim wrote:
       | Cool :)
       | 
       | I've been thinking about building a web app that scrapes specific
       | subreddits.
        
         | onion2k wrote:
         | Reddit has a pretty decent API. PRAW is the most commonly used
         | library for it (in Python), but there's https://github.com/not-
         | an-aardvark/snoowrap if you're set on JS too.
        
         | lpellis wrote:
         | You can just append .json for most subreddits, eg
         | https://www.reddit.com/r/startups/ -->
         | https://www.reddit.com/r/startups.json
        
       | leptons wrote:
       | This article is woefully incomplete and only covers a very
       | specific limited use case for web scraping.
       | 
       | It doesn't mention puppeteer or why you may need to use something
       | like that. It doesn't mention cookies or sessions or anything
       | like that. And it doesn't mention using proxies or any web
       | scraping countermeasures. It's very easy to make crawling
       | difficult, and only very basic sites are easy to crawl with the
       | methods described in the article.
        
         | tuckerconnelly wrote:
         | I was thinking this too. This article really shouldn't be
         | upvoted. I can't really give away any of the secret sauces
         | though :)
        
       | [deleted]
        
       | forlorn wrote:
       | Web scraping only sounds like a simple thing but when you add
       | multithreading, queues, workers and rate limiting it becomes a
       | real monster.
        
         | zepearl wrote:
         | Please let me add as well more specific stuff like depth of the
         | websites, length of parameters of dynamically generated links
         | (which can potentially be infinite if there is a circular
         | perpetually "adding" mechanism in the website's code),
         | upper/lowercase characters in links (irrelevant for the
         | protocol & domain but relevant for the rest like path and
         | parameters), etc... .
         | 
         | I just started with these things and I'm having a lot of
         | unexpected "fun" :)
        
       | lxe wrote:
       | > I've also seen few articles where they teach you how to parse
       | HTML content with regular expressions, spoiler: don't do this.
       | 
       | It's fine, and probably faster, to parse HTML with a regex for a
       | wide variety of use cases. You won't release zalgo.
        
         | danpalmer wrote:
         | To expand on this...
         | 
         | Engineers often love to say you can't do this because regular
         | expressions parse regular languages, and HTML is context-
         | sensitive, not regular, and therefore it's impossible to parse.
         | 
         | What they often miss is that the language actually being
         | scraped may only be regular. If you want to parse a page to see
         | if it has the word Banana on it, then your language may defined
         | as . _?Banana._?, and that 's regular, it doesn't matter that
         | it's HTML. This even applies to questions like "does this
         | contain <element> in the <head>?", or "is there a table in the
         | body".
         | 
         | HTML is not regular, but you're not implementing a browser,
         | you're implementing the language of what you're scraping, and
         | that may well be regular.
        
           | roywiggins wrote:
           | This works as long as you're _really_ sure that the language
           | you 'll want to parse _tomorrow_ will be regular also. It
           | doesn 't take much to accidentally add a new requirement that
           | _isn 't_, and once you've committed to regexps you may be
           | tempted to break out the non-regular extensions that most
           | regexp engines support, and that way lies madness.
           | 
           | Starting with a real HTML parser is a good way to future-
           | proof your code for when someone asks you to add just one
           | more thing.
        
             | danpalmer wrote:
             | That's true, although I've also seen scraping fail because
             | it was being too precise - looking for something at a
             | particular point in the DOM tree because the parser
             | encourages things like XPaths or CSS selectors, where a
             | regex would have been less brittle _for that use-case_.
             | 
             | For me this just highlights why it's important that
             | engineers understand at some basic what these different
             | things all mean, and what limitations you may have with
             | your solutions, or even those you may want.
        
         | Minor49er wrote:
         | Overwhelmingly (in my experience), you're not even really
         | parsing HTML with regex. Rather, you're just treating it as a
         | text document and using certain tags or code snippets as
         | boundary points for finding the data that you want. It's
         | certainly way faster, though prone to its own issues that don't
         | come up as often with something like a DOM library or headless
         | browser.
         | 
         | Many HTML documents will have the same data included multiple
         | times, so a lot of the limitations can be avoided by targeting
         | the places that appear the most consistently. Most of the
         | reason why a web scraper would break would be because only one
         | place was being targeted for data, and often very loosely. That
         | place would get changed. Suddenly, you wind up with either a
         | lot of wrong data or none at all.
        
       | vmatouch wrote:
       | For more generic web indexing you need to use a browser. You do
       | not index pages served by a server anymore, you index pages
       | rendered by javascript apps in the browser. So as a part of the
       | "fetch" stage I usually let parsing of title and other page
       | metadata to a javascript script running inside the browser (using
       | https://www.browserless.io/) and then as part of the "parse"
       | phase I use cheerio to extract links and such. It is very
       | tempting to do everything in the browser, but architecturally it
       | does not belong there. So you need to find the balance that works
       | best for you.
        
         | mnmkng wrote:
         | Not necessarily. It is true that most websites today are
         | JavaScript heavy. However, they are server-side rendered more
         | often than not. Mostly for performance reasons. Also, not all
         | search engines are as good as Google at indexing dynamic JS
         | websites, so it's better to serve pre-rendered HTML for that
         | reason as well.
        
         | [deleted]
        
         | domenicd wrote:
         | Maintainer of jsdom here. jsdom will run the JavaScript on a
         | page, so it can get you pretty far in this regard without a
         | proper browser. It has some definite limitations, most notably
         | that it doesn't do any layout or handling of client-side
         | redirects, but it allows scraping of most single-page client-
         | side-rendered apps.
        
         | mrskitch wrote:
         | Thanks for the mention! I'm the founder of browserless.io, and
         | agree with pretty much everything you're saying.
         | 
         | Our infrastructure actually does procedure for some of our
         | scraping needs: we scrape puppeteer's GH documentation page to
         | build out our debugger's autocomplete tool. To do this, we
         | "goto" the page, extract the page's content, and then hand it
         | off to nodejs libraries for parsing. This has two benefits: it
         | cuts down the time you have the browser open and running, and
         | let's you "offload" some of that work to your back-end with
         | more sophisticated libraries. You get the best of both worlds
         | with this approach, and it's one we generally recommend to
         | folks everywhere. Also a great way that we "dogfood" our own
         | product as well :)
        
           | paulpro wrote:
           | What is the reason you are not just getting page content
           | directly with HTTP request? Is headless browser providing
           | some benefits in your case?
        
             | mrskitch wrote:
             | Yes: often the case is that JS does some kind of data-
             | fetching, API calls, or whatever else to render a full page
             | (single-page apps for instance). With Github being mostly
             | just HTML markup and not needing a JS runtime we could have
             | definitely gone that route. The rationale was that we had a
             | desire to use our product ourselves, to gain better insight
             | into what our users do, and become more empathetic to their
             | cause.
             | 
             | In short: we wanted to dogfood the product at the cost of
             | some time and machine resources
        
       | sxp wrote:
       | +1 to using cheerio.js. When I need to write a web scraper, I've
       | used Node's `request` library to get the HTML text and cheerio to
       | extract links and resources for the next stage.
       | 
       | I've also used cheerio when I want to save a functioning local
       | cache of a webpage since I can have it transform all the various
       | multi-server references for <img>, <a>, <script>, etc on the page
       | to locally valid URLs and then fetch those URLs.
        
         | rajangdavis wrote:
         | Another +1 for cheerio.io
         | 
         | If I recall correctly, what was really helpful about it that I
         | could write whatever code I would need to query and parse the
         | DOM in the browser console and the copy and paste it into a
         | script with almost no changes.
         | 
         | It made it really simple to go from a proof of concept into
         | pipeline for scraping material and feeding it into a database.
        
         | rodw wrote:
         | Self-plug warning but FWIW if you're using cheerio _just_ for
         | the selector syntax a related tool is Stew [1] which is a
         | dependency-free [2] node module that allows one to extract
         | content from web pages (DOM trees) using CSS selectors, like:
         | 
         | var links = stew.select(dom,'a[href]');
         | 
         | extended with support for embeded regular expressions (for
         | tags, classes, IDs, attributes or attribute values). E.g.:
         | 
         | var metadata = stew.select(dom,'head meta[name=/^dc\\.|:/i]');
         | 
         | It's on npm as `stew-select`
         | 
         | [1] https://github.com/rodw/stew/
         | 
         | [2] there's an optional peer-dependency-ish relationship to
         | htmlparser or htmlparser2 or similar to generate a DOM tree
         | from raw HTML but anything that creates a basic DOM tree
         | (`{type:, name:, children:[] }`) will suffice
        
         | domenicd wrote:
         | The article didn't touch on this very well, but the reason to
         | upgrade from cheerio to jsdom is if you want to run scripts.
         | E.g., for client-rendered apps, or apps that pull their data
         | from XHR. Since jsdom implements the script element, and the
         | XHR API, and a bunch of other APIs that pages might use, it can
         | get a lot further in the page lifecycle than just "parse the
         | bytes from the server into an initial DOM tree".
         | 
         | (I'm a maintainer of jsdom.)
        
           | megous wrote:
           | Running the [arbitrary] scripts not written by me is what I
           | usually try to avoid and fear when scraping.
        
       | tiborsaas wrote:
       | I've just discovered Headless Chrome crawler and it works pretty
       | well. Not sure how well it will scale, but I'll index a few
       | hundred sites only.
       | 
       | https://github.com/yujiosaka/headless-chrome-crawler
        
         | mnmkng wrote:
         | I have not tried the Headless Chrome Crawler personally, but
         | try the Apify SDK out https://github.com/apify/apify-js if the
         | Headless Chrome crawler does not scale well enough. We use it
         | to scrape billions of pages every month.
        
       | Jarred wrote:
       | For websites that use React, my favorite trick is loading a copy
       | of React Developer Tools inside a headless Chrome instance.
       | 
       | From there, you just find the component you want to copy data
       | from and you copy the state or props. Very little string parsing
       | or data formatting required, no malformed data, etc. There's a
       | library floating around on GitHub somewhere that makes loading a
       | simplified version of React Developer Tools inside Puppeteer just
       | a script you eval with a jQuery-like API for selecting React
       | components, but I can't remember the name right now.
       | 
       | Someone could probably do this without needing a headless web
       | browser (via jsdom)
        
         | dastx wrote:
         | Doesn't most/all react data come from xhr? Can't you just
         | figure out how the xhr works, and simply parse that?
         | 
         | I did this with an investment website, where I was able to
         | retrieve all data using simple python. It _should_ be more
         | robust than parsing react components/html.
        
           | adeelk93 wrote:
           | I'd add in Postman into that workflow, especially if there's
           | headers you need to know about which are non-obvious in the
           | xhr url. From the network tab of your browser's debugger,
           | copy the network request as cURL, paste the cURL into
           | Postman's import, and then click the "code" button to
           | translate to python (or whatever else) code.
        
           | Jarred wrote:
           | > Doesn't most/all react data come from xhr? Can't you just
           | figure out how the xhr works, and simply parse that?
           | 
           | Content-heavy websites using React often generate static
           | versions of pages at build time (using e.g.
           | https://nextjs.org/docs/advanced-features/automatic-
           | static-o...). In those cases, there might not be a public API
           | endpoint to fetch the data you want
           | 
           | For applications though, it's definitely easier to just make
           | an HTTP request if you can. However, you're more likely to
           | run into issues like APIs blocking datacenter IPs, rate
           | limiting etc than when it appears you're just loading the
           | website like a human
        
         | mnmkng wrote:
         | Could you explain a bit more about how you run the React
         | DevTools in a headless Chrome? As far as I know, headless
         | Chrome can't run extensions.
        
           | Jarred wrote:
           | I don't precisely mean React Developer Tools because the UI
           | is unnecessary for this usecase, but it provides similar
           | functionality where you can access the state/props from the
           | component instance.
           | 
           | The library is: https://github.com/baruchvlz/resq
           | 
           | Example code:                   // resq is the stringified
           | source of the library         // page is a Puppeteer page
           | // this line injects resq into the page         await
           | page.evaluate(resq);         // This finds a React component
           | with a prop "country" set to "us"         const usProps =
           | await page.evaluate(           `window["resq"].resq$("*",
           | document.querySelector("#__next")).byProps({country:
           | "us"}).props`         );         // This finds a React
           | component with a prop "expandRowByClick" set to true
           | const news = await page.evaluate(
           | `window["resq"].resq$("*",
           | document.querySelector("#__next")).byProps({expandRowByClick:
           | true}).props.dataSource`         );
        
       | furstenheim wrote:
       | Shameless plug, sometime ago I ported webscraper to work with
       | node and puppeteer https://www.npmjs.com/package/web-scraper-
       | headless.
       | 
       | That way one can build the scraper with the ui in the browser
       | from the extension https://chrome.google.com/webstore/detail/web-
       | scraper-free-w... and scrape on the server.
        
       | mnmkng wrote:
       | Hey everyone, maintainer of the Apify SDK here. As far as we
       | know, it is the most comprehensive open-source scraping library
       | for JavaScript (Node.js).
       | 
       | It gives you tools to work with both HTTP requests and headless
       | browsers, storages to save data without having to fiddle with
       | databases and automatic scaling based on available system
       | resources. We use it every day in our web scraping business, but
       | 90% of the features are available for free in the libary itself.
       | 
       | Try it out and tell us what you think:
       | https://github.com/apify/apify-js
        
       | mcraiha wrote:
       | Minor nitpick, it is Web scraping with NodeJS.
        
         | paulpro wrote:
         | Have you made it to the end? There is also a quick PoC of how
         | to do same in the browser console
        
         | [deleted]
        
       | simonw wrote:
       | If you're using JavaScript for scraping, you should go straight
       | to the logical conclusion and run your scraper inside a real
       | browser (potentially headless) - using Puppeteer or Selenium or
       | Playwright.
       | 
       | My current favourite stack for this is Selenium + Python - it
       | lets me write most of my scraper in JavaScript that I run inside
       | of the browser, but having Python to control it means I can
       | really easily write the results to a SQLite database while the
       | scraper is running.
       | 
       | I wrote a bit about this here:
       | https://simonwillison.net/2020/Oct/16/weeknotes-evernote-dat...
        
         | megous wrote:
         | I do that for the background scraping. (via userscript that
         | parses the data out of the page I visit and stores info to
         | database in the background)
         | 
         | So for example if I buy some electronics module on aliexpress,
         | my scrapper automatically saves all the product description and
         | images to the database right from the browser as I'm making the
         | order.
         | 
         | These details usually contain vital info to use the module, so
         | it's important to me to have an easily searchable reference for
         | all this information. I really don't trust myself to collect
         | all the necessary info manually.
        
         | paulpro wrote:
         | IMO, for the most of the data-gathering needs running a browser
         | (even a headless one) would be an overkill. Browser is better
         | suited for complex interactions, when you need to fully pretend
         | to be a user. Or just for testing purposes so your environments
         | match.
        
           | dwd wrote:
           | I've used Selenium API running in Firefox in the past to
           | scrape customers data out of proprietary .Net WebForm systems
           | requiring a login that didn't offer any option to export the
           | data.
           | 
           | Crawling the list pages and then each edit page in turn
           | allowed for dumping the name and value from each input field
           | to the log as key:value pairs for processing offline.
           | 
           | Navigating paging was probably the biggest challenge.
        
       | ricardo81 wrote:
       | Done a fair bit of scraping in my time, mostly with PHP/curl and
       | PHP's DOMDocument if necessary.
       | 
       | I'd say to anyone learning how to code it's a good exercise in
       | learning. Think a scraper for most sites can be built in an hour
       | or two, depending on navigation and how data is sent to the
       | client.
       | 
       | Definitely noticed a trend towards XHR and JSON responses
       | typically using a numeric ID. Probably the easiest type of site
       | to scrape where you don't need to crawl navigation, simply
       | iterate over a number range and the scraped data is already
       | pretty much structured.
        
         | paulryanrogers wrote:
         | Agreed. Though often I find sites and pages that need Chrome's
         | flavor of JS. It's becoming increasingly inevitable one will
         | need Chrome/ium to reliably get the rendered markup.
        
           | ricardo81 wrote:
           | I've never really scraped anything where the valued data is
           | in JS or dependent on a browser. Sometimes the browser uses
           | JS to fetch the data, but generally the call is easily found
           | out in your browser console. The patterns are generally
           | obvious.
        
       | danpalmer wrote:
       | Tools like JSDom are pretty nice for this, but I've found that
       | most web scraping involves a lot of low level manipulation of
       | strings and lists -
       | stripping/formatting/concatenating/ranges/etc, and I find JS to
       | have much worse ergonomics for this than languages like Python
       | and Ruby. I actually find the ergonomics of this most comparable
       | to Swift, the difference being that with Swift you get a ton of
       | safety and speed for that trade-off.
       | 
       | If your whole stack is JS and you need a little bit of web
       | scraping, this makes sense. If you're starting a new scraping
       | project from scratch, I think you'll get far further, faster,
       | with Python or Ruby.
        
         | 0df8dkdf wrote:
         | >Tools like JSDom are pretty nice for this, but I've found that
         | most web scraping involves a lot of low level manipulation of
         | strings and lists -
         | stripping/formatting/concatenating/ranges/etc, and I find JS to
         | have much worse ergonomics for this than languages like Python
         | and Ruby. I actually find the ergonomics of this most
         | comparable to Swift, the difference being that with Swift you
         | get a ton of safety and speed for that trade-off.
         | 
         | I think from es6 and up this is handled pretty well.
        
           | danpalmer wrote:
           | It has made it better, but things like slice operators are
           | still missing, which can help a lot, Set/Map types aren't
           | that great to use and aren't used much in practice, and there
           | are still lots of sharp edges for newcomers even with simple
           | things like iteration. That's also not mentioning things like
           | the itertools/collections modules in Python which provide
           | some rich types that come in handy.
        
             | jacobolus wrote:
             | It's certainly _possible_ to make itertools-like stuff in
             | Javascript.
             | 
             | https://observablehq.com/@jrus/itertools
        
             | 0df8dkdf wrote:
             | Seems like slice operator is more like Syntactic sugar for
             | substring?
        
               | mikedelfino wrote:
               | I'm not the author of the comment you're replyin to, but
               | doesn't that fall under the worse ergonomics argument?
        
         | mnmkng wrote:
         | Actually, if you're scraping at any scale above a hobby
         | project, most of your web scraping hours would now be spent on
         | avoiding bot detection, reverse engineering APIs and trying to
         | make HTTP requests work where it seems only a browser can help.
         | The time spent "working with strings" is not even noticeable to
         | me.
         | 
         | I scrape for a living and I work with JS, because currently, it
         | has the better tools.
        
         | moneywoes wrote:
         | Where would you get started with Python web scraping
        
           | karanbhangui wrote:
           | https://www.crummy.com/software/BeautifulSoup/
           | 
           | https://github.com/encode/httpx
        
       ___________________________________________________________________
       (page generated 2020-10-26 23:01 UTC)