[HN Gopher] Web Scraping with JavaScript ___________________________________________________________________ Web Scraping with JavaScript Author : paulpro Score : 95 points Date : 2020-10-26 16:34 UTC (6 hours ago) (HTM) web link (qoob.cc) (TXT) w3m dump (qoob.cc) | tekkk wrote: | Hmm, I think I'd still choose Scrapy over JS in this case. While | it can be a bit convoluted, for real production stuff I don't | know any better choices. | | I have myself deployed a Scrapy web scraper as AWS Lambda | function and it has worked quite nicely. Every day for the last | year now I guess, it has been scraping some websites to make my | life a little easier. | cvhashim wrote: | Cool :) | | I've been thinking about building a web app that scrapes specific | subreddits. | onion2k wrote: | Reddit has a pretty decent API. PRAW is the most commonly used | library for it (in Python), but there's https://github.com/not- | an-aardvark/snoowrap if you're set on JS too. | lpellis wrote: | You can just append .json for most subreddits, eg | https://www.reddit.com/r/startups/ --> | https://www.reddit.com/r/startups.json | leptons wrote: | This article is woefully incomplete and only covers a very | specific limited use case for web scraping. | | It doesn't mention puppeteer or why you may need to use something | like that. It doesn't mention cookies or sessions or anything | like that. And it doesn't mention using proxies or any web | scraping countermeasures. It's very easy to make crawling | difficult, and only very basic sites are easy to crawl with the | methods described in the article. | tuckerconnelly wrote: | I was thinking this too. This article really shouldn't be | upvoted. I can't really give away any of the secret sauces | though :) | [deleted] | forlorn wrote: | Web scraping only sounds like a simple thing but when you add | multithreading, queues, workers and rate limiting it becomes a | real monster. | zepearl wrote: | Please let me add as well more specific stuff like depth of the | websites, length of parameters of dynamically generated links | (which can potentially be infinite if there is a circular | perpetually "adding" mechanism in the website's code), | upper/lowercase characters in links (irrelevant for the | protocol & domain but relevant for the rest like path and | parameters), etc... . | | I just started with these things and I'm having a lot of | unexpected "fun" :) | lxe wrote: | > I've also seen few articles where they teach you how to parse | HTML content with regular expressions, spoiler: don't do this. | | It's fine, and probably faster, to parse HTML with a regex for a | wide variety of use cases. You won't release zalgo. | danpalmer wrote: | To expand on this... | | Engineers often love to say you can't do this because regular | expressions parse regular languages, and HTML is context- | sensitive, not regular, and therefore it's impossible to parse. | | What they often miss is that the language actually being | scraped may only be regular. If you want to parse a page to see | if it has the word Banana on it, then your language may defined | as . _?Banana._?, and that 's regular, it doesn't matter that | it's HTML. This even applies to questions like "does this | contain <element> in the <head>?", or "is there a table in the | body". | | HTML is not regular, but you're not implementing a browser, | you're implementing the language of what you're scraping, and | that may well be regular. | roywiggins wrote: | This works as long as you're _really_ sure that the language | you 'll want to parse _tomorrow_ will be regular also. It | doesn 't take much to accidentally add a new requirement that | _isn 't_, and once you've committed to regexps you may be | tempted to break out the non-regular extensions that most | regexp engines support, and that way lies madness. | | Starting with a real HTML parser is a good way to future- | proof your code for when someone asks you to add just one | more thing. | danpalmer wrote: | That's true, although I've also seen scraping fail because | it was being too precise - looking for something at a | particular point in the DOM tree because the parser | encourages things like XPaths or CSS selectors, where a | regex would have been less brittle _for that use-case_. | | For me this just highlights why it's important that | engineers understand at some basic what these different | things all mean, and what limitations you may have with | your solutions, or even those you may want. | Minor49er wrote: | Overwhelmingly (in my experience), you're not even really | parsing HTML with regex. Rather, you're just treating it as a | text document and using certain tags or code snippets as | boundary points for finding the data that you want. It's | certainly way faster, though prone to its own issues that don't | come up as often with something like a DOM library or headless | browser. | | Many HTML documents will have the same data included multiple | times, so a lot of the limitations can be avoided by targeting | the places that appear the most consistently. Most of the | reason why a web scraper would break would be because only one | place was being targeted for data, and often very loosely. That | place would get changed. Suddenly, you wind up with either a | lot of wrong data or none at all. | vmatouch wrote: | For more generic web indexing you need to use a browser. You do | not index pages served by a server anymore, you index pages | rendered by javascript apps in the browser. So as a part of the | "fetch" stage I usually let parsing of title and other page | metadata to a javascript script running inside the browser (using | https://www.browserless.io/) and then as part of the "parse" | phase I use cheerio to extract links and such. It is very | tempting to do everything in the browser, but architecturally it | does not belong there. So you need to find the balance that works | best for you. | mnmkng wrote: | Not necessarily. It is true that most websites today are | JavaScript heavy. However, they are server-side rendered more | often than not. Mostly for performance reasons. Also, not all | search engines are as good as Google at indexing dynamic JS | websites, so it's better to serve pre-rendered HTML for that | reason as well. | [deleted] | domenicd wrote: | Maintainer of jsdom here. jsdom will run the JavaScript on a | page, so it can get you pretty far in this regard without a | proper browser. It has some definite limitations, most notably | that it doesn't do any layout or handling of client-side | redirects, but it allows scraping of most single-page client- | side-rendered apps. | mrskitch wrote: | Thanks for the mention! I'm the founder of browserless.io, and | agree with pretty much everything you're saying. | | Our infrastructure actually does procedure for some of our | scraping needs: we scrape puppeteer's GH documentation page to | build out our debugger's autocomplete tool. To do this, we | "goto" the page, extract the page's content, and then hand it | off to nodejs libraries for parsing. This has two benefits: it | cuts down the time you have the browser open and running, and | let's you "offload" some of that work to your back-end with | more sophisticated libraries. You get the best of both worlds | with this approach, and it's one we generally recommend to | folks everywhere. Also a great way that we "dogfood" our own | product as well :) | paulpro wrote: | What is the reason you are not just getting page content | directly with HTTP request? Is headless browser providing | some benefits in your case? | mrskitch wrote: | Yes: often the case is that JS does some kind of data- | fetching, API calls, or whatever else to render a full page | (single-page apps for instance). With Github being mostly | just HTML markup and not needing a JS runtime we could have | definitely gone that route. The rationale was that we had a | desire to use our product ourselves, to gain better insight | into what our users do, and become more empathetic to their | cause. | | In short: we wanted to dogfood the product at the cost of | some time and machine resources | sxp wrote: | +1 to using cheerio.js. When I need to write a web scraper, I've | used Node's `request` library to get the HTML text and cheerio to | extract links and resources for the next stage. | | I've also used cheerio when I want to save a functioning local | cache of a webpage since I can have it transform all the various | multi-server references for <img>, <a>, <script>, etc on the page | to locally valid URLs and then fetch those URLs. | rajangdavis wrote: | Another +1 for cheerio.io | | If I recall correctly, what was really helpful about it that I | could write whatever code I would need to query and parse the | DOM in the browser console and the copy and paste it into a | script with almost no changes. | | It made it really simple to go from a proof of concept into | pipeline for scraping material and feeding it into a database. | rodw wrote: | Self-plug warning but FWIW if you're using cheerio _just_ for | the selector syntax a related tool is Stew [1] which is a | dependency-free [2] node module that allows one to extract | content from web pages (DOM trees) using CSS selectors, like: | | var links = stew.select(dom,'a[href]'); | | extended with support for embeded regular expressions (for | tags, classes, IDs, attributes or attribute values). E.g.: | | var metadata = stew.select(dom,'head meta[name=/^dc\\.|:/i]'); | | It's on npm as `stew-select` | | [1] https://github.com/rodw/stew/ | | [2] there's an optional peer-dependency-ish relationship to | htmlparser or htmlparser2 or similar to generate a DOM tree | from raw HTML but anything that creates a basic DOM tree | (`{type:, name:, children:[] }`) will suffice | domenicd wrote: | The article didn't touch on this very well, but the reason to | upgrade from cheerio to jsdom is if you want to run scripts. | E.g., for client-rendered apps, or apps that pull their data | from XHR. Since jsdom implements the script element, and the | XHR API, and a bunch of other APIs that pages might use, it can | get a lot further in the page lifecycle than just "parse the | bytes from the server into an initial DOM tree". | | (I'm a maintainer of jsdom.) | megous wrote: | Running the [arbitrary] scripts not written by me is what I | usually try to avoid and fear when scraping. | tiborsaas wrote: | I've just discovered Headless Chrome crawler and it works pretty | well. Not sure how well it will scale, but I'll index a few | hundred sites only. | | https://github.com/yujiosaka/headless-chrome-crawler | mnmkng wrote: | I have not tried the Headless Chrome Crawler personally, but | try the Apify SDK out https://github.com/apify/apify-js if the | Headless Chrome crawler does not scale well enough. We use it | to scrape billions of pages every month. | Jarred wrote: | For websites that use React, my favorite trick is loading a copy | of React Developer Tools inside a headless Chrome instance. | | From there, you just find the component you want to copy data | from and you copy the state or props. Very little string parsing | or data formatting required, no malformed data, etc. There's a | library floating around on GitHub somewhere that makes loading a | simplified version of React Developer Tools inside Puppeteer just | a script you eval with a jQuery-like API for selecting React | components, but I can't remember the name right now. | | Someone could probably do this without needing a headless web | browser (via jsdom) | dastx wrote: | Doesn't most/all react data come from xhr? Can't you just | figure out how the xhr works, and simply parse that? | | I did this with an investment website, where I was able to | retrieve all data using simple python. It _should_ be more | robust than parsing react components/html. | adeelk93 wrote: | I'd add in Postman into that workflow, especially if there's | headers you need to know about which are non-obvious in the | xhr url. From the network tab of your browser's debugger, | copy the network request as cURL, paste the cURL into | Postman's import, and then click the "code" button to | translate to python (or whatever else) code. | Jarred wrote: | > Doesn't most/all react data come from xhr? Can't you just | figure out how the xhr works, and simply parse that? | | Content-heavy websites using React often generate static | versions of pages at build time (using e.g. | https://nextjs.org/docs/advanced-features/automatic- | static-o...). In those cases, there might not be a public API | endpoint to fetch the data you want | | For applications though, it's definitely easier to just make | an HTTP request if you can. However, you're more likely to | run into issues like APIs blocking datacenter IPs, rate | limiting etc than when it appears you're just loading the | website like a human | mnmkng wrote: | Could you explain a bit more about how you run the React | DevTools in a headless Chrome? As far as I know, headless | Chrome can't run extensions. | Jarred wrote: | I don't precisely mean React Developer Tools because the UI | is unnecessary for this usecase, but it provides similar | functionality where you can access the state/props from the | component instance. | | The library is: https://github.com/baruchvlz/resq | | Example code: // resq is the stringified | source of the library // page is a Puppeteer page | // this line injects resq into the page await | page.evaluate(resq); // This finds a React component | with a prop "country" set to "us" const usProps = | await page.evaluate( `window["resq"].resq$("*", | document.querySelector("#__next")).byProps({country: | "us"}).props` ); // This finds a React | component with a prop "expandRowByClick" set to true | const news = await page.evaluate( | `window["resq"].resq$("*", | document.querySelector("#__next")).byProps({expandRowByClick: | true}).props.dataSource` ); | furstenheim wrote: | Shameless plug, sometime ago I ported webscraper to work with | node and puppeteer https://www.npmjs.com/package/web-scraper- | headless. | | That way one can build the scraper with the ui in the browser | from the extension https://chrome.google.com/webstore/detail/web- | scraper-free-w... and scrape on the server. | mnmkng wrote: | Hey everyone, maintainer of the Apify SDK here. As far as we | know, it is the most comprehensive open-source scraping library | for JavaScript (Node.js). | | It gives you tools to work with both HTTP requests and headless | browsers, storages to save data without having to fiddle with | databases and automatic scaling based on available system | resources. We use it every day in our web scraping business, but | 90% of the features are available for free in the libary itself. | | Try it out and tell us what you think: | https://github.com/apify/apify-js | mcraiha wrote: | Minor nitpick, it is Web scraping with NodeJS. | paulpro wrote: | Have you made it to the end? There is also a quick PoC of how | to do same in the browser console | [deleted] | simonw wrote: | If you're using JavaScript for scraping, you should go straight | to the logical conclusion and run your scraper inside a real | browser (potentially headless) - using Puppeteer or Selenium or | Playwright. | | My current favourite stack for this is Selenium + Python - it | lets me write most of my scraper in JavaScript that I run inside | of the browser, but having Python to control it means I can | really easily write the results to a SQLite database while the | scraper is running. | | I wrote a bit about this here: | https://simonwillison.net/2020/Oct/16/weeknotes-evernote-dat... | megous wrote: | I do that for the background scraping. (via userscript that | parses the data out of the page I visit and stores info to | database in the background) | | So for example if I buy some electronics module on aliexpress, | my scrapper automatically saves all the product description and | images to the database right from the browser as I'm making the | order. | | These details usually contain vital info to use the module, so | it's important to me to have an easily searchable reference for | all this information. I really don't trust myself to collect | all the necessary info manually. | paulpro wrote: | IMO, for the most of the data-gathering needs running a browser | (even a headless one) would be an overkill. Browser is better | suited for complex interactions, when you need to fully pretend | to be a user. Or just for testing purposes so your environments | match. | dwd wrote: | I've used Selenium API running in Firefox in the past to | scrape customers data out of proprietary .Net WebForm systems | requiring a login that didn't offer any option to export the | data. | | Crawling the list pages and then each edit page in turn | allowed for dumping the name and value from each input field | to the log as key:value pairs for processing offline. | | Navigating paging was probably the biggest challenge. | ricardo81 wrote: | Done a fair bit of scraping in my time, mostly with PHP/curl and | PHP's DOMDocument if necessary. | | I'd say to anyone learning how to code it's a good exercise in | learning. Think a scraper for most sites can be built in an hour | or two, depending on navigation and how data is sent to the | client. | | Definitely noticed a trend towards XHR and JSON responses | typically using a numeric ID. Probably the easiest type of site | to scrape where you don't need to crawl navigation, simply | iterate over a number range and the scraped data is already | pretty much structured. | paulryanrogers wrote: | Agreed. Though often I find sites and pages that need Chrome's | flavor of JS. It's becoming increasingly inevitable one will | need Chrome/ium to reliably get the rendered markup. | ricardo81 wrote: | I've never really scraped anything where the valued data is | in JS or dependent on a browser. Sometimes the browser uses | JS to fetch the data, but generally the call is easily found | out in your browser console. The patterns are generally | obvious. | danpalmer wrote: | Tools like JSDom are pretty nice for this, but I've found that | most web scraping involves a lot of low level manipulation of | strings and lists - | stripping/formatting/concatenating/ranges/etc, and I find JS to | have much worse ergonomics for this than languages like Python | and Ruby. I actually find the ergonomics of this most comparable | to Swift, the difference being that with Swift you get a ton of | safety and speed for that trade-off. | | If your whole stack is JS and you need a little bit of web | scraping, this makes sense. If you're starting a new scraping | project from scratch, I think you'll get far further, faster, | with Python or Ruby. | 0df8dkdf wrote: | >Tools like JSDom are pretty nice for this, but I've found that | most web scraping involves a lot of low level manipulation of | strings and lists - | stripping/formatting/concatenating/ranges/etc, and I find JS to | have much worse ergonomics for this than languages like Python | and Ruby. I actually find the ergonomics of this most | comparable to Swift, the difference being that with Swift you | get a ton of safety and speed for that trade-off. | | I think from es6 and up this is handled pretty well. | danpalmer wrote: | It has made it better, but things like slice operators are | still missing, which can help a lot, Set/Map types aren't | that great to use and aren't used much in practice, and there | are still lots of sharp edges for newcomers even with simple | things like iteration. That's also not mentioning things like | the itertools/collections modules in Python which provide | some rich types that come in handy. | jacobolus wrote: | It's certainly _possible_ to make itertools-like stuff in | Javascript. | | https://observablehq.com/@jrus/itertools | 0df8dkdf wrote: | Seems like slice operator is more like Syntactic sugar for | substring? | mikedelfino wrote: | I'm not the author of the comment you're replyin to, but | doesn't that fall under the worse ergonomics argument? | mnmkng wrote: | Actually, if you're scraping at any scale above a hobby | project, most of your web scraping hours would now be spent on | avoiding bot detection, reverse engineering APIs and trying to | make HTTP requests work where it seems only a browser can help. | The time spent "working with strings" is not even noticeable to | me. | | I scrape for a living and I work with JS, because currently, it | has the better tools. | moneywoes wrote: | Where would you get started with Python web scraping | karanbhangui wrote: | https://www.crummy.com/software/BeautifulSoup/ | | https://github.com/encode/httpx ___________________________________________________________________ (page generated 2020-10-26 23:01 UTC)