[HN Gopher] How to Crawl the Web with Scrapy ___________________________________________________________________ How to Crawl the Web with Scrapy Author : babblingfish Score : 98 points Date : 2021-09-13 18:34 UTC (4 hours ago) (HTM) web link (www.babbling.fish) (TXT) w3m dump (www.babbling.fish) | aynyc wrote: | I used scrapy a lot. Just my opinion: | | 1. Instead of creating a urls global variable, use start_requests | function. | | 2. Don't use beautifulsoup to parse, use CSS or XPATH. | | 3. If you are going into multiple pages over and over again, use | CrawlSpider with Rule. | sintezcs wrote: | Can you please give some details about your second point? | What's wrong with beautifulsoup? | estebarb wrote: | It is very slow. But personally, I prefer to write my | crawlers in Go (custom code, not Colly). | zatarc wrote: | Try Parsel: https://github.com/scrapy/parsel | | It's way faster and has better support for CSS selectors. | aynyc wrote: | Using CSS & XPATH to select elements is very natural to web | pages. BS4 has very limited CSS selector support and zero | XPATH support. | tamaharbor wrote: | Any suggestions regarding how to scrape Java-based websites? (For | example, harness racing entries and results from: | https://racing.ustrotting.com/default.aspx). | jcun4128 wrote: | What's wrong with it? That seems like a server-side rendered | page/easier to deal with than waiting for JS to load. | artembugara wrote: | We have to crawl about 60-80k news websites per day [0]. | | I've spent about 1 month to test how scrapy could be a fit for | our purposes. And, quite surprisingly, it was hard to design a | distributed web crawler. Scrapy is great for those in-the-middle | tasks where you need to crawl a bit + process data on the go. | | We ended up just using requests to crawl the web. Then post- | process the web pages in the next step. | | Many thanks to Zyte [1] (ex-ScrapingHub) for open-sourcing so | many wonderful tools for us. I've spoke to Zyte's CEO, and was | really fascinated how he still being a dev person while running | such a big company. | | [0] https://newscatcherapi.com/news-api [1] https://www.zyte.com/ | adamqureshi wrote: | ok so i can just hire zyte to build me a custom scraper? | artembugara wrote: | well, I think it is the cheapest & the fastest way, tbh. | adamqureshi wrote: | Thank you. The other site is also very interesting. I am | working on this MVP and its news aggregator type site for a | NICHE product. So i need to aggregate news for a brand from | maybe 10-20 blogs and list the URL. thank you for sharing | both. I'll reach out to them. | artembugara wrote: | I'm the co-founder of the other one, we could help you | with your task. | | Feel free to contact me. artem [at] newscatcherapi.com | adamqureshi wrote: | Oh awesome! emailing you now. Thank you. | jcun4128 wrote: | > We have to crawl about 60-80k news websites per day [0] | | Can't even imagine that number... different languages or | something? | artembugara wrote: | Yeah, plus many of websites are quite niche news | (construction news, for example) | no_time wrote: | While a decent post, this is more or less inadaquate in 2021. Do | a post on bypassing cloudflare/other anti botting tech using | residental proxy swarms | matheusmoreira wrote: | Yeah. I hate cloudflare, captchas. Why can't these companies | accept that our scrapers are valid user agents? Only Google is | allowed to do it, nobody else. | Eikon wrote: | Because most scrapers aren't providing any value to website | owners, in fact, they are costing them, unlike google. | r_singh wrote: | Exactly! | | While there are scraping APIs that unblock requests and charge | for them, I'd love to learn more about how they work.... | wswope wrote: | Scraping is a cat and mouse game that'll vary a lot by site. | I'm far from an expert and welcome correction here, but the | two big tricks that'll go a long way AFAIK are using a | residential proxy service (never tried one - they tend to be | quite shady), and using a webdriver-type setup like Selenium | or Puppeteer to mock realistic behavior (though IIRC you have | to obfuscate both those systems since they're detectable via | JS). | [deleted] | Eikon wrote: | They use residential proxies with altered clients and / or | headless browsers. Cloudflare's bot protection mostly makes | use of TLS fingerprinting, and thus pretty easy to bypass. | mmerlin wrote: | Yes, Scrapy is quite a good scraper technology for some | features, especially caching, but for some websites it's like | doing things the hard way... | | The easiest scraper with a proxy rotator I've found is in my | current fave web-automator, scraper scripter and scheduler: | Rtila [1] | | Created by an indy/solo developer-on-fire cranking out user- | requested features quite quickly... check the releases page [2] | | I have used (or at least trialled) the vast majority of | scraper-tech and written hundreds of scrapers since my first | VB5 controlling IE then dumping to SQLserver in the 90's and | then moving to various php and python libs/frameworks and a | handful of windows apps like ubot and imacros (both of which | were useful to me at some point but I never use those nowadays) | | A recent release of Rtila allows creating standalone bots you | can run using it's built-in local Node.js server (which also | has it's own locally hosted server API you can program anything | else against using any language you like) | | [1] https://www.rtila.net | | [2] https://github.com/IKAJIAN/rtila-releases/releases | Lammy wrote: | I'm sure Rtila is fantastic at what it does, but I gotta say | it's hilarious to see a landing page done in the Corporate | Memphis artstyle but worded in euphemism: | https://www.rtila.net/#h.d30as4n2092u | | "'Cause if the web server said no, then the answer obviously | is no. The thing is that it's not _gonna_ say no--it'd never | say no, because of the innovation. " | r_singh wrote: | I've used Scrapy extensively for writing crawlers. | | There's a lot of good things like not having to worry about | storage backends, request throttling (random seconds between | requests), the ability to run parallel parsers easily. There is | also a lot of open source middleware to help with things like | retrying requests with proxies and rotating user agents. | | However, like any battery included framework it has downsides in | terms of flexibility. | | In most cases requests and lxml should be enough to crawl the | web. | aynyc wrote: | If you are just doing one or two pages, say you want to get | weather for your location, then requests is sufficient. But if | you want to do many pages where you might want to scan and | follow, requests gets tedious very quickly. | r_singh wrote: | If you're a web developer not really, rather than worrying | about storage backendes, spiders, yielding and managing loops | and items, you could just host a DRF or Flask API with your | scrapers (written in Requests+lxml) initiated with an API | request. | | I guess it's a matter of preference | ducktective wrote: | > In most cases requests and lxml should be enough to crawl the | web. | | Don't mind my `curl | pup xmlstarlet grep(!!)`s... Nothing to | see here... | amozoss wrote: | My brother-in-law had just finished his pilot training and | was trying to apply for a job as teacher to continue his | training. | | However, the jobs were first come, first serve so he was | waking up at 4 am and constantly refreshing for hours trying | to be the first one. | | When I heard about it, I quickly whipped up a `curl | grep && | send_notif` (used pushback.io for notifs) and it helped him | not have to worry so much. | | When a new job posting finally came along he was the first in | line and got the job :) | davidatbu wrote: | Is the complete example (ie, a git repo or the python file) | linked anywhere in the blog post? | babblingfish wrote: | That's a good idea, I added a link to download a python file | with all the code at the end of the article. | question002 wrote: | Like who upvotes this? We actually have programming news here | too. It's just funny we're supposed to believe stuff like Rust is | ever going to catch on, when 90% of the interest on this site | stuff is just doing simple scripting tasks. | yewenjie wrote: | Related question - what is a very fast and easy to use library | for scraping static sites such as Google search results? | zamadatix wrote: | Google search isn't a static site, the results are dynamically | generated based on what it knows about you (location, browser | language, recent searches from IP, recent searches from | account, and so on with all of the things they know from trying | to sell ad slots to that device). | | That being said there isn't anything wrong with using Scrapy | for this. If you're more familiar with web browsers than Python | something like https://github.com/puppeteer/puppeteer can also | be turned into a quick way to scrape a site by giving you a | headless browser controlled by whatever you script in nodejs. | yewenjie wrote: | I see. I am familiar with Python but I don't need something | so heavy like Scrapy. Ideally I am looking for something that | is very lightweight + fast and can just parse the DOM using | CSS selectors. | paulcole wrote: | I've had excellent luck with SerpAPI. It's $50 a month for | 5,000 searches which has been plenty for my needs at a small | SEO/marketing agency. | | http://serpapi.com | wirthjason wrote: | I love scrapy! It's a wonderful tool. | | One of the most underrated features is the request caching. It | really helps with the problem of finding out your spider crashed | or you didn't parse all the data you wanted and rerunning the | job. Rather than making hundred or thousands of requests you can | get them from the cache. | | One nitpick is that the documentation could be a bit better about | integrating scrapy with other Python projects / code rather than | running it directly from the command line. | | Also, some of their internal names are a bit vague. There's a | Spider and a Crawler. What's the difference? To most people these | would be the same thing. This makes reading the source code a | little tricky. ___________________________________________________________________ (page generated 2021-09-13 23:00 UTC)