[HN Gopher] How to Crawl the Web with Scrapy
       ___________________________________________________________________
        
       How to Crawl the Web with Scrapy
        
       Author : babblingfish
       Score  : 98 points
       Date   : 2021-09-13 18:34 UTC (4 hours ago)
        
 (HTM) web link (www.babbling.fish)
 (TXT) w3m dump (www.babbling.fish)
        
       | aynyc wrote:
       | I used scrapy a lot. Just my opinion:
       | 
       | 1. Instead of creating a urls global variable, use start_requests
       | function.
       | 
       | 2. Don't use beautifulsoup to parse, use CSS or XPATH.
       | 
       | 3. If you are going into multiple pages over and over again, use
       | CrawlSpider with Rule.
        
         | sintezcs wrote:
         | Can you please give some details about your second point?
         | What's wrong with beautifulsoup?
        
           | estebarb wrote:
           | It is very slow. But personally, I prefer to write my
           | crawlers in Go (custom code, not Colly).
        
             | zatarc wrote:
             | Try Parsel: https://github.com/scrapy/parsel
             | 
             | It's way faster and has better support for CSS selectors.
        
           | aynyc wrote:
           | Using CSS & XPATH to select elements is very natural to web
           | pages. BS4 has very limited CSS selector support and zero
           | XPATH support.
        
       | tamaharbor wrote:
       | Any suggestions regarding how to scrape Java-based websites? (For
       | example, harness racing entries and results from:
       | https://racing.ustrotting.com/default.aspx).
        
         | jcun4128 wrote:
         | What's wrong with it? That seems like a server-side rendered
         | page/easier to deal with than waiting for JS to load.
        
       | artembugara wrote:
       | We have to crawl about 60-80k news websites per day [0].
       | 
       | I've spent about 1 month to test how scrapy could be a fit for
       | our purposes. And, quite surprisingly, it was hard to design a
       | distributed web crawler. Scrapy is great for those in-the-middle
       | tasks where you need to crawl a bit + process data on the go.
       | 
       | We ended up just using requests to crawl the web. Then post-
       | process the web pages in the next step.
       | 
       | Many thanks to Zyte [1] (ex-ScrapingHub) for open-sourcing so
       | many wonderful tools for us. I've spoke to Zyte's CEO, and was
       | really fascinated how he still being a dev person while running
       | such a big company.
       | 
       | [0] https://newscatcherapi.com/news-api [1] https://www.zyte.com/
        
         | adamqureshi wrote:
         | ok so i can just hire zyte to build me a custom scraper?
        
           | artembugara wrote:
           | well, I think it is the cheapest & the fastest way, tbh.
        
             | adamqureshi wrote:
             | Thank you. The other site is also very interesting. I am
             | working on this MVP and its news aggregator type site for a
             | NICHE product. So i need to aggregate news for a brand from
             | maybe 10-20 blogs and list the URL. thank you for sharing
             | both. I'll reach out to them.
        
               | artembugara wrote:
               | I'm the co-founder of the other one, we could help you
               | with your task.
               | 
               | Feel free to contact me. artem [at] newscatcherapi.com
        
               | adamqureshi wrote:
               | Oh awesome! emailing you now. Thank you.
        
         | jcun4128 wrote:
         | > We have to crawl about 60-80k news websites per day [0]
         | 
         | Can't even imagine that number... different languages or
         | something?
        
           | artembugara wrote:
           | Yeah, plus many of websites are quite niche news
           | (construction news, for example)
        
       | no_time wrote:
       | While a decent post, this is more or less inadaquate in 2021. Do
       | a post on bypassing cloudflare/other anti botting tech using
       | residental proxy swarms
        
         | matheusmoreira wrote:
         | Yeah. I hate cloudflare, captchas. Why can't these companies
         | accept that our scrapers are valid user agents? Only Google is
         | allowed to do it, nobody else.
        
           | Eikon wrote:
           | Because most scrapers aren't providing any value to website
           | owners, in fact, they are costing them, unlike google.
        
         | r_singh wrote:
         | Exactly!
         | 
         | While there are scraping APIs that unblock requests and charge
         | for them, I'd love to learn more about how they work....
        
           | wswope wrote:
           | Scraping is a cat and mouse game that'll vary a lot by site.
           | I'm far from an expert and welcome correction here, but the
           | two big tricks that'll go a long way AFAIK are using a
           | residential proxy service (never tried one - they tend to be
           | quite shady), and using a webdriver-type setup like Selenium
           | or Puppeteer to mock realistic behavior (though IIRC you have
           | to obfuscate both those systems since they're detectable via
           | JS).
        
           | [deleted]
        
           | Eikon wrote:
           | They use residential proxies with altered clients and / or
           | headless browsers. Cloudflare's bot protection mostly makes
           | use of TLS fingerprinting, and thus pretty easy to bypass.
        
         | mmerlin wrote:
         | Yes, Scrapy is quite a good scraper technology for some
         | features, especially caching, but for some websites it's like
         | doing things the hard way...
         | 
         | The easiest scraper with a proxy rotator I've found is in my
         | current fave web-automator, scraper scripter and scheduler:
         | Rtila [1]
         | 
         | Created by an indy/solo developer-on-fire cranking out user-
         | requested features quite quickly... check the releases page [2]
         | 
         | I have used (or at least trialled) the vast majority of
         | scraper-tech and written hundreds of scrapers since my first
         | VB5 controlling IE then dumping to SQLserver in the 90's and
         | then moving to various php and python libs/frameworks and a
         | handful of windows apps like ubot and imacros (both of which
         | were useful to me at some point but I never use those nowadays)
         | 
         | A recent release of Rtila allows creating standalone bots you
         | can run using it's built-in local Node.js server (which also
         | has it's own locally hosted server API you can program anything
         | else against using any language you like)
         | 
         | [1] https://www.rtila.net
         | 
         | [2] https://github.com/IKAJIAN/rtila-releases/releases
        
           | Lammy wrote:
           | I'm sure Rtila is fantastic at what it does, but I gotta say
           | it's hilarious to see a landing page done in the Corporate
           | Memphis artstyle but worded in euphemism:
           | https://www.rtila.net/#h.d30as4n2092u
           | 
           | "'Cause if the web server said no, then the answer obviously
           | is no. The thing is that it's not _gonna_ say no--it'd never
           | say no, because of the innovation. "
        
       | r_singh wrote:
       | I've used Scrapy extensively for writing crawlers.
       | 
       | There's a lot of good things like not having to worry about
       | storage backends, request throttling (random seconds between
       | requests), the ability to run parallel parsers easily. There is
       | also a lot of open source middleware to help with things like
       | retrying requests with proxies and rotating user agents.
       | 
       | However, like any battery included framework it has downsides in
       | terms of flexibility.
       | 
       | In most cases requests and lxml should be enough to crawl the
       | web.
        
         | aynyc wrote:
         | If you are just doing one or two pages, say you want to get
         | weather for your location, then requests is sufficient. But if
         | you want to do many pages where you might want to scan and
         | follow, requests gets tedious very quickly.
        
           | r_singh wrote:
           | If you're a web developer not really, rather than worrying
           | about storage backendes, spiders, yielding and managing loops
           | and items, you could just host a DRF or Flask API with your
           | scrapers (written in Requests+lxml) initiated with an API
           | request.
           | 
           | I guess it's a matter of preference
        
         | ducktective wrote:
         | > In most cases requests and lxml should be enough to crawl the
         | web.
         | 
         | Don't mind my `curl | pup xmlstarlet grep(!!)`s... Nothing to
         | see here...
        
           | amozoss wrote:
           | My brother-in-law had just finished his pilot training and
           | was trying to apply for a job as teacher to continue his
           | training.
           | 
           | However, the jobs were first come, first serve so he was
           | waking up at 4 am and constantly refreshing for hours trying
           | to be the first one.
           | 
           | When I heard about it, I quickly whipped up a `curl | grep &&
           | send_notif` (used pushback.io for notifs) and it helped him
           | not have to worry so much.
           | 
           | When a new job posting finally came along he was the first in
           | line and got the job :)
        
       | davidatbu wrote:
       | Is the complete example (ie, a git repo or the python file)
       | linked anywhere in the blog post?
        
         | babblingfish wrote:
         | That's a good idea, I added a link to download a python file
         | with all the code at the end of the article.
        
       | question002 wrote:
       | Like who upvotes this? We actually have programming news here
       | too. It's just funny we're supposed to believe stuff like Rust is
       | ever going to catch on, when 90% of the interest on this site
       | stuff is just doing simple scripting tasks.
        
       | yewenjie wrote:
       | Related question - what is a very fast and easy to use library
       | for scraping static sites such as Google search results?
        
         | zamadatix wrote:
         | Google search isn't a static site, the results are dynamically
         | generated based on what it knows about you (location, browser
         | language, recent searches from IP, recent searches from
         | account, and so on with all of the things they know from trying
         | to sell ad slots to that device).
         | 
         | That being said there isn't anything wrong with using Scrapy
         | for this. If you're more familiar with web browsers than Python
         | something like https://github.com/puppeteer/puppeteer can also
         | be turned into a quick way to scrape a site by giving you a
         | headless browser controlled by whatever you script in nodejs.
        
           | yewenjie wrote:
           | I see. I am familiar with Python but I don't need something
           | so heavy like Scrapy. Ideally I am looking for something that
           | is very lightweight + fast and can just parse the DOM using
           | CSS selectors.
        
         | paulcole wrote:
         | I've had excellent luck with SerpAPI. It's $50 a month for
         | 5,000 searches which has been plenty for my needs at a small
         | SEO/marketing agency.
         | 
         | http://serpapi.com
        
       | wirthjason wrote:
       | I love scrapy! It's a wonderful tool.
       | 
       | One of the most underrated features is the request caching. It
       | really helps with the problem of finding out your spider crashed
       | or you didn't parse all the data you wanted and rerunning the
       | job. Rather than making hundred or thousands of requests you can
       | get them from the cache.
       | 
       | One nitpick is that the documentation could be a bit better about
       | integrating scrapy with other Python projects / code rather than
       | running it directly from the command line.
       | 
       | Also, some of their internal names are a bit vague. There's a
       | Spider and a Crawler. What's the difference? To most people these
       | would be the same thing. This makes reading the source code a
       | little tricky.
        
       ___________________________________________________________________
       (page generated 2021-09-13 23:00 UTC)