[HN Gopher] Avoiding bot detection: How to scrape the web withou...
       ___________________________________________________________________
        
       Avoiding bot detection: How to scrape the web without getting
       blocked?
        
       Author : proszkinasenne2
       Score  : 90 points
       Date   : 2021-10-31 20:48 UTC (2 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | curun1r wrote:
       | There's one technique that can be very useful in some
       | circumstances that isn't mentioned. Put simply, some sites try to
       | block all bots except for those from the major search engines.
       | They don't want their content scraped, but they want the traffic
       | that comes from search. In those cases, it's often possible to
       | scrape the search engines instead using specialized queries
       | designed to get the content you want into the blurb for each
       | search result.
       | 
       | This kind of indirect scraping can be useful for getting almost
       | all the information you want from sites like LinkedIn that do
       | aggressive scraping detection.
        
         | amelius wrote:
         | But won't the search engines block you after some limit has
         | been reached?
        
           | curun1r wrote:
           | Eventually, but they're not very aggressive when it comes to
           | bot detection. Simple IP rotation usually works.
        
       | rfraile wrote:
       | Datadome, PerimeterX, anyone tried ine if them?
        
       | IceWreck wrote:
       | Half of the short-links to cutt.ly aren't working. Why use short
       | links in markdown ?
        
         | yamakadi wrote:
         | It's most likely for tracking clicks. Better to just search for
         | the company names instead of clicking on the links in case they
         | lead to unexpected places.
        
       | rp1 wrote:
       | It's very easy to install Chrome on a linux box and launch it
       | with a whitelisted extension. You can run Xorg using the dummy
       | driver and get a full Chrome instance (i.e. not headless). You
       | can even enable the DevTools API programmatically. I don't see
       | how this would be detectable, and probably a lot safer than
       | downloading a random browser package from an unknown developer.
        
       | bsamuels wrote:
       | > I need to make a general remark to people who are evaluating
       | (and/or) planning to introduce anti-bot software on their
       | websites. Anti-bot software is nonsense. Its snake oil sold to
       | people without technical knowledge for heavy bucks.
       | 
       | If this guy got to experience how systemically bad the credential
       | stuffing problem is, he'd probably take down the whole
       | repository.
       | 
       | None of these anti-bot providers give a shit about invading your
       | privacy, tracking your every movements, or whatever other power
       | fantasy that can be imagined. Nobody pays those vendors $10m/year
       | to frustrate web crawler enthusiasts, they do it to stop
       | credential stuffing.
        
         | melony wrote:
         | The gold standard is residential IP. It is not cheap but its
         | effectiveness is indisputable.
        
           | northwest65 wrote:
           | Back when we had to scrape airline websites to get the deals
           | they withheld for themselves, residential IP was indeed the
           | way. Once the cottoned on to it and blocked id, you'd simply
           | cycle the ADSL model, get a new IP, and off you'd go again.
           | 
           | Now the best part... one division (big team) of our company
           | worked for the (national carrier) airline , one division of
           | our company worked for the resellers (we had a single grad
           | allocated to web scraping). The airline threw ridiculous
           | dollars at trying to stop it, and we just used a caffeine
           | fueled nerd to keep it running. It wasn't all fun though,
           | they'd often release their new anti scraping stuff on a
           | Friday afternoon. They were less than impressed when they
           | learnt who the 'enemy' was. Good times!
        
             | 1cvmask wrote:
             | What do you mean by deals withheld for themselves?
        
               | northwest65 wrote:
               | Most flights are available through the airline booking
               | systems such as Sabre. However, airlines might have
               | flights available only on there own website at (sometimes
               | massively) reduced cost, which needs to be booked through
               | that site. So the web scraping became two parts, one to
               | provide the data to our search engine to present to our
               | customer (travel agent) customers. The second part was
               | then we would book via the airlines website with the
               | details provided by our customer's customer.
        
           | jonatron wrote:
           | A residential IP would help for IP based detection. As the
           | Readme mentions, there's also Javascript based detection. If,
           | for example, your browser has navigator.webdriver set
           | incorrectly, then you can still get blocked even on a
           | residential IP.
        
         | [deleted]
        
         | devit wrote:
         | If users using weak/reused passwords is your problem, just
         | don't let users choose a password (generate it for them), or
         | don't use passwords at all (send link by e-mail that adds a
         | cookie), or use oauth login.
        
         | Gigachad wrote:
         | 2FA should be a requirement on everything now. And if your site
         | can't for some reason or you don't want to deal with it, then
         | limit your site to external login providers only.
         | 
         | 2FA, especially app based, has been proven to work really
         | really well.
        
         | oxymoron wrote:
         | Yeah, I used to work for one of the major anti-bot vendors.
         | Customers weren't clueless. Nobody buys these solutions because
         | they're so much fun, it's a cost center and they monitor their
         | ROI quite closely. Credit card charge backs, impact to
         | infrastructure, extra incurred cost due to underlying api's
         | (like in the Airline industry in particular) etc are all
         | reasons why bot mitigation is a better option than nothing for
         | a lot of companies, even if it's not 100% effective.
        
       | al2o3cr wrote:
       | You use this software at your own risk. Some of them contain
       | malwares just fyi
       | 
       | LOL why post LINKS to them then? Flat-out irresponsible...
       | you build a tool to automate social media accounts to manage ads
       | more efficiently
       | 
       | If by "manage" you mean "commit click fraud"
        
       | adinosaur123 wrote:
       | Are there any court cases that provide precedence regarding the
       | legality of web scraping?
       | 
       | I'm currently looking for ways to get real estate listings in a
       | particular area and apparently the only real solution is the
       | scrape the few big online listing sites.
        
         | Grimm1 wrote:
         | https://en.m.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn
         | 
         | That's one of the bigger ones. Unfortunately recent events
         | means scraping is still a gray area.
        
           | amelius wrote:
           | Legal gray areas are perfect for growth hacking. Just look at
           | Uber and AirBnb.
        
           | omgwtfbyobbq wrote:
           | Do you mean this case?
           | 
           | https://en.m.wikipedia.org/wiki/Van_Buren_v._United_States
           | 
           | I think it only applies to systems that aren't available to
           | the general public, which in this case was the GCIC. Anything
           | that is available to the public, even if it requires some
           | sort of registration, would I think be legal to scrape. YMMV
           | though.
        
         | [deleted]
        
         | adanto6840 wrote:
         | I was involved in a scraping-related case, though in my
         | situation we were scraping public domain data/facts/public
         | domain media. Email me if you'd like additional info. :)
         | 
         | More related to the submission content -- at the time we used
         | rotating proxies, both in-house & external (ProxyMesh - still
         | exists & only good things to say about it); they allowed us to
         | "pin" multiple requests to an IP or to fetch a new IP, etc...
        
       ___________________________________________________________________
       (page generated 2021-10-31 23:00 UTC)