[HN Gopher] Avoiding bot detection: How to scrape the web withou... ___________________________________________________________________ Avoiding bot detection: How to scrape the web without getting blocked? Author : proszkinasenne2 Score : 90 points Date : 2021-10-31 20:48 UTC (2 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | curun1r wrote: | There's one technique that can be very useful in some | circumstances that isn't mentioned. Put simply, some sites try to | block all bots except for those from the major search engines. | They don't want their content scraped, but they want the traffic | that comes from search. In those cases, it's often possible to | scrape the search engines instead using specialized queries | designed to get the content you want into the blurb for each | search result. | | This kind of indirect scraping can be useful for getting almost | all the information you want from sites like LinkedIn that do | aggressive scraping detection. | amelius wrote: | But won't the search engines block you after some limit has | been reached? | curun1r wrote: | Eventually, but they're not very aggressive when it comes to | bot detection. Simple IP rotation usually works. | rfraile wrote: | Datadome, PerimeterX, anyone tried ine if them? | IceWreck wrote: | Half of the short-links to cutt.ly aren't working. Why use short | links in markdown ? | yamakadi wrote: | It's most likely for tracking clicks. Better to just search for | the company names instead of clicking on the links in case they | lead to unexpected places. | rp1 wrote: | It's very easy to install Chrome on a linux box and launch it | with a whitelisted extension. You can run Xorg using the dummy | driver and get a full Chrome instance (i.e. not headless). You | can even enable the DevTools API programmatically. I don't see | how this would be detectable, and probably a lot safer than | downloading a random browser package from an unknown developer. | bsamuels wrote: | > I need to make a general remark to people who are evaluating | (and/or) planning to introduce anti-bot software on their | websites. Anti-bot software is nonsense. Its snake oil sold to | people without technical knowledge for heavy bucks. | | If this guy got to experience how systemically bad the credential | stuffing problem is, he'd probably take down the whole | repository. | | None of these anti-bot providers give a shit about invading your | privacy, tracking your every movements, or whatever other power | fantasy that can be imagined. Nobody pays those vendors $10m/year | to frustrate web crawler enthusiasts, they do it to stop | credential stuffing. | melony wrote: | The gold standard is residential IP. It is not cheap but its | effectiveness is indisputable. | northwest65 wrote: | Back when we had to scrape airline websites to get the deals | they withheld for themselves, residential IP was indeed the | way. Once the cottoned on to it and blocked id, you'd simply | cycle the ADSL model, get a new IP, and off you'd go again. | | Now the best part... one division (big team) of our company | worked for the (national carrier) airline , one division of | our company worked for the resellers (we had a single grad | allocated to web scraping). The airline threw ridiculous | dollars at trying to stop it, and we just used a caffeine | fueled nerd to keep it running. It wasn't all fun though, | they'd often release their new anti scraping stuff on a | Friday afternoon. They were less than impressed when they | learnt who the 'enemy' was. Good times! | 1cvmask wrote: | What do you mean by deals withheld for themselves? | northwest65 wrote: | Most flights are available through the airline booking | systems such as Sabre. However, airlines might have | flights available only on there own website at (sometimes | massively) reduced cost, which needs to be booked through | that site. So the web scraping became two parts, one to | provide the data to our search engine to present to our | customer (travel agent) customers. The second part was | then we would book via the airlines website with the | details provided by our customer's customer. | jonatron wrote: | A residential IP would help for IP based detection. As the | Readme mentions, there's also Javascript based detection. If, | for example, your browser has navigator.webdriver set | incorrectly, then you can still get blocked even on a | residential IP. | [deleted] | devit wrote: | If users using weak/reused passwords is your problem, just | don't let users choose a password (generate it for them), or | don't use passwords at all (send link by e-mail that adds a | cookie), or use oauth login. | Gigachad wrote: | 2FA should be a requirement on everything now. And if your site | can't for some reason or you don't want to deal with it, then | limit your site to external login providers only. | | 2FA, especially app based, has been proven to work really | really well. | oxymoron wrote: | Yeah, I used to work for one of the major anti-bot vendors. | Customers weren't clueless. Nobody buys these solutions because | they're so much fun, it's a cost center and they monitor their | ROI quite closely. Credit card charge backs, impact to | infrastructure, extra incurred cost due to underlying api's | (like in the Airline industry in particular) etc are all | reasons why bot mitigation is a better option than nothing for | a lot of companies, even if it's not 100% effective. | al2o3cr wrote: | You use this software at your own risk. Some of them contain | malwares just fyi | | LOL why post LINKS to them then? Flat-out irresponsible... | you build a tool to automate social media accounts to manage ads | more efficiently | | If by "manage" you mean "commit click fraud" | adinosaur123 wrote: | Are there any court cases that provide precedence regarding the | legality of web scraping? | | I'm currently looking for ways to get real estate listings in a | particular area and apparently the only real solution is the | scrape the few big online listing sites. | Grimm1 wrote: | https://en.m.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn | | That's one of the bigger ones. Unfortunately recent events | means scraping is still a gray area. | amelius wrote: | Legal gray areas are perfect for growth hacking. Just look at | Uber and AirBnb. | omgwtfbyobbq wrote: | Do you mean this case? | | https://en.m.wikipedia.org/wiki/Van_Buren_v._United_States | | I think it only applies to systems that aren't available to | the general public, which in this case was the GCIC. Anything | that is available to the public, even if it requires some | sort of registration, would I think be legal to scrape. YMMV | though. | [deleted] | adanto6840 wrote: | I was involved in a scraping-related case, though in my | situation we were scraping public domain data/facts/public | domain media. Email me if you'd like additional info. :) | | More related to the submission content -- at the time we used | rotating proxies, both in-house & external (ProxyMesh - still | exists & only good things to say about it); they allowed us to | "pin" multiple requests to an IP or to fetch a new IP, etc... ___________________________________________________________________ (page generated 2021-10-31 23:00 UTC)