hngopher.com

       [HN Gopher] How to Bypass Cloudflare: A Comprehensive Guide
       ___________________________________________________________________
        
       How to Bypass Cloudflare: A Comprehensive Guide
        
       Author : jakobdabo
       Score  : 168 points
       Date   : 2022-09-18 11:59 UTC (11 hours ago)
        
 (HTM) web link (www.zenrows.com)
 (TXT) w3m dump (www.zenrows.com)
        
       | Tiberium wrote:
       | The actual "easiest" way (at least for me) to bypass Cloudflare
       | is to find the actual IP of the web-server running behind it. Of
       | course in a lot of cases it's not possible, for example when the
       | web admin correctly limits the webserver to only respond to
       | Cloudflare IP ranges, or if
       | https://developers.cloudflare.com/ssl/origin-configuration/o...
       | is used.
       | 
       | Most useful services for that are https://shodan.io/ and
       | https://search.censys.io/. I've had decent successes with Censys
       | on finding real IP addresses of websites behind Cloudflare. Of
       | course you might also have success by checking history of DNS
       | records for a particular domain.
        
         | gingerlime wrote:
         | > or if https://developers.cloudflare.com/ssl/origin-
         | configuration/o... is used.
         | 
         | How is using CF's origin CA preventing the connection to the
         | real backend in order to bypass Cloudflare? you cam just ignore
         | the SSL error couldn't you?
        
           | temp0826 wrote:
           | In addition, I think CF provides a list of IPs to whitelist
           | to only allow access from their servers.
        
           | kevincox wrote:
           | They probably meant to link
           | https://developers.cloudflare.com/ssl/origin-
           | configuration/a... where Cloudflare uses a client TLS
           | certificate to pull from the origin and the origin should be
           | configured to reject requests without a client certificate.
        
         | PigiVinci83 wrote:
         | Companies spend thousands of dollars on these anti-bot
         | solutions and then they are so misconfigured that using a
         | specific user agent or faking browsing via mobile, bypasses
         | them. Real life stories.
        
           | grogenaut wrote:
           | Often this is because you are hamstrung by old mobile apps or
           | TV apps that can't be updated forcibly and so you break
           | users. So your making a trade-off of user pain and bot
           | deflection. So many times this is actually known and on
           | purpose. Botters hitting that loophole helps prioritize
           | closing that loophole in an agile customer experience and
           | makes it easier for engineering and product to prioritize.
           | Real life stories
        
             | PigiVinci83 wrote:
             | Totally agree.
        
             | PigiVinci83 wrote:
             | Just shared some of these stories here https://twitter.com/
             | pigivinciguerra/status/15715437943480893...
        
         | discreditable wrote:
         | This is what I expected the article to be about. I would wager
         | a lot of shops don't to the whitelisting. If they wanted to be
         | really intense they could do authenticated origin pulls.
        
           | akira2501 wrote:
           | AWS CloudFront with S3 recommends that you just set your S3
           | to require a specific 'Referer' header variable and you set
           | CloudFront to send that custom 'Referer' with each origin
           | request.
           | 
           | Seems to work great when you use something like a GUID, and
           | no need for IP whitelisting.
        
         | nlh wrote:
         | Just want to say THANK YOU for this insight. It never occurred
         | to me, and I just checked out one of the big sites that I
         | scrape for a side project and, lo and behold, you are 100%
         | correct. Found their origin in Censys in about 30 seconds and
         | I've never been able to crunch through their pages more easily.
         | 
         | To others out there who explore this: As with all scraping, be
         | gentle! If you start pounding on someone's origin server
         | directly, you're much more likely to be noticed than if you're
         | pounding on something behind a CloudFlare cache. Set rate
         | limits, scrape during off-peak hours, etc. Be a good scraping
         | citizen.
        
       | dizhn wrote:
       | Use zenrows. Got it. It's clickbait but it does provide a good
       | summary of how cloudflare's anti bot stuff works.
        
         | yjftsjthsd-h wrote:
         | I'd call it an ad, not clickbait. An ad with some useful
         | content, but still.
         | 
         | Edit: I think the modern term is content native advertising,
         | although I'm perfectly happy to keep using the word
         | infomercial.
        
           | throwaway81523 wrote:
           | I'd call it clickbait since its actual nature is not revealed
           | til the very end.
        
           | dizhn wrote:
           | I called it clickbait because the article does not contain
           | the thing promised in the title.
        
           | Sebb767 wrote:
           | > An ad with some useful content, but still.
           | 
           | There's a term for that, infomercial :-)
        
         | fishtoaster wrote:
         | Yeah, it's content marketing. It's got all the stylistic tells:
         | 
         | - Giving more background than is appropriate to the subject
         | (explaining what cloudflare is in an article about bypassing
         | it)
         | 
         | - Lots of fluff about "what we're going to cover" like it's a
         | poorly-written highschool essay
         | 
         | - Asking and answering questions rather than stating things:
         | "Can Cloudflare be bypassed? Thankfully, the answer is yes!"
         | 
         | I'm not _entirely_ sure what drives these things, but they seem
         | to be very common in this sort of content marketing article. I
         | 'm guessing a lot of it is SEO-driven.
         | 
         | This particular article has more actual content than most, but
         | still ultimately devolves into an ad, of course.
        
           | return_to_monke wrote:
           | > - Asking and answering questions rather than stating
           | things: "Can Cloudflare be bypassed? Thankfully, the answer
           | is yes!"
           | 
           | > I'm not _entirely_ sure what drives these things, but they
           | seem to be very common in this sort of content marketing
           | article. I 'm guessing a lot of it is SEO-driven.
           | 
           | I suspect that this is them trying to get into Google's
           | "frequent question"/"people also ask" [0] box, because that
           | seems like a common search term ("can you bypass cloudflare")
           | 
           | [0]: https://www.brightedge.com/glossary/people-also-ask
        
           | throwaway81523 wrote:
           | Another tell: emphasizing various phrases by boldfacing, as
           | if the rest of the article is not intended to actually be
           | read.
           | 
           | I find the mention of a series A fundraising round at the top
           | interesting too. Do the funders really expect something other
           | than an escalating technical arms race that eventually
           | outpaces them?
        
       | nothasan wrote:
       | Some impressive documentation on how to get around this BM
       | solution.
        
       | unixbane wrote:
       | This is the endgame for the web. Since it doesn't care about
       | having any simple, well-defined protocol and set of features, you
       | will always just have an arms race between charlatans and their
       | security gimmicks (cloudbleed comes to mind) vs people bypassing
       | that, and so you will only be able to browse websites in a
       | certified way, like using your bare IP address or a big 4
       | browser. The web is essentially no better than proprietary
       | software. It's broken by design, no matter what new shiny
       | (plastic) features they added this week.
       | 
       | https://en.wikipedia.org/wiki/Cloudbleed
       | 
       | edit: why is it that for the last 11 years, every single time i
       | posted something about cloudflare doing bad stuff, i get
       | immediately downvoted? the only reason i can imagine is because
       | cloudflare is your favorite pet company. I've noticed that no
       | matter what I post in my set of controversial opinions, the only
       | one that is consistently downvoted is anything against
       | cloudflare. you guys are fucking losers.
        
       | yjftsjthsd-h wrote:
       | The frustrating thing to me is that CF is that invasive and still
       | can't distinguish bots from people; it usually eventually lets me
       | through, but I've spent enough time staring at the "are you
       | _sure_ you 're not a not?" screen to laugh off their claims about
       | human/not traffic ratios.
        
       | Anunayj wrote:
       | I would also like to mention FlareSolverr [1] here, which just
       | uses a headless browser to solve the challenges, which might be
       | acceptable in some situations (that don't need high request rate)
       | 
       | 1. https://github.com/FlareSolverr/FlareSolverr
        
       | alokjnv10 wrote:
       | I hate cloudflare. I had a really hard time making a web scraper.
        
         | vntok wrote:
         | That's... the whole point.
        
           | unixbane wrote:
           | And it's an invalid point. Scraping prevention is the most
           | stupid thing Cloudflare has ever done, and that's after a
           | very long list.
        
             | simondotau wrote:
             | It's not an invalid point. Setting aside Government and
             | business services, you aren't morally entitled to clean,
             | uninterrupted access to any random website. If a webmaster
             | chooses to make your life difficult for any reason, that's
             | entirely their prerogative.
        
       | midislack wrote:
       | All this and it's just an ad for some SAAS? Fuck I got gypped.
        
       | urtom wrote:
       | If I just need to make plain GET requests in my web scraping,
       | I've found the easiest way to bypass Cloudflare on most sites is
       | to make the requests via the Internet Archive. That has some rate
       | limiting, but it can be worked around by using several source IP
       | addresses in parallel.
        
       | donutshop wrote:
       | Are there other products out there that offers a similar feature
       | set at this price point?
        
       | cj wrote:
       | There are legitimate use cases for bypassing cloudflare's bot
       | protection.
       | 
       | I discovered our company's help documentation (and integration
       | guides), hosted by readme.com, were completely de-indexed from
       | Google for the past 3 months.
       | 
       | Our Readme docs were formerly our #1 source of organic (free)
       | leads.
       | 
       | After investigating, Cloudflare (as configured by Readme) was
       | blocking Googlebot when using Cloudflare Workers. Cloudflare was
       | returning a 403 for Googlebot, but returning pages as usual for
       | regular users.
       | 
       | The cause: we were using Workers to rewrite some URLs at the edge
       | (replacing Readme's default images with optimized + compressed
       | images, using Cloudflare's own image optimization service).
       | 
       | By using Workers to do this, it resulted in Readme's Cloudflare
       | account receiving requests from our domain with "googlebot"
       | useragent, but from an IP that wasn't verified as a googlebot IP
       | address (I assume the Worker was requesting the Readme site using
       | the Googlebot user agent but with whatever IP address is used
       | when using CF Workers).
       | 
       | I emailed Cloudflare support but it was clear it would take a lot
       | of time to get them to understand the issue (and probably longer
       | to fix it).
       | 
       | So, we had to spend a lot of time figuring out how to allow
       | Googlebot requests past Cloudflare's "fake bot" firewall rule.
       | 
       | In our own Cloudflare account, we have all security settings at
       | the lowest sensitivity possible (or turned off completely). We
       | serve over 500 billion requests a month (10+ TB of bandwidth),
       | and the amount of blocked traffic to seemingly legitimate clients
       | was surprisingly high.
       | 
       | I love Cloudflare (and own quite a bit of their stock) but I'm
       | beginning to rethink my stance on their service. They make it
       | extremely easy to enable powerful features with little visibility
       | or control over the details of how those features work.
       | 
       | Another SEO nightmare is their "Crawler Hints" service. I highly
       | recommend no one uses this if you are ever the target of
       | automated security scanners (e.g. ones used by bug bounty white
       | hat hackers). With "crawler hints" enabled and with a white hat
       | hacker running a scan of your site hitting random URLs... results
       | in bingbot, yandex, and other search engines attempting to index
       | every single one of the URLs hit by the security scanners used by
       | hackers.
       | 
       | Basically, it's a mess, and the only way to really fix it is to
       | bypass cloudflare or spend a lot of time and money with
       | Cloudflare debugging.
       | 
       | Next quarter I'm faced with the decision of either doubling down
       | of Cloudflare and getting an Enterprise plan with them ($20k+) or
       | just ripping them out of our stack and going back to our old AWS
       | Cloudfront set up which has fewer POPs, but was much less of a
       | hassle.
        
         | dom96 wrote:
         | > By using Workers to do this, it resulted in Readme's
         | Cloudflare account receiving requests from our domain with
         | "googlebot" useragent, but from an IP that wasn't verified as a
         | googlebot IP address (I assume the Worker was requesting the
         | Readme site using the Googlebot user agent but with whatever IP
         | address is used when using CF Workers).
         | 
         | Was this definitely the cause? It's somewhat surprising to hear
         | that requests would be rejected if the user agent doesn't match
         | a set of hard coded IP addresses.
         | 
         | Were you able to resolve this in the end? If not and the cause
         | is what you suspect then perhaps changing the user agent in
         | your worker might be a workaround.
        
           | [deleted]
        
           | kentonv wrote:
           | It actually makes sense to me. I've pinged the bots team to
           | see if we can improve here.
        
           | traek wrote:
           | > It's somewhat surprising to hear that requests would be
           | rejected if the user agent doesn't match a set of hard coded
           | IP addresses.
           | 
           | It's fairly common for DDoS/scraping prevention, Googlebot
           | (and most other crawlers) publish their IP ranges for that
           | reason[0][1][2]. I don't work at Cloudflare though, so no
           | insider knowledge of what you folks are doing.
           | 
           | [0] https://developers.google.com/search/docs/crawling-
           | indexing/...
           | 
           | [1] https://developers.facebook.com/docs/sharing/webmasters/c
           | raw...
           | 
           | [2] https://developer.twitter.com/en/docs/twitter-for-
           | websites/c...
        
         | kevincox wrote:
         | I feel this as well. Cloudflare markets itself as a set-and-
         | forget solution but really doesn't work that way. Furthermore
         | in the limited visibility that they give to blocking they frame
         | each blocked request as a success unconditionally. Of course
         | they would, that is the service they are providing. However
         | this is often not the case, for many websites most requests
         | benefit very little from blocking and bot protection only
         | really needs to be provided for mutating endpoints and DoS
         | attacks.
         | 
         | For example the Cloudflare Blog's RSS feed is very often
         | blocked from public-cloud IP ranges with specific clients. This
         | is an endpoint that is intended to be public, is cachable and
         | even intended to be accessed by bots! This is a common issue
         | that should be very easy to solve technically but highlights
         | how Cloudflare is not a set-and-forget solution. If they can't
         | configure their own blog (a super simple case) correctly it is
         | clear that using the tool correct requires special care and
         | monitoring of the limited visibility that they provide you.
        
           | yjftsjthsd-h wrote:
           | > Furthermore in the limited visibility that they give to
           | blocking they frame each blocked request as a success
           | unconditionally.
           | 
           | Two things that have happened to me:
           | 
           | * Cloudflare has decided that I'm a bot and stalled me, given
           | me capchas, or just blocked me outright.
           | 
           | * Cloudflare has shown me marketing claiming that 40% of
           | traffic is bots.
           | 
           | I'm not particularly impressed.
        
         | sammy2255 wrote:
         | What are you using Cloudflare for??
        
         | [deleted]
        
         | pbowyer wrote:
         | > Next quarter I'm faced with the decision of either doubling
         | down of Cloudflare and getting an Enterprise plan with them
         | ($20k+) or just ripping them out of our stack and going back to
         | our old AWS Cloudfront set up which has fewer POPs, but was
         | much less of a hassle.
         | 
         | Is Fastly a viable alternative for you?
        
         | hutrdvnj wrote:
         | I use Googlebot as my fake browsers user agent for years. It's
         | really interested to explore the web, when everyone thinks
         | you're Google.
        
           | blitzar wrote:
           | Do websites not spit at you or do they jsut assume you 'will
           | do no evil'?
        
           | TrickyRick wrote:
           | What are some of the most interesting differences you've
           | seen?
        
           | simondotau wrote:
           | Like the OP, I've employed a custom configuration in
           | Cloudflare which detects (and blocks) browsers which claim to
           | be Googlebot but don't originate from Google's approved
           | Googlebot IP ranges.
           | 
           | The vast majority of such requests are dodgy scanning
           | operations likely looking for email addresses or exploitable
           | forms.
        
       ___________________________________________________________________
       (page generated 2022-09-18 23:00 UTC)