[HN Gopher] How to Bypass Cloudflare: A Comprehensive Guide ___________________________________________________________________ How to Bypass Cloudflare: A Comprehensive Guide Author : jakobdabo Score : 168 points Date : 2022-09-18 11:59 UTC (11 hours ago) (HTM) web link (www.zenrows.com) (TXT) w3m dump (www.zenrows.com) | Tiberium wrote: | The actual "easiest" way (at least for me) to bypass Cloudflare | is to find the actual IP of the web-server running behind it. Of | course in a lot of cases it's not possible, for example when the | web admin correctly limits the webserver to only respond to | Cloudflare IP ranges, or if | https://developers.cloudflare.com/ssl/origin-configuration/o... | is used. | | Most useful services for that are https://shodan.io/ and | https://search.censys.io/. I've had decent successes with Censys | on finding real IP addresses of websites behind Cloudflare. Of | course you might also have success by checking history of DNS | records for a particular domain. | gingerlime wrote: | > or if https://developers.cloudflare.com/ssl/origin- | configuration/o... is used. | | How is using CF's origin CA preventing the connection to the | real backend in order to bypass Cloudflare? you cam just ignore | the SSL error couldn't you? | temp0826 wrote: | In addition, I think CF provides a list of IPs to whitelist | to only allow access from their servers. | kevincox wrote: | They probably meant to link | https://developers.cloudflare.com/ssl/origin- | configuration/a... where Cloudflare uses a client TLS | certificate to pull from the origin and the origin should be | configured to reject requests without a client certificate. | PigiVinci83 wrote: | Companies spend thousands of dollars on these anti-bot | solutions and then they are so misconfigured that using a | specific user agent or faking browsing via mobile, bypasses | them. Real life stories. | grogenaut wrote: | Often this is because you are hamstrung by old mobile apps or | TV apps that can't be updated forcibly and so you break | users. So your making a trade-off of user pain and bot | deflection. So many times this is actually known and on | purpose. Botters hitting that loophole helps prioritize | closing that loophole in an agile customer experience and | makes it easier for engineering and product to prioritize. | Real life stories | PigiVinci83 wrote: | Totally agree. | PigiVinci83 wrote: | Just shared some of these stories here https://twitter.com/ | pigivinciguerra/status/15715437943480893... | discreditable wrote: | This is what I expected the article to be about. I would wager | a lot of shops don't to the whitelisting. If they wanted to be | really intense they could do authenticated origin pulls. | akira2501 wrote: | AWS CloudFront with S3 recommends that you just set your S3 | to require a specific 'Referer' header variable and you set | CloudFront to send that custom 'Referer' with each origin | request. | | Seems to work great when you use something like a GUID, and | no need for IP whitelisting. | nlh wrote: | Just want to say THANK YOU for this insight. It never occurred | to me, and I just checked out one of the big sites that I | scrape for a side project and, lo and behold, you are 100% | correct. Found their origin in Censys in about 30 seconds and | I've never been able to crunch through their pages more easily. | | To others out there who explore this: As with all scraping, be | gentle! If you start pounding on someone's origin server | directly, you're much more likely to be noticed than if you're | pounding on something behind a CloudFlare cache. Set rate | limits, scrape during off-peak hours, etc. Be a good scraping | citizen. | dizhn wrote: | Use zenrows. Got it. It's clickbait but it does provide a good | summary of how cloudflare's anti bot stuff works. | yjftsjthsd-h wrote: | I'd call it an ad, not clickbait. An ad with some useful | content, but still. | | Edit: I think the modern term is content native advertising, | although I'm perfectly happy to keep using the word | infomercial. | throwaway81523 wrote: | I'd call it clickbait since its actual nature is not revealed | til the very end. | dizhn wrote: | I called it clickbait because the article does not contain | the thing promised in the title. | Sebb767 wrote: | > An ad with some useful content, but still. | | There's a term for that, infomercial :-) | fishtoaster wrote: | Yeah, it's content marketing. It's got all the stylistic tells: | | - Giving more background than is appropriate to the subject | (explaining what cloudflare is in an article about bypassing | it) | | - Lots of fluff about "what we're going to cover" like it's a | poorly-written highschool essay | | - Asking and answering questions rather than stating things: | "Can Cloudflare be bypassed? Thankfully, the answer is yes!" | | I'm not _entirely_ sure what drives these things, but they seem | to be very common in this sort of content marketing article. I | 'm guessing a lot of it is SEO-driven. | | This particular article has more actual content than most, but | still ultimately devolves into an ad, of course. | return_to_monke wrote: | > - Asking and answering questions rather than stating | things: "Can Cloudflare be bypassed? Thankfully, the answer | is yes!" | | > I'm not _entirely_ sure what drives these things, but they | seem to be very common in this sort of content marketing | article. I 'm guessing a lot of it is SEO-driven. | | I suspect that this is them trying to get into Google's | "frequent question"/"people also ask" [0] box, because that | seems like a common search term ("can you bypass cloudflare") | | [0]: https://www.brightedge.com/glossary/people-also-ask | throwaway81523 wrote: | Another tell: emphasizing various phrases by boldfacing, as | if the rest of the article is not intended to actually be | read. | | I find the mention of a series A fundraising round at the top | interesting too. Do the funders really expect something other | than an escalating technical arms race that eventually | outpaces them? | nothasan wrote: | Some impressive documentation on how to get around this BM | solution. | unixbane wrote: | This is the endgame for the web. Since it doesn't care about | having any simple, well-defined protocol and set of features, you | will always just have an arms race between charlatans and their | security gimmicks (cloudbleed comes to mind) vs people bypassing | that, and so you will only be able to browse websites in a | certified way, like using your bare IP address or a big 4 | browser. The web is essentially no better than proprietary | software. It's broken by design, no matter what new shiny | (plastic) features they added this week. | | https://en.wikipedia.org/wiki/Cloudbleed | | edit: why is it that for the last 11 years, every single time i | posted something about cloudflare doing bad stuff, i get | immediately downvoted? the only reason i can imagine is because | cloudflare is your favorite pet company. I've noticed that no | matter what I post in my set of controversial opinions, the only | one that is consistently downvoted is anything against | cloudflare. you guys are fucking losers. | yjftsjthsd-h wrote: | The frustrating thing to me is that CF is that invasive and still | can't distinguish bots from people; it usually eventually lets me | through, but I've spent enough time staring at the "are you | _sure_ you 're not a not?" screen to laugh off their claims about | human/not traffic ratios. | Anunayj wrote: | I would also like to mention FlareSolverr [1] here, which just | uses a headless browser to solve the challenges, which might be | acceptable in some situations (that don't need high request rate) | | 1. https://github.com/FlareSolverr/FlareSolverr | alokjnv10 wrote: | I hate cloudflare. I had a really hard time making a web scraper. | vntok wrote: | That's... the whole point. | unixbane wrote: | And it's an invalid point. Scraping prevention is the most | stupid thing Cloudflare has ever done, and that's after a | very long list. | simondotau wrote: | It's not an invalid point. Setting aside Government and | business services, you aren't morally entitled to clean, | uninterrupted access to any random website. If a webmaster | chooses to make your life difficult for any reason, that's | entirely their prerogative. | midislack wrote: | All this and it's just an ad for some SAAS? Fuck I got gypped. | urtom wrote: | If I just need to make plain GET requests in my web scraping, | I've found the easiest way to bypass Cloudflare on most sites is | to make the requests via the Internet Archive. That has some rate | limiting, but it can be worked around by using several source IP | addresses in parallel. | donutshop wrote: | Are there other products out there that offers a similar feature | set at this price point? | cj wrote: | There are legitimate use cases for bypassing cloudflare's bot | protection. | | I discovered our company's help documentation (and integration | guides), hosted by readme.com, were completely de-indexed from | Google for the past 3 months. | | Our Readme docs were formerly our #1 source of organic (free) | leads. | | After investigating, Cloudflare (as configured by Readme) was | blocking Googlebot when using Cloudflare Workers. Cloudflare was | returning a 403 for Googlebot, but returning pages as usual for | regular users. | | The cause: we were using Workers to rewrite some URLs at the edge | (replacing Readme's default images with optimized + compressed | images, using Cloudflare's own image optimization service). | | By using Workers to do this, it resulted in Readme's Cloudflare | account receiving requests from our domain with "googlebot" | useragent, but from an IP that wasn't verified as a googlebot IP | address (I assume the Worker was requesting the Readme site using | the Googlebot user agent but with whatever IP address is used | when using CF Workers). | | I emailed Cloudflare support but it was clear it would take a lot | of time to get them to understand the issue (and probably longer | to fix it). | | So, we had to spend a lot of time figuring out how to allow | Googlebot requests past Cloudflare's "fake bot" firewall rule. | | In our own Cloudflare account, we have all security settings at | the lowest sensitivity possible (or turned off completely). We | serve over 500 billion requests a month (10+ TB of bandwidth), | and the amount of blocked traffic to seemingly legitimate clients | was surprisingly high. | | I love Cloudflare (and own quite a bit of their stock) but I'm | beginning to rethink my stance on their service. They make it | extremely easy to enable powerful features with little visibility | or control over the details of how those features work. | | Another SEO nightmare is their "Crawler Hints" service. I highly | recommend no one uses this if you are ever the target of | automated security scanners (e.g. ones used by bug bounty white | hat hackers). With "crawler hints" enabled and with a white hat | hacker running a scan of your site hitting random URLs... results | in bingbot, yandex, and other search engines attempting to index | every single one of the URLs hit by the security scanners used by | hackers. | | Basically, it's a mess, and the only way to really fix it is to | bypass cloudflare or spend a lot of time and money with | Cloudflare debugging. | | Next quarter I'm faced with the decision of either doubling down | of Cloudflare and getting an Enterprise plan with them ($20k+) or | just ripping them out of our stack and going back to our old AWS | Cloudfront set up which has fewer POPs, but was much less of a | hassle. | dom96 wrote: | > By using Workers to do this, it resulted in Readme's | Cloudflare account receiving requests from our domain with | "googlebot" useragent, but from an IP that wasn't verified as a | googlebot IP address (I assume the Worker was requesting the | Readme site using the Googlebot user agent but with whatever IP | address is used when using CF Workers). | | Was this definitely the cause? It's somewhat surprising to hear | that requests would be rejected if the user agent doesn't match | a set of hard coded IP addresses. | | Were you able to resolve this in the end? If not and the cause | is what you suspect then perhaps changing the user agent in | your worker might be a workaround. | [deleted] | kentonv wrote: | It actually makes sense to me. I've pinged the bots team to | see if we can improve here. | traek wrote: | > It's somewhat surprising to hear that requests would be | rejected if the user agent doesn't match a set of hard coded | IP addresses. | | It's fairly common for DDoS/scraping prevention, Googlebot | (and most other crawlers) publish their IP ranges for that | reason[0][1][2]. I don't work at Cloudflare though, so no | insider knowledge of what you folks are doing. | | [0] https://developers.google.com/search/docs/crawling- | indexing/... | | [1] https://developers.facebook.com/docs/sharing/webmasters/c | raw... | | [2] https://developer.twitter.com/en/docs/twitter-for- | websites/c... | kevincox wrote: | I feel this as well. Cloudflare markets itself as a set-and- | forget solution but really doesn't work that way. Furthermore | in the limited visibility that they give to blocking they frame | each blocked request as a success unconditionally. Of course | they would, that is the service they are providing. However | this is often not the case, for many websites most requests | benefit very little from blocking and bot protection only | really needs to be provided for mutating endpoints and DoS | attacks. | | For example the Cloudflare Blog's RSS feed is very often | blocked from public-cloud IP ranges with specific clients. This | is an endpoint that is intended to be public, is cachable and | even intended to be accessed by bots! This is a common issue | that should be very easy to solve technically but highlights | how Cloudflare is not a set-and-forget solution. If they can't | configure their own blog (a super simple case) correctly it is | clear that using the tool correct requires special care and | monitoring of the limited visibility that they provide you. | yjftsjthsd-h wrote: | > Furthermore in the limited visibility that they give to | blocking they frame each blocked request as a success | unconditionally. | | Two things that have happened to me: | | * Cloudflare has decided that I'm a bot and stalled me, given | me capchas, or just blocked me outright. | | * Cloudflare has shown me marketing claiming that 40% of | traffic is bots. | | I'm not particularly impressed. | sammy2255 wrote: | What are you using Cloudflare for?? | [deleted] | pbowyer wrote: | > Next quarter I'm faced with the decision of either doubling | down of Cloudflare and getting an Enterprise plan with them | ($20k+) or just ripping them out of our stack and going back to | our old AWS Cloudfront set up which has fewer POPs, but was | much less of a hassle. | | Is Fastly a viable alternative for you? | hutrdvnj wrote: | I use Googlebot as my fake browsers user agent for years. It's | really interested to explore the web, when everyone thinks | you're Google. | blitzar wrote: | Do websites not spit at you or do they jsut assume you 'will | do no evil'? | TrickyRick wrote: | What are some of the most interesting differences you've | seen? | simondotau wrote: | Like the OP, I've employed a custom configuration in | Cloudflare which detects (and blocks) browsers which claim to | be Googlebot but don't originate from Google's approved | Googlebot IP ranges. | | The vast majority of such requests are dodgy scanning | operations likely looking for email addresses or exploitable | forms. ___________________________________________________________________ (page generated 2022-09-18 23:00 UTC)