[HN Gopher] 1.1.1.1 lookup failures on October 4th, 2023
       ___________________________________________________________________
        
       1.1.1.1 lookup failures on October 4th, 2023
        
       Author : todsacerdoti
       Score  : 64 points
       Date   : 2023-10-04 19:41 UTC (3 hours ago)
        
 (HTM) web link (blog.cloudflare.com)
 (TXT) w3m dump (blog.cloudflare.com)
        
       | homero wrote:
       | This got me. I spent an hour trying to figure out why my Internet
       | seemingly went down but not fully
        
       | throwaway67743 wrote:
       | [flagged]
        
         | jarym wrote:
         | Up until a few months ago the HN crowd loved Cloudflare. How
         | sentiment has changed in such a short period.
         | 
         | My guess would be their weird 'site protection' stuff is
         | burning too many people and negatively impacting their
         | reputation.
        
           | kkielhofner wrote:
           | > My guess would be their weird 'site protection' stuff is
           | burning too many people and negatively impacting their
           | reputation.
           | 
           | What's always been interesting to me about this take is it's
           | not as though Cloudflare is randomly inserting themselves in
           | internet traffic.
           | 
           | Cloudflare customers have choice in the marketplace and they
           | chose Cloudflare for whatever reasons. If end-users take
           | issue with accessing the site of a Cloudflare customer they
           | should take it up with the owners of the site that chose
           | Cloudflare. Theoretically the Cloudflare customer would take
           | it up with them if it becomes problematic. Cloudflare has no
           | obligation to the site end-users other than meeting the needs
           | of their customer who does have obligation to their end-users
           | (theoretically).
           | 
           | Cloudflare is, ostensibly, providing a solution for their
           | customers. How that impacts their customer's end-users is
           | between Cloudflare and the customer.
        
           | reaperman wrote:
           | In general, I always seem to find comments along the lines of
           | this are very easy to thoroughly disprove. There has been
           | consistent criticism of Cloudflare for many years, ever since
           | the majority of web traffic started going through their anti-
           | DDOS and anti-bot gateways.
           | 
           | Here's a HN post with lots of very critical comments[0] from
           | 7 years ago, including a fairly scathing one from 'tptacek.
           | Even way back then, you'd get the same comments you hear
           | today like:
           | 
           | > So rather than demand fixes for the fundamental issues that
           | enable ddos attacks (preventing IP spoofing, allowing
           | infected computers to remain connected, etc), we just
           | continue down this path of massive centralization of services
           | into a few big players that can afford the arms race against
           | bonnets. Using services like Cloudflare as a 'fix' is
           | wrecking the decentralized principles of the Internet. At
           | that point we might as well just write all apps as Facebook
           | widgets.
           | 
           | 0: https://news.ycombinator.com/item?id=13718947
        
           | throwaway67743 wrote:
           | I've never loved cloudflare - as someone doing this long
           | before they existed I see through their wordy blog posts
           | about rookie mistakes. It's embarrassing really.
        
         | Eduard wrote:
         | maybe to compensate Cloudflare's success blog posts where they
         | usually represent themselves as the saviors of the world.
        
           | throwaway67743 wrote:
           | Quite. Nobody else can do what they do! (Brb doing the same
           | thing before Prince was even born)
        
             | kkielhofner wrote:
             | This is peak HN comment.
             | 
             | 300 pops around the world delivering 210 Tbps of capacity,
             | mitigation of some of the largest DDoS attacks in history,
             | 20% of internet traffic. Workers, Pages, R2, D1, Zero
             | Trust, Stream, Images, Warp, 1.1.1.1, etc, etc, etc - all
             | at incredible scale.
             | 
             | But yes, of course you have been doing the exact same thing
             | since before Prince was born.
        
               | throwaway67743 wrote:
               | People had global networks of the same scale long before,
               | they just didn't offer the same features because they had
               | different products.
        
         | Zambyte wrote:
         | I would rather they be open about their failures than deceptive
         | about it. Of course simply not failing would be ideal, but we
         | don't live in a perfect world. If a single, external point of
         | failure causes your system to crumble, that's a design problem,
         | not a dependency problem.
        
           | reaperman wrote:
           | To your point, Cloudflare leadership are pretty active on HN.
           | They generally do a pretty good job of providing detailed
           | explanations to good-faith questions here and providing
           | decent post-mortems of major incidents to the HN community.
           | 
           | They do take care to avoid engaging with people who are
           | opposed to their dominance on ideological levels ("no one
           | should be the gatekeeper for that much of the internet", etc)
           | and there are a small handful of questions they seem to avoid
           | (e.g. direct feature-to-feature comparisons between Warp and
           | Mullvad)
        
           | throwaway67743 wrote:
           | They use transparency as a cover for rookie mistakes it's not
           | the same as actual transparency. Especially as these are
           | really bad examples of doing it wrong.
        
         | aftbit wrote:
         | They're practicing "just culture" (as in justice), which
         | rewards explaining and root causing your failures, and rejects
         | the concept that "someone sucks" in favor of "systems can
         | always be improved".
        
       | LeoPanthera wrote:
       | Did 1.0.0.1 also go down? The article doesn't say.
        
         | homero wrote:
         | Of course it did it's the same service
        
           | toast0 wrote:
           | A highly reliable service might run one partition on a
           | completely separate serving stack. It's worth asking.
        
       | morugam wrote:
       | We noticed this through our own, homegrown scripts that check for
       | this, having been screwed by an outage a few years ago. I'm happy
       | they so quickly acknowledge and explain these issues. Good work!
        
       | suprjami wrote:
       | Strangely I noticed this because some parts of eBay stopped
       | loading. I spent a while troubleshooting my privacy/adblock
       | nonsense because _surely CloudFlare couldn 't be down_ but that's
       | the only conclusion I could come to.
        
       | tedunangst wrote:
       | > Visit 1.1.1.1 from any device to get started with our free app
       | that makes your Internet faster and safer.
       | 
       | Ironic.
        
       | denysvitali wrote:
       | My only concern:
       | 
       | 7:57 UTC: first reports coming in
       | 
       | I noticed this issue quite quickly ("reported" at 7:54 UTC [1]),
       | and I noticed I wasn't alone thanks to Twitter / X. I tried to
       | get in touch with Cloudflare to report this issue - but I haven't
       | found any meaningful contact other than Twitter.
       | 
       | For such an important service, I'm impressed there is no contact
       | email / form where you can get in touch with the engineers
       | responsible for keeping the service up and running.
       | 
       | Other than that, kudos for the well written blog post - as
       | always!
       | 
       | [1]: https://nitter.net/DenysVitali/status/1709476961523835246
        
       | araes wrote:
       | I like how its a 42 joke.
       | 
       | 4(0b10) 7:00 ends at 11:02 (4 hr 2 min) on a 4 sum 2x2. And refs
       | to 1.1.1.1 vs 1.0.0.1
        
       | robhlt wrote:
       | The lack of additional alerts in the Remediation section is a
       | little bit concerning. Adding an alert for serving stale root
       | zone data is great, but I think a few more would be very useful
       | too:
       | 
       | - There's a clear uptick in SERVFAIL responses at 7:00 UTC but
       | they don't start their response until an hour later after
       | receiving external reports. This uptick should have automatically
       | triggered an alert. It can't have been within the normal range
       | because they got customer reports about it.
       | 
       | - The resolver failed to load the root zone data on startup and
       | resorted a fallback path. Even if this isn't an error for the
       | resolver it should still be an alert for the static_zone service,
       | because its only client is failing to consume its data.
       | 
       | - The static_zone service should also alert when some percentage
       | of instances fail to parse the root zone data, to get ahead of
       | potential problems before the existing data becomes stale.
        
       | ChrisArchitect wrote:
       | Earlier discussion while outage was active:
       | https://news.ycombinator.com/item?id=37763143
        
       ___________________________________________________________________
       (page generated 2023-10-04 23:00 UTC)