[HN Gopher] Cloudflare outage on June 21, 2022
       ___________________________________________________________________
        
       Cloudflare outage on June 21, 2022
        
       Author : jgrahamc
       Score  : 580 points
       Date   : 2022-06-21 12:39 UTC (10 hours ago)
        
 (HTM) web link (blog.cloudflare.com)
 (TXT) w3m dump (blog.cloudflare.com)
        
       | grenbys wrote:
       | Would be great if the timeline covered 19 minutes of 6:32 -
       | 06:51. How long did it take to get the right people on the call?
       | How long did it take to identify deployment as a suspect?
       | 
       | Another massive gap is the rollback: 6:58 - 7:42 - 44 minutes!
       | What exactly was going on and why did it take so long? What were
       | those back-up procedures mentioned briefly? Why engineers where
       | stepping on each other toes? What's the story with reverting
       | reverts?
       | 
       | Adding more automation, tests and fixing that specific ordering
       | issue of course is an improvement. But that adds more complexity
       | and any automation ultimately will fail some day.
       | 
       | Technical details are all appreciated. But it is going to be
       | something else next time. Would be great to learn more about
       | human interactions. That's where the resilience of a socio-
       | technical system happened and I bet there is some room for
       | improvement there.
        
         | systemvoltage wrote:
         | It would be fun to be a fly on the wall when shit hits the fan
         | in general. From Nuclear meltdowns to 9/11 ATC recordings, it
         | is fascinating to see how emergencies play out and what kind of
         | things go on with boots-on-ground, all-hands-on-deck
         | situations.
         | 
         | Like, does Cloudflare have an emergency procedure for
         | escalation? What does that look like? How does the CTO get
         | woken up in the middle of the night? How to get in touch with
         | critical and most important engineers? Who noticed Cloudflare
         | down first? How do quick decisions get made and decided? Do
         | people get on a giant zoom call? Or emails going around? What
         | if they can't get hold of the most important people that can
         | flip switches? Do they have a control room like the movies? CTO
         | looking over the shoulder calling "Affirmative, apply the fix."
         | followed by a progress bar painfully moving towards completion.
        
         | nijave wrote:
         | Sounds like they had engineers connecting to the devices and
         | manually rolling back changes. Something like...
         | 
         | Slack: "@here need to connect to <long list of devices> to
         | rollback change asap"
        
       | edf13 wrote:
       | It's nearly always BGP when this level of failure occurs.
        
         | jgrahamc wrote:
         | I dunno man, you can really fuck things up with DNS also.
        
           | ngz00 wrote:
           | I was on a severely understaffed edge team fronting several
           | thousand engineers at a fortune 500 - every deploy felt like
           | a spacex launch from my cubicle. I have a lot of reverence
           | for the engineers who take on that kind of responsibility.
        
         | star-glider wrote:
         | Generally speaking:
         | 
         | You broke half the internet: BGP You broke half of your
         | company's ability to access the internet: DNS
        
       | sidcool wrote:
       | This is a very nice write up.
        
       | testplzignore wrote:
       | Are there any steps that can be taken to test these types of
       | changes in a non-production environment?
        
         | vnkr wrote:
         | It's very difficult if not impossible to create a staging
         | environment that would well enough replicate production at this
         | scale. What bog posts suggest as a remediation in the process:
         | "There are several opportunities in our automation suite that
         | would mitigate some or all of the impact seen from this event.
         | Primarily, we will be concentrating on automation improvements
         | that enforce an improved stagger policy for rollouts of network
         | configuration and provide an automated "commit-confirm"
         | rollback. The former enhancement would have significantly
         | lessened the overall impact, and the latter would have greatly
         | reduced the Time-to-Resolve during the incident."
        
       | malikNF wrote:
       | off-topic-ish, this post on /r/ProgrammerHumor gave me a chuckle
       | 
       | https://www.reddit.com/r/ProgrammerHumor/comments/vh9peo/jus...
        
         | jgrahamc wrote:
         | That made me smile.
        
       | Cloudef wrote:
        
       | thejosh wrote:
       | 07:42: The last of the reverts has been completed. This was
       | delayed as network engineers walked over each other's changes,
       | reverting the previous reverts, causing the problem to re-appear
       | sporadically.
       | 
       | Ouch
        
         | jgrahamc wrote:
         | Well, the "we can't reach these data centers at all and need to
         | go through the break glass procedure" was pretty "ouch" also.
        
         | gouggoug wrote:
         | I'd be super interested in understanding what this means
         | concretely. For example, are we talking about reverting
         | commits? If so, why were engineers reverting reverts?
        
           | yuliyp wrote:
           | Developer 1 fetches code, changes flag A. Rebuilds config.
           | Developer 2 fetches code, changes flag B. Rebuilds config.
           | Developer 1 deploys built config. Developer 2 deploys built
           | config, inadvertently reverts developer 1's changes.
        
             | wstuartcl wrote:
             | also can happen when your deploy process has two flows for
             | revert a forward movement revert (where new bits and head
             | are committed fixing the items that needed to be reverted)
             | and a "previous head" revert which just goes back one
             | revision in the rcs (or tagged version).
             | 
             | Imagine the first eng team did a forward movement revert
             | that corrected the issue and had a new head bits that gets
             | deployed, where shortly after another eng fires off the
             | second process type and tells the system to pull back to
             | the last revision (which is now the bad revision as it was
             | just replaced with fresher deploy bits).
             | 
             | Having two revert processes in the toolkit and maybe a few
             | disperse teams working to revert the issue without tight
             | communication leads to this issue.
             | 
             | I think this is more likely the basis issue vs a bad merge
             | (I assume that the root cause was broadcasted wide and
             | large to anyone making a merge)
        
         | dangrossman wrote:
         | I think I experienced first-hand the moment those network
         | engineers were reverting their own reverts, breaking the web
         | again. For example, DoorDash.com had come back online, then
         | went back to serving only HTTP 500 errors from Cloudflare, then
         | came back online again. I raised it in the HN discussion and
         | @jgrahamc responded minutes later.
         | 
         | https://news.ycombinator.com/item?id=31821290
        
         | michaelmior wrote:
         | This was something I was surprised not to see directly
         | addressed in terms of follow up steps. When discussing process
         | changes, they mention additional testing, but nothing to
         | address what seems to be a significant communication gap.
        
           | scottlamb wrote:
           | I'm sure they have a more detailed internal postmortem, and I
           | imagine it'd go into that. This is a nice high-level
           | overview. They probably don't want to bury that under details
           | of their communication processes, much less go into exactly
           | who did what when for wide consumption by an audience that
           | may not be on board with blameless postmortem culture.
        
       | dpz wrote:
       | really appreciate the speed, detail and transparency of this
       | post-mortem. Really one of, if not the best in the industry
        
       | ElectronShak wrote:
       | What's it like to be an engineer designing and working on these
       | systems? Must be sooo fulfiling! #Goals; Y'all are my heores!!
        
         | jgrahamc wrote:
         | https://www.cloudflare.com/careers/
        
           | ElectronShak wrote:
           | Thanks, unfortunately I live in Africa, no roles yet for my
           | location. I'll wait as I use the products :)
        
           | Icathian wrote:
           | I'm currently waiting on a recruiter to get my panel
           | interviews scheduled. You guys are in "dream gig" territory
           | for me. Any tips? ;-)
        
       | CodeWriter23 wrote:
       | Gotta hand it to them, a shining example of transparency and
       | taking responsibility for mistakes.
        
       | minecraftchest1 wrote:
       | Something else that I think would be smart to implement is a
       | reorder detection. Have the change approval specificy point out
       | stuff that gets reordered, and require manual approval for each
       | section that gets moved around.
       | 
       | I also think that having a script that walks through the file and
       | points out any ovibious mistakes would be good to have as well.
        
         | dane-pgp wrote:
         | Yeah, there's got to be some sweet spot between "formally
         | verify all the things" and "i guess this diff looks okay,
         | yolo!".
         | 
         | I'd say that if you're designing a system which has the
         | potential to disconnect half your customers based on a
         | misconfiguration, then you should spend at least an hour
         | thinking about what sorts of misconfigurations are possible,
         | and how you could prevent or mitigate them.
         | 
         | The cost-benefit analysis of "how likely is it such a mistake
         | would get to production (and what would that cost us)?" vs "how
         | much effort would it take to write and maintain a verifier that
         | prevents this mistake?" should then be fairly easy to estimate
         | with sufficient accuracy.
        
       | mproud wrote:
       | If I use Cloudflare, what can I do -- if anything -- to avoid
       | disruption when they go down?
        
         | meibo wrote:
         | On the enterprise plans, you are able to set up your own DNS
         | server that can route users away from Cloudflare, either to
         | your origin or to another CDN/proxy.
        
       | junon wrote:
       | Now _this_ is a post mortem.
        
       | weird-eye-issue wrote:
       | One of our sites uses Cloudflare and serves 400k pageviews per
       | month and generates around $650/day in ad and affiliate revenue.
       | If the site is not up the business is not making any money.
       | 
       | Looking at the hourly chart in Google Analytics (compared to the
       | previous day) there isn't even a blip during this outage.
       | 
       | So for all the advantages we get from Cloudflare (caching, WAF,
       | security [our WP admin is secured with Cloudflare Teams],
       | redirects, page rules, etc) I'll take these minor outages that
       | make HN go apeshit.
       | 
       | Of course it helped that most our traffic is from the US and this
       | happened when it did but in the past week alone we served over
       | 180 countries which Cloudflare helps make sure is nice and fast
       | :D
        
         | algo_trader wrote:
         | Could you possibly, kindly, mention which tools you use to
         | track/buy/calculate conversions/revenue?
         | 
         | Many thanks
         | 
         | (Or DM the puppet email in my profile)
        
           | harrydehal wrote:
           | Not OP, but my team really, really enjoys using a combination
           | of Segment.io for event tracking and piping that data into
           | Amplitude for data viz, funnel metrics, A/B tests,
           | conversion, etc.
        
         | herpderperator wrote:
         | I didn't quite understand this. It sounds like Cloudflare's
         | outage didn't affect you depite being their customer. Why did
         | their large outage not affect you?
        
           | jgrahamc wrote:
           | It wasn't a global outage.
        
             | prophesi wrote:
             | I thought it was global? 19 data centers were taken offline
             | which "handle a significant proportion of [Cloudflare's]
             | global traffic".
        
               | madeofpalk wrote:
               | I did not notice Cloudflare going down. Only reason I
               | knew was because of this thread. Either it was because I
               | was asleep, or my local PoP wasn't affected.
        
               | jgrahamc wrote:
               | I am in Lisbon and was not having trouble because
               | Cloudflare's Lisbon data center was not affected. But
               | over in Madrid there was trouble. It depended where you
               | are.
        
               | mkl wrote:
               | From the article: "Depending on your location in the
               | world you may have been unable to access websites and
               | services that rely on Cloudflare. In other locations,
               | Cloudflare continued to operate normally."
        
               | ihaveajob wrote:
               | But if your clients are mostly asleep while this is
               | happening, they might not notice.
        
           | uhhhhuhyes wrote:
           | Because of the time at which the outage occurred, most of
           | this person's customers were not trying to access the site.
        
         | markdown wrote:
         | Would you mind sharing which site that is?
        
       | keyle wrote:
       | How did no one at cloudflare think that this MCP thing should be
       | part of the staging rollout? I imagine that was part of a //
       | TODO.
       | 
       | It sounds like it's a key architectural part of the system that
       | "[...] convert all of our busiest locations to a more flexible
       | and resilient architecture."
       | 
       | 25 year experience and it's always the things that are supposed
       | to make us "more flexible" and "more resilient" or
       | robust/stable/safer <keyword> that ends up royally f'ing us where
       | the light don't shine.
        
       | kache_ wrote:
       | shit dawg i just woke up
        
       | rocky_raccoon wrote:
       | Time and time again, this type of response proves that it's the
       | right way handle a bad situation. Be humble, apologize, own your
       | mistake, and give a transparent snapshot into what went wrong and
       | how you're going to learn from the mistake.
       | 
       | Or you could go the opposite direction and risk turning something
       | like this into a PR death spiral.
        
         | can16358p wrote:
         | Exactly. I trust businesses/people that are transparent about
         | their mistakes/failures much more than the ones that avoid them
         | (except Apple which never accepts their mistakes, but I still
         | trust their products, I think I'm affected by RDF).
         | 
         | At the end of the day, everybody makes mistakes and that's
         | okay. Everybody else also know that everybody makes mistakes.
         | So why not accept it?
         | 
         | I really don't get what's wrong with accepting mistakes,
         | learning from them, and moving on.
        
           | coob wrote:
           | The exception that proves the rule with Apple:
           | 
           | https://appleinsider.com/articles/12/09/28/apple-ceo-tim-
           | coo...
        
             | can16358p wrote:
             | Yeah. Forgot that one. When it first came out it was
             | terrible.
             | 
             | Apparently so terrible that Apple apologized, perhaps for
             | the first (and last) time for something.
        
               | kylehotchkiss wrote:
               | They didn't apologize about the direction the pro macs
               | were going a few years back but they certainly listened
               | and made amends for it with the recent Pro line and
               | MacBook Pro enhancements
        
             | dylan604 wrote:
             | "Is it Apple Maps bad?" --Gavin Belson, Silicon Valley
             | 
             | This one line will forever cement exactly how bad Apple
             | Maps' release was. Thanks Mike Judge!
        
               | stnmtn wrote:
               | I agree, but lately (as in the past month) I've been
               | finding myself using apple maps more and more than
               | google. When on a complicated highway interchange, the 3d
               | view that Apple Maps gives for which exit to take is a
               | life-saver
        
               | can16358p wrote:
               | Yup. Just remember the episode. IIRC in that context
               | Apple Maps was placed even worse than Windows Vista.
        
               | dylan604 wrote:
               | I would agree with that. Apple Maps was worse than the
               | hockey puck mouse or the trashcan macpro. trying to
               | decide if it is worse than the butterfly keyboard, but I
               | think the keyboard wins for the shear fact that it
               | impacted me in a way that was uncorrectable where I could
               | just use a different Maps app
        
           | rocky_raccoon wrote:
           | > I really don't get what's wrong with accepting mistakes,
           | learning from them, and moving on.
           | 
           | Some people really struggle with this (myself included) but I
           | think it's one of the easiest "power ups" you can use in
           | business and in life. The key is that you have to actually
           | follow through on the "learning from them" clause.
        
             | dylan604 wrote:
             | Sure, this can be a good thing when it's a rare occurrence.
             | If it is a weekly event, then you just start to look
             | incompetent
        
       | gcau wrote:
       | Am I the only who really doesn't think this is a big deal? They
       | had an outage, they fixed it very quickly. Life goes on. Talking
       | about the outage as if it's reason for us to all ditch CF, then
       | buy/ run our own hardware (which will be totally better), so
       | hyperbolic.
        
         | Deukhoofd wrote:
         | It was a bit of a thing as people in Europe started their
         | office work, and found out a lot of their internet services
         | were down, and they were unable to access the things they
         | needed. It's rather dangerous that we all depend on this one
         | service being online.
        
         | nielsole wrote:
         | > Talking about the outage as if it's reason for us to all
         | ditch CF
         | 
         | at time of writing no comment has done that except you.
        
           | gcau wrote:
           | I'm referring to other posts and discussions outside this
           | website. I don't expect as much criticism in this post.
        
         | simiones wrote:
         | It is kind of a big deal to discover just how much of the
         | Internet and the WWW is now dependent on CloudFlare.
         | 
         | For their part, they handled this very well, and are to be
         | commended (quick fix, quick explanation of failure).
         | 
         | But you also can't help but see that they have a dangerous
         | amount of control over such important systems.
        
       | leetrout wrote:
       | BGP changes should be like the display resolution changes on your
       | PC...
       | 
       | It should revert as a failsafe if not confirmed within X minutes.
        
         | addingnumbers wrote:
         | That's the "commit-confirm" process they mention they will use
         | in the write-up:
         | 
         | > Primarily, we will be concentrating on automation
         | improvements ... and provide an automated "commit-confirm"
         | rollback.
        
           | Melatonic wrote:
           | Surprised everyone has not switched to this already - great
           | idea
        
             | neuronexmachina wrote:
             | I assume there's some non-trivial caveats when using this
             | with a widely-distributed system.
        
         | vnkr wrote:
         | That's what is suggested in the blogpost as one of future
         | prevention plans.
        
         | antod wrote:
         | There was a common pattern in use back in the day when I
         | managed openbsd filewalls (can't remember if it was ipf or pf
         | days). When changing firewall rules over ssh, you'd use a
         | command line like:
         | 
         | $ apply new rules; sleep 10; apply original rules
         | 
         | If your ssh access was still working and various sites were
         | still up during that 10sec you were probably good to go - or at
         | least you hadn't shut yourself out.
        
         | DRW_ wrote:
         | Back when I was a briefly a network engineer at the start of my
         | career, on cisco equipment we'd do 'reload in 5' before big
         | changes - so it'd auto restart after 5 minutes unless
         | cancelled.
         | 
         | I'm sure there were and are better ways of doing it, but it was
         | simple enough and worked for us.
        
           | kazen44 wrote:
           | most ISP tier routers have an entire commit engine to load
           | and apply configs.
           | 
           | junipers allows for instance, one to do the command commit
           | confirmed, which will apply the configuration, and revert
           | back to the previous version if one does not acknowledge this
           | command within a predifined time. this prevents permanent
           | lockout out of a system.
        
       | ransom1538 wrote:
       | Still seeing failed network calls.
       | 
       | https://i.imgur.com/xHqvOzj.png
        
         | jgrahamc wrote:
         | Feel free to email me (jgc) details but based on that error I
         | don't think that's us.
        
           | ransom1538 wrote:
           | One more? Ill email too. https://i.imgur.com/Cxwv58g.png
        
         | zinekeller wrote:
         | Yeah, that's not Cloudflare at all (it's unlikely that CF still
         | uses nginx/1.14).
        
         | Jamie9912 wrote:
         | Is that actually coming from Cloudflare? iirc Cloudflare
         | reports it self as Cloudflare not nginx in the 5xx error pages
        
           | lenn0x wrote:
           | correct, i saw that too. the outage returned 500/nginx. no
           | version number either on footer. @jgrahamc thought that was
           | strange too as few commenters last night were caught off
           | guard trying to determine if it was their systems or
           | cloudflare. supposedly its been forwarded along.
        
             | kc10 wrote:
             | yes, there is definitely an nginx service in the path. We
             | don't have any nginx in our infrastructure, but this was
             | the response we had for our urls during the outage.
             | 
             | <html> <head><title>500 Internal Server
             | Error</title></head> <body bgcolor="white"> <center><h1>500
             | Internal Server Error</h1></center>
             | <hr><center>nginx</center> </body> </html>
        
           | samwillis wrote:
           | The outage this morning manifested itself as a Nginx error
           | page, somewhat unusually for CF.
        
       | Belphemur wrote:
       | It's interesting that in 2022 we still have network issues caused
       | by wrong order of rules.
       | 
       | Everybody at one time experiences the dreaded REJECT not being at
       | the end of the rule stack but just too early.
       | 
       | Kudos to CF for such a good explanation of what caused the issue.
        
         | ec109685 wrote:
         | I wonder what tool the engineers used to view that diff. With a
         | side by side one, it's a bit more obvious when lines are
         | reordered.
         | 
         | Even better if the tool was syntax aware so it could highlight
         | the different types of rules in unique colors.
        
       | xiwenc wrote:
       | I'm surprised they did not conclude roll outs should be executed
       | over longer period with smaller batches. When a system is
       | complicated as theirs with so much impact, the only sane strategy
       | is slow rolling updates so that you can hit the brake when
       | needed.
        
         | jgrahamc wrote:
         | That's literally one of the conclusions.
        
       | ttul wrote:
       | Every outage represents an opportunity to demonstrate resilience
       | and ingenuity. Outages are guaranteed to happen. Might as well
       | make the most of it to reveal something cool about their
       | infrastructure.
        
       | asadlionpk wrote:
       | Been a fan of CF since they were an essential for DDOS protection
       | for various Wordpress sites I deployed back then.
       | 
       | I buy more NET every time I see posts like this.
        
         | malfist wrote:
         | Hackernews isn't wallstreetbets.
        
       | nerdbaggy wrote:
       | Really interesting that 19 cities handle 50% of the requests.
        
         | JCharante wrote:
         | Well half of those cities were in Asia during business hours,
         | so given that the majority of humans live in Asia it makes
         | sense. CF data centers in Asia also seem to be less distributed
         | than in the West (e.g. Vietnam traffic seems to go to
         | Singapore) meanwhile CF has multiple centers distributed
         | throughout the US.
        
         | jgrahamc wrote:
         | Actually, I think the flip side is even more interesting. If
         | you want to give good, low latency service to 50% of the world
         | you need a lot of data centers.
        
           | worldofmatthew wrote:
           | If you have an efficient website, you can get decent
           | performance to most of the world with one pop on the West
           | cost of the USA.
        
       | rubatuga wrote:
       | Uh, shouldn't there be a staging environment for these sort of
       | changes?
        
         | alanning wrote:
         | Yes, that was one of the issues they mentioned in the post. Not
         | that they didn't have a staging/testing environment but that it
         | didn't include the specific type of new architecture
         | configuration, "MCP", that ultimately failed.
         | 
         | One of their future changes is to include MCPs in their testing
         | environments.
        
           | nijave wrote:
           | Ahh the old "dev doesn't quite match prod" issue
        
       | ggalihpp wrote:
       | The dns resolver also impacted and seems still have issue. We
       | change to google dns and it solved.
       | 
       | The problem is, we couldn't tell all our client they should
       | change this :(
        
       | jiggawatts wrote:
       | The default way that most networking devices are managed is crazy
       | in this day and age.
       | 
       | Like the post-mortem says, they will put mitigations in place,
       | but this is something every network admin has to implement
       | bespoke after learning the hard way that the default management
       | approach is dangerous.
       | 
       | I've personally watched admins make routing changes where any
       | error would cut them off from the device they are managing and
       | prevent them from rolling it back -- pretty much what happened
       | here.
       | 
       | What should be the _default_ on every networking device is a two-
       | stage commit where the second stage requires a new TCP
       | connection.
       | 
       | Many devices still rely on "not saving" the configuration, with a
       | power cycle as the rollback to the previous saved state. This is
       | a great way to turn a small outage into a big one.
       | 
       | This style of device management may have been okay for small
       | office routers where you can just walk into the "server closet"
       | to flip the switch. It was okay in the era when device firmware
       | was measured in kilobytes and boot times in single digit seconds.
       | 
       | Globally distributed backbone routers are an entirely different
       | scenario but the manufacturers use the same outdated management
       | concepts!
       | 
       | (I have seen some _small_ improvements in this space, such as
       | devices now keeping a history of config files by default instead
       | of a single current-state file only.)
        
         | inferiorhuman wrote:
         | The power cycle as a rollback is IMO reasonable. If you're
         | talking about equipment in a data center you should presumably
         | have some sort of remote power management on a separate
         | network.
         | 
         | Alternatively some sort of watchdog timer would be a great
         | addition (e.g. rollback within X minutes if the changes are not
         | confirmed).
        
       | throwaway_uke wrote:
       | i'm gonna go with the less popular view here that overly detailed
       | post mortems do little in the grand scheme of things other than
       | satisfy tech p0rn for a tiny, highly technical audience. does
       | wonders for hiring indeed.
       | 
       | sure, transparency is better than "something went wrong, we take
       | this very seriously, sorry." (although the non technical crowd
       | couldn't care less)
       | 
       | only people who dont do anything make no mistakes, but doing such
       | highly impactful changes so quickly (inside one day!) for where
       | 50% of traffic happens seems a huge red flag to me, no matter the
       | procedure and safety valves.
        
       | kylegalbraith wrote:
       | As others have said, this is a clear and concise write up of the
       | incident. That is underlined even more when you take into account
       | how quickly they published this. I have seen some companies take
       | weeks or even months to publish an analysis that is half as good
       | as this.
       | 
       | Not trying to take the light away from the outage, the outage was
       | bad. But the relative quickness to recovery is pretty impressive,
       | in my opinion. Sounds like they could have recovered even quicker
       | if not for a bit of toe stepping that happened.
        
         | april_22 wrote:
         | I think it's even better that they explained the backgorund of
         | the outage in a really easy to understand way, so that not only
         | experts can get a hang of what was happening.
        
       | sharps_xp wrote:
       | who will make the abstraction as a service we all need to protect
       | us from config changes
        
         | pigtailgirl wrote:
         | -- how much you willing to pay for said system? --
        
           | sharps_xp wrote:
           | depends on how guaranteed is your solution?
        
             | richardwhiuk wrote:
             | 100%. You can never roll out any changes.
        
       | ruined wrote:
       | happy solstice everyone
        
       | psim1 wrote:
       | CF is the only company I have ever seen that can have an outage
       | and get pages of praise for it. I don't have any (current) use
       | for CloudFlare's products but I would love to see the culture
       | that makes them praiseworthy spread to other companies.
        
         | homero wrote:
         | I'm also a huge fan
        
         | capableweb wrote:
         | I think a lot of companies don't realize the whole
         | "Acknowledging our problems in public" thing CF got going for
         | it is a positive. Lots of companies don't want to publish
         | public post-mortems as they think it'll make them look weak
         | rather than showing that they care about transparency in the
         | face of failures/downtimes.
        
         | [deleted]
        
         | systemvoltage wrote:
         | Nerds in the executive office (CEO & CTO, etc). People just
         | like us.
        
       | badrabbit wrote:
       | Having been on the other side of similar outages, I am very
       | impressee at their response timeline.
        
       | lilyball wrote:
       | They said they ran a dry-run. What did that do, just generate
       | these diffs? I would have expected them to have some way of
       | _simulating_ the network for BGP changes in order to verify that
       | they didn 't just fuck up their traffic.
        
       | kurtextrem wrote:
       | Yet another BGP caused outage. At some point we should collect
       | all of them:
       | 
       | - Cloudflare 2022 (this one)
       | 
       | - Facebook 2021: https://news.ycombinator.com/item?id=28752131 -
       | this one probably had the single biggest impact, since engineers
       | got locked out of their systems, which made the fixing part look
       | like a sci-fi movie
       | 
       | - (Indirectly caused by BGP: Cloudflare 2020:
       | https://blog.cloudflare.com/cloudflare-outage-on-july-17-202...)
       | 
       | - Google Cloud 2020:
       | https://www.theregister.com/2020/12/16/google_europe_outage/
       | 
       | - IBM Cloud 2020:
       | https://www.bleepingcomputer.com/news/technology/ibm-cloud-g...
       | 
       | - Cloudflare 2019: https://news.ycombinator.com/item?id=20262214
       | 
       | - Amazon 2018:
       | https://www.techtarget.com/searchsecurity/news/252439945/BGP...
       | 
       | - AWS: https://www.thousandeyes.com/blog/route-leak-causes-
       | amazon-a... (2015)
       | 
       | - Youtube: https://www.infoworld.com/article/2648947/youtube-
       | outage-und... (2008)
       | 
       | And then there are incidents caused by hijacking:
       | https://en.wikipedia.org/wiki/BGP_hijacking#:~:text=end%20us...
        
         | eddieroger wrote:
         | These are the public facing BGP announcements that cause
         | problems, but doesn't account for the ones on private LANs that
         | also happen. Previous employers of mine have had significant
         | internal network issues because internal BGP between sites
         | started causing problems. I'm not sure there's anything better
         | (I am not a network guy), but this list can't be exhaustive.
        
         | simiones wrote:
         | Google in Japan 2017:
         | https://www.internetsociety.org/blog/2017/08/google-leaked-p...
        
         | jve wrote:
         | Came here to say exactly this... things that mess with BGP have
         | the power to wipe you off the internet.
         | 
         | Some more:
         | 
         | - Google 2016, configuration management bug/BGP:
         | https://status.cloud.google.com/incident/compute/16007
         | 
         | - Valve 2015: https://www.thousandeyes.com/blog/steam-outage-
         | monitor-data-...
         | 
         | - Cloudflare 2013: https://blog.cloudflare.com/todays-outage-
         | post-mortem-82515/
        
         | ssms27 wrote:
         | The internet runs on BGP, I would think that most internet
         | issues would be a result of BGP then.
        
           | perlgeek wrote:
           | There are lots of other causes of incidents, like cut cables,
           | failed router hardware, data centers losing power etc.
           | 
           | It just seems that most of these are local enough and the
           | Internet resilient enough that they don't cause global
           | issues. Maybe the exception would be AWS us-east-1 outages
           | :-)
        
             | tedunangst wrote:
             | BGP is the reason you _don 't_ hear about cable cuts taking
             | down the internet.
        
             | addingnumbers wrote:
             | Maybe a testament to BGP's effectiveness that so many
             | large-scale outages are due to misconfiguring BGP rather
             | than the frequent cable cuts and hardware failures that BGP
             | routes around.
        
         | mlyle wrote:
         | > since engineers got locked out of their systems
         | 
         | Sounds like the same happened here:
         | 
         | "Due to this withdrawal, Cloudflare engineers experienced added
         | difficulty in reaching the affected locations to revert the
         | problematic change. We have backup procedures for handling such
         | an event and used them to take control of the affected
         | locations."
         | 
         | But Cloudflare had sufficient backup connectivity to fix it.
         | I'm curious how Cloudflare does that today-- the solution long
         | ago was always a modem on an auxiliary port.
        
           | jve wrote:
           | > the solution long ago was always a modem on an auxiliary
           | port
           | 
           | Now you can use mobile Internet (4G/5G)
        
             | ccakes wrote:
             | Cell coverage inside datacenters isn't always suitable,
             | occasionally even by-design.
        
           | cwt137 wrote:
           | They have their machines also connected to another AS, so
           | when their network doesn't/can't route, they can still get to
           | their machines to fix stuff.
        
           | Melatonic wrote:
           | Worst case if I was designing this I would probably have a
           | satellite connection running over Iridium at each of their
           | biggest DC's
           | 
           | Also lets face it - the utility of a trusted security
           | guard/staff with an old fashioned physical key is pretty hard
           | to screw up!
        
           | nijave wrote:
           | Not sure how common it is, but you can get serial OOBM
           | devices accessible over cellular which would then give you
           | access to your equipment.
           | 
           | I'm surprised more places don't implement a "click here to
           | confirm changes or it'll be rolled back in 5 minutes" like
           | all those monitor settings dialogues
        
         | merlyn wrote:
         | Thats like blaming the hammer for breaking.
         | 
         | BGP is just a tool, it would be something else to do the same
         | purpose.
        
           | forrestthewoods wrote:
           | Some tools are more fragile and error prone than others.
        
             | techsupporter wrote:
             | Except that this wasn't an example of BGP being prone to
             | error or fragile. This was, as the blog post specifically
             | calls out, human error. They put two BGP announcement rules
             | after the "deny everything not previously allowed" rule.
             | It's the same as if someone did this to a set of ACLs on a
             | firewall.
             | 
             | The main difference between BGP and all other tools is that
             | if you mess up BGP, you've done a very visible thing
             | because BGP underpins how we get to each other's networks.
             | But it's not a sign of BGP being fragile, just very
             | important.
        
         | witcher_rat wrote:
         | You say that like it hasn't been going on since the mid 1990's,
         | when it got deployed.
         | 
         | I'm not blaming BGP, since it prevents far more outages than it
         | causes, but BGP-based outages have been a thing since its
         | beginning. And any other protocol would have outages too - BGP
         | just happens to be the protocol being used.
        
       | Tsiklon wrote:
       | This is a great concise explanation. Thank you for providing it
       | so quickly
       | 
       | If you forgive my prying, was this an implementation issue with
       | the maintenance plan (operator or tooling error), a fundamental
       | issue with the soundness of the plan as it stood, or an
       | unexpected outcome from how the validated and prepared changes
       | interacted with the system?
       | 
       | I imagine that an outage of this scope wasn't foreseen in the
       | development of the maintenance & rollback plan of the work.
        
       | xtat wrote:
       | Feels a little disingenuous to use the first 3/4 of the report to
       | advertise.
        
       | DustinBrett wrote:
       | I wish computers could stop us from making these kinds of
       | mistakes without turning into Skynet.
        
       | terom wrote:
       | TODO: use commit-confirm for automated rollbacks
       | 
       | Sounds like a good idea!
        
         | buggeryorkshire wrote:
         | Is that the equivalent of Cisco 'save running config' with a
         | timer? It's been many years so can't remember the exact
         | incantations...
        
       | trollied wrote:
       | Sounds like Cloudflare need a small low-traffic MCP that they can
       | deploy to first.
        
       | samwillis wrote:
       | In a world where it can take weeks for other companies to publish
       | a postmortem after an outage (if they ever do), I never ceases to
       | amaze me how quickly CF manage to get something like this out.
       | 
       | I think it's a testament to their Ops/Incident response teams and
       | internal processes, it builds confidence in their ability to
       | respond quickly when something does go wrong. Incredible work!
        
         | thejosh wrote:
         | Yep, look at heroku and their big incident, and the amount of
         | downtime they've had lately.
        
         | mattferderer wrote:
         | To further add to your point, the CTO is the one who shared it
         | here & the CEO is incredibly active on forums & social media
         | everywhere with customers. Communication has always been one of
         | their strengths.
        
           | bluehatbrit wrote:
           | I do wonder what would happen should happen if either of them
           | left the company, I feel like there's a lot of trust on HN
           | (and other places) that's heavily attached to them as
           | individuals and their track record of good communication.
        
             | unityByFreedom wrote:
             | Good communicators generally foster that environment. And
             | their customers appreciate it, so there is an external
             | expectation now too. Everything ends some day, but I think
             | this will be regarded as a valuable attribute for awhile.
        
               | jgrahamc wrote:
               | This is deeply, deeply embedded in Cloudflare culture.
        
               | unityByFreedom wrote:
               | Devil's advocate, you could get taken over or end up with
               | a different board. I wouldn't like to see it but
               | someone's got to compete with you or we'll have to send
               | in the FTC! :)
        
             | bombcar wrote:
             | It could be good or bad; I suspect they've thought about it
             | and have worked on succession (I hope!) and have like-
             | minded people in the wings.
             | 
             | But once it happens things will change and, to be honest,
             | likely for the worse.
             | 
             | edit> fix typo
        
               | unityByFreedom wrote:
               | > secession
               | 
               | Succession?
        
               | bombcar wrote:
               | Eep yes, auto spell check on macOS is usually good but
               | sometimes it causes a civil war.
        
           | formerkrogemp wrote:
           | To contrast this with the Atlassian outage recently is night
           | and day.
        
         | agilob wrote:
         | I'd love to see the postmortem from Facebook :(
        
         | Melatonic wrote:
         | To be fair though they sort of MUST do things like this to have
         | our confidence - their whole business is about being FAST and
         | AVAILABLE. Were not talking about Oracle here :-D
        
         | viraptor wrote:
         | I feel like others lose opportunities by not doing the same. By
         | publishing early and publishing the details they: keep the
         | company in the news with positive stuff (free ad), get an
         | internal documentation of the incident (ignoring the customer
         | oriented "we're sorry" part), effectively get a free
         | recruitment post (you're reading this because you're in tech
         | and we do cool stuff, wink), release some internal architecture
         | info that people will reference in discussions later. At a
         | certain size it feels stupid not to post them publicly. I
         | wonder how much those posts are calculated and how much
         | organic/culture related.
        
           | zaidf wrote:
           | >I feel like others lose opportunities by not doing the same
           | 
           | IMO it is a slippery slope to see this as _opportunity_ too
           | strongly. Sure, doing the right thing may be net beneficial
           | to the business in the long run...but the $RIGHT_THING should
           | be done first and foremost because it 's the right thing.
        
             | ethbr0 wrote:
             | I believe Marcus Aurelius had something similar to say on
             | the matter. :-)
        
               | jjtheblunt wrote:
               | quodcumque erat ?
        
           | hunter2_ wrote:
           | > I wonder how much those posts are calculated and how much
           | organic/culture related.
           | 
           | Don't companies have a fiduciary duty to calculate things;
           | the reason for doing something actually cannot just be that
           | it's a nice thing to do? Not down to the word, but at least
           | the general decision to be this way?
        
             | rrss wrote:
             | No.
             | 
             | https://scholarship.law.cornell.edu/cgi/viewcontent.cgi?htt
             | p...
        
             | viraptor wrote:
             | No they don't have such duty. In practice very little
             | decision making is based on hard data in my experience.
             | Real world being fuzzy and risk being hard to quantify do
             | not help the situation.
        
           | Uehreka wrote:
           | Ehhhh... I think it's good (for us) that they do this, but I
           | don't think it's a free ad (contrary to popular belief, not
           | all news is good news, and this is bad news) and any sort of
           | conversion rate on recruitment is probably vanishingly small
           | (which would normally be fine, but incidents like these may
           | turn off some actual customers, which is where actual revenue
           | comes from).
           | 
           | I think their calculation (to the extent you can call it
           | that) is that in the interest of PR and damage control, it's
           | better to get a thorough postmortem out quickly to stem the
           | bleeding and keep people like us from going "I can't wait to
           | hear what happened at Cloudflare" for a week. Now we know,
           | the customers have an explanation, and this bad news cycle
           | has a higher chance of ending quickly.
        
           | smugma wrote:
           | I agree that this is a free ad/recruitment. However, it's
           | easy to see how more conservative businesses see this as a
           | risk. They are highlighting their deficiencies, letting their
           | big important clients know that human error can bring their
           | network down.
           | 
           | Additionally, these post-mittens work for Cloudflare because
           | they have a great reputation and good uptime. If this were
           | happening daily or weekly, it _would_ be a warning sign to
           | customers.
           | 
           | It's a strategy other companies could adopt, but to do it
           | effectively requires changes all across the organization.
        
             | saghm wrote:
             | OTOH, I think most actual engineers would know that
             | everywhere has deficiencies and can be brought down by
             | human error, and I'd personally rather use a product where
             | the people running it admit this rather than just claim
             | that their genius engineers made it 100% foolproof and
             | nothing could ever possibly go wrong
        
               | ethbr0 wrote:
               | Absolutely. The first step of good SRE is admitting
               | (publicly and within the organization) that you have a
               | problem.
        
         | solardev wrote:
         | No provider is perfect, but it's because of stuff like this
         | that I trust Cloudflare waaaaaaaaaaay more than the likes of
         | Amazon. Transparency engenders trust, and eventually, love!
         | Thank you, Cloudflare.
         | 
         | The sheer level of technical competence of your engineering
         | team continues to astound me. (Yes, they made a mistake and
         | didn't catch an error in the diff. But your response process
         | went exactly as it should, and your postmortem is excellent.) I
         | couldn't even _begin_ to think about designing or implementing
         | something of this complexity, much less being able to explain
         | it to a layperson after a failure. It is really impressive, and
         | I hope you will continue to do so into the future!
         | 
         | Most of the companies I've worked for unfortunately don't use
         | your services, but I've always been a staunch advocate and
         | converted a few. Maybe the higher-ups only see downtime and
         | name recognition (i.e. you're not Amazon), but for what it's
         | worth, us devs down the ladder definitely notice your
         | transparency and communications, and it means the world. I've
         | learned to structure my own postmortems after yours, and it's
         | really aided in internal communications.
         | 
         | Thank you again. I can't wait for the day I get to work in a
         | fully-Cloudflare stack :)
        
           | nijave wrote:
           | AWS is pretty decent if you're in an NDA contract (you have
           | paid support). You can request RCAs for any incident you were
           | impacted and they'll usually get them within a day.
           | 
           | Not as transparent as "post it on the internet" but at least
           | better than the usual hand wavey bullshit
        
         | kevin_nisbet wrote:
         | I agree, I think the transparency builds trust and I encourage
         | it where I can. The counter thought I had when reading this
         | case though, is it almost feels too fast. What I mean by that
         | is I hope there isn't an incentive to wrap up the internal
         | investigation quickly and write the blog and send it, and go
         | we're done.
         | 
         | Doing incident response (both outage and security), the
         | tactical fixes for a specific problem are usually pretty easy.
         | We can fix a bug, or change this specific plan to avoid the
         | problem. The search for conditions that allowed the incident to
         | occur can be alot more time consuming, and most organizations
         | I've worked for are happy to make a couple tactical changes and
         | move on.
        
           | cowsandmilk wrote:
           | I have to agree. The environment that leads to a fast blog
           | post may also lead to this quote from the post:
           | 
           | > This was delayed as network engineers walked over each
           | other's changes, reverting the previous reverts, causing the
           | problem to re-appear sporadically.
           | 
           | They are running as fast as they can and this extended the
           | incident. There is a "slow is smooth, smooth is fast" lesson
           | in here. I'd rather have a team that takes a day to put up
           | the blog post, but doesn't unnecessarily extend downtime
           | because they are sprinting.
        
             | jgrahamc wrote:
             | There's normal operating procedure and sign offs and
             | automation etc. etc. and then there's "we've lost contact
             | with these data centers and normal procedures don't work we
             | need to break glass and use the secondary channels". In
             | that situation you are in an emergency without normal
             | visibility.
        
             | bombcar wrote:
             | It can be easy to arm-chair it afterwards, but unless
             | things can be done in parallel (and systems should be
             | designed so this can be done, things like "we're not sure
             | what's wrong, we're bringing up a new cluster on the last
             | known good version even as we try to repair this one") you
             | have to make a choice, and sometimes it won't be optimal.
        
           | jgrahamc wrote:
           | _What I mean by that is I hope there isn 't an incentive to
           | wrap up the internal investigation quickly and write the blog
           | and send it, and go we're done._
           | 
           | There is not. From here there's an ongoing process with a
           | formal post-mortem, all sorts of tickets tracking work to
           | prevent further reoccurrence. This post is just the beginning
           | internally.
        
         | kortilla wrote:
         | Well cloudflare's entire value is in uptime and preventing
         | outages. Showing they have a rapid response and strong
         | fundamental technical understanding is much more critical in
         | the "prevent downtime" business.
        
         | tyingq wrote:
         | >take weeks for other companies to publish a postmortem
         | 
         | And with nowhere near the detail level of what was presented
         | here. Typically lots of sweeping generalizations that don't
         | tell you much about what happened, or give you any confidence
         | they really know what happened or have the right fix in place.
        
       | sschueller wrote:
       | Nodejs is still having issues. For example:
       | https://nodejs.org/dist/v16.15.1/node-v16.15.1-darwin-x64.ta...
       | doesn't download if you do "n lts"
        
       | thomashabets2 wrote:
       | tl;dr: Another BGP outage due to bad config changes.
       | 
       | Here's a somewhat old (2016) but very impressive system at a
       | major ISP for avoiding exactly this:
       | https://www.youtube.com/watch?v=R_vCdGkGeSk
        
       | johnklos wrote:
       | ...and yet they still push so hard for recentralization of the
       | web...
        
         | goodpoint wrote:
         | If the centralization of email, social network, VPS, SaaS was
         | not bad enough.
         | 
         | It's pretty appalling that you are even being downvoted.
        
         | samwillis wrote:
         | CloudFlare are a hosting provider and CDN, they aren't
         | "push[ing] ... hard for recentralization of the web".
         | 
         | If it was AWS, Akamai, Google Cloud, or any of the other
         | massive providers this comment wouldn't be made. I don't really
         | understand the association between centralisation and
         | CloudFlare, other than it being a Meme.
        
           | viraptor wrote:
           | It's often mentioned about AWS, especially when us-east-1
           | fails. The others are not big enough to affect basically "the
           | internet" when they go down, so don't get pointed out as
           | centralisation issues as much.
           | 
           | And yeah, cf is trying to get as much traffic to go through
           | them as possible and add edge services for more opportunities
           | - that's literally their business. Also now r2 with object
           | storage. They're already too big, harmful (as in actually
           | putting people in danger) and untouchable in some ways.
        
           | johnklos wrote:
           | I think you've already drunk the Flavor Aid.
           | 
           | What do you have when you have all DNS going through them,
           | via DoH, and all web requests going through them, if not
           | recentralization?
           | 
           | Sure, they want us to think they give us the freedom to host
           | our web sites anywhere because they're "protected" by them,
           | but that "protection" means we've agreed to recentralize.
           | 
           | It's pretty dismissive to describe something as a meme just
           | because you don't understand it, and either you're pretending
           | to not understand it, or you truly don't.
           | 
           | Look at it this way: If a single company goes down for an
           | hour, and that company going down for an hour causes half the
           | web traffic on the Internet to fail for that hour, what is
           | that if not recentralization?
        
             | samwillis wrote:
             | I understand that for their WAF, DDOS and threat detection
             | products they need to have a very large amount of traffic
             | going through them. They have been very aggressive with
             | their free service to achieve that, to the benefit of all
             | their customers (including the free ones). Some could see
             | that as a push to at centralisation, I don't.
             | 
             | What I don't understand, or believe, is that they want to
             | be the sole (as in centralised) network for the internet. I
             | don't believe they as a company, or the people running it,
             | want that. They obviously have ambition to be one of the
             | largest networking/cloud providers, and are achieving that.
             | 
             | I don't intend either to dismiss your concerns (which are a
             | legitimate thing to have, centralisation would be very
             | bad), my suggestion with the meme comment is that there is
             | at times a trend to "brigade" on large successful companies
             | in a meme-like way. That isn't to suggest you were.
        
               | johnklos wrote:
               | They want to be a monopoly. They want everyone to depend
               | on them. They may not want recentralization in general,
               | but they definitely want as much of the Internet to
               | depend on them as possible.
        
       | philipwhiuk wrote:
       | Is there no system to unit test a rule-set?
        
       | thesuitonym wrote:
       | Where does one even start with learning BGP? It always seemed
       | super interesting to me, but not really something that could be
       | dealt with on a small scale, lab type basis. Or am I wrong there?
        
         | _whiteCaps_ wrote:
         | https://github.com/Exa-Networks/exabgp
         | 
         | They've got some Docker examples in the README.
        
         | ThaDood wrote:
         | DN42 <https://dn42.eu/Home> gets mentioned a lot. Its basically
         | a big dynamic VPN that you can do BGP stuff with. Pretty cool
         | but I could never get my node working properly.
        
           | bpye wrote:
           | I started setting that up and totally forgot, maybe I should
           | actually try and peer with someone.
        
         | jamal-kumar wrote:
         | Nah Cisco has labs you can download and learn for their
         | networking certifications, which are kinda the standard.
         | 
         | Networking talent is kind of hard to find and if you learn that
         | your chances of employment get pretty high.
        
         | nonameiguess wrote:
         | You can learn BGP with mininet: https://mininet.org/
         | 
         | You can simulate arbitrarily large networks and internetworks
         | with this, provided you have the hardware to run a large enough
         | number of virtual appliances, but they are pretty lightweight.
        
           | Icathian wrote:
           | Mininet is what the Georgia Tech OMSCS Computer Networking
           | labs use. It's not bad, the two labs that stood out to me
           | were using it to implement BGP and a Distance Vector Routing
           | protocol.
        
       ___________________________________________________________________
       (page generated 2022-06-21 23:00 UTC)