[HN Gopher] Cloudflare outage on June 21, 2022 ___________________________________________________________________ Cloudflare outage on June 21, 2022 Author : jgrahamc Score : 580 points Date : 2022-06-21 12:39 UTC (10 hours ago) (HTM) web link (blog.cloudflare.com) (TXT) w3m dump (blog.cloudflare.com) | grenbys wrote: | Would be great if the timeline covered 19 minutes of 6:32 - | 06:51. How long did it take to get the right people on the call? | How long did it take to identify deployment as a suspect? | | Another massive gap is the rollback: 6:58 - 7:42 - 44 minutes! | What exactly was going on and why did it take so long? What were | those back-up procedures mentioned briefly? Why engineers where | stepping on each other toes? What's the story with reverting | reverts? | | Adding more automation, tests and fixing that specific ordering | issue of course is an improvement. But that adds more complexity | and any automation ultimately will fail some day. | | Technical details are all appreciated. But it is going to be | something else next time. Would be great to learn more about | human interactions. That's where the resilience of a socio- | technical system happened and I bet there is some room for | improvement there. | systemvoltage wrote: | It would be fun to be a fly on the wall when shit hits the fan | in general. From Nuclear meltdowns to 9/11 ATC recordings, it | is fascinating to see how emergencies play out and what kind of | things go on with boots-on-ground, all-hands-on-deck | situations. | | Like, does Cloudflare have an emergency procedure for | escalation? What does that look like? How does the CTO get | woken up in the middle of the night? How to get in touch with | critical and most important engineers? Who noticed Cloudflare | down first? How do quick decisions get made and decided? Do | people get on a giant zoom call? Or emails going around? What | if they can't get hold of the most important people that can | flip switches? Do they have a control room like the movies? CTO | looking over the shoulder calling "Affirmative, apply the fix." | followed by a progress bar painfully moving towards completion. | nijave wrote: | Sounds like they had engineers connecting to the devices and | manually rolling back changes. Something like... | | Slack: "@here need to connect to <long list of devices> to | rollback change asap" | edf13 wrote: | It's nearly always BGP when this level of failure occurs. | jgrahamc wrote: | I dunno man, you can really fuck things up with DNS also. | ngz00 wrote: | I was on a severely understaffed edge team fronting several | thousand engineers at a fortune 500 - every deploy felt like | a spacex launch from my cubicle. I have a lot of reverence | for the engineers who take on that kind of responsibility. | star-glider wrote: | Generally speaking: | | You broke half the internet: BGP You broke half of your | company's ability to access the internet: DNS | sidcool wrote: | This is a very nice write up. | testplzignore wrote: | Are there any steps that can be taken to test these types of | changes in a non-production environment? | vnkr wrote: | It's very difficult if not impossible to create a staging | environment that would well enough replicate production at this | scale. What bog posts suggest as a remediation in the process: | "There are several opportunities in our automation suite that | would mitigate some or all of the impact seen from this event. | Primarily, we will be concentrating on automation improvements | that enforce an improved stagger policy for rollouts of network | configuration and provide an automated "commit-confirm" | rollback. The former enhancement would have significantly | lessened the overall impact, and the latter would have greatly | reduced the Time-to-Resolve during the incident." | malikNF wrote: | off-topic-ish, this post on /r/ProgrammerHumor gave me a chuckle | | https://www.reddit.com/r/ProgrammerHumor/comments/vh9peo/jus... | jgrahamc wrote: | That made me smile. | Cloudef wrote: | thejosh wrote: | 07:42: The last of the reverts has been completed. This was | delayed as network engineers walked over each other's changes, | reverting the previous reverts, causing the problem to re-appear | sporadically. | | Ouch | jgrahamc wrote: | Well, the "we can't reach these data centers at all and need to | go through the break glass procedure" was pretty "ouch" also. | gouggoug wrote: | I'd be super interested in understanding what this means | concretely. For example, are we talking about reverting | commits? If so, why were engineers reverting reverts? | yuliyp wrote: | Developer 1 fetches code, changes flag A. Rebuilds config. | Developer 2 fetches code, changes flag B. Rebuilds config. | Developer 1 deploys built config. Developer 2 deploys built | config, inadvertently reverts developer 1's changes. | wstuartcl wrote: | also can happen when your deploy process has two flows for | revert a forward movement revert (where new bits and head | are committed fixing the items that needed to be reverted) | and a "previous head" revert which just goes back one | revision in the rcs (or tagged version). | | Imagine the first eng team did a forward movement revert | that corrected the issue and had a new head bits that gets | deployed, where shortly after another eng fires off the | second process type and tells the system to pull back to | the last revision (which is now the bad revision as it was | just replaced with fresher deploy bits). | | Having two revert processes in the toolkit and maybe a few | disperse teams working to revert the issue without tight | communication leads to this issue. | | I think this is more likely the basis issue vs a bad merge | (I assume that the root cause was broadcasted wide and | large to anyone making a merge) | dangrossman wrote: | I think I experienced first-hand the moment those network | engineers were reverting their own reverts, breaking the web | again. For example, DoorDash.com had come back online, then | went back to serving only HTTP 500 errors from Cloudflare, then | came back online again. I raised it in the HN discussion and | @jgrahamc responded minutes later. | | https://news.ycombinator.com/item?id=31821290 | michaelmior wrote: | This was something I was surprised not to see directly | addressed in terms of follow up steps. When discussing process | changes, they mention additional testing, but nothing to | address what seems to be a significant communication gap. | scottlamb wrote: | I'm sure they have a more detailed internal postmortem, and I | imagine it'd go into that. This is a nice high-level | overview. They probably don't want to bury that under details | of their communication processes, much less go into exactly | who did what when for wide consumption by an audience that | may not be on board with blameless postmortem culture. | dpz wrote: | really appreciate the speed, detail and transparency of this | post-mortem. Really one of, if not the best in the industry | ElectronShak wrote: | What's it like to be an engineer designing and working on these | systems? Must be sooo fulfiling! #Goals; Y'all are my heores!! | jgrahamc wrote: | https://www.cloudflare.com/careers/ | ElectronShak wrote: | Thanks, unfortunately I live in Africa, no roles yet for my | location. I'll wait as I use the products :) | Icathian wrote: | I'm currently waiting on a recruiter to get my panel | interviews scheduled. You guys are in "dream gig" territory | for me. Any tips? ;-) | CodeWriter23 wrote: | Gotta hand it to them, a shining example of transparency and | taking responsibility for mistakes. | minecraftchest1 wrote: | Something else that I think would be smart to implement is a | reorder detection. Have the change approval specificy point out | stuff that gets reordered, and require manual approval for each | section that gets moved around. | | I also think that having a script that walks through the file and | points out any ovibious mistakes would be good to have as well. | dane-pgp wrote: | Yeah, there's got to be some sweet spot between "formally | verify all the things" and "i guess this diff looks okay, | yolo!". | | I'd say that if you're designing a system which has the | potential to disconnect half your customers based on a | misconfiguration, then you should spend at least an hour | thinking about what sorts of misconfigurations are possible, | and how you could prevent or mitigate them. | | The cost-benefit analysis of "how likely is it such a mistake | would get to production (and what would that cost us)?" vs "how | much effort would it take to write and maintain a verifier that | prevents this mistake?" should then be fairly easy to estimate | with sufficient accuracy. | mproud wrote: | If I use Cloudflare, what can I do -- if anything -- to avoid | disruption when they go down? | meibo wrote: | On the enterprise plans, you are able to set up your own DNS | server that can route users away from Cloudflare, either to | your origin or to another CDN/proxy. | junon wrote: | Now _this_ is a post mortem. | weird-eye-issue wrote: | One of our sites uses Cloudflare and serves 400k pageviews per | month and generates around $650/day in ad and affiliate revenue. | If the site is not up the business is not making any money. | | Looking at the hourly chart in Google Analytics (compared to the | previous day) there isn't even a blip during this outage. | | So for all the advantages we get from Cloudflare (caching, WAF, | security [our WP admin is secured with Cloudflare Teams], | redirects, page rules, etc) I'll take these minor outages that | make HN go apeshit. | | Of course it helped that most our traffic is from the US and this | happened when it did but in the past week alone we served over | 180 countries which Cloudflare helps make sure is nice and fast | :D | algo_trader wrote: | Could you possibly, kindly, mention which tools you use to | track/buy/calculate conversions/revenue? | | Many thanks | | (Or DM the puppet email in my profile) | harrydehal wrote: | Not OP, but my team really, really enjoys using a combination | of Segment.io for event tracking and piping that data into | Amplitude for data viz, funnel metrics, A/B tests, | conversion, etc. | herpderperator wrote: | I didn't quite understand this. It sounds like Cloudflare's | outage didn't affect you depite being their customer. Why did | their large outage not affect you? | jgrahamc wrote: | It wasn't a global outage. | prophesi wrote: | I thought it was global? 19 data centers were taken offline | which "handle a significant proportion of [Cloudflare's] | global traffic". | madeofpalk wrote: | I did not notice Cloudflare going down. Only reason I | knew was because of this thread. Either it was because I | was asleep, or my local PoP wasn't affected. | jgrahamc wrote: | I am in Lisbon and was not having trouble because | Cloudflare's Lisbon data center was not affected. But | over in Madrid there was trouble. It depended where you | are. | mkl wrote: | From the article: "Depending on your location in the | world you may have been unable to access websites and | services that rely on Cloudflare. In other locations, | Cloudflare continued to operate normally." | ihaveajob wrote: | But if your clients are mostly asleep while this is | happening, they might not notice. | uhhhhuhyes wrote: | Because of the time at which the outage occurred, most of | this person's customers were not trying to access the site. | markdown wrote: | Would you mind sharing which site that is? | keyle wrote: | How did no one at cloudflare think that this MCP thing should be | part of the staging rollout? I imagine that was part of a // | TODO. | | It sounds like it's a key architectural part of the system that | "[...] convert all of our busiest locations to a more flexible | and resilient architecture." | | 25 year experience and it's always the things that are supposed | to make us "more flexible" and "more resilient" or | robust/stable/safer <keyword> that ends up royally f'ing us where | the light don't shine. | kache_ wrote: | shit dawg i just woke up | rocky_raccoon wrote: | Time and time again, this type of response proves that it's the | right way handle a bad situation. Be humble, apologize, own your | mistake, and give a transparent snapshot into what went wrong and | how you're going to learn from the mistake. | | Or you could go the opposite direction and risk turning something | like this into a PR death spiral. | can16358p wrote: | Exactly. I trust businesses/people that are transparent about | their mistakes/failures much more than the ones that avoid them | (except Apple which never accepts their mistakes, but I still | trust their products, I think I'm affected by RDF). | | At the end of the day, everybody makes mistakes and that's | okay. Everybody else also know that everybody makes mistakes. | So why not accept it? | | I really don't get what's wrong with accepting mistakes, | learning from them, and moving on. | coob wrote: | The exception that proves the rule with Apple: | | https://appleinsider.com/articles/12/09/28/apple-ceo-tim- | coo... | can16358p wrote: | Yeah. Forgot that one. When it first came out it was | terrible. | | Apparently so terrible that Apple apologized, perhaps for | the first (and last) time for something. | kylehotchkiss wrote: | They didn't apologize about the direction the pro macs | were going a few years back but they certainly listened | and made amends for it with the recent Pro line and | MacBook Pro enhancements | dylan604 wrote: | "Is it Apple Maps bad?" --Gavin Belson, Silicon Valley | | This one line will forever cement exactly how bad Apple | Maps' release was. Thanks Mike Judge! | stnmtn wrote: | I agree, but lately (as in the past month) I've been | finding myself using apple maps more and more than | google. When on a complicated highway interchange, the 3d | view that Apple Maps gives for which exit to take is a | life-saver | can16358p wrote: | Yup. Just remember the episode. IIRC in that context | Apple Maps was placed even worse than Windows Vista. | dylan604 wrote: | I would agree with that. Apple Maps was worse than the | hockey puck mouse or the trashcan macpro. trying to | decide if it is worse than the butterfly keyboard, but I | think the keyboard wins for the shear fact that it | impacted me in a way that was uncorrectable where I could | just use a different Maps app | rocky_raccoon wrote: | > I really don't get what's wrong with accepting mistakes, | learning from them, and moving on. | | Some people really struggle with this (myself included) but I | think it's one of the easiest "power ups" you can use in | business and in life. The key is that you have to actually | follow through on the "learning from them" clause. | dylan604 wrote: | Sure, this can be a good thing when it's a rare occurrence. | If it is a weekly event, then you just start to look | incompetent | gcau wrote: | Am I the only who really doesn't think this is a big deal? They | had an outage, they fixed it very quickly. Life goes on. Talking | about the outage as if it's reason for us to all ditch CF, then | buy/ run our own hardware (which will be totally better), so | hyperbolic. | Deukhoofd wrote: | It was a bit of a thing as people in Europe started their | office work, and found out a lot of their internet services | were down, and they were unable to access the things they | needed. It's rather dangerous that we all depend on this one | service being online. | nielsole wrote: | > Talking about the outage as if it's reason for us to all | ditch CF | | at time of writing no comment has done that except you. | gcau wrote: | I'm referring to other posts and discussions outside this | website. I don't expect as much criticism in this post. | simiones wrote: | It is kind of a big deal to discover just how much of the | Internet and the WWW is now dependent on CloudFlare. | | For their part, they handled this very well, and are to be | commended (quick fix, quick explanation of failure). | | But you also can't help but see that they have a dangerous | amount of control over such important systems. | leetrout wrote: | BGP changes should be like the display resolution changes on your | PC... | | It should revert as a failsafe if not confirmed within X minutes. | addingnumbers wrote: | That's the "commit-confirm" process they mention they will use | in the write-up: | | > Primarily, we will be concentrating on automation | improvements ... and provide an automated "commit-confirm" | rollback. | Melatonic wrote: | Surprised everyone has not switched to this already - great | idea | neuronexmachina wrote: | I assume there's some non-trivial caveats when using this | with a widely-distributed system. | vnkr wrote: | That's what is suggested in the blogpost as one of future | prevention plans. | antod wrote: | There was a common pattern in use back in the day when I | managed openbsd filewalls (can't remember if it was ipf or pf | days). When changing firewall rules over ssh, you'd use a | command line like: | | $ apply new rules; sleep 10; apply original rules | | If your ssh access was still working and various sites were | still up during that 10sec you were probably good to go - or at | least you hadn't shut yourself out. | DRW_ wrote: | Back when I was a briefly a network engineer at the start of my | career, on cisco equipment we'd do 'reload in 5' before big | changes - so it'd auto restart after 5 minutes unless | cancelled. | | I'm sure there were and are better ways of doing it, but it was | simple enough and worked for us. | kazen44 wrote: | most ISP tier routers have an entire commit engine to load | and apply configs. | | junipers allows for instance, one to do the command commit | confirmed, which will apply the configuration, and revert | back to the previous version if one does not acknowledge this | command within a predifined time. this prevents permanent | lockout out of a system. | ransom1538 wrote: | Still seeing failed network calls. | | https://i.imgur.com/xHqvOzj.png | jgrahamc wrote: | Feel free to email me (jgc) details but based on that error I | don't think that's us. | ransom1538 wrote: | One more? Ill email too. https://i.imgur.com/Cxwv58g.png | zinekeller wrote: | Yeah, that's not Cloudflare at all (it's unlikely that CF still | uses nginx/1.14). | Jamie9912 wrote: | Is that actually coming from Cloudflare? iirc Cloudflare | reports it self as Cloudflare not nginx in the 5xx error pages | lenn0x wrote: | correct, i saw that too. the outage returned 500/nginx. no | version number either on footer. @jgrahamc thought that was | strange too as few commenters last night were caught off | guard trying to determine if it was their systems or | cloudflare. supposedly its been forwarded along. | kc10 wrote: | yes, there is definitely an nginx service in the path. We | don't have any nginx in our infrastructure, but this was | the response we had for our urls during the outage. | | <html> <head><title>500 Internal Server | Error</title></head> <body bgcolor="white"> <center><h1>500 | Internal Server Error</h1></center> | <hr><center>nginx</center> </body> </html> | samwillis wrote: | The outage this morning manifested itself as a Nginx error | page, somewhat unusually for CF. | Belphemur wrote: | It's interesting that in 2022 we still have network issues caused | by wrong order of rules. | | Everybody at one time experiences the dreaded REJECT not being at | the end of the rule stack but just too early. | | Kudos to CF for such a good explanation of what caused the issue. | ec109685 wrote: | I wonder what tool the engineers used to view that diff. With a | side by side one, it's a bit more obvious when lines are | reordered. | | Even better if the tool was syntax aware so it could highlight | the different types of rules in unique colors. | xiwenc wrote: | I'm surprised they did not conclude roll outs should be executed | over longer period with smaller batches. When a system is | complicated as theirs with so much impact, the only sane strategy | is slow rolling updates so that you can hit the brake when | needed. | jgrahamc wrote: | That's literally one of the conclusions. | ttul wrote: | Every outage represents an opportunity to demonstrate resilience | and ingenuity. Outages are guaranteed to happen. Might as well | make the most of it to reveal something cool about their | infrastructure. | asadlionpk wrote: | Been a fan of CF since they were an essential for DDOS protection | for various Wordpress sites I deployed back then. | | I buy more NET every time I see posts like this. | malfist wrote: | Hackernews isn't wallstreetbets. | nerdbaggy wrote: | Really interesting that 19 cities handle 50% of the requests. | JCharante wrote: | Well half of those cities were in Asia during business hours, | so given that the majority of humans live in Asia it makes | sense. CF data centers in Asia also seem to be less distributed | than in the West (e.g. Vietnam traffic seems to go to | Singapore) meanwhile CF has multiple centers distributed | throughout the US. | jgrahamc wrote: | Actually, I think the flip side is even more interesting. If | you want to give good, low latency service to 50% of the world | you need a lot of data centers. | worldofmatthew wrote: | If you have an efficient website, you can get decent | performance to most of the world with one pop on the West | cost of the USA. | rubatuga wrote: | Uh, shouldn't there be a staging environment for these sort of | changes? | alanning wrote: | Yes, that was one of the issues they mentioned in the post. Not | that they didn't have a staging/testing environment but that it | didn't include the specific type of new architecture | configuration, "MCP", that ultimately failed. | | One of their future changes is to include MCPs in their testing | environments. | nijave wrote: | Ahh the old "dev doesn't quite match prod" issue | ggalihpp wrote: | The dns resolver also impacted and seems still have issue. We | change to google dns and it solved. | | The problem is, we couldn't tell all our client they should | change this :( | jiggawatts wrote: | The default way that most networking devices are managed is crazy | in this day and age. | | Like the post-mortem says, they will put mitigations in place, | but this is something every network admin has to implement | bespoke after learning the hard way that the default management | approach is dangerous. | | I've personally watched admins make routing changes where any | error would cut them off from the device they are managing and | prevent them from rolling it back -- pretty much what happened | here. | | What should be the _default_ on every networking device is a two- | stage commit where the second stage requires a new TCP | connection. | | Many devices still rely on "not saving" the configuration, with a | power cycle as the rollback to the previous saved state. This is | a great way to turn a small outage into a big one. | | This style of device management may have been okay for small | office routers where you can just walk into the "server closet" | to flip the switch. It was okay in the era when device firmware | was measured in kilobytes and boot times in single digit seconds. | | Globally distributed backbone routers are an entirely different | scenario but the manufacturers use the same outdated management | concepts! | | (I have seen some _small_ improvements in this space, such as | devices now keeping a history of config files by default instead | of a single current-state file only.) | inferiorhuman wrote: | The power cycle as a rollback is IMO reasonable. If you're | talking about equipment in a data center you should presumably | have some sort of remote power management on a separate | network. | | Alternatively some sort of watchdog timer would be a great | addition (e.g. rollback within X minutes if the changes are not | confirmed). | throwaway_uke wrote: | i'm gonna go with the less popular view here that overly detailed | post mortems do little in the grand scheme of things other than | satisfy tech p0rn for a tiny, highly technical audience. does | wonders for hiring indeed. | | sure, transparency is better than "something went wrong, we take | this very seriously, sorry." (although the non technical crowd | couldn't care less) | | only people who dont do anything make no mistakes, but doing such | highly impactful changes so quickly (inside one day!) for where | 50% of traffic happens seems a huge red flag to me, no matter the | procedure and safety valves. | kylegalbraith wrote: | As others have said, this is a clear and concise write up of the | incident. That is underlined even more when you take into account | how quickly they published this. I have seen some companies take | weeks or even months to publish an analysis that is half as good | as this. | | Not trying to take the light away from the outage, the outage was | bad. But the relative quickness to recovery is pretty impressive, | in my opinion. Sounds like they could have recovered even quicker | if not for a bit of toe stepping that happened. | april_22 wrote: | I think it's even better that they explained the backgorund of | the outage in a really easy to understand way, so that not only | experts can get a hang of what was happening. | sharps_xp wrote: | who will make the abstraction as a service we all need to protect | us from config changes | pigtailgirl wrote: | -- how much you willing to pay for said system? -- | sharps_xp wrote: | depends on how guaranteed is your solution? | richardwhiuk wrote: | 100%. You can never roll out any changes. | ruined wrote: | happy solstice everyone | psim1 wrote: | CF is the only company I have ever seen that can have an outage | and get pages of praise for it. I don't have any (current) use | for CloudFlare's products but I would love to see the culture | that makes them praiseworthy spread to other companies. | homero wrote: | I'm also a huge fan | capableweb wrote: | I think a lot of companies don't realize the whole | "Acknowledging our problems in public" thing CF got going for | it is a positive. Lots of companies don't want to publish | public post-mortems as they think it'll make them look weak | rather than showing that they care about transparency in the | face of failures/downtimes. | [deleted] | systemvoltage wrote: | Nerds in the executive office (CEO & CTO, etc). People just | like us. | badrabbit wrote: | Having been on the other side of similar outages, I am very | impressee at their response timeline. | lilyball wrote: | They said they ran a dry-run. What did that do, just generate | these diffs? I would have expected them to have some way of | _simulating_ the network for BGP changes in order to verify that | they didn 't just fuck up their traffic. | kurtextrem wrote: | Yet another BGP caused outage. At some point we should collect | all of them: | | - Cloudflare 2022 (this one) | | - Facebook 2021: https://news.ycombinator.com/item?id=28752131 - | this one probably had the single biggest impact, since engineers | got locked out of their systems, which made the fixing part look | like a sci-fi movie | | - (Indirectly caused by BGP: Cloudflare 2020: | https://blog.cloudflare.com/cloudflare-outage-on-july-17-202...) | | - Google Cloud 2020: | https://www.theregister.com/2020/12/16/google_europe_outage/ | | - IBM Cloud 2020: | https://www.bleepingcomputer.com/news/technology/ibm-cloud-g... | | - Cloudflare 2019: https://news.ycombinator.com/item?id=20262214 | | - Amazon 2018: | https://www.techtarget.com/searchsecurity/news/252439945/BGP... | | - AWS: https://www.thousandeyes.com/blog/route-leak-causes- | amazon-a... (2015) | | - Youtube: https://www.infoworld.com/article/2648947/youtube- | outage-und... (2008) | | And then there are incidents caused by hijacking: | https://en.wikipedia.org/wiki/BGP_hijacking#:~:text=end%20us... | eddieroger wrote: | These are the public facing BGP announcements that cause | problems, but doesn't account for the ones on private LANs that | also happen. Previous employers of mine have had significant | internal network issues because internal BGP between sites | started causing problems. I'm not sure there's anything better | (I am not a network guy), but this list can't be exhaustive. | simiones wrote: | Google in Japan 2017: | https://www.internetsociety.org/blog/2017/08/google-leaked-p... | jve wrote: | Came here to say exactly this... things that mess with BGP have | the power to wipe you off the internet. | | Some more: | | - Google 2016, configuration management bug/BGP: | https://status.cloud.google.com/incident/compute/16007 | | - Valve 2015: https://www.thousandeyes.com/blog/steam-outage- | monitor-data-... | | - Cloudflare 2013: https://blog.cloudflare.com/todays-outage- | post-mortem-82515/ | ssms27 wrote: | The internet runs on BGP, I would think that most internet | issues would be a result of BGP then. | perlgeek wrote: | There are lots of other causes of incidents, like cut cables, | failed router hardware, data centers losing power etc. | | It just seems that most of these are local enough and the | Internet resilient enough that they don't cause global | issues. Maybe the exception would be AWS us-east-1 outages | :-) | tedunangst wrote: | BGP is the reason you _don 't_ hear about cable cuts taking | down the internet. | addingnumbers wrote: | Maybe a testament to BGP's effectiveness that so many | large-scale outages are due to misconfiguring BGP rather | than the frequent cable cuts and hardware failures that BGP | routes around. | mlyle wrote: | > since engineers got locked out of their systems | | Sounds like the same happened here: | | "Due to this withdrawal, Cloudflare engineers experienced added | difficulty in reaching the affected locations to revert the | problematic change. We have backup procedures for handling such | an event and used them to take control of the affected | locations." | | But Cloudflare had sufficient backup connectivity to fix it. | I'm curious how Cloudflare does that today-- the solution long | ago was always a modem on an auxiliary port. | jve wrote: | > the solution long ago was always a modem on an auxiliary | port | | Now you can use mobile Internet (4G/5G) | ccakes wrote: | Cell coverage inside datacenters isn't always suitable, | occasionally even by-design. | cwt137 wrote: | They have their machines also connected to another AS, so | when their network doesn't/can't route, they can still get to | their machines to fix stuff. | Melatonic wrote: | Worst case if I was designing this I would probably have a | satellite connection running over Iridium at each of their | biggest DC's | | Also lets face it - the utility of a trusted security | guard/staff with an old fashioned physical key is pretty hard | to screw up! | nijave wrote: | Not sure how common it is, but you can get serial OOBM | devices accessible over cellular which would then give you | access to your equipment. | | I'm surprised more places don't implement a "click here to | confirm changes or it'll be rolled back in 5 minutes" like | all those monitor settings dialogues | merlyn wrote: | Thats like blaming the hammer for breaking. | | BGP is just a tool, it would be something else to do the same | purpose. | forrestthewoods wrote: | Some tools are more fragile and error prone than others. | techsupporter wrote: | Except that this wasn't an example of BGP being prone to | error or fragile. This was, as the blog post specifically | calls out, human error. They put two BGP announcement rules | after the "deny everything not previously allowed" rule. | It's the same as if someone did this to a set of ACLs on a | firewall. | | The main difference between BGP and all other tools is that | if you mess up BGP, you've done a very visible thing | because BGP underpins how we get to each other's networks. | But it's not a sign of BGP being fragile, just very | important. | witcher_rat wrote: | You say that like it hasn't been going on since the mid 1990's, | when it got deployed. | | I'm not blaming BGP, since it prevents far more outages than it | causes, but BGP-based outages have been a thing since its | beginning. And any other protocol would have outages too - BGP | just happens to be the protocol being used. | Tsiklon wrote: | This is a great concise explanation. Thank you for providing it | so quickly | | If you forgive my prying, was this an implementation issue with | the maintenance plan (operator or tooling error), a fundamental | issue with the soundness of the plan as it stood, or an | unexpected outcome from how the validated and prepared changes | interacted with the system? | | I imagine that an outage of this scope wasn't foreseen in the | development of the maintenance & rollback plan of the work. | xtat wrote: | Feels a little disingenuous to use the first 3/4 of the report to | advertise. | DustinBrett wrote: | I wish computers could stop us from making these kinds of | mistakes without turning into Skynet. | terom wrote: | TODO: use commit-confirm for automated rollbacks | | Sounds like a good idea! | buggeryorkshire wrote: | Is that the equivalent of Cisco 'save running config' with a | timer? It's been many years so can't remember the exact | incantations... | trollied wrote: | Sounds like Cloudflare need a small low-traffic MCP that they can | deploy to first. | samwillis wrote: | In a world where it can take weeks for other companies to publish | a postmortem after an outage (if they ever do), I never ceases to | amaze me how quickly CF manage to get something like this out. | | I think it's a testament to their Ops/Incident response teams and | internal processes, it builds confidence in their ability to | respond quickly when something does go wrong. Incredible work! | thejosh wrote: | Yep, look at heroku and their big incident, and the amount of | downtime they've had lately. | mattferderer wrote: | To further add to your point, the CTO is the one who shared it | here & the CEO is incredibly active on forums & social media | everywhere with customers. Communication has always been one of | their strengths. | bluehatbrit wrote: | I do wonder what would happen should happen if either of them | left the company, I feel like there's a lot of trust on HN | (and other places) that's heavily attached to them as | individuals and their track record of good communication. | unityByFreedom wrote: | Good communicators generally foster that environment. And | their customers appreciate it, so there is an external | expectation now too. Everything ends some day, but I think | this will be regarded as a valuable attribute for awhile. | jgrahamc wrote: | This is deeply, deeply embedded in Cloudflare culture. | unityByFreedom wrote: | Devil's advocate, you could get taken over or end up with | a different board. I wouldn't like to see it but | someone's got to compete with you or we'll have to send | in the FTC! :) | bombcar wrote: | It could be good or bad; I suspect they've thought about it | and have worked on succession (I hope!) and have like- | minded people in the wings. | | But once it happens things will change and, to be honest, | likely for the worse. | | edit> fix typo | unityByFreedom wrote: | > secession | | Succession? | bombcar wrote: | Eep yes, auto spell check on macOS is usually good but | sometimes it causes a civil war. | formerkrogemp wrote: | To contrast this with the Atlassian outage recently is night | and day. | agilob wrote: | I'd love to see the postmortem from Facebook :( | Melatonic wrote: | To be fair though they sort of MUST do things like this to have | our confidence - their whole business is about being FAST and | AVAILABLE. Were not talking about Oracle here :-D | viraptor wrote: | I feel like others lose opportunities by not doing the same. By | publishing early and publishing the details they: keep the | company in the news with positive stuff (free ad), get an | internal documentation of the incident (ignoring the customer | oriented "we're sorry" part), effectively get a free | recruitment post (you're reading this because you're in tech | and we do cool stuff, wink), release some internal architecture | info that people will reference in discussions later. At a | certain size it feels stupid not to post them publicly. I | wonder how much those posts are calculated and how much | organic/culture related. | zaidf wrote: | >I feel like others lose opportunities by not doing the same | | IMO it is a slippery slope to see this as _opportunity_ too | strongly. Sure, doing the right thing may be net beneficial | to the business in the long run...but the $RIGHT_THING should | be done first and foremost because it 's the right thing. | ethbr0 wrote: | I believe Marcus Aurelius had something similar to say on | the matter. :-) | jjtheblunt wrote: | quodcumque erat ? | hunter2_ wrote: | > I wonder how much those posts are calculated and how much | organic/culture related. | | Don't companies have a fiduciary duty to calculate things; | the reason for doing something actually cannot just be that | it's a nice thing to do? Not down to the word, but at least | the general decision to be this way? | rrss wrote: | No. | | https://scholarship.law.cornell.edu/cgi/viewcontent.cgi?htt | p... | viraptor wrote: | No they don't have such duty. In practice very little | decision making is based on hard data in my experience. | Real world being fuzzy and risk being hard to quantify do | not help the situation. | Uehreka wrote: | Ehhhh... I think it's good (for us) that they do this, but I | don't think it's a free ad (contrary to popular belief, not | all news is good news, and this is bad news) and any sort of | conversion rate on recruitment is probably vanishingly small | (which would normally be fine, but incidents like these may | turn off some actual customers, which is where actual revenue | comes from). | | I think their calculation (to the extent you can call it | that) is that in the interest of PR and damage control, it's | better to get a thorough postmortem out quickly to stem the | bleeding and keep people like us from going "I can't wait to | hear what happened at Cloudflare" for a week. Now we know, | the customers have an explanation, and this bad news cycle | has a higher chance of ending quickly. | smugma wrote: | I agree that this is a free ad/recruitment. However, it's | easy to see how more conservative businesses see this as a | risk. They are highlighting their deficiencies, letting their | big important clients know that human error can bring their | network down. | | Additionally, these post-mittens work for Cloudflare because | they have a great reputation and good uptime. If this were | happening daily or weekly, it _would_ be a warning sign to | customers. | | It's a strategy other companies could adopt, but to do it | effectively requires changes all across the organization. | saghm wrote: | OTOH, I think most actual engineers would know that | everywhere has deficiencies and can be brought down by | human error, and I'd personally rather use a product where | the people running it admit this rather than just claim | that their genius engineers made it 100% foolproof and | nothing could ever possibly go wrong | ethbr0 wrote: | Absolutely. The first step of good SRE is admitting | (publicly and within the organization) that you have a | problem. | solardev wrote: | No provider is perfect, but it's because of stuff like this | that I trust Cloudflare waaaaaaaaaaay more than the likes of | Amazon. Transparency engenders trust, and eventually, love! | Thank you, Cloudflare. | | The sheer level of technical competence of your engineering | team continues to astound me. (Yes, they made a mistake and | didn't catch an error in the diff. But your response process | went exactly as it should, and your postmortem is excellent.) I | couldn't even _begin_ to think about designing or implementing | something of this complexity, much less being able to explain | it to a layperson after a failure. It is really impressive, and | I hope you will continue to do so into the future! | | Most of the companies I've worked for unfortunately don't use | your services, but I've always been a staunch advocate and | converted a few. Maybe the higher-ups only see downtime and | name recognition (i.e. you're not Amazon), but for what it's | worth, us devs down the ladder definitely notice your | transparency and communications, and it means the world. I've | learned to structure my own postmortems after yours, and it's | really aided in internal communications. | | Thank you again. I can't wait for the day I get to work in a | fully-Cloudflare stack :) | nijave wrote: | AWS is pretty decent if you're in an NDA contract (you have | paid support). You can request RCAs for any incident you were | impacted and they'll usually get them within a day. | | Not as transparent as "post it on the internet" but at least | better than the usual hand wavey bullshit | kevin_nisbet wrote: | I agree, I think the transparency builds trust and I encourage | it where I can. The counter thought I had when reading this | case though, is it almost feels too fast. What I mean by that | is I hope there isn't an incentive to wrap up the internal | investigation quickly and write the blog and send it, and go | we're done. | | Doing incident response (both outage and security), the | tactical fixes for a specific problem are usually pretty easy. | We can fix a bug, or change this specific plan to avoid the | problem. The search for conditions that allowed the incident to | occur can be alot more time consuming, and most organizations | I've worked for are happy to make a couple tactical changes and | move on. | cowsandmilk wrote: | I have to agree. The environment that leads to a fast blog | post may also lead to this quote from the post: | | > This was delayed as network engineers walked over each | other's changes, reverting the previous reverts, causing the | problem to re-appear sporadically. | | They are running as fast as they can and this extended the | incident. There is a "slow is smooth, smooth is fast" lesson | in here. I'd rather have a team that takes a day to put up | the blog post, but doesn't unnecessarily extend downtime | because they are sprinting. | jgrahamc wrote: | There's normal operating procedure and sign offs and | automation etc. etc. and then there's "we've lost contact | with these data centers and normal procedures don't work we | need to break glass and use the secondary channels". In | that situation you are in an emergency without normal | visibility. | bombcar wrote: | It can be easy to arm-chair it afterwards, but unless | things can be done in parallel (and systems should be | designed so this can be done, things like "we're not sure | what's wrong, we're bringing up a new cluster on the last | known good version even as we try to repair this one") you | have to make a choice, and sometimes it won't be optimal. | jgrahamc wrote: | _What I mean by that is I hope there isn 't an incentive to | wrap up the internal investigation quickly and write the blog | and send it, and go we're done._ | | There is not. From here there's an ongoing process with a | formal post-mortem, all sorts of tickets tracking work to | prevent further reoccurrence. This post is just the beginning | internally. | kortilla wrote: | Well cloudflare's entire value is in uptime and preventing | outages. Showing they have a rapid response and strong | fundamental technical understanding is much more critical in | the "prevent downtime" business. | tyingq wrote: | >take weeks for other companies to publish a postmortem | | And with nowhere near the detail level of what was presented | here. Typically lots of sweeping generalizations that don't | tell you much about what happened, or give you any confidence | they really know what happened or have the right fix in place. | sschueller wrote: | Nodejs is still having issues. For example: | https://nodejs.org/dist/v16.15.1/node-v16.15.1-darwin-x64.ta... | doesn't download if you do "n lts" | thomashabets2 wrote: | tl;dr: Another BGP outage due to bad config changes. | | Here's a somewhat old (2016) but very impressive system at a | major ISP for avoiding exactly this: | https://www.youtube.com/watch?v=R_vCdGkGeSk | johnklos wrote: | ...and yet they still push so hard for recentralization of the | web... | goodpoint wrote: | If the centralization of email, social network, VPS, SaaS was | not bad enough. | | It's pretty appalling that you are even being downvoted. | samwillis wrote: | CloudFlare are a hosting provider and CDN, they aren't | "push[ing] ... hard for recentralization of the web". | | If it was AWS, Akamai, Google Cloud, or any of the other | massive providers this comment wouldn't be made. I don't really | understand the association between centralisation and | CloudFlare, other than it being a Meme. | viraptor wrote: | It's often mentioned about AWS, especially when us-east-1 | fails. The others are not big enough to affect basically "the | internet" when they go down, so don't get pointed out as | centralisation issues as much. | | And yeah, cf is trying to get as much traffic to go through | them as possible and add edge services for more opportunities | - that's literally their business. Also now r2 with object | storage. They're already too big, harmful (as in actually | putting people in danger) and untouchable in some ways. | johnklos wrote: | I think you've already drunk the Flavor Aid. | | What do you have when you have all DNS going through them, | via DoH, and all web requests going through them, if not | recentralization? | | Sure, they want us to think they give us the freedom to host | our web sites anywhere because they're "protected" by them, | but that "protection" means we've agreed to recentralize. | | It's pretty dismissive to describe something as a meme just | because you don't understand it, and either you're pretending | to not understand it, or you truly don't. | | Look at it this way: If a single company goes down for an | hour, and that company going down for an hour causes half the | web traffic on the Internet to fail for that hour, what is | that if not recentralization? | samwillis wrote: | I understand that for their WAF, DDOS and threat detection | products they need to have a very large amount of traffic | going through them. They have been very aggressive with | their free service to achieve that, to the benefit of all | their customers (including the free ones). Some could see | that as a push to at centralisation, I don't. | | What I don't understand, or believe, is that they want to | be the sole (as in centralised) network for the internet. I | don't believe they as a company, or the people running it, | want that. They obviously have ambition to be one of the | largest networking/cloud providers, and are achieving that. | | I don't intend either to dismiss your concerns (which are a | legitimate thing to have, centralisation would be very | bad), my suggestion with the meme comment is that there is | at times a trend to "brigade" on large successful companies | in a meme-like way. That isn't to suggest you were. | johnklos wrote: | They want to be a monopoly. They want everyone to depend | on them. They may not want recentralization in general, | but they definitely want as much of the Internet to | depend on them as possible. | philipwhiuk wrote: | Is there no system to unit test a rule-set? | thesuitonym wrote: | Where does one even start with learning BGP? It always seemed | super interesting to me, but not really something that could be | dealt with on a small scale, lab type basis. Or am I wrong there? | _whiteCaps_ wrote: | https://github.com/Exa-Networks/exabgp | | They've got some Docker examples in the README. | ThaDood wrote: | DN42 <https://dn42.eu/Home> gets mentioned a lot. Its basically | a big dynamic VPN that you can do BGP stuff with. Pretty cool | but I could never get my node working properly. | bpye wrote: | I started setting that up and totally forgot, maybe I should | actually try and peer with someone. | jamal-kumar wrote: | Nah Cisco has labs you can download and learn for their | networking certifications, which are kinda the standard. | | Networking talent is kind of hard to find and if you learn that | your chances of employment get pretty high. | nonameiguess wrote: | You can learn BGP with mininet: https://mininet.org/ | | You can simulate arbitrarily large networks and internetworks | with this, provided you have the hardware to run a large enough | number of virtual appliances, but they are pretty lightweight. | Icathian wrote: | Mininet is what the Georgia Tech OMSCS Computer Networking | labs use. It's not bad, the two labs that stood out to me | were using it to implement BGP and a Distance Vector Routing | protocol. ___________________________________________________________________ (page generated 2022-06-21 23:00 UTC)