[HN Gopher] WAN router IP address change blamed for global Micro... ___________________________________________________________________ WAN router IP address change blamed for global Microsoft 365 outage Author : mikece Score : 79 points Date : 2023-01-30 13:53 UTC (9 hours ago) (HTM) web link (www.theregister.com) (TXT) w3m dump (www.theregister.com) | kloch wrote: | > As part of a planned change to update the IP address on a WAN | router, a command given to the router caused it to send messages | to all other routers in the WAN, which resulted in all of them | recomputing their adjacency and forwarding tables. During this | re-computation process, the routers were unable to correctly | forward packets traversing them. "The command that caused the | issue has different behaviors on different network devices, and | the command had not been vetted using our full qualification | process on the router on which it was executed." | | From this it sounds like they might have changed the primary | loopback IP, which by default is the "router-id" for various | routing protocols, causing the entire network to have to | reconverge. You can override the default router-id with an | explicit address that does not depend on lo0 but lots of networks | don't do that. | | It's extremely uncommon to change the primary loopback address. | It's less uncommon to add an additional one but as the article | says that syntax varies by vendor: Juniper will add as additional | by default, Cisco and Arista will replace the existing primary | one (IPv4) unless you include the "secondary" keyword... | AdamJacobMuller wrote: | I feel like they intended to /ADD/ a new loopback IP and in the | process accidentally removed the existing one and replaced | because I think anyone intentionally changing the loopback IP | knows it's going to reset all bgp sessions. I think more modern | cisco/arista platforms now "secondary" by default and perhaps | that is what bit them? | kemals wrote: | This was a rather interesting event. In general, changing the | IP address (even the loopback address) shouldn't have caused it | from the BGP perspective. For example, if you were to change | the IP address of BGP enabled router that has multiple BGP | sessions, all other routers tore down the sessions to it, and | withdrew the prefixes. BGP reconverge events take time. | However, less than this took (90+ minutes and then a few more | hours until __full__ recovery). | | This seems like one of the events in which they changed IP on | Route Reflector routers that were pretty busy, which would | cause reconvergence and CPU spikes for all routers that it had | sessions with. Also, there was a lot of volatility, as part of | which re-advertisements were happening continuously. They also | attempted rollback, which caused reverse operation, which | triggered reconvergence. The other scenario is doing this | change on the SDN controller, which affected all other routers. | | More details: https://www.thousandeyes.com/blog/microsoft- | outage-analysis-... https://www.thousandeyes.com/resources/na- | microsoft-outage-a... | nzgrover wrote: | It's not DNS | | There's no way it's DNS | | It was DNS | | Credit: https://www.cyberciti.biz/humour/a-haiku-about-dns/ | jimmyl02 wrote: | It seems that in modern large scale systems networking continues | to be one of the few things were a a seemingly small and | inconsequential change can cause entire cloud providers and | highly redundant systems to go down. It makes sense as networking | is the fabric connecting all systems together but each time an | incident like this occurs I'm reminded of just how important | networking is. | | Network engineers and the people handling network ops always | amaze me. | iso1631 wrote: | IME Network engineers put too much faith in vendors. They think | "the vendor says this is a resilient virtual chassis so it | can't break", rather than thinking "ok, if this breaks what | happens" | | A crash affecting both sides of a "resilient" virtual chassis I | had to work with took off a major broadcast last year (it was a | last minute favour I was doing, and I rerouted to a tertiary | route in a couple of minutes). | | Meanwhile I ran a rather large event going out to some hundred | million listeners via two crappy PS300 switches which were | completely independent of each other, into two independent | routers, running via two separate systems (one on a UPS, one on | mains). If one of them broke the other one was completely | independent and the broadcast would have continued just fine. | | As far as I am concerned, that is far better than a virtual | chassis. | ccakes wrote: | This may be true of enterprise network engineers but I've | worked across a lot of very large networks (telco, not cloud) | and we never _ever_ trust the vendor. | | The kind of bugs that I've read about in errata notes over | the years is wild and truly unpredictable. | Spooky23 wrote: | Enterprise is definitely different - network guys need | multiple customers to develop the vendor skepticism. I used | to get into brutal internal fights with network directors | over whatever bullshit the Cisco salesman said offhand that | was treated as though it was delivered by Moses off the | mountain. One guy tried to get me fired because I offended | an SE. lol. | | I worked on systems and platforms at the time, and we were | more cynical even about vendors we liked. | jacquesm wrote: | It wouldn't be the first time that your redundant vendors | end up sharing a conduit for a bunch of fiber somewhere. | Guess where that backhoe will start digging? | oarsinsync wrote: | Redundant vendors in the GP's context referred to using | multiple router vendors, eg Cisco and Juniper. | | Using multiple connectivity vendors doesn't guarantee | path diversity. Demanding fibre maps and ensuring that | your connectivity has separate points of entry into the | building, doesn't cross outside the building, and | validating with your DC provider that your cross connects | aren't crossing either, guaranteed path diversity / | redundancy. | jacquesm wrote: | The GP was clearly talking about whole networks, not just | the hardware vendors, if I read that different than the | GP intended I'll wait for their correction. | | One of the problems that I've seen in practice that with | the degree of virtualization at play that it has at the | same time become much more easy to _in principle_ be | guaranteed 100% independence and _in practice_ it has | become much harder to verify that this is the case | because of all of the abstraction layers underneath the | topology. One of my customers specializes in software | that allows one to make such guarantees and this is a | non-trivial problem, to put it mildly, especially when | the situation becomes more dynamic due to outages from | various causes. | jacquesm wrote: | The network _is_ a single point of failure, even if the network | itself is redundant! | wmf wrote: | One possible way to fix that is to replace _the_ network with | multiple independent networks. It 's really expensive though. | jacquesm wrote: | Yes, exactly. Most really mission critical places do | exactly that. | | The first time I saw something like that put into practice | was when an experiment in the oil and gas industry that was | scheduled to run for years delivered their network design. | On the runtime cost of the experiment the extra network | wasn't a big deal, but a service interruption would have | been and would have caused them to have to restart the | whole thing from scratch. It's more than a decade ago and I | forgot what the exact context was but the whole thing was | fascinating from a redundancy perspective as well as the | degree of thinking that had gone into the risk assessment. | Those guys _really_ knew their business. Also the amount of | data that experiment was expected to generated was off the | scale. Multiple petabytes, which at the time (a decade ago | or so) was a non trivial amount of data. | bogomipz wrote: | This doesn't really make sense. The modern WAN operates on | multiple independent networks - SD-WANs, multiple transit | providers, fiber-ring MPLS, EVPN etc. If you propagate a | bad network change throughout your autonomous system or | backbone you can still have an outage on your hands. | jgrahamc wrote: | Having, uh, had bad things happen with router configuration I | feel for them. | | https://blog.cloudflare.com/cloudflare-outage-on-july-17-202... | newah1 wrote: | I remember this happening. The 20 some sites we ran went down | as they were supported by cloudflare. I spent a panicked 30 | minutes trying to figure out what I had done wrong, to | eventually find out it was on CF's end. | | I remember voicing at our team meeting "boy, they must be | panicking at CloudFlare." | | Cloudflare works so spectacularly we just wrote it off as a one | time thing. | jgrahamc wrote: | There was no panic but there was a lot of VUF (Very Urgent | Focus)! | alex-mohr wrote: | It does seem like network configuration remains rather manual | compared to other large scale systems that include more | automation. | | In Microsoft's case, the remediation is not to put in place | higher level systems to safely accomplish the goal of the | command. Instead: | | - "We have blocked highly impactful commands from getting | executed on the devices (Completed)" | | - "We will require all command execution on the devices to | follow safe change guidelines (Estimated completion: February | 2023)" | | Requiring commands to follow guidelines sounds suspiciously | like they're requiring network ops not to break things. | iso1631 wrote: | That's the norm in network ops. Automated testing is pretty | much impossible, easy rollback may be possible depending on | exactly what was screwed, but not always. | | Take this for example, looks like the problem was an | unplanned recalculation of routing tables. That's not going | to be the case on a small scale test network, and rolling | back won't help, indeed in this case it likely would cause | more problems. | jsz0 wrote: | One of the reasons I got out of network engineering was how | frequently the work I was required to do would cause | unintended consequences. You can do all your due diligence, | get your work blessed by vendor support, and still get | blown up by a bug or undocumented behaviors on a regular | basis. The conspiratorial part of my brain says these | network device makers intentionally provide unreliable | software and terrible documentation to bolster their | support contract profits. I was just the guy typing in the | commands and getting all the blame. | candiddevmike wrote: | Networking and storage changes are always butt clenching | affairs. Way more stressful than anything else in IT due | to their blast radius if something shits the bed. | spookthesunset wrote: | I remember the first time I got access to an employers | production Cisco router. It's pretty scary how easy it is | to majorly fuck something up. | | There isn't a concept of a transaction or a rollback. You | just enter a command, press enter and it's live. | | To counter this we'd write all the commands we planned on | executing and peer review it. Nothing was to be done "on | the fly" (at least in theory) | | In short, coming from a developer perspective with ample | version controls and gated releases... networking is a | very wild ride. | simoncion wrote: | > There isn't a concept of a transaction or a rollback. | | Yeah, Cisco gear is bonkers. | | Mikrotik has "Safe Mode", which undoes all commands since | you entered "Safe Mode" if the connection that created | the shell gets interrupted. It has saved my bacon on | several occasions, but there are several obvious | situations in which you can get yourself locked out. | | Juniper gear has "commit confirmed $NUMBER_OF_MINUTES", | which will roll back everything since your last commit if | you don't do a "commit" within $NUMBER_OF_MINUTES. It | will also, apply all of the changes you've staged all at | once (and do configuration sanity checking before it | performs the commit). | | I do have no idea how Juniper's rollback works when | multiple users are doing simultaneous config editing... | maybe don't do that? | atxbcp wrote: | That's not entirely true, you can rollback a change on | modern switches/routers, either via a rollback command, | or with a revert timer (configure terminal revert timer | X) (because the new configuration might have made the | router unreachable, so you're never sure you'll be able | to rollback manually if you're working remotely). | bogomipz wrote: | >"There isn't a concept of a transaction or a rollback. | You just enter a command, press enter and it's live." | | This hasn't been true for a very long time. Juniper | router's have rollbacks, commits and revisions: | | https://www.juniper.net/documentation/us/en/software/juno | s/c... | | and | | https://www.juniper.net/documentation/us/en/software/juno | s/c... | | Cisco has similar: | | https://www.cisco.com/c/en/us/td/docs/ios/ios_xe/fundamen | tal... | ccakes wrote: | Modern router operating systems have this. | | It's been a long time since I've touched IOS-XE (Cisco | enterprise gear) but Cisco IOS-XR, Junos, Arista EOS and | the Nokia SRs all support some combination of | configuration transactions with rollback and commit | confirm on a timer | | This definitely doesn't stop you shooting yourself in the | foot, similar to how you can still push broken config to | a k8s controller, but it's some level of protection for | certain types of changes. | meltyness wrote: | Interesting. There's also some stuff in Cisco that can't | be done both atomically and remotely, so you may have to | push a change as a file to the router and then source the | file into the running config with some permutation of | `copy`. | meltyness wrote: | Hadn't thought about it from the perspective of support | contract profits, but they also have their friendship | stick firmly planted in technicians via the semi-required | training since as you indicate the manuals are deficient. | | At some point network vendors switched manuals from | engineers documenting features whitebox to educated techs | documenting features blackbox. | | There's a clear transition for docs produced after 2008, | prior to which more care went into tech notes and | interpreting technologies -- after you're lucky to even | get a complete set of steps and caveats without having to | cross-reference bugs, release notes, old-manuals, new- | manuals, draft manuals, reference manuals, licensing | manuals, the inevitable errors that appear in the logs, | and of course the configuration guide where this should | all be in the first place. | | In short, yes, this. | Cyph0n wrote: | > The conspiratorial part of my brain says these network | device makers intentionally provide unreliable software | and terrible documentation to bolster their support | contract profits. | | As a dev who has worked at one of the major networking | vendors, I can assure you that is the not the case. You'd | be surprised by how major bugs are handled internally, | especially if the bug affects "important" customers. | throw0101a wrote: | > _That 's the norm in network ops. Automated testing is | pretty much impossible, easy rollback may be possible | depending on exactly what was screwed, but not always._ | | Ansible/Napalm is a thing in NetOps in some places. Some | folks use Eve-ng / GNS3 to spin up virtual networks to test | config changes, and it may be possible to do CI/CD changes | if you track things in a repo. | | Juniper JunOS has auto-rollback if you don't confirm the | change after "x" minutes: | | * https://www.juniper.net/documentation/us/en/software/juno | s/c... | | So if you did something that causes breakage and | disconnection from the router, you (ideally) don't have to | do anything but wait it out. | AdamJacobMuller wrote: | commit confirmed is such a life-saver. I ran a production | network which spanned multiple continents and even though | I probably only ever actually needed commit confirmed a | single digit number of times, the fact that it was there | made every change I did 99% less stressful. I knew that | even if I made a mistake, all I had to do was wait 5-10 | minutes and it would all revert. | | Compare this to my cisco/foundry/other experience where I | would delay changes until I was in the office (physically | colocated with main routers) or calling people to be | onsite for what was 99% of the time an innocuous change. | The stress of it led to me deferring changes or just | skipping them entirely which led to more | issues/stress/etc. | | I'm really not sure there is a single software feature | which improved my life as much as "commit confirmed" | bogomipz wrote: | Your CEO sure doesn't seem to have much empathy when it's | someone else though: | | https://twitter.com/eastdakota/status/1143182575680143361 | latchkey wrote: | How about describing how you implement systems that prevent | this? You kind of talk about what was 'fixed', but not how. | CI/CD is pretty hard to do for global networking changes. I'm | sure whatever CF has done in this area is a lot of magic sauce | and it would be super interesting to learn more about it, even | at a high level. | libraryatnight wrote: | Holy shit I have been there and it sucks. I wasn't the guy who | made the change, but I was on the long call that followed. | int0x2e wrote: | Time to share one of my favorite talks (and speakers) ever - | | "Debugging Under Fire: Keep your Head when Systems have Lost | their Mind" (Bryan Cantrill, GOTO 2017) | | https://www.youtube.com/watch?v=30jNsCVLpAE | libraryatnight wrote: | This was an awesome lunch listen, thank you for sharing! | rexarex wrote: | The curse of network engineering. You're invisible and | insignificant when everything is running well, and public enemy | number one if you make a mistake! | asmor wrote: | Security too. | septune wrote: | I miss the old days of IOS : | | switchport trunk allowed vlan (add) xxx | | Can't imagine how many outages where caused by the missing << add | >> command. | candiddevmike wrote: | Too many Cisco commands would truncate the syntax if you didnt | know better: | | no access-list 101 permit something | | so long access-list 101! | raffraffraff wrote: | Token-ring network. Someone configured their printer to use the | gateway address in the ip address field. Idiot. "Turn off all | devices on the internet, then turn them all on again one by one | until we find the bastard who did this" ___________________________________________________________________ (page generated 2023-01-30 23:00 UTC)