[HN Gopher] WAN router IP address change blamed for global Micro...
       ___________________________________________________________________
        
       WAN router IP address change blamed for global Microsoft 365 outage
        
       Author : mikece
       Score  : 79 points
       Date   : 2023-01-30 13:53 UTC (9 hours ago)
        
 (HTM) web link (www.theregister.com)
 (TXT) w3m dump (www.theregister.com)
        
       | kloch wrote:
       | > As part of a planned change to update the IP address on a WAN
       | router, a command given to the router caused it to send messages
       | to all other routers in the WAN, which resulted in all of them
       | recomputing their adjacency and forwarding tables. During this
       | re-computation process, the routers were unable to correctly
       | forward packets traversing them. "The command that caused the
       | issue has different behaviors on different network devices, and
       | the command had not been vetted using our full qualification
       | process on the router on which it was executed."
       | 
       | From this it sounds like they might have changed the primary
       | loopback IP, which by default is the "router-id" for various
       | routing protocols, causing the entire network to have to
       | reconverge. You can override the default router-id with an
       | explicit address that does not depend on lo0 but lots of networks
       | don't do that.
       | 
       | It's extremely uncommon to change the primary loopback address.
       | It's less uncommon to add an additional one but as the article
       | says that syntax varies by vendor: Juniper will add as additional
       | by default, Cisco and Arista will replace the existing primary
       | one (IPv4) unless you include the "secondary" keyword...
        
         | AdamJacobMuller wrote:
         | I feel like they intended to /ADD/ a new loopback IP and in the
         | process accidentally removed the existing one and replaced
         | because I think anyone intentionally changing the loopback IP
         | knows it's going to reset all bgp sessions. I think more modern
         | cisco/arista platforms now "secondary" by default and perhaps
         | that is what bit them?
        
         | kemals wrote:
         | This was a rather interesting event. In general, changing the
         | IP address (even the loopback address) shouldn't have caused it
         | from the BGP perspective. For example, if you were to change
         | the IP address of BGP enabled router that has multiple BGP
         | sessions, all other routers tore down the sessions to it, and
         | withdrew the prefixes. BGP reconverge events take time.
         | However, less than this took (90+ minutes and then a few more
         | hours until __full__ recovery).
         | 
         | This seems like one of the events in which they changed IP on
         | Route Reflector routers that were pretty busy, which would
         | cause reconvergence and CPU spikes for all routers that it had
         | sessions with. Also, there was a lot of volatility, as part of
         | which re-advertisements were happening continuously. They also
         | attempted rollback, which caused reverse operation, which
         | triggered reconvergence. The other scenario is doing this
         | change on the SDN controller, which affected all other routers.
         | 
         | More details: https://www.thousandeyes.com/blog/microsoft-
         | outage-analysis-... https://www.thousandeyes.com/resources/na-
         | microsoft-outage-a...
        
       | nzgrover wrote:
       | It's not DNS
       | 
       | There's no way it's DNS
       | 
       | It was DNS
       | 
       | Credit: https://www.cyberciti.biz/humour/a-haiku-about-dns/
        
       | jimmyl02 wrote:
       | It seems that in modern large scale systems networking continues
       | to be one of the few things were a a seemingly small and
       | inconsequential change can cause entire cloud providers and
       | highly redundant systems to go down. It makes sense as networking
       | is the fabric connecting all systems together but each time an
       | incident like this occurs I'm reminded of just how important
       | networking is.
       | 
       | Network engineers and the people handling network ops always
       | amaze me.
        
         | iso1631 wrote:
         | IME Network engineers put too much faith in vendors. They think
         | "the vendor says this is a resilient virtual chassis so it
         | can't break", rather than thinking "ok, if this breaks what
         | happens"
         | 
         | A crash affecting both sides of a "resilient" virtual chassis I
         | had to work with took off a major broadcast last year (it was a
         | last minute favour I was doing, and I rerouted to a tertiary
         | route in a couple of minutes).
         | 
         | Meanwhile I ran a rather large event going out to some hundred
         | million listeners via two crappy PS300 switches which were
         | completely independent of each other, into two independent
         | routers, running via two separate systems (one on a UPS, one on
         | mains). If one of them broke the other one was completely
         | independent and the broadcast would have continued just fine.
         | 
         | As far as I am concerned, that is far better than a virtual
         | chassis.
        
           | ccakes wrote:
           | This may be true of enterprise network engineers but I've
           | worked across a lot of very large networks (telco, not cloud)
           | and we never _ever_ trust the vendor.
           | 
           | The kind of bugs that I've read about in errata notes over
           | the years is wild and truly unpredictable.
        
             | Spooky23 wrote:
             | Enterprise is definitely different - network guys need
             | multiple customers to develop the vendor skepticism. I used
             | to get into brutal internal fights with network directors
             | over whatever bullshit the Cisco salesman said offhand that
             | was treated as though it was delivered by Moses off the
             | mountain. One guy tried to get me fired because I offended
             | an SE. lol.
             | 
             | I worked on systems and platforms at the time, and we were
             | more cynical even about vendors we liked.
        
             | jacquesm wrote:
             | It wouldn't be the first time that your redundant vendors
             | end up sharing a conduit for a bunch of fiber somewhere.
             | Guess where that backhoe will start digging?
        
               | oarsinsync wrote:
               | Redundant vendors in the GP's context referred to using
               | multiple router vendors, eg Cisco and Juniper.
               | 
               | Using multiple connectivity vendors doesn't guarantee
               | path diversity. Demanding fibre maps and ensuring that
               | your connectivity has separate points of entry into the
               | building, doesn't cross outside the building, and
               | validating with your DC provider that your cross connects
               | aren't crossing either, guaranteed path diversity /
               | redundancy.
        
               | jacquesm wrote:
               | The GP was clearly talking about whole networks, not just
               | the hardware vendors, if I read that different than the
               | GP intended I'll wait for their correction.
               | 
               | One of the problems that I've seen in practice that with
               | the degree of virtualization at play that it has at the
               | same time become much more easy to _in principle_ be
               | guaranteed 100% independence and _in practice_ it has
               | become much harder to verify that this is the case
               | because of all of the abstraction layers underneath the
               | topology. One of my customers specializes in software
               | that allows one to make such guarantees and this is a
               | non-trivial problem, to put it mildly, especially when
               | the situation becomes more dynamic due to outages from
               | various causes.
        
         | jacquesm wrote:
         | The network _is_ a single point of failure, even if the network
         | itself is redundant!
        
           | wmf wrote:
           | One possible way to fix that is to replace _the_ network with
           | multiple independent networks. It 's really expensive though.
        
             | jacquesm wrote:
             | Yes, exactly. Most really mission critical places do
             | exactly that.
             | 
             | The first time I saw something like that put into practice
             | was when an experiment in the oil and gas industry that was
             | scheduled to run for years delivered their network design.
             | On the runtime cost of the experiment the extra network
             | wasn't a big deal, but a service interruption would have
             | been and would have caused them to have to restart the
             | whole thing from scratch. It's more than a decade ago and I
             | forgot what the exact context was but the whole thing was
             | fascinating from a redundancy perspective as well as the
             | degree of thinking that had gone into the risk assessment.
             | Those guys _really_ knew their business. Also the amount of
             | data that experiment was expected to generated was off the
             | scale. Multiple petabytes, which at the time (a decade ago
             | or so) was a non trivial amount of data.
        
             | bogomipz wrote:
             | This doesn't really make sense. The modern WAN operates on
             | multiple independent networks - SD-WANs, multiple transit
             | providers, fiber-ring MPLS, EVPN etc. If you propagate a
             | bad network change throughout your autonomous system or
             | backbone you can still have an outage on your hands.
        
       | jgrahamc wrote:
       | Having, uh, had bad things happen with router configuration I
       | feel for them.
       | 
       | https://blog.cloudflare.com/cloudflare-outage-on-july-17-202...
        
         | newah1 wrote:
         | I remember this happening. The 20 some sites we ran went down
         | as they were supported by cloudflare. I spent a panicked 30
         | minutes trying to figure out what I had done wrong, to
         | eventually find out it was on CF's end.
         | 
         | I remember voicing at our team meeting "boy, they must be
         | panicking at CloudFlare."
         | 
         | Cloudflare works so spectacularly we just wrote it off as a one
         | time thing.
        
           | jgrahamc wrote:
           | There was no panic but there was a lot of VUF (Very Urgent
           | Focus)!
        
         | alex-mohr wrote:
         | It does seem like network configuration remains rather manual
         | compared to other large scale systems that include more
         | automation.
         | 
         | In Microsoft's case, the remediation is not to put in place
         | higher level systems to safely accomplish the goal of the
         | command. Instead:
         | 
         | - "We have blocked highly impactful commands from getting
         | executed on the devices (Completed)"
         | 
         | - "We will require all command execution on the devices to
         | follow safe change guidelines (Estimated completion: February
         | 2023)"
         | 
         | Requiring commands to follow guidelines sounds suspiciously
         | like they're requiring network ops not to break things.
        
           | iso1631 wrote:
           | That's the norm in network ops. Automated testing is pretty
           | much impossible, easy rollback may be possible depending on
           | exactly what was screwed, but not always.
           | 
           | Take this for example, looks like the problem was an
           | unplanned recalculation of routing tables. That's not going
           | to be the case on a small scale test network, and rolling
           | back won't help, indeed in this case it likely would cause
           | more problems.
        
             | jsz0 wrote:
             | One of the reasons I got out of network engineering was how
             | frequently the work I was required to do would cause
             | unintended consequences. You can do all your due diligence,
             | get your work blessed by vendor support, and still get
             | blown up by a bug or undocumented behaviors on a regular
             | basis. The conspiratorial part of my brain says these
             | network device makers intentionally provide unreliable
             | software and terrible documentation to bolster their
             | support contract profits. I was just the guy typing in the
             | commands and getting all the blame.
        
               | candiddevmike wrote:
               | Networking and storage changes are always butt clenching
               | affairs. Way more stressful than anything else in IT due
               | to their blast radius if something shits the bed.
        
               | spookthesunset wrote:
               | I remember the first time I got access to an employers
               | production Cisco router. It's pretty scary how easy it is
               | to majorly fuck something up.
               | 
               | There isn't a concept of a transaction or a rollback. You
               | just enter a command, press enter and it's live.
               | 
               | To counter this we'd write all the commands we planned on
               | executing and peer review it. Nothing was to be done "on
               | the fly" (at least in theory)
               | 
               | In short, coming from a developer perspective with ample
               | version controls and gated releases... networking is a
               | very wild ride.
        
               | simoncion wrote:
               | > There isn't a concept of a transaction or a rollback.
               | 
               | Yeah, Cisco gear is bonkers.
               | 
               | Mikrotik has "Safe Mode", which undoes all commands since
               | you entered "Safe Mode" if the connection that created
               | the shell gets interrupted. It has saved my bacon on
               | several occasions, but there are several obvious
               | situations in which you can get yourself locked out.
               | 
               | Juniper gear has "commit confirmed $NUMBER_OF_MINUTES",
               | which will roll back everything since your last commit if
               | you don't do a "commit" within $NUMBER_OF_MINUTES. It
               | will also, apply all of the changes you've staged all at
               | once (and do configuration sanity checking before it
               | performs the commit).
               | 
               | I do have no idea how Juniper's rollback works when
               | multiple users are doing simultaneous config editing...
               | maybe don't do that?
        
               | atxbcp wrote:
               | That's not entirely true, you can rollback a change on
               | modern switches/routers, either via a rollback command,
               | or with a revert timer (configure terminal revert timer
               | X) (because the new configuration might have made the
               | router unreachable, so you're never sure you'll be able
               | to rollback manually if you're working remotely).
        
               | bogomipz wrote:
               | >"There isn't a concept of a transaction or a rollback.
               | You just enter a command, press enter and it's live."
               | 
               | This hasn't been true for a very long time. Juniper
               | router's have rollbacks, commits and revisions:
               | 
               | https://www.juniper.net/documentation/us/en/software/juno
               | s/c...
               | 
               | and
               | 
               | https://www.juniper.net/documentation/us/en/software/juno
               | s/c...
               | 
               | Cisco has similar:
               | 
               | https://www.cisco.com/c/en/us/td/docs/ios/ios_xe/fundamen
               | tal...
        
               | ccakes wrote:
               | Modern router operating systems have this.
               | 
               | It's been a long time since I've touched IOS-XE (Cisco
               | enterprise gear) but Cisco IOS-XR, Junos, Arista EOS and
               | the Nokia SRs all support some combination of
               | configuration transactions with rollback and commit
               | confirm on a timer
               | 
               | This definitely doesn't stop you shooting yourself in the
               | foot, similar to how you can still push broken config to
               | a k8s controller, but it's some level of protection for
               | certain types of changes.
        
               | meltyness wrote:
               | Interesting. There's also some stuff in Cisco that can't
               | be done both atomically and remotely, so you may have to
               | push a change as a file to the router and then source the
               | file into the running config with some permutation of
               | `copy`.
        
               | meltyness wrote:
               | Hadn't thought about it from the perspective of support
               | contract profits, but they also have their friendship
               | stick firmly planted in technicians via the semi-required
               | training since as you indicate the manuals are deficient.
               | 
               | At some point network vendors switched manuals from
               | engineers documenting features whitebox to educated techs
               | documenting features blackbox.
               | 
               | There's a clear transition for docs produced after 2008,
               | prior to which more care went into tech notes and
               | interpreting technologies -- after you're lucky to even
               | get a complete set of steps and caveats without having to
               | cross-reference bugs, release notes, old-manuals, new-
               | manuals, draft manuals, reference manuals, licensing
               | manuals, the inevitable errors that appear in the logs,
               | and of course the configuration guide where this should
               | all be in the first place.
               | 
               | In short, yes, this.
        
               | Cyph0n wrote:
               | > The conspiratorial part of my brain says these network
               | device makers intentionally provide unreliable software
               | and terrible documentation to bolster their support
               | contract profits.
               | 
               | As a dev who has worked at one of the major networking
               | vendors, I can assure you that is the not the case. You'd
               | be surprised by how major bugs are handled internally,
               | especially if the bug affects "important" customers.
        
             | throw0101a wrote:
             | > _That 's the norm in network ops. Automated testing is
             | pretty much impossible, easy rollback may be possible
             | depending on exactly what was screwed, but not always._
             | 
             | Ansible/Napalm is a thing in NetOps in some places. Some
             | folks use Eve-ng / GNS3 to spin up virtual networks to test
             | config changes, and it may be possible to do CI/CD changes
             | if you track things in a repo.
             | 
             | Juniper JunOS has auto-rollback if you don't confirm the
             | change after "x" minutes:
             | 
             | * https://www.juniper.net/documentation/us/en/software/juno
             | s/c...
             | 
             | So if you did something that causes breakage and
             | disconnection from the router, you (ideally) don't have to
             | do anything but wait it out.
        
               | AdamJacobMuller wrote:
               | commit confirmed is such a life-saver. I ran a production
               | network which spanned multiple continents and even though
               | I probably only ever actually needed commit confirmed a
               | single digit number of times, the fact that it was there
               | made every change I did 99% less stressful. I knew that
               | even if I made a mistake, all I had to do was wait 5-10
               | minutes and it would all revert.
               | 
               | Compare this to my cisco/foundry/other experience where I
               | would delay changes until I was in the office (physically
               | colocated with main routers) or calling people to be
               | onsite for what was 99% of the time an innocuous change.
               | The stress of it led to me deferring changes or just
               | skipping them entirely which led to more
               | issues/stress/etc.
               | 
               | I'm really not sure there is a single software feature
               | which improved my life as much as "commit confirmed"
        
         | bogomipz wrote:
         | Your CEO sure doesn't seem to have much empathy when it's
         | someone else though:
         | 
         | https://twitter.com/eastdakota/status/1143182575680143361
        
         | latchkey wrote:
         | How about describing how you implement systems that prevent
         | this? You kind of talk about what was 'fixed', but not how.
         | CI/CD is pretty hard to do for global networking changes. I'm
         | sure whatever CF has done in this area is a lot of magic sauce
         | and it would be super interesting to learn more about it, even
         | at a high level.
        
         | libraryatnight wrote:
         | Holy shit I have been there and it sucks. I wasn't the guy who
         | made the change, but I was on the long call that followed.
        
           | int0x2e wrote:
           | Time to share one of my favorite talks (and speakers) ever -
           | 
           | "Debugging Under Fire: Keep your Head when Systems have Lost
           | their Mind" (Bryan Cantrill, GOTO 2017)
           | 
           | https://www.youtube.com/watch?v=30jNsCVLpAE
        
             | libraryatnight wrote:
             | This was an awesome lunch listen, thank you for sharing!
        
       | rexarex wrote:
       | The curse of network engineering. You're invisible and
       | insignificant when everything is running well, and public enemy
       | number one if you make a mistake!
        
         | asmor wrote:
         | Security too.
        
       | septune wrote:
       | I miss the old days of IOS :
       | 
       | switchport trunk allowed vlan (add) xxx
       | 
       | Can't imagine how many outages where caused by the missing << add
       | >> command.
        
         | candiddevmike wrote:
         | Too many Cisco commands would truncate the syntax if you didnt
         | know better:
         | 
         | no access-list 101 permit something
         | 
         | so long access-list 101!
        
       | raffraffraff wrote:
       | Token-ring network. Someone configured their printer to use the
       | gateway address in the ip address field. Idiot. "Turn off all
       | devices on the internet, then turn them all on again one by one
       | until we find the bastard who did this"
        
       ___________________________________________________________________
       (page generated 2023-01-30 23:00 UTC)