[HN Gopher] A coding error caused Rogers outage that left millio...
       ___________________________________________________________________
        
       A coding error caused Rogers outage that left millions without
       service
        
       Author : kelseydh
       Score  : 32 points
       Date   : 2022-07-25 22:21 UTC (38 minutes ago)
        
 (HTM) web link (web.archive.org)
 (TXT) w3m dump (web.archive.org)
        
       | gwen-shapira wrote:
       | "Although Bell and Telus offered to help, Rogers quickly
       | determined that it would not be able to transfer its customers to
       | its rivals' networks because certain elements of the Rogers
       | network, such as its centralized user database, were inaccessible
       | as a result of the outage."
       | 
       | It sounds like their control/management plane (with the user's
       | database) was dependent on their data plane. So a data plane
       | outage was more challenging to mitigate than it should have been
       | in a decoupled architecture. Good lesson for any architecture.
        
       | mabbo wrote:
       | Both my home internet and mobile are through Rogers (for the time
       | being). I had no access to the internet for 15 hours, 6am to 9pm.
       | Couldn't do my job as a remote developer.
       | 
       | And all through the day I kept thinking to myself "I bet someone
       | pushed an update to prod, causing this. And I'm glad that this
       | time it wasn't me."
        
       | amatecha wrote:
       | You can see Rogers' own report (with some redactions) as provided
       | to the CRTC. See the doc linked under the first (2022-07-22)
       | heading here:
       | https://crtc.gc.ca/otf/eng/2022/8000/c12-202203868.htm
        
       | Victerius wrote:
       | The key bit:
       | 
       | > But, at 4:43 a.m. on July 8, a piece of code was introduced
       | that deleted a routing filter. In telecom networks, packets of
       | data are guided and directed by devices called routers, and
       | filters prevent those routers from becoming overwhelmed, by
       | limiting the number of possible routes that are presented to
       | them.
       | 
       | > Deleting the filter caused all possible routes to the internet
       | to pass through the routers, resulting in several of the devices
       | exceeding their memory and processing capacities. This caused the
       | core network to shut down.
       | 
       | Lesson no 1: Do not design your system to have a single point of
       | failure.
        
         | erentz wrote:
         | > But, in the early hours, the company's technicians had not
         | yet pinpointed the cause of the catastrophe. Rogers apparently
         | considered the possibility that its networks had been attacked
         | by cybercriminals.
         | 
         | I mean, if you just pushed a config change and the whole
         | network goes kaput, take a look at the config change before you
         | start suspecting hackers.
        
           | pitched wrote:
           | I heard that the teams were having trouble communicating with
           | each other and so the ones who pushed the config might not
           | have been the ones looking for hackers.
           | 
           | This is why some hospitals still use the old pager systems to
           | contact people in the city. One hospital-owned antenna on a
           | battery can coordinate a lot of people. I don't know what the
           | equivalent to that would be in this case though.
        
             | chx wrote:
             | Ham radio.
             | 
             | It still works, you know?
             | 
             | Also, pagerduty works over wifi...
        
       ___________________________________________________________________
       (page generated 2022-07-25 23:00 UTC)