[HN Gopher] A coding error caused Rogers outage that left millio... ___________________________________________________________________ A coding error caused Rogers outage that left millions without service Author : kelseydh Score : 32 points Date : 2022-07-25 22:21 UTC (38 minutes ago) (HTM) web link (web.archive.org) (TXT) w3m dump (web.archive.org) | gwen-shapira wrote: | "Although Bell and Telus offered to help, Rogers quickly | determined that it would not be able to transfer its customers to | its rivals' networks because certain elements of the Rogers | network, such as its centralized user database, were inaccessible | as a result of the outage." | | It sounds like their control/management plane (with the user's | database) was dependent on their data plane. So a data plane | outage was more challenging to mitigate than it should have been | in a decoupled architecture. Good lesson for any architecture. | mabbo wrote: | Both my home internet and mobile are through Rogers (for the time | being). I had no access to the internet for 15 hours, 6am to 9pm. | Couldn't do my job as a remote developer. | | And all through the day I kept thinking to myself "I bet someone | pushed an update to prod, causing this. And I'm glad that this | time it wasn't me." | amatecha wrote: | You can see Rogers' own report (with some redactions) as provided | to the CRTC. See the doc linked under the first (2022-07-22) | heading here: | https://crtc.gc.ca/otf/eng/2022/8000/c12-202203868.htm | Victerius wrote: | The key bit: | | > But, at 4:43 a.m. on July 8, a piece of code was introduced | that deleted a routing filter. In telecom networks, packets of | data are guided and directed by devices called routers, and | filters prevent those routers from becoming overwhelmed, by | limiting the number of possible routes that are presented to | them. | | > Deleting the filter caused all possible routes to the internet | to pass through the routers, resulting in several of the devices | exceeding their memory and processing capacities. This caused the | core network to shut down. | | Lesson no 1: Do not design your system to have a single point of | failure. | erentz wrote: | > But, in the early hours, the company's technicians had not | yet pinpointed the cause of the catastrophe. Rogers apparently | considered the possibility that its networks had been attacked | by cybercriminals. | | I mean, if you just pushed a config change and the whole | network goes kaput, take a look at the config change before you | start suspecting hackers. | pitched wrote: | I heard that the teams were having trouble communicating with | each other and so the ones who pushed the config might not | have been the ones looking for hackers. | | This is why some hospitals still use the old pager systems to | contact people in the city. One hospital-owned antenna on a | battery can coordinate a lot of people. I don't know what the | equivalent to that would be in this case though. | chx wrote: | Ham radio. | | It still works, you know? | | Also, pagerduty works over wifi... ___________________________________________________________________ (page generated 2022-07-25 23:00 UTC)