[HN Gopher] We unplugged a data center to test our disaster read... ___________________________________________________________________ We unplugged a data center to test our disaster readiness Author : ianrahman Score : 187 points Date : 2022-04-25 14:47 UTC (1 days ago) (HTM) web link (dropbox.tech) (TXT) w3m dump (dropbox.tech) | rkagerer wrote: | This read to me as much like a history of technical debt as an | article about current efforts. | eptcyka wrote: | Given that the blog doesn't load for me, I guess the datacenter | remains unplugged? | [deleted] | deathanatos wrote: | > _Not Found_ | | > _The requested URL /infrastructure/disaster-readiness-test- | failover-blackhole-sjc was not found on this server._ | | > _Additionally, a 404 Not Found error was encountered while | trying to use an ErrorDocument to handle the request._ | | I see which data center was hosting the article, then. | | Internet Archive to the rescue: | https://web.archive.org/web/20220426191128/https://dropbox.t... | lloydatkinson wrote: | I admire the forward thinking and application of common sense | here. There's literally no better way of testing the system than | this. It seems that a lot of big tech companies would never have | the balls to do this themselves. | mike_d wrote: | Netflix has Chaos Gorilla, Facebook has Storms(?), Google has | DiRT. Everyone does this type of testing. | aaaaaaaaata wrote: | ckwalsh wrote: | Similar to Facebook's Storm initiative: | https://www.forbes.com/sites/roberthof/2016/09/11/interview-... | | These exercises happen several times a year. | Shish2k wrote: | I used to be heavily involved in those, it was a process where | we'd take a week to prepare for it, do weeks of post-mortems, | and print a run of t-shirts for everyone involved to celebrate | pulling one off successfully. | | These days the team running them announces that it's happening | in an opt-in announcement group at 8am, pulls the plug at 9am, | and barely anyone even notices because the automation handles | it so gracefully. | | Mostly I just miss the t-shirts, as the <datacenter>-storm | events got the coolest graphical designs... | bob1029 wrote: | > This complex ownership model made it hard to move just a subset | of user data to another region. | | Welcome to basically any large-scale enterprise. | | I have grown to learn that the active-passive strategy is the | best option if the business can tolerate the necessary human | delays. You get to side-step so much complicated bullshit with | this path. | | Trying to force active-active instant failover magic is sometimes | not possible, profitable or even desirable. I can come up with a | few scenarios where I would absolutely _insist_ that a human go | and check a few control points before letting another exact copy | of the same system start up on its own, even if it would be | possible in theory to have automatic failover work reliably | 99.999% of the time. | dboreham wrote: | My fear with any sort of passive standby approach is that when | the disaster comes, that standby won't work, or the mechanism | used to fail over to it won't work. I prefer schemes where the | "failover" is happening all the time hence I can be confident | it works. | mike_d wrote: | A solid active/standby design should regularly flop. Every 2 | weeks seems to be a sweet spot. This also balances wear | across consumable hardware like disks. | | If your failover is happening "all the time" you basically | just have a single system with failures. | orev wrote: | Having a passive strategy doesn't mean you don't test it, and | you can even perform actual failovers once on a while to | validate everything. | | Active-active is also valid, but the point is that it comes | with a huge amount of increased complexity. At some point you | need to make a value calculation to decide if you want to | focus on that, or on building the product. | [deleted] | aftbit wrote: | How big is metaserver these days? | | I might run three deployments at each data center: the primary, | and secondaries for two other regions. Replicate between them at | the block device level, bypassing the mysql replication situation | entirely (except for on-disk consistency requirements of course). | | Of course this comes with a 3x increase in service infrastructure | costs because of the two backups in each data center that are | idle waiting for load. | Johnny555 wrote: | And much higher write latency, assuming write consistency is | important to you. | outside1234 wrote: | DiRT | zoover2020 wrote: | Take that, ChaosMonkey! This is King Kong | qw3rty01 wrote: | Chaos Gorilla has already existed for over a decade | beckman466 wrote: | gotta prepare for the climate crisis, as Douglas Rushkoff | discovered, rich capitalists "are plotting to leave us behind" | | https://onezero.medium.com/survival-of-the-richest-9ef6cddd0... | | unpaywalled: https://archive.ph/AABsP | aftbit wrote: | >Given San Jose's proximity to the the San Andreas Fault, it was | critical we ensured an earthquake wouldn't take Dropbox offline. | | >Given this context, we structured our RTO--and more broadly, our | disaster readiness plans--around imminent failures where our | primary region is still up, but may not be for long. | | IMO if the big one hits San Andreas, the SJC facilities will | likely go down with ~0 warning. Certainly not enough time to | drain user traffic. | | It's interesting to note that Dropbox realistically can probably | tolerate a loss of a few second to minutes of user data in a | major earthquake, but cannot tolerate the same losses to perform | realistic tests (just yank the cable, no warning). | | If the earthquake hits at 3am in SF, it'll likely take both the | metro and a significant number of the DR team out of the picture | for at least a period of time. Surviving that kind of blow in the | short term with 0 downtime is a very hard goal. | paxys wrote: | A random (large) earthquake along the San Andreas fault is not | the same as all of western USA ripping apart. A much more | likely scenario is that the power grid goes down and the data | center stays reasonably intact for a while on emergency power. | [deleted] | dmitrygr wrote: | Google does this yearly, with different scenarios - it is called | DiRT and you can read a little here: | https://cloud.google.com/blog/products/management-tools/shri... ___________________________________________________________________ (page generated 2022-04-26 23:00 UTC)