[HN Gopher] We unplugged a data center to test our disaster read...
       ___________________________________________________________________
        
       We unplugged a data center to test our disaster readiness
        
       Author : ianrahman
       Score  : 187 points
       Date   : 2022-04-25 14:47 UTC (1 days ago)
        
 (HTM) web link (dropbox.tech)
 (TXT) w3m dump (dropbox.tech)
        
       | rkagerer wrote:
       | This read to me as much like a history of technical debt as an
       | article about current efforts.
        
       | eptcyka wrote:
       | Given that the blog doesn't load for me, I guess the datacenter
       | remains unplugged?
        
         | [deleted]
        
       | deathanatos wrote:
       | > _Not Found_
       | 
       | > _The requested URL /infrastructure/disaster-readiness-test-
       | failover-blackhole-sjc was not found on this server._
       | 
       | > _Additionally, a 404 Not Found error was encountered while
       | trying to use an ErrorDocument to handle the request._
       | 
       | I see which data center was hosting the article, then.
       | 
       | Internet Archive to the rescue:
       | https://web.archive.org/web/20220426191128/https://dropbox.t...
        
       | lloydatkinson wrote:
       | I admire the forward thinking and application of common sense
       | here. There's literally no better way of testing the system than
       | this. It seems that a lot of big tech companies would never have
       | the balls to do this themselves.
        
         | mike_d wrote:
         | Netflix has Chaos Gorilla, Facebook has Storms(?), Google has
         | DiRT. Everyone does this type of testing.
        
       | aaaaaaaaata wrote:
        
       | ckwalsh wrote:
       | Similar to Facebook's Storm initiative:
       | https://www.forbes.com/sites/roberthof/2016/09/11/interview-...
       | 
       | These exercises happen several times a year.
        
         | Shish2k wrote:
         | I used to be heavily involved in those, it was a process where
         | we'd take a week to prepare for it, do weeks of post-mortems,
         | and print a run of t-shirts for everyone involved to celebrate
         | pulling one off successfully.
         | 
         | These days the team running them announces that it's happening
         | in an opt-in announcement group at 8am, pulls the plug at 9am,
         | and barely anyone even notices because the automation handles
         | it so gracefully.
         | 
         | Mostly I just miss the t-shirts, as the <datacenter>-storm
         | events got the coolest graphical designs...
        
       | bob1029 wrote:
       | > This complex ownership model made it hard to move just a subset
       | of user data to another region.
       | 
       | Welcome to basically any large-scale enterprise.
       | 
       | I have grown to learn that the active-passive strategy is the
       | best option if the business can tolerate the necessary human
       | delays. You get to side-step so much complicated bullshit with
       | this path.
       | 
       | Trying to force active-active instant failover magic is sometimes
       | not possible, profitable or even desirable. I can come up with a
       | few scenarios where I would absolutely _insist_ that a human go
       | and check a few control points before letting another exact copy
       | of the same system start up on its own, even if it would be
       | possible in theory to have automatic failover work reliably
       | 99.999% of the time.
        
         | dboreham wrote:
         | My fear with any sort of passive standby approach is that when
         | the disaster comes, that standby won't work, or the mechanism
         | used to fail over to it won't work. I prefer schemes where the
         | "failover" is happening all the time hence I can be confident
         | it works.
        
           | mike_d wrote:
           | A solid active/standby design should regularly flop. Every 2
           | weeks seems to be a sweet spot. This also balances wear
           | across consumable hardware like disks.
           | 
           | If your failover is happening "all the time" you basically
           | just have a single system with failures.
        
           | orev wrote:
           | Having a passive strategy doesn't mean you don't test it, and
           | you can even perform actual failovers once on a while to
           | validate everything.
           | 
           | Active-active is also valid, but the point is that it comes
           | with a huge amount of increased complexity. At some point you
           | need to make a value calculation to decide if you want to
           | focus on that, or on building the product.
        
         | [deleted]
        
       | aftbit wrote:
       | How big is metaserver these days?
       | 
       | I might run three deployments at each data center: the primary,
       | and secondaries for two other regions. Replicate between them at
       | the block device level, bypassing the mysql replication situation
       | entirely (except for on-disk consistency requirements of course).
       | 
       | Of course this comes with a 3x increase in service infrastructure
       | costs because of the two backups in each data center that are
       | idle waiting for load.
        
         | Johnny555 wrote:
         | And much higher write latency, assuming write consistency is
         | important to you.
        
       | outside1234 wrote:
       | DiRT
        
       | zoover2020 wrote:
       | Take that, ChaosMonkey! This is King Kong
        
         | qw3rty01 wrote:
         | Chaos Gorilla has already existed for over a decade
        
       | beckman466 wrote:
       | gotta prepare for the climate crisis, as Douglas Rushkoff
       | discovered, rich capitalists "are plotting to leave us behind"
       | 
       | https://onezero.medium.com/survival-of-the-richest-9ef6cddd0...
       | 
       | unpaywalled: https://archive.ph/AABsP
        
       | aftbit wrote:
       | >Given San Jose's proximity to the the San Andreas Fault, it was
       | critical we ensured an earthquake wouldn't take Dropbox offline.
       | 
       | >Given this context, we structured our RTO--and more broadly, our
       | disaster readiness plans--around imminent failures where our
       | primary region is still up, but may not be for long.
       | 
       | IMO if the big one hits San Andreas, the SJC facilities will
       | likely go down with ~0 warning. Certainly not enough time to
       | drain user traffic.
       | 
       | It's interesting to note that Dropbox realistically can probably
       | tolerate a loss of a few second to minutes of user data in a
       | major earthquake, but cannot tolerate the same losses to perform
       | realistic tests (just yank the cable, no warning).
       | 
       | If the earthquake hits at 3am in SF, it'll likely take both the
       | metro and a significant number of the DR team out of the picture
       | for at least a period of time. Surviving that kind of blow in the
       | short term with 0 downtime is a very hard goal.
        
         | paxys wrote:
         | A random (large) earthquake along the San Andreas fault is not
         | the same as all of western USA ripping apart. A much more
         | likely scenario is that the power grid goes down and the data
         | center stays reasonably intact for a while on emergency power.
        
           | [deleted]
        
       | dmitrygr wrote:
       | Google does this yearly, with different scenarios - it is called
       | DiRT and you can read a little here:
       | https://cloud.google.com/blog/products/management-tools/shri...
        
       ___________________________________________________________________
       (page generated 2022-04-26 23:00 UTC)