[HN Gopher] I want to have an AWS region where everything breaks... ___________________________________________________________________ I want to have an AWS region where everything breaks with high frequency Author : caiobegotti Score : 451 points Date : 2020-08-09 23:12 UTC (23 hours ago) (HTM) web link (twitter.com) (TXT) w3m dump (twitter.com) | rdoherty wrote: | This is called chaos engineering and many companies built tooling | to do exactly this. Netflix pioneered/proselytized it years ago. | Since you likely don't just rely upon AWS services if your app is | in AWS, you want something either on your servers themselves or | built into whatever low level HTTP wrapper you use. Use that | library to do fault injection like high latency, errors, | timeouts, etc. | tempsolution wrote: | Sorry but this is just strangely naive. Yes, you should have | integration test setups that allow you to swap out any network | dependency with in-memory stubs, with fault-injected proxies, | etc. But all of that can never replace the behavior and chaos | of a real outage. On top of that, a lot of the more complicated | scenarios are immensely difficult to setup and you first also | need to know HOW a region can fail. | | I agree with you on one thing though. AWS, and actually every | cloud provider should ship a Chaos Engineering toolbox that | comes with ready-to-use, realistic failure simulations that you | can use in tests. I.e. drop-in replacements for their SDK | clients that then just start running through predefined failure | scenarios. | jedberg wrote: | This type of service would be a compliment to those techniques, | not replace them. Ideally we could have both. | infogulch wrote: | It's harder to do chaos engineering if you're not engineering | it. What this is really asking for is the service provider to | sell chaos engineering as a service (CEAAS?), on the services | they provide. I've wanted this kind of thing for testing cloud | infrastructure before: you read about various failure states | and scenarios you might want to handle from docs but there's no | way to trigger them so you just have to hope that they work as | described and your code is correct. At the least, let users | simulate the effect of the failures that are _part of your | API_. | | This would be great for testing the pieces of the stack that | the provider is responsible for, but you may still want to | inject chaos into the part of your stack that you do control. | segmondy wrote: | Netflix runs on AWS, they are doing chaos engineering quite | alright on it. | fizwhiz wrote: | Came here to say this exact thing. There have been a variety of | techniques to achieve this - some intrusive to your binaries | (i.e. they require embedding specific libraries) and others | that are more "external" (ex: tc/iptables). | | The "real" challenge is not _creating_ chaos but managing it | and verifying that your apps are resilient to said chaos. | [deleted] | georgewfraser wrote: | I think people overestimate the importance of failures of the | underlying cloud platform. One of the most surprising lessons of | the last 5 years at my company has been how rarely single points | of failure actually _fail_. A simple load-balanced group of EC2 | instances, pointed at a single RDS Postgres database, is | astonishingly reliable. If you get fancy and build a multi-master | system, you can easily end up creating more downtime than you | prevent when your own failover /recovery system runs amok. | msla wrote: | A zone not only of sight and sound, but of CPU faults and RAM | errors, cache inconsistency and microcode bugs. A zone of the pit | of prod's fears and the peak of test's paranoia. Look, up ahead: | Your root is now read-only and your page cache has been mapped to | /dev/null! You're in the Unavailability Zone! | mindcrime wrote: | Your conductor on this journey through the Unavailability Zone: | the BOFH! | whoisjuan wrote: | Isn't this what Gremlin does? | bob1029 wrote: | It sounds to me what some people would like is for a magical box | they can throw their infrastructure into that will automatically | shit test all the things that could potentially go wrong for | them. This is poor engineering. Arbitrary, contrived error | conditions do not constitute a rational test fixture. If you are | not already aware of where failures might arise in your | application and how to explicitly probe those areas, you are | gambling at best. Not all errors are going to generate stack | traces, and not all errors are going to be detectable by your | users. What you would consider an error condition for one | application may be a completely acceptable outcome for another. | | This is the reliability engineering equivalent of building a data | warehouse when you don't know what sorts of reports you want to | run or how the data will generally be used after you collect it. | riskneutral wrote: | > building a data warehouse when you don't know what sorts of | reports you want to run or how the data will generally be used | after you collect it. | | Hi Bob, I can't tell you what reports I want or what we'll do | with the data until you've first collected the data for | analysis. Thanks! | cogman10 wrote: | I disagree. | | Not handling failures correctly is a time honored tradition in | programming. It is so easy to miss. | | For example, how often have you seen a malloc check for | `ENOMEM`? | | Even though that is something that could be semi common. Even | though that's definitely something you might be able to handle. | Instead, most code will simply blow chunks when that sort of | condition happens. Is the person that wrote it "wrong"? That's | debatable. | | Some languages like Go make it even trickier to detect that | someone forgot to handle an error condition. Nothing obvious in | the code review (other than knowledge of the API in question) | would get someone senior to catch those sorts of issues. | | So the question is, HOW do you catch those problems? | | The answer seems obvious to me, you simulate problems in | integration tests. What happens when Service X simply | disappears. What happens when a server restarts mid | communication? Is everything handled or does this cause the | apps to go into an non-recoverable mode? | | This are all great infrastructure tests that can catch a lot of | edge case problems that may have been missed in code reviews. | Even better, that sort of infrastructure testing can be | generalized and apply to many applications. Making rare events | common in an environment makes it a lot easier to catch hard to | notice bugs that everyone writes. | | It's basically just Fuzz testing but for infrastructure. Fuzz | testing has been shown to have a ton of value, infrastructure | fuzzing seems like a natural valuable extension of that. | Especially when high reliability and low maintenance is | something everyone should want. | chucky_z wrote: | it's us-west-1! :D | | we've had a ton of instances fail at once because they had some | kind of rack-level failure and a bunch of our EC2s ended up in | the same rack. :( | missosoup wrote: | That region is called Microsoft Azure. It will even break the | control UI with high frequency. | llama052 wrote: | I was going to post this but you beat me to it. | | We are forced to use Azure for business reasons where I work, | and the frequency of one off failures and outages is insane. | moooo99 wrote: | Thank you, this is the exact comment I was looking for | terom wrote: | There are multiple methods for automating AWS EC2 instance | recovery for instances in the "system status check failed" or | "scheduled for retirement event" cases. | | Yet to figure out how to test any of those cloudwatch | alerts/rules. I've had them deployed in my dev/test environments | for months now, after having to manually deal with a handful of | them in a short time period. They've yet to trigger once since. | | Umbrellas when it's raining etc. | wmf wrote: | This is why it seems like it would be good to have explicit | fault injection APIs instead of assuming that the normal APIs | behave the same as a real failure. | rob-olmos wrote: | I imagine AWS and other clouds have a staging/simulation | environment for testing their own services. I seem to recall them | discussing that for VPC during re:Invent or something. | | I'm on the fence though if I would want a separate region for | this with various random failures. I think I'd be more interested | in being able to inject faults/latencies/degradation in existing | regions, and when I want them to happen for more control and | ability to verify any fixes. | | Would be interesting to see how they price it as well. High per- | API cost depending on the service being affected, combined with a | duration. Eg, make these EBS volumes 50% slower for the next | 5min. | | Then after or in tandem with the API pieces, release their own | hosted Chaos Monkey type service. | jedberg wrote: | For those saying "Chaos Engineering", first off, the poster is | well aware of Chaos Engineering. He's an AWS Hero and the founder | of Tarsnap. | | Secondly, this would help make CE better. I actually asked Amazon | for an API to do this ten years ago when I was working on Chaos | Monkey. | | I asked for an API to do a hard power off of an instance. To this | day, you can only do a graceful power off. I want to know what | happens when the instance just goes away. | | I also asked for an API to slow down networking, set a random | packet drop rate, EBS failures, etc. All of these things can be | simulated with software, but it's still not exactly the same as | when it happens outside the OS. | | Basically I want an API where I can torture an EC2 instance to | see what happens to it, for science! | exdsq wrote: | Dumb question here from a CE beginner but can't you have a | Docker image for that service and turn it off? | paranoidrobot wrote: | OS-level 'turn off' options don't replicate what happens when | you yank power on a rack of equipment. | | Pretty much every option you have from the OS will let caches | flush, will have in-progress writes complete. | | Yank the power and none of that will happen. It'll let you | actually see what level of fibbing your OS and hardware are | telling you. | | Oh you got a result from that flush to disk? It completed? | Are you sure? Really really sure? Lets find out... | jrockway wrote: | The main result of this testing is that you'll find bugs in | all the abstraction layers that you can't control. You'll | then be paranoid but unable to take any action. Have you | ever seen the source code that's running on your SSD's CPU? | Nope. And it's probably not a small program. My guess is | that it works well when everything is OK, and fails | catastrophically 0.0001% of the time when everything isn't | OK. But you'll never know unless you try failing | catastrophically a million times. Did the vendor even try | failing catastrophically one million times before they | started manufacturing (or at least sent the batch to | retailers)? Did they do that, fail 1 time, and mark the bug | closed as non-reproducible? | | I have no idea! Maybe everything is actually great. Or | maybe someone else will be oncall the week you hit that one | in a million failure case. Or maybe it will happen to your | competitors instead of you! Without testing, all we have is | hope. | landryraccoon wrote: | I think this is too pessimistic. If all you do is test | your error detection and backup recovery mechanism (of | course you have backups right)? Then that's an advantage. | | Let's say a hard power off causes data corruption that | can't be fixed. For lots of applications downtime is | better than corruption, so in that case you at least will | be able to test when you should take the system down and | recover from a known good backup. | alexchamberlain wrote: | Even with testing, it's still just hope; to quote Buzz | Lightyear, it's hoping with style. | IgorPartola wrote: | _kill -9_ on the VM process from the host? | jedberg wrote: | If you're using Docker, yes, you can get a lot more options | in testing by poking "from the outside", but that still | doesn't test what happens when your docker host just dies. | dotancohen wrote: | That's not a dumb question at all. | | I'm not the GP, but I assume it is because the GP would like | to run his "production" setup (or even production setup | itself) under such circumstances. So far as I remember, the | origin of the idea came from the Chaos Monkey which would | sabotage certain services in a production environment to | ensure than everything was redundant and fail safe. | [deleted] | aloknnikhil wrote: | > I asked for an API to do a hard power off of an instance. To | this day, you can only do a graceful power off. I want to know | what happens when the instance just goes away. | | Wouldn't just running "halt -f" do the same? | vetrom wrote: | Not exactly. "halt -f" depends on some variant of power | management working. For extra fun, you could also have | partial power-offs, like say a power cut to the motherboard, | or any combination of other devices with their own power | lines from a power supply. (in a current PC these would | typically be PCI supplementary power, motherboard, SATA | power, 4-pin 12v molex power, supplemental CPU power.) | | Granted, most of those are out of scope for cloud | development, but the concept of externally cutting off a VM | is different than even calling whatever power service you | have to cut power. In the 'real' world, enough of those | failure triggers above would probably also trigger an | automated power cycle if you're in a managed environment. | 0xEFF wrote: | Kernel panics work very well for this use case. | | echo c > /proc/sysrq-trigger | derefr wrote: | Maybe for testing the crash-resilience of software running | _on_ the node; but not necessarily for testing how the SDN | autoscaling glop you 've got configured responds to the | node's death. | | A panicking instance is still "alive" from its hypervisor's | perspective (either it'll hang, sleep, or reboot, but it | won't usually _turn off_ its vCPU in a way the hypervisor | would register as "the instance is now off"); while if a | hypervisor box suffers a power cut, the rest of the compute | cluster knows that the instances on that node are now very | certainly _off_. | raverbashing wrote: | For reference: https://unix.stackexchange.com/a/66205 (you | obviously need to be root)} | jedberg wrote: | It would be really close yes, but not exactly the same as | ripping the power cord out. It still gives the OS a hint that | shutdown is coming. | zyamada wrote: | Possibly, but how can we be 100% sure without the ability to | compare behavior? If I were following this line of research | I'd still want to know if there's any difference in the | nature of a failure when it comes from within the OS | (possibly simulated by half -f) and the situation the parent | OP pointed out where the instance just goes poof without | sending any kind of signal to the OS itself. | JulianMorrison wrote: | To give an example of the difference: does it fsync() before | stopping? | satya71 wrote: | I think localstack[1] gets you lot of this. | | [1] https://localstack.cloud/ | nbadg wrote: | I was just about to suggest localstack. I use it religiously | in personal projects; can't recommend it enough. I haven't | started telling it to induce errors yet, but it definitely | has that capability. And if you're running it in docker, some | of the network stuff can be simulated that way as well. | pojzon wrote: | Did you try to setup an on premis euqalyptus cloud for that ? | Eucalyptus has an api compliant with AWS api? | jimnotgym wrote: | I really wish the world could agree on a better term than 'on | premis' for 'not in the cloud'. At least 'on premises' is | descriptive of the situation, although a bit clunky. On | premises doesn't make any sense at all. 'Premis' is not the | singular of 'premises', 'premises' is the singular of itself! | I prefer 'on site', or 'self hosted' myself | simonebrunozzi wrote: | > Basically I want an API where I can torture an EC2 instance | to see what happens to it, for science! | | And one day there will be PETA [0] for EC2 instances! | | [0]: https://www.peta.org/ | peterwwillis wrote: | > Basically I want an API where I can torture an EC2 instance | to see what happens to it, for science! # cat > | dropme.sh <<'EOFILE' #!/bin/sh set -eu read | -r SLEEP tmp=`mktemp` ; tmp6=`mktemp` iptables-save | > $tmp ip6tables-save > $tmp6 for t in iptables | ip6tables ; do for c in INPUT OUTPUT FORWARD ; do $t | -P $c DROP ; done $t -t nat -F ; $t -t mangle -F ; $t | -F ; $t -X done sleep "$SLEEP" iptables- | restore < $tmp ip6tables-restore < $tmp6 rm -f $tmp | $tmp6 EOFILE # chmod 755 dropme.sh # ncat -k | -l -c ./dropme.sh $(ifconfig eth0 | grep 'inet ' | awk '{print | $2}') 12345 & # echo "60" | ncat -v $(ifconfig eth0 | | grep 'inet ' | awk '{print $2}') 12345 | | If you're lucky the existing connections won't even die, but | the box will be offline for 60 seconds. | dmurray wrote: | > For those saying "Chaos Engineering", first off, the poster | is well aware of Chaos Engineering. He's an AWS Hero and the | founder of Tarsnap. | | Yeah, but did he win the Putnam? | jedberg wrote: | He did, but that particular accolade didn't seem relevant. :) | hitekker wrote: | For reference: https://news.ycombinator.com/item?id=35079 | Terretta wrote: | Same thread where Drew from "getdropbox.com" says he also | has a "sync and backup done right" idea... | asdff wrote: | I wonder how much you would have to pay Amazon for them to send | a tech down to the datacenter and pull the plug on your running | node | yjftsjthsd-h wrote: | It's AWS; surely half of the work would be finding the server | that happens to be running your instance, anonymously sitting | in a datacenter with thousands of other servers. There's also | the question of how many other customers' instances are | located on the same hardware. | breatheoften wrote: | Fascinating! | | The mere existence of such an api would be an interesting | source of problems when used accidentally/via buggy code ... | | I wonder the degree to which this functionality would end up | becoming relied on as a "in worst case hard kill things to | recover" behavior that folks utilize for bad engineering | reasons as opposed to good ones ... | odonnellryan wrote: | Make it hard to accidentally call and hide it in the docs | somewhere with items related to testing. | giancarlostoro wrote: | I would assume said code should never be part of actual | deployments or only part of unit tests, maybe an external | project even. | agravier wrote: | Oh you, sweet summer child... | hiyer wrote: | Today you can use spot block instances for this. They are | guaranteed to die off after your chosen time block of 1-6 | hours. | jedberg wrote: | As far as I know it still sends a graceful shutdown at the | end of the time block. It sends exactly the same signal as if | you use the shutdown API, which is the same as pressing the | power button on the front of the machine. | hackmiester wrote: | Surely it eventually kills you if you just ignore the | signal...?! (Though of course, by that time, we've missed | the point of this exercise.) | toredash wrote: | Correct! | recuter wrote: | Wouldn't a simple cron that disconnects the networking on | an instance randomly and/or pegs all the cores be | equivalent? | jedberg wrote: | There are lots of ways to get a very close experience to | a power pull, but nothing you can do from within the VM | is quite the same as doing it outside the VM. | recuter wrote: | I believe you but I'm curios as to the subtle difference. | | To my mind simply turning off the networking, physically | cutting a cable, or hardware spontaneously combusting are | not discernible events to an outside observer. What am I | missing? | thejj100100 wrote: | Tasks that are writing to disk, network outage wouldn't | stop the task? | recuter wrote: | If the disk is network attached it would, and if it | isn't, what difference does it make? | hackmiester wrote: | Well, we would know that if we could try it. | fred_is_fred wrote: | us-east-1? | NovemberWhiskey wrote: | Anecdotally, I hear the South American regions are the places | where the really canary stuff goes out first. | [deleted] | mitchs wrote: | I've heard a fun story from the old timers in my org about a | fiber outage in Brazil. A routine fiber cut occurred. They | figure out how far from one end the cut is (there is gear | that measures the time to see a light pulse reflect off of | the cut end.) Then they pull out a map of where the fiber was | laid, count out the distance, and send a technician out to | have a look at where they expect the cut to be. All standard | practice up until this point. | | The technician updates the ticket after a while with "cannot | find road." The folks back in the office try to send them | directions, but then the technician clarifies, "road is | gone." Our fiber, and the road it was buried under was | totally demolished in the few hours it took to get someone | out there. The developing world can develop at alarming | rates. | | Other tales from the middle of nowhere: People shoot at arial | fiber with guns. Or dig it up and cut it for fun. One time | out technician was carjacked on the way to doing a repair. | swasheck wrote: | For us it's AP-SE-2 with us-east-1 as a close second | spullara wrote: | I came here to post this and it isn't even a joke. Just true. | yupyup54133 wrote: | same here :-) | falcolas wrote: | I don't work with the group directly, but one group at our | company has set up Gremlin, and the breadth and depth of outages | Gremlin can cause is pretty impressive. Chaos Testing FTW. | robpco wrote: | I've also had a customer who used Gremlin to dramatically | improve their stability. | imhoguy wrote: | Failing individual computes isn't hard, some chaos script to kill | VMs is enough. Worst are situations when things seem to be up but | not acceptable: abnormal network latency, random packet drops, | random but repeatable service errors, lagging eventual | consistency. Not even mentioning any hardware woes. | davidrupp wrote: | [Disclaimer: I work as a software engineer at Amazon (opinions my | own, obvs)] | | The chaos aspect of this would certainly increase the | evolutionary pressure on your systems to get better. You would | need really good visibility into what exactly was going on at the | time your stuff fell over, so you could know what combination(s) | to guard against next time. But there is definitely a class of | problems this would help you discover and solve. | | The problem with the testing aspect, though, is that test | failures are most helpful when they're deterministic. If you | could dictate the type, number, and sequence of specific | failures, then write tests (and corresponding code) that help | make your system resilient to that combination, that would | definitely be useful. It seems like "us-fail-1" would be more | helpful for organic discovery of failure conditions, less so for | the testing of specific conditions. | cogman10 wrote: | > The problem with the testing aspect, though, is that test | failures are most helpful when they're deterministic. | | Let's not let `perfect` get in the way of `good`. | | Certainly having a 100% traceable system would be ideal, most | systems are not that. | | There is still a TON of low hanging and easy to find issues | that would automatically fall out of a system of random fails. | Even if engineers have to spend some time figuring out what the | hell is going on, it would overall improve their system because | it would shine a bright shiny flashlight on the system to let | them know "Hey, something is rotten here". From there, more | deterministic tests and better tracing can be added. | haecceity wrote: | Why does Twitter often fail to load when I open a thread and if I | refresh it works. Does Twitter use us-fail-1? | saagarjha wrote: | I think they don't like browsers they can't fingerprint, or | something like that. | Havoc wrote: | I get this too. Seems to always be desktop | notpiika wrote: | For me it's the exact opposite -- always on mobile, always | when logged out (haven't extensively tested being logged in | on mobile, since I'm barely logged in to Twitter on mobile.) | caymanjim wrote: | I don't know why, but it happens to everyone and it's been that | way for a long time. Either their engineers are failing, or | there's some sketchy monetary reason for it. You're not the | only one. | martin-adams wrote: | I can see a use case for this being implemented on top of | Kubernetes. I've no idea if that's achievable, but could go some | way to make your code more resilient. | MattGaiser wrote: | Whichever region Quora is using. | CloudNetworking wrote: | You can use IBM cloud for that purpose | chrishynes wrote: | Savage. | [deleted] | [deleted] | enahs-sf wrote: | There's a microsoft azure/google cloud joke in there | somewhere... | dotancohen wrote: | No, really, it's the IBM cloud that is the joke. This isn't | the first I've heard of it, though I've not used it myself. | | I'm a happy AWS user and I'll stay a happy AWS user not for | their prices or features, but service. Which was the reason I | was a Rackspace fan before it was sold and went down the | tube. | toast0 wrote: | Hey, I used their loadbalancers for a couple months, and they | only failed every 30 days, that's not high frequency. | paranoidrobot wrote: | I used to use their Citrix Netscaler VPX1000s at a previous | job. | | They were very reliable, imo. Aside from general Netscaler | bullshit, We only ever had issues with them when we'd try to | get them to do too much so that the CPU or Memory was | overloaded. | | We tried on a few occasions to get more cores allocated to | them, but no. This made terminating large numbers of SSL | connections on them problematic. | toast0 wrote: | I was using their shared loadbalancers, not the run a load | balancer in a VM option, because I was hoping for something | more reliable than a single computer. For the couple months | they were running, it was literally every 30 days, 10 | minutes of downtime. So I went back to DNS round robin, | cause it was better. | bootyfarm wrote: | I believe this is available as a service called "Softlayer" | bootyfarm wrote: | I believe this is a service called "SoftLayer" | raverbashing wrote: | There's an easier way: spot instances (and us-east-1 as | mentioned) | | As for things like EBS failing, or dropping packages, it's a bit | tricky as some things might break at the OS levels | | And given sufficient failures, you can't swim anymore, you'll | just sink. | exabrial wrote: | Sort of counter-intuitive, but for small projects, you want | resilient hardware systems as much as possible... the larger your | scale out, the less reliable you would want them to force that | out of hardware into resilient software. | vemv wrote: | While these are not exclusive, personally I'd look instead into | studying my system's reliability in a way that is independent of | a cloud provider, or even of performing any side-effectful | testing at all. | | There's extensive research and works on all things resilience. | One could say: if one build a system that is proven to be | theoretically resilient, that model should extrapolate to real- | world resilience. | | This approach is probably intimately related with pure-functional | programming, which I feel has been not explored enough in this | area. | exabrial wrote: | Simply host on Google Cloud! They will terminate your access for | something random, like someone said your name on YouTube while | doing something bad! They don't have a number you can call, and | their support is ran but the stupidest of all AI algorithms. | code4tee wrote: | This is what chaos monkey does. | dijit wrote: | Isn't us-east-1 exactly that? | | All jokes aside, I actually asked my google cloud rep about stuff | like this; they came back with some solutions but often the | problem with that is, what kind of failure condition are you | hoping for? | | Zonal outage (networking)? Hypervisor outage? Storage outage? | | Unless it's something like s3 giving high error rates then most | things can actually be done manually. (And this was the advice I | got back because faulting the entire set of apis and tools in | unique and interesting ways is quite impossible) | londons_explore wrote: | > Unless it's something like s3 giving high error rates | | Just firewall off the real s3, and point clients at a proxy | which forwards most requests to the real s3 and returns errors | or delays to the rest. | caymanjim wrote: | Yeah, us-east-1 is pretty good at failing already. We lost us- | east-1c for most of the day about a week ago due to a fiber | line being cut. I'd estimate that AWS manages fewer than "three | 9s" in us-east-1 on average. Not across the board, but at any | given time something has a decent chance of not working, be it | an entire AZ, or regional S3, etc. They're still pretty | reliable, and I like the idea of a zone with built-in failure | for testing things, but your joke about us-east-1 is based in | solid fact. | 6510 wrote: | Sounds useful. Crank it up to 99% failure and it becomes | interesting science. | caiobegotti wrote: | It actually sounds useful, to the point I wouldn't be surprised | if in the near future cloud providers bundled up some chaos | monkey stack and offered that with a neat price within their | realms (dunno, maybe per VPC or project). | pseudosavant wrote: | They will definitely figure out a way to charge us more for | hardware that is less reliable. | emerged wrote: | That's a great idea: instead of throwing away failing | hardware, toss it into the chaos region and charge double. | kentlyons wrote: | I want this at the programming language level too. If a function | call can fail, I want to set a flag and have it (randomly?) fail. | I hacked my way around this by adding in some wrapper that would | if random, err for a bunch of critical functions. It was great | for working through a ton of race conditions in golang with | channels, and remote connections, etc. But hacking it in manually | was annoying and not something I'd want to commit. | lordgeek wrote: | brilliant! | swasheck wrote: | Wait. I thought this was ap-southeast-2 | jariel wrote: | This is a really great idea. | jonplackett wrote: | This is such a clever idea. I wonder if amazon are smart enough | to actually do this. | foota wrote: | Just deploy a new region with no ops support, it'll quickly | become that. | gregdoesit wrote: | When I worked at Skype / Microsoft and Azure was quite young, the | Data team next to me had a close relationship with one of the | Azure groups who were building new data centers. | | The Azure group would ask them to send large loads of data their | way, so they could get some "real" load on the servers. There | would be issues at the infra level, and the team had to detect | this and respond to it. In return, the data team would also ask | the Azure folks to just unpug a few machines - power them off, | take out network cables - helping them test what happens. | | Unfortunately, this was a one-off, and once the data center was | stable, the team lost this kind of "insider" connection. | | Howerver, as a fun fact, at Skype, we could use Azure for free | for about a year - every dev in the office, for work purposes | (including work pet projects). We spun up way too many instances | during time, as you'd expect, and only came around to turning | them off when Azure changed billing to charge 10% of the | "regular" pricing for internal customers. | thoraway1010 wrote: | A great idea! I'd love to run stuff in this zone. Rotate through | a bunch of errors, unavailability, latency spikes, power outages | etc every day, make it a 12 hour torture test cycle. | kevindong wrote: | At my job, my team owns a service that generally has great | uptime. Dependent teams/services have gotten into the habit of | assuming that our service will be 100% available which is | problematic because it's obviously not. That false assumption has | caused several minor incidents unfortunately. | | There have been some talk internally of doing chaos engineering | to help improve the reliability of our company's products as a | whole. Unfortunately, the most easily simulatable failure | scenarios (e.g. entire containers go down at once instantly, | etc.) tend to be the least helpful since my team designed the | service to tolerate those kinds of easily modelable situations. | | The more subtle/complex/interesting failure conditions are far | harder to recognize and simulate (e.g. all containers hosted on | one particular node experience 10s latencies on all network | traffic, stale DNS entries, broken service discovery, etc.). ___________________________________________________________________ (page generated 2020-08-10 23:00 UTC)