[HN Gopher] I want to have an AWS region where everything breaks...
       ___________________________________________________________________
        
       I want to have an AWS region where everything breaks with high
       frequency
        
       Author : caiobegotti
       Score  : 451 points
       Date   : 2020-08-09 23:12 UTC (23 hours ago)
        
 (HTM) web link (twitter.com)
 (TXT) w3m dump (twitter.com)
        
       | rdoherty wrote:
       | This is called chaos engineering and many companies built tooling
       | to do exactly this. Netflix pioneered/proselytized it years ago.
       | Since you likely don't just rely upon AWS services if your app is
       | in AWS, you want something either on your servers themselves or
       | built into whatever low level HTTP wrapper you use. Use that
       | library to do fault injection like high latency, errors,
       | timeouts, etc.
        
         | tempsolution wrote:
         | Sorry but this is just strangely naive. Yes, you should have
         | integration test setups that allow you to swap out any network
         | dependency with in-memory stubs, with fault-injected proxies,
         | etc. But all of that can never replace the behavior and chaos
         | of a real outage. On top of that, a lot of the more complicated
         | scenarios are immensely difficult to setup and you first also
         | need to know HOW a region can fail.
         | 
         | I agree with you on one thing though. AWS, and actually every
         | cloud provider should ship a Chaos Engineering toolbox that
         | comes with ready-to-use, realistic failure simulations that you
         | can use in tests. I.e. drop-in replacements for their SDK
         | clients that then just start running through predefined failure
         | scenarios.
        
         | jedberg wrote:
         | This type of service would be a compliment to those techniques,
         | not replace them. Ideally we could have both.
        
         | infogulch wrote:
         | It's harder to do chaos engineering if you're not engineering
         | it. What this is really asking for is the service provider to
         | sell chaos engineering as a service (CEAAS?), on the services
         | they provide. I've wanted this kind of thing for testing cloud
         | infrastructure before: you read about various failure states
         | and scenarios you might want to handle from docs but there's no
         | way to trigger them so you just have to hope that they work as
         | described and your code is correct. At the least, let users
         | simulate the effect of the failures that are _part of your
         | API_.
         | 
         | This would be great for testing the pieces of the stack that
         | the provider is responsible for, but you may still want to
         | inject chaos into the part of your stack that you do control.
        
           | segmondy wrote:
           | Netflix runs on AWS, they are doing chaos engineering quite
           | alright on it.
        
         | fizwhiz wrote:
         | Came here to say this exact thing. There have been a variety of
         | techniques to achieve this - some intrusive to your binaries
         | (i.e. they require embedding specific libraries) and others
         | that are more "external" (ex: tc/iptables).
         | 
         | The "real" challenge is not _creating_ chaos but managing it
         | and verifying that your apps are resilient to said chaos.
        
         | [deleted]
        
       | georgewfraser wrote:
       | I think people overestimate the importance of failures of the
       | underlying cloud platform. One of the most surprising lessons of
       | the last 5 years at my company has been how rarely single points
       | of failure actually _fail_. A simple load-balanced group of EC2
       | instances, pointed at a single RDS Postgres database, is
       | astonishingly reliable. If you get fancy and build a multi-master
       | system, you can easily end up creating more downtime than you
       | prevent when your own failover /recovery system runs amok.
        
       | msla wrote:
       | A zone not only of sight and sound, but of CPU faults and RAM
       | errors, cache inconsistency and microcode bugs. A zone of the pit
       | of prod's fears and the peak of test's paranoia. Look, up ahead:
       | Your root is now read-only and your page cache has been mapped to
       | /dev/null! You're in the Unavailability Zone!
        
         | mindcrime wrote:
         | Your conductor on this journey through the Unavailability Zone:
         | the BOFH!
        
       | whoisjuan wrote:
       | Isn't this what Gremlin does?
        
       | bob1029 wrote:
       | It sounds to me what some people would like is for a magical box
       | they can throw their infrastructure into that will automatically
       | shit test all the things that could potentially go wrong for
       | them. This is poor engineering. Arbitrary, contrived error
       | conditions do not constitute a rational test fixture. If you are
       | not already aware of where failures might arise in your
       | application and how to explicitly probe those areas, you are
       | gambling at best. Not all errors are going to generate stack
       | traces, and not all errors are going to be detectable by your
       | users. What you would consider an error condition for one
       | application may be a completely acceptable outcome for another.
       | 
       | This is the reliability engineering equivalent of building a data
       | warehouse when you don't know what sorts of reports you want to
       | run or how the data will generally be used after you collect it.
        
         | riskneutral wrote:
         | > building a data warehouse when you don't know what sorts of
         | reports you want to run or how the data will generally be used
         | after you collect it.
         | 
         | Hi Bob, I can't tell you what reports I want or what we'll do
         | with the data until you've first collected the data for
         | analysis. Thanks!
        
         | cogman10 wrote:
         | I disagree.
         | 
         | Not handling failures correctly is a time honored tradition in
         | programming. It is so easy to miss.
         | 
         | For example, how often have you seen a malloc check for
         | `ENOMEM`?
         | 
         | Even though that is something that could be semi common. Even
         | though that's definitely something you might be able to handle.
         | Instead, most code will simply blow chunks when that sort of
         | condition happens. Is the person that wrote it "wrong"? That's
         | debatable.
         | 
         | Some languages like Go make it even trickier to detect that
         | someone forgot to handle an error condition. Nothing obvious in
         | the code review (other than knowledge of the API in question)
         | would get someone senior to catch those sorts of issues.
         | 
         | So the question is, HOW do you catch those problems?
         | 
         | The answer seems obvious to me, you simulate problems in
         | integration tests. What happens when Service X simply
         | disappears. What happens when a server restarts mid
         | communication? Is everything handled or does this cause the
         | apps to go into an non-recoverable mode?
         | 
         | This are all great infrastructure tests that can catch a lot of
         | edge case problems that may have been missed in code reviews.
         | Even better, that sort of infrastructure testing can be
         | generalized and apply to many applications. Making rare events
         | common in an environment makes it a lot easier to catch hard to
         | notice bugs that everyone writes.
         | 
         | It's basically just Fuzz testing but for infrastructure. Fuzz
         | testing has been shown to have a ton of value, infrastructure
         | fuzzing seems like a natural valuable extension of that.
         | Especially when high reliability and low maintenance is
         | something everyone should want.
        
       | chucky_z wrote:
       | it's us-west-1! :D
       | 
       | we've had a ton of instances fail at once because they had some
       | kind of rack-level failure and a bunch of our EC2s ended up in
       | the same rack. :(
        
       | missosoup wrote:
       | That region is called Microsoft Azure. It will even break the
       | control UI with high frequency.
        
         | llama052 wrote:
         | I was going to post this but you beat me to it.
         | 
         | We are forced to use Azure for business reasons where I work,
         | and the frequency of one off failures and outages is insane.
        
         | moooo99 wrote:
         | Thank you, this is the exact comment I was looking for
        
       | terom wrote:
       | There are multiple methods for automating AWS EC2 instance
       | recovery for instances in the "system status check failed" or
       | "scheduled for retirement event" cases.
       | 
       | Yet to figure out how to test any of those cloudwatch
       | alerts/rules. I've had them deployed in my dev/test environments
       | for months now, after having to manually deal with a handful of
       | them in a short time period. They've yet to trigger once since.
       | 
       | Umbrellas when it's raining etc.
        
         | wmf wrote:
         | This is why it seems like it would be good to have explicit
         | fault injection APIs instead of assuming that the normal APIs
         | behave the same as a real failure.
        
       | rob-olmos wrote:
       | I imagine AWS and other clouds have a staging/simulation
       | environment for testing their own services. I seem to recall them
       | discussing that for VPC during re:Invent or something.
       | 
       | I'm on the fence though if I would want a separate region for
       | this with various random failures. I think I'd be more interested
       | in being able to inject faults/latencies/degradation in existing
       | regions, and when I want them to happen for more control and
       | ability to verify any fixes.
       | 
       | Would be interesting to see how they price it as well. High per-
       | API cost depending on the service being affected, combined with a
       | duration. Eg, make these EBS volumes 50% slower for the next
       | 5min.
       | 
       | Then after or in tandem with the API pieces, release their own
       | hosted Chaos Monkey type service.
        
       | jedberg wrote:
       | For those saying "Chaos Engineering", first off, the poster is
       | well aware of Chaos Engineering. He's an AWS Hero and the founder
       | of Tarsnap.
       | 
       | Secondly, this would help make CE better. I actually asked Amazon
       | for an API to do this ten years ago when I was working on Chaos
       | Monkey.
       | 
       | I asked for an API to do a hard power off of an instance. To this
       | day, you can only do a graceful power off. I want to know what
       | happens when the instance just goes away.
       | 
       | I also asked for an API to slow down networking, set a random
       | packet drop rate, EBS failures, etc. All of these things can be
       | simulated with software, but it's still not exactly the same as
       | when it happens outside the OS.
       | 
       | Basically I want an API where I can torture an EC2 instance to
       | see what happens to it, for science!
        
         | exdsq wrote:
         | Dumb question here from a CE beginner but can't you have a
         | Docker image for that service and turn it off?
        
           | paranoidrobot wrote:
           | OS-level 'turn off' options don't replicate what happens when
           | you yank power on a rack of equipment.
           | 
           | Pretty much every option you have from the OS will let caches
           | flush, will have in-progress writes complete.
           | 
           | Yank the power and none of that will happen. It'll let you
           | actually see what level of fibbing your OS and hardware are
           | telling you.
           | 
           | Oh you got a result from that flush to disk? It completed?
           | Are you sure? Really really sure? Lets find out...
        
             | jrockway wrote:
             | The main result of this testing is that you'll find bugs in
             | all the abstraction layers that you can't control. You'll
             | then be paranoid but unable to take any action. Have you
             | ever seen the source code that's running on your SSD's CPU?
             | Nope. And it's probably not a small program. My guess is
             | that it works well when everything is OK, and fails
             | catastrophically 0.0001% of the time when everything isn't
             | OK. But you'll never know unless you try failing
             | catastrophically a million times. Did the vendor even try
             | failing catastrophically one million times before they
             | started manufacturing (or at least sent the batch to
             | retailers)? Did they do that, fail 1 time, and mark the bug
             | closed as non-reproducible?
             | 
             | I have no idea! Maybe everything is actually great. Or
             | maybe someone else will be oncall the week you hit that one
             | in a million failure case. Or maybe it will happen to your
             | competitors instead of you! Without testing, all we have is
             | hope.
        
               | landryraccoon wrote:
               | I think this is too pessimistic. If all you do is test
               | your error detection and backup recovery mechanism (of
               | course you have backups right)? Then that's an advantage.
               | 
               | Let's say a hard power off causes data corruption that
               | can't be fixed. For lots of applications downtime is
               | better than corruption, so in that case you at least will
               | be able to test when you should take the system down and
               | recover from a known good backup.
        
               | alexchamberlain wrote:
               | Even with testing, it's still just hope; to quote Buzz
               | Lightyear, it's hoping with style.
        
             | IgorPartola wrote:
             | _kill -9_ on the VM process from the host?
        
           | jedberg wrote:
           | If you're using Docker, yes, you can get a lot more options
           | in testing by poking "from the outside", but that still
           | doesn't test what happens when your docker host just dies.
        
           | dotancohen wrote:
           | That's not a dumb question at all.
           | 
           | I'm not the GP, but I assume it is because the GP would like
           | to run his "production" setup (or even production setup
           | itself) under such circumstances. So far as I remember, the
           | origin of the idea came from the Chaos Monkey which would
           | sabotage certain services in a production environment to
           | ensure than everything was redundant and fail safe.
        
             | [deleted]
        
         | aloknnikhil wrote:
         | > I asked for an API to do a hard power off of an instance. To
         | this day, you can only do a graceful power off. I want to know
         | what happens when the instance just goes away.
         | 
         | Wouldn't just running "halt -f" do the same?
        
           | vetrom wrote:
           | Not exactly. "halt -f" depends on some variant of power
           | management working. For extra fun, you could also have
           | partial power-offs, like say a power cut to the motherboard,
           | or any combination of other devices with their own power
           | lines from a power supply. (in a current PC these would
           | typically be PCI supplementary power, motherboard, SATA
           | power, 4-pin 12v molex power, supplemental CPU power.)
           | 
           | Granted, most of those are out of scope for cloud
           | development, but the concept of externally cutting off a VM
           | is different than even calling whatever power service you
           | have to cut power. In the 'real' world, enough of those
           | failure triggers above would probably also trigger an
           | automated power cycle if you're in a managed environment.
        
           | 0xEFF wrote:
           | Kernel panics work very well for this use case.
           | 
           | echo c > /proc/sysrq-trigger
        
             | derefr wrote:
             | Maybe for testing the crash-resilience of software running
             | _on_ the node; but not necessarily for testing how the SDN
             | autoscaling glop you 've got configured responds to the
             | node's death.
             | 
             | A panicking instance is still "alive" from its hypervisor's
             | perspective (either it'll hang, sleep, or reboot, but it
             | won't usually _turn off_ its vCPU in a way the hypervisor
             | would register as  "the instance is now off"); while if a
             | hypervisor box suffers a power cut, the rest of the compute
             | cluster knows that the instances on that node are now very
             | certainly _off_.
        
             | raverbashing wrote:
             | For reference: https://unix.stackexchange.com/a/66205 (you
             | obviously need to be root)}
        
           | jedberg wrote:
           | It would be really close yes, but not exactly the same as
           | ripping the power cord out. It still gives the OS a hint that
           | shutdown is coming.
        
           | zyamada wrote:
           | Possibly, but how can we be 100% sure without the ability to
           | compare behavior? If I were following this line of research
           | I'd still want to know if there's any difference in the
           | nature of a failure when it comes from within the OS
           | (possibly simulated by half -f) and the situation the parent
           | OP pointed out where the instance just goes poof without
           | sending any kind of signal to the OS itself.
        
           | JulianMorrison wrote:
           | To give an example of the difference: does it fsync() before
           | stopping?
        
         | satya71 wrote:
         | I think localstack[1] gets you lot of this.
         | 
         | [1] https://localstack.cloud/
        
           | nbadg wrote:
           | I was just about to suggest localstack. I use it religiously
           | in personal projects; can't recommend it enough. I haven't
           | started telling it to induce errors yet, but it definitely
           | has that capability. And if you're running it in docker, some
           | of the network stuff can be simulated that way as well.
        
         | pojzon wrote:
         | Did you try to setup an on premis euqalyptus cloud for that ?
         | Eucalyptus has an api compliant with AWS api?
        
           | jimnotgym wrote:
           | I really wish the world could agree on a better term than 'on
           | premis' for 'not in the cloud'. At least 'on premises' is
           | descriptive of the situation, although a bit clunky. On
           | premises doesn't make any sense at all. 'Premis' is not the
           | singular of 'premises', 'premises' is the singular of itself!
           | I prefer 'on site', or 'self hosted' myself
        
         | simonebrunozzi wrote:
         | > Basically I want an API where I can torture an EC2 instance
         | to see what happens to it, for science!
         | 
         | And one day there will be PETA [0] for EC2 instances!
         | 
         | [0]: https://www.peta.org/
        
         | peterwwillis wrote:
         | > Basically I want an API where I can torture an EC2 instance
         | to see what happens to it, for science!                 # cat >
         | dropme.sh <<'EOFILE'       #!/bin/sh       set -eu       read
         | -r SLEEP       tmp=`mktemp` ; tmp6=`mktemp`       iptables-save
         | > $tmp       ip6tables-save > $tmp6       for t in iptables
         | ip6tables ; do          for c in INPUT OUTPUT FORWARD ; do $t
         | -P $c DROP ; done         $t -t nat -F ; $t -t mangle -F ; $t
         | -F ; $t -X       done       sleep "$SLEEP"       iptables-
         | restore < $tmp       ip6tables-restore < $tmp6       rm -f $tmp
         | $tmp6       EOFILE       # chmod 755 dropme.sh       # ncat -k
         | -l -c ./dropme.sh $(ifconfig eth0 | grep 'inet ' | awk '{print
         | $2}') 12345 &       # echo "60" | ncat -v $(ifconfig eth0 |
         | grep 'inet ' | awk '{print $2}') 12345
         | 
         | If you're lucky the existing connections won't even die, but
         | the box will be offline for 60 seconds.
        
         | dmurray wrote:
         | > For those saying "Chaos Engineering", first off, the poster
         | is well aware of Chaos Engineering. He's an AWS Hero and the
         | founder of Tarsnap.
         | 
         | Yeah, but did he win the Putnam?
        
           | jedberg wrote:
           | He did, but that particular accolade didn't seem relevant. :)
        
           | hitekker wrote:
           | For reference: https://news.ycombinator.com/item?id=35079
        
             | Terretta wrote:
             | Same thread where Drew from "getdropbox.com" says he also
             | has a "sync and backup done right" idea...
        
         | asdff wrote:
         | I wonder how much you would have to pay Amazon for them to send
         | a tech down to the datacenter and pull the plug on your running
         | node
        
           | yjftsjthsd-h wrote:
           | It's AWS; surely half of the work would be finding the server
           | that happens to be running your instance, anonymously sitting
           | in a datacenter with thousands of other servers. There's also
           | the question of how many other customers' instances are
           | located on the same hardware.
        
         | breatheoften wrote:
         | Fascinating!
         | 
         | The mere existence of such an api would be an interesting
         | source of problems when used accidentally/via buggy code ...
         | 
         | I wonder the degree to which this functionality would end up
         | becoming relied on as a "in worst case hard kill things to
         | recover" behavior that folks utilize for bad engineering
         | reasons as opposed to good ones ...
        
           | odonnellryan wrote:
           | Make it hard to accidentally call and hide it in the docs
           | somewhere with items related to testing.
        
           | giancarlostoro wrote:
           | I would assume said code should never be part of actual
           | deployments or only part of unit tests, maybe an external
           | project even.
        
             | agravier wrote:
             | Oh you, sweet summer child...
        
         | hiyer wrote:
         | Today you can use spot block instances for this. They are
         | guaranteed to die off after your chosen time block of 1-6
         | hours.
        
           | jedberg wrote:
           | As far as I know it still sends a graceful shutdown at the
           | end of the time block. It sends exactly the same signal as if
           | you use the shutdown API, which is the same as pressing the
           | power button on the front of the machine.
        
             | hackmiester wrote:
             | Surely it eventually kills you if you just ignore the
             | signal...?! (Though of course, by that time, we've missed
             | the point of this exercise.)
        
             | toredash wrote:
             | Correct!
        
             | recuter wrote:
             | Wouldn't a simple cron that disconnects the networking on
             | an instance randomly and/or pegs all the cores be
             | equivalent?
        
               | jedberg wrote:
               | There are lots of ways to get a very close experience to
               | a power pull, but nothing you can do from within the VM
               | is quite the same as doing it outside the VM.
        
               | recuter wrote:
               | I believe you but I'm curios as to the subtle difference.
               | 
               | To my mind simply turning off the networking, physically
               | cutting a cable, or hardware spontaneously combusting are
               | not discernible events to an outside observer. What am I
               | missing?
        
               | thejj100100 wrote:
               | Tasks that are writing to disk, network outage wouldn't
               | stop the task?
        
               | recuter wrote:
               | If the disk is network attached it would, and if it
               | isn't, what difference does it make?
        
               | hackmiester wrote:
               | Well, we would know that if we could try it.
        
       | fred_is_fred wrote:
       | us-east-1?
        
         | NovemberWhiskey wrote:
         | Anecdotally, I hear the South American regions are the places
         | where the really canary stuff goes out first.
        
           | [deleted]
        
           | mitchs wrote:
           | I've heard a fun story from the old timers in my org about a
           | fiber outage in Brazil. A routine fiber cut occurred. They
           | figure out how far from one end the cut is (there is gear
           | that measures the time to see a light pulse reflect off of
           | the cut end.) Then they pull out a map of where the fiber was
           | laid, count out the distance, and send a technician out to
           | have a look at where they expect the cut to be. All standard
           | practice up until this point.
           | 
           | The technician updates the ticket after a while with "cannot
           | find road." The folks back in the office try to send them
           | directions, but then the technician clarifies, "road is
           | gone." Our fiber, and the road it was buried under was
           | totally demolished in the few hours it took to get someone
           | out there. The developing world can develop at alarming
           | rates.
           | 
           | Other tales from the middle of nowhere: People shoot at arial
           | fiber with guns. Or dig it up and cut it for fun. One time
           | out technician was carjacked on the way to doing a repair.
        
         | swasheck wrote:
         | For us it's AP-SE-2 with us-east-1 as a close second
        
         | spullara wrote:
         | I came here to post this and it isn't even a joke. Just true.
        
           | yupyup54133 wrote:
           | same here :-)
        
       | falcolas wrote:
       | I don't work with the group directly, but one group at our
       | company has set up Gremlin, and the breadth and depth of outages
       | Gremlin can cause is pretty impressive. Chaos Testing FTW.
        
         | robpco wrote:
         | I've also had a customer who used Gremlin to dramatically
         | improve their stability.
        
       | imhoguy wrote:
       | Failing individual computes isn't hard, some chaos script to kill
       | VMs is enough. Worst are situations when things seem to be up but
       | not acceptable: abnormal network latency, random packet drops,
       | random but repeatable service errors, lagging eventual
       | consistency. Not even mentioning any hardware woes.
        
       | davidrupp wrote:
       | [Disclaimer: I work as a software engineer at Amazon (opinions my
       | own, obvs)]
       | 
       | The chaos aspect of this would certainly increase the
       | evolutionary pressure on your systems to get better. You would
       | need really good visibility into what exactly was going on at the
       | time your stuff fell over, so you could know what combination(s)
       | to guard against next time. But there is definitely a class of
       | problems this would help you discover and solve.
       | 
       | The problem with the testing aspect, though, is that test
       | failures are most helpful when they're deterministic. If you
       | could dictate the type, number, and sequence of specific
       | failures, then write tests (and corresponding code) that help
       | make your system resilient to that combination, that would
       | definitely be useful. It seems like "us-fail-1" would be more
       | helpful for organic discovery of failure conditions, less so for
       | the testing of specific conditions.
        
         | cogman10 wrote:
         | > The problem with the testing aspect, though, is that test
         | failures are most helpful when they're deterministic.
         | 
         | Let's not let `perfect` get in the way of `good`.
         | 
         | Certainly having a 100% traceable system would be ideal, most
         | systems are not that.
         | 
         | There is still a TON of low hanging and easy to find issues
         | that would automatically fall out of a system of random fails.
         | Even if engineers have to spend some time figuring out what the
         | hell is going on, it would overall improve their system because
         | it would shine a bright shiny flashlight on the system to let
         | them know "Hey, something is rotten here". From there, more
         | deterministic tests and better tracing can be added.
        
       | haecceity wrote:
       | Why does Twitter often fail to load when I open a thread and if I
       | refresh it works. Does Twitter use us-fail-1?
        
         | saagarjha wrote:
         | I think they don't like browsers they can't fingerprint, or
         | something like that.
        
         | Havoc wrote:
         | I get this too. Seems to always be desktop
        
           | notpiika wrote:
           | For me it's the exact opposite -- always on mobile, always
           | when logged out (haven't extensively tested being logged in
           | on mobile, since I'm barely logged in to Twitter on mobile.)
        
         | caymanjim wrote:
         | I don't know why, but it happens to everyone and it's been that
         | way for a long time. Either their engineers are failing, or
         | there's some sketchy monetary reason for it. You're not the
         | only one.
        
       | martin-adams wrote:
       | I can see a use case for this being implemented on top of
       | Kubernetes. I've no idea if that's achievable, but could go some
       | way to make your code more resilient.
        
       | MattGaiser wrote:
       | Whichever region Quora is using.
        
       | CloudNetworking wrote:
       | You can use IBM cloud for that purpose
        
         | chrishynes wrote:
         | Savage.
        
         | [deleted]
        
         | [deleted]
        
         | enahs-sf wrote:
         | There's a microsoft azure/google cloud joke in there
         | somewhere...
        
           | dotancohen wrote:
           | No, really, it's the IBM cloud that is the joke. This isn't
           | the first I've heard of it, though I've not used it myself.
           | 
           | I'm a happy AWS user and I'll stay a happy AWS user not for
           | their prices or features, but service. Which was the reason I
           | was a Rackspace fan before it was sold and went down the
           | tube.
        
         | toast0 wrote:
         | Hey, I used their loadbalancers for a couple months, and they
         | only failed every 30 days, that's not high frequency.
        
           | paranoidrobot wrote:
           | I used to use their Citrix Netscaler VPX1000s at a previous
           | job.
           | 
           | They were very reliable, imo. Aside from general Netscaler
           | bullshit, We only ever had issues with them when we'd try to
           | get them to do too much so that the CPU or Memory was
           | overloaded.
           | 
           | We tried on a few occasions to get more cores allocated to
           | them, but no. This made terminating large numbers of SSL
           | connections on them problematic.
        
             | toast0 wrote:
             | I was using their shared loadbalancers, not the run a load
             | balancer in a VM option, because I was hoping for something
             | more reliable than a single computer. For the couple months
             | they were running, it was literally every 30 days, 10
             | minutes of downtime. So I went back to DNS round robin,
             | cause it was better.
        
       | bootyfarm wrote:
       | I believe this is available as a service called "Softlayer"
        
       | bootyfarm wrote:
       | I believe this is a service called "SoftLayer"
        
       | raverbashing wrote:
       | There's an easier way: spot instances (and us-east-1 as
       | mentioned)
       | 
       | As for things like EBS failing, or dropping packages, it's a bit
       | tricky as some things might break at the OS levels
       | 
       | And given sufficient failures, you can't swim anymore, you'll
       | just sink.
        
       | exabrial wrote:
       | Sort of counter-intuitive, but for small projects, you want
       | resilient hardware systems as much as possible... the larger your
       | scale out, the less reliable you would want them to force that
       | out of hardware into resilient software.
        
       | vemv wrote:
       | While these are not exclusive, personally I'd look instead into
       | studying my system's reliability in a way that is independent of
       | a cloud provider, or even of performing any side-effectful
       | testing at all.
       | 
       | There's extensive research and works on all things resilience.
       | One could say: if one build a system that is proven to be
       | theoretically resilient, that model should extrapolate to real-
       | world resilience.
       | 
       | This approach is probably intimately related with pure-functional
       | programming, which I feel has been not explored enough in this
       | area.
        
       | exabrial wrote:
       | Simply host on Google Cloud! They will terminate your access for
       | something random, like someone said your name on YouTube while
       | doing something bad! They don't have a number you can call, and
       | their support is ran but the stupidest of all AI algorithms.
        
       | code4tee wrote:
       | This is what chaos monkey does.
        
       | dijit wrote:
       | Isn't us-east-1 exactly that?
       | 
       | All jokes aside, I actually asked my google cloud rep about stuff
       | like this; they came back with some solutions but often the
       | problem with that is, what kind of failure condition are you
       | hoping for?
       | 
       | Zonal outage (networking)? Hypervisor outage? Storage outage?
       | 
       | Unless it's something like s3 giving high error rates then most
       | things can actually be done manually. (And this was the advice I
       | got back because faulting the entire set of apis and tools in
       | unique and interesting ways is quite impossible)
        
         | londons_explore wrote:
         | > Unless it's something like s3 giving high error rates
         | 
         | Just firewall off the real s3, and point clients at a proxy
         | which forwards most requests to the real s3 and returns errors
         | or delays to the rest.
        
         | caymanjim wrote:
         | Yeah, us-east-1 is pretty good at failing already. We lost us-
         | east-1c for most of the day about a week ago due to a fiber
         | line being cut. I'd estimate that AWS manages fewer than "three
         | 9s" in us-east-1 on average. Not across the board, but at any
         | given time something has a decent chance of not working, be it
         | an entire AZ, or regional S3, etc. They're still pretty
         | reliable, and I like the idea of a zone with built-in failure
         | for testing things, but your joke about us-east-1 is based in
         | solid fact.
        
       | 6510 wrote:
       | Sounds useful. Crank it up to 99% failure and it becomes
       | interesting science.
        
         | caiobegotti wrote:
         | It actually sounds useful, to the point I wouldn't be surprised
         | if in the near future cloud providers bundled up some chaos
         | monkey stack and offered that with a neat price within their
         | realms (dunno, maybe per VPC or project).
        
           | pseudosavant wrote:
           | They will definitely figure out a way to charge us more for
           | hardware that is less reliable.
        
             | emerged wrote:
             | That's a great idea: instead of throwing away failing
             | hardware, toss it into the chaos region and charge double.
        
       | kentlyons wrote:
       | I want this at the programming language level too. If a function
       | call can fail, I want to set a flag and have it (randomly?) fail.
       | I hacked my way around this by adding in some wrapper that would
       | if random, err for a bunch of critical functions. It was great
       | for working through a ton of race conditions in golang with
       | channels, and remote connections, etc. But hacking it in manually
       | was annoying and not something I'd want to commit.
        
       | lordgeek wrote:
       | brilliant!
        
       | swasheck wrote:
       | Wait. I thought this was ap-southeast-2
        
       | jariel wrote:
       | This is a really great idea.
        
       | jonplackett wrote:
       | This is such a clever idea. I wonder if amazon are smart enough
       | to actually do this.
        
       | foota wrote:
       | Just deploy a new region with no ops support, it'll quickly
       | become that.
        
       | gregdoesit wrote:
       | When I worked at Skype / Microsoft and Azure was quite young, the
       | Data team next to me had a close relationship with one of the
       | Azure groups who were building new data centers.
       | 
       | The Azure group would ask them to send large loads of data their
       | way, so they could get some "real" load on the servers. There
       | would be issues at the infra level, and the team had to detect
       | this and respond to it. In return, the data team would also ask
       | the Azure folks to just unpug a few machines - power them off,
       | take out network cables - helping them test what happens.
       | 
       | Unfortunately, this was a one-off, and once the data center was
       | stable, the team lost this kind of "insider" connection.
       | 
       | Howerver, as a fun fact, at Skype, we could use Azure for free
       | for about a year - every dev in the office, for work purposes
       | (including work pet projects). We spun up way too many instances
       | during time, as you'd expect, and only came around to turning
       | them off when Azure changed billing to charge 10% of the
       | "regular" pricing for internal customers.
        
       | thoraway1010 wrote:
       | A great idea! I'd love to run stuff in this zone. Rotate through
       | a bunch of errors, unavailability, latency spikes, power outages
       | etc every day, make it a 12 hour torture test cycle.
        
       | kevindong wrote:
       | At my job, my team owns a service that generally has great
       | uptime. Dependent teams/services have gotten into the habit of
       | assuming that our service will be 100% available which is
       | problematic because it's obviously not. That false assumption has
       | caused several minor incidents unfortunately.
       | 
       | There have been some talk internally of doing chaos engineering
       | to help improve the reliability of our company's products as a
       | whole. Unfortunately, the most easily simulatable failure
       | scenarios (e.g. entire containers go down at once instantly,
       | etc.) tend to be the least helpful since my team designed the
       | service to tolerate those kinds of easily modelable situations.
       | 
       | The more subtle/complex/interesting failure conditions are far
       | harder to recognize and simulate (e.g. all containers hosted on
       | one particular node experience 10s latencies on all network
       | traffic, stale DNS entries, broken service discovery, etc.).
        
       ___________________________________________________________________
       (page generated 2020-08-10 23:00 UTC)