[HN Gopher] AWS vs. GCP reliability is wildly different
       ___________________________________________________________________
        
       AWS vs. GCP reliability is wildly different
        
       Author : icyfox
       Score  : 160 points
       Date   : 2022-09-21 20:29 UTC (2 hours ago)
        
 (HTM) web link (freeman.vc)
 (TXT) w3m dump (freeman.vc)
        
       | user- wrote:
       | I wouldn't call this reliability, which already has a loaded
       | definition in the cloud world, and instead something along time-
       | to-start or latency or something.
        
       | 1-6 wrote:
       | This is all about cloud GPUs, I was expecting something totally
       | different from the title.
        
       | s-xyz wrote:
       | Would be interested to see a comparison of lambda functions vs
       | google 2nd gen functions. I think that gcp is more serverless
       | focused
        
       | duskwuff wrote:
       | ... why does the first graph show some instances as having a
       | negative launch time? Is that meant to indicate errors, or has
       | GCP started preemptively launching instances to anticipate
       | requests?
        
         | tra3 wrote:
         | The y axis here measures duration that it took to successfully
         | spin up the box, where negative results were requests that
         | timed out after 200 seconds. The results are pretty staggering
        
         | zaltekk wrote:
         | I don't know how that value (looks like -50?) was chosen, but
         | it seems to correspond to the launch failures.
        
         | staringback wrote:
         | Perhaps if you read the line directly about the graph you would
         | see it was explained and would not have to ask this question
        
       | zmmmmm wrote:
       | > In total it scaled up about 3,000 T4 GPUs per platform
       | 
       | > why I burned $150 on GPUs
       | 
       | How do you rent 3000 GPUs over a period of weeks for $150? Were
       | they literally requisitioning it and releasing it immediately?
       | Seems like this is quite a unrealistic type of usage pattern and
       | would depend a lot on whether the cloud provider optimises to
       | hand you back the same warm instance you just relinquished.
       | 
       | > GCP allows you to attach a GPU to an arbitrary VM as a hardware
       | accelerator
       | 
       | it's quite fascinating that GCP can do this. GPUs are physical
       | things (!) do they provision every single instance type in the
       | data center with GPUs? That would seem very expensive.
        
         | bushbaba wrote:
         | Unlikely. More likely they put your VM on a host with GPU
         | attached, and use live migration to move workloads around for
         | better resource utilization.
         | 
         | However, live-migration can cause impact to HPC workloads.
        
         | ZiiS wrote:
         | GPUs are physical but VMs are not; I expect they just move them
         | to a host with a GPU.
        
         | NavinF wrote:
         | It probably live-migrates your VM to a physical machine that
         | has a GPU available.
         | 
         | ...if there are any GPUs available in the AZ that is. I had a
         | hell of a time last year moving back and forth between regions
         | to grab just 1 GPU to test something. The web UI didn't have a
         | "any region" option for launching VMs so if you don't use the
         | API you'll have to sit there for 20 minutes trying each
         | AZ/region until you managed to grab one.
        
       | remus wrote:
       | > The offerings between the two cloud vendors are also not the
       | same, which might relate to their differing response times. GCP
       | allows you to attach a GPU to an arbitrary VM as a hardware
       | accelerator - you can separately configure quantity of the CPUs
       | as needed. AWS only provisions defined VMs that have GPUs
       | attached - the g4dn.x series of hardware here. Each of these
       | instances are fixed in their CPU allocation, so if you want one
       | particular varietal of GPU you are stuck with the associated CPU
       | configuration.
       | 
       | At a surface level, the above (from the article) seems like a
       | pretty straightforward explanation? GCP gives you more
       | flexibility in configuring GPU instances at the trade off of
       | increased startup time variability.
        
         | btgeekboy wrote:
         | I wouldn't be surprised if GCP has GPUs scattered throughout
         | the datacenter. If you happen to want to attach one, it has to
         | find one for you to use - potentially live migrating your
         | instance or someone else's so that it can connect them. It'd
         | explain the massive variability between launch times.
        
           | master_crab wrote:
           | Yeah that was my thought too when I first read the blurb.
           | 
           | It's neat...but like a lot of things in large scale
           | operations, the devil is in the details. GPU-CPU
           | communications is a low latency high bandwidth operation. Not
           | something you can trivially do over standard TCP. GCP
           | offering something like that without the ability to
           | flawlessly migrate the VM or procure enough "local" GPUs
           | means it's just vaporware.
           | 
           | As a side note, I'm surprised the author didn't note the
           | amount of ICE's (insufficient capacity errors) AWS throws
           | whenever you spin up a G type instance. AWS is notorious for
           | offering very few G's and P's is certain AZs and regions.
        
       | dekhn wrote:
       | What would you expect? AWS is an org dedicated to giving
       | customers what they want and charging them for it, while GCP is
       | an org dedicated to telling customers what they want and using
       | the revenue to get slightly better cost margins on Intel servers.
        
         | dilyevsky wrote:
         | I don't believe this reasoning is used since at least Diane
        
           | dekhn wrote:
           | I haven't seen any real change from Google about how they
           | approach cloud in the past decade (first as an employee and
           | developer of cloud services there, and now as a customer).
           | Their sales people have hollow eyes
        
       | playingalong wrote:
       | This is great.
       | 
       | I have always been feeling there is so little independent content
       | on benchmarking the IaaS providers. There is so much you can
       | measure in how they behave.
        
       | endisneigh wrote:
       | this doesn't really seem like a fair comparison, nor is it a
       | measure of "reliability".
        
       | humanfromearth wrote:
       | We have constant autoscaling issues because of this in GCP - glad
       | someone plotted this - hope people in GCP will pay a bit more
       | attention to this. Thanks to the OP!
        
       | kccqzy wrote:
       | Heard from a Googler that the internal infrastructure (Borg) is
       | simply not optimized for quick startup. Launching a new Borg job
       | often takes multiple minutes before the job runs. Not surprising
       | at all.
        
         | dekhn wrote:
         | A well-configured isolated borg cluster and well-configured job
         | can be really fast. If there's no preemption (IE, no other job
         | that is kicked off and gets some grace period), the packages
         | are already cached locally, and there is no undue load on the
         | scheduler, the resources are available, and it's a job with
         | tasks, rather than multiple jobs, it will be close to
         | instantaneous.
         | 
         | I spend a significant fraction of my 11+ years there clicking
         | Reload on my job's borg page. I was able to (re-)start ~100K
         | jobs globally in about 15 minutes.
        
         | dekhn wrote:
         | booting VMs != starting a borg job.
        
           | kccqzy wrote:
           | The technology may be different but the culture carries over.
           | People simply don't have the habit to optimize for startup
           | time.
        
         | readams wrote:
         | Borg is not used for gcp vms, though.
        
           | dilyevsky wrote:
           | It is used but borg scheduler does not manage vm startups
        
         | epberry wrote:
         | Echoing this. The SRE book is also highly revealing about how
         | Google request prioritization works. https://sre.google/sre-
         | book/load-balancing-datacenter/
         | 
         | My personal opinion is that Google's resources are more tightly
         | optimized than AWS and they may try to find the 99% best
         | allocation versus the 95% best allocation on AWS.. and this
         | leads to more rejected requests. Open to being wrong on this.
        
         | valleyjo wrote:
         | As another comment points out, GPU resources are less common so
         | it takes longer to create, which makes sense. In general, start
         | up times are pretty quick on GCP as other comments also
         | confirm.
        
       | MonkeyMalarky wrote:
       | I would love to see the same for deploying things like a
       | cloud/lambda function.
        
       | politelemon wrote:
       | A few weeks ago I needed to change the volume type on an EC2
       | instance to gp3. Following the instructions, the change happened
       | while the instance was running. I didn't need to reboot or stop
       | the instance, it just changed the type. While the instance was
       | running.
       | 
       | I didn't understand how they were able to do this, I had thought
       | volume types mapped to hardware clusters of some kind. And since
       | I didn't understand, I wasn't able to distinguish it from magic.
        
         | osti wrote:
         | Look up AWS Nitro on YouTube if you are interested in learning
         | more about it.
        
         | ArchOversight wrote:
         | Changing the volume type on AWS is somewhat magical. Seeing it
         | happens on-line was amazing.
        
         | cavisne wrote:
         | EBS is already replicated so they probably just migrate behind
         | the scenes, same as if the original physical disk was
         | corrupted. It looks like only certain conditions allow this
         | kindof migration.
         | 
         | https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/modify-v...
        
         | Salgat wrote:
         | If I remember right they use the equivalent of a ledger of
         | changes to manage volume state. So in this case, they copy over
         | the contents (up to a certain point in time) to the new faster
         | virtual volume, then append and direct all new changes to the
         | new volume.
         | 
         | This is also how they are able to snapshot a volume at a
         | certain point in time without having any downtime or data
         | inconsistencies.
        
         | xyzzyz wrote:
         | Dunno about AWS, but GCP uses live migration, and will migrate
         | your VM across physical machines as necessary. The disk volumes
         | are all connected over the network, nothing really depends on
         | the actual physical machine your VM is ran on.
        
           | lbhdc wrote:
           | How does migrating a vm to another physical machine work?
        
             | the_duke wrote:
             | This blog post is pretty old (2015) but gives a good
             | introduction.
             | 
             | https://cloudplatform.googleblog.com/2015/03/Google-
             | Compute-...
        
               | lbhdc wrote:
               | Thanks for sharing, I will give it a read!
        
             | rejectfinite wrote:
             | vsphere vmotion has been a thing for years lmao
        
             | roomey wrote:
             | VMware has been doing this for years, it's called vmotion
             | and there is a lot of documentation about it if you are
             | interested (eg https://www.thegeekpub.com/8407/how-vmotion-
             | works/ )
             | 
             | Essential, memory state is copied to the new host, the VM
             | is stunned for a millisecond and the cpu states is copied
             | and resumed on the new host (you may see a dropped ping).
             | All the networking and storage is virtual anyway so that is
             | "moved" (it's not really moved) in the background.
        
               | davidro80 wrote:
        
               | lbhdc wrote:
               | That is really interesting I didn't realize it was so
               | fast. Thanks for the post I will give it a read!
        
               | mh- wrote:
               | Up to 500ms per your source, depending on how much churn
               | there is in the memory from the source system.
               | 
               | Very cool.
        
             | valleyjo wrote:
             | Stream the contents of ram from source to dest, pause the
             | source, reprogram the network and copy and memory that
             | changed since the initial stream, resume the dest, destroy
             | the source, profit.
        
             | pclmulqdq wrote:
             | They pause your VM, copy everything about its state over to
             | the new machine, and quickly start the other instance. It's
             | pretty clever. I think there are tricks you can play with
             | machines that have large memory footprints to copy most of
             | it before the pause, and only copy what has changed since
             | then during the pause.
             | 
             | The disks are all on the network, so no need to move
             | anything there.
        
               | prmoustache wrote:
               | In reality it sync the memory first to the other host and
               | only pause the vm when the last state sync is small
               | enough to be so quick the pause is barely measurable.
        
               | lbhdc wrote:
               | When its transferring the state to the target, how does
               | it handle memory updates that are happening at that time?
               | Is the programs execution paused at that point?
        
               | water-your-self wrote:
               | Indian jones and the register states
        
           | valleyjo wrote:
           | Azure, AWS and GCP all have live migration. VMWare has it
           | too.
        
             | dilyevsky wrote:
             | Ec2 does not have live migration. On azure it's spotty so
             | not every maintenance can offer it.
        
               | [deleted]
        
             | free652 wrote:
             | Are you sure, because AWS consistently requires me to
             | migrate to a different host. They go as far as shutting
             | down instances, but don't do any kind of live migrations.
        
         | shrubble wrote:
         | Assuming this blurb is accurate: " General-purpose SSD volume
         | (gp3) provides the consistent 125 MiB/s throughput and 3000
         | IOPS within the price of provisioned storage. Additional IOPS
         | (up to 16,000) and throughput (1000 MiB/s) can be provisioned
         | with an additional price. The General-purpose SSD volume (gp2)
         | provides 3 IOPS per GiB storage provisioned with a minimum of
         | 100 IOPS"
         | 
         | ... then it seems like a device that limits bandwidth either on
         | the storage cluster or between the node and storage cluster is
         | present. 125MiB/s is right around the speed of a 1gbit link, I
         | believe. That it was a networking setting changed in-switch
         | doesn't seem to be surprising.
        
           | nonameiguess wrote:
           | This would have been my guess. All EBS volumes are stored on
           | a physical disk that supports the highest bandwidth and IOPS
           | you can live migrate to, and the actual rates you get are
           | determined by something in the interconnect. Live migration
           | is thus a matter of swapping out the interconnect between the
           | VM and the disk or even just relaxing a logical rate-limiter,
           | without having to migrate your data to a different disk.
        
             | prmoustache wrote:
             | The actual migration is not instantaneous despite the
             | volume being immediately reported as gp3. You get a status
             | change to "optimizing" if my memory is correct with a
             | percentage. And the higher the volume the longer it takes
             | so there is definitely a sync to faster storage.
        
       | 0xbadcafebee wrote:
       | Reliability in general is measured on the basic principle of:
       | _does it function within our defined expectations?_ As long as it
       | 's launching, and it eventually responds within SLA/SLO limits,
       | and on failure comes back within SLA/SLO limits, it is reliable.
       | Even with GCP's multiple failures to launch, that may still be
       | considered "reliable" within their SLA.
       | 
       | If both AWS and GCP had the same SLA, and one did better than the
       | other at starting up, you could say one is _more performant_ than
       | the other, but you couldn 't say it's _more reliable_ if they are
       | both meeting the SLA. It 's easy to look at something that never
       | goes down and say "that is more reliable", but it might have been
       | pure chance that it never went down. Always read the fine print,
       | and don't expect anything better than what they guarantee.
        
       | cmcconomy wrote:
       | I wish Azure was here to round it out!
        
       | londons_explore wrote:
       | AWS normally has machines sitting idle just waiting for you to
       | use. Thats why they can get you going in a couple of seconds.
       | 
       | GCP on the other hand fills all machines with background jobs.
       | When you want a machine, they need to terminate a background job
       | to make room for you. That background job has a shutdown grace
       | time. Usually thats 30 seconds.
       | 
       | Sometimes, to prevent fragmentation, they actually need to
       | shuffle around many other users to give you the perfect slot -
       | and some of those jobs have start-new-before-stop-old semantics -
       | that's why sometimes the delay is far higher too.
        
         | dekhn wrote:
         | borg implements preemption but the delay to start VMs is not
         | because they are waiting for a background task to clean up.
        
       | devxpy wrote:
       | Is this testing for spot instances?
       | 
       | In my limited experience, persistent (on-demand) GCP instances
       | always boot up much faster than AWS EC2 instances.
        
         | marcinzm wrote:
         | In my experience GPU persistent instances often simply don't
         | boot up on GCP due to lack of available GPUs. One reason I
         | didn't choose GCP at my last company.
        
       | rwalle wrote:
       | Looks like the author has never heard of the word "histogram"
       | 
       | That graph is a pain to see.
        
       | charbull wrote:
       | Can you put this in context of the problem/use case /need you are
       | solving for ?
        
       | ajross wrote:
       | Worth pointing out that the article is measuring provisioning
       | latency and success rates (how quickly can you get a GPU box
       | running and whether or not you get an error back from the API
       | when you try), and not "reliability" as most readers would
       | understand it (how likely they are to do what you want them to do
       | without failure).
       | 
       | Definitely seems like interesting info, though.
        
       | curious_cat_163 wrote:
       | Setting the use of word "reliability" aside, it is is interesting
       | to see the differences in launch time and errors?
       | 
       | One explanation is that AWS has been at it longer, so they know
       | better. That seems like an unsatisfying explanation though, given
       | Google's massive advantage on building and running distributed
       | systems.
       | 
       | Another explanation could be that AWS is more "customer-focused",
       | i.e. they pay a lot more attention to technical issues that are
       | perceptible by a blog writer. But, I am not sure why Google would
       | not be incentivized to do the same. They are certainly motivated
       | and have brought the capital to bear to this fight.
       | 
       | So, what gives?
        
       | PigiVinci83 wrote:
       | Thank you for this article, it confirms my direct experience.
       | Never run a benchmarking test but I can see this every day.
        
       | amaks wrote:
       | The link is broken?
        
         | lucb1e wrote:
         | Works for me using Firefox in Germany, although the article
         | doesn't really match the title so maybe that's why you were
         | confused? :p
        
       | danielmarkbruce wrote:
       | It's meant to say "ephemeral"... right? It's hard to read after
       | that.
        
         | datalopers wrote:
         | ephemeral and ethereal are commonly confused words.
        
           | dublin wrote:
           | Ephimerides really throws them. (And thank God for PyEphem,
           | which makes all that otherwise quite fiddly stuff really
           | easy...)
        
           | danielmarkbruce wrote:
           | I guess that's fair. It's sort of a smell when someone uses
           | the wrong word (especially in writing) though. It suggests
           | they aren't in industry, throwing ideas around with other
           | folks. The word "ephemeral" is used extensively amongst
           | software engineers.
        
       | dark-star wrote:
       | I wonder why someone would equate "instance launch time" with
       | "reliability"... I won't go as far as calling it "clickbait" but
       | wouldn't some other noun ("startup performance is wildly
       | different") have made more sense?
        
         | santoshalper wrote:
         | I won't go so far as saying "you didn't read the article", but
         | I think you missed something.
        
         | xmonkee wrote:
         | GCP also had 84 errors compared to 1 for AWS
        
           | danielmarkbruce wrote:
           | If not a 4xx, what should they return for instance not
           | available?
        
             | eurasiantiger wrote:
             | 503 service unavailable?
        
               | sn0wf1re wrote:
               | That would be confusing. The HTTP response code should
               | not be conflated with the application's state.
        
               | dheera wrote:
               | Using HTTP error codes for non-REST things is cringe.
               | 
               | 503 would mean the IaaS API calls themselves are
               | unavailable. Very different from the API working
               | perfectly fine but the instances not being available.
        
           | sheeshkebab wrote:
           | Maybe 1 reported. Not saying aws reliability is bad, but the
           | number of various glitches that crop up in various aws
           | services and not reflected on their status page is quite
           | high.
        
             | theamk wrote:
             | that was measured from API call return codes, not by
             | looking at overall service status page
             | 
             | Amazon is pretty good about this, if their API says machine
             | is ready, it usually is.
        
             | mcqueenjordan wrote:
             | Errors returned from APIs and the status page are
             | completely separate topics in this context.
        
         | mikewave wrote:
         | Well, if your system elastically uses GPU compute and needs to
         | be able to spin up, run compute on a GPU, and spin down in a
         | predictable amount of time to provide reasonable UX, launch
         | time would definitely be a factor in terms of customer-
         | perceived reliability.
        
           | rco8786 wrote:
           | Sure but not anywhere remotely near clearing the bar to
           | simply calling that "reliability".
        
             | VWWHFSfQ wrote:
             | I would still call it "reliability".
             | 
             | If the instance takes too long to launch then it doesn't
             | matter if it's "reliable" once it's running. It took too
             | long to even get started.
        
               | rco8786 wrote:
               | Why would you not call it "startup performance".
               | 
               | Calling this reliability is like saying a Ford is more
               | reliable than a Chevy because the Ford has a better
               | throttle response.
        
               | endisneigh wrote:
               | that's not what reliability means
        
               | VWWHFSfQ wrote:
               | > that's not what reliability means
               | 
               | What is your definition of reliability?
        
               | endisneigh wrote:
               | unfortunately cloud computing and marketing have
               | conflated reliability, availability and fault tolerance
               | so it's hard to give you a definition everyone would
               | agree to, but in general I'd say reliability is referring
               | to your ability to use the system without errors or
               | significant decreases in throughput, such that it's not
               | usable for the stated purpose.
               | 
               | in other words, reliability is that it does what you
               | expect it to. GCP does not have any particular guarantees
               | around being able to spin up VMs fast, so its inability
               | to do so wouldn't make it unreliable. it would be like me
               | saying that you're unreliable for not doing something
               | when you never said you were going to.
               | 
               | if this were comparing Lambda vs Cloud Functions, who
               | both have stated SLAs around cold start times, and there
               | were significant discrepancies, sure.
        
               | pas wrote:
               | true, the grammar and semantics work out, but since
               | reliability needs a target usually it's a serious design
               | flaw to rely on something that never demonstrably worked
               | like your reliability target assumes.
               | 
               | so that's why in engineering it's not really used as
               | such. (as far as I understand at least.)
        
             | somat wrote:
             | It is not reliably running the machine but reliably getting
             | the machine.
             | 
             | Like the article said, The promise of the cloud is that you
             | can easily get machines when you need them the cloud that
             | sometimes does not get you that machine(or does not get you
             | that machine in time) is a less reliable cloud than the one
             | that does.
        
             | [deleted]
        
           | Art9681 wrote:
           | Why would you scale to zero in high perf compute? Wouldn't it
           | be wise to have a buffer of instances ready to pick up
           | workloads instantly? I get that it shouldnt be necessary with
           | a reliable and performant backend, and that the cost of
           | having some instances waiting for job can be substantial
           | depending on how you do it, but I wonder if the cost
           | difference between AWS and GCP would make up for that and you
           | can get an equivalent amount of performance for an equivalent
           | price? I'm not sure. I'd like to know though.
        
             | thwayunion wrote:
             | _> Why would you scale to zero in high perf compute?_
             | 
             | Midnight - 6am is six hours. The on demand price for a G5
             | is $1/hr. That's over $2K/yr, or "an extra week of skiing
             | paid for by your B2B side project that almost never has
             | customers from ~9pm west coat to ~6am east coast". And I'm
             | not even counting weekends. Even in that rather extreme and
             | case there's a real business case.
             | 
             | But that's sort of a silly edge case. The real savings are
             | in predictable startup times for bursty work loads. Fast
             | and low variance startup times unlock a huge amount of
             | savings. Without both speed and predictability, you have to
             | plan to fail and over-allocate. Which can get really
             | expensive fast.
        
             | diroussel wrote:
             | Scaling to zero means zero cost when there is zero work. If
             | you have a buffer pool, how long do you keep it populated
             | when you have no work?
             | 
             | Maintaining a buffer pool is hard. You need to maintain
             | state, have a prediction function, track usage through
             | time, etc. just spinning up new nodes for new work is
             | substantially easier.
             | 
             | And the author said he could spin up new nodes in 15
             | seconds, that's pretty quick.
        
         | iLoveOncall wrote:
         | It is clickbait, the real title should be "AWS vs. GCP on-
         | demand provisioning of GPU resources performance is wildly
         | different".
         | 
         | That said, while I agree that launch time and provisioning
         | error rate are not sufficient to define reliability, they are
         | definitely a part of it.
        
           | [deleted]
        
           | lelandfe wrote:
           | > wildly different
           | 
           | For this, I'd prefer a title that lets me draw my own
           | conclusions. 84 errors out of 3000 doesn't sound awful to
           | me...? But what do I know - maybe just give me the data:
           | 
           | "1 in 3000 GPUs fail to spawn on AWS. GCP: 84"
           | 
           | "Time to provision GPU with AWS: 11.4s. GCP: 42.6s"
           | 
           | "GCP >4x avg. time to provision GPU than AWS"
           | 
           | "Provisioning on GCP both slower and more error-prone than
           | AWS"
        
         | rmah wrote:
         | They are talking about the reliability of AWS vs GCP. As a user
         | of both, I'd categorize predictable startup times under
         | reliability because if it took more than a minute or so, we'd
         | consider it broken. I suspect many others would have even
         | tighter constraints.
        
         | chrismarlow9 wrote:
         | I mean if you're talking about worst case systems you assume
         | everything is gone except your infra code and backups. In that
         | case your instance launch time would ultimately define what
         | your downtime looks like assuming all else is equal. It does
         | seem a little weird to define it that way but in a strict sense
         | maybe not.
        
       ___________________________________________________________________
       (page generated 2022-09-21 23:00 UTC)