[HN Gopher] AWS vs. GCP reliability is wildly different ___________________________________________________________________ AWS vs. GCP reliability is wildly different Author : icyfox Score : 160 points Date : 2022-09-21 20:29 UTC (2 hours ago) (HTM) web link (freeman.vc) (TXT) w3m dump (freeman.vc) | user- wrote: | I wouldn't call this reliability, which already has a loaded | definition in the cloud world, and instead something along time- | to-start or latency or something. | 1-6 wrote: | This is all about cloud GPUs, I was expecting something totally | different from the title. | s-xyz wrote: | Would be interested to see a comparison of lambda functions vs | google 2nd gen functions. I think that gcp is more serverless | focused | duskwuff wrote: | ... why does the first graph show some instances as having a | negative launch time? Is that meant to indicate errors, or has | GCP started preemptively launching instances to anticipate | requests? | tra3 wrote: | The y axis here measures duration that it took to successfully | spin up the box, where negative results were requests that | timed out after 200 seconds. The results are pretty staggering | zaltekk wrote: | I don't know how that value (looks like -50?) was chosen, but | it seems to correspond to the launch failures. | staringback wrote: | Perhaps if you read the line directly about the graph you would | see it was explained and would not have to ask this question | zmmmmm wrote: | > In total it scaled up about 3,000 T4 GPUs per platform | | > why I burned $150 on GPUs | | How do you rent 3000 GPUs over a period of weeks for $150? Were | they literally requisitioning it and releasing it immediately? | Seems like this is quite a unrealistic type of usage pattern and | would depend a lot on whether the cloud provider optimises to | hand you back the same warm instance you just relinquished. | | > GCP allows you to attach a GPU to an arbitrary VM as a hardware | accelerator | | it's quite fascinating that GCP can do this. GPUs are physical | things (!) do they provision every single instance type in the | data center with GPUs? That would seem very expensive. | bushbaba wrote: | Unlikely. More likely they put your VM on a host with GPU | attached, and use live migration to move workloads around for | better resource utilization. | | However, live-migration can cause impact to HPC workloads. | ZiiS wrote: | GPUs are physical but VMs are not; I expect they just move them | to a host with a GPU. | NavinF wrote: | It probably live-migrates your VM to a physical machine that | has a GPU available. | | ...if there are any GPUs available in the AZ that is. I had a | hell of a time last year moving back and forth between regions | to grab just 1 GPU to test something. The web UI didn't have a | "any region" option for launching VMs so if you don't use the | API you'll have to sit there for 20 minutes trying each | AZ/region until you managed to grab one. | remus wrote: | > The offerings between the two cloud vendors are also not the | same, which might relate to their differing response times. GCP | allows you to attach a GPU to an arbitrary VM as a hardware | accelerator - you can separately configure quantity of the CPUs | as needed. AWS only provisions defined VMs that have GPUs | attached - the g4dn.x series of hardware here. Each of these | instances are fixed in their CPU allocation, so if you want one | particular varietal of GPU you are stuck with the associated CPU | configuration. | | At a surface level, the above (from the article) seems like a | pretty straightforward explanation? GCP gives you more | flexibility in configuring GPU instances at the trade off of | increased startup time variability. | btgeekboy wrote: | I wouldn't be surprised if GCP has GPUs scattered throughout | the datacenter. If you happen to want to attach one, it has to | find one for you to use - potentially live migrating your | instance or someone else's so that it can connect them. It'd | explain the massive variability between launch times. | master_crab wrote: | Yeah that was my thought too when I first read the blurb. | | It's neat...but like a lot of things in large scale | operations, the devil is in the details. GPU-CPU | communications is a low latency high bandwidth operation. Not | something you can trivially do over standard TCP. GCP | offering something like that without the ability to | flawlessly migrate the VM or procure enough "local" GPUs | means it's just vaporware. | | As a side note, I'm surprised the author didn't note the | amount of ICE's (insufficient capacity errors) AWS throws | whenever you spin up a G type instance. AWS is notorious for | offering very few G's and P's is certain AZs and regions. | dekhn wrote: | What would you expect? AWS is an org dedicated to giving | customers what they want and charging them for it, while GCP is | an org dedicated to telling customers what they want and using | the revenue to get slightly better cost margins on Intel servers. | dilyevsky wrote: | I don't believe this reasoning is used since at least Diane | dekhn wrote: | I haven't seen any real change from Google about how they | approach cloud in the past decade (first as an employee and | developer of cloud services there, and now as a customer). | Their sales people have hollow eyes | playingalong wrote: | This is great. | | I have always been feeling there is so little independent content | on benchmarking the IaaS providers. There is so much you can | measure in how they behave. | endisneigh wrote: | this doesn't really seem like a fair comparison, nor is it a | measure of "reliability". | humanfromearth wrote: | We have constant autoscaling issues because of this in GCP - glad | someone plotted this - hope people in GCP will pay a bit more | attention to this. Thanks to the OP! | kccqzy wrote: | Heard from a Googler that the internal infrastructure (Borg) is | simply not optimized for quick startup. Launching a new Borg job | often takes multiple minutes before the job runs. Not surprising | at all. | dekhn wrote: | A well-configured isolated borg cluster and well-configured job | can be really fast. If there's no preemption (IE, no other job | that is kicked off and gets some grace period), the packages | are already cached locally, and there is no undue load on the | scheduler, the resources are available, and it's a job with | tasks, rather than multiple jobs, it will be close to | instantaneous. | | I spend a significant fraction of my 11+ years there clicking | Reload on my job's borg page. I was able to (re-)start ~100K | jobs globally in about 15 minutes. | dekhn wrote: | booting VMs != starting a borg job. | kccqzy wrote: | The technology may be different but the culture carries over. | People simply don't have the habit to optimize for startup | time. | readams wrote: | Borg is not used for gcp vms, though. | dilyevsky wrote: | It is used but borg scheduler does not manage vm startups | epberry wrote: | Echoing this. The SRE book is also highly revealing about how | Google request prioritization works. https://sre.google/sre- | book/load-balancing-datacenter/ | | My personal opinion is that Google's resources are more tightly | optimized than AWS and they may try to find the 99% best | allocation versus the 95% best allocation on AWS.. and this | leads to more rejected requests. Open to being wrong on this. | valleyjo wrote: | As another comment points out, GPU resources are less common so | it takes longer to create, which makes sense. In general, start | up times are pretty quick on GCP as other comments also | confirm. | MonkeyMalarky wrote: | I would love to see the same for deploying things like a | cloud/lambda function. | politelemon wrote: | A few weeks ago I needed to change the volume type on an EC2 | instance to gp3. Following the instructions, the change happened | while the instance was running. I didn't need to reboot or stop | the instance, it just changed the type. While the instance was | running. | | I didn't understand how they were able to do this, I had thought | volume types mapped to hardware clusters of some kind. And since | I didn't understand, I wasn't able to distinguish it from magic. | osti wrote: | Look up AWS Nitro on YouTube if you are interested in learning | more about it. | ArchOversight wrote: | Changing the volume type on AWS is somewhat magical. Seeing it | happens on-line was amazing. | cavisne wrote: | EBS is already replicated so they probably just migrate behind | the scenes, same as if the original physical disk was | corrupted. It looks like only certain conditions allow this | kindof migration. | | https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/modify-v... | Salgat wrote: | If I remember right they use the equivalent of a ledger of | changes to manage volume state. So in this case, they copy over | the contents (up to a certain point in time) to the new faster | virtual volume, then append and direct all new changes to the | new volume. | | This is also how they are able to snapshot a volume at a | certain point in time without having any downtime or data | inconsistencies. | xyzzyz wrote: | Dunno about AWS, but GCP uses live migration, and will migrate | your VM across physical machines as necessary. The disk volumes | are all connected over the network, nothing really depends on | the actual physical machine your VM is ran on. | lbhdc wrote: | How does migrating a vm to another physical machine work? | the_duke wrote: | This blog post is pretty old (2015) but gives a good | introduction. | | https://cloudplatform.googleblog.com/2015/03/Google- | Compute-... | lbhdc wrote: | Thanks for sharing, I will give it a read! | rejectfinite wrote: | vsphere vmotion has been a thing for years lmao | roomey wrote: | VMware has been doing this for years, it's called vmotion | and there is a lot of documentation about it if you are | interested (eg https://www.thegeekpub.com/8407/how-vmotion- | works/ ) | | Essential, memory state is copied to the new host, the VM | is stunned for a millisecond and the cpu states is copied | and resumed on the new host (you may see a dropped ping). | All the networking and storage is virtual anyway so that is | "moved" (it's not really moved) in the background. | davidro80 wrote: | lbhdc wrote: | That is really interesting I didn't realize it was so | fast. Thanks for the post I will give it a read! | mh- wrote: | Up to 500ms per your source, depending on how much churn | there is in the memory from the source system. | | Very cool. | valleyjo wrote: | Stream the contents of ram from source to dest, pause the | source, reprogram the network and copy and memory that | changed since the initial stream, resume the dest, destroy | the source, profit. | pclmulqdq wrote: | They pause your VM, copy everything about its state over to | the new machine, and quickly start the other instance. It's | pretty clever. I think there are tricks you can play with | machines that have large memory footprints to copy most of | it before the pause, and only copy what has changed since | then during the pause. | | The disks are all on the network, so no need to move | anything there. | prmoustache wrote: | In reality it sync the memory first to the other host and | only pause the vm when the last state sync is small | enough to be so quick the pause is barely measurable. | lbhdc wrote: | When its transferring the state to the target, how does | it handle memory updates that are happening at that time? | Is the programs execution paused at that point? | water-your-self wrote: | Indian jones and the register states | valleyjo wrote: | Azure, AWS and GCP all have live migration. VMWare has it | too. | dilyevsky wrote: | Ec2 does not have live migration. On azure it's spotty so | not every maintenance can offer it. | [deleted] | free652 wrote: | Are you sure, because AWS consistently requires me to | migrate to a different host. They go as far as shutting | down instances, but don't do any kind of live migrations. | shrubble wrote: | Assuming this blurb is accurate: " General-purpose SSD volume | (gp3) provides the consistent 125 MiB/s throughput and 3000 | IOPS within the price of provisioned storage. Additional IOPS | (up to 16,000) and throughput (1000 MiB/s) can be provisioned | with an additional price. The General-purpose SSD volume (gp2) | provides 3 IOPS per GiB storage provisioned with a minimum of | 100 IOPS" | | ... then it seems like a device that limits bandwidth either on | the storage cluster or between the node and storage cluster is | present. 125MiB/s is right around the speed of a 1gbit link, I | believe. That it was a networking setting changed in-switch | doesn't seem to be surprising. | nonameiguess wrote: | This would have been my guess. All EBS volumes are stored on | a physical disk that supports the highest bandwidth and IOPS | you can live migrate to, and the actual rates you get are | determined by something in the interconnect. Live migration | is thus a matter of swapping out the interconnect between the | VM and the disk or even just relaxing a logical rate-limiter, | without having to migrate your data to a different disk. | prmoustache wrote: | The actual migration is not instantaneous despite the | volume being immediately reported as gp3. You get a status | change to "optimizing" if my memory is correct with a | percentage. And the higher the volume the longer it takes | so there is definitely a sync to faster storage. | 0xbadcafebee wrote: | Reliability in general is measured on the basic principle of: | _does it function within our defined expectations?_ As long as it | 's launching, and it eventually responds within SLA/SLO limits, | and on failure comes back within SLA/SLO limits, it is reliable. | Even with GCP's multiple failures to launch, that may still be | considered "reliable" within their SLA. | | If both AWS and GCP had the same SLA, and one did better than the | other at starting up, you could say one is _more performant_ than | the other, but you couldn 't say it's _more reliable_ if they are | both meeting the SLA. It 's easy to look at something that never | goes down and say "that is more reliable", but it might have been | pure chance that it never went down. Always read the fine print, | and don't expect anything better than what they guarantee. | cmcconomy wrote: | I wish Azure was here to round it out! | londons_explore wrote: | AWS normally has machines sitting idle just waiting for you to | use. Thats why they can get you going in a couple of seconds. | | GCP on the other hand fills all machines with background jobs. | When you want a machine, they need to terminate a background job | to make room for you. That background job has a shutdown grace | time. Usually thats 30 seconds. | | Sometimes, to prevent fragmentation, they actually need to | shuffle around many other users to give you the perfect slot - | and some of those jobs have start-new-before-stop-old semantics - | that's why sometimes the delay is far higher too. | dekhn wrote: | borg implements preemption but the delay to start VMs is not | because they are waiting for a background task to clean up. | devxpy wrote: | Is this testing for spot instances? | | In my limited experience, persistent (on-demand) GCP instances | always boot up much faster than AWS EC2 instances. | marcinzm wrote: | In my experience GPU persistent instances often simply don't | boot up on GCP due to lack of available GPUs. One reason I | didn't choose GCP at my last company. | rwalle wrote: | Looks like the author has never heard of the word "histogram" | | That graph is a pain to see. | charbull wrote: | Can you put this in context of the problem/use case /need you are | solving for ? | ajross wrote: | Worth pointing out that the article is measuring provisioning | latency and success rates (how quickly can you get a GPU box | running and whether or not you get an error back from the API | when you try), and not "reliability" as most readers would | understand it (how likely they are to do what you want them to do | without failure). | | Definitely seems like interesting info, though. | curious_cat_163 wrote: | Setting the use of word "reliability" aside, it is is interesting | to see the differences in launch time and errors? | | One explanation is that AWS has been at it longer, so they know | better. That seems like an unsatisfying explanation though, given | Google's massive advantage on building and running distributed | systems. | | Another explanation could be that AWS is more "customer-focused", | i.e. they pay a lot more attention to technical issues that are | perceptible by a blog writer. But, I am not sure why Google would | not be incentivized to do the same. They are certainly motivated | and have brought the capital to bear to this fight. | | So, what gives? | PigiVinci83 wrote: | Thank you for this article, it confirms my direct experience. | Never run a benchmarking test but I can see this every day. | amaks wrote: | The link is broken? | lucb1e wrote: | Works for me using Firefox in Germany, although the article | doesn't really match the title so maybe that's why you were | confused? :p | danielmarkbruce wrote: | It's meant to say "ephemeral"... right? It's hard to read after | that. | datalopers wrote: | ephemeral and ethereal are commonly confused words. | dublin wrote: | Ephimerides really throws them. (And thank God for PyEphem, | which makes all that otherwise quite fiddly stuff really | easy...) | danielmarkbruce wrote: | I guess that's fair. It's sort of a smell when someone uses | the wrong word (especially in writing) though. It suggests | they aren't in industry, throwing ideas around with other | folks. The word "ephemeral" is used extensively amongst | software engineers. | dark-star wrote: | I wonder why someone would equate "instance launch time" with | "reliability"... I won't go as far as calling it "clickbait" but | wouldn't some other noun ("startup performance is wildly | different") have made more sense? | santoshalper wrote: | I won't go so far as saying "you didn't read the article", but | I think you missed something. | xmonkee wrote: | GCP also had 84 errors compared to 1 for AWS | danielmarkbruce wrote: | If not a 4xx, what should they return for instance not | available? | eurasiantiger wrote: | 503 service unavailable? | sn0wf1re wrote: | That would be confusing. The HTTP response code should | not be conflated with the application's state. | dheera wrote: | Using HTTP error codes for non-REST things is cringe. | | 503 would mean the IaaS API calls themselves are | unavailable. Very different from the API working | perfectly fine but the instances not being available. | sheeshkebab wrote: | Maybe 1 reported. Not saying aws reliability is bad, but the | number of various glitches that crop up in various aws | services and not reflected on their status page is quite | high. | theamk wrote: | that was measured from API call return codes, not by | looking at overall service status page | | Amazon is pretty good about this, if their API says machine | is ready, it usually is. | mcqueenjordan wrote: | Errors returned from APIs and the status page are | completely separate topics in this context. | mikewave wrote: | Well, if your system elastically uses GPU compute and needs to | be able to spin up, run compute on a GPU, and spin down in a | predictable amount of time to provide reasonable UX, launch | time would definitely be a factor in terms of customer- | perceived reliability. | rco8786 wrote: | Sure but not anywhere remotely near clearing the bar to | simply calling that "reliability". | VWWHFSfQ wrote: | I would still call it "reliability". | | If the instance takes too long to launch then it doesn't | matter if it's "reliable" once it's running. It took too | long to even get started. | rco8786 wrote: | Why would you not call it "startup performance". | | Calling this reliability is like saying a Ford is more | reliable than a Chevy because the Ford has a better | throttle response. | endisneigh wrote: | that's not what reliability means | VWWHFSfQ wrote: | > that's not what reliability means | | What is your definition of reliability? | endisneigh wrote: | unfortunately cloud computing and marketing have | conflated reliability, availability and fault tolerance | so it's hard to give you a definition everyone would | agree to, but in general I'd say reliability is referring | to your ability to use the system without errors or | significant decreases in throughput, such that it's not | usable for the stated purpose. | | in other words, reliability is that it does what you | expect it to. GCP does not have any particular guarantees | around being able to spin up VMs fast, so its inability | to do so wouldn't make it unreliable. it would be like me | saying that you're unreliable for not doing something | when you never said you were going to. | | if this were comparing Lambda vs Cloud Functions, who | both have stated SLAs around cold start times, and there | were significant discrepancies, sure. | pas wrote: | true, the grammar and semantics work out, but since | reliability needs a target usually it's a serious design | flaw to rely on something that never demonstrably worked | like your reliability target assumes. | | so that's why in engineering it's not really used as | such. (as far as I understand at least.) | somat wrote: | It is not reliably running the machine but reliably getting | the machine. | | Like the article said, The promise of the cloud is that you | can easily get machines when you need them the cloud that | sometimes does not get you that machine(or does not get you | that machine in time) is a less reliable cloud than the one | that does. | [deleted] | Art9681 wrote: | Why would you scale to zero in high perf compute? Wouldn't it | be wise to have a buffer of instances ready to pick up | workloads instantly? I get that it shouldnt be necessary with | a reliable and performant backend, and that the cost of | having some instances waiting for job can be substantial | depending on how you do it, but I wonder if the cost | difference between AWS and GCP would make up for that and you | can get an equivalent amount of performance for an equivalent | price? I'm not sure. I'd like to know though. | thwayunion wrote: | _> Why would you scale to zero in high perf compute?_ | | Midnight - 6am is six hours. The on demand price for a G5 | is $1/hr. That's over $2K/yr, or "an extra week of skiing | paid for by your B2B side project that almost never has | customers from ~9pm west coat to ~6am east coast". And I'm | not even counting weekends. Even in that rather extreme and | case there's a real business case. | | But that's sort of a silly edge case. The real savings are | in predictable startup times for bursty work loads. Fast | and low variance startup times unlock a huge amount of | savings. Without both speed and predictability, you have to | plan to fail and over-allocate. Which can get really | expensive fast. | diroussel wrote: | Scaling to zero means zero cost when there is zero work. If | you have a buffer pool, how long do you keep it populated | when you have no work? | | Maintaining a buffer pool is hard. You need to maintain | state, have a prediction function, track usage through | time, etc. just spinning up new nodes for new work is | substantially easier. | | And the author said he could spin up new nodes in 15 | seconds, that's pretty quick. | iLoveOncall wrote: | It is clickbait, the real title should be "AWS vs. GCP on- | demand provisioning of GPU resources performance is wildly | different". | | That said, while I agree that launch time and provisioning | error rate are not sufficient to define reliability, they are | definitely a part of it. | [deleted] | lelandfe wrote: | > wildly different | | For this, I'd prefer a title that lets me draw my own | conclusions. 84 errors out of 3000 doesn't sound awful to | me...? But what do I know - maybe just give me the data: | | "1 in 3000 GPUs fail to spawn on AWS. GCP: 84" | | "Time to provision GPU with AWS: 11.4s. GCP: 42.6s" | | "GCP >4x avg. time to provision GPU than AWS" | | "Provisioning on GCP both slower and more error-prone than | AWS" | rmah wrote: | They are talking about the reliability of AWS vs GCP. As a user | of both, I'd categorize predictable startup times under | reliability because if it took more than a minute or so, we'd | consider it broken. I suspect many others would have even | tighter constraints. | chrismarlow9 wrote: | I mean if you're talking about worst case systems you assume | everything is gone except your infra code and backups. In that | case your instance launch time would ultimately define what | your downtime looks like assuming all else is equal. It does | seem a little weird to define it that way but in a strict sense | maybe not. ___________________________________________________________________ (page generated 2022-09-21 23:00 UTC)