[HN Gopher] Leveraging mispriced AWS spot instances ___________________________________________________________________ Leveraging mispriced AWS spot instances Author : ericpauley Score : 105 points Date : 2022-10-21 13:33 UTC (9 hours ago) (HTM) web link (pauley.me) (TXT) w3m dump (pauley.me) | bluelightning2k wrote: | The article makes a HUGE assumption. | | They spot an inconsistency between two prices, and decide that | the fair market value must be the very highest part of the | spread. Anything under this is therefore "under-priced". | | Is it not possible instead that people are _overpaying_ for the | popular ones through sub-optimal bids - instead of simply | assuming that only these inflexible /least sophisticated bidding | strategies represent the fair market value. | | They actually go further. They assume that AWS could realize this | value, and that encouraging more flexible bids through tooling | etc. would move everyone to the top of the spread, instead of | smoothing it out towards the average. And that what is | essentially a price-increase can be achieved without hurting the | overall value (price-performance vs. flexibility). Given the | entire point of this is auctioning unused cycles at a discount, | clearly any overall increase would decrease the overall demand. | | Having said this it's a great article. I think the overall | quality of the article made it so surprising to see this missed. | bushbaba wrote: | There's underlying capacity as well. Would you rather pay a bit | more to get 100 r6g.4xl OR pay a bit less to have 90 r6g.4xl + | 10 r6gd.4xl. | | Many workloads do not have deployment configuration supporting | a non-homogenous fleet of instances. Over time this will be | addressed, but it could be a current major contributor to the | discrepancies viewed. | ericpauley wrote: | Author here. This is definitely a big assumption. I cut the | price differences in half to account for market movement, but | the price difference could definitely be more or less | especially as these pools are probably thinner markets. | bluelightning2k wrote: | As I said - it's a great article! This was just one thing I | noticed which I pointed out as it made me think. | | Keep up the good work | benlivengood wrote: | Interestingly GCP already offers over 75% discounts for n2d (AMD) | spot instances that don't rely on any internal market, and the | discounts for other families are fairly close. | | We see individual spot instances go away every few days which | works pretty well for GKE. The older preemptible class of | instances restarted every 24 hours which was more of a pain | (mitigated a bit with a preemptible killer to spread the restarts | out). | plantain wrote: | I've spent a lot of time trying to capitalize on these mispricing | - and often they're priced like that because the capacity in that | region/configuration is much lower and you are exposed to more | more preemptions than in higher priced region/configurations. | JoachimSchipper wrote: | > Compute-optimized (C) instances can substitute for a general- | purpose (M) instances of half the size | | They do have the same amount of memory (and twice the CPU). But | if you run a workload that automatically scales to the number of | available cores, starting twice the number of processes / threads | might well run you out of memory. | | The article is interesting, but blindly running your code on | unexpected instance types may be more "exciting" than the author | makes it sound. | ericpauley wrote: | Author here. If you design the workload you can ignore the | extra instances. You can actually hide these cpu cores from | instances within the AWS api (see setting instance vCPU) so it | is truly transparent. | hosh wrote: | One thing to note is volatility. Spot instances are great for | workloads that can absorb spot instance interruptions, and those | interruptions tend to happen more if everyone else is trying to | get spot instances at that time. Stateless web workloads that can | startup and shutdown fast are a good example. | | Some workloads might not. You wouldn't want to run stateful | workloads on spot, for instance. In our case, we have something | that doesn't handle bootup under load very well, and until we can | improve that, the overall reliability is not as good. | | I also like GCP's way of pricing these: you say whether your | workload is preemptible or not, and you get discounts. You | automatically get discounts if you run the workload for a long | time. | StratusBen wrote: | As someone who spends entirely too much time thinking about cloud | infrastructure costs ( I'm co-founder of https://www.vantage.sh/ | which maintains https://ec2instances.info/ ) I just want to | recognize that amount of effort that went into this blog post to | collect the data and express an interesting perspective for a | fairly complicated topic. | | Kudos to the author on producing this. | pradn wrote: | Great work on vantage.sh and ec2instances.info! | | Quick, small fix: This instance is shown as having 0 GBs of | memory, but in fact it has 0.5 GB. | https://instances.vantage.sh/aws/ec2/t4g.nano | StratusBen wrote: | It's community supported! We just pay the bills and maintain | hosting it :) | | Do you mind opening an issue on the repo here? | https://github.com/vantage-sh/ec2instances.info | | Thank you for the report! | awsthrowawy5767 wrote: | You should know we use this tool _inside_ of AWS as well. Not | in EC2 itself, but many many other places | alex_duf wrote: | Oh I've used ec2instances.info very _very_ often, so thank you | for that. So useful! | kureikain wrote: | I run a service that has an API. which can help get spot price | https://ec2.shop/ | | Simplify do: | | curl 'https://ec2.shop?region=us-west-2&filter=m5&json' | jq | | You can pipe it to whatever your system store to get the real | time price without dealing with AWS Price API | Havoc wrote: | Quite surprised that there is this degree of mispricing. I would | have thought it's a market that is big and diverse enough to iron | that out. Especially given that the participants in question | would tend toward the analytical side of things | milesvp wrote: | I was thinking the same thing. I'm wondering if the price | differences are reflecting a general demand for certain sizes? | When I was maintaining AWS servers, I don't think it would have | been easy for me to take advantage of spot prices that were | outside of the sizes I was already using. I'd tuned things such | that I knew the sizes I tended to need to have the redundancy I | needed, and then could auto scale when necessary. Which means, | I would never have bid on spot instances that were bigger than | what I needed, because it would have been way more complicated | to analyze the state of the system as a whole and make sure | scaling happened when it needed to. Which also introduces risk | that probably was never worth the savings. So if you had a lot | of people like me, you'd get m3.large (or whatever current | naming) as the thing that gets bid up the most, because it hit | an autoscaling sweet spot | Havoc wrote: | > it would have been way more complicated to analyze the | state of the system as a whole and make sure scaling happened | when it needed to | | Yeah that's probably what's going on here. Complexity & that | its just a bit counterintuitive | pid-1 wrote: | > For example, you can make all these substitutions: | | >c6g.2xlarge-c6g.4xlarge-m6g.4xlarge-r6g.4xlarge-r6gd.4xlarge | | A long standing ticket in my personal project backlog is | comparing different instance types performance. I'm not sure this | equivalente is without caveats. | | Anyhow, the reason "misprices" exist is because: | | - Many AWS products are elastic but only allow one to choose a | single instance type. So you need to guess the best instance for | a workload and stick with it. | | - No AWS product exposes a "Just give me the cheapest VM with x | CPU and Y memory" API | universa1 wrote: | The end of the article shows a request for just that, doesn't | it? No clue about the API though... | | Depending on your workload you might be able to actually | substitute a single 8xlarge with two 4xlarge for example... A | while back I was actually doing something like that to save | some money :-) | pid-1 wrote: | Wow totally missed that. Cool stuff! | alFReD-NSH wrote: | Autoscaling group with mix instance type spot strategy does | that. You can even give weights to instance type, giving more | performant/higher capacity higher weight and it can choose the | cheapest one with the weight in mind. | pclmulqdq wrote: | AWS doesn't want you to have that API! That's a significant | part of their margin. | jftuga wrote: | I wrote a program to get AWS spot instance pricing. This program | is similar to using "aws ec2 describe-spot-price-history" but is | faster and has a few more options. | | https://github.com/jftuga/spotprice | Moissanite wrote: | It is completely incorrect to characterize these observations as | "mispricing" - this is a quirk of automatically-determined prices | across very different products. If the author actually tried to | use these instances in any significant volume they would | understand the driver - capacity pools are nowhere near equal, | and not as interchangeable for AWS as the article implies they | would be for a user. Prices reflect demand munged with available | capacity - uncommon instance types are uncommon precisely because | they aren't used as much, so there aren't the same signals to | drive the price up and down automatically. | | Instances with attached NVMe are available in much lower volumes | than others, as are AMD instances. Obviously these pools cannot | be used as a drop-in replacement for non-"d" instances or Intel | families. | ericpauley wrote: | Author here. The key here is that customers can leverage these | pools in addition to their existing pools, improving capacity | and price. AWS actually supports this out of the box (including | substituting instances with drives) by specifying core and | memory requirements directly instead of instance types. | snake_doc wrote: | > Across all AWS availability zones instances are mispriced | by roughly $400/hr at any given time. This means that, with | just a single instance of each type, Amazon is missing out on | $200/hr or roughy $1.7 million each year. This is over | roughly 15,000 pools of instances. Given Amazon controls | roughly 100 million IPs, we can guess that each instance pool | probably has on the order of 1000 instances (more for smaller | instances, less for larger instances). Given this, the | average mispriced pool might have hundreds of instances, | meaning hundreds of millions each year in missed revenue due | to mispriced spot instances. Because amazon keeps their | number of instances a secret, it's difficult to make a | precise estimate from the outside, but the missed revenue | probably falls somewhere in this range. | | You are hypothesizing that the price differences produce | "lost" revenues. | | An alternative hypothesis can be that the price differences | produce similar or higher level of revenues for AWS through | price segmentation, with Amazon recognizing the lack of | adoption of certain spot instance bidding features and | auction markets reacting appropriately. | | Unless you have the capacity and quantity demanded for each | instance types, you can't prove your hypothesis. You are | assuming scenario 3 (below) with no insights into price | elasticity of the underlying customers. | | Example: Baseline: | | Instance types A and B are equivalent. | | A is priced at $3, with capacity of 1000, quantity demanded | of 800. B is priced at $2, with capacity of 1000, quantity | demanded of 200. Total quantity demanded = 1,000. | | Revenues from instance type A = $3 x 800 = $2,400 Revenues | from instance type B = $2 x 200 = $ 400 | | Total revenues = $2,800 Scenario 1: All | customers purchase instance B instead due to better price | discovery. | | Revenues from instance type A = $3 x 0 = $0 Revenues from | instance type B = $2 x 1,000 = $ $2,000 Total quantity | demanded = 1,000. | | Total revenues = $2,000 | | Amazon loses $800 in revenues, there are no "lost" revenues" | recovered. Scenario 2: Amazon changes | instance type B price to $3. Total quantity demand decreases | to 900 due to price elasticity of instance type B customers. | | Revenues from instance type A = $3 x 800 = $2,400 Revenues | from instance type B = $3 x 100 = $300 | | Total revenues = $2,700 | | Amazon loses $100 in revenues, there are no "lost" revenues | recovered. Scenario 3: Amazon changes | instance type B price to $3. Total quantity demand remains at | 1,000. | | Revenues from instance type A = $3 x 800 = $2,400 Revenues | from instance type B = $3 x 200 = $600 | | Total revenues = $3,000 | | Amazon recovers $200 in "lost" revenues. | ericpauley wrote: | The missing component of your analysis is that amazon has | 4th option: re-sell instances of B as instances of A when A | is more expensive, and otherwise allowing the market to | adjust. The analysis is strictly limited to instances where | amazon could, in theory, do this (e.g., reselling c6gd as | c6g). | | Assuming the market is in equilibrium, the above scenarious | aren't realistic, as demand at the market price would equal | supply at the current price ( _roughly_ , of course). | | Suppose there are 1000 c6g and 200 c6gd, with equilibrium | price of $3 and $2, respectively (i.e., all instances have | demand). Amazon re-SKUs c6gd as c6g until there are 1100 | c6g selling fro $2.90 and 100 c6gd selling at $2.90. Total | revenue is $3480 vs. $3400. Of course it's impossible to | know the true numbers without hidden knowledge of the | market, but this is more akin to what would occur. Amazon | effectively has a risk-free arbitrage opportunity here, so | it stands to reason that there is revenue to be made. | Customers don't have this option (since you can't short | spot instances), so the best you can do is diversify and | save money. | | Edit: Actually, the AWS spot market is often out of | equilibrium in a way that makes this reselling _even more | effective_. For instance, in the example in the article the | c6gd instance is actually pegged at the minimum price, so | some number of those instances could be resold as c6g | without moving the c6gd price _at all_. | snake_doc wrote: | I think you're think about the revenue functions for spot | instances in isolation of the larger supply base of all | instances. Spot instances are already a result of revenue | management of a fixed supply base that increases in | discrete increments over time. Instance capacity overall | usually leads instance demand, shortage costs are very | high in data centers. | | Spot instance capacities are a function of the all | instance capacity for the same type and on-demand | instance usage. Spot instance pricing can influence the | quantity demanded of on-demand instances of the same | type, and vice-versa. | | Anyhow, there's no way we can figure out whether you're | right or wrong with any reasonable level of certainty. | ericpauley wrote: | While it's tough to say with certainty how much revenue | is lost, there is certainly lost revenue. Consider that | many substitute instances are available at the minimum | allowable price (i.e., won't go any lower, there is | unused capacity). These could be resold without moving | the substitute market. | pclmulqdq wrote: | The mispricing is likely good for Amazon. It indicates that | most people aren't doing this arbitrage, so Amazon can milk | them for extra money. | Moissanite wrote: | Totally agree with that; it is a pretty common approach. The | only part I don't agree with is calling out the price | differences as some kind of "gotcha" that AWS somehow missed, | particularly given the speculative "lost revenue" data which | have no basis in reality. | ericpauley wrote: | See the emphasis on transparent substitutes in the article. | This analysis is limited _strictly_ to sets of instances | that are fully hardware compatible, meaning AWS could | resell one instance as another. There are way more savings | to be had as a customer by leveraging instances that aren | 't transparent substitutes. | Moissanite wrote: | I read it all, and don't agree with your interpretation | of "transparent substitutes" in several of the cases. | ericpauley wrote: | Which instances are not transparent substitutes, in your | opinion? Keep in mind the defintion here is that _Amazon_ | could substitute the image transparently, e.g., by | ignoring the additional resources in hypervisor, not that | the instances are by default indistinguishable. | | That being said, the substitute instances considered | could be trivially accepted by any task running on the | original instance, so long as it doesn't misbehave when | given too many resources. In the case of vCPU, you can | even hide extra vCPU cores, so a c6g.xlarge can be made | effectively indistinguishable from a m6g.2xlarge by | disabling the vCPUs at the hypervisor level. | pclmulqdq wrote: | In financial markets, this quirk of automatically-determined | prices across different products is frequently called | "mispricing" when those products logically _should_ have a | relationship with each other. | | Straightforwardly: All hosts with space for a c6gd spot | instance have space for a c6g instance. If Amazon is willing to | host a c6gd instance in that slot for $X, they should be | willing to also host a c6g instance there for $X. | | In financial markets, the way this gets handled is through | arbitrage: someone will buy the equivalent of the c6gd | instance, and sell the c6g part for the higher price (they may | also sell the "d" part for even more money). This has the | effect of "correcting" the price. The AWS spot market does not | allow you to do arbitrage, and AWS doesn't appear to do the | arbitrage for you. | | AWS probably likes this inefficiency in their market: some | instance types are more popular than others, and some customers | make assumptions that require them to use a very specific | instance type (ie a c6gd would not work as a substitute for | their c6g instance). However, the vast majority of users | probably could work just fine if their c6g instance were a | c6gd, and don't look for the arbitrage opportunity. That means | Amazon gets paid extra. | Moissanite wrote: | > If Amazon is willing to host a c6gd instance in that slot | for $X, they should be willing to also host a c6g instance | there for $X. | | The reality is that direct c6gd demand might be an order of | magnitude lower than c6g direct demand - if AWS can get some | more flexible people to adopt c6gd by offering a lower price, | c6g capacity is slightly stabilized for on-demand usage by | people who don't value the flexibility. | | Also note that c6g to c6gd has a non-zero switching cost - | extra NVMe on the instance adds a new source of potential | hardware failure, increasing the probability of termination | very slightly. There might be other software-related costs | depending on whether your application makes any ill-advised | assumptions about attached storage during setup. | | So overall, I would just be happier to read this article if | it was framed as "PSA: having more features in an ec2 | instance is sometimes cheaper! Don't rule yourself out of | extra savings by making overly-constrained fleet requests." | The extra commentary about foregone revenue makes too many | assumptions and detracts from the core point. | pclmulqdq wrote: | The point is that Amazon doesn't have to fill that slot | with a c6gd. They can also fill it with a c6g. They just | choose not to. | | The fact that you have to host a c6gd to get that price | instead of a c6g is an inefficiency in the spot market that | likely makes Amazon money, but is a little customer- | hostile. I think the article is probably wrong that Amazon | is foregoing revenue due to this. This is a form of price | discrimination and it is likely making Amazon money, but in | a scummy way. | ericpauley wrote: | Agreed that it's definitely difficult to know the true | missed revenue here without internal data, and even then | you'd be making some assumptions. I am confident there is | _some_ missed revenue here, as amazon routinely has spot | capacity constraints under existing prices so could | definitely sell some substitute instances without moving | the original instance market (even one instance per pool | substituted equates to >$1M per year). In either case, a | savvy organization can definitely benefit from the price | discrepancy even if Amazon couldn't. | Moissanite wrote: | I can agree that there is missed revenue - but | realistically it wouldale much more sense to sell that | capacity via Fargate (which is closer to undifferentiated | generic compute and RAM) rather than monkeying with the | spot pricing algorithm. | ericpauley wrote: | Great point on Fargate, I'd be very curious on whether | they select capacity for that from EC2 capcity or if | there's a separate physical footprint for it. | phamilton wrote: | We run a very large installation 100% on spot and have done for a | few years. We serve our web traffic, do background work, etc. all | on spot instances. | | We see similar mismatched pricing all the time and take advantage | of it. One additional area not called out here is the difference | between c5.24xlarge and c5.metal instance pricing. These are | pretty much identical hardware but metal instances are often | cheaper. | | As you go down this path, do expect to see a lot of weird things | that you'll have to track down. For example, when we introduced | metal instances we found that the default ubuntu AMI launched | with a powersave cpu governor. Non-metal instances don't support | CPU throttling so it never came up with c5.24xlarges. When we | first launched metal instances the performance per instance was | significantly worse and took a bit of work to track down. | | Recently we've seen a lot more spot interruptions and it's | pushing us to incorporate more 6th gen instances to get us more | diversity. We've also temporarily switched to capacity optimized | over price optimized and we've enable capacity rebalancing. | | It's absolutely a win for us from a pricing perspective. Our | traffic is extremely variable each day and very seasonal | throughout the year. RIs don't make sense given <12 hrs daily | peak and 10x difference between July and September. However, just | plan for some odd surprises along the way. | Moissanite wrote: | Have you observed metal instances taking longer to boot? I did | last time I checked, and the difference was big enough to | affect pricing in a non-trivial way, given that performance is | the same and that you start paying immediately. | TheP1000 wrote: | If you want to leverage cheap spot, use us-east-2 / Ohio region. | The prices are typically half of what you see in us-east-1. | | Also, it really helps to analyze at the AZ level. Certain AZs | lack instances or have very low spot availability and contrary to | recommended best practice, reducing AZs can sometimes be | beneficial (I am looking at you eu-central-1a). | | While lowest price sounds nice, they can be really messy in terms | of spot interruption rate. It is much better to set a max price | and choose capacity optimized with as many instances as possible. | playingalong wrote: | > eu-central-1a | | FYI, AZ names are not universal. Your eu-central-1a might be | someone else's eu-central-1b. | bscanlan wrote: | Fun article, the phenomenon is interesting to see in practice, | I've seen it regularly with newer instance types as it can take | time for people to add them to their configurations. | | We're heavy users of spot here in Intercom. I spot-checked our | biggest workload, and this week we could have paid around 10% | less if we were able to get the cheapest spot host possible in | us-east-1 that is suitable for our workload (all 16xlarge | Gravitons). However that would be at the cost of fleet stability, | I think that to run relatively large production services used in | realtime on spot you need to prioritise fleet stability, so | choosing the "Capacity Optimized" strategy. We've seen incessant | fleet churn when trying out cost optimised strategies. | socialismisok wrote: | Is there tooling to find the global minimum price for an instance | with certain characteristics? | | I found it easy enough to do that in one region, but I've got | some compute workloads that just read/write from S3 and are not | latency sensitive. | | They do need 128 GB RAM and ephemeral disks. | DJBunnies wrote: | > compute workloads that just read/write from S3 | | > need 128 GB RAM | | Eh? | tyingq wrote: | I took "just read/write from S3" to mean that they didn't | interact with any other AWS services apart from S3. Such that | they didn't care where in the world it ran. | | Not that they didn't do anything memory intensive. | socialismisok wrote: | You got it. It's some drone image processing. Read in data | from S3, do analysis, write results. | ericpauley wrote: | Spot fleet requests allow you to set minimum specs for | instances, and the fleet will be composed of any instances that | meet the spec. If it's asynchronous work, you could pick lowest | price allocation and not worry too much about interruptions. In | fact, if your work is tolerant of interruptions (batch size | <2min), you can actually save even more by being interrupted, | as you don't get billed for partial hours: | https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/billing-... ___________________________________________________________________ (page generated 2022-10-21 23:00 UTC)