[HN Gopher] Leveraging mispriced AWS spot instances
       ___________________________________________________________________
        
       Leveraging mispriced AWS spot instances
        
       Author : ericpauley
       Score  : 105 points
       Date   : 2022-10-21 13:33 UTC (9 hours ago)
        
 (HTM) web link (pauley.me)
 (TXT) w3m dump (pauley.me)
        
       | bluelightning2k wrote:
       | The article makes a HUGE assumption.
       | 
       | They spot an inconsistency between two prices, and decide that
       | the fair market value must be the very highest part of the
       | spread. Anything under this is therefore "under-priced".
       | 
       | Is it not possible instead that people are _overpaying_ for the
       | popular ones through sub-optimal bids - instead of simply
       | assuming that only these inflexible /least sophisticated bidding
       | strategies represent the fair market value.
       | 
       | They actually go further. They assume that AWS could realize this
       | value, and that encouraging more flexible bids through tooling
       | etc. would move everyone to the top of the spread, instead of
       | smoothing it out towards the average. And that what is
       | essentially a price-increase can be achieved without hurting the
       | overall value (price-performance vs. flexibility). Given the
       | entire point of this is auctioning unused cycles at a discount,
       | clearly any overall increase would decrease the overall demand.
       | 
       | Having said this it's a great article. I think the overall
       | quality of the article made it so surprising to see this missed.
        
         | bushbaba wrote:
         | There's underlying capacity as well. Would you rather pay a bit
         | more to get 100 r6g.4xl OR pay a bit less to have 90 r6g.4xl +
         | 10 r6gd.4xl.
         | 
         | Many workloads do not have deployment configuration supporting
         | a non-homogenous fleet of instances. Over time this will be
         | addressed, but it could be a current major contributor to the
         | discrepancies viewed.
        
         | ericpauley wrote:
         | Author here. This is definitely a big assumption. I cut the
         | price differences in half to account for market movement, but
         | the price difference could definitely be more or less
         | especially as these pools are probably thinner markets.
        
           | bluelightning2k wrote:
           | As I said - it's a great article! This was just one thing I
           | noticed which I pointed out as it made me think.
           | 
           | Keep up the good work
        
       | benlivengood wrote:
       | Interestingly GCP already offers over 75% discounts for n2d (AMD)
       | spot instances that don't rely on any internal market, and the
       | discounts for other families are fairly close.
       | 
       | We see individual spot instances go away every few days which
       | works pretty well for GKE. The older preemptible class of
       | instances restarted every 24 hours which was more of a pain
       | (mitigated a bit with a preemptible killer to spread the restarts
       | out).
        
       | plantain wrote:
       | I've spent a lot of time trying to capitalize on these mispricing
       | - and often they're priced like that because the capacity in that
       | region/configuration is much lower and you are exposed to more
       | more preemptions than in higher priced region/configurations.
        
       | JoachimSchipper wrote:
       | > Compute-optimized (C) instances can substitute for a general-
       | purpose (M) instances of half the size
       | 
       | They do have the same amount of memory (and twice the CPU). But
       | if you run a workload that automatically scales to the number of
       | available cores, starting twice the number of processes / threads
       | might well run you out of memory.
       | 
       | The article is interesting, but blindly running your code on
       | unexpected instance types may be more "exciting" than the author
       | makes it sound.
        
         | ericpauley wrote:
         | Author here. If you design the workload you can ignore the
         | extra instances. You can actually hide these cpu cores from
         | instances within the AWS api (see setting instance vCPU) so it
         | is truly transparent.
        
       | hosh wrote:
       | One thing to note is volatility. Spot instances are great for
       | workloads that can absorb spot instance interruptions, and those
       | interruptions tend to happen more if everyone else is trying to
       | get spot instances at that time. Stateless web workloads that can
       | startup and shutdown fast are a good example.
       | 
       | Some workloads might not. You wouldn't want to run stateful
       | workloads on spot, for instance. In our case, we have something
       | that doesn't handle bootup under load very well, and until we can
       | improve that, the overall reliability is not as good.
       | 
       | I also like GCP's way of pricing these: you say whether your
       | workload is preemptible or not, and you get discounts. You
       | automatically get discounts if you run the workload for a long
       | time.
        
       | StratusBen wrote:
       | As someone who spends entirely too much time thinking about cloud
       | infrastructure costs ( I'm co-founder of https://www.vantage.sh/
       | which maintains https://ec2instances.info/ ) I just want to
       | recognize that amount of effort that went into this blog post to
       | collect the data and express an interesting perspective for a
       | fairly complicated topic.
       | 
       | Kudos to the author on producing this.
        
         | pradn wrote:
         | Great work on vantage.sh and ec2instances.info!
         | 
         | Quick, small fix: This instance is shown as having 0 GBs of
         | memory, but in fact it has 0.5 GB.
         | https://instances.vantage.sh/aws/ec2/t4g.nano
        
           | StratusBen wrote:
           | It's community supported! We just pay the bills and maintain
           | hosting it :)
           | 
           | Do you mind opening an issue on the repo here?
           | https://github.com/vantage-sh/ec2instances.info
           | 
           | Thank you for the report!
        
         | awsthrowawy5767 wrote:
         | You should know we use this tool _inside_ of AWS as well. Not
         | in EC2 itself, but many many other places
        
         | alex_duf wrote:
         | Oh I've used ec2instances.info very _very_ often, so thank you
         | for that. So useful!
        
       | kureikain wrote:
       | I run a service that has an API. which can help get spot price
       | https://ec2.shop/
       | 
       | Simplify do:
       | 
       | curl 'https://ec2.shop?region=us-west-2&filter=m5&json' | jq
       | 
       | You can pipe it to whatever your system store to get the real
       | time price without dealing with AWS Price API
        
       | Havoc wrote:
       | Quite surprised that there is this degree of mispricing. I would
       | have thought it's a market that is big and diverse enough to iron
       | that out. Especially given that the participants in question
       | would tend toward the analytical side of things
        
         | milesvp wrote:
         | I was thinking the same thing. I'm wondering if the price
         | differences are reflecting a general demand for certain sizes?
         | When I was maintaining AWS servers, I don't think it would have
         | been easy for me to take advantage of spot prices that were
         | outside of the sizes I was already using. I'd tuned things such
         | that I knew the sizes I tended to need to have the redundancy I
         | needed, and then could auto scale when necessary. Which means,
         | I would never have bid on spot instances that were bigger than
         | what I needed, because it would have been way more complicated
         | to analyze the state of the system as a whole and make sure
         | scaling happened when it needed to. Which also introduces risk
         | that probably was never worth the savings. So if you had a lot
         | of people like me, you'd get m3.large (or whatever current
         | naming) as the thing that gets bid up the most, because it hit
         | an autoscaling sweet spot
        
           | Havoc wrote:
           | > it would have been way more complicated to analyze the
           | state of the system as a whole and make sure scaling happened
           | when it needed to
           | 
           | Yeah that's probably what's going on here. Complexity & that
           | its just a bit counterintuitive
        
       | pid-1 wrote:
       | > For example, you can make all these substitutions:
       | 
       | >c6g.2xlarge-c6g.4xlarge-m6g.4xlarge-r6g.4xlarge-r6gd.4xlarge
       | 
       | A long standing ticket in my personal project backlog is
       | comparing different instance types performance. I'm not sure this
       | equivalente is without caveats.
       | 
       | Anyhow, the reason "misprices" exist is because:
       | 
       | - Many AWS products are elastic but only allow one to choose a
       | single instance type. So you need to guess the best instance for
       | a workload and stick with it.
       | 
       | - No AWS product exposes a "Just give me the cheapest VM with x
       | CPU and Y memory" API
        
         | universa1 wrote:
         | The end of the article shows a request for just that, doesn't
         | it? No clue about the API though...
         | 
         | Depending on your workload you might be able to actually
         | substitute a single 8xlarge with two 4xlarge for example... A
         | while back I was actually doing something like that to save
         | some money :-)
        
           | pid-1 wrote:
           | Wow totally missed that. Cool stuff!
        
         | alFReD-NSH wrote:
         | Autoscaling group with mix instance type spot strategy does
         | that. You can even give weights to instance type, giving more
         | performant/higher capacity higher weight and it can choose the
         | cheapest one with the weight in mind.
        
         | pclmulqdq wrote:
         | AWS doesn't want you to have that API! That's a significant
         | part of their margin.
        
       | jftuga wrote:
       | I wrote a program to get AWS spot instance pricing. This program
       | is similar to using "aws ec2 describe-spot-price-history" but is
       | faster and has a few more options.
       | 
       | https://github.com/jftuga/spotprice
        
       | Moissanite wrote:
       | It is completely incorrect to characterize these observations as
       | "mispricing" - this is a quirk of automatically-determined prices
       | across very different products. If the author actually tried to
       | use these instances in any significant volume they would
       | understand the driver - capacity pools are nowhere near equal,
       | and not as interchangeable for AWS as the article implies they
       | would be for a user. Prices reflect demand munged with available
       | capacity - uncommon instance types are uncommon precisely because
       | they aren't used as much, so there aren't the same signals to
       | drive the price up and down automatically.
       | 
       | Instances with attached NVMe are available in much lower volumes
       | than others, as are AMD instances. Obviously these pools cannot
       | be used as a drop-in replacement for non-"d" instances or Intel
       | families.
        
         | ericpauley wrote:
         | Author here. The key here is that customers can leverage these
         | pools in addition to their existing pools, improving capacity
         | and price. AWS actually supports this out of the box (including
         | substituting instances with drives) by specifying core and
         | memory requirements directly instead of instance types.
        
           | snake_doc wrote:
           | > Across all AWS availability zones instances are mispriced
           | by roughly $400/hr at any given time. This means that, with
           | just a single instance of each type, Amazon is missing out on
           | $200/hr or roughy $1.7 million each year. This is over
           | roughly 15,000 pools of instances. Given Amazon controls
           | roughly 100 million IPs, we can guess that each instance pool
           | probably has on the order of 1000 instances (more for smaller
           | instances, less for larger instances). Given this, the
           | average mispriced pool might have hundreds of instances,
           | meaning hundreds of millions each year in missed revenue due
           | to mispriced spot instances. Because amazon keeps their
           | number of instances a secret, it's difficult to make a
           | precise estimate from the outside, but the missed revenue
           | probably falls somewhere in this range.
           | 
           | You are hypothesizing that the price differences produce
           | "lost" revenues.
           | 
           | An alternative hypothesis can be that the price differences
           | produce similar or higher level of revenues for AWS through
           | price segmentation, with Amazon recognizing the lack of
           | adoption of certain spot instance bidding features and
           | auction markets reacting appropriately.
           | 
           | Unless you have the capacity and quantity demanded for each
           | instance types, you can't prove your hypothesis. You are
           | assuming scenario 3 (below) with no insights into price
           | elasticity of the underlying customers.
           | 
           | Example:                 Baseline:
           | 
           | Instance types A and B are equivalent.
           | 
           | A is priced at $3, with capacity of 1000, quantity demanded
           | of 800. B is priced at $2, with capacity of 1000, quantity
           | demanded of 200. Total quantity demanded = 1,000.
           | 
           | Revenues from instance type A = $3 x 800 = $2,400 Revenues
           | from instance type B = $2 x 200 = $ 400
           | 
           | Total revenues = $2,800                 Scenario 1: All
           | customers purchase instance B instead due to better price
           | discovery.
           | 
           | Revenues from instance type A = $3 x 0 = $0 Revenues from
           | instance type B = $2 x 1,000 = $ $2,000 Total quantity
           | demanded = 1,000.
           | 
           | Total revenues = $2,000
           | 
           | Amazon loses $800 in revenues, there are no "lost" revenues"
           | recovered.                 Scenario 2: Amazon changes
           | instance type B price to $3. Total quantity demand decreases
           | to 900 due to price elasticity of instance type B customers.
           | 
           | Revenues from instance type A = $3 x 800 = $2,400 Revenues
           | from instance type B = $3 x 100 = $300
           | 
           | Total revenues = $2,700
           | 
           | Amazon loses $100 in revenues, there are no "lost" revenues
           | recovered.                 Scenario 3: Amazon changes
           | instance type B price to $3. Total quantity demand remains at
           | 1,000.
           | 
           | Revenues from instance type A = $3 x 800 = $2,400 Revenues
           | from instance type B = $3 x 200 = $600
           | 
           | Total revenues = $3,000
           | 
           | Amazon recovers $200 in "lost" revenues.
        
             | ericpauley wrote:
             | The missing component of your analysis is that amazon has
             | 4th option: re-sell instances of B as instances of A when A
             | is more expensive, and otherwise allowing the market to
             | adjust. The analysis is strictly limited to instances where
             | amazon could, in theory, do this (e.g., reselling c6gd as
             | c6g).
             | 
             | Assuming the market is in equilibrium, the above scenarious
             | aren't realistic, as demand at the market price would equal
             | supply at the current price ( _roughly_ , of course).
             | 
             | Suppose there are 1000 c6g and 200 c6gd, with equilibrium
             | price of $3 and $2, respectively (i.e., all instances have
             | demand). Amazon re-SKUs c6gd as c6g until there are 1100
             | c6g selling fro $2.90 and 100 c6gd selling at $2.90. Total
             | revenue is $3480 vs. $3400. Of course it's impossible to
             | know the true numbers without hidden knowledge of the
             | market, but this is more akin to what would occur. Amazon
             | effectively has a risk-free arbitrage opportunity here, so
             | it stands to reason that there is revenue to be made.
             | Customers don't have this option (since you can't short
             | spot instances), so the best you can do is diversify and
             | save money.
             | 
             | Edit: Actually, the AWS spot market is often out of
             | equilibrium in a way that makes this reselling _even more
             | effective_. For instance, in the example in the article the
             | c6gd instance is actually pegged at the minimum price, so
             | some number of those instances could be resold as c6g
             | without moving the c6gd price _at all_.
        
               | snake_doc wrote:
               | I think you're think about the revenue functions for spot
               | instances in isolation of the larger supply base of all
               | instances. Spot instances are already a result of revenue
               | management of a fixed supply base that increases in
               | discrete increments over time. Instance capacity overall
               | usually leads instance demand, shortage costs are very
               | high in data centers.
               | 
               | Spot instance capacities are a function of the all
               | instance capacity for the same type and on-demand
               | instance usage. Spot instance pricing can influence the
               | quantity demanded of on-demand instances of the same
               | type, and vice-versa.
               | 
               | Anyhow, there's no way we can figure out whether you're
               | right or wrong with any reasonable level of certainty.
        
               | ericpauley wrote:
               | While it's tough to say with certainty how much revenue
               | is lost, there is certainly lost revenue. Consider that
               | many substitute instances are available at the minimum
               | allowable price (i.e., won't go any lower, there is
               | unused capacity). These could be resold without moving
               | the substitute market.
        
             | pclmulqdq wrote:
             | The mispricing is likely good for Amazon. It indicates that
             | most people aren't doing this arbitrage, so Amazon can milk
             | them for extra money.
        
           | Moissanite wrote:
           | Totally agree with that; it is a pretty common approach. The
           | only part I don't agree with is calling out the price
           | differences as some kind of "gotcha" that AWS somehow missed,
           | particularly given the speculative "lost revenue" data which
           | have no basis in reality.
        
             | ericpauley wrote:
             | See the emphasis on transparent substitutes in the article.
             | This analysis is limited _strictly_ to sets of instances
             | that are fully hardware compatible, meaning AWS could
             | resell one instance as another. There are way more savings
             | to be had as a customer by leveraging instances that aren
             | 't transparent substitutes.
        
               | Moissanite wrote:
               | I read it all, and don't agree with your interpretation
               | of "transparent substitutes" in several of the cases.
        
               | ericpauley wrote:
               | Which instances are not transparent substitutes, in your
               | opinion? Keep in mind the defintion here is that _Amazon_
               | could substitute the image transparently, e.g., by
               | ignoring the additional resources in hypervisor, not that
               | the instances are by default indistinguishable.
               | 
               | That being said, the substitute instances considered
               | could be trivially accepted by any task running on the
               | original instance, so long as it doesn't misbehave when
               | given too many resources. In the case of vCPU, you can
               | even hide extra vCPU cores, so a c6g.xlarge can be made
               | effectively indistinguishable from a m6g.2xlarge by
               | disabling the vCPUs at the hypervisor level.
        
         | pclmulqdq wrote:
         | In financial markets, this quirk of automatically-determined
         | prices across different products is frequently called
         | "mispricing" when those products logically _should_ have a
         | relationship with each other.
         | 
         | Straightforwardly: All hosts with space for a c6gd spot
         | instance have space for a c6g instance. If Amazon is willing to
         | host a c6gd instance in that slot for $X, they should be
         | willing to also host a c6g instance there for $X.
         | 
         | In financial markets, the way this gets handled is through
         | arbitrage: someone will buy the equivalent of the c6gd
         | instance, and sell the c6g part for the higher price (they may
         | also sell the "d" part for even more money). This has the
         | effect of "correcting" the price. The AWS spot market does not
         | allow you to do arbitrage, and AWS doesn't appear to do the
         | arbitrage for you.
         | 
         | AWS probably likes this inefficiency in their market: some
         | instance types are more popular than others, and some customers
         | make assumptions that require them to use a very specific
         | instance type (ie a c6gd would not work as a substitute for
         | their c6g instance). However, the vast majority of users
         | probably could work just fine if their c6g instance were a
         | c6gd, and don't look for the arbitrage opportunity. That means
         | Amazon gets paid extra.
        
           | Moissanite wrote:
           | > If Amazon is willing to host a c6gd instance in that slot
           | for $X, they should be willing to also host a c6g instance
           | there for $X.
           | 
           | The reality is that direct c6gd demand might be an order of
           | magnitude lower than c6g direct demand - if AWS can get some
           | more flexible people to adopt c6gd by offering a lower price,
           | c6g capacity is slightly stabilized for on-demand usage by
           | people who don't value the flexibility.
           | 
           | Also note that c6g to c6gd has a non-zero switching cost -
           | extra NVMe on the instance adds a new source of potential
           | hardware failure, increasing the probability of termination
           | very slightly. There might be other software-related costs
           | depending on whether your application makes any ill-advised
           | assumptions about attached storage during setup.
           | 
           | So overall, I would just be happier to read this article if
           | it was framed as "PSA: having more features in an ec2
           | instance is sometimes cheaper! Don't rule yourself out of
           | extra savings by making overly-constrained fleet requests."
           | The extra commentary about foregone revenue makes too many
           | assumptions and detracts from the core point.
        
             | pclmulqdq wrote:
             | The point is that Amazon doesn't have to fill that slot
             | with a c6gd. They can also fill it with a c6g. They just
             | choose not to.
             | 
             | The fact that you have to host a c6gd to get that price
             | instead of a c6g is an inefficiency in the spot market that
             | likely makes Amazon money, but is a little customer-
             | hostile. I think the article is probably wrong that Amazon
             | is foregoing revenue due to this. This is a form of price
             | discrimination and it is likely making Amazon money, but in
             | a scummy way.
        
               | ericpauley wrote:
               | Agreed that it's definitely difficult to know the true
               | missed revenue here without internal data, and even then
               | you'd be making some assumptions. I am confident there is
               | _some_ missed revenue here, as amazon routinely has spot
               | capacity constraints under existing prices so could
               | definitely sell some substitute instances without moving
               | the original instance market (even one instance per pool
               | substituted equates to  >$1M per year). In either case, a
               | savvy organization can definitely benefit from the price
               | discrepancy even if Amazon couldn't.
        
               | Moissanite wrote:
               | I can agree that there is missed revenue - but
               | realistically it wouldale much more sense to sell that
               | capacity via Fargate (which is closer to undifferentiated
               | generic compute and RAM) rather than monkeying with the
               | spot pricing algorithm.
        
               | ericpauley wrote:
               | Great point on Fargate, I'd be very curious on whether
               | they select capacity for that from EC2 capcity or if
               | there's a separate physical footprint for it.
        
       | phamilton wrote:
       | We run a very large installation 100% on spot and have done for a
       | few years. We serve our web traffic, do background work, etc. all
       | on spot instances.
       | 
       | We see similar mismatched pricing all the time and take advantage
       | of it. One additional area not called out here is the difference
       | between c5.24xlarge and c5.metal instance pricing. These are
       | pretty much identical hardware but metal instances are often
       | cheaper.
       | 
       | As you go down this path, do expect to see a lot of weird things
       | that you'll have to track down. For example, when we introduced
       | metal instances we found that the default ubuntu AMI launched
       | with a powersave cpu governor. Non-metal instances don't support
       | CPU throttling so it never came up with c5.24xlarges. When we
       | first launched metal instances the performance per instance was
       | significantly worse and took a bit of work to track down.
       | 
       | Recently we've seen a lot more spot interruptions and it's
       | pushing us to incorporate more 6th gen instances to get us more
       | diversity. We've also temporarily switched to capacity optimized
       | over price optimized and we've enable capacity rebalancing.
       | 
       | It's absolutely a win for us from a pricing perspective. Our
       | traffic is extremely variable each day and very seasonal
       | throughout the year. RIs don't make sense given <12 hrs daily
       | peak and 10x difference between July and September. However, just
       | plan for some odd surprises along the way.
        
         | Moissanite wrote:
         | Have you observed metal instances taking longer to boot? I did
         | last time I checked, and the difference was big enough to
         | affect pricing in a non-trivial way, given that performance is
         | the same and that you start paying immediately.
        
       | TheP1000 wrote:
       | If you want to leverage cheap spot, use us-east-2 / Ohio region.
       | The prices are typically half of what you see in us-east-1.
       | 
       | Also, it really helps to analyze at the AZ level. Certain AZs
       | lack instances or have very low spot availability and contrary to
       | recommended best practice, reducing AZs can sometimes be
       | beneficial (I am looking at you eu-central-1a).
       | 
       | While lowest price sounds nice, they can be really messy in terms
       | of spot interruption rate. It is much better to set a max price
       | and choose capacity optimized with as many instances as possible.
        
         | playingalong wrote:
         | > eu-central-1a
         | 
         | FYI, AZ names are not universal. Your eu-central-1a might be
         | someone else's eu-central-1b.
        
       | bscanlan wrote:
       | Fun article, the phenomenon is interesting to see in practice,
       | I've seen it regularly with newer instance types as it can take
       | time for people to add them to their configurations.
       | 
       | We're heavy users of spot here in Intercom. I spot-checked our
       | biggest workload, and this week we could have paid around 10%
       | less if we were able to get the cheapest spot host possible in
       | us-east-1 that is suitable for our workload (all 16xlarge
       | Gravitons). However that would be at the cost of fleet stability,
       | I think that to run relatively large production services used in
       | realtime on spot you need to prioritise fleet stability, so
       | choosing the "Capacity Optimized" strategy. We've seen incessant
       | fleet churn when trying out cost optimised strategies.
        
       | socialismisok wrote:
       | Is there tooling to find the global minimum price for an instance
       | with certain characteristics?
       | 
       | I found it easy enough to do that in one region, but I've got
       | some compute workloads that just read/write from S3 and are not
       | latency sensitive.
       | 
       | They do need 128 GB RAM and ephemeral disks.
        
         | DJBunnies wrote:
         | > compute workloads that just read/write from S3
         | 
         | > need 128 GB RAM
         | 
         | Eh?
        
           | tyingq wrote:
           | I took "just read/write from S3" to mean that they didn't
           | interact with any other AWS services apart from S3. Such that
           | they didn't care where in the world it ran.
           | 
           | Not that they didn't do anything memory intensive.
        
             | socialismisok wrote:
             | You got it. It's some drone image processing. Read in data
             | from S3, do analysis, write results.
        
         | ericpauley wrote:
         | Spot fleet requests allow you to set minimum specs for
         | instances, and the fleet will be composed of any instances that
         | meet the spec. If it's asynchronous work, you could pick lowest
         | price allocation and not worry too much about interruptions. In
         | fact, if your work is tolerant of interruptions (batch size
         | <2min), you can actually save even more by being interrupted,
         | as you don't get billed for partial hours:
         | https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/billing-...
        
       ___________________________________________________________________
       (page generated 2022-10-21 23:00 UTC)