[HN Gopher] Are GPUs Worth It for ML?
       ___________________________________________________________________
        
       Are GPUs Worth It for ML?
        
       Author : varunkmohan
       Score  : 81 points
       Date   : 2022-08-29 18:34 UTC (4 hours ago)
        
 (HTM) web link (exafunction.com)
 (TXT) w3m dump (exafunction.com)
        
       | mpaepper wrote:
       | This also very much depends on the inference use case / context.
       | For example, I work in deep learning on digital pathology where
       | images can be up to 100000x100000pixels in size and inference
       | needs GPUs as it's just way too slow otherwise.
        
       | PeterStuer wrote:
       | " It feels wasteful to have an expensive GPU sitting idle while
       | we are executing the CPU portions of the ML workflow"
       | 
       | What is expensive? Those 3090ti's are looking very tasteful at
       | current prices.
        
       | ummonk wrote:
       | What a clickbaity article. It's an interesting discussion of GPU
       | multiplexing for ML inference merged together with a sales pitch
       | but the clickbait title made me hate the article bait and switch.
       | This wasn't even an example of Betteridge's law but just
       | completely misleading headline.
        
         | Eridrus wrote:
         | Is everyone with relevant inference costs not doing this
         | already?
         | 
         | I am so confused how there seems to be a startup around having
         | a work queue that does batching...
        
       | Kukumber wrote:
       | An interesting question, shows how insanely overpriced GPUs still
       | are, specially in the cloud environment
        
         | pqn wrote:
         | Disclaimer: I work at Exafunction
         | 
         | I empathize a bit with the cloud providers as they have to
         | upgrade their data centers every few years with new GPU
         | instances and it's hard for them to anticipate demand.
         | 
         | But if you can easily use every trick in the book (CPU version
         | of the model, autoscaling to zero, model compilation, keeping
         | inference in your own VPC, using spot instances, etc.) then
         | it's usually still worth it.
        
         | NavinF wrote:
         | *only in the cloud environment
         | 
         | Throw some 3090s in a rack and you'll break even in 3 months
        
         | mistrial9 wrote:
         | the HPC crowd are not able to add GPUs, that I know of..
         | deepLearning group of algorithms do kick butt for lots of kinds
         | of problems+data .. though I will advocate that dl is NOT the
         | only game in town, despite what you often read here
        
           | Frost1x wrote:
           | In what context? HPC and certain code bases have been
           | effectively leveraging heterogenous CPU GPU workloads for a
           | variety of applications for quite awhile. I know of some
           | doing so in at least 2009 and know plenty of prior art was
           | already there by that point, it's just a specific time I
           | happen to remember.
        
       | fancyfredbot wrote:
       | There are some pretty elegant solutions out there for the problem
       | of having the right ratio of CPU to GPU. One of the nicer ones is
       | rCUDA.
       | https://scholar.google.com/citations?view_op=view_citation&h...
        
         | varunkmohan wrote:
         | rCUDA is super cool! One of the issues though is for a lot of
         | the common model frameworks are not supported and a new release
         | has not come out a while.
        
           | fancyfredbot wrote:
           | Fair point. It's not obvious from the website which model
           | frameworks does exafunction supports, or when the last
           | exafunction release was.
        
       | PeterisP wrote:
       | For some reason they focus on the inference, which is the
       | computationally cheap part. If you're working on ML (as opposed
       | to deploying someone else's ML) then almost all of your workload
       | is training, not inference.
        
         | cardine wrote:
         | I have not found this to be true at all in my field (natural
         | language generation).
         | 
         | We have a 7 figure GPU setup that is running 24/7 at 100%
         | utilization just to handle inference.
        
           | fartcannon wrote:
           | How do you train new models if your GPUs are being used for
           | inference? I guess the training happens significantly less
           | frequently?
           | 
           | Forgive my ignorance.
        
             | jacquesm wrote:
             | Typically a different set of hardware for model training.
        
             | cardine wrote:
             | We have different servers for each. But the split is
             | usually 80%/20% for inference/training. As our product
             | grows in usage the 80% number is steadily increasing.
             | 
             | That isn't because we aren't training that often - we are
             | almost always training many new models. It is just that
             | inference is so computationally expensive!
        
           | dheera wrote:
           | Also true of self-driving. You train a perception model for a
           | week and then log millions of vehicle-hours on inference.
        
         | MichaelBurge wrote:
         | Think Google: Every time you search, some model somewhere gets
         | invoked, and the aggregate inference cost would dwarf even very
         | large training costs if you have billions of searches.
         | 
         | Marketing blogspam like this is always targeting big(not
         | Google, but big) companies hoping to divert their big IT
         | budgets to their coffers: "You have X million queries to your
         | model every day. Imagine if we billed you per-request, but
         | scaled the price so in aggregate it's slightly cheaper than
         | your current spending."
         | 
         | People who are training-constrained are early-stage(i.e.
         | correlate with not having money), and then they need to buy an
         | entirely separate set of GPUs to support you(e.g. T4s are good
         | for inference, but they need V100s for training). So they
         | choose to ignore you entirely.
        
         | Jensson wrote:
         | If you are training models that are intended to be used in
         | production at scale then training is dirt cheap compared to
         | inference. There is a reason why Google focused on inference
         | first with their TPU's even though Google does a lot of ML
         | training.
        
           | dllthomas wrote:
           | I think another part of the question is whether you're
           | scaling on your own hardware or the customers' hardware.
        
         | gowld wrote:
        
         | jldugger wrote:
         | > If you're working on ML (as opposed to deploying someone
         | else's ML) then almost all of your workload is training, not
         | inference.
         | 
         | Wouldn't that depend on the size of your customer base? Or at
         | least, requests per second?
        
           | karamanolev wrote:
           | With more customers usually the revenue and profit grow, then
           | the team becomes larger, wants to perform more experiments,
           | spends more on training and so on. Inference is just so
           | computationally cheap compared to training.
           | 
           | That's what I've seen in my experience, but I concur that
           | there might be cases where the ML is a more-or-less solved
           | problem for a very large customer base where inference is
           | more. I've rarely seen it happen, but other people are
           | sharing scenarios where it happens frequently. So I guess it
           | massively depends on the domain.
        
         | mgraczyk wrote:
         | This depends a lot on what you're doing. If you are ranking 1M
         | qps in a recommender system, then training cost will be tiny
         | compared to inference.
        
           | lajamerr wrote:
           | I wonder if there's room for model caching. Sacrifice some
           | personalization for near similar results so you aren't
           | hitting the model so often.
        
             | mgraczyk wrote:
             | Yeah we did lots of things like this at Instagram. Can be
             | very brittle and dangerous though to share any caching
             | amongst multiple users. If you work at Facebook you can
             | search for some SEVs related to this lol
        
         | varunkmohan wrote:
         | Agreed that there are workloads where inference is not
         | expensive, but it's really workload dependent. For applications
         | that run inference over large amounts of data in the computer
         | vision space, inference ends up being a dominant portion of the
         | spend.
        
           | PeterisP wrote:
           | The way I see it, generally every new data point (on which
           | the production model inference gets run once) becomes part of
           | the data set which then gets used in training every next
           | model, processing the same data point many more times in
           | training, thus training unavoidably taking more effort than
           | inference.
           | 
           | Perhaps I'm a bit biased towards all kinds of self-supervised
           | or human-in-the-loop or semi-supervised models, but the
           | notion of discarding large amounts of good domain-specific
           | data that get processed _only_ for inference and not used for
           | training afterward feels a bit foreign to me, because you
           | usually can extract an advantage from it. But perhaps that 's
           | the difference between data-starved domains and overwhelming-
           | data domains?
        
             | varunkmohan wrote:
             | Yup, exactly. It's a good point that for self-supervised
             | workloads, the training set can become arbitrarily large.
             | For a lot of other workloads in the vision space, most data
             | needs to be labeled to be able to used for training.
        
             | pdpi wrote:
             | There's one piece of the puzzle you're missing: field-
             | deployed devices.
             | 
             | If I play chess on my computer, the games I play locally
             | won't hit the Stockfish models. When I use the feature on
             | my phone that allows me to copy text from a picture, it
             | won't phone home with all the frames.
        
             | version_five wrote:
             | What you say re saving all data is the ideal. I'd add a
             | couple caveats, one is that in many fields you often get
             | lots of redundant data that adds nothing to training (for
             | example if an image classifier looking for some rare class
             | you can be drowning in images of the majority class). Or
             | you can just have lots of data that is unambiguously and
             | correctly classified- some kind of active learning can tell
             | you what is worth keeping.
             | 
             | The other is that for various reasons the customer doesnt
             | want to share their data (or at least have sharing built
             | into the inference system) so even if you'd like to have
             | everything they record, it's just not available. Obviously
             | something to discourage but it seems common
        
         | acchow wrote:
         | Is your inference running on some daily jobs? That's not a ton
         | of inference compared to running online for every live request
         | (10k QPS?)
        
       | sabotista wrote:
       | It depends a lot on your problem, of course.
       | 
       | Game-playing (e.g. AlphaGo) is computationally hard but the rules
       | are immutable, target functions (e.g., heuristics) don't change
       | much, and you can generate arbitrarily sized clean data sets
       | (play more games). On these problems, ML-scaling approaches work
       | very well. For business problems where the value of data decays
       | rapidly, though, you probably don't need the power of a deer or
       | complex neural net with millions of parameters, and expensive
       | specialty hardware probably isn't worth it.
        
       | scosman wrote:
       | We did a big analysis of this a few years back. We ended up using
       | a big spot-instance cluster of CPU machines for our inference
       | cluster. Much more consistently available than spot GPU, at
       | greater scale, and at better price per inference (at least at the
       | time). Scaled well to many billion inferences. Of course, compare
       | cost per inference on your models to make sure logic applies.
       | Article on how it worked: https://www.freecodecamp.org/news/ml-
       | armada-running-tens-of-...
       | 
       | Training was always GPUs (for speed), non-spot-instance (for
       | reliability), and cloud based (for infinite parallelism).
       | Training work tended to be chunky, never made sense to build
       | servers in house that would be idle some of the time, and queued
       | at other times.
        
         | machinekob wrote:
         | What cloud is even remotely worth it over buying 20x rtx 3090
         | or even some quadro for training? Maybe if u have very small
         | team and small problems but if you have CV/Video tasks and team
         | more than 3 maybe even 2 people in house servers are always
         | better choice as you'll get your money back in 2-3 months of
         | training over cloud solution and maybe even more if you wait
         | for rtx 4090.
         | 
         | And if you are solo dev its even easier choice as you can reuse
         | your rig for other stuff when you dont train anything (for
         | example gaming :D).
         | 
         | Only possibility is if you get free 100k from AWS and then 100k
         | from GCP you can live with that for a year or even two if u
         | stack both providers but it is special case and im not sure how
         | easy it is to get 100k right now.
        
           | beecafe wrote:
           | You are years behind if you think you're training a model
           | worth anything on consumer grade GPUs. Table stakes these
           | days is 8x A100 pods, and lots of them. Luckily you can just
           | get DGX pods so you don't have to build racks but for many
           | orgs just renting the pods is much cheaper.
        
             | rockemsockem wrote:
             | 300 billion parameters or GTFO, eh?
             | 
             | There is tons of value to be had from smaller models. Even
             | some state of the art results can be obtained on a
             | relatively small set of commodity GPUs. Not everything is
             | GPT-scale.
        
             | cjbgkagh wrote:
             | Years behind what? Table stakes for what? There is much
             | more to ML than the latest transformer and diffusion
             | models. While those get the attention the amount of
             | research not in that space dominates.
        
             | mushufasa wrote:
             | > You are years behind if you think you're training a model
             | worth anything on consumer grade GPUs
             | 
             | Ah yes, my code can't be useful to people unless it takes a
             | long time to compile...
        
               | skimo8 wrote:
               | To be fair, I think ML workloads are quite a bit
               | different than the days of compiling over lunch breaks.
               | 
               | What the above post was probably trying to get at is that
               | the ML specific hardware is far more efficient these days
               | than consumer GPUs.
        
             | machinekob wrote:
             | Ahh yes cause there is only one way to do Deep Learning and
             | it is ofc stacking models large enough to not be useful
             | outside pods of GPUs and this is for sure way to go if you
             | want to make money (from VC ofc cause you wont have much
             | users that are ever willing to pay so much that you'll ever
             | make even, as was OpenAI and other big model providers,
             | maybe you can get some money/sponsoring from state or uni).
             | 
             | Market for local small and efficient models running on
             | device is pretty big maybe even biggest that exist right
             | now [ios, android and macos are pretty easy to monetize
             | with low cost models that are useful]. I can assure you of
             | that and you can do it on even 4x RTX 3090 [ it wont be
             | fast but you'll get there :) ]
        
           | scosman wrote:
           | As mentioned in the comment, ML training workloads tend to be
           | super chunky (at least in my experience). Some days we want
           | to train 50 models, some weeks we are evaluating and don't
           | need any compute.
           | 
           | I'd rather be able to spin up 200 gpus in parallel when
           | needed (yes, at a premium), but ramp to 0 when not. Data
           | scientists waiting around are more expensive than GPUs.
           | Replacing/maintaining servers is more work/money than you
           | expect. And for us the training data was cloud native, so
           | transfer/privacy/security is easier; nothing on prem, data
           | scientists can design models without having access to raw
           | data, etc.
        
             | machinekob wrote:
             | If you are cloud only company then for sure it is just
             | easier but still it wont be cheaper just more convenient to
             | use. If data science team is very big probably "best"
             | solution without unlimited money is just to run local and
             | go cloud [premium] if you dont have free resources for your
             | teams (It was the case when i was working in pretty big EU
             | bank but it wasn't "true" Deep learning yet [about 4-5
             | years ago]).
        
           | varunkmohan wrote:
           | You have a good point. I think for small enough workloads
           | self managing instances on-prem is more cost-effective. There
           | is a simplicity gain in being able to scale up and scale down
           | instances in the cloud but may not make sense if you can
           | self-manage without too much work.
        
         | varunkmohan wrote:
         | Disclaimer: I'm the Cofounder / CEO at Exafunction
         | 
         | That's a great point. We'll be addressing this in an upcoming
         | post as well.
         | 
         | We've served workloads that run entirely on spot GPUs where it
         | makes sense since a small number of spot GPUs can make up for a
         | large amount of spot CPU capacity. The best of all worlds is if
         | you can manage both spot and on-demand instances (with a
         | preference towards spot instances). Also, for latency sensitive
         | workloads, running on spot instances or CPUs sometimes is not
         | an option.
         | 
         | I could definitely see cases where it makes sense to run on
         | spot CPUs though.
        
           | fortysixdegrees wrote:
           | Disclaimer != Disclosure
           | 
           | Probably one of HNs most common mistakes in comments
        
             | adgjlsfhk1 wrote:
             | perhaps, but I think disclaimer in this context it's just
             | an abbreviation since the disclosure carries with it the
             | implicit disclaimer of "so the things I'm saying are
             | subconsciously influenced by the fact that they potentially
             | could make me money"
        
       | synergy20 wrote:
       | I think TPU is the way to go for ML, be it training or inference.
       | 
       | We're using GPU(some contains a TPU block inside) due to
       | 'historical reasons'. With vector unit(x86 AVX, ARM SVE, RISC-V
       | RVV) that is part of the host cpu, either put a TPU on a separate
       | die of the chiplet, or just put it into a PCIe card will do the
       | heavy lift ML job fine. It shall be much cheaper than the GPU
       | model for ML nowadays, unless you are both a PC game player and a
       | ML engineer.
        
       | andrewmutz wrote:
       | At training time they sure are. The only thing more expensive
       | than fancy GPUs are the ML engineers whose productivity that are
       | improving.
        
       | jacquesm wrote:
       | This is an ad.
        
       | triknomeister wrote:
       | I thought this post would be about how ASICs are probably a
       | better bet.
        
       | rfrey wrote:
       | Not related to the article, but how would one begin to become
       | smart on optimizing GPU workloads? I've been charged with
       | deploying an application that is a mixture of heuristic search
       | and inference, that has been exclusively single-user to this
       | point.
       | 
       | I'm sure every little thing I've discovered (e.g. measuring
       | cpu/gpu workloads, trying to multiplex access to the gpu, etc)
       | was probably covered in somebody's grad school notes 12 years
       | ago, but I haven't found a source of info on the topic.
        
         | pqn wrote:
         | Let's just take the topic of measuring GPU usage. This alone is
         | quite tricky -- tools like nvidia-smi will show full GPU
         | utilization even if not all SMs are running. And also the
         | workload may change behavior over time, if for instance inputs
         | to transformers got longer over time. And then it gets even
         | more complicated to measure when considering optimizations like
         | dynamic batching. I think if you peek into some ML Ops
         | communities you can get a flavor of these nuances, but not sure
         | if there are good exhaustive guides around right now.
        
       | einpoklum wrote:
       | > And CPUs are so much cheaper
       | 
       | Doesn't look like it. Consumer:
       | 
       | AMD ThreadRipper 3970X: ~3000 USD on NewEgg
       | 
       | https://www.newegg.com/amd-ryzen-threadripper-2990wx/p/N82E1...
       | 
       | NVIDIA RTX 3080 Ti Founders' Edition: ~2000 USD
       | 
       | https://www.newegg.com/nvidia-900-1g133-2518-000/p/1FT-0004-...
       | 
       | For servers, a comparison is even more complicated and it
       | wouldn't be fair to just give two numbers, but I still don't
       | think GPUs are more expensive.
       | 
       | ... besides, none of that may matter if yours is a power budget.
        
       | 37ef_ced3 wrote:
       | For small-scale transformer CPU inference you can use, e.g.,
       | Fabrice Bellard's https://bellard.org/libnc/
       | 
       | Similarly, for small-scale convolutional CPU inference, where you
       | only need to do maybe 20 ResNet-50 (batch size 1) per second per
       | CPU (cloud CPUs cost $0.015 per hour) you can use inference
       | engines designed for this purpose, e.g., https://NN-512.com
       | 
       | You can expect about 2x the performance of TensorFlow or PyTorch.
        
         | tombert wrote:
         | Is there a thing that Fabrice Bellard hasn't built? I had no
         | idea that he was interested in something like machine learning,
         | but I guess I shouldn't have been surprised because he has
         | built every tool that I use.
        
           | mistrial9 wrote:
           | https://en.wikipedia.org/wiki/Fabrice_Bellard
        
           | [deleted]
        
       | [deleted]
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2022-08-29 23:00 UTC)