[HN Gopher] The cost to train an AI system is improving at 50x t...
       ___________________________________________________________________
        
       The cost to train an AI system is improving at 50x the pace of
       Moore's Law
        
       Author : kayza
       Score  : 199 points
       Date   : 2020-07-05 12:23 UTC (10 hours ago)
        
 (HTM) web link (ark-invest.com)
 (TXT) w3m dump (ark-invest.com)
        
       | lukevp wrote:
       | What are some domains that a solo developer could build something
       | commercially compelling to capture some of this $37 trillion? Are
       | there any workflows or tools or efficiencies that could be easily
       | realized as a commercial offering that would not require massive
       | man hours to implement?
        
         | KorfmannArno wrote:
         | krisp.ai but using gpu (also on mac) and with desktop version
         | for ubuntu linux.
        
         | op03 wrote:
         | Extracting and selling data stuck in the mountain ranges of
         | pdfs and other useless formats in every large corp, org, govt
         | dept on the planet.
         | 
         | Do it for a couple publicly available docs and then contact the
         | org saying you offer 'archive digitization' so their data ppl
         | can mine for intelligence.
         | 
         | Most of the time and resources of 'Digital Transformation'/Data
         | Science Depts goes to just manually extracting info from all
         | kinds of old docs, pdfs, spreadsheets containing institutional
         | knowledge.
        
         | jacquesm wrote:
         | Take any domain that requires classification work that has not
         | yet been targeted and make a run for it. You likely will be
         | able to adapt one of the existing nets or even use transfer
         | learning to outperform a human. That's the low hanging fruit.
         | 
         | For instance: quality control: abnormality detection (for
         | instance: in medicine), agriculture (lots of movement there
         | right now), parts inspection, assembly inspection, sorting and
         | so on. There are more applications for this stuff than you
         | might think at first glance, essentially if a toddler can do it
         | and it is a job right now that's a good target.
        
           | yelloweyes wrote:
           | anything that's even remotely profitable is already taken
        
             | jacquesm wrote:
             | This simply isn't true. Every year since the present day ML
             | wave started has seen more and more domains tackled. Even
             | something like that silly lego sorting machine I built
             | could be the basis of a whole company pursuing sorting
             | technology if you set your mind to it. And that's just
             | resnet50 in disguise, likely you could do better today
             | without any effort.
             | 
             | Your statement reminds of 'all the good domains are taken',
             | which I've been hearing since 1996 or so. Of course you'll
             | need to do some work to identify a niche that doesn't have
             | a major player in it yet. But the 'boring' niches are where
             | a lot of money is to be made, the sexy stuff (cancer, fruit
             | sorting) is well covered. But more obscure things are still
             | wide open, I get decks with some regularity about new
             | players in very interesting spaces using thinly wrapped ML
             | to do very profitable things.
        
             | smabie wrote:
             | Ah yes, of course. There will never be A new profitable ML
             | startup until the end of time. Makes perfect sense.
        
           | Barrin92 wrote:
           | > abnormality detection (for instance: in medicine),
           | agriculture (lots of movement there right now), parts
           | inspection, assembly inspection, sorting and so on
           | 
           | none of these is anything someone can run from their bedroom
           | because they have very high quality and regulatory
           | requirements and require constant work outside of the actual
           | AI training.
           | 
           | This is actually reflected in the margins of "AI" companies,
           | which are significantly lower than traditional SAAS
           | businesses and require significantly more manpower to deal
           | with the long tailed problems, which is where the AI fails
           | but it's what actually matters.
        
             | jacquesm wrote:
             | Well, depending on the size of your bedroom ;) I've seen
             | teams of two people running fairly impressive ML based
             | stuff. They were good enough at it that they didn't remain
             | at two people for very long but that was more than enough
             | to be useful to others. One interesting company - that I'm
             | free to talk about - did a nice one on e-commerce sites to
             | help with risk management: spot fraudulent orders before
             | they ship.
             | 
             | In the long term, and to stay competitive you will always
             | have to get out of bed and go to work. But the initial push
             | can easily be just a very low number of people engaging an
             | otherwise dormant niche.
             | 
             | Yes, medicine has regulatory requirements. But as long as
             | you _advise_ rather than diagnose the regulatory
             | requirements drop to almost nil.
        
               | [deleted]
        
         | Isinlor wrote:
         | You need to be creative. But one example - colorizing old
         | photos: https://twitter.com/citnaj
        
         | vertak wrote:
         | You can give this article by Chip Huyen a read. Mayhaps you
         | will find a niche for a solo or small dev team. Though it is
         | focused on MLOps if that makes a different for the type of
         | niche you're looking for.
         | 
         | https://huyenchip.com/2020/06/22/mlops.html
        
       | calebkaiser wrote:
       | This is an odd framing.
       | 
       | Training has become much more accessible, due to a variety of
       | things (ASICs, offerings from public clouds, innovations on the
       | data science side). Comparing it to Moore's Law doesn't make any
       | sense to me, though.
       | 
       | Moore's Law is an observation on the pace of increase of a
       | tightly scoped thing, the number of transistors.
       | 
       | The cost of training a model is not a single "thing," it's a
       | cumulative effect of many things, including things as fluid as
       | cloud pricing.
       | 
       | Completely possible that I'm missing something obvious, though.
        
         | gumby wrote:
         | Like many things, Moore's law is garbled when adopted by
         | analogy outside its domain.
         | 
         | What does "more transistors" mean? To you, it means just what
         | Gordon Moore means when he said it: opportunity for more
         | function in same space/cost.
         | 
         | The laypersons, marketing grabbed the term and said it would
         | imply "faster". Which then was absurdly conflated with CPU
         | clock speed (itself an important input, though hardly the only
         | one, determining the actual speed of A system).
         | 
         | The use here is of the "garbled analogy" sort which surely is
         | the dominant use today.
        
           | bcrosby95 wrote:
           | Yes but that aspect of Moore's law for CPUs expired over a
           | _decade_ ago. It 's the whole reason we got multicore in the
           | first place.
        
             | andrewprock wrote:
             | Even with multi-core, a CPU today is only 6x faster than a
             | 10-year old CPU.
        
         | jessriedel wrote:
         | Ok, but achieving Moore's law has required combining an
         | enormous number of conceptually distinct technical insights.
         | Both training costs and transistor density seem like well-
         | defined single parameters that incorporate many small
         | complicated effects.
        
         | staycoolboy wrote:
         | Agreed, but Moore's Law has morphed to refer to both xtors and
         | performance despite his original phrasing.
         | 
         | The biggest innovation I've seen is in the cloud: backplane I/O
         | and memory is essential and up until a few years ago there
         | weren't many cloud configurations suitable for massive amount
         | of I/O.
        
         | adrianmonk wrote:
         | > _Comparing it to Moore 's Law doesn't make any sense to me,
         | though._
         | 
         | I assume it's meant as a qualitative comparison rather than a
         | meaningful quantitative one. Sort of a (sub-)cultural
         | touchstone to illustrate a point about which phase of
         | development we're in.
         | 
         | With CPUs, during the phase of consistent year after year
         | exponential growth, there were ripple effects on software. For
         | example, for a while it was cost-prohibitive to run HTTPS for
         | everything, then CPUs got faster and it wasn't anymore. So
         | during that phase, you expected all kinds of things to keep
         | changing.
         | 
         | If deep learning is in a similar phase, then whatever the
         | numbers are, we can expect other things to keep changing as a
         | result.
        
           | Const-me wrote:
           | > then CPUs got faster and it wasn't anymore
           | 
           | The enabling tech was AES-NI instruction set, not the speed.
           | 
           | Agree on the rest. The main reason why modern CPUs and GPUs
           | all have 16-bit floats is probably the deep learning trend.
        
         | mtgx wrote:
         | Many people use Moore's Law to mean some kind of "law of
         | accelerated returns" - which actually is a thing, and it kind
         | of does work like the author implies:
         | 
         | https://www.kurzweilai.net/the-law-of-accelerating-returns
        
       | seek3r00 wrote:
       | tl;dr: Training learners is becoming cheaper every year, thanks
       | to big tech companies pushing hardware and software.
        
       | anonu wrote:
       | Ark Invest are the creators of the ARKK [1] and ARKW ETFs that
       | have become retail darlings, mainly because they're heavily
       | invested in TSLA.
       | 
       | They pride themselves on this type of fundamental, bottom up
       | analysis on the market.
       | 
       | It's fine.. I don't know if I agree with using Moore's law which
       | is fundamentally about hardware, with the cost to run a "system"
       | which is a combination of customized hardware and new software
       | techniques
       | 
       | [1] https://pages.etflogic.io/?ticker=ARKK
        
         | [deleted]
        
       | gentleman11 wrote:
       | Despite nvidia vaguely prohibiting users from using their desktop
       | cards for machine learning in any sort of data center-like or
       | server-like capacity. Hopefully AMDs ml support / OpenCl will
       | continue improving
        
         | QuixoticQuibit wrote:
         | Last I saw, they don't even support ROCm on their recent Navi
         | cards, so I'd be hesitant.
        
           | Reelin wrote:
           | Wow. This is really disappointing to see.
           | (https://github.com/RadeonOpenCompute/ROCm/issues/887)
           | 
           | I guess PlaidML might be a viable option?
        
       | solidasparagus wrote:
       | Resnet-50 with DawnBench settings is a very poor choice for
       | illustrating this trend. The main technique driving this
       | reduction in cost-to-train has been finding arcane, fast training
       | schedules. This sounds good until you realize its a type of
       | sleight of hand where finding that schedule takes tens of
       | thousands of dollars (usually more) that isn't counted in cost-
       | to-train, but is a real-world cost you would experience if you
       | want to train models.
       | 
       | However, I think the overall trend this article talks about is
       | accurate. There has been an increased focus on cost-to-train and
       | you can see that with models like EfficientNet where NAS is used
       | to optimize both accuracy and model size jointly.
        
         | sdenton4 wrote:
         | I would guess that this means DawnBench is basically working.
         | You'll get some "overfit" training schedule optimizations, but
         | hopefully amongst those you'll end up with some improvements
         | you can take to other models.
         | 
         | We also seem to be moving more towards a world where big
         | problem-specific models are shared (BERT, GPT), so that the
         | base time to train doesn't matter much unless you're doing
         | model architecture research. For most end-use cases in language
         | and perception, you'll end up picking up a 99%-trained model,
         | and fine tuning on your particular version of the problem.
        
       | bra-ket wrote:
       | "AI" is not really appropriate name for what it is
        
       | gxx wrote:
       | The cost to collect the huge amounts of needed to train
       | meaningful models is surely not growing at this rate.
        
       | m3kw9 wrote:
       | It was probably because very inefficient to begin with.
        
         | techbio wrote:
         | Indeed nonexistent
        
       | sktguha wrote:
       | Does it mean that the cost to train something like gpt3 by OpenAI
       | will reduce from 12 million dollars to less next year ? If so how
       | much will it reduce to ?
        
       | ersiees wrote:
       | I would really like a thorough analysis on how expensive it is to
       | multiply large matrices, which is the most expensive part of a
       | transformer training for example according to the profiler. Is
       | there some Moore's law or similar trend?
        
         | [deleted]
        
       | gchamonlive wrote:
       | I remember this article from 2018: https://medium.com/the-
       | mission/why-building-your-own-deep-le...
       | 
       | Hackernews discussion for the article:
       | https://news.ycombinator.com/item?id=18063893
       | 
       | It really is interesting how this is changing the dynamics of
       | neural network training. Now it is affordable to train a useful
       | network on the cloud, whereas 2 years ago that would be reserved
       | to companies with either bigger investments or an already
       | consolidated product.
        
         | mtgp1000 wrote:
         | I trained a useful neural network and prototyped a viable
         | [failed] startup technology something like 4 years ago on a
         | 1080ti with a mid range CPU. It was enough to get me meetings
         | with a couple of the largest companies in the world.
         | 
         | Yeah it took 12-24 hours to do what I could login to AWS and
         | accomplish in minutes with parallel GPUs...but practical
         | solutions were already in reach. The primary changes now are
         | buzz and possibly unprecedent rate of research progress.
        
         | qayxc wrote:
         | > Now it is affordable to train a useful network on the cloud
         | 
         | I honestly don't see how anything changed significantly in past
         | 2 years. Benchmarks indicate that a V100 is barely 2x the
         | performance of an RTX 2080 Ti [1] and a V100 is
         | 
         | * $2.50/h at Google [2]
         | 
         | * $13.46/h (4xV100) at Microsoft Azure [3]
         | 
         | * $12.24/h (4xV100) at AWS [4]
         | 
         | * ~$2.80/h (2xV100, 1 month) at LeaderGPU [5]
         | 
         | * ~$3.38/h (4xV100, 1 month) at Exoscale [6]
         | 
         | Other smaller cloud providers are in a similar price range to
         | [5] and [6] (read: GCE, Azure and AWS are way overpriced...).
         | 
         | Using the 2x figure from [1] and adjusting the price for the
         | build to a 2080 Ti and an AMD R9 3950X instead of the TR
         | results in similar figures to the article you provided.
         | 
         | Please point me to any resources that show how the content of
         | the article doesn't apply anymore, 2 years later. I'd be very
         | interested to learn what actually changed (if anything).
         | 
         | NVIDIA's new A100 platform might be a game changer, but it's
         | not yet available in public cloud offerings.
         | 
         | [1] https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-
         | vs-v...
         | 
         | [2] https://cloud.google.com/compute/gpus-pricing
         | 
         | [3] https://azure.microsoft.com/en-us/pricing/details/virtual-
         | ma...
         | 
         | [4] https://aws.amazon.com/ec2/pricing/on-demand/
         | 
         | [5] https://www.leadergpu.com/#chose-best
         | 
         | [6] https://www.exoscale.com/gpu/
        
           | solidasparagus wrote:
           | You are missing TPU and spot/preemptible pricing, which need
           | to be considered when we are talking about training cost. The
           | big one to me is the ability to consistently train on V100s
           | with spot pricing, which was not possible a couple of years
           | ago (there wasn't enough spare capacity). Also, the
           | improvement in cloud bandwidth for DL-type instances has
           | helped distributed training a lot.
        
           | gchamonlive wrote:
           | I don't really know if those hardware breakthroughs that the
           | article refers to already reflects in Cloud GPU performance,
           | but software reflects nonetheless. So even though pricing has
           | fluctuated marginally since 2018, it is just plain faster to
           | train a neural network today because of software advances,
           | from what I understood.
        
             | qayxc wrote:
             | But that's not what the actual data says.
             | 
             | Here's some figures from an actual benchmark [1] w.r.t.
             | training costs:
             | 
             | 1. [Mar 2020] $7.43 (AlibabaCloud, 8xV100, TF v2.1)
             | 
             | 2. [Sep 2018] $12.60 (Google, 8 TPU cores, TF v1.11)
             | 
             | 3. [Mar 2020] $14.42 (AlibabaCloud, 128xV100, TF v2.1)
             | 
             | --
             | 
             | Training time didn't go down exponentially either [1]:
             | 
             | 1. [Mar 2020] 0:02:38 (AlibabaCloud, 128 x V100, TF v2.1)
             | 
             | 2. [May 2019] 0:02:43 (Huawei Cloud, 128 x V100, TF v1.13)
             | 
             | 3. [Dec 2018] 0:09:22 (Huawei Cloud, 128 x V100, MXNet)
             | 
             | So again, I have to ask where exactly do these magical
             | improvement occur (regarding training - inference is
             | another matter entirely, I understand that)? I've yet to
             | find a source that supports 4x to 10x cost reductions.
             | 
             | [1] https://dawn.cs.stanford.edu/benchmark/index.html
        
               | gchamonlive wrote:
               | I guess I should have been more skeptical of the articles
               | figures. But still, if we give the benefit of the doubt,
               | is there any scenario we might see the reduction
               | mentioned? 1000 to 10 USD?
        
               | qayxc wrote:
               | The scenario is indeed there - if you take early 2017
               | numbers and restrict yourself to AWS/Google/Azure and
               | outdated hardware and software, you can get to the
               | US$1000 figure.
               | 
               | Likewise, if your other point of comparison is late 2019
               | AlibabaCloud spot pricing, you can get to US$10 for the
               | same task.
               | 
               | Realistically, though, that's worst case 2017 vs best
               | case 2019/2020. So you sure, you can get to that if you
               | choose your numbers correctly.
               | 
               | They basically compared results from H/W that even in
               | 2017 was 2 generations behind with the latest H/W. So
               | yeah - between 2015 and 2019 we indeed saw a cost
               | reduction from ~1000 to ~10 USD (on the _major_ cloud
               | provider vs best offer today scale).
               | 
               | I only take issue with the assumption that the trend
               | continues this way, which it doesn't seem to.
        
           | sabalaba wrote:
           | Nothing really has changed in the last two years in terms of
           | training cost. I think the author is making unreasonable
           | extrapolations based on changes in performance on the Dawn
           | benchmarks. A lot of the results are fast but require a lot
           | more compute / search time to find the best parameters and
           | training regimen that lead to those fast convergence times.
           | (Learning rate schedule, batch size, image size schedules,
           | etc.) The point being that once the juice is squeezed out you
           | aren't going to continue to see training convergence time
           | improvements on the same hardware.
           | 
           | Also, because you cited our GPU benchmarks, I also wanted to
           | throw in a mention our GPU instances which have some of the
           | lowest training costs on the Stanford Dawn Benchmarks
           | discussed in the article.
           | 
           | https://lambdalabs.com/service/gpu-cloud
        
           | robecommerce wrote:
           | Another data point:
           | 
           | "For example, we recently internally benchmarked an
           | Inferentia instance (inf1.2xlarge) against a GPU instance
           | with an almost identical spot price (g4dn.xlarge) and found
           | that, when serving the same ResNet50 model on Cortex, the
           | Inferentia instance offered a more than 4x speedup."
           | 
           | https://towardsdatascience.com/why-every-company-will-
           | have-m...
        
             | qayxc wrote:
             | That data point talks about _inference_ though, and nobody
             | 's arguing that deployment and inference have improved
             | significantly over the past years.
             | 
             | I'm referring to training and fine-tuning, not inference,
             | which - let's be honest - can be done on a phone these
             | days.
        
       | mellosouls wrote:
       | It is regrettable if an equivalent to the self-fulfilling
       | prophecy of Moore's "Law" (originally an astute observation and
       | forecast, but not remotely a law) became a driver/limiter in this
       | field as well, even more so if it's a straight transplant for
       | soundbite reasons rather than through any impartial and
       | thoughtful analysis.
        
         | kens wrote:
         | One thing I've wondered is if Moore's Law is good or bad, in
         | the sense of how fast should we have been able to improve IC
         | technology. Was progress limited by business decisions or is
         | this as fast as improvements could take place?
         | 
         | A thought experiment: suppose we meet aliens who are remarkably
         | similar to ourselves and have an IC industry. Would they be
         | impressed by our Moore's law progress, or wonder why we took so
         | long?
        
           | NortySpock wrote:
           | https://en.wikipedia.org/wiki/Moore%27s_law, third paragraph
           | of the header, claims that Moore's Law drove targets in R&D
           | and manufacturing, but does not cite a reference for this
           | claim.
           | 
           | "Moore's prediction has been used in the semiconductor
           | industry to guide long-term planning and to set targets for
           | research and development."
        
           | imtringued wrote:
           | I'm not sure what the point of that question is. In theory
           | you could have a government subsidize construction of fabs so
           | that skipping nodes is feasible but why on earth would you do
           | that when the industry is fully self sufficient and wildly
           | profitable?
        
       | [deleted]
        
       ___________________________________________________________________
       (page generated 2020-07-05 23:00 UTC)