[HN Gopher] The cost to train an AI system is improving at 50x t... ___________________________________________________________________ The cost to train an AI system is improving at 50x the pace of Moore's Law Author : kayza Score : 199 points Date : 2020-07-05 12:23 UTC (10 hours ago) (HTM) web link (ark-invest.com) (TXT) w3m dump (ark-invest.com) | lukevp wrote: | What are some domains that a solo developer could build something | commercially compelling to capture some of this $37 trillion? Are | there any workflows or tools or efficiencies that could be easily | realized as a commercial offering that would not require massive | man hours to implement? | KorfmannArno wrote: | krisp.ai but using gpu (also on mac) and with desktop version | for ubuntu linux. | op03 wrote: | Extracting and selling data stuck in the mountain ranges of | pdfs and other useless formats in every large corp, org, govt | dept on the planet. | | Do it for a couple publicly available docs and then contact the | org saying you offer 'archive digitization' so their data ppl | can mine for intelligence. | | Most of the time and resources of 'Digital Transformation'/Data | Science Depts goes to just manually extracting info from all | kinds of old docs, pdfs, spreadsheets containing institutional | knowledge. | jacquesm wrote: | Take any domain that requires classification work that has not | yet been targeted and make a run for it. You likely will be | able to adapt one of the existing nets or even use transfer | learning to outperform a human. That's the low hanging fruit. | | For instance: quality control: abnormality detection (for | instance: in medicine), agriculture (lots of movement there | right now), parts inspection, assembly inspection, sorting and | so on. There are more applications for this stuff than you | might think at first glance, essentially if a toddler can do it | and it is a job right now that's a good target. | yelloweyes wrote: | anything that's even remotely profitable is already taken | jacquesm wrote: | This simply isn't true. Every year since the present day ML | wave started has seen more and more domains tackled. Even | something like that silly lego sorting machine I built | could be the basis of a whole company pursuing sorting | technology if you set your mind to it. And that's just | resnet50 in disguise, likely you could do better today | without any effort. | | Your statement reminds of 'all the good domains are taken', | which I've been hearing since 1996 or so. Of course you'll | need to do some work to identify a niche that doesn't have | a major player in it yet. But the 'boring' niches are where | a lot of money is to be made, the sexy stuff (cancer, fruit | sorting) is well covered. But more obscure things are still | wide open, I get decks with some regularity about new | players in very interesting spaces using thinly wrapped ML | to do very profitable things. | smabie wrote: | Ah yes, of course. There will never be A new profitable ML | startup until the end of time. Makes perfect sense. | Barrin92 wrote: | > abnormality detection (for instance: in medicine), | agriculture (lots of movement there right now), parts | inspection, assembly inspection, sorting and so on | | none of these is anything someone can run from their bedroom | because they have very high quality and regulatory | requirements and require constant work outside of the actual | AI training. | | This is actually reflected in the margins of "AI" companies, | which are significantly lower than traditional SAAS | businesses and require significantly more manpower to deal | with the long tailed problems, which is where the AI fails | but it's what actually matters. | jacquesm wrote: | Well, depending on the size of your bedroom ;) I've seen | teams of two people running fairly impressive ML based | stuff. They were good enough at it that they didn't remain | at two people for very long but that was more than enough | to be useful to others. One interesting company - that I'm | free to talk about - did a nice one on e-commerce sites to | help with risk management: spot fraudulent orders before | they ship. | | In the long term, and to stay competitive you will always | have to get out of bed and go to work. But the initial push | can easily be just a very low number of people engaging an | otherwise dormant niche. | | Yes, medicine has regulatory requirements. But as long as | you _advise_ rather than diagnose the regulatory | requirements drop to almost nil. | [deleted] | Isinlor wrote: | You need to be creative. But one example - colorizing old | photos: https://twitter.com/citnaj | vertak wrote: | You can give this article by Chip Huyen a read. Mayhaps you | will find a niche for a solo or small dev team. Though it is | focused on MLOps if that makes a different for the type of | niche you're looking for. | | https://huyenchip.com/2020/06/22/mlops.html | calebkaiser wrote: | This is an odd framing. | | Training has become much more accessible, due to a variety of | things (ASICs, offerings from public clouds, innovations on the | data science side). Comparing it to Moore's Law doesn't make any | sense to me, though. | | Moore's Law is an observation on the pace of increase of a | tightly scoped thing, the number of transistors. | | The cost of training a model is not a single "thing," it's a | cumulative effect of many things, including things as fluid as | cloud pricing. | | Completely possible that I'm missing something obvious, though. | gumby wrote: | Like many things, Moore's law is garbled when adopted by | analogy outside its domain. | | What does "more transistors" mean? To you, it means just what | Gordon Moore means when he said it: opportunity for more | function in same space/cost. | | The laypersons, marketing grabbed the term and said it would | imply "faster". Which then was absurdly conflated with CPU | clock speed (itself an important input, though hardly the only | one, determining the actual speed of A system). | | The use here is of the "garbled analogy" sort which surely is | the dominant use today. | bcrosby95 wrote: | Yes but that aspect of Moore's law for CPUs expired over a | _decade_ ago. It 's the whole reason we got multicore in the | first place. | andrewprock wrote: | Even with multi-core, a CPU today is only 6x faster than a | 10-year old CPU. | jessriedel wrote: | Ok, but achieving Moore's law has required combining an | enormous number of conceptually distinct technical insights. | Both training costs and transistor density seem like well- | defined single parameters that incorporate many small | complicated effects. | staycoolboy wrote: | Agreed, but Moore's Law has morphed to refer to both xtors and | performance despite his original phrasing. | | The biggest innovation I've seen is in the cloud: backplane I/O | and memory is essential and up until a few years ago there | weren't many cloud configurations suitable for massive amount | of I/O. | adrianmonk wrote: | > _Comparing it to Moore 's Law doesn't make any sense to me, | though._ | | I assume it's meant as a qualitative comparison rather than a | meaningful quantitative one. Sort of a (sub-)cultural | touchstone to illustrate a point about which phase of | development we're in. | | With CPUs, during the phase of consistent year after year | exponential growth, there were ripple effects on software. For | example, for a while it was cost-prohibitive to run HTTPS for | everything, then CPUs got faster and it wasn't anymore. So | during that phase, you expected all kinds of things to keep | changing. | | If deep learning is in a similar phase, then whatever the | numbers are, we can expect other things to keep changing as a | result. | Const-me wrote: | > then CPUs got faster and it wasn't anymore | | The enabling tech was AES-NI instruction set, not the speed. | | Agree on the rest. The main reason why modern CPUs and GPUs | all have 16-bit floats is probably the deep learning trend. | mtgx wrote: | Many people use Moore's Law to mean some kind of "law of | accelerated returns" - which actually is a thing, and it kind | of does work like the author implies: | | https://www.kurzweilai.net/the-law-of-accelerating-returns | seek3r00 wrote: | tl;dr: Training learners is becoming cheaper every year, thanks | to big tech companies pushing hardware and software. | anonu wrote: | Ark Invest are the creators of the ARKK [1] and ARKW ETFs that | have become retail darlings, mainly because they're heavily | invested in TSLA. | | They pride themselves on this type of fundamental, bottom up | analysis on the market. | | It's fine.. I don't know if I agree with using Moore's law which | is fundamentally about hardware, with the cost to run a "system" | which is a combination of customized hardware and new software | techniques | | [1] https://pages.etflogic.io/?ticker=ARKK | [deleted] | gentleman11 wrote: | Despite nvidia vaguely prohibiting users from using their desktop | cards for machine learning in any sort of data center-like or | server-like capacity. Hopefully AMDs ml support / OpenCl will | continue improving | QuixoticQuibit wrote: | Last I saw, they don't even support ROCm on their recent Navi | cards, so I'd be hesitant. | Reelin wrote: | Wow. This is really disappointing to see. | (https://github.com/RadeonOpenCompute/ROCm/issues/887) | | I guess PlaidML might be a viable option? | solidasparagus wrote: | Resnet-50 with DawnBench settings is a very poor choice for | illustrating this trend. The main technique driving this | reduction in cost-to-train has been finding arcane, fast training | schedules. This sounds good until you realize its a type of | sleight of hand where finding that schedule takes tens of | thousands of dollars (usually more) that isn't counted in cost- | to-train, but is a real-world cost you would experience if you | want to train models. | | However, I think the overall trend this article talks about is | accurate. There has been an increased focus on cost-to-train and | you can see that with models like EfficientNet where NAS is used | to optimize both accuracy and model size jointly. | sdenton4 wrote: | I would guess that this means DawnBench is basically working. | You'll get some "overfit" training schedule optimizations, but | hopefully amongst those you'll end up with some improvements | you can take to other models. | | We also seem to be moving more towards a world where big | problem-specific models are shared (BERT, GPT), so that the | base time to train doesn't matter much unless you're doing | model architecture research. For most end-use cases in language | and perception, you'll end up picking up a 99%-trained model, | and fine tuning on your particular version of the problem. | bra-ket wrote: | "AI" is not really appropriate name for what it is | gxx wrote: | The cost to collect the huge amounts of needed to train | meaningful models is surely not growing at this rate. | m3kw9 wrote: | It was probably because very inefficient to begin with. | techbio wrote: | Indeed nonexistent | sktguha wrote: | Does it mean that the cost to train something like gpt3 by OpenAI | will reduce from 12 million dollars to less next year ? If so how | much will it reduce to ? | ersiees wrote: | I would really like a thorough analysis on how expensive it is to | multiply large matrices, which is the most expensive part of a | transformer training for example according to the profiler. Is | there some Moore's law or similar trend? | [deleted] | gchamonlive wrote: | I remember this article from 2018: https://medium.com/the- | mission/why-building-your-own-deep-le... | | Hackernews discussion for the article: | https://news.ycombinator.com/item?id=18063893 | | It really is interesting how this is changing the dynamics of | neural network training. Now it is affordable to train a useful | network on the cloud, whereas 2 years ago that would be reserved | to companies with either bigger investments or an already | consolidated product. | mtgp1000 wrote: | I trained a useful neural network and prototyped a viable | [failed] startup technology something like 4 years ago on a | 1080ti with a mid range CPU. It was enough to get me meetings | with a couple of the largest companies in the world. | | Yeah it took 12-24 hours to do what I could login to AWS and | accomplish in minutes with parallel GPUs...but practical | solutions were already in reach. The primary changes now are | buzz and possibly unprecedent rate of research progress. | qayxc wrote: | > Now it is affordable to train a useful network on the cloud | | I honestly don't see how anything changed significantly in past | 2 years. Benchmarks indicate that a V100 is barely 2x the | performance of an RTX 2080 Ti [1] and a V100 is | | * $2.50/h at Google [2] | | * $13.46/h (4xV100) at Microsoft Azure [3] | | * $12.24/h (4xV100) at AWS [4] | | * ~$2.80/h (2xV100, 1 month) at LeaderGPU [5] | | * ~$3.38/h (4xV100, 1 month) at Exoscale [6] | | Other smaller cloud providers are in a similar price range to | [5] and [6] (read: GCE, Azure and AWS are way overpriced...). | | Using the 2x figure from [1] and adjusting the price for the | build to a 2080 Ti and an AMD R9 3950X instead of the TR | results in similar figures to the article you provided. | | Please point me to any resources that show how the content of | the article doesn't apply anymore, 2 years later. I'd be very | interested to learn what actually changed (if anything). | | NVIDIA's new A100 platform might be a game changer, but it's | not yet available in public cloud offerings. | | [1] https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti- | vs-v... | | [2] https://cloud.google.com/compute/gpus-pricing | | [3] https://azure.microsoft.com/en-us/pricing/details/virtual- | ma... | | [4] https://aws.amazon.com/ec2/pricing/on-demand/ | | [5] https://www.leadergpu.com/#chose-best | | [6] https://www.exoscale.com/gpu/ | solidasparagus wrote: | You are missing TPU and spot/preemptible pricing, which need | to be considered when we are talking about training cost. The | big one to me is the ability to consistently train on V100s | with spot pricing, which was not possible a couple of years | ago (there wasn't enough spare capacity). Also, the | improvement in cloud bandwidth for DL-type instances has | helped distributed training a lot. | gchamonlive wrote: | I don't really know if those hardware breakthroughs that the | article refers to already reflects in Cloud GPU performance, | but software reflects nonetheless. So even though pricing has | fluctuated marginally since 2018, it is just plain faster to | train a neural network today because of software advances, | from what I understood. | qayxc wrote: | But that's not what the actual data says. | | Here's some figures from an actual benchmark [1] w.r.t. | training costs: | | 1. [Mar 2020] $7.43 (AlibabaCloud, 8xV100, TF v2.1) | | 2. [Sep 2018] $12.60 (Google, 8 TPU cores, TF v1.11) | | 3. [Mar 2020] $14.42 (AlibabaCloud, 128xV100, TF v2.1) | | -- | | Training time didn't go down exponentially either [1]: | | 1. [Mar 2020] 0:02:38 (AlibabaCloud, 128 x V100, TF v2.1) | | 2. [May 2019] 0:02:43 (Huawei Cloud, 128 x V100, TF v1.13) | | 3. [Dec 2018] 0:09:22 (Huawei Cloud, 128 x V100, MXNet) | | So again, I have to ask where exactly do these magical | improvement occur (regarding training - inference is | another matter entirely, I understand that)? I've yet to | find a source that supports 4x to 10x cost reductions. | | [1] https://dawn.cs.stanford.edu/benchmark/index.html | gchamonlive wrote: | I guess I should have been more skeptical of the articles | figures. But still, if we give the benefit of the doubt, | is there any scenario we might see the reduction | mentioned? 1000 to 10 USD? | qayxc wrote: | The scenario is indeed there - if you take early 2017 | numbers and restrict yourself to AWS/Google/Azure and | outdated hardware and software, you can get to the | US$1000 figure. | | Likewise, if your other point of comparison is late 2019 | AlibabaCloud spot pricing, you can get to US$10 for the | same task. | | Realistically, though, that's worst case 2017 vs best | case 2019/2020. So you sure, you can get to that if you | choose your numbers correctly. | | They basically compared results from H/W that even in | 2017 was 2 generations behind with the latest H/W. So | yeah - between 2015 and 2019 we indeed saw a cost | reduction from ~1000 to ~10 USD (on the _major_ cloud | provider vs best offer today scale). | | I only take issue with the assumption that the trend | continues this way, which it doesn't seem to. | sabalaba wrote: | Nothing really has changed in the last two years in terms of | training cost. I think the author is making unreasonable | extrapolations based on changes in performance on the Dawn | benchmarks. A lot of the results are fast but require a lot | more compute / search time to find the best parameters and | training regimen that lead to those fast convergence times. | (Learning rate schedule, batch size, image size schedules, | etc.) The point being that once the juice is squeezed out you | aren't going to continue to see training convergence time | improvements on the same hardware. | | Also, because you cited our GPU benchmarks, I also wanted to | throw in a mention our GPU instances which have some of the | lowest training costs on the Stanford Dawn Benchmarks | discussed in the article. | | https://lambdalabs.com/service/gpu-cloud | robecommerce wrote: | Another data point: | | "For example, we recently internally benchmarked an | Inferentia instance (inf1.2xlarge) against a GPU instance | with an almost identical spot price (g4dn.xlarge) and found | that, when serving the same ResNet50 model on Cortex, the | Inferentia instance offered a more than 4x speedup." | | https://towardsdatascience.com/why-every-company-will- | have-m... | qayxc wrote: | That data point talks about _inference_ though, and nobody | 's arguing that deployment and inference have improved | significantly over the past years. | | I'm referring to training and fine-tuning, not inference, | which - let's be honest - can be done on a phone these | days. | mellosouls wrote: | It is regrettable if an equivalent to the self-fulfilling | prophecy of Moore's "Law" (originally an astute observation and | forecast, but not remotely a law) became a driver/limiter in this | field as well, even more so if it's a straight transplant for | soundbite reasons rather than through any impartial and | thoughtful analysis. | kens wrote: | One thing I've wondered is if Moore's Law is good or bad, in | the sense of how fast should we have been able to improve IC | technology. Was progress limited by business decisions or is | this as fast as improvements could take place? | | A thought experiment: suppose we meet aliens who are remarkably | similar to ourselves and have an IC industry. Would they be | impressed by our Moore's law progress, or wonder why we took so | long? | NortySpock wrote: | https://en.wikipedia.org/wiki/Moore%27s_law, third paragraph | of the header, claims that Moore's Law drove targets in R&D | and manufacturing, but does not cite a reference for this | claim. | | "Moore's prediction has been used in the semiconductor | industry to guide long-term planning and to set targets for | research and development." | imtringued wrote: | I'm not sure what the point of that question is. In theory | you could have a government subsidize construction of fabs so | that skipping nodes is feasible but why on earth would you do | that when the industry is fully self sufficient and wildly | profitable? | [deleted] ___________________________________________________________________ (page generated 2020-07-05 23:00 UTC)