[HN Gopher] TPU v4 provides exaFLOPS-scale ML with efficiency gains
       ___________________________________________________________________
        
       TPU v4 provides exaFLOPS-scale ML with efficiency gains
        
       Author : zekrioca
       Score  : 53 points
       Date   : 2023-04-05 20:52 UTC (2 hours ago)
        
 (HTM) web link (cloud.google.com)
 (TXT) w3m dump (cloud.google.com)
        
       | reaperman wrote:
       | I am very impressed with what Google has done for the state of
       | machine learning infrastructure. I'm looking forward to future
       | models based on OpenXLA which can run between Nvidia, Apple
       | Silicon, and google's TPU's. My main limiter to using TPU more
       | often is model compatibility. The TPU hardware is clearly the
       | very best, just not always cost-effective for those of us who are
       | starved of available engineering hours. OpenXLA may fix this if
       | it lives up to its promise.
       | 
       | That said, it's also incredible how fast things move in this
       | space:
       | 
       | > Midjourney, one of the leading text-to-image AI startups, have
       | been using Cloud TPU v4 to train their state-of-the-art model,
       | coincidentally also called "version four".
       | 
       | Midjourney is already on v5 as of the date of publication of this
       | press release.
        
       | xiphias2 wrote:
       | ,,Midjourney, one of the leading text-to-image AI startups, have
       | been using Cloud TPU v4 to train their state-of-the-art model,
       | coincidentally also called "version four''
       | 
       | This sounds quite bad in a press release when Midjourney is at
       | v5. Why did they move away?
        
         | sebzim4500 wrote:
         | Sounds like they are just out of date, it is possible MJv5 was
         | also trained on TPUs.
        
       | obblekk wrote:
       | This is very impressive technology and engineering.
       | 
       | However, I remain a bit skeptical of the business case for TPUs
       | for 3 core reasons:
       | 
       | 1) 100000x lower unit production volume than GPUs means higher
       | unit costs
       | 
       | 2) Slow iteration cycle - these TPUv4 were launched in 2020.
       | Maybe Google publishes one gen behind, but that would still be a
       | 2-3 year iteration cycle from v3 to v4.
       | 
       | 3) Constant multiple advantage over GPUs - maybe 5-10x compute
       | advantage over off the shelf GPU, and that number isn't
       | increasing with each generation.
       | 
       | It's cool to get that 5-10x performance over GPUs, but that's
       | 4.5yrs of Moore's Law, and might already be offset today due to
       | unit cost advantages.
       | 
       | If the TPU architecture did something to allow fundamentally
       | faster transistor density scaling, it's advantage over GPUs would
       | increase each year and become unbeatable. But based on the TPUv3
       | to TPUv4 perf improvement over 3 years, it doesn't seem so.
       | 
       | Apple's competing approach seems a bit more promising from a
       | business perspective. The M1 unifies memory reducing the time
       | commitment required to move data and switch between CPU and GPU
       | processing. This allows advances in GPUs to continue scaling
       | independently, while decreasing the user experience cost of using
       | GPUs.
       | 
       | Apple's version also seems to scale from 8GB RAM to 128GB meaning
       | the same fundamental process can be used at high volume,
       | achieving a low unit cost.
       | 
       | Are there other interesting hardware for ML approaches out there?
        
         | sliken wrote:
         | > 100000x lower unit production volume than GPUs means higher
         | unit costs
         | 
         | Two points. Nvidia's RTX 3000 series (3060 Ti, 3080, and many
         | other flavors) ships 6 or more flavors per generation. The
         | related silicon has names like the GA102, GA103, GA104, GA106,
         | and GA107. So only 1/6th of the consumer market for Nvidia
         | silicon can be amortized over any single design.
         | 
         | I wouldn't be at all surprised to see Google making the TPUs by
         | the million. I found a vague reference to 9 exaflops and single
         | facilities (one of many) costing $4 billion to $8 billion.
         | 
         | So I wouldn't assume that the consumer GPU market/number of
         | silicon designs is 100,000 times larger than the TPUv4 market.
         | 
         | > Slow iteration cycle
         | 
         | True. Then again generations make much less difference than
         | they used to. Gone are the days where even after a multiple
         | generations that average performance increases by 2x. Sure
         | nvidia's 4000 series claims 2x ... on raytracing. But normal
         | game performance seems to be more like 15%. Sure various
         | trickery like DLSS helps, but similar tricks are increasing the
         | performance of older cards as well. Similarly apple's a14 ->
         | a15 -> a16 (or m1 -> m2 if you prefer) chips have had modest
         | performance increases and mostly have increases in perf/watt.
         | 
         | > 4.5yrs of Moore's Law
         | 
         | It's dead Jim.
        
           | pclmulqdq wrote:
           | I believe that Nvidia uses the same chip from A16 up to the
           | A100, and maybe for some of the Quadro chips. That easily
           | puts the unit count into the several millions.
           | 
           | The picture in this article shows 8 racks with (according to
           | the paper on arxiv) has 16 TPU sleds each, 4 TPUs per sled.
           | that's only 512 chips. According to the paper, it is one of
           | eight in a 4096-chip supercomputer. If you give them 10-100
           | of those around the world, you get 40,000-400,000 chips.
           | That's enough for reasonable scale. Nvidia should still have
           | 100x (or more) their scale.
        
         | kccqzy wrote:
         | > Are there other interesting hardware for ML approaches out
         | there?
         | 
         | Google also has Coral, which is a non-cloud mobile-focused TPU
         | that you can buy and plug in (USB or PCIe).
         | 
         | https://coral.ai/products/
        
           | 1MachineElf wrote:
           | Just received my mSATA Coral TPU in the mail. I'd been
           | waiting 11 months for it after backordering on Digikey.
           | Perhaps this speaks to the parent comment's concerns over
           | unit volume and iteration cycle? Hopefully that will improve
           | in the future and modules like these will become widespread.
        
         | cubefox wrote:
         | > If the TPU architecture did something to allow fundamentally
         | faster transistor density scaling, it's advantage over GPUs
         | would increase each year and become unbeatable.
         | 
         | It is completely unreasonable to expect something like that.
        
         | sebzim4500 wrote:
         | >100000x lower unit production volume than GPUs
         | 
         | This is obviously an exaggeration, I wonder what the actual
         | ratio is between TPU proudction and e.g. A100 production.
        
           | summerlight wrote:
           | Probably more close to 1000x? I see a fairly large number of
           | TPU pod these days and I don't think A100 is as prevalent as
           | high end consumer GPU, which is typically measured in
           | millions, not billions.
        
       | tinco wrote:
       | They're so non confrontational. Their performance comparisons are
       | against "CPU". Just come out and say it, even if it's not apples
       | to apples. If the 3D-torus interconnect is so much better, just
       | say how it compares to NVidia's latest and greatest. It's cool
       | that midjourney committed to building on TPU, but I have a hard
       | time betting my company on a technology that's so guarded that
       | they won't even post a benchmark against their main competitor.
        
         | jeffbee wrote:
         | The paper compares it to A100.
         | 
         | https://arxiv.org/pdf/2304.01433.pdf
        
       | KeplerBoy wrote:
       | I refuse to care about them until they sell them on PCIe Cards.
       | 
       | The lock-in is bad enough when dealing with niche hardware on-
       | prem, i certainly won't deal with niche hardware in the cloud.
        
       | TradingPlaces wrote:
       | If Google can't become the king of AI cloud training, they should
       | all just quit.
        
       ___________________________________________________________________
       (page generated 2023-04-05 23:00 UTC)