[HN Gopher] TPU v4 provides exaFLOPS-scale ML with efficiency gains ___________________________________________________________________ TPU v4 provides exaFLOPS-scale ML with efficiency gains Author : zekrioca Score : 53 points Date : 2023-04-05 20:52 UTC (2 hours ago) (HTM) web link (cloud.google.com) (TXT) w3m dump (cloud.google.com) | reaperman wrote: | I am very impressed with what Google has done for the state of | machine learning infrastructure. I'm looking forward to future | models based on OpenXLA which can run between Nvidia, Apple | Silicon, and google's TPU's. My main limiter to using TPU more | often is model compatibility. The TPU hardware is clearly the | very best, just not always cost-effective for those of us who are | starved of available engineering hours. OpenXLA may fix this if | it lives up to its promise. | | That said, it's also incredible how fast things move in this | space: | | > Midjourney, one of the leading text-to-image AI startups, have | been using Cloud TPU v4 to train their state-of-the-art model, | coincidentally also called "version four". | | Midjourney is already on v5 as of the date of publication of this | press release. | xiphias2 wrote: | ,,Midjourney, one of the leading text-to-image AI startups, have | been using Cloud TPU v4 to train their state-of-the-art model, | coincidentally also called "version four'' | | This sounds quite bad in a press release when Midjourney is at | v5. Why did they move away? | sebzim4500 wrote: | Sounds like they are just out of date, it is possible MJv5 was | also trained on TPUs. | obblekk wrote: | This is very impressive technology and engineering. | | However, I remain a bit skeptical of the business case for TPUs | for 3 core reasons: | | 1) 100000x lower unit production volume than GPUs means higher | unit costs | | 2) Slow iteration cycle - these TPUv4 were launched in 2020. | Maybe Google publishes one gen behind, but that would still be a | 2-3 year iteration cycle from v3 to v4. | | 3) Constant multiple advantage over GPUs - maybe 5-10x compute | advantage over off the shelf GPU, and that number isn't | increasing with each generation. | | It's cool to get that 5-10x performance over GPUs, but that's | 4.5yrs of Moore's Law, and might already be offset today due to | unit cost advantages. | | If the TPU architecture did something to allow fundamentally | faster transistor density scaling, it's advantage over GPUs would | increase each year and become unbeatable. But based on the TPUv3 | to TPUv4 perf improvement over 3 years, it doesn't seem so. | | Apple's competing approach seems a bit more promising from a | business perspective. The M1 unifies memory reducing the time | commitment required to move data and switch between CPU and GPU | processing. This allows advances in GPUs to continue scaling | independently, while decreasing the user experience cost of using | GPUs. | | Apple's version also seems to scale from 8GB RAM to 128GB meaning | the same fundamental process can be used at high volume, | achieving a low unit cost. | | Are there other interesting hardware for ML approaches out there? | sliken wrote: | > 100000x lower unit production volume than GPUs means higher | unit costs | | Two points. Nvidia's RTX 3000 series (3060 Ti, 3080, and many | other flavors) ships 6 or more flavors per generation. The | related silicon has names like the GA102, GA103, GA104, GA106, | and GA107. So only 1/6th of the consumer market for Nvidia | silicon can be amortized over any single design. | | I wouldn't be at all surprised to see Google making the TPUs by | the million. I found a vague reference to 9 exaflops and single | facilities (one of many) costing $4 billion to $8 billion. | | So I wouldn't assume that the consumer GPU market/number of | silicon designs is 100,000 times larger than the TPUv4 market. | | > Slow iteration cycle | | True. Then again generations make much less difference than | they used to. Gone are the days where even after a multiple | generations that average performance increases by 2x. Sure | nvidia's 4000 series claims 2x ... on raytracing. But normal | game performance seems to be more like 15%. Sure various | trickery like DLSS helps, but similar tricks are increasing the | performance of older cards as well. Similarly apple's a14 -> | a15 -> a16 (or m1 -> m2 if you prefer) chips have had modest | performance increases and mostly have increases in perf/watt. | | > 4.5yrs of Moore's Law | | It's dead Jim. | pclmulqdq wrote: | I believe that Nvidia uses the same chip from A16 up to the | A100, and maybe for some of the Quadro chips. That easily | puts the unit count into the several millions. | | The picture in this article shows 8 racks with (according to | the paper on arxiv) has 16 TPU sleds each, 4 TPUs per sled. | that's only 512 chips. According to the paper, it is one of | eight in a 4096-chip supercomputer. If you give them 10-100 | of those around the world, you get 40,000-400,000 chips. | That's enough for reasonable scale. Nvidia should still have | 100x (or more) their scale. | kccqzy wrote: | > Are there other interesting hardware for ML approaches out | there? | | Google also has Coral, which is a non-cloud mobile-focused TPU | that you can buy and plug in (USB or PCIe). | | https://coral.ai/products/ | 1MachineElf wrote: | Just received my mSATA Coral TPU in the mail. I'd been | waiting 11 months for it after backordering on Digikey. | Perhaps this speaks to the parent comment's concerns over | unit volume and iteration cycle? Hopefully that will improve | in the future and modules like these will become widespread. | cubefox wrote: | > If the TPU architecture did something to allow fundamentally | faster transistor density scaling, it's advantage over GPUs | would increase each year and become unbeatable. | | It is completely unreasonable to expect something like that. | sebzim4500 wrote: | >100000x lower unit production volume than GPUs | | This is obviously an exaggeration, I wonder what the actual | ratio is between TPU proudction and e.g. A100 production. | summerlight wrote: | Probably more close to 1000x? I see a fairly large number of | TPU pod these days and I don't think A100 is as prevalent as | high end consumer GPU, which is typically measured in | millions, not billions. | tinco wrote: | They're so non confrontational. Their performance comparisons are | against "CPU". Just come out and say it, even if it's not apples | to apples. If the 3D-torus interconnect is so much better, just | say how it compares to NVidia's latest and greatest. It's cool | that midjourney committed to building on TPU, but I have a hard | time betting my company on a technology that's so guarded that | they won't even post a benchmark against their main competitor. | jeffbee wrote: | The paper compares it to A100. | | https://arxiv.org/pdf/2304.01433.pdf | KeplerBoy wrote: | I refuse to care about them until they sell them on PCIe Cards. | | The lock-in is bad enough when dealing with niche hardware on- | prem, i certainly won't deal with niche hardware in the cloud. | TradingPlaces wrote: | If Google can't become the king of AI cloud training, they should | all just quit. ___________________________________________________________________ (page generated 2023-04-05 23:00 UTC)