[HN Gopher] Intel's Ponte Vecchio Xe-HPC GPU Boasts 100B Transis... ___________________________________________________________________ Intel's Ponte Vecchio Xe-HPC GPU Boasts 100B Transistors Author : rbanffy Score : 59 points Date : 2021-03-26 09:09 UTC (1 days ago) (HTM) web link (www.tomshardware.com) (TXT) w3m dump (www.tomshardware.com) | [deleted] | barkingcat wrote: | This will probably be a nightmare for a consumer product. | | Too many components from too many different sources, with intel | doing the "integration". | | Doesn't this remind anyone of the engineering philosophy of the | Boeing 787 Dreamliner? Have individual manufacturers build | component parts and then use just in time integration to put | assembly and packaging at the end. If any individual manufacturer | runs out of chips or components, or de-prioritize production (for | example, if Samsung or TSMC is being ordered by Korea or Taiwan | to specifically prioritize chips for their automotive industries) | - this could lead to shortages that will cause ripples down the | assembly line for these xe-hpc chips. | | Especially in today's world, when companies like Apple are | constantly moving toward vertical integration, and bringing in | all external dependencies inward (or at least have ironclad | contracts mandating partners satisfy their contractual duties), | this move by intel is in the wrong direction in the post Covid- | chip shortage era. | wmf wrote: | Ponte Vecchio isn't a consumer product. In fact, I've long | predicted that they'll only manufacture enough to satisfy the | Aurora contract. | rincebrain wrote: | It would be...unfortunate, for Intel to do something with | such little volume as that again after Xeon Phi. | marcodiego wrote: | Nice. But without benchmarks, these numbers mean nothing. | [deleted] | rubyn00bie wrote: | It kind of feels like this is just Intel's marketing machine. The | chip is less impressive than the article makes it sound, Nvidia | _shipped_ the A100 in 2020. This Intel chip doesn 't even exist | in a production system... and the A100 is already pretty damn | close, hitting (w/ FP16) 624 TF with sparsity according to | Nvidia's documentation which is at least as accurate as an | unreleased data center chip from Intel: | | https://www.nvidia.com/en-us/data-center/a100/ | | I'd guess, by the time Intel actually ships anything useful, | Nvidia will have it made it mostly moot. | Veedrac wrote: | TFLOPS-equivalent-with-sparsity is not real TFLOPS. The article | compared to A100's 312 TFLOPS which is much more reasonable. | my123 wrote: | Intel didn't say what the TFlops number was about at all, we | just don't know anything other than the headline. | dragontamer wrote: | We know the performance targets of the Aurora supercomputer | though. | | The only way Intel reaches those performance targets is to | outdo the current crop of GPUs: MI100 (from AMD) and A100 | (NVidia). | | Not that that is a guarantee that we got a winner here, but | we know the goal for what Intel is shooting for at least. | my123 wrote: | Given the _actual_ performance metrics that they gave about its | Xe HP cousin (which Intel didn't publish any indication on it | having FP64 at all), I'm inclined to believe that the 1PF | number is indeed some very ML-specific stuff. | | https://cdn.mos.cms.futurecdn.net/BUsZ5EdKUcP8mWRKypTNB4-970... | | When excluding ML... (because that's what Intel gave actual | metrics on for Xe-HP) | | 41Tflops FP32 with 4 dies. For comparison, an RTX 3090 (arch | whitepaper at https://www.nvidia.com/content/dam/en- | zz/Solutions/geforce/a...) has 35.6Tflops FP32, with a single | die. | gigatexal wrote: | "Intel usually considers FP16 to be the optimal precision for AI, | so when the company says that that its Ponte Vecchio is a | 'PetaFLOP scale AI computer in the palm of the hand,' this might | mean that that the GPU features about 1 PFLOPS FP16 performance, | or 1,000 TFLOPS FP16 performance. To put the number into context, | Nvidia's A100 compute GPU provides about 312 TFLOPS FP16 | performance. " | | wow | mrDmrTmrJ wrote: | Manufacturing the chip-lets independently appears to be an | interesting approach to maximizing yields. If anyone component | has a defect, you just assemble using a different chip-let, as | opposed to it affecting the final product. | | Anyone know how this affects power, compute, or communications | metrics compared to monolithic designs? | | Or am I off in thinking this approach maximizes yields? | zamadatix wrote: | See/compare AMD's CPUs from the last couple of years to Intel's | - they use the chiplet approach with up to 8+1 in their Epyc | CPUs for example. | baybal2 wrote: | > Anyone know how this affects power, compute, or | communications metrics compared to monolithic designs? | | It does effect ennormously, but everything is highly design | specific. | | The die size limits are not only yield related. | | Power, and clock have stopped scaling few generations ago. | | New chips have more, and more disaggregated, independent blocks | separated by asynchronous interfaces to accomodate more clock, | and power domains. | | If you have to break a chip along such domain boundary, you | loose little in terms of speed unlike if you did it right | across registers, logic, and synchronous parallel links. | | Caches also stopped scaling too, and making them bigger, also | makes them slower. | | Instead, more elaborate application specific cache hierarchies | are getting popular. L1-2 get smaller, and faster, but L3 can | be made to ones fantasy: eDRAM, standalone SRAM, stacked memory | etc. | LegitShady wrote: | I don't think I've ever cared how many transistors were in | something I purchased. | varispeed wrote: | I can't shake the feeling that buying anything with Intel today | is like buying already obsolete technology. Did I get myself too | much under influence of advertisement etc. or is it valid to an | extent? My laptop is currently 3 years old so I am looking for a | replacement and it seems like there is no point to buy anything | right now apart from M1 and AMD is out of stock everywhere. But | even latest AMD processors are not that great of an upgrade. So I | am left with M1, but I cannot support this company politics and | my conclusion is that I am going to stick to my old laptop for a | time being... | wmf wrote: | You're basically right. Tiger Lake-H and Alder Lake should | catch up to AMD this year though. | zokier wrote: | Alder Lake is still only 8+8 big+small cores, while you can | already get 16 big cores in 5950X with hopefully Zen3 | Threadrippers in the pipeline coming soon now that Milan is | out. Feels like Intel has little to offer in competition. | NathanielK wrote: | 3 years isn't that old. Unless someone else is footing it, keep | using what works for you. The 14nm Intel laptops haven't | changed much in that time. | | Very small laptops with Intel Tigerlake are on level with AMD | and Apple products. They have all the new IO bits (PCIe 4, | LPDDR4x, Wifi6) and low power usage on 10nm. | | If you wanted a bit more battery life, performance, or just | want to try a fancier display upgrading could be nice. | xiphias2 wrote: | I don't see Apple laptops having worse politics than other | companies. On my iPad I feel the lots of problems of the closed | ecosystem, but M1 laptops are accessable enough for developers | to work with (even though it is sadly undocumented). | bserge wrote: | Performance aside, that thing looks beautiful | choppaface wrote: | The Cebras prototype was about 0.86 PFLOPS (?) for a whole wafer | (1T transistors) so this Intel chip looks like a potential viable | competitor at 1PFLOP for only 100B transistors (even if just | FP16). I'm sure Intel will want to chase NVidia but Cerebras is | also a threat given it already has software support (Tensorflow, | Pytorch, etc). Maybe I'm making an unfair comparison but looks | Ponte Vecchio would put Intel just above where Cerebras was a | couple years ago. | | https://www.nextbigfuture.com/2020/11/cerebras-trillion-tran... | aokiji wrote: | Lets not forget all of the Intel backdoors that were exploited | and forced us to use patched hardware with lower performance than | what was advertised. | caycep wrote: | Actually - regardless of the performance of this, and perhaps | this is orthogonal to their GPU - with the global crunch in | chips/GPU, would this be a natural market space for Intel, | especially with the new foundry services, to compete? I would | imagine there is a lot of business to be had from Nvidia/AMD for | GPUs...assuming the mining boom holds up. | wmf wrote: | Intel has the same capacity shortage as everyone else and GPUs | actually seem pretty cheap (i.e. less profitable) given their | large dies. | onli wrote: | Intel with their own production facilities seems to manage | the shortage better than everyone else. Their product may be | worse, but their supply situation has been consistently | better since December. | Google234 wrote: | Very cool! I'm looking forward to see seeing how it performs. | cs702 wrote: | NVIDIA's hardware and software (CUDA) badly need competition in | this space -- from Intel, from AMD, from anyone, please. | | If anyone at Intel is reading this, please consider releasing all | Ponte Vecchio drivers under a permissive open-source license; it | would facilitate and encourage faster adoption. | dogma1138 wrote: | Intel's OneAPI is already miles a head of AMD's ROCm which is | pretty awesome. | zepmck wrote: | When? Where? How can it be miles ahead if the hardware has | not been released yet? | baybal2 wrote: | Yes, seconding that. | | What the point of using OneAPI, a yet another compute API | wrapper, to make software just for a single platform? | | You can just use regular computing libs, and C, or C++. | | Serious HPC will still stay with its own serious HPC stuff, | superoptimised C, and fortran code, no matter how labour | intensive it is. | | So, I see very little point in that. | dogma1138 wrote: | OneAPI is already cross platform through codeplay's | implementation which also can run on NVIDIA GPUs, its | whole point is to be open cross platform framework that | targets a wide range of hardware. | | Wether it would be successful or not is up in the air but | it's goals are pretty solid. | my123 wrote: | So basically, a thing that will provide first-class | capabilities only on Intel hardware, and won't be really | optimised for maximum performance/expose all the | underlying capabilities of the hardware elsewhere. | pjmlp wrote: | Now they need to catch up with polyglot CUDA eco-system. | johnnycerberus wrote: | I really don't get this push to polyglot programming when | 99% of the high performance libraries use C++. Even more, | openAPI has DPC++, SPIR-V has SYCL, CUDA is even building a | C++ standard library that is heterogeneous supporting both | CPU and GPU, libcu++. Seriously now, how many people from | JVM or CLR world actually need this level of high | performance? How many actually push kernels to the GPU from | these runtimes? I have yet to see a programming language | that will replace C++ at what it does best. Maybe Zig | because it is streamlined and easier to get into will be a | true contender to C++ HPC but only time will tell. | pjmlp wrote: | Enough people to keep a couple of companies in business, | and NVidia doing collaboration projects with Microsoft | and Oracle, HPC is not the only market for CUDA. | bionhoward wrote: | Whenever I hit AI limits, it's due to memory. That's why | I would argue the future of AI is Rust, not C++. Memory | efficiency matters! | jacques_chester wrote: | > _Seriously now, how many people from JVM or CLR world | actually need this level of high performance?_ | | The big data ecosystem is Java-centric. | johnnycerberus wrote: | Indeed it is, but the developers in these ecosystems | created complements like Apache Arrow that will unload | the data in a language-independent columnar memory format | for efficient analytics in services that will run C++ on | clusters of CPUs and GPUs. Even Spark has rewritten their | own analytics engine in C++ recently. These were created | because of the limitations of the JVM. We have tried to | move the numerical processing away from C++ in the past | decades but we have always failed. | jacques_chester wrote: | You asked who in the JVM world would be interested in | this kind of performance: that's big data folks. To the | extent that improvements accrue to the JVM they accrue to | that world without needing to rewrite into C++. | dogma1138 wrote: | Finance too, large exchanges with micro second latency | have their core systems written in Java, CME Globex and | EBS/Brokertec are written in Java. | spijdar wrote: | Sadly, that's not a very high bar to set... | xiphias2 wrote: | CUDA is not as important as Tensorflow, PyTorch and JAX support | at this point. Those frameworks are what people code against, | so having high quality backends for them are more important | than the drivers themselves. | elihu wrote: | The One-API and OpenCL implementations, Intel Graphics | Compiler, and Linux driver are all open source. Ponte Vechio | support just hasn't been publicly released yet. | | https://github.com/intel/compute-runtime | | https://github.com/intel/intel-graphics-compiler | | https://github.com/torvalds/linux/tree/master/drivers/gpu/dr... | zepmck wrote: | One-API is not completely open source. Support for Ponte | Vecchio will be not released open source for many reasons. | elihu wrote: | I don't have specific knowledge of Ponte Vecchio in | particular, so I'll defer to you if you have such info. The | support for their mainstream GPU products is open source, | though. | nine_k wrote: | Where to find more details? | pjmlp wrote: | One-API focus too much on C++ (SYSCL + Intel own stuff), | while OpenCL is all about C. | | CUDA is polyglot, with very nice graphical debuggers that can | even single step shaders. | | Something that the anti-CUDA keep forgeting. | UncleOxidant wrote: | oneAPI support in Julia: | https://github.com/JuliaGPU/oneAPI.jl | pjmlp wrote: | Nice to know, thanks. | dogma1138 wrote: | CUDA's biggest advantage over OpenCL other than not being a | camel was its C++ support which is still the main language | in use for CUDA in production, I doubt FORTRAN was the | reason why CUDA got to where it is, C++ on the other hand | had quite a lot to do with it during its initial days when | OpenCL was still stuck in OpenGL C-Land. | | NVIDIA understood also early on the importance of first | party libraries and commercial partnerships something Intel | also understands which is why OneAPI has wider adoption | already than ROCm. | pjmlp wrote: | CUDA supports much more languages than just C++ and | Fortan. | | .NET, Java, Julia, Python (RAPIDS/cuDF), Haskell don't | have a place on OneAPI so far. | | And yes, going back to C++, the hardware is based on | C++11 memory model (which was based on Java/.NET models). | | So plenty of stuff to catch up, besides "we can do C++". | dragandj wrote: | How does CUDA support any of these (.Net, Java, etc?). | It's the first time I hear this claim. There are 3rd | party wrappers in Java, .Net, etc. that call CUDA's C++ | API, and that's all. Equivalet APIs exist for OpenCL | too... | my123 wrote: | The CUDA runtime gets as input the PTX intermediate | language. | | The toolkit ships with compilers from C++ and Fortran to | NVVM, and provides you documentation about the PTX | virtual machine at https://docs.nvidia.com/cuda/parallel- | thread-execution/index... and about the higher-level NVVM | (which compiles down to PTX) at | https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html. | navaati wrote: | Oooh, I didn't know PTX was an intermediate | representation and explicitly documented as such, I | really thought it was the actual assembly ran by the | chips... | my123 wrote: | You can get the GPU-targeted assembly (sometimes called | SASS by NVIDIA) through specifically compiling to a given | GPU then using nvdisasm, which also has a very terse | definition of the underlying instruction set in the docs | (https://docs.nvidia.com/cuda/cuda-binary- | utilities/index.htm...). | | But it's one way only, NVIDIA ships a disassembler, but | explicitly doesn't ship an assembler. | The_rationalist wrote: | https://github.com/NVIDIA/grcuda | dogma1138 wrote: | There are Java and C# compilers for CUDA such as JCUDA | and http://www.altimesh.com/hybridizer-essentials/ but | the CUDA runtime, libraries and the first party compiler | only supports C/C++ and FORTRAN, for Python you need to | use something like Numba. | | Most non C++ frameworks and implementations tho would | simply use wrappers and bindings. | | I also am not aware of any high performance lib for CUDA | that wasn't written in C++. | pjmlp wrote: | "Hybridizer: High-Performance C# on GPUs" | | https://developer.nvidia.com/blog/hybridizer-csharp/ | | "Simplifying GPU Access: A Polyglot Binding for GPUs with | GraalVM" | | https://developer.nvidia.com/gtc/2020/video/s21269-vid | | And then you can browse for products on | https://www.nvidia.com/content/dam/en-zz/Solutions/Data- | Cent... | dogma1138 wrote: | Hybridizer simply creates CUDA C++ code from C# which is | then compiled to PTX it also does it for AVX which you | can the compile with Intel's compiler or gcc, it's not | particularly good and you often need to debug the | generated CUDA source code yourself, it's also doesn't | always play well with the CUDA programming model | especially its more advanced features. | | And again it's a commercial product developed by a 3rd | party, whilst someone uses it I wouldn't even put it as a | rounding error when accounting for why CUDA has the | market share it has. | pjmlp wrote: | It is like everyone arguing about C++ for AAA studios, as | if everyone was doing Crysis and Fortnight clones, while | forgetting the legions of people making money selling A | games. | | Or forgetting the days when games written in C were | actually full of inline Assembly. | | It is still CUDA, regardless if it goes through PTX or | CUDA C++ as implementation detail for the high level | code. | my123 wrote: | https://www.ibm.com/support/knowledgecenter/SSYKE2_8.0.0/ | com... goes to a level above: | | "Alternatively you can let the virtual machine (VM) make | this decision automatically by setting a system property | on the command line. The JIT can also offload certain | processing tasks based on performance heuristics." | | A lot of what ultimately limits GPUs today is that they | are connected over a relatively slow bus (PCIe), this | will change in the future, allowing smaller and smaller | tasks to be offloaded. | The_rationalist wrote: | in addition, grCuda is a breakthrough that enable interop | with much more languages such as Ruby, R, Js, (soon | python), etc https://github.com/NVIDIA/grcuda ___________________________________________________________________ (page generated 2021-03-27 23:00 UTC)