[HN Gopher] Intel's Ponte Vecchio Xe-HPC GPU Boasts 100B Transis...
       ___________________________________________________________________
        
       Intel's Ponte Vecchio Xe-HPC GPU Boasts 100B Transistors
        
       Author : rbanffy
       Score  : 59 points
       Date   : 2021-03-26 09:09 UTC (1 days ago)
        
 (HTM) web link (www.tomshardware.com)
 (TXT) w3m dump (www.tomshardware.com)
        
       | [deleted]
        
       | barkingcat wrote:
       | This will probably be a nightmare for a consumer product.
       | 
       | Too many components from too many different sources, with intel
       | doing the "integration".
       | 
       | Doesn't this remind anyone of the engineering philosophy of the
       | Boeing 787 Dreamliner? Have individual manufacturers build
       | component parts and then use just in time integration to put
       | assembly and packaging at the end. If any individual manufacturer
       | runs out of chips or components, or de-prioritize production (for
       | example, if Samsung or TSMC is being ordered by Korea or Taiwan
       | to specifically prioritize chips for their automotive industries)
       | - this could lead to shortages that will cause ripples down the
       | assembly line for these xe-hpc chips.
       | 
       | Especially in today's world, when companies like Apple are
       | constantly moving toward vertical integration, and bringing in
       | all external dependencies inward (or at least have ironclad
       | contracts mandating partners satisfy their contractual duties),
       | this move by intel is in the wrong direction in the post Covid-
       | chip shortage era.
        
         | wmf wrote:
         | Ponte Vecchio isn't a consumer product. In fact, I've long
         | predicted that they'll only manufacture enough to satisfy the
         | Aurora contract.
        
           | rincebrain wrote:
           | It would be...unfortunate, for Intel to do something with
           | such little volume as that again after Xeon Phi.
        
       | marcodiego wrote:
       | Nice. But without benchmarks, these numbers mean nothing.
        
       | [deleted]
        
       | rubyn00bie wrote:
       | It kind of feels like this is just Intel's marketing machine. The
       | chip is less impressive than the article makes it sound, Nvidia
       | _shipped_ the A100 in 2020. This Intel chip doesn 't even exist
       | in a production system... and the A100 is already pretty damn
       | close, hitting (w/ FP16) 624 TF with sparsity according to
       | Nvidia's documentation which is at least as accurate as an
       | unreleased data center chip from Intel:
       | 
       | https://www.nvidia.com/en-us/data-center/a100/
       | 
       | I'd guess, by the time Intel actually ships anything useful,
       | Nvidia will have it made it mostly moot.
        
         | Veedrac wrote:
         | TFLOPS-equivalent-with-sparsity is not real TFLOPS. The article
         | compared to A100's 312 TFLOPS which is much more reasonable.
        
           | my123 wrote:
           | Intel didn't say what the TFlops number was about at all, we
           | just don't know anything other than the headline.
        
             | dragontamer wrote:
             | We know the performance targets of the Aurora supercomputer
             | though.
             | 
             | The only way Intel reaches those performance targets is to
             | outdo the current crop of GPUs: MI100 (from AMD) and A100
             | (NVidia).
             | 
             | Not that that is a guarantee that we got a winner here, but
             | we know the goal for what Intel is shooting for at least.
        
         | my123 wrote:
         | Given the _actual_ performance metrics that they gave about its
         | Xe HP cousin (which Intel didn't publish any indication on it
         | having FP64 at all), I'm inclined to believe that the 1PF
         | number is indeed some very ML-specific stuff.
         | 
         | https://cdn.mos.cms.futurecdn.net/BUsZ5EdKUcP8mWRKypTNB4-970...
         | 
         | When excluding ML... (because that's what Intel gave actual
         | metrics on for Xe-HP)
         | 
         | 41Tflops FP32 with 4 dies. For comparison, an RTX 3090 (arch
         | whitepaper at https://www.nvidia.com/content/dam/en-
         | zz/Solutions/geforce/a...) has 35.6Tflops FP32, with a single
         | die.
        
       | gigatexal wrote:
       | "Intel usually considers FP16 to be the optimal precision for AI,
       | so when the company says that that its Ponte Vecchio is a
       | 'PetaFLOP scale AI computer in the palm of the hand,' this might
       | mean that that the GPU features about 1 PFLOPS FP16 performance,
       | or 1,000 TFLOPS FP16 performance. To put the number into context,
       | Nvidia's A100 compute GPU provides about 312 TFLOPS FP16
       | performance. "
       | 
       | wow
        
       | mrDmrTmrJ wrote:
       | Manufacturing the chip-lets independently appears to be an
       | interesting approach to maximizing yields. If anyone component
       | has a defect, you just assemble using a different chip-let, as
       | opposed to it affecting the final product.
       | 
       | Anyone know how this affects power, compute, or communications
       | metrics compared to monolithic designs?
       | 
       | Or am I off in thinking this approach maximizes yields?
        
         | zamadatix wrote:
         | See/compare AMD's CPUs from the last couple of years to Intel's
         | - they use the chiplet approach with up to 8+1 in their Epyc
         | CPUs for example.
        
         | baybal2 wrote:
         | > Anyone know how this affects power, compute, or
         | communications metrics compared to monolithic designs?
         | 
         | It does effect ennormously, but everything is highly design
         | specific.
         | 
         | The die size limits are not only yield related.
         | 
         | Power, and clock have stopped scaling few generations ago.
         | 
         | New chips have more, and more disaggregated, independent blocks
         | separated by asynchronous interfaces to accomodate more clock,
         | and power domains.
         | 
         | If you have to break a chip along such domain boundary, you
         | loose little in terms of speed unlike if you did it right
         | across registers, logic, and synchronous parallel links.
         | 
         | Caches also stopped scaling too, and making them bigger, also
         | makes them slower.
         | 
         | Instead, more elaborate application specific cache hierarchies
         | are getting popular. L1-2 get smaller, and faster, but L3 can
         | be made to ones fantasy: eDRAM, standalone SRAM, stacked memory
         | etc.
        
       | LegitShady wrote:
       | I don't think I've ever cared how many transistors were in
       | something I purchased.
        
       | varispeed wrote:
       | I can't shake the feeling that buying anything with Intel today
       | is like buying already obsolete technology. Did I get myself too
       | much under influence of advertisement etc. or is it valid to an
       | extent? My laptop is currently 3 years old so I am looking for a
       | replacement and it seems like there is no point to buy anything
       | right now apart from M1 and AMD is out of stock everywhere. But
       | even latest AMD processors are not that great of an upgrade. So I
       | am left with M1, but I cannot support this company politics and
       | my conclusion is that I am going to stick to my old laptop for a
       | time being...
        
         | wmf wrote:
         | You're basically right. Tiger Lake-H and Alder Lake should
         | catch up to AMD this year though.
        
           | zokier wrote:
           | Alder Lake is still only 8+8 big+small cores, while you can
           | already get 16 big cores in 5950X with hopefully Zen3
           | Threadrippers in the pipeline coming soon now that Milan is
           | out. Feels like Intel has little to offer in competition.
        
         | NathanielK wrote:
         | 3 years isn't that old. Unless someone else is footing it, keep
         | using what works for you. The 14nm Intel laptops haven't
         | changed much in that time.
         | 
         | Very small laptops with Intel Tigerlake are on level with AMD
         | and Apple products. They have all the new IO bits (PCIe 4,
         | LPDDR4x, Wifi6) and low power usage on 10nm.
         | 
         | If you wanted a bit more battery life, performance, or just
         | want to try a fancier display upgrading could be nice.
        
         | xiphias2 wrote:
         | I don't see Apple laptops having worse politics than other
         | companies. On my iPad I feel the lots of problems of the closed
         | ecosystem, but M1 laptops are accessable enough for developers
         | to work with (even though it is sadly undocumented).
        
       | bserge wrote:
       | Performance aside, that thing looks beautiful
        
       | choppaface wrote:
       | The Cebras prototype was about 0.86 PFLOPS (?) for a whole wafer
       | (1T transistors) so this Intel chip looks like a potential viable
       | competitor at 1PFLOP for only 100B transistors (even if just
       | FP16). I'm sure Intel will want to chase NVidia but Cerebras is
       | also a threat given it already has software support (Tensorflow,
       | Pytorch, etc). Maybe I'm making an unfair comparison but looks
       | Ponte Vecchio would put Intel just above where Cerebras was a
       | couple years ago.
       | 
       | https://www.nextbigfuture.com/2020/11/cerebras-trillion-tran...
        
       | aokiji wrote:
       | Lets not forget all of the Intel backdoors that were exploited
       | and forced us to use patched hardware with lower performance than
       | what was advertised.
        
       | caycep wrote:
       | Actually - regardless of the performance of this, and perhaps
       | this is orthogonal to their GPU - with the global crunch in
       | chips/GPU, would this be a natural market space for Intel,
       | especially with the new foundry services, to compete? I would
       | imagine there is a lot of business to be had from Nvidia/AMD for
       | GPUs...assuming the mining boom holds up.
        
         | wmf wrote:
         | Intel has the same capacity shortage as everyone else and GPUs
         | actually seem pretty cheap (i.e. less profitable) given their
         | large dies.
        
           | onli wrote:
           | Intel with their own production facilities seems to manage
           | the shortage better than everyone else. Their product may be
           | worse, but their supply situation has been consistently
           | better since December.
        
       | Google234 wrote:
       | Very cool! I'm looking forward to see seeing how it performs.
        
       | cs702 wrote:
       | NVIDIA's hardware and software (CUDA) badly need competition in
       | this space -- from Intel, from AMD, from anyone, please.
       | 
       | If anyone at Intel is reading this, please consider releasing all
       | Ponte Vecchio drivers under a permissive open-source license; it
       | would facilitate and encourage faster adoption.
        
         | dogma1138 wrote:
         | Intel's OneAPI is already miles a head of AMD's ROCm which is
         | pretty awesome.
        
           | zepmck wrote:
           | When? Where? How can it be miles ahead if the hardware has
           | not been released yet?
        
             | baybal2 wrote:
             | Yes, seconding that.
             | 
             | What the point of using OneAPI, a yet another compute API
             | wrapper, to make software just for a single platform?
             | 
             | You can just use regular computing libs, and C, or C++.
             | 
             | Serious HPC will still stay with its own serious HPC stuff,
             | superoptimised C, and fortran code, no matter how labour
             | intensive it is.
             | 
             | So, I see very little point in that.
        
               | dogma1138 wrote:
               | OneAPI is already cross platform through codeplay's
               | implementation which also can run on NVIDIA GPUs, its
               | whole point is to be open cross platform framework that
               | targets a wide range of hardware.
               | 
               | Wether it would be successful or not is up in the air but
               | it's goals are pretty solid.
        
               | my123 wrote:
               | So basically, a thing that will provide first-class
               | capabilities only on Intel hardware, and won't be really
               | optimised for maximum performance/expose all the
               | underlying capabilities of the hardware elsewhere.
        
           | pjmlp wrote:
           | Now they need to catch up with polyglot CUDA eco-system.
        
             | johnnycerberus wrote:
             | I really don't get this push to polyglot programming when
             | 99% of the high performance libraries use C++. Even more,
             | openAPI has DPC++, SPIR-V has SYCL, CUDA is even building a
             | C++ standard library that is heterogeneous supporting both
             | CPU and GPU, libcu++. Seriously now, how many people from
             | JVM or CLR world actually need this level of high
             | performance? How many actually push kernels to the GPU from
             | these runtimes? I have yet to see a programming language
             | that will replace C++ at what it does best. Maybe Zig
             | because it is streamlined and easier to get into will be a
             | true contender to C++ HPC but only time will tell.
        
               | pjmlp wrote:
               | Enough people to keep a couple of companies in business,
               | and NVidia doing collaboration projects with Microsoft
               | and Oracle, HPC is not the only market for CUDA.
        
               | bionhoward wrote:
               | Whenever I hit AI limits, it's due to memory. That's why
               | I would argue the future of AI is Rust, not C++. Memory
               | efficiency matters!
        
               | jacques_chester wrote:
               | > _Seriously now, how many people from JVM or CLR world
               | actually need this level of high performance?_
               | 
               | The big data ecosystem is Java-centric.
        
               | johnnycerberus wrote:
               | Indeed it is, but the developers in these ecosystems
               | created complements like Apache Arrow that will unload
               | the data in a language-independent columnar memory format
               | for efficient analytics in services that will run C++ on
               | clusters of CPUs and GPUs. Even Spark has rewritten their
               | own analytics engine in C++ recently. These were created
               | because of the limitations of the JVM. We have tried to
               | move the numerical processing away from C++ in the past
               | decades but we have always failed.
        
               | jacques_chester wrote:
               | You asked who in the JVM world would be interested in
               | this kind of performance: that's big data folks. To the
               | extent that improvements accrue to the JVM they accrue to
               | that world without needing to rewrite into C++.
        
               | dogma1138 wrote:
               | Finance too, large exchanges with micro second latency
               | have their core systems written in Java, CME Globex and
               | EBS/Brokertec are written in Java.
        
           | spijdar wrote:
           | Sadly, that's not a very high bar to set...
        
         | xiphias2 wrote:
         | CUDA is not as important as Tensorflow, PyTorch and JAX support
         | at this point. Those frameworks are what people code against,
         | so having high quality backends for them are more important
         | than the drivers themselves.
        
         | elihu wrote:
         | The One-API and OpenCL implementations, Intel Graphics
         | Compiler, and Linux driver are all open source. Ponte Vechio
         | support just hasn't been publicly released yet.
         | 
         | https://github.com/intel/compute-runtime
         | 
         | https://github.com/intel/intel-graphics-compiler
         | 
         | https://github.com/torvalds/linux/tree/master/drivers/gpu/dr...
        
           | zepmck wrote:
           | One-API is not completely open source. Support for Ponte
           | Vecchio will be not released open source for many reasons.
        
             | elihu wrote:
             | I don't have specific knowledge of Ponte Vecchio in
             | particular, so I'll defer to you if you have such info. The
             | support for their mainstream GPU products is open source,
             | though.
        
             | nine_k wrote:
             | Where to find more details?
        
           | pjmlp wrote:
           | One-API focus too much on C++ (SYSCL + Intel own stuff),
           | while OpenCL is all about C.
           | 
           | CUDA is polyglot, with very nice graphical debuggers that can
           | even single step shaders.
           | 
           | Something that the anti-CUDA keep forgeting.
        
             | UncleOxidant wrote:
             | oneAPI support in Julia:
             | https://github.com/JuliaGPU/oneAPI.jl
        
               | pjmlp wrote:
               | Nice to know, thanks.
        
             | dogma1138 wrote:
             | CUDA's biggest advantage over OpenCL other than not being a
             | camel was its C++ support which is still the main language
             | in use for CUDA in production, I doubt FORTRAN was the
             | reason why CUDA got to where it is, C++ on the other hand
             | had quite a lot to do with it during its initial days when
             | OpenCL was still stuck in OpenGL C-Land.
             | 
             | NVIDIA understood also early on the importance of first
             | party libraries and commercial partnerships something Intel
             | also understands which is why OneAPI has wider adoption
             | already than ROCm.
        
               | pjmlp wrote:
               | CUDA supports much more languages than just C++ and
               | Fortan.
               | 
               | .NET, Java, Julia, Python (RAPIDS/cuDF), Haskell don't
               | have a place on OneAPI so far.
               | 
               | And yes, going back to C++, the hardware is based on
               | C++11 memory model (which was based on Java/.NET models).
               | 
               | So plenty of stuff to catch up, besides "we can do C++".
        
               | dragandj wrote:
               | How does CUDA support any of these (.Net, Java, etc?).
               | It's the first time I hear this claim. There are 3rd
               | party wrappers in Java, .Net, etc. that call CUDA's C++
               | API, and that's all. Equivalet APIs exist for OpenCL
               | too...
        
               | my123 wrote:
               | The CUDA runtime gets as input the PTX intermediate
               | language.
               | 
               | The toolkit ships with compilers from C++ and Fortran to
               | NVVM, and provides you documentation about the PTX
               | virtual machine at https://docs.nvidia.com/cuda/parallel-
               | thread-execution/index... and about the higher-level NVVM
               | (which compiles down to PTX) at
               | https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html.
        
               | navaati wrote:
               | Oooh, I didn't know PTX was an intermediate
               | representation and explicitly documented as such, I
               | really thought it was the actual assembly ran by the
               | chips...
        
               | my123 wrote:
               | You can get the GPU-targeted assembly (sometimes called
               | SASS by NVIDIA) through specifically compiling to a given
               | GPU then using nvdisasm, which also has a very terse
               | definition of the underlying instruction set in the docs
               | (https://docs.nvidia.com/cuda/cuda-binary-
               | utilities/index.htm...).
               | 
               | But it's one way only, NVIDIA ships a disassembler, but
               | explicitly doesn't ship an assembler.
        
               | The_rationalist wrote:
               | https://github.com/NVIDIA/grcuda
        
               | dogma1138 wrote:
               | There are Java and C# compilers for CUDA such as JCUDA
               | and http://www.altimesh.com/hybridizer-essentials/ but
               | the CUDA runtime, libraries and the first party compiler
               | only supports C/C++ and FORTRAN, for Python you need to
               | use something like Numba.
               | 
               | Most non C++ frameworks and implementations tho would
               | simply use wrappers and bindings.
               | 
               | I also am not aware of any high performance lib for CUDA
               | that wasn't written in C++.
        
               | pjmlp wrote:
               | "Hybridizer: High-Performance C# on GPUs"
               | 
               | https://developer.nvidia.com/blog/hybridizer-csharp/
               | 
               | "Simplifying GPU Access: A Polyglot Binding for GPUs with
               | GraalVM"
               | 
               | https://developer.nvidia.com/gtc/2020/video/s21269-vid
               | 
               | And then you can browse for products on
               | https://www.nvidia.com/content/dam/en-zz/Solutions/Data-
               | Cent...
        
               | dogma1138 wrote:
               | Hybridizer simply creates CUDA C++ code from C# which is
               | then compiled to PTX it also does it for AVX which you
               | can the compile with Intel's compiler or gcc, it's not
               | particularly good and you often need to debug the
               | generated CUDA source code yourself, it's also doesn't
               | always play well with the CUDA programming model
               | especially its more advanced features.
               | 
               | And again it's a commercial product developed by a 3rd
               | party, whilst someone uses it I wouldn't even put it as a
               | rounding error when accounting for why CUDA has the
               | market share it has.
        
               | pjmlp wrote:
               | It is like everyone arguing about C++ for AAA studios, as
               | if everyone was doing Crysis and Fortnight clones, while
               | forgetting the legions of people making money selling A
               | games.
               | 
               | Or forgetting the days when games written in C were
               | actually full of inline Assembly.
               | 
               | It is still CUDA, regardless if it goes through PTX or
               | CUDA C++ as implementation detail for the high level
               | code.
        
               | my123 wrote:
               | https://www.ibm.com/support/knowledgecenter/SSYKE2_8.0.0/
               | com... goes to a level above:
               | 
               | "Alternatively you can let the virtual machine (VM) make
               | this decision automatically by setting a system property
               | on the command line. The JIT can also offload certain
               | processing tasks based on performance heuristics."
               | 
               | A lot of what ultimately limits GPUs today is that they
               | are connected over a relatively slow bus (PCIe), this
               | will change in the future, allowing smaller and smaller
               | tasks to be offloaded.
        
               | The_rationalist wrote:
               | in addition, grCuda is a breakthrough that enable interop
               | with much more languages such as Ruby, R, Js, (soon
               | python), etc https://github.com/NVIDIA/grcuda
        
       ___________________________________________________________________
       (page generated 2021-03-27 23:00 UTC)