[HN Gopher] Intel Xe-HP Graphics: Early Samples Offer 42 TFLOPs ...
       ___________________________________________________________________
        
       Intel Xe-HP Graphics: Early Samples Offer 42 TFLOPs of FP32
       Performance
        
       Author : rbanffy
       Score  : 71 points
       Date   : 2020-08-21 16:54 UTC (6 hours ago)
        
 (HTM) web link (www.anandtech.com)
 (TXT) w3m dump (www.anandtech.com)
        
       | MR4D wrote:
       | So if four tiles = 42 teraflop, then does that mean 25 of these
       | will produce 1 petaflop?
       | 
       | Wow.
       | 
       | I'd imagine these things are super expensive.
        
         | ivalm wrote:
         | You need thermal and memory bandwidth solution for multi-tile
         | setup, so you can't easily scale it too much up. Although
         | companies like cerebras do show us a path into "very very large
         | chips"
        
         | rbanffy wrote:
         | The tiles are huge, so yes.
        
       | m0zg wrote:
       | The number is a bit misleading: the quoted performance is for
       | "4-tile" configuration. Per-tile this is still markedly slower
       | than NVIDIA.
        
       | aidenn0 wrote:
       | I would buy a discrete GPU from Intel without seeing any
       | benchmarks. The only system I own in which desktop compositing
       | actually works on Linux is the Intel one.
       | 
       | Both AMD and Nvidia drivers are dumpster fires in terms of
       | stability (and for Nvidia, I've tried both nouveau and the binary
       | drivers)
        
         | bayindirh wrote:
         | Actually, the Intel drivers work really well until they don't.
         | With every new kernel release, my laptop shows more and more
         | visual artifacts god knows why.
         | 
         | Hope that their GPU driver quality won't degrade to e1000e
         | levels, where cards freeze randomly or lose links just because
         | they didn't find the passing packets as exciting as I do, get
         | bored and stop working.
        
           | mpol wrote:
           | Can confirm :)
           | 
           | I once bought a pci card with Intel Gigabit. Slashdot
           | comments were very positive about this being the best card
           | with the best drivers. Oh wow, just looking at the kernel
           | dmesg was infuriating. Reset upon reset, where the NIC would
           | just be unavailable for seconds. Even a major update (I don't
           | know, 4x to 5x or something) did not even change anything.
           | 
           | Currently my laptop updated to linux kernel 5.7, where Xfce
           | and Xfwm just break down and get stuck. Going back to 5.6
           | makes everything work. Nope, I don't trust Intel to make good
           | Linux drivers. Will not buy again :( I never had this in 20
           | years of using AMD/ATI.
        
             | formerly_proven wrote:
             | Intel drivers seem to go down the drain as soon as the
             | hardware is a few years old.
        
               | bayindirh wrote:
               | Currently e1000e is problematic even in relatively new
               | hardware.
               | 
               | On Windows, they publish WHQL certified broken drivers.
        
         | boudin wrote:
         | AMD support is really good, I would put their support on Linux
         | ahead of Intel.
        
         | DoofusOfDeath wrote:
         | A few weeks ago I put together a gaming system with an RX 5700
         | XT GPU and X570 motherboard. I've installed both Linux (Pop_OS!
         | 20.04) and Windows 10 on it.
         | 
         | With Pop_OS! using the open-source amdgpu driver, everything
         | worked beautifully. The only notable downside (which I don't
         | really care about) was not having polished utilities for
         | tweaking the GPU's behavior, and uncertainty about the status
         | of FreeSync.
         | 
         | My experience on Windows 10 has been much worse. There seemed
         | to be a 3-way argument between my monitor, GPU driver, and
         | Windows regarding HDR settings. And unlike on Linux, an
         | apparent bug in AMD's driver causes instability when using HDMI
         | to stream audio to my monitor.
        
         | drewg123 wrote:
         | I'll probably get downvoted for saying this, but the only
         | graphics drivers that have worked reliably for me on Linux (and
         | FreeBSD) are the proprietary Nvidia ones. When I ran Ubuntu,
         | I'd have to blacklist the Nouveau driver to avoid oopses.
        
           | notyourday wrote:
           | I agree. I'm running prop nVidia driving 4x 4k monitors under
           | X11 and it. just. works. Nouveau driver crashes every few
           | days.
           | 
           | Intel has freezes in Mesa. It is a known issue.
        
         | winter_blue wrote:
         | What problems have you experienced with AMD drivers?
         | 
         | I'm asking because Nvidia has been an absolute pain on my
         | laptop (e.g. sleep is terribly broken), and I'm considering a
         | switch over to the Ryzen 4700U, which has pretty powerful
         | integrated Radeon graphics.
         | 
         | So I'm trying to decide if I should instead switch to i7-1065G7
         | instead, which has good integrated (Intel) graphics.
         | 
         | This is a really important question to me; I would appreciate
         | any answers.
        
           | fiddlerwoaroof wrote:
           | I ran into an amdgpu when I built my most recent desktop (the
           | card would randomly lock up and turn its fans on full speed
           | with some combination of OpenGL and power management.). I
           | think it's been fixed, though, I haven't had any issues for
           | several months now.
        
           | Athas wrote:
           | I have been running open source AMD drivers on my desktop
           | since I got it (early 2018, Vega 64 GPU). They work well, and
           | after so many years, it felt like a revelation to have
           | problem-free high-performance 3D acceleration after a fresh
           | installation of a default Linux kernel. The only place where
           | AMD is still wonky is when you want to do GPGPU, but that is
           | mostly down to AMDs byzantine and schizophrenic software
           | strategy (just try to pin down which parts of ROCm you need,
           | or what they do). You don't have to worry about that for
           | graphics, though, as the amdgpu driver is in the kernel, and
           | the default Mesa OpenGL works perfectly with it.
        
           | cycomanic wrote:
           | I've been running a rx480 for quite a while and the amdgpu
           | drivers in the kernel are completely without problems (I've
           | recently started gaming a bit again and fallout 4 runs on
           | highest settings under proton without issues) . At some time
           | I tried the pro drivers and performance actually degraded. I
           | can second what somebody else posted, that doing gpugpu is a
           | bit of a pain because it's difficult to figure out what to
           | install. On the other hand Intel isn't really better there.
        
         | pengaru wrote:
         | > Both AMD and Nvidia drivers are dumpster fires in terms of
         | stability (and for Nvidia, I've tried both nouveau and the
         | binary drivers)
         | 
         | I've had great experience with the mainline amdgpu driver, and
         | find it ridiculous to group AMD and nvidia together in this
         | context today. With amdgpu AMD has moved much closer to intel
         | on the mainline linux gpu support front.
        
           | mixedCase wrote:
           | Using the 5700XT with amdgpu has been a complete shit show
           | since day one, and you have success depending on your set-up
           | since QA appears to be "if it works on a dev's machine ship
           | it".
           | 
           |  _Mainline_ (let alone stable kernels) was almost unusable
           | for like 4-5 months after release, you were completely out of
           | luck unless you followed the right incantations and used a
           | precise firmware binary that had been available at some point
           | at an AMD developer 's personal repo but you later had to get
           | through someone else's Dropbox. After that period it got
           | stable-ish for me. Occasionally Mesa/kernel updates made some
           | games stop working or freeze the system but workarounds.
           | Yesterday I swapped a monitor for a higher res one, and I
           | triggered a bug where DPM essentially shits its pants and you
           | have to switch to manual power management. Meaning a year
           | after release the card can't handle basic multi-monitor
           | stuff.
           | 
           | I've had AMD CPUs exclusively ever since Athlon II, always
           | coupled with Nvidia cards. I wanted to support them also in
           | GPUs given their open source work but for me it's been a
           | complete mess. If Intel GPUs are actually any good and don't
           | shit the bed I'll sell this card for whatever I can get.
        
             | vondur wrote:
             | I've been making sure to run the latest kernels and latest
             | mesa Mesa, and my 5700xt has been great. But, I agree that
             | the 5700 series had some serious issues in its original
             | kernel releases.
        
           | the8472 wrote:
           | Afaik AMD is buggy when passed to virtualized guest. nvidia
           | also requires hacks to make it work. Intel is the only party
           | where it's supported out of the box.
           | 
           | Edit: incorrect terminology
        
             | formerly_proven wrote:
             | Doesn't nVidia explicitly disable this feature unless it's
             | a Quadro?
        
               | the8472 wrote:
               | There are several ways to virtualize a card. Splitting
               | the card into multiple virtual units only works with the
               | pro/datacenter models. Passing through the whole card
               | works if you trick the guest driver by hiding
               | virtualization.
        
             | zamadatix wrote:
             | This comment conflates pretty much everything it mentions:
             | 
             | - Intel GPUs support mediated virtualization passthrough
             | through GVT-G not SR-IOV
             | 
             | - Nvidia blocks passthrough as a feature gatekeep in the
             | driver. You can work around this block by hiding some info
             | from the guest (the hacks mentioned). Consumer Nvidia cards
             | do not support SR-IOV, special models that do require the
             | proprietary driver and a license to SR-IOV (no hacks to
             | workaround this)
             | 
             | - AMD works fine under passthrough but requires specific
             | models for SR-IOV (though this works on the official open
             | driver). SR-IOV is still unstable.
             | 
             | - None of this is really relevant to the original topic
        
               | the8472 wrote:
               | Thanks for the correction. I only played around with it
               | for a few evenings and intel was the only thing I managed
               | to virtualize. I think they're relevant because being
               | able to virtualize it is an aspect of not being a
               | dumpster fire.
        
             | tjoff wrote:
             | That is a wholly different use case though. And I'd rather
             | take AMDs difficulties in that regard over nVidia who are
             | actively fighting it.
             | 
             | Don't give money to someone that shits on you.
        
             | pengaru wrote:
             | Judging from mainline's commit log for
             | linux/drivers/gpu/drm/amd/amdgpu there's been a steady flow
             | of SR-IOV related fixes/updates up to and including v5.8 so
             | it's at least not being ignored.
             | 
             | I admit I haven't used amdgpu in this capacity, but just
             | the fact that we can scrutinize amdgpu's history in
             | mainline is a huge improvement over the past.
        
         | pjmlp wrote:
         | Given my experience with GL programming across all three, Intel
         | is definitely not the first choice.
        
         | Symmetry wrote:
         | My experience has been that recently AMD's open source Linux
         | driver has been quite nice, not always having all the latest
         | features implemented but being very stable and roughly as
         | performant as the closed source driver.
        
           | bayindirh wrote:
           | AMD has revised its silicon (decoupled HDCP stuff from video
           | encoder/decoder to be able to open source video hardware) to
           | enable unencumbered open source drivers while nVidia
           | obfuscates everything deeper and deeper.
           | 
           | It's a big disservice to compare them to nVidia about driver
           | quality at this point.
        
         | pizza234 wrote:
         | I guess YMMV.
         | 
         | I use Nvidia with proprietary drivers, and I have only one
         | minor problem (the windows of some open programs get corrupted
         | on resume). I also have Intel on my laptop and have no problems
         | at all (I had an issue in the past, but I've solved it by
         | updating the drivers).
         | 
         | Nouveau is definitely a "dumpster fire", if one wants to put it
         | that way, but that's pretty much openly caused by Nvidia.
         | 
         | To be clear, I don't use pretty much any 3D though, so I can
         | only speak for "typical office usage".
        
           | aidenn0 wrote:
           | It was fine when only 3d games could cause crashes, but now
           | that firefox and KDE's window-manager all use hardware-
           | acceleration it's just terrible. The binary drivers are
           | _definitely_ better than nouveau though for stability, except
           | maybe for some older hardware.
           | 
           | Most issues fix themselves when I switch to a text console
           | then back to X11, but every now and then I have to restart
           | X11 to fix things.
        
       | strictnein wrote:
       | I know they're going to have a gaming GPU in 2021, and GFLOPs
       | aren't everything, but:
       | 
       | > One Tile: 10588 GFLOPs (10.6 TF) of FP32
       | 
       | > NVIDIA RTX 2080: 10.07 TFLOPS - FP32
        
         | ivalm wrote:
         | That's actually pretty bad, it will be potentially worse than
         | 3070 (which it will compete against)...
        
         | znpy wrote:
         | i'm no gpu expert, thank you for providing something to compare
         | to
        
       | [deleted]
        
       | fancyfredbot wrote:
       | Looks a lot more interesting than the Xeon Phi ever did. If they
       | can provide the huge memory bandwidth this will need to keep it
       | fed, and if they can offer a decent programming model, then this
       | could be very competitive. I suspect they can do these things and
       | the next challenge for them is going to be optimized software.
       | NVIDIA have a massive lead in terms of software support for their
       | accelerators so I can see this being a challenge.
        
         | trhway wrote:
         | >NVIDIA have a massive lead in terms of software support for
         | their accelerators
         | 
         | when around 2005-2006 Nvidia was hiring compiler people from
         | Sun it was puzzling ...
        
         | gnufx wrote:
         | They're pushing "One API" for programming. They rather have to
         | deliver this time on the Aurora supercomputer.
        
         | rbanffy wrote:
         | The nice thing about Phi was it that it looked like a lot of
         | x86's. It was easy to program something, but, unfortunately,
         | harder to make it perform like a similarly priced GPU. It was
         | sold for HPC, but I saw it as an interesting way to explore
         | what future Intel desktop and server CPUs could look like -
         | core counts will only go up, after all. The last generation,
         | with virtualization and the previous one, adding up to 16 GB of
         | in-module memory that could either be system memory or behave
         | like an L4 cache (as a comparison, IBM's latest z15 mainframe
         | has about a gigabyte of L4 per 4-socket drawer).
         | 
         | Too bad it never found its niche - it was too slow for HPC,
         | lacked virtualization for servers and was ludicrously expensive
         | (and hot) for desktops.
        
       | tweedledee wrote:
       | Does anyone else here think the coprocessor on the nvidia ampere
       | looks a lot like the RC 18. If so that should post some crazy
       | perf numbers.
        
       | aspaceman wrote:
       | Much more interested in architecture design and memory hierarchy
       | than flops. Anything interesting going on in caching or memory
       | hardware?
       | 
       | All the problems I work on need benefit from memory bandwidth and
       | cache latency than raw FLOPS. I imagine others are in the same
       | boat.
       | 
       | I was hoping this would be the start of some more architecture
       | diversity like apples tile based deferred rendering.
        
         | ivalm wrote:
         | If anything this is a move to maxwell-like architecture away
         | from normal intel stuff. I feel like all GPU architectures are
         | kind of converging (amd navi is also a move towards more
         | maxwell-like arch)
        
         | winter_blue wrote:
         | In terms of memory architecture -- I've heard that memory is a
         | bottleneck for the GPU, specifically the time it takes to move
         | stuff from RAM (main memory) to the GPU's RAM/memory. If it is
         | such a big bottleneck, then why don't we (yet) see powerful GPU
         | sharing the same die as the CPU and accessing/using/sharing RAM
         | with the CPU (like integrated GPUs do)? Then there'd be a
         | _zero_ bottleneck. You would just load whatever into RAM, and
         | just give the ( _super-powerful integrated_ ) GPU a
         | pointer/address. Bam, done. Why hasn't this happened yet? /
         | _What am I missing /misunderstanding here?_
        
           | tarlinian wrote:
           | Mainly because what you're saying isn't true for many
           | workloads? (Also this already works for most existing
           | integrated GPUs.)
           | 
           | The types of workloads run on GPUs typically like very high
           | memory bandwidth and are usually willing to live with higher
           | memory latency to get it. Onboard GPU memory is usually built
           | with this in mind (trade off capacity and latency for
           | increases in bandwidth). This is generally speaking the
           | opposite of what you want in a CPU where people often want
           | very high memory capacity and lower latencies, but may not
           | limited by memory bandwidth, so simply sticking a GPU on die
           | and giving it access to a memory subsystem that was not
           | designed to feed a GPU is not going to make anything better.
        
           | qzw wrote:
           | Besides the massive heat/power issues you would have, main
           | memory like DDR4 is optimized for different characteristics
           | than graphics memory like GDDR5 (mostly trade offs in latency
           | vs bandwidth).
        
           | nordsieck wrote:
           | > I've heard that memory is a bottleneck for the GPU,
           | specifically the time it takes to move stuff from RAM (main
           | memory) to the GPU's RAM/memory. If it is such a big
           | bottleneck, then why don't we (yet) see powerful GPU sharing
           | the same die as the CPU and accessing/using/sharing RAM with
           | the CPU (like integrated GPUs do)? Then there'd be a zero
           | bottleneck. You would just give load whatever into RAM, and
           | just give the (super-powerful integrated) GPU a
           | pointer/address. Bam, done. Why hasn't this happened yet? /
           | What am I missing/misunderstanding here?
           | 
           | That is not the only bottleneck involved.
           | 
           | Historically, GPUs have used GDDR ram as opposed to general
           | purpose DDR memory. One of the key differences between GDDR
           | and DDR is the bus width, which can be as large as 1024 bits,
           | compared to conventional ram with a 64 bit bus width
           | (although dual channel is effectively 128 bits). This much
           | wider bus results in much higher memory bandwidth which is
           | generally necessary to feed the truly enormous number of
           | functional units in a GPU.
           | 
           | I suppose you could ask: why doesn't everyone just
           | standardize on GDDR?
           | 
           | 1. This would dramatically increase cache line size. I don't
           | have data, but I assume this would generally be bad.
           | 
           | 2. My recollection (but I don't have a source for this) is
           | that DDR has lower latency than GDDR ram, so for branchy code
           | (which CPUs often have to deal with, but GPUs typically never
           | have to deal with), DDR could actually be faster.
           | 
           | 3. DDR is cheaper to manufacture. Aside from being higher
           | volume, a lower bus width just makes is simpler to
           | manufacture.
        
             | PixelOfDeath wrote:
             | > 1. This would dramatically increase cache line size. I
             | don't have data, but I assume this would generally be bad.
             | 
             | Why would it change cache line size? GPUs also use cache
             | lines in the range of 32-128 byte?! I think that is
             | independent of the bus system/width.
        
               | nordsieck wrote:
               | > Why would it change cache line size? GPUs also use
               | cache lines in the range of 32-128 byte?! I think that is
               | independent of the bus system/width.
               | 
               | I just assumed that bus width = cache line size. I guess
               | I was wrong.
               | 
               | Sorry.
        
             | jandrese wrote:
             | > One of the key differences between GDDR and DDR is the
             | bus width, which can be as large as 1024 bits, compared to
             | conventional ram with a 64 bit bus width
             | 
             | Does this mean there are over a thousand traces between the
             | GPU chip and the memory chips? It would be pretty clear why
             | regular motherboards don't use it if that's the case, the
             | sockets for the chips would be enormous! You're talking
             | about roughly doubling the pincount vs. a 64 bit memory bus
             | on a modern LGA socket.
        
               | [deleted]
        
               | thechao wrote:
               | The on-die Larrabee traces were 3072 wires wide.
        
             | easde wrote:
             | GDDR also uses a ton of power (per gigabyte) and has very
             | tight signaling tolerances so it can't be socketed. Not a
             | good choice for laptops (power), desktops (not socketed) or
             | servers (both reasons).
        
           | PixelOfDeath wrote:
           | GPUs can hide memory latency very well, because they are
           | basically SMT on steroids. (Imagine instead of 2 threads per
           | "core" you have 32-64 threads)
           | 
           | But they are starved of memory bandwidth! And the lower
           | latency memory CPUs prefer is not the same as the high
           | bandwidth memory that GPUs like.
           | 
           | Also there is different kind of caching. Modern APUs often
           | have two ways to access memory. Over there own cache, or over
           | the CPUs cache. So shared memory for CPU<->GPU gets the full
           | advantage of a cache, but still it is a trade off.
           | 
           | If you want some work done by the GPU part of an APU, by
           | sharing a pointer, you can do that today. But from the point
           | of view of the CPU there is no prediction beyond the "GPU do
           | X" commands. And a very high latency until the job is done.
           | So you need a minimum GPU job size for it to make sense.
        
           | nwallin wrote:
           | What you're describing actually exists, it's called
           | Heterogeneous Systems Architecture. https://en.wikipedia.org/
           | wiki/Heterogeneous_System_Architect... It's nifty, but it's
           | generally only used as a cost saving measure, not a
           | performance one. It does, as you mention, broaden the scope
           | of what is a useful target for GPU compute operations.
           | 
           | Memory bandwidth is typically the bottleneck in GPUs.
           | Meanwhile, the access patterns are typically very
           | predictable. So they're able to prefetch data, so latency is
           | generally not a problem. So GPU memory is designed to have
           | very high bandwidth, even if it means completely tanking the
           | latency.
           | 
           | On the other hand, CPUs typically always need better memory
           | latency, and most workloads do not saturate the memory
           | bandwidth. Unlike in a GPU, many memory access patterns are
           | unpredictable. The bottleneck of operations that are pointer
           | heavy tends to be limited by latency. All tree operations and
           | linked list iteration tend to be latency limited. Languages
           | like C#, Java, Python and Javascript where all data lives
           | behind a pointer tend to benefit significantly from improving
           | latency. So while improvement memory bandwidth up to a point
           | is important, there's much more attention given to latency.
        
           | formerly_proven wrote:
           | It's more of a bandwidth problem. The arithmetic intensity of
           | a non-HBM GPU is around 100, meaning the GPU can perform 100
           | FLOPs for every float load/store.
        
           | brandmeyer wrote:
           | > What am I missing/misunderstanding here
           | 
           | Some (not all) GPU memory types are not cache coherent with
           | the CPU. Some of the cache-coherent cases have poorer
           | performance relative to the non-coherent memory from the
           | perspective of the GPU's memory bandwidth and access latency.
        
           | aspaceman wrote:
           | That's one bottleneck, but you're missing another bottleneck
           | entirely. In many problems, we load the entirety of the data
           | into the GPU's RAM anyways (some games do so with the
           | entirety of their assets during load times). It doesn't
           | matter that the data is already in RAM, because we care about
           | how much data can come from GDDR to registers (in GB/s). You
           | have to get the data to your compute units. This is your
           | memory bandwidth.
           | 
           | GDDR RAM wants to be accessed in bulk - large rows at once.
           | Things are easy when each thread wants a subsequent byte, but
           | if not things become much slower. Caching and other
           | techniques can help mitigate this, and they're (imo) the
           | place for a lot of creativity in architecture design. Having
           | more potential FLOPS just means more ALUs.
        
             | winter_blue wrote:
             | Ah, things makes a lot more sense now.
             | 
             | Thank you for explaining it so clearly.
        
           | wolf550e wrote:
           | You would have 600W in a single socket.
        
           | boxfire wrote:
           | The intel Kaby Lake G processors are Radeon RX Vega M GL/GH
           | paired on-die with an 8th gen core series processor over an
           | internal PCIE 3 x8 link. Seems like they could do much
           | better, but its the only such product I know of.
           | 
           | https://fuse.wikichip.org/news/1634/hot-chips-30-intel-
           | kaby-...
        
         | jra101 wrote:
         | Tile based Deferred Rendering came from PowerVR. Prior to
         | moving to their own GPU architecture, Apple licensed various
         | PowerVR designs.
        
           | ksec wrote:
           | Yes and it goes as far back as Saga Dreamcast. And the
           | technique itself is now used in pretty much all of the
           | current GPU from Nvidia to AMD and ARM.
           | 
           | Still a little pissed at how Apple handled IMG / PowerVR.
        
             | monocasa wrote:
             | It goes farther back. You could buy PowerVR discrete GPUs
             | for desktops far back as 1996, IIRC.
        
             | TomVDB wrote:
             | Neither AMD nor Nvidia are using deferred rendering.
        
             | Veedrac wrote:
             | NVIDIA uses tiled rasterization, but not TBDR. I believe
             | the same is true of AMD.
        
       | MangoCoffee wrote:
       | Is it going to be based on 14+++nm? AMD/Nvidia is on 7nm
        
         | sgerenser wrote:
         | I thought these were being fabbed by TSMC, presumably at 7nm?
        
         | formerly_proven wrote:
         | 10nm+++, actually.
        
           | ivalm wrote:
           | I thought intel said during architecture day that GPUs will
           | be done on outside fabs (presumably tsmc or samsung).
        
             | formerly_proven wrote:
             | > We also know, due to disclosures made at Intel's
             | Architecture Day, that it is set to be built on Intel's
             | 10nm Enhanced SuperFin (10ESF, formerly 10++, formerly
             | 10+++) manufacturing process, which we believe to be a late
             | 2021 process.
        
       ___________________________________________________________________
       (page generated 2020-08-21 23:00 UTC)