[HN Gopher] Intel Xe-HP Graphics: Early Samples Offer 42 TFLOPs ... ___________________________________________________________________ Intel Xe-HP Graphics: Early Samples Offer 42 TFLOPs of FP32 Performance Author : rbanffy Score : 71 points Date : 2020-08-21 16:54 UTC (6 hours ago) (HTM) web link (www.anandtech.com) (TXT) w3m dump (www.anandtech.com) | MR4D wrote: | So if four tiles = 42 teraflop, then does that mean 25 of these | will produce 1 petaflop? | | Wow. | | I'd imagine these things are super expensive. | ivalm wrote: | You need thermal and memory bandwidth solution for multi-tile | setup, so you can't easily scale it too much up. Although | companies like cerebras do show us a path into "very very large | chips" | rbanffy wrote: | The tiles are huge, so yes. | m0zg wrote: | The number is a bit misleading: the quoted performance is for | "4-tile" configuration. Per-tile this is still markedly slower | than NVIDIA. | aidenn0 wrote: | I would buy a discrete GPU from Intel without seeing any | benchmarks. The only system I own in which desktop compositing | actually works on Linux is the Intel one. | | Both AMD and Nvidia drivers are dumpster fires in terms of | stability (and for Nvidia, I've tried both nouveau and the binary | drivers) | bayindirh wrote: | Actually, the Intel drivers work really well until they don't. | With every new kernel release, my laptop shows more and more | visual artifacts god knows why. | | Hope that their GPU driver quality won't degrade to e1000e | levels, where cards freeze randomly or lose links just because | they didn't find the passing packets as exciting as I do, get | bored and stop working. | mpol wrote: | Can confirm :) | | I once bought a pci card with Intel Gigabit. Slashdot | comments were very positive about this being the best card | with the best drivers. Oh wow, just looking at the kernel | dmesg was infuriating. Reset upon reset, where the NIC would | just be unavailable for seconds. Even a major update (I don't | know, 4x to 5x or something) did not even change anything. | | Currently my laptop updated to linux kernel 5.7, where Xfce | and Xfwm just break down and get stuck. Going back to 5.6 | makes everything work. Nope, I don't trust Intel to make good | Linux drivers. Will not buy again :( I never had this in 20 | years of using AMD/ATI. | formerly_proven wrote: | Intel drivers seem to go down the drain as soon as the | hardware is a few years old. | bayindirh wrote: | Currently e1000e is problematic even in relatively new | hardware. | | On Windows, they publish WHQL certified broken drivers. | boudin wrote: | AMD support is really good, I would put their support on Linux | ahead of Intel. | DoofusOfDeath wrote: | A few weeks ago I put together a gaming system with an RX 5700 | XT GPU and X570 motherboard. I've installed both Linux (Pop_OS! | 20.04) and Windows 10 on it. | | With Pop_OS! using the open-source amdgpu driver, everything | worked beautifully. The only notable downside (which I don't | really care about) was not having polished utilities for | tweaking the GPU's behavior, and uncertainty about the status | of FreeSync. | | My experience on Windows 10 has been much worse. There seemed | to be a 3-way argument between my monitor, GPU driver, and | Windows regarding HDR settings. And unlike on Linux, an | apparent bug in AMD's driver causes instability when using HDMI | to stream audio to my monitor. | drewg123 wrote: | I'll probably get downvoted for saying this, but the only | graphics drivers that have worked reliably for me on Linux (and | FreeBSD) are the proprietary Nvidia ones. When I ran Ubuntu, | I'd have to blacklist the Nouveau driver to avoid oopses. | notyourday wrote: | I agree. I'm running prop nVidia driving 4x 4k monitors under | X11 and it. just. works. Nouveau driver crashes every few | days. | | Intel has freezes in Mesa. It is a known issue. | winter_blue wrote: | What problems have you experienced with AMD drivers? | | I'm asking because Nvidia has been an absolute pain on my | laptop (e.g. sleep is terribly broken), and I'm considering a | switch over to the Ryzen 4700U, which has pretty powerful | integrated Radeon graphics. | | So I'm trying to decide if I should instead switch to i7-1065G7 | instead, which has good integrated (Intel) graphics. | | This is a really important question to me; I would appreciate | any answers. | fiddlerwoaroof wrote: | I ran into an amdgpu when I built my most recent desktop (the | card would randomly lock up and turn its fans on full speed | with some combination of OpenGL and power management.). I | think it's been fixed, though, I haven't had any issues for | several months now. | Athas wrote: | I have been running open source AMD drivers on my desktop | since I got it (early 2018, Vega 64 GPU). They work well, and | after so many years, it felt like a revelation to have | problem-free high-performance 3D acceleration after a fresh | installation of a default Linux kernel. The only place where | AMD is still wonky is when you want to do GPGPU, but that is | mostly down to AMDs byzantine and schizophrenic software | strategy (just try to pin down which parts of ROCm you need, | or what they do). You don't have to worry about that for | graphics, though, as the amdgpu driver is in the kernel, and | the default Mesa OpenGL works perfectly with it. | cycomanic wrote: | I've been running a rx480 for quite a while and the amdgpu | drivers in the kernel are completely without problems (I've | recently started gaming a bit again and fallout 4 runs on | highest settings under proton without issues) . At some time | I tried the pro drivers and performance actually degraded. I | can second what somebody else posted, that doing gpugpu is a | bit of a pain because it's difficult to figure out what to | install. On the other hand Intel isn't really better there. | pengaru wrote: | > Both AMD and Nvidia drivers are dumpster fires in terms of | stability (and for Nvidia, I've tried both nouveau and the | binary drivers) | | I've had great experience with the mainline amdgpu driver, and | find it ridiculous to group AMD and nvidia together in this | context today. With amdgpu AMD has moved much closer to intel | on the mainline linux gpu support front. | mixedCase wrote: | Using the 5700XT with amdgpu has been a complete shit show | since day one, and you have success depending on your set-up | since QA appears to be "if it works on a dev's machine ship | it". | | _Mainline_ (let alone stable kernels) was almost unusable | for like 4-5 months after release, you were completely out of | luck unless you followed the right incantations and used a | precise firmware binary that had been available at some point | at an AMD developer 's personal repo but you later had to get | through someone else's Dropbox. After that period it got | stable-ish for me. Occasionally Mesa/kernel updates made some | games stop working or freeze the system but workarounds. | Yesterday I swapped a monitor for a higher res one, and I | triggered a bug where DPM essentially shits its pants and you | have to switch to manual power management. Meaning a year | after release the card can't handle basic multi-monitor | stuff. | | I've had AMD CPUs exclusively ever since Athlon II, always | coupled with Nvidia cards. I wanted to support them also in | GPUs given their open source work but for me it's been a | complete mess. If Intel GPUs are actually any good and don't | shit the bed I'll sell this card for whatever I can get. | vondur wrote: | I've been making sure to run the latest kernels and latest | mesa Mesa, and my 5700xt has been great. But, I agree that | the 5700 series had some serious issues in its original | kernel releases. | the8472 wrote: | Afaik AMD is buggy when passed to virtualized guest. nvidia | also requires hacks to make it work. Intel is the only party | where it's supported out of the box. | | Edit: incorrect terminology | formerly_proven wrote: | Doesn't nVidia explicitly disable this feature unless it's | a Quadro? | the8472 wrote: | There are several ways to virtualize a card. Splitting | the card into multiple virtual units only works with the | pro/datacenter models. Passing through the whole card | works if you trick the guest driver by hiding | virtualization. | zamadatix wrote: | This comment conflates pretty much everything it mentions: | | - Intel GPUs support mediated virtualization passthrough | through GVT-G not SR-IOV | | - Nvidia blocks passthrough as a feature gatekeep in the | driver. You can work around this block by hiding some info | from the guest (the hacks mentioned). Consumer Nvidia cards | do not support SR-IOV, special models that do require the | proprietary driver and a license to SR-IOV (no hacks to | workaround this) | | - AMD works fine under passthrough but requires specific | models for SR-IOV (though this works on the official open | driver). SR-IOV is still unstable. | | - None of this is really relevant to the original topic | the8472 wrote: | Thanks for the correction. I only played around with it | for a few evenings and intel was the only thing I managed | to virtualize. I think they're relevant because being | able to virtualize it is an aspect of not being a | dumpster fire. | tjoff wrote: | That is a wholly different use case though. And I'd rather | take AMDs difficulties in that regard over nVidia who are | actively fighting it. | | Don't give money to someone that shits on you. | pengaru wrote: | Judging from mainline's commit log for | linux/drivers/gpu/drm/amd/amdgpu there's been a steady flow | of SR-IOV related fixes/updates up to and including v5.8 so | it's at least not being ignored. | | I admit I haven't used amdgpu in this capacity, but just | the fact that we can scrutinize amdgpu's history in | mainline is a huge improvement over the past. | pjmlp wrote: | Given my experience with GL programming across all three, Intel | is definitely not the first choice. | Symmetry wrote: | My experience has been that recently AMD's open source Linux | driver has been quite nice, not always having all the latest | features implemented but being very stable and roughly as | performant as the closed source driver. | bayindirh wrote: | AMD has revised its silicon (decoupled HDCP stuff from video | encoder/decoder to be able to open source video hardware) to | enable unencumbered open source drivers while nVidia | obfuscates everything deeper and deeper. | | It's a big disservice to compare them to nVidia about driver | quality at this point. | pizza234 wrote: | I guess YMMV. | | I use Nvidia with proprietary drivers, and I have only one | minor problem (the windows of some open programs get corrupted | on resume). I also have Intel on my laptop and have no problems | at all (I had an issue in the past, but I've solved it by | updating the drivers). | | Nouveau is definitely a "dumpster fire", if one wants to put it | that way, but that's pretty much openly caused by Nvidia. | | To be clear, I don't use pretty much any 3D though, so I can | only speak for "typical office usage". | aidenn0 wrote: | It was fine when only 3d games could cause crashes, but now | that firefox and KDE's window-manager all use hardware- | acceleration it's just terrible. The binary drivers are | _definitely_ better than nouveau though for stability, except | maybe for some older hardware. | | Most issues fix themselves when I switch to a text console | then back to X11, but every now and then I have to restart | X11 to fix things. | strictnein wrote: | I know they're going to have a gaming GPU in 2021, and GFLOPs | aren't everything, but: | | > One Tile: 10588 GFLOPs (10.6 TF) of FP32 | | > NVIDIA RTX 2080: 10.07 TFLOPS - FP32 | ivalm wrote: | That's actually pretty bad, it will be potentially worse than | 3070 (which it will compete against)... | znpy wrote: | i'm no gpu expert, thank you for providing something to compare | to | [deleted] | fancyfredbot wrote: | Looks a lot more interesting than the Xeon Phi ever did. If they | can provide the huge memory bandwidth this will need to keep it | fed, and if they can offer a decent programming model, then this | could be very competitive. I suspect they can do these things and | the next challenge for them is going to be optimized software. | NVIDIA have a massive lead in terms of software support for their | accelerators so I can see this being a challenge. | trhway wrote: | >NVIDIA have a massive lead in terms of software support for | their accelerators | | when around 2005-2006 Nvidia was hiring compiler people from | Sun it was puzzling ... | gnufx wrote: | They're pushing "One API" for programming. They rather have to | deliver this time on the Aurora supercomputer. | rbanffy wrote: | The nice thing about Phi was it that it looked like a lot of | x86's. It was easy to program something, but, unfortunately, | harder to make it perform like a similarly priced GPU. It was | sold for HPC, but I saw it as an interesting way to explore | what future Intel desktop and server CPUs could look like - | core counts will only go up, after all. The last generation, | with virtualization and the previous one, adding up to 16 GB of | in-module memory that could either be system memory or behave | like an L4 cache (as a comparison, IBM's latest z15 mainframe | has about a gigabyte of L4 per 4-socket drawer). | | Too bad it never found its niche - it was too slow for HPC, | lacked virtualization for servers and was ludicrously expensive | (and hot) for desktops. | tweedledee wrote: | Does anyone else here think the coprocessor on the nvidia ampere | looks a lot like the RC 18. If so that should post some crazy | perf numbers. | aspaceman wrote: | Much more interested in architecture design and memory hierarchy | than flops. Anything interesting going on in caching or memory | hardware? | | All the problems I work on need benefit from memory bandwidth and | cache latency than raw FLOPS. I imagine others are in the same | boat. | | I was hoping this would be the start of some more architecture | diversity like apples tile based deferred rendering. | ivalm wrote: | If anything this is a move to maxwell-like architecture away | from normal intel stuff. I feel like all GPU architectures are | kind of converging (amd navi is also a move towards more | maxwell-like arch) | winter_blue wrote: | In terms of memory architecture -- I've heard that memory is a | bottleneck for the GPU, specifically the time it takes to move | stuff from RAM (main memory) to the GPU's RAM/memory. If it is | such a big bottleneck, then why don't we (yet) see powerful GPU | sharing the same die as the CPU and accessing/using/sharing RAM | with the CPU (like integrated GPUs do)? Then there'd be a | _zero_ bottleneck. You would just load whatever into RAM, and | just give the ( _super-powerful integrated_ ) GPU a | pointer/address. Bam, done. Why hasn't this happened yet? / | _What am I missing /misunderstanding here?_ | tarlinian wrote: | Mainly because what you're saying isn't true for many | workloads? (Also this already works for most existing | integrated GPUs.) | | The types of workloads run on GPUs typically like very high | memory bandwidth and are usually willing to live with higher | memory latency to get it. Onboard GPU memory is usually built | with this in mind (trade off capacity and latency for | increases in bandwidth). This is generally speaking the | opposite of what you want in a CPU where people often want | very high memory capacity and lower latencies, but may not | limited by memory bandwidth, so simply sticking a GPU on die | and giving it access to a memory subsystem that was not | designed to feed a GPU is not going to make anything better. | qzw wrote: | Besides the massive heat/power issues you would have, main | memory like DDR4 is optimized for different characteristics | than graphics memory like GDDR5 (mostly trade offs in latency | vs bandwidth). | nordsieck wrote: | > I've heard that memory is a bottleneck for the GPU, | specifically the time it takes to move stuff from RAM (main | memory) to the GPU's RAM/memory. If it is such a big | bottleneck, then why don't we (yet) see powerful GPU sharing | the same die as the CPU and accessing/using/sharing RAM with | the CPU (like integrated GPUs do)? Then there'd be a zero | bottleneck. You would just give load whatever into RAM, and | just give the (super-powerful integrated) GPU a | pointer/address. Bam, done. Why hasn't this happened yet? / | What am I missing/misunderstanding here? | | That is not the only bottleneck involved. | | Historically, GPUs have used GDDR ram as opposed to general | purpose DDR memory. One of the key differences between GDDR | and DDR is the bus width, which can be as large as 1024 bits, | compared to conventional ram with a 64 bit bus width | (although dual channel is effectively 128 bits). This much | wider bus results in much higher memory bandwidth which is | generally necessary to feed the truly enormous number of | functional units in a GPU. | | I suppose you could ask: why doesn't everyone just | standardize on GDDR? | | 1. This would dramatically increase cache line size. I don't | have data, but I assume this would generally be bad. | | 2. My recollection (but I don't have a source for this) is | that DDR has lower latency than GDDR ram, so for branchy code | (which CPUs often have to deal with, but GPUs typically never | have to deal with), DDR could actually be faster. | | 3. DDR is cheaper to manufacture. Aside from being higher | volume, a lower bus width just makes is simpler to | manufacture. | PixelOfDeath wrote: | > 1. This would dramatically increase cache line size. I | don't have data, but I assume this would generally be bad. | | Why would it change cache line size? GPUs also use cache | lines in the range of 32-128 byte?! I think that is | independent of the bus system/width. | nordsieck wrote: | > Why would it change cache line size? GPUs also use | cache lines in the range of 32-128 byte?! I think that is | independent of the bus system/width. | | I just assumed that bus width = cache line size. I guess | I was wrong. | | Sorry. | jandrese wrote: | > One of the key differences between GDDR and DDR is the | bus width, which can be as large as 1024 bits, compared to | conventional ram with a 64 bit bus width | | Does this mean there are over a thousand traces between the | GPU chip and the memory chips? It would be pretty clear why | regular motherboards don't use it if that's the case, the | sockets for the chips would be enormous! You're talking | about roughly doubling the pincount vs. a 64 bit memory bus | on a modern LGA socket. | [deleted] | thechao wrote: | The on-die Larrabee traces were 3072 wires wide. | easde wrote: | GDDR also uses a ton of power (per gigabyte) and has very | tight signaling tolerances so it can't be socketed. Not a | good choice for laptops (power), desktops (not socketed) or | servers (both reasons). | PixelOfDeath wrote: | GPUs can hide memory latency very well, because they are | basically SMT on steroids. (Imagine instead of 2 threads per | "core" you have 32-64 threads) | | But they are starved of memory bandwidth! And the lower | latency memory CPUs prefer is not the same as the high | bandwidth memory that GPUs like. | | Also there is different kind of caching. Modern APUs often | have two ways to access memory. Over there own cache, or over | the CPUs cache. So shared memory for CPU<->GPU gets the full | advantage of a cache, but still it is a trade off. | | If you want some work done by the GPU part of an APU, by | sharing a pointer, you can do that today. But from the point | of view of the CPU there is no prediction beyond the "GPU do | X" commands. And a very high latency until the job is done. | So you need a minimum GPU job size for it to make sense. | nwallin wrote: | What you're describing actually exists, it's called | Heterogeneous Systems Architecture. https://en.wikipedia.org/ | wiki/Heterogeneous_System_Architect... It's nifty, but it's | generally only used as a cost saving measure, not a | performance one. It does, as you mention, broaden the scope | of what is a useful target for GPU compute operations. | | Memory bandwidth is typically the bottleneck in GPUs. | Meanwhile, the access patterns are typically very | predictable. So they're able to prefetch data, so latency is | generally not a problem. So GPU memory is designed to have | very high bandwidth, even if it means completely tanking the | latency. | | On the other hand, CPUs typically always need better memory | latency, and most workloads do not saturate the memory | bandwidth. Unlike in a GPU, many memory access patterns are | unpredictable. The bottleneck of operations that are pointer | heavy tends to be limited by latency. All tree operations and | linked list iteration tend to be latency limited. Languages | like C#, Java, Python and Javascript where all data lives | behind a pointer tend to benefit significantly from improving | latency. So while improvement memory bandwidth up to a point | is important, there's much more attention given to latency. | formerly_proven wrote: | It's more of a bandwidth problem. The arithmetic intensity of | a non-HBM GPU is around 100, meaning the GPU can perform 100 | FLOPs for every float load/store. | brandmeyer wrote: | > What am I missing/misunderstanding here | | Some (not all) GPU memory types are not cache coherent with | the CPU. Some of the cache-coherent cases have poorer | performance relative to the non-coherent memory from the | perspective of the GPU's memory bandwidth and access latency. | aspaceman wrote: | That's one bottleneck, but you're missing another bottleneck | entirely. In many problems, we load the entirety of the data | into the GPU's RAM anyways (some games do so with the | entirety of their assets during load times). It doesn't | matter that the data is already in RAM, because we care about | how much data can come from GDDR to registers (in GB/s). You | have to get the data to your compute units. This is your | memory bandwidth. | | GDDR RAM wants to be accessed in bulk - large rows at once. | Things are easy when each thread wants a subsequent byte, but | if not things become much slower. Caching and other | techniques can help mitigate this, and they're (imo) the | place for a lot of creativity in architecture design. Having | more potential FLOPS just means more ALUs. | winter_blue wrote: | Ah, things makes a lot more sense now. | | Thank you for explaining it so clearly. | wolf550e wrote: | You would have 600W in a single socket. | boxfire wrote: | The intel Kaby Lake G processors are Radeon RX Vega M GL/GH | paired on-die with an 8th gen core series processor over an | internal PCIE 3 x8 link. Seems like they could do much | better, but its the only such product I know of. | | https://fuse.wikichip.org/news/1634/hot-chips-30-intel- | kaby-... | jra101 wrote: | Tile based Deferred Rendering came from PowerVR. Prior to | moving to their own GPU architecture, Apple licensed various | PowerVR designs. | ksec wrote: | Yes and it goes as far back as Saga Dreamcast. And the | technique itself is now used in pretty much all of the | current GPU from Nvidia to AMD and ARM. | | Still a little pissed at how Apple handled IMG / PowerVR. | monocasa wrote: | It goes farther back. You could buy PowerVR discrete GPUs | for desktops far back as 1996, IIRC. | TomVDB wrote: | Neither AMD nor Nvidia are using deferred rendering. | Veedrac wrote: | NVIDIA uses tiled rasterization, but not TBDR. I believe | the same is true of AMD. | MangoCoffee wrote: | Is it going to be based on 14+++nm? AMD/Nvidia is on 7nm | sgerenser wrote: | I thought these were being fabbed by TSMC, presumably at 7nm? | formerly_proven wrote: | 10nm+++, actually. | ivalm wrote: | I thought intel said during architecture day that GPUs will | be done on outside fabs (presumably tsmc or samsung). | formerly_proven wrote: | > We also know, due to disclosures made at Intel's | Architecture Day, that it is set to be built on Intel's | 10nm Enhanced SuperFin (10ESF, formerly 10++, formerly | 10+++) manufacturing process, which we believe to be a late | 2021 process. ___________________________________________________________________ (page generated 2020-08-21 23:00 UTC)