[HN Gopher] Ice Lake AVX-512 Downclocking ___________________________________________________________________ Ice Lake AVX-512 Downclocking Author : ingve Score : 51 points Date : 2020-08-19 19:25 UTC (3 hours ago) (HTM) web link (travisdowns.github.io) (TXT) w3m dump (travisdowns.github.io) | tarlinian wrote: | Hopefully this behavior change will help improve AVX-512 uptake | and end the somewhat ridiculous conception people have that the | instructions are entirely useless. Intel's HotChips presentation | on Icelake-SP also indicated that the behavior will be | significantly better on server chips, but the behavior is more | instruction dependent with 3 license levels and only 512-bit | instructions that utilize the FMA unit being subject to | downclocking (by ~15-17% as opposed to ~27-29% on SKX derived | chips). | | (Intel Hotchips slide: | https://images.anandtech.com/doci/15984/202008171757161.jpg) | CoolGuySteve wrote: | I don't think the book is closed. Thermal and TDP downclocking | are still present. | | It would have been nice to see the vcore and thermal values | graphed as part of the benchmark. Do they increase faster for | AVX-512 vs the other instruction sets? | | I've had problems in the past with Sandybridge, where AVX hit | thermal throttles before SSE. I ended up having to disable them | in my build because of it. Presumably, the same behaviour would | be seen here now that the vector unit is wider and there are | more densely packed transistors flipping. | wmf wrote: | Doing work faster is almost always going to consume more | power and if you're already at the power or temperature limit | (which is how most CPUs/GPUs operate now) then the frequency | will have to reduce. This isn't automatically bad; ultimately | what matters is performance. Did you really see lower | performance with AVX than with SSE? | donor20 wrote: | Isn't the problem if you have just a small % of your | workload or some random binaries on your system doing a | view 512 instructions they bonk the rest of your | performance? | | If I had a little background deamon that used 512 because | they were cool or in the hotchips presentation, and that | bonked my overall system performance that would be | annoying. | | It's also annoying because intel benchmarks with no | mitigations. So what can happen is you think you should be | seeing X performance, and then with mitigations applied and | some 512 instructions hitting you are Y performance. | CoolGuySteve wrote: | The problem is that the downclocking affects other cores. | So a performance improvement in this one task can hurt | performance on other threads, which is what happened to me. | wmf wrote: | Yeah, I think Intel is trying to fix this with Speed | Select but it's complex enough that no one will probably | use it. | gameswithgo wrote: | it does bot affect other cores on all cpus, most newer | ones it doesn't | BeeOnRope wrote: | At least on most new cores, the frequency is per-core. | This isn't true on, for example, some Skylake client | cores - but these don't have much SIMD related | downclocking either. | BeeOnRope wrote: | The book is definitely not closed, but the other limits are | somehow less problematic than the license-based downclocking. | | You you use more power (and get hotter temps: these are | exactly proportional, so you can mostly just talk about them | as one) with wider vectors because you are doing more work. | When you look at it on a per-element basis, you use _less_ | power per element with wider vectors. E.g., you might use 1 | pJ per element for 256-bit FMA but only 0.8 for 512-bit FMA. | | Of course, since you can do 2x as many total elements in | 512-bits on a 2 FMA machine, you can be both more efficient | but use more total power, so you can get TDP or thermal | limits with 512-bit code that you wouldn't on 256-bit, but it | should still per faster and more efficient per element. | | All of this assumes you can usefully use the 2x more work | with the larger vectors. Sometimes the scaling is worse: | e.g., for lots of short arrays, when a lookup table is | involved, when additional shuffling or transposition is | required with larger vectors, etc. In that case you could end | up less efficient with larger vectors. | tarlinian wrote: | To reiterate, the problem with the existing license based | downclocking is that a few AVX-512 operations can drop your | frequency for subsequent scalar loads on the same core so you | need to carefully analyze the overall application to make | sure that you have enough AVX-512 work over which you can | amortize the loss in performance in the rest of your code | that is affected by the frequency drop. | | If the only issue with AVX-512 is thermal downclocking | because you end up using more power, it's almost definitely | because you are getting more work done per time. A few | AVX-512 instructions in a mostly scalar workload is not going | to significantly increase power dissipation and therefore | should not induce thermal downclocking, while a heavily | utilized AVX-512 kernel will burn power, but should also be | doing work twice as fast per instruction. | Dylan16807 wrote: | It definitely helps. | | And it looks like they've _reduced_ how often a single | instruction will cause a lockup as the core shifts to a | different power level. But until they 've eliminated that | issue, it's still scary to toss in a few AVX-512 instructions. | BeeOnRope wrote: | Makes sense - the main difference between ICL and ICL-X would | seem to be 2 FMA units. | The_rationalist wrote: | I generally agree but | https://www.phoronix.com/scan.php?page=news_item&px=LLVM-Cla... | gameswithgo wrote: | compilers are not at a point where they do a great job of | leveraging SIMD in general, and definitely not where they | leverage AVX-512, but hand written intrinsics with AVX-512 | can attain amazing performance. | paulmd wrote: | Given the process improvements in Tiger Lake - I wonder if this | improves further, or at least all levels become somewhat faster? | BooneJS wrote: | At Hot Chips 32 this week, Intel mentioned that Tiger Lake Xeon | with Sunny Cove core would only downclock if AVX-512 usage hit | TDP limits. | donor20 wrote: | That is MUCH better it seems? Because then you don't randomly | throw away lots of performance because some 512 items hit so | there is less risk in using 512. | rbanffy wrote: | I wonder how many clock cycles that'd take ;-) | BooneJS wrote: | They made it sound like some instructions were more power | hungry than others. The impression I got is that the unit can | run some kinds of streams without reduction Of clock. | kardos wrote: | Using 'licenses' here is odd because it evokes ideas of | deliberate-crippling that can be turned off with a | subscription... | rbanffy wrote: | Did the Xeon Phi also downclock when using AVX-512? | th3typh00n wrote: | I haven't seen any numbers on that but there's literally zero | reason to run a Xeon Phi without using AVX-512, so I'd assume | no design considerations were taken to optimize the clock | frequency for a non-AVX-512 use case. | jiggawatts wrote: | One thing I really want to know is whether SQL Server's new | vector-accelerated "Batch Mode" uses AVX2 only or if it also has | AVX-512 code paths? | | I'd like to be able to recommend the right CPU to customers, but | there just isn't any information out in public about this... | rbanffy wrote: | There are a lot of different factors that can affect | performance. I'd advise you to always benchmark. | jiggawatts wrote: | I can't benchmark with a CPU I don't have, and I can't advise | a customer to go off and buy a $50K server just for a "quick | benchmark". | | Even if I were to, say, test with some cloud VMs, even then | there are confounding issues. The different VM size | categories aren't just different in the CPU type only, there | are other differences that'll make the benchmark difficult to | interpret. Memory type, throttling, HT on/off, etc... | | Why is it so difficult for Microsoft to simply say "AVX-512 | supported" somewhere in their documentation? | | This is like every TV vendor saying "HDMI" instead of "HDMI | 2.1" or whatever. Just because the port looks the same | doesn't mean that they're identical! Versions matter. ___________________________________________________________________ (page generated 2020-08-19 23:00 UTC)