[HN Gopher] Ice Lake AVX-512 Downclocking
       ___________________________________________________________________
        
       Ice Lake AVX-512 Downclocking
        
       Author : ingve
       Score  : 51 points
       Date   : 2020-08-19 19:25 UTC (3 hours ago)
        
 (HTM) web link (travisdowns.github.io)
 (TXT) w3m dump (travisdowns.github.io)
        
       | tarlinian wrote:
       | Hopefully this behavior change will help improve AVX-512 uptake
       | and end the somewhat ridiculous conception people have that the
       | instructions are entirely useless. Intel's HotChips presentation
       | on Icelake-SP also indicated that the behavior will be
       | significantly better on server chips, but the behavior is more
       | instruction dependent with 3 license levels and only 512-bit
       | instructions that utilize the FMA unit being subject to
       | downclocking (by ~15-17% as opposed to ~27-29% on SKX derived
       | chips).
       | 
       | (Intel Hotchips slide:
       | https://images.anandtech.com/doci/15984/202008171757161.jpg)
        
         | CoolGuySteve wrote:
         | I don't think the book is closed. Thermal and TDP downclocking
         | are still present.
         | 
         | It would have been nice to see the vcore and thermal values
         | graphed as part of the benchmark. Do they increase faster for
         | AVX-512 vs the other instruction sets?
         | 
         | I've had problems in the past with Sandybridge, where AVX hit
         | thermal throttles before SSE. I ended up having to disable them
         | in my build because of it. Presumably, the same behaviour would
         | be seen here now that the vector unit is wider and there are
         | more densely packed transistors flipping.
        
           | wmf wrote:
           | Doing work faster is almost always going to consume more
           | power and if you're already at the power or temperature limit
           | (which is how most CPUs/GPUs operate now) then the frequency
           | will have to reduce. This isn't automatically bad; ultimately
           | what matters is performance. Did you really see lower
           | performance with AVX than with SSE?
        
             | donor20 wrote:
             | Isn't the problem if you have just a small % of your
             | workload or some random binaries on your system doing a
             | view 512 instructions they bonk the rest of your
             | performance?
             | 
             | If I had a little background deamon that used 512 because
             | they were cool or in the hotchips presentation, and that
             | bonked my overall system performance that would be
             | annoying.
             | 
             | It's also annoying because intel benchmarks with no
             | mitigations. So what can happen is you think you should be
             | seeing X performance, and then with mitigations applied and
             | some 512 instructions hitting you are Y performance.
        
             | CoolGuySteve wrote:
             | The problem is that the downclocking affects other cores.
             | So a performance improvement in this one task can hurt
             | performance on other threads, which is what happened to me.
        
               | wmf wrote:
               | Yeah, I think Intel is trying to fix this with Speed
               | Select but it's complex enough that no one will probably
               | use it.
        
               | gameswithgo wrote:
               | it does bot affect other cores on all cpus, most newer
               | ones it doesn't
        
               | BeeOnRope wrote:
               | At least on most new cores, the frequency is per-core.
               | This isn't true on, for example, some Skylake client
               | cores - but these don't have much SIMD related
               | downclocking either.
        
           | BeeOnRope wrote:
           | The book is definitely not closed, but the other limits are
           | somehow less problematic than the license-based downclocking.
           | 
           | You you use more power (and get hotter temps: these are
           | exactly proportional, so you can mostly just talk about them
           | as one) with wider vectors because you are doing more work.
           | When you look at it on a per-element basis, you use _less_
           | power per element with wider vectors. E.g., you might use 1
           | pJ per element for 256-bit FMA but only 0.8 for 512-bit FMA.
           | 
           | Of course, since you can do 2x as many total elements in
           | 512-bits on a 2 FMA machine, you can be both more efficient
           | but use more total power, so you can get TDP or thermal
           | limits with 512-bit code that you wouldn't on 256-bit, but it
           | should still per faster and more efficient per element.
           | 
           | All of this assumes you can usefully use the 2x more work
           | with the larger vectors. Sometimes the scaling is worse:
           | e.g., for lots of short arrays, when a lookup table is
           | involved, when additional shuffling or transposition is
           | required with larger vectors, etc. In that case you could end
           | up less efficient with larger vectors.
        
           | tarlinian wrote:
           | To reiterate, the problem with the existing license based
           | downclocking is that a few AVX-512 operations can drop your
           | frequency for subsequent scalar loads on the same core so you
           | need to carefully analyze the overall application to make
           | sure that you have enough AVX-512 work over which you can
           | amortize the loss in performance in the rest of your code
           | that is affected by the frequency drop.
           | 
           | If the only issue with AVX-512 is thermal downclocking
           | because you end up using more power, it's almost definitely
           | because you are getting more work done per time. A few
           | AVX-512 instructions in a mostly scalar workload is not going
           | to significantly increase power dissipation and therefore
           | should not induce thermal downclocking, while a heavily
           | utilized AVX-512 kernel will burn power, but should also be
           | doing work twice as fast per instruction.
        
         | Dylan16807 wrote:
         | It definitely helps.
         | 
         | And it looks like they've _reduced_ how often a single
         | instruction will cause a lockup as the core shifts to a
         | different power level. But until they 've eliminated that
         | issue, it's still scary to toss in a few AVX-512 instructions.
        
         | BeeOnRope wrote:
         | Makes sense - the main difference between ICL and ICL-X would
         | seem to be 2 FMA units.
        
         | The_rationalist wrote:
         | I generally agree but
         | https://www.phoronix.com/scan.php?page=news_item&px=LLVM-Cla...
        
           | gameswithgo wrote:
           | compilers are not at a point where they do a great job of
           | leveraging SIMD in general, and definitely not where they
           | leverage AVX-512, but hand written intrinsics with AVX-512
           | can attain amazing performance.
        
       | paulmd wrote:
       | Given the process improvements in Tiger Lake - I wonder if this
       | improves further, or at least all levels become somewhat faster?
        
       | BooneJS wrote:
       | At Hot Chips 32 this week, Intel mentioned that Tiger Lake Xeon
       | with Sunny Cove core would only downclock if AVX-512 usage hit
       | TDP limits.
        
         | donor20 wrote:
         | That is MUCH better it seems? Because then you don't randomly
         | throw away lots of performance because some 512 items hit so
         | there is less risk in using 512.
        
         | rbanffy wrote:
         | I wonder how many clock cycles that'd take ;-)
        
           | BooneJS wrote:
           | They made it sound like some instructions were more power
           | hungry than others. The impression I got is that the unit can
           | run some kinds of streams without reduction Of clock.
        
       | kardos wrote:
       | Using 'licenses' here is odd because it evokes ideas of
       | deliberate-crippling that can be turned off with a
       | subscription...
        
       | rbanffy wrote:
       | Did the Xeon Phi also downclock when using AVX-512?
        
         | th3typh00n wrote:
         | I haven't seen any numbers on that but there's literally zero
         | reason to run a Xeon Phi without using AVX-512, so I'd assume
         | no design considerations were taken to optimize the clock
         | frequency for a non-AVX-512 use case.
        
       | jiggawatts wrote:
       | One thing I really want to know is whether SQL Server's new
       | vector-accelerated "Batch Mode" uses AVX2 only or if it also has
       | AVX-512 code paths?
       | 
       | I'd like to be able to recommend the right CPU to customers, but
       | there just isn't any information out in public about this...
        
         | rbanffy wrote:
         | There are a lot of different factors that can affect
         | performance. I'd advise you to always benchmark.
        
           | jiggawatts wrote:
           | I can't benchmark with a CPU I don't have, and I can't advise
           | a customer to go off and buy a $50K server just for a "quick
           | benchmark".
           | 
           | Even if I were to, say, test with some cloud VMs, even then
           | there are confounding issues. The different VM size
           | categories aren't just different in the CPU type only, there
           | are other differences that'll make the benchmark difficult to
           | interpret. Memory type, throttling, HT on/off, etc...
           | 
           | Why is it so difficult for Microsoft to simply say "AVX-512
           | supported" somewhere in their documentation?
           | 
           | This is like every TV vendor saying "HDMI" instead of "HDMI
           | 2.1" or whatever. Just because the port looks the same
           | doesn't mean that they're identical! Versions matter.
        
       ___________________________________________________________________
       (page generated 2020-08-19 23:00 UTC)