[HN Gopher] Auto-vectorization for the masses (2011) ___________________________________________________________________ Auto-vectorization for the masses (2011) Author : lelf Score : 26 points Date : 2020-02-15 05:46 UTC (17 hours ago) (HTM) web link (leiradel.github.io) (TXT) w3m dump (leiradel.github.io) | epistasis wrote: | Very interesting and useful to see. | | And in an entirely approach for vectorization for the masses: I | do wish that it was easier to access vectorization through BLAS, | a library that is well supported across nearly all languages, | gets massively optimized, but is hard to install correctly. | chewxy wrote: | Good news is that the Gonum team has been working on an | optimized pure Go version of BLAS. It's at parity with netlib | blas for some of the important functions (GEMV, GEMV, etc). | | Why is this good news? Go is a very easy to use language, and | it favours using compile targets, leading it to be available | across different platforms. To install, one simply does `go get | gonum.org/v1/gonum` | jedbrown wrote: | Netlib BLAS is a very low bar [1], and not at all how one | should go about writing a performance portable BLAS. BLIS | (https://github.com/flame/blis/) is a much better approach, | and underlies vendor implementations on AMD | (https://developer.amd.com/amd-aocl/blas-library/) and many | embedded systems. | | [1] GEMV is entirely limited by memory bandwidth, thus quite | uninteresting from a vectorization standpoint. Maybe you | meant GEMM? | marklacey wrote: | I only barely skimmed the post and the follow-on posts so this is | less about that and more about autovectorizers. | | Autovectorization is the wrong approach for data-parallelization. | You don't want to rely on a brittle unpredictable code | transformation for performance in this case. You want to bake it | into the programming model. | | ispc uses this approach and it results in performance | predictability to a large degree. You can imagine other | approaches as well, like explicitly data-parallel loops, or a | declarative approach. | | Most of these (and the GPU data-parallel models) rely to a very | large extent on the programmer to manage data dependencies to | ensure correctness. | llukas wrote: | Just for the record: you rely on performance tests to guarantee | performance, nothing else. | tom_mellior wrote: | > You don't want to rely on a brittle unpredictable code | transformation for performance in this case. | | That's somewhat true, but much of the unpredictability could be | removed if compilers provided annotations saying "I expect this | loop to be vectorized" where the compiler would be forced to | report an error if it didn't manage to do it. | rsp1984 wrote: | This has been done by Intel: https://ispc.github.io | tom_ wrote: | More about ispc, from Matt Pharr: | https://pharr.org/matt/blog/2018/04/30/ispc-all.html - includes | some discussion of Intel's corporate culture. Interesting | throughout. | tom_mellior wrote: | So... skimming this post and its successors, I didn't see any | actual examples of generated vector code, especially not examples | that GCC can't do although they are supposedly "easy". And no | benchmarks. Did I miss anything or did this project really die | before it got to vectorization (or anything more interesting than | constant folding)? ___________________________________________________________________ (page generated 2020-02-15 23:00 UTC)