[HN Gopher] Vectorization: Introduction
       ___________________________________________________________________
        
       Vectorization: Introduction
        
       Author : gsav
       Score  : 126 points
       Date   : 2023-06-01 19:50 UTC (3 hours ago)
        
 (HTM) web link (cvw.cac.cornell.edu)
 (TXT) w3m dump (cvw.cac.cornell.edu)
        
       | NKosmatos wrote:
       | I know I'm probably going to get downvoted for this, but I'm
       | going to post it anyhow :-)
       | 
       | https://jvns.ca/blog/2023/02/08/why-does-0-1-plus-0-2-equal-...
        
         | sva_ wrote:
         | Perhaps you could point out in which way you think this is
         | relevant?
        
           | NKosmatos wrote:
           | From the introductory text "The ultimate goal of
           | vectorization is an increase in floating-point
           | performance...". I find it a bit funny that by vectorizing
           | float executions we'll be able to calculate wrong answers
           | faster ;-)
        
             | windsignaling wrote:
             | Still irrelevant.
        
             | m00x wrote:
             | How is this relevant? This is more of a critique of floats
             | than it is about vectorization.
        
             | jcranmer wrote:
             | It's not a wrong answer, anymore than writing down  1/3  as
             | 0.333, multiplying that by 3, and discovering that the
             | result is 0.999 and not 1.
             | 
             | And there's nothing in vectorization that requires
             | floating-point numbers; it's just that the algorithms that
             | tend to demand the most out of a computer tend to be large
             | numerical algorithms, which use floating-point.
        
               | sva_ wrote:
               | > the result is 0.999 and not 1.
               | 
               | But 0.999... is 1.
               | 
               | Edit: Nevermind, I see you meant "writing down 1/3 as
               | 0.333" literally.
        
             | comicjk wrote:
             | They're not wrong answers, they're floating-point answers.
             | What's wrong is the assumption that infinite precision will
             | fit in a handful of bytes.
        
             | shoo wrote:
             | for many practical applications where we compute things,
             | the input data isn't fully accurate, it's an estimate with
             | significant error or uncertainty. when we consider how the
             | outputs of computations are used, in many cases it isn't
             | useful for outputs to be arbitrarily precise. maybe the
             | output is used to make a decision and the expected cost or
             | benefit decision won't significantly change even if the
             | output is wrong by 1%, for example.
             | 
             | one person's "wrong" is another person's approximation. by
             | reducing accuracy we can often get very large performance
             | improvements. good engineering makes an appropriate
             | tradeoff.
             | 
             | there's even more scope for this kind of thing inside some
             | numerical algorithms where having a "pretty good" guess
             | gives a massive speed boost, but the algorithm can be
             | iterated to find quite precise answers even if the "pretty
             | good" guess is an approximation that's off by 20%
             | 
             | there's yet more scope for this if we consider if we're
             | trying to mathematically model reality -- even if we
             | perform all the math with arbitrary precision calculations,
             | the theoretical model itself is still an approximation of
             | reality, there's some error there. there's limited value in
             | solving a model exactly, as there's always some
             | approximation error between the model and reality.
             | 
             | it's amazing we have such good support for high-performance
             | scientific quality ieee floating point math everywhere in
             | commodity hardware. it's a great tool to have in the
             | toolbelt
        
       | nologic01 wrote:
       | We need also something like Tensorization:Introduction
        
         | alfalfasprout wrote:
         | Tensor ops are usually implemented using the principles of
         | vectorizing code.
         | 
         | After all most tensors are implemented as 1D arrays with
         | strides for each dimension with the innermost dimension always
         | contiguous.
        
           | sva_ wrote:
           | I think the bigger difficulty lies in how to store and
           | process this stuff on a GPU/other accelerator.
           | 
           | If the OP cares about implementation details about how an API
           | like PyTorch is made, I think the MiniTorch 'book' is a
           | pretty good intro:
           | 
           | https://minitorch.github.io/
        
           | richardw wrote:
           | ChatGPT interpretation:
           | 
           | Sure! Let me break it down for you:
           | 
           | When we talk about "tensor ops," we're referring to
           | operations performed on tensors, which are multidimensional
           | arrays of numbers commonly used in mathematics and computer
           | science.
           | 
           | Now, these tensor operations are typically implemented using
           | a technique called "vectorizing code." In simple terms,
           | vectorizing code means performing operations on entire arrays
           | of data instead of looping through each element one by one.
           | It's like doing multiple calculations at once, which can be
           | more efficient and faster.
           | 
           | Tensors are usually represented as 1D arrays, meaning all the
           | elements are arranged in a single line. Each dimension of the
           | tensor has a concept called "stride," which represents how
           | many elements we need to skip to move to the next element in
           | that dimension. This helps us efficiently access and
           | manipulate the data in the tensor.
           | 
           | Additionally, the innermost dimension of a tensor is always
           | "contiguous," which means the elements are stored
           | sequentially without any gaps. This arrangement also aids in
           | efficient processing of the tensor data.
           | 
           | So, in summary, tensor ops involve performing operations on
           | multidimensional arrays, and we use vectorized code to do
           | these operations more efficiently. Tensors are represented as
           | 1D arrays with strides for each dimension, and the innermost
           | dimension is always stored sequentially without any gaps.
        
           | riku_iki wrote:
           | > the principles of vectorizing code. > 1D arrays
           | 
           | I think accelerators mostly do computations over small
           | matrices and not vectors, so it is "matricized" code, but I
           | am not an expert in this area.
        
         | amelius wrote:
         | We really need to name tensors differently. "Array" would be a
         | more fitting name.
         | 
         | Tensors are really different mathematical objects with a far
         | more rich structure than those used in the Deep Learning
         | context.
         | 
         | https://en.wikipedia.org/wiki/Tensor
        
           | nologic01 wrote:
           | I think that naming battle is lost.
           | 
           | Would indeed be cool to have _true_ tensor processing units
           | :-)
           | 
           | PKD level of cool as you could argue those chips directly
           | process spacetime chunks
        
           | sva_ wrote:
           | Names can have different meanings based on context, even
           | within pure mathematics. I think a mathematician would have
           | little trouble discerning which kind of tensor is meant, as
           | this sort of reusing existing terms is pretty common.
           | 
           | Just calling it an array might be underselling it, at this
           | point. Perhaps a tensor in the context of CompSci is just a
           | particular type of array with certain expected properties.
        
           | itishappy wrote:
           | Same with vectors!
           | 
           | https://en.wikipedia.org/wiki/Vector_(mathematics_and_physic.
           | ..
        
           | opportune wrote:
           | You could apply that argument to "isomorphic" functions, c++
           | "vectors" or compute "vectorizarion", "integer" types... even
           | though the semantics aren't exactly the same between the
           | computing and math terms, they're related enough that they do
           | help with understanding to those coming from a math
           | background (which to be fair is probably less and less
           | programmers over time).
        
             | myownpetard wrote:
             | Isomorphic javascript is one of the most egregious.
        
         | godelski wrote:
         | This is mostly going to be the same. The main difference in
         | tensorization is that your vectors are not contiguous. So you
         | lose some speedup. How you typically optimize this is by
         | flattening your tensor. Then your data is contiguious and all
         | the same rules apply.
         | 
         | This is overly simplified though. Things get different when we
         | start talking about {S,M}I{S,M}D (page addresses SIMD) or GPUs.
         | Parallel computing is a whole other game, and CUDA takes it to
         | the next level (lots of people can write CUDA kernels, not a
         | lot of people can write GOOD CUDA kernels (I'm definitely not a
         | pro, but know some)). Parallelism adds another level of
         | complexity due to being able to access different memory
         | locations simultaneously. If you think concurrent optimization
         | is magic, parallel optimization is wizardry (I still believe
         | this is true)
        
         | corsix wrote:
         | From a hardware perspective, vector instructions operate on
         | small 1D vectors, whereas tensor instructions operate on small
         | 2D matrices. I say "instructions", but it's really only matrix
         | multiply or matrix multiply and accumulate - most other
         | instructions are fine staying as 1D.
        
           | nologic01 wrote:
           | If there is matrix multiply at hardware level its fair to
           | have another name than vectorization. For example the
           | dimensions and partitioning of large matrices to fit would be
           | specific to that design and very different from rolling
           | things out on 1D arrays
        
       | dragontamer wrote:
       | Auto-vectorization is easier to get into than other SIMD-
       | frameworks like CUDA, OpenCL, ROCm, Intel's ISPC and whatnot. But
       | in my experience, auto-vectorizers are just not as flexible as
       | the proper SIMD-tools.
       | 
       | I'd say auto-vectorization should still be learned by modern
       | high-performance programmers, because its very low-hanging fruit.
       | You barely have to do anything and suddenly your for-loops are
       | AVX512 optimized, though maybe not to the fullest extent
       | possible.
       | 
       | Still, I suggest that programmers also learn how to properly make
       | SIMD code. Maybe intrinsics are too hard in practice, but ISPC,
       | CUDA, and other SIMD-programming environments make things far
       | easier to learn than you might expect.
       | 
       | ------------
       | 
       | ISPC in particular is Intel's SIMD programming language, much
       | akin to CUDA except it compiles into AVX512. So for AVX512-like
       | code execution environments, using the ISPC language/compiler is
       | exceptionally useful.
       | 
       | Its harder to learn a new language than to learn a few compiler-
       | options to enable auto-vectorization however. So in practice,
       | auto-vectorization will continue to be used. But for tasks that
       | specifically would benefit from SIMD-thinking, the dedicated ISPC
       | language should be beneficial.
        
         | sva_ wrote:
         | > I'd say auto-vectorization should still be learned by modern
         | high-performance programmers, because its very low-hanging
         | fruit.
         | 
         | Somewhat related: 55 GiB/s FizzBuzz
         | https://codegolf.stackexchange.com/questions/215216/high-thr...
         | (discussion: https://news.ycombinator.com/item?id=29031488)
        
           | saagarjha wrote:
           | Definitely not autovectorized.
        
         | jackmott42 wrote:
         | What do you mean by learning auto vectorization? Do you mean
         | learning how to structure your code so that your compiler of
         | choice has a good chance of pulling it off?
        
           | Veliladon wrote:
           | Pretty much.
           | 
           | If one wants to sum two arrays instead of for looping through
           | the elements of two arrays one might instead use iterate by
           | chunks so that the compiler can easily tie them all together
           | as a single operation which can then easily vectorize.
        
       | photochemsyn wrote:
       | This looks like a curious case, non-fixed-length vectorization
       | code in RISC-V:
       | 
       | > "Perhaps the most interesting part of the open RISC-V
       | instruction set architecture (ISA) is the vector extension
       | (RISC-V "V"). In contrast to the average single-instruction
       | multipe-data (SIMD) instruction set, RISC-V vector instructions
       | are vector length agnostic (VLA). Thus, a RISC-V "V" CPU is
       | flexible in choosing a vector register size while RISC-V "V"
       | binary code is portable between different CPU implementations."
       | 
       | https://gms.tf/riscv-vector.html
        
       | divbzero wrote:
       | With interpreted languages such as Python or MATLAB, the
       | performance gains come not only from hardware parallelism but
       | also from the use of compiled libraries instead of interpreted
       | code.
        
       | saagarjha wrote:
       | > Vectorization is a process by which floating-point computations
       | in scientific code are compiled into special instructions that
       | execute elementary operations (+,-,*, etc.) or functions (exp,
       | cos, etc.) in parallel on fixed-size vector arrays.
       | 
       | I guess this is a scientific computing course or something but I
       | feel like even so it's important to point out that most
       | processors have many vector instructions that operate on
       | integers, bitfields, characters, and the like. The fundamental
       | premise of "do a thing on a bunch of data at once" isn't limited
       | to just floating point operations.
        
       ___________________________________________________________________
       (page generated 2023-06-01 23:00 UTC)