[HN Gopher] Vectorization: Introduction ___________________________________________________________________ Vectorization: Introduction Author : gsav Score : 126 points Date : 2023-06-01 19:50 UTC (3 hours ago) (HTM) web link (cvw.cac.cornell.edu) (TXT) w3m dump (cvw.cac.cornell.edu) | NKosmatos wrote: | I know I'm probably going to get downvoted for this, but I'm | going to post it anyhow :-) | | https://jvns.ca/blog/2023/02/08/why-does-0-1-plus-0-2-equal-... | sva_ wrote: | Perhaps you could point out in which way you think this is | relevant? | NKosmatos wrote: | From the introductory text "The ultimate goal of | vectorization is an increase in floating-point | performance...". I find it a bit funny that by vectorizing | float executions we'll be able to calculate wrong answers | faster ;-) | windsignaling wrote: | Still irrelevant. | m00x wrote: | How is this relevant? This is more of a critique of floats | than it is about vectorization. | jcranmer wrote: | It's not a wrong answer, anymore than writing down 1/3 as | 0.333, multiplying that by 3, and discovering that the | result is 0.999 and not 1. | | And there's nothing in vectorization that requires | floating-point numbers; it's just that the algorithms that | tend to demand the most out of a computer tend to be large | numerical algorithms, which use floating-point. | sva_ wrote: | > the result is 0.999 and not 1. | | But 0.999... is 1. | | Edit: Nevermind, I see you meant "writing down 1/3 as | 0.333" literally. | comicjk wrote: | They're not wrong answers, they're floating-point answers. | What's wrong is the assumption that infinite precision will | fit in a handful of bytes. | shoo wrote: | for many practical applications where we compute things, | the input data isn't fully accurate, it's an estimate with | significant error or uncertainty. when we consider how the | outputs of computations are used, in many cases it isn't | useful for outputs to be arbitrarily precise. maybe the | output is used to make a decision and the expected cost or | benefit decision won't significantly change even if the | output is wrong by 1%, for example. | | one person's "wrong" is another person's approximation. by | reducing accuracy we can often get very large performance | improvements. good engineering makes an appropriate | tradeoff. | | there's even more scope for this kind of thing inside some | numerical algorithms where having a "pretty good" guess | gives a massive speed boost, but the algorithm can be | iterated to find quite precise answers even if the "pretty | good" guess is an approximation that's off by 20% | | there's yet more scope for this if we consider if we're | trying to mathematically model reality -- even if we | perform all the math with arbitrary precision calculations, | the theoretical model itself is still an approximation of | reality, there's some error there. there's limited value in | solving a model exactly, as there's always some | approximation error between the model and reality. | | it's amazing we have such good support for high-performance | scientific quality ieee floating point math everywhere in | commodity hardware. it's a great tool to have in the | toolbelt | nologic01 wrote: | We need also something like Tensorization:Introduction | alfalfasprout wrote: | Tensor ops are usually implemented using the principles of | vectorizing code. | | After all most tensors are implemented as 1D arrays with | strides for each dimension with the innermost dimension always | contiguous. | sva_ wrote: | I think the bigger difficulty lies in how to store and | process this stuff on a GPU/other accelerator. | | If the OP cares about implementation details about how an API | like PyTorch is made, I think the MiniTorch 'book' is a | pretty good intro: | | https://minitorch.github.io/ | richardw wrote: | ChatGPT interpretation: | | Sure! Let me break it down for you: | | When we talk about "tensor ops," we're referring to | operations performed on tensors, which are multidimensional | arrays of numbers commonly used in mathematics and computer | science. | | Now, these tensor operations are typically implemented using | a technique called "vectorizing code." In simple terms, | vectorizing code means performing operations on entire arrays | of data instead of looping through each element one by one. | It's like doing multiple calculations at once, which can be | more efficient and faster. | | Tensors are usually represented as 1D arrays, meaning all the | elements are arranged in a single line. Each dimension of the | tensor has a concept called "stride," which represents how | many elements we need to skip to move to the next element in | that dimension. This helps us efficiently access and | manipulate the data in the tensor. | | Additionally, the innermost dimension of a tensor is always | "contiguous," which means the elements are stored | sequentially without any gaps. This arrangement also aids in | efficient processing of the tensor data. | | So, in summary, tensor ops involve performing operations on | multidimensional arrays, and we use vectorized code to do | these operations more efficiently. Tensors are represented as | 1D arrays with strides for each dimension, and the innermost | dimension is always stored sequentially without any gaps. | riku_iki wrote: | > the principles of vectorizing code. > 1D arrays | | I think accelerators mostly do computations over small | matrices and not vectors, so it is "matricized" code, but I | am not an expert in this area. | amelius wrote: | We really need to name tensors differently. "Array" would be a | more fitting name. | | Tensors are really different mathematical objects with a far | more rich structure than those used in the Deep Learning | context. | | https://en.wikipedia.org/wiki/Tensor | nologic01 wrote: | I think that naming battle is lost. | | Would indeed be cool to have _true_ tensor processing units | :-) | | PKD level of cool as you could argue those chips directly | process spacetime chunks | sva_ wrote: | Names can have different meanings based on context, even | within pure mathematics. I think a mathematician would have | little trouble discerning which kind of tensor is meant, as | this sort of reusing existing terms is pretty common. | | Just calling it an array might be underselling it, at this | point. Perhaps a tensor in the context of CompSci is just a | particular type of array with certain expected properties. | itishappy wrote: | Same with vectors! | | https://en.wikipedia.org/wiki/Vector_(mathematics_and_physic. | .. | opportune wrote: | You could apply that argument to "isomorphic" functions, c++ | "vectors" or compute "vectorizarion", "integer" types... even | though the semantics aren't exactly the same between the | computing and math terms, they're related enough that they do | help with understanding to those coming from a math | background (which to be fair is probably less and less | programmers over time). | myownpetard wrote: | Isomorphic javascript is one of the most egregious. | godelski wrote: | This is mostly going to be the same. The main difference in | tensorization is that your vectors are not contiguous. So you | lose some speedup. How you typically optimize this is by | flattening your tensor. Then your data is contiguious and all | the same rules apply. | | This is overly simplified though. Things get different when we | start talking about {S,M}I{S,M}D (page addresses SIMD) or GPUs. | Parallel computing is a whole other game, and CUDA takes it to | the next level (lots of people can write CUDA kernels, not a | lot of people can write GOOD CUDA kernels (I'm definitely not a | pro, but know some)). Parallelism adds another level of | complexity due to being able to access different memory | locations simultaneously. If you think concurrent optimization | is magic, parallel optimization is wizardry (I still believe | this is true) | corsix wrote: | From a hardware perspective, vector instructions operate on | small 1D vectors, whereas tensor instructions operate on small | 2D matrices. I say "instructions", but it's really only matrix | multiply or matrix multiply and accumulate - most other | instructions are fine staying as 1D. | nologic01 wrote: | If there is matrix multiply at hardware level its fair to | have another name than vectorization. For example the | dimensions and partitioning of large matrices to fit would be | specific to that design and very different from rolling | things out on 1D arrays | dragontamer wrote: | Auto-vectorization is easier to get into than other SIMD- | frameworks like CUDA, OpenCL, ROCm, Intel's ISPC and whatnot. But | in my experience, auto-vectorizers are just not as flexible as | the proper SIMD-tools. | | I'd say auto-vectorization should still be learned by modern | high-performance programmers, because its very low-hanging fruit. | You barely have to do anything and suddenly your for-loops are | AVX512 optimized, though maybe not to the fullest extent | possible. | | Still, I suggest that programmers also learn how to properly make | SIMD code. Maybe intrinsics are too hard in practice, but ISPC, | CUDA, and other SIMD-programming environments make things far | easier to learn than you might expect. | | ------------ | | ISPC in particular is Intel's SIMD programming language, much | akin to CUDA except it compiles into AVX512. So for AVX512-like | code execution environments, using the ISPC language/compiler is | exceptionally useful. | | Its harder to learn a new language than to learn a few compiler- | options to enable auto-vectorization however. So in practice, | auto-vectorization will continue to be used. But for tasks that | specifically would benefit from SIMD-thinking, the dedicated ISPC | language should be beneficial. | sva_ wrote: | > I'd say auto-vectorization should still be learned by modern | high-performance programmers, because its very low-hanging | fruit. | | Somewhat related: 55 GiB/s FizzBuzz | https://codegolf.stackexchange.com/questions/215216/high-thr... | (discussion: https://news.ycombinator.com/item?id=29031488) | saagarjha wrote: | Definitely not autovectorized. | jackmott42 wrote: | What do you mean by learning auto vectorization? Do you mean | learning how to structure your code so that your compiler of | choice has a good chance of pulling it off? | Veliladon wrote: | Pretty much. | | If one wants to sum two arrays instead of for looping through | the elements of two arrays one might instead use iterate by | chunks so that the compiler can easily tie them all together | as a single operation which can then easily vectorize. | photochemsyn wrote: | This looks like a curious case, non-fixed-length vectorization | code in RISC-V: | | > "Perhaps the most interesting part of the open RISC-V | instruction set architecture (ISA) is the vector extension | (RISC-V "V"). In contrast to the average single-instruction | multipe-data (SIMD) instruction set, RISC-V vector instructions | are vector length agnostic (VLA). Thus, a RISC-V "V" CPU is | flexible in choosing a vector register size while RISC-V "V" | binary code is portable between different CPU implementations." | | https://gms.tf/riscv-vector.html | divbzero wrote: | With interpreted languages such as Python or MATLAB, the | performance gains come not only from hardware parallelism but | also from the use of compiled libraries instead of interpreted | code. | saagarjha wrote: | > Vectorization is a process by which floating-point computations | in scientific code are compiled into special instructions that | execute elementary operations (+,-,*, etc.) or functions (exp, | cos, etc.) in parallel on fixed-size vector arrays. | | I guess this is a scientific computing course or something but I | feel like even so it's important to point out that most | processors have many vector instructions that operate on | integers, bitfields, characters, and the like. The fundamental | premise of "do a thing on a bunch of data at once" isn't limited | to just floating point operations. ___________________________________________________________________ (page generated 2023-06-01 23:00 UTC)