[HN Gopher] Sorting with SIMD
       ___________________________________________________________________
        
       Sorting with SIMD
        
       Author : lukastyrychtr
       Score  : 117 points
       Date   : 2022-12-17 16:59 UTC (6 hours ago)
        
 (HTM) web link (tweedegolf.nl)
 (TXT) w3m dump (tweedegolf.nl)
        
       | anovikov wrote:
       | SIMD is a fantastic tool. My first try with AVX2 (256-bit),
       | building an "overlay with transparency" for two YUVA420P frames,
       | yielded speed better than 1 pixel per CPU cycle! Although i
       | didn't even try optimising it all that much.
        
       | the__alchemist wrote:
       | When would you use SIMD vice a GPU? (Eg Vulkan comp shader) Is it
       | easier to write for CPU, but you bring in the GPU shader if doing
       | massively parallel ops vice just a few? I've skimmed a Rust lib
       | that uses CPU SIMD (GLAM), and it used a diff syntax from normal,
       | so I'm not positive it would be easier if you're familiar with
       | the GPU process.
        
         | fathyb wrote:
         | It depends how much, where, and when you'd like the data to be
         | sent? Most discrete GPUs cannot access the CPU memory directly,
         | so you need to make a copy through the PCI bus, which can be
         | slow.
         | 
         | If you have small chunks of data (mining), or your destination
         | is the screen (video game), it might make sense to use the GPU.
         | If you need high-throughput, low latency, or your destination
         | is something like a sound card (DAW), then SIMD might be a
         | better choice.
         | 
         | Fast integrated GPUs like Apple's allow for directly accessing
         | the main memory without copy, making the GPU more viable for
         | general purposes.
        
       | anonymoushn wrote:
       | This is a good explanation. Writing the masks in big-endian
       | somewhat obscures things.
       | 
       | As an aside, while recent developments in quicksort are quite
       | good, it seems like MSB radix sort performs comparably to (or
       | slightly better than!) these algorithms on random inputs.
        
         | bee_rider wrote:
         | Random inputs for which you don't know the bounds?
        
           | anonymoushn wrote:
           | Yes. If the input is bounded in a narrow range compared to
           | the length if the input, it seems like it should actually be
           | a bit slower than normal, since the first few passes may do
           | nothing and the later passes will need to be run on more of
           | the input than normal. We're using it for arrays of doubles
           | that have fairly small exponents though, and this causes us
           | to have lots of empty buckets in the first pass, and it
           | performs well enough.
        
             | bee_rider wrote:
             | Cool! I should probably not have commented, I haven't
             | really kept up with advances in sorting, haha. I assumed it
             | was basically done to death 20 years ago. This will spur
             | some reading I think! Thanks.
        
               | anonymoushn wrote:
               | Please do comment in situations like this.
               | 
               | It's tough to find a good overview of this topic that
               | includes recent work or that is aimed at empirical
               | performance. That's part of why TFA is great!
               | 
               | A version of MSB radix sort for sorting ints, doubles, or
               | floats, or sorting structs by a single key of those types
               | is here[0] and a version for sorting strings is here[1].
               | These are based on the MSB radix sort from this
               | repository[2] which is associated with a technical report
               | about parallel string sorts. I was only interested in
               | sequential sorts, but this repository turned out to be a
               | great resource anyway.
               | 
               | [0]: https://github.com/alichraghi/zort/blob/main/src/rad
               | ix.zig
               | 
               | [1]: https://github.com/dendibakh/perf-
               | challenge6/blob/Solution_R...
               | 
               | [2]: https://github.com/bingmann/parallel-string-sorting
        
       | rikschennink wrote:
       | This is why with redact.photo I randomly shuffle the pixels
       | before blurring so it looks like you can reverse engineer it but
       | you'll just get scrambled pixels.
        
         | boberoni wrote:
         | I thought that blurring a photo would set each pixel to the
         | average color value of all its neighbors. Since information is
         | lost, how could you reverse engineer a blurred photo?
        
           | snovv_crash wrote:
           | If there are a limited number of input permutations then you
           | can test all of them to see which is plausible.
        
         | lucb1e wrote:
         | The threads are right next to each other, might you have meant
         | to reply here? https://news.ycombinator.com/item?id=34031568
         | "Unredacter: Never use pixelation [for] redaction"
        
       | fearthetelomere wrote:
       | >You should probably not use inline assembly in production code
       | 
       | What are the alternatives here? Write the assembly in a separate
       | file and provide a FFI?
        
         | corysama wrote:
         | Popular compilers support popular SIMD architectures through
         | "intrinsic" functions. They look and act like regular
         | functions, but they are built in to the compiler and usually
         | compile to a single specific assembly instruction. In the
         | article, _mm_set_epi32 is an intrinsic function that compiles
         | to the instruction of the same name.
         | 
         | This is a sharp contrast to inline assembly for which the
         | compiler has practically zero visibility into. Inline assembly
         | can't be pipelined with other work by the compiler. And, the
         | compiler has to switch to a super-conservative assumption that
         | the inline assembly might have done god-knows-what behind the
         | compiler's back.
         | 
         | AFAICT, the last holdouts for hand-written assembly are people
         | working on media codecs. Even AAA game engines use intrinsic
         | functions rarely and assembly nearly never.
        
           | girvo wrote:
           | Isn't the reason they had to use inline assembly there
           | because the compiler they're using doesn't have that
           | particular instruction bound as an intrinsic?
           | 
           | What do you do in that case? I'm genuinely curious as it's
           | something I've run up against: the vector extensions for the
           | LX7 processor in the ESP32-S3 don't have intrinsics for them.
        
         | jvanderbot wrote:
         | There are SIMD abstraction libraries floating around. And many
         | so-called "Math" libraries will use SIMD instructions to speed
         | things up, I believe. So the work is to cast the problem to the
         | language of the library(ies) and do some profiling.
        
       | Dwedit wrote:
       | If you are exclusively moving numbers around in an array, SIMD
       | sorting sounds great.
       | 
       | But when you want to actually sort things, it's different. You
       | have objects in an array, the objects will have one member which
       | is a Key that you will sort the objects on.
       | 
       | In order to create a sorted permutation of the list, you either
       | need to rearrange the list of object pointers (fast), rearrange
       | the memory of the objects (slow), or you simply output a list of
       | array indexes that represents what the sort order should be
       | (fast).
       | 
       | Code that doesn't physically move the object memory around
       | creates indirection. Indirection makes any SIMD sorting depend on
       | scatter-gather in order to get data in. Scatter-Gather causes
       | random memory accesses which don't cache as well as sequential
       | access.
        
         | ww520 wrote:
         | It might be easier for SIMD to work by extracting the key value
         | from each object into its own array.
         | 
         | Form an array with fixed element size: [ {index0 | key0},
         | {index1 | key1}, ...], where indices are the element index to
         | the original array of objects and the keys are fixed size. The
         | fat element {index | key} is moved as a whole unit to carry the
         | original index along with the swap during sorting. SIMD can
         | deal with fixed size units very well.
        
         | dan-robertson wrote:
         | Yeah I find the thing you generally want but which many
         | languages don't support well is to have a struct-of-arrays or
         | data frame layout rather than an array of objects. Then you can
         | do one of:
         | 
         | Sort one array and rearrange other arrays the same way
         | 
         | Compute the array of indices into another array such that
         | array[index[i]] < array[index[i+1]], I.e. the sorted version of
         | the array is then [array[i] for i in index]. If you have a
         | vectorised index operation (eg APL, R, ...) getting the sorted
         | array is one line of code.
         | 
         | I think with simd you'd want to sort your array and
         | simultaneously permute either a second array or the array of
         | indices 1, 2, ..., n so you can use that to do accesses to
         | corresponding elements in your other associated columns.
        
       | xphos wrote:
       | This is a well written SIMD example I find these are the hardest
       | topic to clearly explain because of how much is happening. Anyone
       | got other nice links I've been doing a lot of vector processing
       | so I need to learn!
        
         | gww wrote:
         | Someone previously shared this YouTube playlist on HN, and I
         | have found it to be very helpful [1].
         | 
         | [1]
         | https://www.youtube.com/playlist?list=PLYCMvilhmuPEM8DUvY6Wg...
        
       | wolf550e wrote:
       | 128bit SIMD on x86 is called SSE, not AVX. AVX is 256bit. SSE
       | (first introduced with the Pentium 3) had many extensions that
       | added instructions: SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 and some
       | others.
        
         | cpgxiii wrote:
         | That's not entirely true. AVX from the beginning (introduced
         | with Sandy Bridge), not AVX2 (introduced with Haswell),
         | contains 128-bit instructions as well, and these are what's
         | being used in the article (e.g. vpermilps).
        
         | adrian_b wrote:
         | For most 128-bit SSE instructions, there are equivalent 128-bit
         | AVX instructions, which differ from them by allowing 3 register
         | addresses instead of 2 register addresses, and by not causing
         | execution stalls when they are mixed with 256-bit AVX
         | instructions.
         | 
         | For most AVX instructions, you may choose between 128-bit and
         | 256-bit instructions, which is especially useful on those CPUs
         | where 256-bit instructions cause down-clocking.
         | 
         | There are also 128-bit AVX-512 instructions and 256-bit AVX-512
         | instructions.
         | 
         | So the same SIMD algorithm may be implemented for Intel CPUs
         | using either 128-bit SSE instructions or 128-bit AVX
         | instructions or 128-bit AVX-512 instructions or 256-bit AVX
         | instructions or 256-bit AVX-512 instructions or 512-bit AVX-512
         | instructions.
        
         | [deleted]
        
       ___________________________________________________________________
       (page generated 2022-12-17 23:00 UTC)