[HN Gopher] Sorting with SIMD ___________________________________________________________________ Sorting with SIMD Author : lukastyrychtr Score : 117 points Date : 2022-12-17 16:59 UTC (6 hours ago) (HTM) web link (tweedegolf.nl) (TXT) w3m dump (tweedegolf.nl) | anovikov wrote: | SIMD is a fantastic tool. My first try with AVX2 (256-bit), | building an "overlay with transparency" for two YUVA420P frames, | yielded speed better than 1 pixel per CPU cycle! Although i | didn't even try optimising it all that much. | the__alchemist wrote: | When would you use SIMD vice a GPU? (Eg Vulkan comp shader) Is it | easier to write for CPU, but you bring in the GPU shader if doing | massively parallel ops vice just a few? I've skimmed a Rust lib | that uses CPU SIMD (GLAM), and it used a diff syntax from normal, | so I'm not positive it would be easier if you're familiar with | the GPU process. | fathyb wrote: | It depends how much, where, and when you'd like the data to be | sent? Most discrete GPUs cannot access the CPU memory directly, | so you need to make a copy through the PCI bus, which can be | slow. | | If you have small chunks of data (mining), or your destination | is the screen (video game), it might make sense to use the GPU. | If you need high-throughput, low latency, or your destination | is something like a sound card (DAW), then SIMD might be a | better choice. | | Fast integrated GPUs like Apple's allow for directly accessing | the main memory without copy, making the GPU more viable for | general purposes. | anonymoushn wrote: | This is a good explanation. Writing the masks in big-endian | somewhat obscures things. | | As an aside, while recent developments in quicksort are quite | good, it seems like MSB radix sort performs comparably to (or | slightly better than!) these algorithms on random inputs. | bee_rider wrote: | Random inputs for which you don't know the bounds? | anonymoushn wrote: | Yes. If the input is bounded in a narrow range compared to | the length if the input, it seems like it should actually be | a bit slower than normal, since the first few passes may do | nothing and the later passes will need to be run on more of | the input than normal. We're using it for arrays of doubles | that have fairly small exponents though, and this causes us | to have lots of empty buckets in the first pass, and it | performs well enough. | bee_rider wrote: | Cool! I should probably not have commented, I haven't | really kept up with advances in sorting, haha. I assumed it | was basically done to death 20 years ago. This will spur | some reading I think! Thanks. | anonymoushn wrote: | Please do comment in situations like this. | | It's tough to find a good overview of this topic that | includes recent work or that is aimed at empirical | performance. That's part of why TFA is great! | | A version of MSB radix sort for sorting ints, doubles, or | floats, or sorting structs by a single key of those types | is here[0] and a version for sorting strings is here[1]. | These are based on the MSB radix sort from this | repository[2] which is associated with a technical report | about parallel string sorts. I was only interested in | sequential sorts, but this repository turned out to be a | great resource anyway. | | [0]: https://github.com/alichraghi/zort/blob/main/src/rad | ix.zig | | [1]: https://github.com/dendibakh/perf- | challenge6/blob/Solution_R... | | [2]: https://github.com/bingmann/parallel-string-sorting | rikschennink wrote: | This is why with redact.photo I randomly shuffle the pixels | before blurring so it looks like you can reverse engineer it but | you'll just get scrambled pixels. | boberoni wrote: | I thought that blurring a photo would set each pixel to the | average color value of all its neighbors. Since information is | lost, how could you reverse engineer a blurred photo? | snovv_crash wrote: | If there are a limited number of input permutations then you | can test all of them to see which is plausible. | lucb1e wrote: | The threads are right next to each other, might you have meant | to reply here? https://news.ycombinator.com/item?id=34031568 | "Unredacter: Never use pixelation [for] redaction" | fearthetelomere wrote: | >You should probably not use inline assembly in production code | | What are the alternatives here? Write the assembly in a separate | file and provide a FFI? | corysama wrote: | Popular compilers support popular SIMD architectures through | "intrinsic" functions. They look and act like regular | functions, but they are built in to the compiler and usually | compile to a single specific assembly instruction. In the | article, _mm_set_epi32 is an intrinsic function that compiles | to the instruction of the same name. | | This is a sharp contrast to inline assembly for which the | compiler has practically zero visibility into. Inline assembly | can't be pipelined with other work by the compiler. And, the | compiler has to switch to a super-conservative assumption that | the inline assembly might have done god-knows-what behind the | compiler's back. | | AFAICT, the last holdouts for hand-written assembly are people | working on media codecs. Even AAA game engines use intrinsic | functions rarely and assembly nearly never. | girvo wrote: | Isn't the reason they had to use inline assembly there | because the compiler they're using doesn't have that | particular instruction bound as an intrinsic? | | What do you do in that case? I'm genuinely curious as it's | something I've run up against: the vector extensions for the | LX7 processor in the ESP32-S3 don't have intrinsics for them. | jvanderbot wrote: | There are SIMD abstraction libraries floating around. And many | so-called "Math" libraries will use SIMD instructions to speed | things up, I believe. So the work is to cast the problem to the | language of the library(ies) and do some profiling. | Dwedit wrote: | If you are exclusively moving numbers around in an array, SIMD | sorting sounds great. | | But when you want to actually sort things, it's different. You | have objects in an array, the objects will have one member which | is a Key that you will sort the objects on. | | In order to create a sorted permutation of the list, you either | need to rearrange the list of object pointers (fast), rearrange | the memory of the objects (slow), or you simply output a list of | array indexes that represents what the sort order should be | (fast). | | Code that doesn't physically move the object memory around | creates indirection. Indirection makes any SIMD sorting depend on | scatter-gather in order to get data in. Scatter-Gather causes | random memory accesses which don't cache as well as sequential | access. | ww520 wrote: | It might be easier for SIMD to work by extracting the key value | from each object into its own array. | | Form an array with fixed element size: [ {index0 | key0}, | {index1 | key1}, ...], where indices are the element index to | the original array of objects and the keys are fixed size. The | fat element {index | key} is moved as a whole unit to carry the | original index along with the swap during sorting. SIMD can | deal with fixed size units very well. | dan-robertson wrote: | Yeah I find the thing you generally want but which many | languages don't support well is to have a struct-of-arrays or | data frame layout rather than an array of objects. Then you can | do one of: | | Sort one array and rearrange other arrays the same way | | Compute the array of indices into another array such that | array[index[i]] < array[index[i+1]], I.e. the sorted version of | the array is then [array[i] for i in index]. If you have a | vectorised index operation (eg APL, R, ...) getting the sorted | array is one line of code. | | I think with simd you'd want to sort your array and | simultaneously permute either a second array or the array of | indices 1, 2, ..., n so you can use that to do accesses to | corresponding elements in your other associated columns. | xphos wrote: | This is a well written SIMD example I find these are the hardest | topic to clearly explain because of how much is happening. Anyone | got other nice links I've been doing a lot of vector processing | so I need to learn! | gww wrote: | Someone previously shared this YouTube playlist on HN, and I | have found it to be very helpful [1]. | | [1] | https://www.youtube.com/playlist?list=PLYCMvilhmuPEM8DUvY6Wg... | wolf550e wrote: | 128bit SIMD on x86 is called SSE, not AVX. AVX is 256bit. SSE | (first introduced with the Pentium 3) had many extensions that | added instructions: SSE2, SSE3, SSSE3, SSE4.1, SSE4.2 and some | others. | cpgxiii wrote: | That's not entirely true. AVX from the beginning (introduced | with Sandy Bridge), not AVX2 (introduced with Haswell), | contains 128-bit instructions as well, and these are what's | being used in the article (e.g. vpermilps). | adrian_b wrote: | For most 128-bit SSE instructions, there are equivalent 128-bit | AVX instructions, which differ from them by allowing 3 register | addresses instead of 2 register addresses, and by not causing | execution stalls when they are mixed with 256-bit AVX | instructions. | | For most AVX instructions, you may choose between 128-bit and | 256-bit instructions, which is especially useful on those CPUs | where 256-bit instructions cause down-clocking. | | There are also 128-bit AVX-512 instructions and 256-bit AVX-512 | instructions. | | So the same SIMD algorithm may be implemented for Intel CPUs | using either 128-bit SSE instructions or 128-bit AVX | instructions or 128-bit AVX-512 instructions or 256-bit AVX | instructions or 256-bit AVX-512 instructions or 512-bit AVX-512 | instructions. | [deleted] ___________________________________________________________________ (page generated 2022-12-17 23:00 UTC)