[HN Gopher] tolower() in bulk at speed ___________________________________________________________________ tolower() in bulk at speed Author : fanf2 Score : 84 points Date : 2022-06-27 21:12 UTC (1 hours ago) (HTM) web link (dotat.at) (TXT) w3m dump (dotat.at) | [deleted] | mananaysiempre wrote: | This technique of SWAR ("SIMD within a register") with very | narrow elements and no hardware support is explained very well in | an SO answer[1] (and the linked slides) about the mythical (and | useless) "fusion trees". | | [1] https://stackoverflow.com/a/56604876 | dragontamer wrote: | Seems a bit odd to stop at SWAR on today's systems. | | Every system I'm aware of has 128-bit SIMD implemented: either | SSE (x86), NEON (ARM), or AltiVec (POWERPC). As such, 128-bit | SIMD is the "reliably portable" SIMD operation. | | Of course, for fastest speeds, you need to go to the largest | SIMD-register size for your platform: 512-bit for some Intel | processors, rumored AMD Zen4 and Centaur chips. 256-bit for most | Intel / AMD Zen3 chips. 128-bit for Apple ARMs, 512-bit for | Fujitsu A64 ARMs, etc. etc. | | > And these are the fastest kinds of instructions :-) | | And it should be noted that modern SIMD instructions execute at | one-per-clock-tick. So the 64-bit instructions certainly are | fast, but the SIMD instructions tie if you're using simple XOR or | comparison operations. | | -------- | | This style of code might even be "reliably auto-vectorizable" on | GCC and CLANG actually. I wonder if I could get portable auto- | vectorizing C from these examples. | fanf2 wrote: | Is there a good way to load an arbitrary number of bytes (up to | the vector size) into a vector register? As I said in the | article, I could not find one when looking at AVX2 or NEON | reference manuals. Getting the data from RAM is the main | bottleneck for short strings, which DNS names usually are. | dragontamer wrote: | You're not thinking in terms of SIMD. | | If you load 5 bytes into a 128-bit register (16-bytes) you'll | have 5-bytes of data + 11-bytes of garbage. | | Perform the calculation over all 16-bytes. Yes, this makes | 11-bytes of garbage at the end. Once you're done, just write | back the first 5 bytes and you're set. | | The 11-bytes of garbage are "free". They didn't cost you | anything. | | > Getting the data from RAM is the main bottleneck for short | strings | | Your L1 and L2 cache-lines are 64-byte minimum read / write | anyway (actually, some systems were 128-byte minimum IIRC). A | 16-byte read is literally smaller than what your cache does | on the path into your registers. | | EDIT: More importantly, modern CPUs only have 2x or 3x | load/store units. Meaning a modern CPU can only perform ~2ish | load/stores per clock tick. The SSE (16-byte / 128-bit) read | / write to L1 cache will perform the same speed as a 8-byte | /64-bit read/write to L1 cache in practice. | fanf2 wrote: | In fact the code that I left out of the blog post does | almost exactly what you suggest :-) It's the "load 5 bytes" | and "store 5 bytes" that I don't have a good solution for. | At the moment I am using memmove() and relying on the | compiler developers to have better ideas about optimizing | it than I do... The bottleneck comes from the number of | instructions and the branchiness, not the data bandwidth. | | I briefly considered playing games with overlong reads, but | then asan gave me a slap and I reconsidered the path to | wisdom. | dzaima wrote: | Doesn't seem to be for SSE & AVX at least (only 32-bit and | 64-bit groups in AVX, and nothing for SSE). AVX-512 has | _mm512_maskz_loadu_epi8, but not much has AVX-512. | | It's a shame that there aren't guarantees on being able to | overread some number of garbage bytes. (for a SIMD-heavy | project, I've just went with a custom allocator that | guarantees the ability to read past the end of allocations, | and storing by either maskstore for 32/64-bit, and blending | with what's already there for 8/16-bit) | dietrichepp wrote: | Naive version to explore autovectorization: | | https://gcc.godbolt.org/z/bcefvznG3 | | Yes, there's some amount of autovectorization here. Seems like | a mess of a function. Here's something more like tolower8(): | | https://gcc.godbolt.org/z/h8GTanodd | | The generated code definitely looks funny to me. Here is a | manual vectorization, which is shorter: | | https://gcc.godbolt.org/z/1e4odEsKq | | The issue of loading into your vector register... well, there | are some dirty tricks for that. The 16-byte slice containing | the first byte, and the 16-byte slice containing the last byte, | can both be loaded into registers and then shifted around in | order to construct the desired value. Note the careful wording | here... these slices might be the same slice. Or you can | iterate over the 16-byte slices containing the array, and shift | as you go, if you're storing into a different location. Or you | can use various masked load/store operations on various | architectures. | dragontamer wrote: | > The issue of loading into your vector register... well, | there are some dirty tricks for that. | | Good discussion. I think the only method you haven't talked | about is the simple "unaligned load" instructions (which | might be the simplest, and most portable way, to do this). I | know that ARM and x86 both can do unaligned loads no problem, | but at a possible performance penalty. | moonchild wrote: | > modern SIMD instructions execute at one-per-clock-tick | | Or even two, or four on apple arm, plus load and store which | can be done in parallel. | dragontamer wrote: | Yeah, good point. SIMD is also superscalar on most systems. | | For x86, Intel can do 3x AVX512 instructions per clock tick | as long as each instruction is simple enough (add, AND, OR, | NOT, XOR, maybe even multiply if you're not counting the | latency issue) | sedatk wrote: | "labels are frequently less than 8 bytes long, and therefore | fit inside a 64-bit register. So it probably isn't worth | dealing with the portability issues of working with wide vector | registers (Especially since I could not find a quick way to | load an arbitrary number of bytes into a vector register with | AVX2 nor with NEON.)" | dragontamer wrote: | DNS labels like "ycombina" (tor) you mean? :-) | fanf2 wrote: | Two out of three ain't bad :-) | | _news_.ycombinator. _com_ | londons_explore wrote: | All those arithmetic calculations depend on the previous... | | You should get much more throughput if you can interleave them | with other instructions... | | I wonder if that's what the benchmarks did? | fanf2 wrote: | Hmm, yes, there are only 2 or 3 instructions that can execute | concurrently - but your observation made me realise I can eke | out another by masking is_ascii more thoroughly. Thanks! | | Bulk throughput isn't really my aim, it's just a convenient way | to get numbers big enough to be easily measurable :-) | dzaima wrote: | An out-of-order CPU will interleave things for you. Any modern | CPU should have no issues running many iterations of the loop | in parallel, as nothing should depend on the previous | iteration. | | But, given that the task at hand usually deals with very short | inputs, there's not much to interleave with anyway. | londons_explore wrote: | Worth patching libc? | vinkelhake wrote: | libc tolower() works on a single char at a time. This post is | about the gains you can get by converting multiple chars at | once. | mananaysiempre wrote: | The standard tolower() / toupper() take a single character | only; bulk strlwr() / strupr() are nonstandard (if commonly | present) and--more importantly--virtually unused. I suppose | implementing this technique in the C/POSIX-locale version of | nonstandard stricmp() / POSIX strcasecmp() might be helpful in | some cases, because people do use that one, but I still expect | that any situations that truly call for an ASCII-only case- | insensitive comparison (parsers?) will be doing much more work | per byte for other reasons (state machine, perfect hash, _etc._ | ). | stabbles wrote: | It's not the same, libc's tolower takes an int as an individual | character | Findecanor wrote: | Although ... the existence of a libc with SIMD versions of | its functions is not implausible. There are compilers that | would produce such functions beside the normal ones if the | source is decorated with the right #pragma. | | Such functions would be called only from within vectorised | loops (or other SIMD versions of functions). | Findecanor wrote: | This type of bit manipulation will only work with pure ASCII, | which had been designed to make this transformation simple with | only bit manipulation -- just not in parallel. | | Most systems default to using Unicode these days, for which the | problem is _much_ more complex even when the language is set to | English. | babelfish wrote: | Great article. Have been brushing up on bit manipulation for | interview prep lately and I love finding easy to digest, real- | world use cases like this. | MrYellowP wrote: | Had I known anyone cares about this ... | | Anything else anyone cares about, that could use a speedup... ? | charcircuit wrote: | Since domains are UTF-8 doesn't the assumption of ASCII | breakdown? | sveiss wrote: | Domain names are limited to a small subset of ASCII; the DNS | protocol doesn't support UTF-8. | | To work around this, the international domain name standard | defines an encoding called Punycode which maps Unicode to the | limited character set DNS supports. The server is unaware of | this, and so this optimised tolower() implementation works | without any Unicode considerations. | asveikau wrote: | Lower case is locale-sensitive, however. For example, | tolower('I') in Turkish should be 'i'. | | And within unicode, doing it in a "dumb ascii" way probably | needs some normalization of diacritics. Eg. 'e' should be | U+0065 U+0301 ("e\xcc\x81"), not U+00E9 ("\xc3\x89"). | | Not sure how punycode handles this, I did once look deeply | into it but that was years ago. | jimmygrapes wrote: | Absolutely. This article is reminiscent of the type of brief | tutorials from the mid/late 90s in QBASIC newsletters | introducing how to use ASM to get speed gains, with all the | associated assumptions. The concept of "lowercase" breaks down | terribly once you go beyond simple A-Z. You can apply similar | rules to certain limited code sets, but a universal tolower() | won't work so well with this method. | mgcunha wrote: | DNS labels are limited to some ASCII characters. UTF support in | DNS is available via punycode which is an encoding from UTF | onto that restricted ASCII set acceptable for DNS. Libraries as | the ones discussed here typically perform UTF to punycode | conversions before doing any label comparisons to ensure | accuracy. In particular this tolower implementation would | likely be used against the punycode encoded version of the UTF | domain. | | https://en.wikipedia.org/wiki/Punycode ___________________________________________________________________ (page generated 2022-06-27 23:00 UTC)