[HN Gopher] tolower() in bulk at speed
       ___________________________________________________________________
        
       tolower() in bulk at speed
        
       Author : fanf2
       Score  : 84 points
       Date   : 2022-06-27 21:12 UTC (1 hours ago)
        
 (HTM) web link (dotat.at)
 (TXT) w3m dump (dotat.at)
        
       | [deleted]
        
       | mananaysiempre wrote:
       | This technique of SWAR ("SIMD within a register") with very
       | narrow elements and no hardware support is explained very well in
       | an SO answer[1] (and the linked slides) about the mythical (and
       | useless) "fusion trees".
       | 
       | [1] https://stackoverflow.com/a/56604876
        
       | dragontamer wrote:
       | Seems a bit odd to stop at SWAR on today's systems.
       | 
       | Every system I'm aware of has 128-bit SIMD implemented: either
       | SSE (x86), NEON (ARM), or AltiVec (POWERPC). As such, 128-bit
       | SIMD is the "reliably portable" SIMD operation.
       | 
       | Of course, for fastest speeds, you need to go to the largest
       | SIMD-register size for your platform: 512-bit for some Intel
       | processors, rumored AMD Zen4 and Centaur chips. 256-bit for most
       | Intel / AMD Zen3 chips. 128-bit for Apple ARMs, 512-bit for
       | Fujitsu A64 ARMs, etc. etc.
       | 
       | > And these are the fastest kinds of instructions :-)
       | 
       | And it should be noted that modern SIMD instructions execute at
       | one-per-clock-tick. So the 64-bit instructions certainly are
       | fast, but the SIMD instructions tie if you're using simple XOR or
       | comparison operations.
       | 
       | --------
       | 
       | This style of code might even be "reliably auto-vectorizable" on
       | GCC and CLANG actually. I wonder if I could get portable auto-
       | vectorizing C from these examples.
        
         | fanf2 wrote:
         | Is there a good way to load an arbitrary number of bytes (up to
         | the vector size) into a vector register? As I said in the
         | article, I could not find one when looking at AVX2 or NEON
         | reference manuals. Getting the data from RAM is the main
         | bottleneck for short strings, which DNS names usually are.
        
           | dragontamer wrote:
           | You're not thinking in terms of SIMD.
           | 
           | If you load 5 bytes into a 128-bit register (16-bytes) you'll
           | have 5-bytes of data + 11-bytes of garbage.
           | 
           | Perform the calculation over all 16-bytes. Yes, this makes
           | 11-bytes of garbage at the end. Once you're done, just write
           | back the first 5 bytes and you're set.
           | 
           | The 11-bytes of garbage are "free". They didn't cost you
           | anything.
           | 
           | > Getting the data from RAM is the main bottleneck for short
           | strings
           | 
           | Your L1 and L2 cache-lines are 64-byte minimum read / write
           | anyway (actually, some systems were 128-byte minimum IIRC). A
           | 16-byte read is literally smaller than what your cache does
           | on the path into your registers.
           | 
           | EDIT: More importantly, modern CPUs only have 2x or 3x
           | load/store units. Meaning a modern CPU can only perform ~2ish
           | load/stores per clock tick. The SSE (16-byte / 128-bit) read
           | / write to L1 cache will perform the same speed as a 8-byte
           | /64-bit read/write to L1 cache in practice.
        
             | fanf2 wrote:
             | In fact the code that I left out of the blog post does
             | almost exactly what you suggest :-) It's the "load 5 bytes"
             | and "store 5 bytes" that I don't have a good solution for.
             | At the moment I am using memmove() and relying on the
             | compiler developers to have better ideas about optimizing
             | it than I do... The bottleneck comes from the number of
             | instructions and the branchiness, not the data bandwidth.
             | 
             | I briefly considered playing games with overlong reads, but
             | then asan gave me a slap and I reconsidered the path to
             | wisdom.
        
           | dzaima wrote:
           | Doesn't seem to be for SSE & AVX at least (only 32-bit and
           | 64-bit groups in AVX, and nothing for SSE). AVX-512 has
           | _mm512_maskz_loadu_epi8, but not much has AVX-512.
           | 
           | It's a shame that there aren't guarantees on being able to
           | overread some number of garbage bytes. (for a SIMD-heavy
           | project, I've just went with a custom allocator that
           | guarantees the ability to read past the end of allocations,
           | and storing by either maskstore for 32/64-bit, and blending
           | with what's already there for 8/16-bit)
        
         | dietrichepp wrote:
         | Naive version to explore autovectorization:
         | 
         | https://gcc.godbolt.org/z/bcefvznG3
         | 
         | Yes, there's some amount of autovectorization here. Seems like
         | a mess of a function. Here's something more like tolower8():
         | 
         | https://gcc.godbolt.org/z/h8GTanodd
         | 
         | The generated code definitely looks funny to me. Here is a
         | manual vectorization, which is shorter:
         | 
         | https://gcc.godbolt.org/z/1e4odEsKq
         | 
         | The issue of loading into your vector register... well, there
         | are some dirty tricks for that. The 16-byte slice containing
         | the first byte, and the 16-byte slice containing the last byte,
         | can both be loaded into registers and then shifted around in
         | order to construct the desired value. Note the careful wording
         | here... these slices might be the same slice. Or you can
         | iterate over the 16-byte slices containing the array, and shift
         | as you go, if you're storing into a different location. Or you
         | can use various masked load/store operations on various
         | architectures.
        
           | dragontamer wrote:
           | > The issue of loading into your vector register... well,
           | there are some dirty tricks for that.
           | 
           | Good discussion. I think the only method you haven't talked
           | about is the simple "unaligned load" instructions (which
           | might be the simplest, and most portable way, to do this). I
           | know that ARM and x86 both can do unaligned loads no problem,
           | but at a possible performance penalty.
        
         | moonchild wrote:
         | > modern SIMD instructions execute at one-per-clock-tick
         | 
         | Or even two, or four on apple arm, plus load and store which
         | can be done in parallel.
        
           | dragontamer wrote:
           | Yeah, good point. SIMD is also superscalar on most systems.
           | 
           | For x86, Intel can do 3x AVX512 instructions per clock tick
           | as long as each instruction is simple enough (add, AND, OR,
           | NOT, XOR, maybe even multiply if you're not counting the
           | latency issue)
        
         | sedatk wrote:
         | "labels are frequently less than 8 bytes long, and therefore
         | fit inside a 64-bit register. So it probably isn't worth
         | dealing with the portability issues of working with wide vector
         | registers (Especially since I could not find a quick way to
         | load an arbitrary number of bytes into a vector register with
         | AVX2 nor with NEON.)"
        
           | dragontamer wrote:
           | DNS labels like "ycombina" (tor) you mean? :-)
        
             | fanf2 wrote:
             | Two out of three ain't bad :-)
             | 
             |  _news_.ycombinator. _com_
        
       | londons_explore wrote:
       | All those arithmetic calculations depend on the previous...
       | 
       | You should get much more throughput if you can interleave them
       | with other instructions...
       | 
       | I wonder if that's what the benchmarks did?
        
         | fanf2 wrote:
         | Hmm, yes, there are only 2 or 3 instructions that can execute
         | concurrently - but your observation made me realise I can eke
         | out another by masking is_ascii more thoroughly. Thanks!
         | 
         | Bulk throughput isn't really my aim, it's just a convenient way
         | to get numbers big enough to be easily measurable :-)
        
         | dzaima wrote:
         | An out-of-order CPU will interleave things for you. Any modern
         | CPU should have no issues running many iterations of the loop
         | in parallel, as nothing should depend on the previous
         | iteration.
         | 
         | But, given that the task at hand usually deals with very short
         | inputs, there's not much to interleave with anyway.
        
       | londons_explore wrote:
       | Worth patching libc?
        
         | vinkelhake wrote:
         | libc tolower() works on a single char at a time. This post is
         | about the gains you can get by converting multiple chars at
         | once.
        
         | mananaysiempre wrote:
         | The standard tolower() / toupper() take a single character
         | only; bulk strlwr() / strupr() are nonstandard (if commonly
         | present) and--more importantly--virtually unused. I suppose
         | implementing this technique in the C/POSIX-locale version of
         | nonstandard stricmp() / POSIX strcasecmp() might be helpful in
         | some cases, because people do use that one, but I still expect
         | that any situations that truly call for an ASCII-only case-
         | insensitive comparison (parsers?) will be doing much more work
         | per byte for other reasons (state machine, perfect hash, _etc._
         | ).
        
         | stabbles wrote:
         | It's not the same, libc's tolower takes an int as an individual
         | character
        
           | Findecanor wrote:
           | Although ... the existence of a libc with SIMD versions of
           | its functions is not implausible. There are compilers that
           | would produce such functions beside the normal ones if the
           | source is decorated with the right #pragma.
           | 
           | Such functions would be called only from within vectorised
           | loops (or other SIMD versions of functions).
        
         | Findecanor wrote:
         | This type of bit manipulation will only work with pure ASCII,
         | which had been designed to make this transformation simple with
         | only bit manipulation -- just not in parallel.
         | 
         | Most systems default to using Unicode these days, for which the
         | problem is _much_ more complex even when the language is set to
         | English.
        
       | babelfish wrote:
       | Great article. Have been brushing up on bit manipulation for
       | interview prep lately and I love finding easy to digest, real-
       | world use cases like this.
        
       | MrYellowP wrote:
       | Had I known anyone cares about this ...
       | 
       | Anything else anyone cares about, that could use a speedup... ?
        
       | charcircuit wrote:
       | Since domains are UTF-8 doesn't the assumption of ASCII
       | breakdown?
        
         | sveiss wrote:
         | Domain names are limited to a small subset of ASCII; the DNS
         | protocol doesn't support UTF-8.
         | 
         | To work around this, the international domain name standard
         | defines an encoding called Punycode which maps Unicode to the
         | limited character set DNS supports. The server is unaware of
         | this, and so this optimised tolower() implementation works
         | without any Unicode considerations.
        
           | asveikau wrote:
           | Lower case is locale-sensitive, however. For example,
           | tolower('I') in Turkish should be 'i'.
           | 
           | And within unicode, doing it in a "dumb ascii" way probably
           | needs some normalization of diacritics. Eg. 'e' should be
           | U+0065 U+0301 ("e\xcc\x81"), not U+00E9 ("\xc3\x89").
           | 
           | Not sure how punycode handles this, I did once look deeply
           | into it but that was years ago.
        
         | jimmygrapes wrote:
         | Absolutely. This article is reminiscent of the type of brief
         | tutorials from the mid/late 90s in QBASIC newsletters
         | introducing how to use ASM to get speed gains, with all the
         | associated assumptions. The concept of "lowercase" breaks down
         | terribly once you go beyond simple A-Z. You can apply similar
         | rules to certain limited code sets, but a universal tolower()
         | won't work so well with this method.
        
         | mgcunha wrote:
         | DNS labels are limited to some ASCII characters. UTF support in
         | DNS is available via punycode which is an encoding from UTF
         | onto that restricted ASCII set acceptable for DNS. Libraries as
         | the ones discussed here typically perform UTF to punycode
         | conversions before doing any label comparisons to ensure
         | accuracy. In particular this tolower implementation would
         | likely be used against the punycode encoded version of the UTF
         | domain.
         | 
         | https://en.wikipedia.org/wiki/Punycode
        
       ___________________________________________________________________
       (page generated 2022-06-27 23:00 UTC)