[HN Gopher] Parsing 8-bit integers quickly
       ___________________________________________________________________
        
       Parsing 8-bit integers quickly
        
       Author : usdogu
       Score  : 60 points
       Date   : 2023-11-28 20:19 UTC (2 hours ago)
        
 (HTM) web link (lemire.me)
 (TXT) w3m dump (lemire.me)
        
       | nasretdinov wrote:
       | I imagine that you are not allowed to allocate a constant array
       | that would contain a mapping between ASCII values of integers and
       | the actual ints :)? The're just 255 of them needed. Or woukd it
       | be slower?
        
         | jstanley wrote:
         | How would you actually implement that?
         | 
         | The naive approach where you just index with the ASCII values
         | you have would have to contain 16 million elements since it
         | needs to look at 3 bytes.
         | 
         | Anything more space efficient than that would probably be
         | slower than TFA.
        
           | kazinator wrote:
           | Here is a table-driven idea.                 #include
           | <stdio.h>            int bcd2dec[] = {         [0x000] = 0,
           | [0x001] = 1, [0x002] = 2, [0x003] = 3, [0x004] = 4,
           | [0x005] = 5, [0x006] = 6, [0x007] = 7, [0x008] = 8, [0x009] =
           | 9,         [0x010] = 10, [0x011] = 11, [0x012] = 12, [0x013]
           | = 13, [0x014] = 14,         [0x015] = 15, [0x016] = 16,
           | [0x017] = 17, [0x018] = 18, [0x019] = 19,         [0x020] =
           | 20, [0x021] = 21, [0x022] = 22, [0x023] = 23, [0x024] = 24,
           | [0x025] = 25, [0x026] = 26, [0x027] = 27, [0x028] = 28,
           | [0x029] = 29,         [0x030] = 30, [0x031] = 31, [0x032] =
           | 32, [0x033] = 33, [0x034] = 34,         [0x035] = 35, [0x036]
           | = 36, [0x037] = 37, [0x038] = 38, [0x039] = 39,
           | [0x040] = 40, [0x041] = 41, [0x042] = 42, [0x043] = 43,
           | [0x044] = 44,         [0x045] = 45, [0x046] = 46, [0x047] =
           | 47, [0x048] = 48, [0x049] = 49,         [0x050] = 50, [0x051]
           | = 51, [0x052] = 52, [0x053] = 53, [0x054] = 54,
           | [0x055] = 55, [0x056] = 56, [0x057] = 57, [0x058] = 58,
           | [0x059] = 59,         [0x060] = 60, [0x061] = 61, [0x062] =
           | 62, [0x063] = 63, [0x064] = 64,         [0x065] = 65, [0x066]
           | = 66, [0x067] = 67, [0x068] = 68, [0x069] = 69,
           | [0x070] = 70, [0x071] = 71, [0x072] = 72, [0x073] = 73,
           | [0x074] = 74,         [0x075] = 75, [0x076] = 76, [0x077] =
           | 77, [0x078] = 78, [0x079] = 79,         [0x080] = 80, [0x081]
           | = 81, [0x082] = 82, [0x083] = 83, [0x084] = 84,
           | [0x085] = 85, [0x086] = 86, [0x087] = 87, [0x088] = 88,
           | [0x089] = 89,         [0x090] = 90, [0x091] = 91, [0x092] =
           | 92, [0x093] = 93, [0x094] = 94,         [0x095] = 95, [0x096]
           | = 96, [0x097] = 97, [0x098] = 98, [0x099] = 99,
           | [0x100] = 100, [0x101] = 101, [0x102] = 102, [0x103] = 103,
           | [0x104] = 104,         [0x105] = 105, [0x106] = 106, [0x107]
           | = 107, [0x108] = 108, [0x109] = 109,         [0x110] = 110,
           | [0x111] = 111, [0x112] = 112, [0x113] = 113, [0x114] = 114,
           | [0x115] = 115, [0x116] = 116, [0x117] = 117, [0x118] = 118,
           | [0x119] = 119,         [0x120] = 120, [0x121] = 121, [0x122]
           | = 122, [0x123] = 123, [0x124] = 124,         [0x125] = 125,
           | [0x126] = 126, [0x127] = 127, [0x128] = 128, [0x129] = 129,
           | [0x130] = 130, [0x131] = 131, [0x132] = 132, [0x133] = 133,
           | [0x134] = 134,         [0x135] = 135, [0x136] = 136, [0x137]
           | = 137, [0x138] = 138, [0x139] = 139,         [0x140] = 140,
           | [0x141] = 141, [0x142] = 142, [0x143] = 143, [0x144] = 144,
           | [0x145] = 145, [0x146] = 146, [0x147] = 147, [0x148] = 148,
           | [0x149] = 149,         [0x150] = 150, [0x151] = 151, [0x152]
           | = 152, [0x153] = 153, [0x154] = 154,         [0x155] = 155,
           | [0x156] = 156, [0x157] = 157, [0x158] = 158, [0x159] = 159,
           | [0x160] = 160, [0x161] = 161, [0x162] = 162, [0x163] = 163,
           | [0x164] = 164,         [0x165] = 165, [0x166] = 166, [0x167]
           | = 167, [0x168] = 168, [0x169] = 169,         [0x170] = 170,
           | [0x171] = 171, [0x172] = 172, [0x173] = 173, [0x174] = 174,
           | [0x175] = 175, [0x176] = 176, [0x177] = 177, [0x178] = 178,
           | [0x179] = 179,         [0x180] = 180, [0x181] = 181, [0x182]
           | = 182, [0x183] = 183, [0x184] = 184,         [0x185] = 185,
           | [0x186] = 186, [0x187] = 187, [0x188] = 188, [0x189] = 189,
           | [0x190] = 190, [0x191] = 191, [0x192] = 192, [0x193] = 193,
           | [0x194] = 194,         [0x195] = 195, [0x196] = 196, [0x197]
           | = 197, [0x198] = 198, [0x199] = 199,         [0x200] = 200,
           | [0x201] = 201, [0x202] = 202, [0x203] = 203, [0x204] = 204,
           | [0x205] = 205, [0x206] = 206, [0x207] = 207, [0x208] = 208,
           | [0x209] = 209,         [0x210] = 210, [0x211] = 211, [0x212]
           | = 212, [0x213] = 213, [0x214] = 214,         [0x215] = 215,
           | [0x216] = 216, [0x217] = 217, [0x218] = 218, [0x219] = 219,
           | [0x220] = 220, [0x221] = 221, [0x222] = 222, [0x223] = 223,
           | [0x224] = 224,         [0x225] = 225, [0x226] = 226, [0x227]
           | = 227, [0x228] = 228, [0x229] = 229,         [0x230] = 230,
           | [0x231] = 231, [0x232] = 232, [0x233] = 233, [0x234] = 234,
           | [0x235] = 235, [0x236] = 236, [0x237] = 237, [0x238] = 238,
           | [0x239] = 239,         [0x240] = 240, [0x241] = 241, [0x242]
           | = 242, [0x243] = 243, [0x244] = 244,         [0x245] = 245,
           | [0x246] = 246, [0x247] = 247, [0x248] = 248, [0x249] = 249,
           | [0x250] = 250, [0x251] = 251, [0x252] = 252, [0x253] = 253,
           | [0x254] = 254,         [0x255] = 255,       };            /*
           | * If the input is an empty string or a string of length four
           | or more,        * we return -1. Other than that, we assume we
           | have digits.        */       int parse_uint8_bcd(const char
           | *str)       {         if (!str[0])           return -1;
           | if (!str[1])           return str[0] & 0xF;         if
           | (!str[2])           return bcd2dec[(str[0] & 0xF) << 4 |
           | (str[1] & 0xF)];         if (!str[3])           return
           | bcd2dec[(str[0] & 0xF) << 8 | (str[1] & 0xF) << 4 | (str[2] &
           | 0xF)];         return -1;       }            int main(int
           | argc, char **argv)       {         if (argv[1])
           | printf("value of %s is %d\n", argv[1],
           | parse_uint8_bcd(argv[1]));              return 0;       }
        
           | zamadatix wrote:
           | For most CPUs you can do multiple lookups at once and have
           | multiple in flight. This would allow you to use 256 bytes
           | which should be able to stay in lower level cache once you
           | start doing any meaningful number of conversions.
           | 
           | So not necessarily enormous and dog slow... but the real test
           | is put it with the rest of your real code and see what
           | happens. If you come to a certain answer it may not stay the
           | same for another real bit of code.
        
           | nasretdinov wrote:
           | Yeah you're right, I didn't consider that. However even if
           | you allocate a 16 mln array you'll only need to access ~26
           | different cache lines, so it might still be ok? That's
           | assuming that my calculation is correct that numbers will be
           | grouped in tens, e.g. 240-249 would be on the same cache line
           | as their numeric values are different by just 10
        
         | pxx wrote:
         | That is likely to be extremely slow.
         | https://news.ycombinator.com/item?id=37823805
         | 
         | Also, the 8-bit lookup table only saves you a couple
         | subtractions and conditional moves from the naive solution?
         | That seems like one of the worst tradeoffs to make.
        
         | gpvos wrote:
         | You'd need a very sparse array of 2^(3*8) =~ 16 million
         | entries, not 256, since you're going in the other direction. A
         | hashtable might work.
        
         | wholesomepotato wrote:
         | Would use XOR 0x30303030lu, then OR with a value shifted left
         | 12 bits, take bits 24..12 and lookup in 4k pre-computed lookup
         | array.
        
       | kazinator wrote:
       | $ cat parse256.c       #include <stdio.h>       #include
       | <stdint.h>       #include <string.h>            int
       | parse_uint8_fastswar(const char *str, size_t len, uint8_t *num) {
       | union {           uint8_t as_str[4];           uint32_t as_int;
       | } digits;         memcpy(&digits.as_int, str, sizeof(digits));
       | digits.as_int ^= 0x30303030lu;         digits.as_int <<= (4 -
       | (len & 0x3)) * 8;         uint32_t y = ((UINT64_C(0x640a0100) *
       | digits.as_int)>>32)&0xff;         *num = (uint8_t)(y);
       | return (digits.as_int & 0xf0f0f0f0) == 0 && y < 256 && len != 0
       | && len < 4;       }            int main(int argc, char **argv)
       | {         if (argv[1]) {           uint8_t val = 0;           if
       | (parse_uint8_fastswar(argv[1], strlen(argv[1]), &val)) {
       | printf("value of %s is %d\n", argv[1], (int) val);           }
       | else {             printf("%s is invalid, value stored %d\n",
       | argv[1], (int) val);           }         }              return 0;
       | }       $ ./parse256 123       123 is invalid, value stored 225
       | $ uname -a       Linux gcc1-power7.osuosl.org
       | 3.10.0-862.14.4.el7.ppc64 #1 SMP Wed Sep 26 20:38:32 GMT 2018
       | ppc64 ppc64 ppc64 GNU/Linux
       | 
       | Oops!
        
         | zimpenfish wrote:
         | ~ $ ./a.out 123         value of 123 is 123         ~ $ uname
         | -a         Darwin CyanistesCyanus.local 23.2.0 Darwin Kernel
         | Version 23.2.0: Thu Nov  9 06:28:34 PST 2023; root:xnu-
         | 10002.60.71.505.1~3/RELEASE_ARM64_T8103 arm64         ~ $ gcc
         | -v         Apple clang version 15.0.0 (clang-1500.0.40.1)
         | Target: arm64-apple-darwin23.2.0         Thread model: posix
         | InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolch
         | ains/XcodeDefault.xctoolchain/usr/bin
        
         | bangonkeyboard wrote:
         | These gotchas make keeping a big-endian machine around totally
         | worth it.
        
           | dataflow wrote:
           | I'm guessing there isn't any way to simulate big endian on
           | x86?
        
       | jstanley wrote:
       | I tried this out but it parsed the string "123\n" as 32.
       | 
       | Also it parses "400" as 144, when the reference implementation
       | considers it not-a-uint8, but I don't mind so much about that.
       | 
       | EDIT: Ah, I think it assumes the string contains _only_ a uint8,
       | rather than trying to parse a uint8 from the start of a string.
       | So you need to zero out the  "\n" separately, and then it works.
        
         | Sharlin wrote:
         | Like the article says, it uses SWAR and _requires_ a buffer of
         | at least four bytes, with the trailing bytes zeroed. Your
         | strings were both four bytes which saved you; had you tried
         | with  "42" or something, that would've been undefined behavior!
         | 
         | Edit: No, wait, the `len` parameter is there to ensure that
         | trailing bytes don't matter. But note that it must be the
         | length of the number-containing prefix part, not the whole
         | input string!
         | 
         | > Also it parses "400" as 144
         | 
         | Did you check the return value? If false (as it is in case of
         | overflow), the result is meaningless.
        
           | jstanley wrote:
           | Yes I checked the return value. Example using the program
           | from https://news.ycombinator.com/item?id=38451560
           | $ ./parse256 400       value of 400 is 144
        
       | firebaze wrote:
       | Looking forward to "parsing bit sequences in roman literals
       | quickly"
       | 
       | https://hn.algolia.com/?q=lemire+parsing
        
       | nvartolomei wrote:
       | dlemire, you note that the read "overflows". Why can't you copy
       | just `len` bytes? Does it slow too much because of the
       | branch/more load/store operations?
        
       | vinkelhake wrote:
       | The SWAR algorithm accepts the 6 ASCII characters after '9'.
       | It'll parse ":>" as 114.                   int res =
       | parse_uint8_fastswar(":>\0", 2, &num);
       | 
       | Returns true and num is 114.
        
       | zokier wrote:
       | The use of union here is bit confusing (I think its
       | unnecessary?), although I don't imagine it making any difference
       | in the generated code.
        
         | mwkaufma wrote:
         | That's the accepted method in c to do well-defined type-
         | punning.
        
           | vinkelhake wrote:
           | But the `as_str` field is never used. The type punning (and
           | all the wonderful endian issues) is already happening in the
           | memcpy.
        
             | mwkaufma wrote:
             | Oh yeah, wheee, I'm not carefully reading. Maybe it's
             | vestigial leftovers from a port from c to c++ (major impls
             | handle it in C++ like in C but the standard insists it's
             | UB).
        
         | progmetaldev wrote:
         | Agreed, as someone that isn't very familiar with C, but has
         | used many C-like languages, the names "as_str" and "as_int"
         | threw me off.
        
       | amelius wrote:
       | Fetching the data should be the bottleneck (by far), so why is
       | the naive approach 2x slower than this smarter approach?
       | 
       | Sounds like the CPU should be designed in a smarter way, not the
       | code.
        
       | IshKebab wrote:
       | Presumably this also only works well if the data is 4-byte
       | aligned.
        
       ___________________________________________________________________
       (page generated 2023-11-28 23:00 UTC)