[HN Gopher] Parsing 8-bit integers quickly ___________________________________________________________________ Parsing 8-bit integers quickly Author : usdogu Score : 60 points Date : 2023-11-28 20:19 UTC (2 hours ago) (HTM) web link (lemire.me) (TXT) w3m dump (lemire.me) | nasretdinov wrote: | I imagine that you are not allowed to allocate a constant array | that would contain a mapping between ASCII values of integers and | the actual ints :)? The're just 255 of them needed. Or woukd it | be slower? | jstanley wrote: | How would you actually implement that? | | The naive approach where you just index with the ASCII values | you have would have to contain 16 million elements since it | needs to look at 3 bytes. | | Anything more space efficient than that would probably be | slower than TFA. | kazinator wrote: | Here is a table-driven idea. #include | <stdio.h> int bcd2dec[] = { [0x000] = 0, | [0x001] = 1, [0x002] = 2, [0x003] = 3, [0x004] = 4, | [0x005] = 5, [0x006] = 6, [0x007] = 7, [0x008] = 8, [0x009] = | 9, [0x010] = 10, [0x011] = 11, [0x012] = 12, [0x013] | = 13, [0x014] = 14, [0x015] = 15, [0x016] = 16, | [0x017] = 17, [0x018] = 18, [0x019] = 19, [0x020] = | 20, [0x021] = 21, [0x022] = 22, [0x023] = 23, [0x024] = 24, | [0x025] = 25, [0x026] = 26, [0x027] = 27, [0x028] = 28, | [0x029] = 29, [0x030] = 30, [0x031] = 31, [0x032] = | 32, [0x033] = 33, [0x034] = 34, [0x035] = 35, [0x036] | = 36, [0x037] = 37, [0x038] = 38, [0x039] = 39, | [0x040] = 40, [0x041] = 41, [0x042] = 42, [0x043] = 43, | [0x044] = 44, [0x045] = 45, [0x046] = 46, [0x047] = | 47, [0x048] = 48, [0x049] = 49, [0x050] = 50, [0x051] | = 51, [0x052] = 52, [0x053] = 53, [0x054] = 54, | [0x055] = 55, [0x056] = 56, [0x057] = 57, [0x058] = 58, | [0x059] = 59, [0x060] = 60, [0x061] = 61, [0x062] = | 62, [0x063] = 63, [0x064] = 64, [0x065] = 65, [0x066] | = 66, [0x067] = 67, [0x068] = 68, [0x069] = 69, | [0x070] = 70, [0x071] = 71, [0x072] = 72, [0x073] = 73, | [0x074] = 74, [0x075] = 75, [0x076] = 76, [0x077] = | 77, [0x078] = 78, [0x079] = 79, [0x080] = 80, [0x081] | = 81, [0x082] = 82, [0x083] = 83, [0x084] = 84, | [0x085] = 85, [0x086] = 86, [0x087] = 87, [0x088] = 88, | [0x089] = 89, [0x090] = 90, [0x091] = 91, [0x092] = | 92, [0x093] = 93, [0x094] = 94, [0x095] = 95, [0x096] | = 96, [0x097] = 97, [0x098] = 98, [0x099] = 99, | [0x100] = 100, [0x101] = 101, [0x102] = 102, [0x103] = 103, | [0x104] = 104, [0x105] = 105, [0x106] = 106, [0x107] | = 107, [0x108] = 108, [0x109] = 109, [0x110] = 110, | [0x111] = 111, [0x112] = 112, [0x113] = 113, [0x114] = 114, | [0x115] = 115, [0x116] = 116, [0x117] = 117, [0x118] = 118, | [0x119] = 119, [0x120] = 120, [0x121] = 121, [0x122] | = 122, [0x123] = 123, [0x124] = 124, [0x125] = 125, | [0x126] = 126, [0x127] = 127, [0x128] = 128, [0x129] = 129, | [0x130] = 130, [0x131] = 131, [0x132] = 132, [0x133] = 133, | [0x134] = 134, [0x135] = 135, [0x136] = 136, [0x137] | = 137, [0x138] = 138, [0x139] = 139, [0x140] = 140, | [0x141] = 141, [0x142] = 142, [0x143] = 143, [0x144] = 144, | [0x145] = 145, [0x146] = 146, [0x147] = 147, [0x148] = 148, | [0x149] = 149, [0x150] = 150, [0x151] = 151, [0x152] | = 152, [0x153] = 153, [0x154] = 154, [0x155] = 155, | [0x156] = 156, [0x157] = 157, [0x158] = 158, [0x159] = 159, | [0x160] = 160, [0x161] = 161, [0x162] = 162, [0x163] = 163, | [0x164] = 164, [0x165] = 165, [0x166] = 166, [0x167] | = 167, [0x168] = 168, [0x169] = 169, [0x170] = 170, | [0x171] = 171, [0x172] = 172, [0x173] = 173, [0x174] = 174, | [0x175] = 175, [0x176] = 176, [0x177] = 177, [0x178] = 178, | [0x179] = 179, [0x180] = 180, [0x181] = 181, [0x182] | = 182, [0x183] = 183, [0x184] = 184, [0x185] = 185, | [0x186] = 186, [0x187] = 187, [0x188] = 188, [0x189] = 189, | [0x190] = 190, [0x191] = 191, [0x192] = 192, [0x193] = 193, | [0x194] = 194, [0x195] = 195, [0x196] = 196, [0x197] | = 197, [0x198] = 198, [0x199] = 199, [0x200] = 200, | [0x201] = 201, [0x202] = 202, [0x203] = 203, [0x204] = 204, | [0x205] = 205, [0x206] = 206, [0x207] = 207, [0x208] = 208, | [0x209] = 209, [0x210] = 210, [0x211] = 211, [0x212] | = 212, [0x213] = 213, [0x214] = 214, [0x215] = 215, | [0x216] = 216, [0x217] = 217, [0x218] = 218, [0x219] = 219, | [0x220] = 220, [0x221] = 221, [0x222] = 222, [0x223] = 223, | [0x224] = 224, [0x225] = 225, [0x226] = 226, [0x227] | = 227, [0x228] = 228, [0x229] = 229, [0x230] = 230, | [0x231] = 231, [0x232] = 232, [0x233] = 233, [0x234] = 234, | [0x235] = 235, [0x236] = 236, [0x237] = 237, [0x238] = 238, | [0x239] = 239, [0x240] = 240, [0x241] = 241, [0x242] | = 242, [0x243] = 243, [0x244] = 244, [0x245] = 245, | [0x246] = 246, [0x247] = 247, [0x248] = 248, [0x249] = 249, | [0x250] = 250, [0x251] = 251, [0x252] = 252, [0x253] = 253, | [0x254] = 254, [0x255] = 255, }; /* | * If the input is an empty string or a string of length four | or more, * we return -1. Other than that, we assume we | have digits. */ int parse_uint8_bcd(const char | *str) { if (!str[0]) return -1; | if (!str[1]) return str[0] & 0xF; if | (!str[2]) return bcd2dec[(str[0] & 0xF) << 4 | | (str[1] & 0xF)]; if (!str[3]) return | bcd2dec[(str[0] & 0xF) << 8 | (str[1] & 0xF) << 4 | (str[2] & | 0xF)]; return -1; } int main(int | argc, char **argv) { if (argv[1]) | printf("value of %s is %d\n", argv[1], | parse_uint8_bcd(argv[1])); return 0; } | zamadatix wrote: | For most CPUs you can do multiple lookups at once and have | multiple in flight. This would allow you to use 256 bytes | which should be able to stay in lower level cache once you | start doing any meaningful number of conversions. | | So not necessarily enormous and dog slow... but the real test | is put it with the rest of your real code and see what | happens. If you come to a certain answer it may not stay the | same for another real bit of code. | nasretdinov wrote: | Yeah you're right, I didn't consider that. However even if | you allocate a 16 mln array you'll only need to access ~26 | different cache lines, so it might still be ok? That's | assuming that my calculation is correct that numbers will be | grouped in tens, e.g. 240-249 would be on the same cache line | as their numeric values are different by just 10 | pxx wrote: | That is likely to be extremely slow. | https://news.ycombinator.com/item?id=37823805 | | Also, the 8-bit lookup table only saves you a couple | subtractions and conditional moves from the naive solution? | That seems like one of the worst tradeoffs to make. | gpvos wrote: | You'd need a very sparse array of 2^(3*8) =~ 16 million | entries, not 256, since you're going in the other direction. A | hashtable might work. | wholesomepotato wrote: | Would use XOR 0x30303030lu, then OR with a value shifted left | 12 bits, take bits 24..12 and lookup in 4k pre-computed lookup | array. | kazinator wrote: | $ cat parse256.c #include <stdio.h> #include | <stdint.h> #include <string.h> int | parse_uint8_fastswar(const char *str, size_t len, uint8_t *num) { | union { uint8_t as_str[4]; uint32_t as_int; | } digits; memcpy(&digits.as_int, str, sizeof(digits)); | digits.as_int ^= 0x30303030lu; digits.as_int <<= (4 - | (len & 0x3)) * 8; uint32_t y = ((UINT64_C(0x640a0100) * | digits.as_int)>>32)&0xff; *num = (uint8_t)(y); | return (digits.as_int & 0xf0f0f0f0) == 0 && y < 256 && len != 0 | && len < 4; } int main(int argc, char **argv) | { if (argv[1]) { uint8_t val = 0; if | (parse_uint8_fastswar(argv[1], strlen(argv[1]), &val)) { | printf("value of %s is %d\n", argv[1], (int) val); } | else { printf("%s is invalid, value stored %d\n", | argv[1], (int) val); } } return 0; | } $ ./parse256 123 123 is invalid, value stored 225 | $ uname -a Linux gcc1-power7.osuosl.org | 3.10.0-862.14.4.el7.ppc64 #1 SMP Wed Sep 26 20:38:32 GMT 2018 | ppc64 ppc64 ppc64 GNU/Linux | | Oops! | zimpenfish wrote: | ~ $ ./a.out 123 value of 123 is 123 ~ $ uname | -a Darwin CyanistesCyanus.local 23.2.0 Darwin Kernel | Version 23.2.0: Thu Nov 9 06:28:34 PST 2023; root:xnu- | 10002.60.71.505.1~3/RELEASE_ARM64_T8103 arm64 ~ $ gcc | -v Apple clang version 15.0.0 (clang-1500.0.40.1) | Target: arm64-apple-darwin23.2.0 Thread model: posix | InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolch | ains/XcodeDefault.xctoolchain/usr/bin | bangonkeyboard wrote: | These gotchas make keeping a big-endian machine around totally | worth it. | dataflow wrote: | I'm guessing there isn't any way to simulate big endian on | x86? | jstanley wrote: | I tried this out but it parsed the string "123\n" as 32. | | Also it parses "400" as 144, when the reference implementation | considers it not-a-uint8, but I don't mind so much about that. | | EDIT: Ah, I think it assumes the string contains _only_ a uint8, | rather than trying to parse a uint8 from the start of a string. | So you need to zero out the "\n" separately, and then it works. | Sharlin wrote: | Like the article says, it uses SWAR and _requires_ a buffer of | at least four bytes, with the trailing bytes zeroed. Your | strings were both four bytes which saved you; had you tried | with "42" or something, that would've been undefined behavior! | | Edit: No, wait, the `len` parameter is there to ensure that | trailing bytes don't matter. But note that it must be the | length of the number-containing prefix part, not the whole | input string! | | > Also it parses "400" as 144 | | Did you check the return value? If false (as it is in case of | overflow), the result is meaningless. | jstanley wrote: | Yes I checked the return value. Example using the program | from https://news.ycombinator.com/item?id=38451560 | $ ./parse256 400 value of 400 is 144 | firebaze wrote: | Looking forward to "parsing bit sequences in roman literals | quickly" | | https://hn.algolia.com/?q=lemire+parsing | nvartolomei wrote: | dlemire, you note that the read "overflows". Why can't you copy | just `len` bytes? Does it slow too much because of the | branch/more load/store operations? | vinkelhake wrote: | The SWAR algorithm accepts the 6 ASCII characters after '9'. | It'll parse ":>" as 114. int res = | parse_uint8_fastswar(":>\0", 2, &num); | | Returns true and num is 114. | zokier wrote: | The use of union here is bit confusing (I think its | unnecessary?), although I don't imagine it making any difference | in the generated code. | mwkaufma wrote: | That's the accepted method in c to do well-defined type- | punning. | vinkelhake wrote: | But the `as_str` field is never used. The type punning (and | all the wonderful endian issues) is already happening in the | memcpy. | mwkaufma wrote: | Oh yeah, wheee, I'm not carefully reading. Maybe it's | vestigial leftovers from a port from c to c++ (major impls | handle it in C++ like in C but the standard insists it's | UB). | progmetaldev wrote: | Agreed, as someone that isn't very familiar with C, but has | used many C-like languages, the names "as_str" and "as_int" | threw me off. | amelius wrote: | Fetching the data should be the bottleneck (by far), so why is | the naive approach 2x slower than this smarter approach? | | Sounds like the CPU should be designed in a smarter way, not the | code. | IshKebab wrote: | Presumably this also only works well if the data is 4-byte | aligned. ___________________________________________________________________ (page generated 2023-11-28 23:00 UTC)