[HN Gopher] Fun with Glibc and the Ctype.h Functions ___________________________________________________________________ Fun with Glibc and the Ctype.h Functions Author : picture Score : 24 points Date : 2021-09-30 05:36 UTC (1 days ago) (HTM) web link (rachelbythebay.com) (TXT) w3m dump (rachelbythebay.com) | _kst_ wrote: | IMHO the more interesting oddity about the functions declared in | <ctype.h> is that they work with unsigned char, which means that | they have undefined behavior if you pass a negative char value | (other than EOF, which is typically -1). | | This means that if you have a char value (say, an element of a | string), you need to cast it to unsigned char before passing it | to any of the is*() functions. | gumby wrote: | The rant behind her post | (https://drewdevault.com/2020/09/25/A-story-of-two-libcs.html ), | which has had some circulation, really shows its author's limited | perspective. | | glibc needs to solve two hard problems: be very fast and run on | innumerable systems. Some of that conditional stuff is because | all the world is not Linux or BSD; some of the macrology is there | to make sure such handling is performed everywhere needed, and of | course the preprocessor is the closest a language like C can get | to preprocessing. | | I was in the code as glibc started to exist (we paid for a lot of | it) and it looked like Musl: very straightforward. | cryptonector wrote: | The definition of the ctype functions as working on unsigned | char values and EOF + CHAR_BIT being 8 everywhere now basically | means that there isn't much locale-specificity to the ctype | functions: they can be made to work with ASCII, ISO-8859-*, | and... EBCDIC, but not UTF-8 in general (just ASCII) or any | Unicode encoding (idk, maybe they can be made to be locale- | specific for Shift-JIS, but only for ASCII in Shift-JIS). | | And... yes, glibc does have support for EBCDIC, which is | probably ultimately why it has these run-time indirections in | its ctype. There's no other reason to have run-time | indirections for ctype functions given the limitation of | unsigned char values + EOF. That means this code can be | simplified a great deal. | | Anyways, yes, Drew DeVault's rant misses glibc's need to | support EBCDIC, but glibc is exactly like this for every little | thing -- an unmaintainable mess. There has to be a better way | to produce a fast C library w/o being such a mess on the | inside. | Hello71 wrote: | > all the world is not Linux or BSD | | since when does glibc run on bsd | LukeShu wrote: | Well for one, Debian GNU/kFreeBSD. | jcelerier wrote: | didn't glibc exist before linux ? surely it would have been | running on bsd then | tyingq wrote: | I do get what you're saying, but musl also has to live in many | different worlds. Using the example where glibc is trawling | into endianness in the post you linked, for example. Musl runs | on a bunch of different big and little endian router boxes and | other unusual use cases. While I haven't tested, I'm guessing | that their much simpler isalnum() works fine on all of them. | | Musl does have a lot less legacy to contend with, and musl is | often much slower than glibc, so your point stands, of course. | masklinn wrote: | > While I haven't tested, I'm guessing that their much | simpler isalnum() works fine on all of them. | | isalnum works fine of both, it only veers off when you get | into UB which is UB. | | If you define "works fine" as "gives correct answers even in | ub" then musl's is completely broken since it only gives | correct answers for english in ascii. | cryptonector wrote: | It can't give correct answers for anything other than | English in UTF-8 locales. | | It can't give correct answers for any non-Latin scripts in | any locales. | | The problem is ctype and POSIX. | | Given that, making ctype only work for ASCII (and maybe | EBCDIC if you're really unlucky, which glibc is) is | basically sufficient. | jcelerier wrote: | musl's "isalpha" is trivially wrong, for instance it wouldn't | support "c" (0xe7) or "ss" (0xdf) in ISO 8859-1 which are | both alphabetic characters which fit in an unsigned char. | cryptonector wrote: | ctype is trivially non-localizable to locales with codesets | larger than sizeof(unsigned char) anyways. Maybe the | problem here is POSIX. | jcelerier wrote: | oh yes, no code written in 2021 should use that mess. but | a glibc being some level of posix compatibility.. hard to | blame them for at least trying to make it work. | cryptonector wrote: | Hmm, well, I mean, if ctype can't work for any | interesting non-ASCII (and non-EBCDIC) cases (no one | should still be using ISO-8859 locales...)... maybe stop | trying so hard? | tyingq wrote: | Those both return 0 for isalpha() on glibc for me, with or | without export LC_CTYPE=iso_8859_1 | | Is there some other setup I'd need to do to see it work in | glibc? | jcelerier wrote: | most likely you need to build the locale on your system | (uncomment the relevant line in /etc/locale.gen and run | sudo locale-gen). | | here #include <ctype.h> #include | <locale.h> #include <stdio.h> int | main(int argc, char** argv) { | setlocale(LC_CTYPE, "fr_FR.iso88591"); | if(isalpha('c')) printf("ok\n"); } | | prints ok (with the file in the correct encoding) | _kst_ wrote: | isalpha() works with the "C" locale unless you first call | setlocale(). | | For example, on my system isalpha(0xe7) is true if I first | call setlocale(LC_ALL, "en_US.iso88591"). | jcelerier wrote: | well, yes, in "normal" C programs you're supposed to | fetch the locale from the user's env vars (with setlocale | (LC_ALL, "")) | [deleted] | anonymousiam wrote: | Ran it on 32-bit ARM, 64-bit ARM, 32-bit x86, and 64-bit x86. All | had different results, but all were the same until index 549, | which is greater than the maximum value for unsigned char (255). | zx2c4 wrote: | Here are some branchless/constant-time versions of those | functions that don't rely on locale: | https://git.zx2c4.com/wireguard-tools/tree/src/ctype.h | malkia wrote: | I like the suffix in 0x80001FU | josephcsible wrote: | Here's what the C standard says about character handling | functions: | | > In all cases the argument is an int, the value of which shall | be representable as an unsigned char or shall equal the value of | the macro EOF. If the argument has any other value, the behavior | is undefined. | | So this is just a case of glibc being optimized in a way that's | really unforgiving if you commit that particular UB. | cryptonector wrote: | No, this is a case of glibc trying to support localization of | ctype in spite of the fact that it can't be localized to | anything other than English in UTF-8 locales, anything other | than Latin scripts in ISO-8859-* locales, or English in C/POSIX | or EBCDIC locales. And then on top of that trying to be fast. | | I'd give up on supporting localization for ctype. | | This makes me think, too, "never use ctype, just hardcode my | own that assumes ASCII". | guidovranken wrote: | This also applies to C++ <locale> functions, like std::isspace. | | Another fun one: With FD_CLR, FD_ISSET, FD_SET you can corrupt | memory by merely passing a socket descriptor that is not 0..1024. | Pass a negative integer for some undefined behavior as well | (shift by negative value occurs here [1]) | | [1] | https://github.com/lattera/glibc/blob/895ef79e04a953cac14938... ___________________________________________________________________ (page generated 2021-10-01 23:00 UTC)