[HN Gopher] Fun with Glibc and the Ctype.h Functions
       ___________________________________________________________________
        
       Fun with Glibc and the Ctype.h Functions
        
       Author : picture
       Score  : 24 points
       Date   : 2021-09-30 05:36 UTC (1 days ago)
        
 (HTM) web link (rachelbythebay.com)
 (TXT) w3m dump (rachelbythebay.com)
        
       | _kst_ wrote:
       | IMHO the more interesting oddity about the functions declared in
       | <ctype.h> is that they work with unsigned char, which means that
       | they have undefined behavior if you pass a negative char value
       | (other than EOF, which is typically -1).
       | 
       | This means that if you have a char value (say, an element of a
       | string), you need to cast it to unsigned char before passing it
       | to any of the is*() functions.
        
       | gumby wrote:
       | The rant behind her post
       | (https://drewdevault.com/2020/09/25/A-story-of-two-libcs.html ),
       | which has had some circulation, really shows its author's limited
       | perspective.
       | 
       | glibc needs to solve two hard problems: be very fast and run on
       | innumerable systems. Some of that conditional stuff is because
       | all the world is not Linux or BSD; some of the macrology is there
       | to make sure such handling is performed everywhere needed, and of
       | course the preprocessor is the closest a language like C can get
       | to preprocessing.
       | 
       | I was in the code as glibc started to exist (we paid for a lot of
       | it) and it looked like Musl: very straightforward.
        
         | cryptonector wrote:
         | The definition of the ctype functions as working on unsigned
         | char values and EOF + CHAR_BIT being 8 everywhere now basically
         | means that there isn't much locale-specificity to the ctype
         | functions: they can be made to work with ASCII, ISO-8859-*,
         | and... EBCDIC, but not UTF-8 in general (just ASCII) or any
         | Unicode encoding (idk, maybe they can be made to be locale-
         | specific for Shift-JIS, but only for ASCII in Shift-JIS).
         | 
         | And... yes, glibc does have support for EBCDIC, which is
         | probably ultimately why it has these run-time indirections in
         | its ctype. There's no other reason to have run-time
         | indirections for ctype functions given the limitation of
         | unsigned char values + EOF. That means this code can be
         | simplified a great deal.
         | 
         | Anyways, yes, Drew DeVault's rant misses glibc's need to
         | support EBCDIC, but glibc is exactly like this for every little
         | thing -- an unmaintainable mess. There has to be a better way
         | to produce a fast C library w/o being such a mess on the
         | inside.
        
         | Hello71 wrote:
         | > all the world is not Linux or BSD
         | 
         | since when does glibc run on bsd
        
           | LukeShu wrote:
           | Well for one, Debian GNU/kFreeBSD.
        
           | jcelerier wrote:
           | didn't glibc exist before linux ? surely it would have been
           | running on bsd then
        
         | tyingq wrote:
         | I do get what you're saying, but musl also has to live in many
         | different worlds. Using the example where glibc is trawling
         | into endianness in the post you linked, for example. Musl runs
         | on a bunch of different big and little endian router boxes and
         | other unusual use cases. While I haven't tested, I'm guessing
         | that their much simpler isalnum() works fine on all of them.
         | 
         | Musl does have a lot less legacy to contend with, and musl is
         | often much slower than glibc, so your point stands, of course.
        
           | masklinn wrote:
           | > While I haven't tested, I'm guessing that their much
           | simpler isalnum() works fine on all of them.
           | 
           | isalnum works fine of both, it only veers off when you get
           | into UB which is UB.
           | 
           | If you define "works fine" as "gives correct answers even in
           | ub" then musl's is completely broken since it only gives
           | correct answers for english in ascii.
        
             | cryptonector wrote:
             | It can't give correct answers for anything other than
             | English in UTF-8 locales.
             | 
             | It can't give correct answers for any non-Latin scripts in
             | any locales.
             | 
             | The problem is ctype and POSIX.
             | 
             | Given that, making ctype only work for ASCII (and maybe
             | EBCDIC if you're really unlucky, which glibc is) is
             | basically sufficient.
        
           | jcelerier wrote:
           | musl's "isalpha" is trivially wrong, for instance it wouldn't
           | support "c" (0xe7) or "ss" (0xdf) in ISO 8859-1 which are
           | both alphabetic characters which fit in an unsigned char.
        
             | cryptonector wrote:
             | ctype is trivially non-localizable to locales with codesets
             | larger than sizeof(unsigned char) anyways. Maybe the
             | problem here is POSIX.
        
               | jcelerier wrote:
               | oh yes, no code written in 2021 should use that mess. but
               | a glibc being some level of posix compatibility.. hard to
               | blame them for at least trying to make it work.
        
               | cryptonector wrote:
               | Hmm, well, I mean, if ctype can't work for any
               | interesting non-ASCII (and non-EBCDIC) cases (no one
               | should still be using ISO-8859 locales...)... maybe stop
               | trying so hard?
        
             | tyingq wrote:
             | Those both return 0 for isalpha() on glibc for me, with or
             | without export LC_CTYPE=iso_8859_1
             | 
             | Is there some other setup I'd need to do to see it work in
             | glibc?
        
               | jcelerier wrote:
               | most likely you need to build the locale on your system
               | (uncomment the relevant line in /etc/locale.gen and run
               | sudo locale-gen).
               | 
               | here                 #include <ctype.h>       #include
               | <locale.h>       #include <stdio.h>            int
               | main(int argc, char** argv)       {
               | setlocale(LC_CTYPE, "fr_FR.iso88591");
               | if(isalpha('c'))           printf("ok\n");       }
               | 
               | prints ok (with the file in the correct encoding)
        
             | _kst_ wrote:
             | isalpha() works with the "C" locale unless you first call
             | setlocale().
             | 
             | For example, on my system isalpha(0xe7) is true if I first
             | call setlocale(LC_ALL, "en_US.iso88591").
        
               | jcelerier wrote:
               | well, yes, in "normal" C programs you're supposed to
               | fetch the locale from the user's env vars (with setlocale
               | (LC_ALL, ""))
        
         | [deleted]
        
       | anonymousiam wrote:
       | Ran it on 32-bit ARM, 64-bit ARM, 32-bit x86, and 64-bit x86. All
       | had different results, but all were the same until index 549,
       | which is greater than the maximum value for unsigned char (255).
        
       | zx2c4 wrote:
       | Here are some branchless/constant-time versions of those
       | functions that don't rely on locale:
       | https://git.zx2c4.com/wireguard-tools/tree/src/ctype.h
        
         | malkia wrote:
         | I like the suffix in 0x80001FU
        
       | josephcsible wrote:
       | Here's what the C standard says about character handling
       | functions:
       | 
       | > In all cases the argument is an int, the value of which shall
       | be representable as an unsigned char or shall equal the value of
       | the macro EOF. If the argument has any other value, the behavior
       | is undefined.
       | 
       | So this is just a case of glibc being optimized in a way that's
       | really unforgiving if you commit that particular UB.
        
         | cryptonector wrote:
         | No, this is a case of glibc trying to support localization of
         | ctype in spite of the fact that it can't be localized to
         | anything other than English in UTF-8 locales, anything other
         | than Latin scripts in ISO-8859-* locales, or English in C/POSIX
         | or EBCDIC locales. And then on top of that trying to be fast.
         | 
         | I'd give up on supporting localization for ctype.
         | 
         | This makes me think, too, "never use ctype, just hardcode my
         | own that assumes ASCII".
        
       | guidovranken wrote:
       | This also applies to C++ <locale> functions, like std::isspace.
       | 
       | Another fun one: With FD_CLR, FD_ISSET, FD_SET you can corrupt
       | memory by merely passing a socket descriptor that is not 0..1024.
       | Pass a negative integer for some undefined behavior as well
       | (shift by negative value occurs here [1])
       | 
       | [1]
       | https://github.com/lattera/glibc/blob/895ef79e04a953cac14938...
        
       ___________________________________________________________________
       (page generated 2021-10-01 23:00 UTC)