hngopher.com

       [HN Gopher] The Wonderfully Terrible World of C and C++ Text Enc...
       ___________________________________________________________________
        
       The Wonderfully Terrible World of C and C++ Text Encoding APIs
       (With Some Rust)
        
       Author : codewiz
       Score  : 14 points
       Date   : 2022-10-15 21:27 UTC (1 hours ago)
        
 (HTM) web link (thephd.dev)
 (TXT) w3m dump (thephd.dev)
        
       | int_19h wrote:
       | It's kind of amazing that something as basic as character
       | encodings - at least the basics like UTF-16 - UTF-8 - stdio-
       | encoding! - is something that's still not in the C++ standard
       | library. For a while there was codecvt_utf8 et al, but that was
       | deprecated 5 years ago in C++17 with no replacement "to clear the
       | path for the future" (https://www.open-
       | std.org/jtc1/sc22/wg21/docs/papers/2017/p06...), yet no
       | replacement came in C++20, and none are planned for C++23.
        
         | lultimouomo wrote:
         | I feel your pain. Last week I just gave up and wrote my UTF-8
         | to UTF-32 conversion routine. It took me far less to do that
         | than I spent looking for a standard solution.
        
         | kevin_thibedeau wrote:
         | Unicode support requires incorporating their database into a
         | library. At a minimum you need to know which code points are
         | combining chars. For a language with five to ten year update
         | cycles should everyone be stuck with outdated data if the
         | Unicode standard is revised in the interim?
        
           | arka2147483647 wrote:
           | All operating systems have the unicode database saved
           | somewhere. There should just be a standard way of accessing
           | it. Just like filesystem.
           | 
           | Edit; that is; it does not have to be linked in the standard
           | lib. Can be a data file somewhere, or a a shared lib.
        
             | duskwuff wrote:
             | There's precedent for this, too: time zones! Time zone data
             | can change over time, and as such it's typically stored in
             | system files and loaded at runtime, rather than being
             | embedded in executables.
             | 
             | Locales have some similar behavior as well.
        
           | Ferrotin wrote:
           | Conversion between encodings doesn't require a database or
           | knowledge of combining characters.
        
             | poorlyknit wrote:
             | This. But it strenghtens the arguments that programming
             | environments should just come with some sort of support for
             | the most common encoding forms.
        
             | kevin_thibedeau wrote:
             | A library that only extracts code points will do more
             | damage than not having one at all. If you have to decode
             | Unicode you presumably want to parse it some of the time.
             | Not supporting the needs for string processing with multi-
             | point graphemes leads to broken Unicode "support" that
             | doesn't actually work with all valid Unicode.
        
       ___________________________________________________________________
       (page generated 2022-10-15 23:00 UTC)