[HN Gopher] The Wonderfully Terrible World of C and C++ Text Enc... ___________________________________________________________________ The Wonderfully Terrible World of C and C++ Text Encoding APIs (With Some Rust) Author : codewiz Score : 14 points Date : 2022-10-15 21:27 UTC (1 hours ago) (HTM) web link (thephd.dev) (TXT) w3m dump (thephd.dev) | int_19h wrote: | It's kind of amazing that something as basic as character | encodings - at least the basics like UTF-16 - UTF-8 - stdio- | encoding! - is something that's still not in the C++ standard | library. For a while there was codecvt_utf8 et al, but that was | deprecated 5 years ago in C++17 with no replacement "to clear the | path for the future" (https://www.open- | std.org/jtc1/sc22/wg21/docs/papers/2017/p06...), yet no | replacement came in C++20, and none are planned for C++23. | lultimouomo wrote: | I feel your pain. Last week I just gave up and wrote my UTF-8 | to UTF-32 conversion routine. It took me far less to do that | than I spent looking for a standard solution. | kevin_thibedeau wrote: | Unicode support requires incorporating their database into a | library. At a minimum you need to know which code points are | combining chars. For a language with five to ten year update | cycles should everyone be stuck with outdated data if the | Unicode standard is revised in the interim? | arka2147483647 wrote: | All operating systems have the unicode database saved | somewhere. There should just be a standard way of accessing | it. Just like filesystem. | | Edit; that is; it does not have to be linked in the standard | lib. Can be a data file somewhere, or a a shared lib. | duskwuff wrote: | There's precedent for this, too: time zones! Time zone data | can change over time, and as such it's typically stored in | system files and loaded at runtime, rather than being | embedded in executables. | | Locales have some similar behavior as well. | Ferrotin wrote: | Conversion between encodings doesn't require a database or | knowledge of combining characters. | poorlyknit wrote: | This. But it strenghtens the arguments that programming | environments should just come with some sort of support for | the most common encoding forms. | kevin_thibedeau wrote: | A library that only extracts code points will do more | damage than not having one at all. If you have to decode | Unicode you presumably want to parse it some of the time. | Not supporting the needs for string processing with multi- | point graphemes leads to broken Unicode "support" that | doesn't actually work with all valid Unicode. ___________________________________________________________________ (page generated 2022-10-15 23:00 UTC)