[HN Gopher] Libgrapheme: A simple freestanding C99 library for U... ___________________________________________________________________ Libgrapheme: A simple freestanding C99 library for Unicode Author : harporoeder Score : 58 points Date : 2022-11-15 17:16 UTC (5 hours ago) (HTM) web link (libs.suckless.org) (TXT) w3m dump (libs.suckless.org) | daaaaaaan wrote: | https://twitter.com/kuschku/status/1156488420413362177 | raspyberr wrote: | https://en.wikipedia.org/wiki/Talk:Suckless.org | manifoldgeo wrote: | I've been trying to get into C this past week, and this is a | great coincidence to see this today! I was just thinking how | convenient it is to type emojis right into strings in Python and | print them. I assumed C didn't have much unicode compatibility, | though I didn't research it. | | I gave libgrapheme a try, and it compiled just as the | instructions said it would. The hello-world program also mostly | worked, but in my terminal it malformed several things. For | example the American flag emoji rendered in my terminal as | [U][S], and the family emoji rendered as three distinct emoji | faces (side-by-side) rather than one grouped one. | | I went to a website that lets me copy emojis to my clipboard, and | I directly copy-pasted the American flag into my terminal, and I | still got [U][S], so I think the problem is just with the | terminal and not the library. | | edit: Indeed, this is a problem in Gnome terminal. I found a | Bugzilla link[0] that is still open. The official name for the | grouped emoji type is "ZWJ sequence"[1], short for Zero-Width | Joiner, and it appears not a lot of terminals support them. If | anyone knows of a good one for Linux, please let me know! | | Great stuff, thank you for sharing! | | References: | | [0]: https://gitlab.gnome.org/GNOME/vte/-/issues/2317 | | [1]: https://emojipedia.org/emoji-zwj-sequence/ | csande17 wrote: | > I was just thinking how convenient it is to type emojis right | into strings in Python and print them. I assumed C didn't have | much unicode compatibility, though I didn't research it. | | Libgrapheme is a nice library, but it doesn't really have | anything to do with this. | | Almost all modern terminal emulators use the UTF-8 character | encoding. In order to successfully output Unicode characters, | your programming language doesn't actually need much "Unicode | support"; it just needs to be able to send UTF-8-encoded bytes | to stdout. (That's why many modern programming languages like | Go and Zig define strings as a simple array of bytes.) Modern C | compilers allow you to printf("e") and get the appropriate | behavior. | | As you mention, the terminal emulator also needs to be able to | _decode_ and _display_ those UTF-8 bytes correctly, and a lot | of terminals don 't get it right in some situations. Off the | top of my head, I don't know of a terminal that actually | implements the entire (very complex) set of Unicode text | rendering behaviors; maybe one of the web-based ones that run | in Electron? macOS's Terminal.app is also pretty good IIRC. | | Where libgrapheme comes in is if you want to analyze or | manipulate a UTF-8-encoded string. It provides operations like | "split into words" and "convert to uppercase". A surprising | number of programs never need to do that stuff, but if you do, | libgrapheme will give you a Unicode-compatible implementation. | (Many more basic operations, like concatenating two strings, | will work just fine without libgrapheme.) | mananaysiempre wrote: | (Not a language or Unicode expert, the following likely has | important mistakes.) | | > Off the top of my head, I don't know of a terminal that | actually implements the entire (very complex) set of Unicode | text rendering behaviors | | There are at least two reasons for this: | | First, nobody actually seems to know how bidirectional text | should interact with terminal control sequences, or indeed | how it should be typeset on a terminal in the first place | (where are the paragraph boundaries?). There is the pre- | Unicode bidirectional support mode (BDSM, I kid you not) in | ECMA-48[1] and TR/53[2], which AFAIK nobody implements nor | cares about; there are terminal emulators endorsed by bidi- | language users[3], which AFAIK nobody has written down the | behaviour of; there is the Freedesktop bidi terminal spec[4], | which is a draft and AFAIK nobody implements yet either but | at least some people care about; finally, there are bidi- | language users who say that spec is a mistake[5]. | | Second, aside from bidi and a smattering of other things such | as emoji, there _is_ no detailed "Unicode text rendering | behaviour", only standards specific to font formats--the most | recent among them being OpenType, which is dubiously | compatible across implementations, decently documented only | through painstaking reverse engineering (sometimes in | words[6], sometimes only in Freetype library code), and | generally full of snakes[7]. And it has no notion of a | monospace font--only of a (proportional) font where all Lat | /Cyr/Grk characters just happen to have the same advance. | | AFAICT that is not negligence or an oversight, but rather a | concession to the fact that there are scripts which don't | really have a notion of monospace in the typographic | tradition and in fact are written such that it's extremely | unclear what monospace would even mean--certainly not one or | two cells per codepoint (e.g. Burmese or Tibetan; apparently | there _are_ Arabic monospace fonts[8] but I've no idea how | the hell they work). Not coincidentally, those are the | scripts where you really, really need that shaper, otherwise | nothing looks anywhere close to correct. | | [This post could have been titled " _Contra_ Muratori on | Unicode in terminal emulators".] | | [1] https://www.ecma-international.org/publications-and- | standard... | | [2] https://www.ecma-international.org/publications-and- | standard... | | [3] https://news.ycombinator.com/item?id=8086417 | | [4] https://terminal-wg.pages.freedesktop.org/bidi/ | | [5] http://litcave.rudi.ir/ | | [6] https://github.com/n8willis/opentype-shaping-documents | | [7] https://litherum.blogspot.com/2019/03/addition-font.html | | [8] https://news.ycombinator.com/item?id=10395464 | csande17 wrote: | Here's another fun Unicode pitfall: does _any_ terminal | provide a way to display Chinese and Japanese text | simultaneously, using the appropriate versions of the | glyphs for each language 's characters? | mananaysiempre wrote: | As far as existing terminals are concerned, I don't know. | FWIW, there are similar problems (though only to the | point of looking wrong, not of misunderstanding) in other | scripts: Cyrillic as used in Bulgarian and a number of | other languages[1] and even Latin as used in Polish[2]. | | Even the Han version, though, does not seem to me to be | the sort of "what does it even _mean_?" problem like | those I listed above; more like what you want the input | to be. You can make your terminal keep language state, | e.g. using the deprecated language tags. Pro: some form | of this likely already needs to happen for bidi support; | similar to what HTML does. Con: no text file or program | ever did this; your nice UTF-8-only terminal is now | stateful and goes mad after `head /dev/urandom`. | Alternatively, you can require the driving program to | emit variation selectors for each Han character. Pro: the | state and the ensuing madness is now limited; you can | still pretend you're looking at a stream of characters. | Con: no text file or program ever did this; neither does | HTML although it theorerically could. | | [1] https://commons.wikimedia.org/wiki/File:Cyrillic_alte | rnates.... | | [2] | https://www.twardoch.com/download/polishhowto/kreska.html | duskwuff wrote: | > First, nobody actually seems to know how bidirectional | text should interact with terminal control sequences... | | This goes beyond just bidirectional text. The traditional | behavior of text in a terminal is based around two key | assumptions, both of which break down catastrophically when | dealing with non-ASCII text: | | 1) The state of a terminal can be represented as a set of | cells, each of which has exactly one glyph in it and can be | drawn independently from any other cell. | | 2) Printing a character will write a glyph to the cell the | cursor is in and move the cursor to the right by one cell | (or down to the next line). | | The first assumption breaks down when dealing with full- | width characters and ligatures/complex scripts, but can at | least be papered over to handle full-width. The second | assumption breaks down when exposed to virtually any | interesting typographical feature (RTL, combining | characters and ZWJ, shaped characters, etc). And I'm not | sure it's possible to fix without some pretty substantial | changes to how terminals operate -- standard terminal | control sequences, and the code that uses them, are all | built around these assumptions; introducing new behaviors | like "the cursor doesn't always move from left to right" or | "erasing the middle of a string might change how the rest | of it displays" _will_ break existing applications. | | The ECMA standards are of absolutely no help in the matter. | They were written in the early 1990s, before Unicode came | onto the scene. Their idea of "international language | support" was supporting both French and German. | __d wrote: | Does anyone know offhand whether this does comparisons? And | normalization? | gigel82 wrote: | This is interesting, particularly for implementing Intl in JS | engines without the mega-heavy ICU. But I wonder how portable it | really is. | | Sometimes I have to dig very deep to find that what folks call | "portable C" is actually POSIX-dependent. | | It doesn't appear to be the case after going through the code for | a bit, so that's promising. | mananaysiempre wrote: | You can also refer to the Unicode routines of other small JS | engines[1,2], those don't use ICU either, although the | implementations are mercilessly size-optimized (to put it | politely) and restricted to what the target JS version requires | (e.g. Duktape does casemapping but no normalization). Still, | Bellard's in particular look like he had a small Unicode | processing library lying around and just copied it into the | tree, not like he was forced to write the absolute minimum to | do a JS inteprerer, so they can even be compared with dedicated | libraries like libgrapheme, libutf8proc or libutf. | | [1] https://github.com/bellard/quickjs/blob/master/libunicode.c | | [2] https://github.com/svaarala/duktape/blob/master/src- | input/du... | dochtman wrote: | Maybe a comparison to ICU4X is more interesting. | sylware wrote: | tomcam wrote: ___________________________________________________________________ (page generated 2022-11-15 23:01 UTC)