[HN Gopher] Libgrapheme: A simple freestanding C99 library for U...
       ___________________________________________________________________
        
       Libgrapheme: A simple freestanding C99 library for Unicode
        
       Author : harporoeder
       Score  : 58 points
       Date   : 2022-11-15 17:16 UTC (5 hours ago)
        
 (HTM) web link (libs.suckless.org)
 (TXT) w3m dump (libs.suckless.org)
        
       | daaaaaaan wrote:
       | https://twitter.com/kuschku/status/1156488420413362177
        
         | raspyberr wrote:
         | https://en.wikipedia.org/wiki/Talk:Suckless.org
        
       | manifoldgeo wrote:
       | I've been trying to get into C this past week, and this is a
       | great coincidence to see this today! I was just thinking how
       | convenient it is to type emojis right into strings in Python and
       | print them. I assumed C didn't have much unicode compatibility,
       | though I didn't research it.
       | 
       | I gave libgrapheme a try, and it compiled just as the
       | instructions said it would. The hello-world program also mostly
       | worked, but in my terminal it malformed several things. For
       | example the American flag emoji rendered in my terminal as
       | [U][S], and the family emoji rendered as three distinct emoji
       | faces (side-by-side) rather than one grouped one.
       | 
       | I went to a website that lets me copy emojis to my clipboard, and
       | I directly copy-pasted the American flag into my terminal, and I
       | still got [U][S], so I think the problem is just with the
       | terminal and not the library.
       | 
       | edit: Indeed, this is a problem in Gnome terminal. I found a
       | Bugzilla link[0] that is still open. The official name for the
       | grouped emoji type is "ZWJ sequence"[1], short for Zero-Width
       | Joiner, and it appears not a lot of terminals support them. If
       | anyone knows of a good one for Linux, please let me know!
       | 
       | Great stuff, thank you for sharing!
       | 
       | References:
       | 
       | [0]: https://gitlab.gnome.org/GNOME/vte/-/issues/2317
       | 
       | [1]: https://emojipedia.org/emoji-zwj-sequence/
        
         | csande17 wrote:
         | > I was just thinking how convenient it is to type emojis right
         | into strings in Python and print them. I assumed C didn't have
         | much unicode compatibility, though I didn't research it.
         | 
         | Libgrapheme is a nice library, but it doesn't really have
         | anything to do with this.
         | 
         | Almost all modern terminal emulators use the UTF-8 character
         | encoding. In order to successfully output Unicode characters,
         | your programming language doesn't actually need much "Unicode
         | support"; it just needs to be able to send UTF-8-encoded bytes
         | to stdout. (That's why many modern programming languages like
         | Go and Zig define strings as a simple array of bytes.) Modern C
         | compilers allow you to printf("e") and get the appropriate
         | behavior.
         | 
         | As you mention, the terminal emulator also needs to be able to
         | _decode_ and _display_ those UTF-8 bytes correctly, and a lot
         | of terminals don 't get it right in some situations. Off the
         | top of my head, I don't know of a terminal that actually
         | implements the entire (very complex) set of Unicode text
         | rendering behaviors; maybe one of the web-based ones that run
         | in Electron? macOS's Terminal.app is also pretty good IIRC.
         | 
         | Where libgrapheme comes in is if you want to analyze or
         | manipulate a UTF-8-encoded string. It provides operations like
         | "split into words" and "convert to uppercase". A surprising
         | number of programs never need to do that stuff, but if you do,
         | libgrapheme will give you a Unicode-compatible implementation.
         | (Many more basic operations, like concatenating two strings,
         | will work just fine without libgrapheme.)
        
           | mananaysiempre wrote:
           | (Not a language or Unicode expert, the following likely has
           | important mistakes.)
           | 
           | > Off the top of my head, I don't know of a terminal that
           | actually implements the entire (very complex) set of Unicode
           | text rendering behaviors
           | 
           | There are at least two reasons for this:
           | 
           | First, nobody actually seems to know how bidirectional text
           | should interact with terminal control sequences, or indeed
           | how it should be typeset on a terminal in the first place
           | (where are the paragraph boundaries?). There is the pre-
           | Unicode bidirectional support mode (BDSM, I kid you not) in
           | ECMA-48[1] and TR/53[2], which AFAIK nobody implements nor
           | cares about; there are terminal emulators endorsed by bidi-
           | language users[3], which AFAIK nobody has written down the
           | behaviour of; there is the Freedesktop bidi terminal spec[4],
           | which is a draft and AFAIK nobody implements yet either but
           | at least some people care about; finally, there are bidi-
           | language users who say that spec is a mistake[5].
           | 
           | Second, aside from bidi and a smattering of other things such
           | as emoji, there _is_ no detailed "Unicode text rendering
           | behaviour", only standards specific to font formats--the most
           | recent among them being OpenType, which is dubiously
           | compatible across implementations, decently documented only
           | through painstaking reverse engineering (sometimes in
           | words[6], sometimes only in Freetype library code), and
           | generally full of snakes[7]. And it has no notion of a
           | monospace font--only of a (proportional) font where all Lat
           | /Cyr/Grk characters just happen to have the same advance.
           | 
           | AFAICT that is not negligence or an oversight, but rather a
           | concession to the fact that there are scripts which don't
           | really have a notion of monospace in the typographic
           | tradition and in fact are written such that it's extremely
           | unclear what monospace would even mean--certainly not one or
           | two cells per codepoint (e.g. Burmese or Tibetan; apparently
           | there _are_ Arabic monospace fonts[8] but I've no idea how
           | the hell they work). Not coincidentally, those are the
           | scripts where you really, really need that shaper, otherwise
           | nothing looks anywhere close to correct.
           | 
           | [This post could have been titled " _Contra_ Muratori on
           | Unicode in terminal emulators".]
           | 
           | [1] https://www.ecma-international.org/publications-and-
           | standard...
           | 
           | [2] https://www.ecma-international.org/publications-and-
           | standard...
           | 
           | [3] https://news.ycombinator.com/item?id=8086417
           | 
           | [4] https://terminal-wg.pages.freedesktop.org/bidi/
           | 
           | [5] http://litcave.rudi.ir/
           | 
           | [6] https://github.com/n8willis/opentype-shaping-documents
           | 
           | [7] https://litherum.blogspot.com/2019/03/addition-font.html
           | 
           | [8] https://news.ycombinator.com/item?id=10395464
        
             | csande17 wrote:
             | Here's another fun Unicode pitfall: does _any_ terminal
             | provide a way to display Chinese and Japanese text
             | simultaneously, using the appropriate versions of the
             | glyphs for each language 's characters?
        
               | mananaysiempre wrote:
               | As far as existing terminals are concerned, I don't know.
               | FWIW, there are similar problems (though only to the
               | point of looking wrong, not of misunderstanding) in other
               | scripts: Cyrillic as used in Bulgarian and a number of
               | other languages[1] and even Latin as used in Polish[2].
               | 
               | Even the Han version, though, does not seem to me to be
               | the sort of "what does it even _mean_?" problem like
               | those I listed above; more like what you want the input
               | to be. You can make your terminal keep language state,
               | e.g. using the deprecated language tags. Pro: some form
               | of this likely already needs to happen for bidi support;
               | similar to what HTML does. Con: no text file or program
               | ever did this; your nice UTF-8-only terminal is now
               | stateful and goes mad after `head  /dev/urandom`.
               | Alternatively, you can require the driving program to
               | emit variation selectors for each Han character. Pro: the
               | state and the ensuing madness is now limited; you can
               | still pretend you're looking at a stream of characters.
               | Con: no text file or program ever did this; neither does
               | HTML although it theorerically could.
               | 
               | [1] https://commons.wikimedia.org/wiki/File:Cyrillic_alte
               | rnates....
               | 
               | [2]
               | https://www.twardoch.com/download/polishhowto/kreska.html
        
             | duskwuff wrote:
             | > First, nobody actually seems to know how bidirectional
             | text should interact with terminal control sequences...
             | 
             | This goes beyond just bidirectional text. The traditional
             | behavior of text in a terminal is based around two key
             | assumptions, both of which break down catastrophically when
             | dealing with non-ASCII text:
             | 
             | 1) The state of a terminal can be represented as a set of
             | cells, each of which has exactly one glyph in it and can be
             | drawn independently from any other cell.
             | 
             | 2) Printing a character will write a glyph to the cell the
             | cursor is in and move the cursor to the right by one cell
             | (or down to the next line).
             | 
             | The first assumption breaks down when dealing with full-
             | width characters and ligatures/complex scripts, but can at
             | least be papered over to handle full-width. The second
             | assumption breaks down when exposed to virtually any
             | interesting typographical feature (RTL, combining
             | characters and ZWJ, shaped characters, etc). And I'm not
             | sure it's possible to fix without some pretty substantial
             | changes to how terminals operate -- standard terminal
             | control sequences, and the code that uses them, are all
             | built around these assumptions; introducing new behaviors
             | like "the cursor doesn't always move from left to right" or
             | "erasing the middle of a string might change how the rest
             | of it displays" _will_ break existing applications.
             | 
             | The ECMA standards are of absolutely no help in the matter.
             | They were written in the early 1990s, before Unicode came
             | onto the scene. Their idea of "international language
             | support" was supporting both French and German.
        
       | __d wrote:
       | Does anyone know offhand whether this does comparisons? And
       | normalization?
        
       | gigel82 wrote:
       | This is interesting, particularly for implementing Intl in JS
       | engines without the mega-heavy ICU. But I wonder how portable it
       | really is.
       | 
       | Sometimes I have to dig very deep to find that what folks call
       | "portable C" is actually POSIX-dependent.
       | 
       | It doesn't appear to be the case after going through the code for
       | a bit, so that's promising.
        
         | mananaysiempre wrote:
         | You can also refer to the Unicode routines of other small JS
         | engines[1,2], those don't use ICU either, although the
         | implementations are mercilessly size-optimized (to put it
         | politely) and restricted to what the target JS version requires
         | (e.g. Duktape does casemapping but no normalization). Still,
         | Bellard's in particular look like he had a small Unicode
         | processing library lying around and just copied it into the
         | tree, not like he was forced to write the absolute minimum to
         | do a JS inteprerer, so they can even be compared with dedicated
         | libraries like libgrapheme, libutf8proc or libutf.
         | 
         | [1] https://github.com/bellard/quickjs/blob/master/libunicode.c
         | 
         | [2] https://github.com/svaarala/duktape/blob/master/src-
         | input/du...
        
         | dochtman wrote:
         | Maybe a comparison to ICU4X is more interesting.
        
       | sylware wrote:
        
         | tomcam wrote:
        
       ___________________________________________________________________
       (page generated 2022-11-15 23:01 UTC)