[HN Gopher] Unicode is harder than you think
       ___________________________________________________________________
        
       Unicode is harder than you think
        
       Author : mcilloni
       Score  : 86 points
       Date   : 2023-07-25 16:47 UTC (6 hours ago)
        
 (HTM) web link (mcilloni.ovh)
 (TXT) w3m dump (mcilloni.ovh)
        
       | jkaptur wrote:
       | The logical next step here is to realize that if you want to be
       | truly internationalized, pretty much every single method of the
       | string class in your favorite language is an antipattern and
       | should be used with extreme caution. Seriously!
        
         | frizlab wrote:
         | I _think_ Swift did it properly. At least that's what they
         | claim[1]
         | 
         | [1]: https://www.swift.org/blog/utf8-string/
         | 
         | There were many discussions on how to handle strings in the
         | forums too. Remarkably, it is not possible to access "abc"[1]
         | as it has unpredictable performance; instead one has to build
         | the index and making this should make the developer realize the
         | operation is costly. All in all most beginners in Swift hate
         | working with strings because it's not intuitive at first, but
         | to be fair in almost all languages strings handling is not done
         | properly.
        
           | jkaptur wrote:
           | Oh, interesting! The fact that "abc"[1] doesn't work is a
           | great sign, since my contention is that "the character at
           | index i" is not a well-defined concept.
        
       | spudlyo wrote:
       | If you found this essay interesting, you owe it to yourself to
       | check out this super entertaining talk "Plain Text"[0] from NDC
       | 2022 by Dylan Beattie. Rabbit hole warning: This video caused me
       | to lose an entire Sunday watching Dylan's talks on YouTube, which
       | are uniformly awesome.
       | 
       | [0]: https://www.youtube.com/watch?v=gd5uJ7Nlvvo
        
         | WaffleIronMaker wrote:
         | I also really enjoy Dylan Beattie's work. For those with some
         | spare time, who might like to see a true "rockstar" programmer,
         | you may like "The Art of Code"[0].
         | 
         | [0] https://youtu.be/6avJHaC3C2U
        
           | aidos wrote:
           | Oh wow. The Amstrad 6128 was my first machine (1985). Looking
           | forward to watching this!
        
           | spudlyo wrote:
           | Another good talk! I also really loved "Email vs Capitalism,
           | or, Why We Can't Have Nice Things"[0] which has one of the
           | best audience participation gimmicks I have ever seen in a
           | talk.
           | 
           | [0]: https://www.youtube.com/watch?v=mrGfahzt-4Q
        
         | soneil wrote:
         | This is exactly what I came here to post, too. I believe he's
         | unusually prolific because he's part of the organisation of
         | these conferences - but his delivery pays off regardless. I
         | actually sent exactly the same video to a colleague this a few
         | days ago.
        
       | NovemberWhiskey wrote:
       | I used to work on a platform at a large financial services firm;
       | it was essentially complete ignorant of anything Unicode with
       | respect to string handling, strings were null-terminated byte
       | streams. The platform had CSV import capability for tabular data,
       | and it had an integrated pivot table capability based on some
       | widgets that had been grafted onto it.
       | 
       | Some of the users in Hong Kong discovered that you could import
       | CSVs with Unicode text (e.g. index compositions with Chinese
       | company names) and they'd display in the pivot table widgets and
       | even be exportable to reports. But only most names. Some names
       | were truncated or turned into garbage, and I was called upon to
       | help debug this.
       | 
       | My first reaction was frank amazement that this "worked" at all:
       | apparently, the path from the dumb CSV import code through to the
       | Unicode-aware pivot table was sufficiently clean that much of the
       | encoded text made it through OK. I can't remember the precise
       | details now but I think the problem turned out to be embedded
       | nulls from UTF-16 encoding and so was completely insoluble
       | without a gut renovation of the platform.
        
       | Pannoniae wrote:
       | Most programs claim to support Unicode but they actually don't.
       | They either miscount string lengths (you type a CJK character or
       | an emoji in, string appears shorter than what the program
       | thinks), separate them improperly or many other things. It
       | doesn't help that by default, most programming languages also
       | handle unicode poorly, with the default APIs producing wrong
       | results.
       | 
       | I'd take "we don't do unicode at all" or "we only support BMP" or
       | "we don't support composite characters" any day over pretend-
       | support (but then inevitably breaking when the program wasn't
       | tested with anything non-ASCII)
       | 
       | (ninjaedit: to see how prevalent it is, even gigantic message
       | apps such as discord make this mistake. There are users on
       | discord who you can't add as friends because the friend input
       | field is limited to 32.... something - probably bytes, yet
       | elsewhere the program allows the name to be taken. This is easy
       | to do with combining characters)
        
         | david_draco wrote:
         | Maybe an "Acid test" for Unicode would help? These pages seem
         | to go into that direction:
         | https://www.kermitproject.org/utf8.html
         | https://web.archive.org/web/20160306060703/http://www.inter-...
         | 
         | Placing a fuzzy tester like "hypothesis.strategies.characters"
         | into the CI may also be revealing.
        
       | alcover wrote:
       | Currently working on a language, I feel dizzy after reading this.
       | 
       | My stdlib will provide a (byte) Buffer class with basic low-level
       | methods but I feel like iterating through it in fancy ways should
       | be the concern of the user or 3rd-party libraries.
       | 
       | I fail to see this as part of a programming language.
       | 
       | Am I wrong here ?
        
         | zadokshi wrote:
         | You're definitely wrong. You're designing a language that only
         | works for "Americans" by default.
         | 
         | Imagine how you would feel about a language that supports
         | Arabic by default and needs special foo to work with American
         | English?
         | 
         | You need to start thinking of characters as a type. Characters
         | do not fit in bytes unless you're American.
         | 
         | UTF-8 is a reasonable compromise though.
        
         | scatters wrote:
         | It depends on the level and domain of your language. A low-
         | level language can get by with just byte arrays. A mid-level
         | language should probably handle at least some encodings, and
         | provide codepoint access. A high-level language should handle
         | locale-sensitive casing and collating, and grapheme-cluster
         | access (note that this depends on the font!).
        
         | ekidd wrote:
         | One good approach is to have separate "byte array" and "string"
         | types, and say, "Strings are always UTF-8. Anything else is a
         | bug. Deal with it."
         | 
         | Then you can have a nice, user-friendly string class for basic
         | UTF-8 text, which is pretty easy. Ignore sorting and grapheme
         | clusters (those probably belong in libraries, and they require
         | fairly large tables). Consider providing a library function to
         | iterate over UTF-8 "characters" (as unpacked 32 bit Unicode
         | code points).
         | 
         | This is one of the sweet spots in language design, and it
         | provides enough structure for third-party libraries to work
         | well together, without everyone reinventing their own string
         | type.
         | 
         | For another good alternative to this approach, see Ruby.
        
       | night-rider wrote:
       | It's not that it's hard, it's just people don't go out of their
       | way to escape UTF-8 glyphs into ASCII when dealing with exotic
       | glyphs in a text editor. It's more mundane and tedious, but not
       | 'hard'.
       | 
       | Try working with raw UTF-8 in JS and find yourself in a world of
       | pain. Mathias Bynens talks about these gotchas here:
       | 
       | https://mathiasbynens.be/notes/javascript-unicode
        
         | JohnFen wrote:
         | "Hard" is a bit context-dependent. Instead of thinking of it as
         | "hard", I think of it as a real pain in the ass full of
         | footguns.
        
       | o1y32 wrote:
       | Regarding the title -- anecdotally, everyone I know is sacred of
       | encoding issues, and I don't know anyone who claims they have a
       | great understanding of Unicode or think it is easy (including
       | myself). It is often overlooked for sure -- people don't realize
       | there is a problem in the code until they run into a bug, ane it
       | turns out they are treating strings wrong from the very
       | beginning.
        
         | m_0x wrote:
         | sacred or scared? ;)
        
       | zamadatix wrote:
       | TIL of UTF-1, what an odd specification.
        
       | dmitrygr wrote:
       | Favourite unicode fact: properly rendering unicode requires
       | understanding of the current geopolitical situation (Depending on
       | whom you accept as a country and whom you do not, two country-
       | code-letters may or may not render as a flag. This changes
       | sometimes in today's world.). https://esham.io/2014/06/unicode-
       | flags
        
         | svachalek wrote:
         | Interesting. They pushed all the politics onto the font
         | designers.
        
           | not2b wrote:
           | The font designer has to include a flag for each supported
           | country. Often a given font is missing lots of flags for
           | reasons that have nothing to do with whether the designer
           | recognizes a given country or not, just a question of
           | priorities; perhaps only 100 out of 200 flags are supported.
        
         | Longhanks wrote:
         | Imho, unicode should stay out of politics. Country flags,
         | vaccine syringes and pregnant men should have nothing to do
         | with how computers handle text and writing systems.
        
           | veave wrote:
           | Or when big tech banded together to change the pistol emoji
           | to some scifi zapper.
        
             | spookthesunset wrote:
             | From what I recall the problem was on some devices it was
             | rendered as a "sci-fi zapper" or squirtgun and on others it
             | was a fairly realistic depiction of a gun. Leading to some
             | misunderstandings...
        
           | makeworld wrote:
           | There is no way to avoid it. It is very obvious that deciding
           | whether "vaccine syringes" are political (and therefore
           | excluded) or not is itself a political decision.
        
             | naniwaduni wrote:
             | There's a certain kind of extremist who claims that their
             | contentious positions aren't political, but the fact that
             | there's an argument, and that you can point to mainstream
             | coverage of it, strongly suggests that they're full of shit
             | and everyone can see it.
        
               | kergonath wrote:
               | What a strange position. The fact that an argument exists
               | just shows that some people want to argue. Anyone can
               | start an argument about anything. It's hardly a good base
               | to make a decision.
        
           | kergonath wrote:
           | What does a syringe has to do with this, exactly?
           | 
           | Besides, why do you care what funny symbols people use in
           | discussions that don't involve you?
        
           | qalmakka wrote:
           | If it were that easy - sadly everything that's in any way
           | related to the way we communicate and we relate to the world.
           | Just look at the kerfuffle about skin tones... Everything is
           | political if you are looking for a reason to fight.
           | 
           | Language is a very sensitive topic - in Central Asian
           | countries, using Latin, Cyrillic or Perso-Arabic script for
           | instance has very strong political connotations, same in the
           | Balkans. The world is just like that
        
           | mseepgood wrote:
           | How do you recognize whether a syringe emoji is a vaccine
           | syringe or a regular syringe?
        
       | Roark66 wrote:
       | Indeed it is. One use of Unicode I do is for icons that can be
       | used by console programs like (neo)vim. I was quite happy that
       | xterm supports Unicode these days so I can use a fast terminal
       | that supports OSC52 system clipboard integration(none of the
       | newer gnome/KDE terminals do).
       | 
       | I was rather disappointed when I noticed my pretty Unicode icons
       | would sometimes end up cut in half :-(
        
       | jmclnx wrote:
       | No kidding, you have not lived until you try and explain UTF-8 to
       | people who only believes in what they called "doublebyte".
       | 
       | You think they get it, but surprise happens when a database load
       | fails when loading Chinese Character "string" into a field sized
       | calculated based upon 2 bytes per character.
        
         | theamk wrote:
         | Thank god for emojis! Those people would say, "No one in our
         | org would use chinese" and refuse to fix things... but now I
         | just point them to latest message from upper management which
         | contain emoji or two.
         | 
         | (And emoji are such a fine example - once they ate on the
         | table, you need support for combining characters, characters
         | outside of BMP, ligatures.. a large part of Unicode spec)
        
         | qalmakka wrote:
         | It's terrible, and we IMHO owe that to some introductory
         | university courses to Java (plus some Win32 veterans). I got
         | very close to being rejected by a professor that was
         | obstinately convinced that Unicode "characters" were 2 bytes
         | because it drunk the Kool Aid in the '90s about Java's `char`
         | type representing a Unicode character. Ugh. I still get angry
         | by thinking back at that sometimes
        
           | jmclnx wrote:
           | I can relate, I remember a teacher stating "you never have to
           | worry about the amount of memory". This was in the late 90s,
           | I then asked "So I can load a 20 gig data file into memory",
           | he said yes.
        
       | nightpool wrote:
       | Thank you for being the first article I've ever actually read to
       | explain the difference between NFC, NFD, NFKD and NFKC in a way
       | that I actually understood. I was a little bored through the
       | whole UCS/UTF* history lesson because I knew a lot of it already,
       | but the normalization and collation examples were definitely
       | worth it
        
         | Lammy wrote:
         | Agreed, and it would be even better if it mentioned some real-
         | world normalization issues like it does for the UCS encodings.
         | I learned about it the hard way when dealing with Apple
         | filesystems: https://eclecticlight.co/2021/05/08/explainer-
         | unicode-normal...
        
       | skitter wrote:
       | Annoyingly, Java, JavaScript, Windows file paths and more don't
       | quite use UTF-16 (well, even if they did, that would be annoying)
       | -- they allow unpaired surrogates, which don't represent any
       | Unicode character. So if you want to represent e.g. an arbitrary
       | Windows file path in UTF-8, you can't; you have to use WTF-8
       | (wobbly transformation format) instead.
        
         | Knee_Pain wrote:
         | >WTF-8
         | 
         | truly an appropriate name
        
         | deadbeeves wrote:
         | But UTF-8 is just a way to encode a number as a variable-length
         | string of octets. Why would you be unable to encode, say, a
         | terminating U+D800 as a string of three bytes at the end of a
         | UTF-8 stream?
        
           | skitter wrote:
           | Because that's how UTF-8 is defined[1]. WTF-8 lifts that
           | restriction.
           | 
           | [1] https://simonsapin.github.io/wtf-8/#utf-8
        
             | deadbeeves wrote:
             | It doesn't sound very annoying, then. You use the exact
             | same encoding scheme, but skip a verification step.
             | Actually it sounds more convenient.
        
               | jraph wrote:
               | Still potentially annoying if you deal with some other
               | code that expects UTF-8 proper and you pass it a wtf-8
               | string that fails the lifted verification.
        
           | [deleted]
        
         | sedatk wrote:
         | Certainly not true for Windows. Windows uses UTF-16; e.g. it
         | uses proper surrogate pairs.
         | 
         | https://learn.microsoft.com/en-us/windows/win32/intl/surroga...
        
           | skitter wrote:
           | That would be great, but that article is about
           | recommendations for applications running on Windows, not
           | about what valid file names applications may encounter.
           | Here's a counter-example:
           | https://github.com/golang/go/issues/32334
        
             | sedatk wrote:
             | No, I mean Windows API honors UTF-16 surrogate pairs, and
             | processes them correctly. It doesn't produce invalid UTF-16
             | strings either. Apps may not support UTF-16 properly, and
             | that's not on Windows, is it?
             | 
             | NTFS, on the other hand, has no dictated format for
             | filename encoding. It just stores raw bytes as filenames,
             | so anything can be a filename on NTFS, including invalid
             | strings if the caller decides to do so. That's not on
             | Windows either, otherwise, we should add Linux to the list
             | too as ext4 and most other file systems also don't care
             | about filename encoding.
        
       ___________________________________________________________________
       (page generated 2023-07-25 23:00 UTC)