[HN Gopher] Unicode is harder than you think ___________________________________________________________________ Unicode is harder than you think Author : mcilloni Score : 86 points Date : 2023-07-25 16:47 UTC (6 hours ago) (HTM) web link (mcilloni.ovh) (TXT) w3m dump (mcilloni.ovh) | jkaptur wrote: | The logical next step here is to realize that if you want to be | truly internationalized, pretty much every single method of the | string class in your favorite language is an antipattern and | should be used with extreme caution. Seriously! | frizlab wrote: | I _think_ Swift did it properly. At least that's what they | claim[1] | | [1]: https://www.swift.org/blog/utf8-string/ | | There were many discussions on how to handle strings in the | forums too. Remarkably, it is not possible to access "abc"[1] | as it has unpredictable performance; instead one has to build | the index and making this should make the developer realize the | operation is costly. All in all most beginners in Swift hate | working with strings because it's not intuitive at first, but | to be fair in almost all languages strings handling is not done | properly. | jkaptur wrote: | Oh, interesting! The fact that "abc"[1] doesn't work is a | great sign, since my contention is that "the character at | index i" is not a well-defined concept. | spudlyo wrote: | If you found this essay interesting, you owe it to yourself to | check out this super entertaining talk "Plain Text"[0] from NDC | 2022 by Dylan Beattie. Rabbit hole warning: This video caused me | to lose an entire Sunday watching Dylan's talks on YouTube, which | are uniformly awesome. | | [0]: https://www.youtube.com/watch?v=gd5uJ7Nlvvo | WaffleIronMaker wrote: | I also really enjoy Dylan Beattie's work. For those with some | spare time, who might like to see a true "rockstar" programmer, | you may like "The Art of Code"[0]. | | [0] https://youtu.be/6avJHaC3C2U | aidos wrote: | Oh wow. The Amstrad 6128 was my first machine (1985). Looking | forward to watching this! | spudlyo wrote: | Another good talk! I also really loved "Email vs Capitalism, | or, Why We Can't Have Nice Things"[0] which has one of the | best audience participation gimmicks I have ever seen in a | talk. | | [0]: https://www.youtube.com/watch?v=mrGfahzt-4Q | soneil wrote: | This is exactly what I came here to post, too. I believe he's | unusually prolific because he's part of the organisation of | these conferences - but his delivery pays off regardless. I | actually sent exactly the same video to a colleague this a few | days ago. | NovemberWhiskey wrote: | I used to work on a platform at a large financial services firm; | it was essentially complete ignorant of anything Unicode with | respect to string handling, strings were null-terminated byte | streams. The platform had CSV import capability for tabular data, | and it had an integrated pivot table capability based on some | widgets that had been grafted onto it. | | Some of the users in Hong Kong discovered that you could import | CSVs with Unicode text (e.g. index compositions with Chinese | company names) and they'd display in the pivot table widgets and | even be exportable to reports. But only most names. Some names | were truncated or turned into garbage, and I was called upon to | help debug this. | | My first reaction was frank amazement that this "worked" at all: | apparently, the path from the dumb CSV import code through to the | Unicode-aware pivot table was sufficiently clean that much of the | encoded text made it through OK. I can't remember the precise | details now but I think the problem turned out to be embedded | nulls from UTF-16 encoding and so was completely insoluble | without a gut renovation of the platform. | Pannoniae wrote: | Most programs claim to support Unicode but they actually don't. | They either miscount string lengths (you type a CJK character or | an emoji in, string appears shorter than what the program | thinks), separate them improperly or many other things. It | doesn't help that by default, most programming languages also | handle unicode poorly, with the default APIs producing wrong | results. | | I'd take "we don't do unicode at all" or "we only support BMP" or | "we don't support composite characters" any day over pretend- | support (but then inevitably breaking when the program wasn't | tested with anything non-ASCII) | | (ninjaedit: to see how prevalent it is, even gigantic message | apps such as discord make this mistake. There are users on | discord who you can't add as friends because the friend input | field is limited to 32.... something - probably bytes, yet | elsewhere the program allows the name to be taken. This is easy | to do with combining characters) | david_draco wrote: | Maybe an "Acid test" for Unicode would help? These pages seem | to go into that direction: | https://www.kermitproject.org/utf8.html | https://web.archive.org/web/20160306060703/http://www.inter-... | | Placing a fuzzy tester like "hypothesis.strategies.characters" | into the CI may also be revealing. | alcover wrote: | Currently working on a language, I feel dizzy after reading this. | | My stdlib will provide a (byte) Buffer class with basic low-level | methods but I feel like iterating through it in fancy ways should | be the concern of the user or 3rd-party libraries. | | I fail to see this as part of a programming language. | | Am I wrong here ? | zadokshi wrote: | You're definitely wrong. You're designing a language that only | works for "Americans" by default. | | Imagine how you would feel about a language that supports | Arabic by default and needs special foo to work with American | English? | | You need to start thinking of characters as a type. Characters | do not fit in bytes unless you're American. | | UTF-8 is a reasonable compromise though. | scatters wrote: | It depends on the level and domain of your language. A low- | level language can get by with just byte arrays. A mid-level | language should probably handle at least some encodings, and | provide codepoint access. A high-level language should handle | locale-sensitive casing and collating, and grapheme-cluster | access (note that this depends on the font!). | ekidd wrote: | One good approach is to have separate "byte array" and "string" | types, and say, "Strings are always UTF-8. Anything else is a | bug. Deal with it." | | Then you can have a nice, user-friendly string class for basic | UTF-8 text, which is pretty easy. Ignore sorting and grapheme | clusters (those probably belong in libraries, and they require | fairly large tables). Consider providing a library function to | iterate over UTF-8 "characters" (as unpacked 32 bit Unicode | code points). | | This is one of the sweet spots in language design, and it | provides enough structure for third-party libraries to work | well together, without everyone reinventing their own string | type. | | For another good alternative to this approach, see Ruby. | night-rider wrote: | It's not that it's hard, it's just people don't go out of their | way to escape UTF-8 glyphs into ASCII when dealing with exotic | glyphs in a text editor. It's more mundane and tedious, but not | 'hard'. | | Try working with raw UTF-8 in JS and find yourself in a world of | pain. Mathias Bynens talks about these gotchas here: | | https://mathiasbynens.be/notes/javascript-unicode | JohnFen wrote: | "Hard" is a bit context-dependent. Instead of thinking of it as | "hard", I think of it as a real pain in the ass full of | footguns. | o1y32 wrote: | Regarding the title -- anecdotally, everyone I know is sacred of | encoding issues, and I don't know anyone who claims they have a | great understanding of Unicode or think it is easy (including | myself). It is often overlooked for sure -- people don't realize | there is a problem in the code until they run into a bug, ane it | turns out they are treating strings wrong from the very | beginning. | m_0x wrote: | sacred or scared? ;) | zamadatix wrote: | TIL of UTF-1, what an odd specification. | dmitrygr wrote: | Favourite unicode fact: properly rendering unicode requires | understanding of the current geopolitical situation (Depending on | whom you accept as a country and whom you do not, two country- | code-letters may or may not render as a flag. This changes | sometimes in today's world.). https://esham.io/2014/06/unicode- | flags | svachalek wrote: | Interesting. They pushed all the politics onto the font | designers. | not2b wrote: | The font designer has to include a flag for each supported | country. Often a given font is missing lots of flags for | reasons that have nothing to do with whether the designer | recognizes a given country or not, just a question of | priorities; perhaps only 100 out of 200 flags are supported. | Longhanks wrote: | Imho, unicode should stay out of politics. Country flags, | vaccine syringes and pregnant men should have nothing to do | with how computers handle text and writing systems. | veave wrote: | Or when big tech banded together to change the pistol emoji | to some scifi zapper. | spookthesunset wrote: | From what I recall the problem was on some devices it was | rendered as a "sci-fi zapper" or squirtgun and on others it | was a fairly realistic depiction of a gun. Leading to some | misunderstandings... | makeworld wrote: | There is no way to avoid it. It is very obvious that deciding | whether "vaccine syringes" are political (and therefore | excluded) or not is itself a political decision. | naniwaduni wrote: | There's a certain kind of extremist who claims that their | contentious positions aren't political, but the fact that | there's an argument, and that you can point to mainstream | coverage of it, strongly suggests that they're full of shit | and everyone can see it. | kergonath wrote: | What a strange position. The fact that an argument exists | just shows that some people want to argue. Anyone can | start an argument about anything. It's hardly a good base | to make a decision. | kergonath wrote: | What does a syringe has to do with this, exactly? | | Besides, why do you care what funny symbols people use in | discussions that don't involve you? | qalmakka wrote: | If it were that easy - sadly everything that's in any way | related to the way we communicate and we relate to the world. | Just look at the kerfuffle about skin tones... Everything is | political if you are looking for a reason to fight. | | Language is a very sensitive topic - in Central Asian | countries, using Latin, Cyrillic or Perso-Arabic script for | instance has very strong political connotations, same in the | Balkans. The world is just like that | mseepgood wrote: | How do you recognize whether a syringe emoji is a vaccine | syringe or a regular syringe? | Roark66 wrote: | Indeed it is. One use of Unicode I do is for icons that can be | used by console programs like (neo)vim. I was quite happy that | xterm supports Unicode these days so I can use a fast terminal | that supports OSC52 system clipboard integration(none of the | newer gnome/KDE terminals do). | | I was rather disappointed when I noticed my pretty Unicode icons | would sometimes end up cut in half :-( | jmclnx wrote: | No kidding, you have not lived until you try and explain UTF-8 to | people who only believes in what they called "doublebyte". | | You think they get it, but surprise happens when a database load | fails when loading Chinese Character "string" into a field sized | calculated based upon 2 bytes per character. | theamk wrote: | Thank god for emojis! Those people would say, "No one in our | org would use chinese" and refuse to fix things... but now I | just point them to latest message from upper management which | contain emoji or two. | | (And emoji are such a fine example - once they ate on the | table, you need support for combining characters, characters | outside of BMP, ligatures.. a large part of Unicode spec) | qalmakka wrote: | It's terrible, and we IMHO owe that to some introductory | university courses to Java (plus some Win32 veterans). I got | very close to being rejected by a professor that was | obstinately convinced that Unicode "characters" were 2 bytes | because it drunk the Kool Aid in the '90s about Java's `char` | type representing a Unicode character. Ugh. I still get angry | by thinking back at that sometimes | jmclnx wrote: | I can relate, I remember a teacher stating "you never have to | worry about the amount of memory". This was in the late 90s, | I then asked "So I can load a 20 gig data file into memory", | he said yes. | nightpool wrote: | Thank you for being the first article I've ever actually read to | explain the difference between NFC, NFD, NFKD and NFKC in a way | that I actually understood. I was a little bored through the | whole UCS/UTF* history lesson because I knew a lot of it already, | but the normalization and collation examples were definitely | worth it | Lammy wrote: | Agreed, and it would be even better if it mentioned some real- | world normalization issues like it does for the UCS encodings. | I learned about it the hard way when dealing with Apple | filesystems: https://eclecticlight.co/2021/05/08/explainer- | unicode-normal... | skitter wrote: | Annoyingly, Java, JavaScript, Windows file paths and more don't | quite use UTF-16 (well, even if they did, that would be annoying) | -- they allow unpaired surrogates, which don't represent any | Unicode character. So if you want to represent e.g. an arbitrary | Windows file path in UTF-8, you can't; you have to use WTF-8 | (wobbly transformation format) instead. | Knee_Pain wrote: | >WTF-8 | | truly an appropriate name | deadbeeves wrote: | But UTF-8 is just a way to encode a number as a variable-length | string of octets. Why would you be unable to encode, say, a | terminating U+D800 as a string of three bytes at the end of a | UTF-8 stream? | skitter wrote: | Because that's how UTF-8 is defined[1]. WTF-8 lifts that | restriction. | | [1] https://simonsapin.github.io/wtf-8/#utf-8 | deadbeeves wrote: | It doesn't sound very annoying, then. You use the exact | same encoding scheme, but skip a verification step. | Actually it sounds more convenient. | jraph wrote: | Still potentially annoying if you deal with some other | code that expects UTF-8 proper and you pass it a wtf-8 | string that fails the lifted verification. | [deleted] | sedatk wrote: | Certainly not true for Windows. Windows uses UTF-16; e.g. it | uses proper surrogate pairs. | | https://learn.microsoft.com/en-us/windows/win32/intl/surroga... | skitter wrote: | That would be great, but that article is about | recommendations for applications running on Windows, not | about what valid file names applications may encounter. | Here's a counter-example: | https://github.com/golang/go/issues/32334 | sedatk wrote: | No, I mean Windows API honors UTF-16 surrogate pairs, and | processes them correctly. It doesn't produce invalid UTF-16 | strings either. Apps may not support UTF-16 properly, and | that's not on Windows, is it? | | NTFS, on the other hand, has no dictated format for | filename encoding. It just stores raw bytes as filenames, | so anything can be a filename on NTFS, including invalid | strings if the caller decides to do so. That's not on | Windows either, otherwise, we should add Linux to the list | too as ext4 and most other file systems also don't care | about filename encoding. ___________________________________________________________________ (page generated 2023-07-25 23:00 UTC)