[HN Gopher] Unicode Normalization Forms: When o [?] o ___________________________________________________________________ Unicode Normalization Forms: When o [?] o Author : ocrb Score : 82 points Date : 2021-12-31 19:32 UTC (3 hours ago) (HTM) web link (blog.opencore.ch) (TXT) w3m dump (blog.opencore.ch) | mannerheim wrote: | Duolingo doesn't handle Unicode normalisation for certain | languages, and it's incredibly frustrating. Here's one example[0] | (Vietnamese) and I know it's the case for Yiddish as well. | | [0]: https://forum.duolingo.com/comment/17787660/Bug-Correct- | Viet... | nixpulvis wrote: | Half normal isn't normal. That said, I personally try to avoid | unicode in filenames (and caps too) for similar reasons. | javajosh wrote: | tl;dr - don't use crazy unicode characters in filenames, they can | be problematic for non-trivial reasons (in this case because of | unicode normalization on an smb mount.) | int_19h wrote: | What's "crazy" about the letter? It's a standard letter of | several European alphabets. | drpixie wrote: | Nothing crazy about the "letter", but it is crazy that there | are multiple different ways to encode the "letter". | hinkley wrote: | Reading about unicode has made me much, much more circumspect | about the meaning of != in languages, and what fall-through | behavior should look like. Unicode domain names lasted for a hot | minute until someone registered microsoft.com with Cyrillic | letters. | | Years ago I read a rant by someone who insisted that being able | to mix arbitrary languages into a single String object makes | sense for linguists but for most of us we would be better off | being able to a assert that a piece of text was German, or | sanskrit, not a jumble of both. It's been living rent free in my | head for almost two decades and I can't agree with it, nor can I | laugh it off. | | It might have been better if the 'code pages' idea was refined | instead of eliminated (that is, the string uses one or more code | pages, not the process). I don't know what the right answer is, | but I know Every X is a Y almost always gets us into trouble. | LAC-Tech wrote: | Sprinkling English with foreign words is really, really common. | I'm in New Zealand and people do it all the time. And even in | the states, right? Don't want two different strings because | someone writes an English sentence about how much they love | jalapeno. | mr_luc wrote: | Heh, funny, I'm implementing this _exact_ thing at the moment, | oddly enough -- rather, implementing a security check that | provides that same guarantee you mention, Mixed Script | protections. | | In Unicode spec terms, 'UTS 39 (Security)' contains the | description of how to do this, mostly in section 5, and it | relies on 'UTX 24 (Scripts)'. | | It's more nuanced than your example but only slightly. If you | replace "German" with "Japanese" you're talking about multiple | scripts in the same 'writing system', but the spec provides | files with the lists of 'sets of scripts' each character | belongs to. | | The way that the spec tells us to ensure that the word | 'microsoft' isn't made up of fishy characters is that we just | keep the intersection of each character's augmented script | sets. If at the end, that intersection is empty, that's often | fishy -- ie, there's no intersection between '{Latin}, | {Cyrillic}'. | | However, the spec allows the legit uses of writing systems that | use more than one script; the lookup procedure outlined in the | spec could give script sets like '{Jpan, Kore, Hani, Hanb}, | {Jpan, Kana}' for two characters, and that intersection isn't | empty; it'd give us the answer "Okay, this word is contained | within the Japanese writing system". | drdaeman wrote: | > but for most of us we would be better off | | That's simple - it is provably wrong. While relatively uncommon | there are plenty of examples that would contradict this | statement. And it's not about being able to encode the Rosetta | Stone - non-scientists mix languages all the time, from Carmina | Burana to Blinkenlights. They even make meaningful portmanteau | words and write them with characters from multiple unrelated | writing systems, like "zashitano" (see - Latin and Cyrillic | scripts in the same single word!) | david-gpu wrote: | _> Years ago I read a rant by someone who insisted that being | able to mix arbitrary languages into a single String object | makes sense for linguists but for most of us we would be better | off being able to a assert that a piece of text was German, or | sanskrit, not a jumble of both._ | | Presumably the person who wrote it speaks a single language. | | Just because something is not useful to them, it doesn't mean | it is not useful in general. There are millions of polyglots as | well as documents that include words and names in multiple | scripts. | jerf wrote: | I think in that case the idea would either be that you should | then have an array of strings, each of which may have its own | language set, or that the string should be labelled as | "containing Latin and Cyrillic", but still not able to | include arbitrary other characters from Unicode. And multi- | lingual text still generally breaks on _words_... Kilobytes | of Latin text with a single Cyrillic character in the middle | of a word is very suspicious, in a way that kilobytes of | Latin text with a single Cyrillic _word_ isn 't. | | Of course you'd always need an "unrestricted" string (to | speak to the rest of the system if necessary), but there are | very few natural strings out there in the world that consist | of half-a-dozen languages just mishmashed together. Those | exceptions can be treated as _exceptions_. | danudey wrote: | In our Jenkins system, we have remote build nodes return data | back to the primary node via environment variable-style | formatted files (e.g. FOO=bar), so when I had to send back a | bunch of arbitrary multi-line textual data, I decided to base64 | encode it. Simple enough. | | On *nix systems, I ran this through the base64 command; the | data was UTF8, which meant that in practice it was ASCII | (because we didn't have any special characters in our commit | messages). | | On Windows systems... oh god. The system treats all text as | UTF-16 with whatever byte order, and it took me ages to figure | out how to get it to convert the data to UTF-8 before encoding | it. Eventually it started working, and it worked for a while | until it didn't for whatever reason. I ended up tearing out all | the code and just encoding the UTF-16 in base64 and then | processing that into UTF-8 on the master where I had access to | much saner tools. | | Generally speaking, "Unicode" works great in most cases, but | when you're dealing with systems with weird or unusual encoding | habits, like Windows using UTF-16 or MySQL's "utf8" being | limited to three bytes per unicode character instead of four, | everything goes out the window and it's the wild west all over | again. | int_19h wrote: | You can already map Unicode ranges to "code pages" of sorts, so | how would that help? | | Thing is, people who are not linguists _do_ want to mix | languages. It 's very common in some cultures to intersperse | the native language with English. But even if not, if the | language in question uses a non-Latin alphabet, there are often | bits and pieces of data that have to be written down in Latin. | So that "most of us" perspective is really "most of us in US | and Western Europe", at best. | | For domains and such, what I think is really needed is a new | definition of string equality that boils down to "are people | likely to consider these two the same?". So that would e.g. | treat similarly-shaped Latin/Greek/Cyrillic letters the same. | jrochkind1 wrote: | Oh, you can do far more than "code pages of sorts". Unicode | has a variety of metadata available about each codepoint. The | things that are "code pages of sorts" are maybe "block" (for | o "Latin-1 Supplement"), and "plane" (for o it's "Basic | Multilingual Plane"), but those are really mostly | administrative and probably not what want. | | But you also have "Script" (for o "Latin). Some characters | belong to more than one script though. Unicode will tell you | that. | | Unicode also has a variety of algorithms available already | written. One of the most relevant ones here is... | normalization. To compare two strings in the broadest | semantic sense of "are people likely to consider these the | same", you want want a "compatibility" normalization. NFKC or | NFKD. They will for instance make `1` and `1`[superscript] | the same, which is definitely one kind of "consider these the | same" -- very useful for, say, a search index. | | That won't be iron-clad, but that will be better than trying | to role your own algorithm involving looking at character | metadata yourself! But it won't get you past intentional | attacks using "look-alike" characters that are actually | different semantically but look similar/indistinguishable | depending on font. The trick is "consider these the same" | really, it turns out, depends on context and purpose, it's | not always the same. | | Unicode also has a variety of useful guides as part of the | standard, including the guide to normalization | https://unicode.org/reports/tr15/ and some guides related to | security (such as https://unicode.org/reports/tr36/ and | http://unicode.org/reports/tr39/), all of which are relevant | to this concern, and suggest approaches and algorithms. | | Unicode has a LOT of very clever stuff in it to handle the | inherently complicated problem of dealing with the entire | universe of global languages that Unicode makes possible. It | pays to spend some time with em. | BlueTemplar wrote: | Yeah, Greek alphabet is used _a lot_ in sciences. It 's | really annoying that we're only starting to get proper | support _now_. (Including on keyboards : http://norme- | azerty.fr/en/ ) | wisty wrote: | What's a word? (A quick test - how many words were in the | previous sentence, maybe 3 or 4 depending on whether the 's is | part of a word; so can we talk about Johannesson's foreign | policy?). | | It's hard enough to know what a letter is in unicode. Breaking | things into words is just another massive headache. | Someone wrote: | That doesn't make sense to me. Even disregarding cases where | people mix languages (how do you write a dictionary? If the | answer is "just create a data structure combining multiple | strings", shouldn't we standardize how to do that?), all | languages share thousands of symbols such as currency symbols, | mathematical symbols, Greek and Hebrew alphabets (to be used in | math books written in the language), etc. So, even languages | such as Greek and English share way more symbols than that they | have unique ones. | jrochkind1 wrote: | it seems like a bug that to get consistent unicode normalization | you need to flip a non-default config option. What am I missing? | tpmx wrote: | As a European, I _kinda_ miss iso-8859-1 being used everywhere. | 0x0 wrote: | Java is terrible in this regard, as most file APIs use | "java.lang.String" to identify the filename, which most of the | time depends on the system property "file.encoding". With the | result that there will be files that you can never read from a | java application if the filename encoding does not match the java | file.encoding encoding. | mgaunard wrote: | Most formats (including XML) require data to be normalized to | NFC. | chrismorgan wrote: | Can you point me to a single format that actually _requires_ | NFC? Most things either make no comment or just express | preferences, though I'm confident there will be some somewhere. | | XML does _not_ require normalisation: per | <https://www.w3.org/TR/xml11/#sec-normalization-checking>, XML | data SHOULD be fully normalised, but MUST NOT be transformed by | processors; in other words, it's a dead letter "SHOULD", and no | one actually cares, just like almost everything else. | guerrilla wrote: | > But here, normalization caused this issue. | | Nope, the lack of normalization on both accounts by the SMB | server caused the issue. It could have normalized before emitting | but it definitely should have normalized on receiving for | comparison. | B-Con wrote: | I think that in the ls->read workflow, Nextcloud shouldn't | normalize the response from SMB and should issue back to SMB | whatever SMB returned to Nextcloud. | guerrilla wrote: | According to Unicode, it should be allowed to and the SMB | server should be able to handle it. That's kind of the point | of normalization, they're meant to be done before all | comparisons so that exactly this doesn't happen. Your | suggestion is just premature optimization, i.e. eliminating a | redundancy. | int_19h wrote: | Unicode doesn't say anything about what "should be allowed | to" with respect to an unrelated protocol. If the protocol | says that filenames are sequences of 16-bit values that | have to be compared one by one, then that's what it is. | guerrilla wrote: | It does say that if comparisons are being made then... | and comparisons are being made, so yes, it does. | silon42 wrote: | At least it should perform validation and reject the NFD form | and force the client to normalize to NFC? | misnome wrote: | Why isn't the answer just "Don't unicode normalise the file | name"? | | I thought the generally recommended way to deal with file names | is to treat as a block of bytes (to the extent that e.g. rust has | an entirely separate string type for OS provided strings), or | just to allow direct encoding/decoding but not normalisation or | alteration. | tialaramex wrote: | In terms of what filenames _are_ neither Windows nor Linux (I | don 't know for sure with MacOS but I doubt it) actually | guarantee you any sort of _characters_. | | Linux filenames are a sequence of non-zero bytes (they might be | ASCII, or at least UTF-8, they might be an old 8-bit charset, | but they also might just be arbitrary non-zero bytes) and | Windows file names are a sequence of non-zero 16-bit unsigned | integers, which you could think of as UTF-16 code units but | they don't promise to encode UTF-16. | | _Probably_ the files have human readable names, but, maybe | not. If you 're accepting command line file names it's not | crazy to insist on human readable (thus, Unicode) names, but if | you process arbitrary input files you didn't create, | particularly files you just found by looking around on disks | unsupervised - you need to accept that utter gibberish is | inevitable sooner or later and you must cope with that | successfully. | | Rust's OSStr variants match this reality. | atoav wrote: | This is what I found quite refreshing about Rust -- instead | of choosing one of the following: A) The | programmer is a almighty god who knows everything, we just | expose him to the raw thing B) The programmer is | a immature toddler who cannot be trusted, so we handle things | for them | | What Rust does is more among the lines of "you might already | know this, but anyways here is a reminder that you, the | programmer need to take some decision about this". | [deleted] | GlitchMr wrote: | Filenames in HFS+ filesystem (an old filesystem used by Mac | OS X) are normalized with a proprietary variant of NFD - this | is a filesystem feature. APFS removed this feature. | 1over137 wrote: | >APFS removed this feature. | | And then brought it back. It normalizes now. | lilyball wrote: | By "proprietary variant" you mean "publicly documented | variant" which IIRC is just the normalization tables frozen | in time from an early version of Unicode (the idea being | that updating your OS shouldn't change the rules about what | filenames are valid). | | As for APFS, it ~~doesn't~~didn't normalize but I believe | it still requires UTF-8. And the OS will normalize | filenames at a higher level. EDIT: they added native | normalization. At least for iOS, I didn't dig enough to | check it macOS is doing native normalizing or is just | normalization-insensitive. | chrismorgan wrote: | Normalisation is expressly done with the composition of | version 3.1 for compatibility: see | <https://www.unicode.org/reports/tr15/#Versioning>. IF | that's what HFS+ does, then "proprietary variant" is | wrong. And if not, I'm curious what it does differently. | | (On the use of version 3.1, note that in practice version | 3.2 is used, correcting one typo: see | <https://www.unicode.org/versions/corrigendum3.html>.) | | I find a few references to it being slightly different, | but not one of them actually says what's different; | Wikipedia is the only one with a citation | (<https://en.wikipedia.org/wiki/HFS_Plus>: "and | normalized to a form very nearly the same as Unicode | Normalization Form D (NFD)[12]"), and that citation says | it's UAX #15 NFD, no deviations. One library that handles | HFS+ differently switches to UCD 3.2.0 for HFS+ | <https://github.com/ksze/filename- | sanitizer/blob/e990e963dc5b...>, but my impression from | UAX #15 is that this should be superfluous, not actually | changing anything. (Why is UCD 3.2.0 still around there? | Probably because IDNA 2003 needs it: | <https://bugs.python.org/issue42157#msg379674>.) | | _Update:_ https://developer.apple.com/library/archive/te | chnotes/tn/tn1... has actual technical information, but | the table in question doesn't show Unicode version | changes like they claim it does, so I dunno. Looks like | maybe from macOS 10.3 it's exactly UAX #15, but 8.1-10.2 | was a precursor? I'm fuzzy on where the normalisation | actually happens, anyway. | GlitchMr wrote: | The `filename-sanitizer` library you have linked has the | following comment. # FIXME: | improve HFS+ handling, because it does not use the | standard NFD. It's # close, but it's | not exactly the same thing. 'hfs+': | (255, 'characters', 'utf-16', 'NFD'), | | I wonder what does that mean... | matja wrote: | ZFS can support normalization also: $ | echo test > $'\xc3\xb6' $ cat $'\x6f\xcc\x88' | cat: o: No such file or directory $ zfs create | -o normalization=formD pool/dataset $ echo test > | $'\xc3\xb6' $ cat $'\x6f\xcc\x88' test | zekica wrote: | macOS is interesting: some APIs normalize filenames while | others don't. And it causes some very interesting bugs. | | One example is when you submit a file in Safari it doesn't | normalize the file name while js file.name does. | stefan_ wrote: | Sure but at some point you might want to create a file and | frequently using user input or filter files using some user | provided query string, the kind of use cases that unicode | normalization was invented for. So the whole "opaque blob of | bytes" filesystem handling is nice if all you want is to not | silently corrupt files, but it is very obviously not even | covering 10% of normal use cases. Rust isn't being super smart, | it just has its hands thrown up in the air. | alkonaut wrote: | Falls over on the fact that I don't want to be able to write | these two files in the same dir. if I write file o1.txt and | o1.txt then I want to be warned that the file exists even of | the encoding is different when I use two different apps but try | to write the same file. | | The same applies for a.txt and A.txt on case insensitive file | systems (as someone pointed out the most common desktop file | systems are). | pavlov wrote: | The most common desktop file systems are case-insensitive, | which complicates the picture. | Pxtl wrote: | Still, it looks like the right thing to do is let the | filesystem do the filesystem's job. The filesystem should be | normalizing unicode and enforceing the case-insensitivity and | whatnot, but _just_ the filesystem. Wrappers around it like | whatever Nextcloud is doing should be treating the filenames | as a dumb pile of bytes. | dataflow wrote: | I'm not sure this problem even _has_ a "right" solution. | | > Wrappers around it like whatever Nextcloud is doing | should be treating the filenames as a dumb pile of bytes. | | What do you do when the input isn't a dumb pile of bytes, | but actual text? (Like from a text box the user typed | into?) | rzzzt wrote: | Maintain a table that maps the original file name to | random-generated one that doesn't hit these gotchas. | rob_c wrote: | And place the files in chunks, and... Wait I think we're | getting close to reinventing block storage again ;) | dataflow wrote: | I'm afraid I don't follow. Who maintains this table and | who consumes it? What if they're different entities? How | do you prevent it from going out of sync with the file | system when the user renames a file? Are you inventing | your own file system here? How do you deal with existing | file systems? | rzzzt wrote: | I assumed that you have a system where file | management/synchronization happens strictly through a web | interface, and files are not changed or renamed outside | this system's knowledge. Under these preconditions, | having such a mapping table frees the users from having | to abide whatever restrictions the underlying file system | places on valid file names. | dataflow wrote: | Oh I was talking about the general case from a | programming standpoint. What do you do on a typical local | filesystem? | | The point I'm trying to get at being, you need to worry | about the representation at multiple layers, not just at | the bottom FS layer. | mjevans wrote: | Case insensitivity is a braindead behavior. If desired it | should be a fallback path selecting the best match, not the | first resort. | laurent92 wrote: | So you're fine with ~/Downloads and ~/downloads coexisting | as entirely separate directories? And | John.McCauley@yahoo.fr and john.mccauley@yahoo.fr being | attributed to two different people ;) | im3w1l wrote: | > So you're fine with ~/Downloads and ~/downloads | coexisting as entirely separate directories? | | Case (in)sensitivity for filenames is a non-issue in my | experience. Never had problems with either convention. As | for emails, I do think insensitivity was the right | choice. | tim-- wrote: | The RFC states that email addresses are case sensitive. | | The local-part of a mailbox MUST BE treated as case | sensitive. | | Section 2.4 RFC 2821, | https://www.ietf.org/rfc/rfc2821.txt | deadbunny wrote: | My guess would be that the local part of an email address | would usually map to a directory on case sensitive | filesystems... | justaguy37 wrote: | can we just say no to capital letters? (or lowercase?) | | do capital letters have a good enough usage case to | justify their continued existence? | Lammy wrote: | Fun fact: The Apple II and II+ originally only did upper- | case, and it was very popular to add a Shift Key / lower- | case mod via one of the gamepad buttons: https://web.arch | ive.org/web/20010212094858/http://home.swbel... | colejohnson66 wrote: | You are free to stop using capital letters, but good luck | getting everyone to go along. Capitals have been around | for centuries (they're older than the printing press) and | aren't going anywhere. | vgel wrote: | First one: yes, though good UI should prevent it from | happening unless the user really intended it (for example | I have ~/Documents symlinked into Dropbox, so ~/documents | could be local-only documents) | | Second one: no, emails are not filenames, and more | generally distinguishability is more important for | identifiers. In cases where identifiers like emails need | to be mapped to filenames, like caches, they should be | normalized. | jodrellblank wrote: | The opposite; case insensitivity is what human brains do, | we read word WORD Word and woRD as the same thing, it's | computer case-sensitive matching which is "brainless". | Computers not aligning with what humans do is annoying and | frustrating; they should be tools for us, not us for them. | There's no way two people would write o o and have readers | think they were different because one was written in oil- | based ink and one in water-based ink, or whatever compares | with behind the scenes implementation details like | combining form vs single character. | | I have just been arguing the same thing in far too much | detail in this thread: | https://news.ycombinator.com/item?id=29722019 | rob_c wrote: | WORD, Word WoRD.... | | Sorry to say I tend to use case sensitivity as a filter | for me offering support to other developers. I'm not | willing to find time for people who can't get their head | around "turn on/off caps lock". You don't do it in | professional writeups or applications (and I hope not in | a CV) so don't pollute my filesystems or codebases with | that madness. | skymt wrote: | There are a couple arguments against case-insensitive | filesystems I think are strong. The first is simply | compatibility with existing case-sensitive systems. The | second is that case is locale-dependent, so a pair of | names could be equivalent or not depending on the | device's locale. | | I don't think I've seen any good argument against | normalization, though. | jrochkind1 wrote: | Well, precisely because if you _don 't_ normalize the | filenames, o [?] o. You could have two files with different | filenames, `goteborg.txt` and `goteborg.txt`, and they are | different files with different filenames. | | Or you could have one file `goteborg.txt`, and when you try to | ask for it as `goteborg.txt`, the system tells you "no file by | that name". | | Unicode normalization is the _solution_ to this. And the | unicode normalization algorithms are pretty good. The bug in | this case is that the system did not apply unicode | normalization consistently. It required a non-default config | option to be turned on to do so? I don 't really understand | what's going on here, but it sounds like a bug in the system to | me that this would be a non-default config option. | | Dealing with the entire universe of human language is | inherently complicated. But unicode gives us some actually | pretty marvelous tools for doing it consistently and | reasonably. But you still have to use them, and use them right, | and with all software bugs are possible. | | But I don't think you get fewer crazy edge cases by not | normalizing at all. (In some cases you can even get security | concerns, think about usernames and the risk of `john` and | `john` being two different users...). I know that this is the | choice some traditional/legacy OSs/file systems make, in order | to keep pre-unicode-hegemony backwards compat. It has problems | as well. I think the right choice for any greenfield | possibilities is consistent unicode normalization, so | `goteborg.txt` and `goteborg.txt` can't be two different files | with two different filenames. | | [btw I tried to actually use the two common different forms of | o in this text; I don't believe HN normalizes them so they | should remain.] | nieve wrote: | It looks like instead of the config option switching | everything to use the same normalization it keeps a second | copy of the name in a database to compare to. What a horrible | kludge, I wonder how they even got into this situation of | using different normalization in different parts of the | system? | arka2147483647 wrote: | That works for programmers, but not for users. There could be | several files with the same name, buth with different | encodings. Worse, depending on how your terminal encodes user | input, some of them migth not be typable. | zarzavat wrote: | From the users perspective I don't want any normalisation at | all. It's good as long as you only have one file system but | as soon as you get multiple file systems with conflicting | rules (which includes transferring files to other people) it | becomes hell. Unfortunately we are stuck with that hell. | heikkilevanto wrote: | Well, if 7-bit US ASCII was good enough for our Lord, it is good | enough for me ;-) ___________________________________________________________________ (page generated 2021-12-31 23:00 UTC)