hngopher.com

       [HN Gopher] Unicode Normalization Forms: When o [?] o
       ___________________________________________________________________
        
       Unicode Normalization Forms: When o [?] o
        
       Author : ocrb
       Score  : 82 points
       Date   : 2021-12-31 19:32 UTC (3 hours ago)
        
 (HTM) web link (blog.opencore.ch)
 (TXT) w3m dump (blog.opencore.ch)
        
       | mannerheim wrote:
       | Duolingo doesn't handle Unicode normalisation for certain
       | languages, and it's incredibly frustrating. Here's one example[0]
       | (Vietnamese) and I know it's the case for Yiddish as well.
       | 
       | [0]: https://forum.duolingo.com/comment/17787660/Bug-Correct-
       | Viet...
        
       | nixpulvis wrote:
       | Half normal isn't normal. That said, I personally try to avoid
       | unicode in filenames (and caps too) for similar reasons.
        
       | javajosh wrote:
       | tl;dr - don't use crazy unicode characters in filenames, they can
       | be problematic for non-trivial reasons (in this case because of
       | unicode normalization on an smb mount.)
        
         | int_19h wrote:
         | What's "crazy" about the letter? It's a standard letter of
         | several European alphabets.
        
           | drpixie wrote:
           | Nothing crazy about the "letter", but it is crazy that there
           | are multiple different ways to encode the "letter".
        
       | hinkley wrote:
       | Reading about unicode has made me much, much more circumspect
       | about the meaning of != in languages, and what fall-through
       | behavior should look like. Unicode domain names lasted for a hot
       | minute until someone registered microsoft.com with Cyrillic
       | letters.
       | 
       | Years ago I read a rant by someone who insisted that being able
       | to mix arbitrary languages into a single String object makes
       | sense for linguists but for most of us we would be better off
       | being able to a assert that a piece of text was German, or
       | sanskrit, not a jumble of both. It's been living rent free in my
       | head for almost two decades and I can't agree with it, nor can I
       | laugh it off.
       | 
       | It might have been better if the 'code pages' idea was refined
       | instead of eliminated (that is, the string uses one or more code
       | pages, not the process). I don't know what the right answer is,
       | but I know Every X is a Y almost always gets us into trouble.
        
         | LAC-Tech wrote:
         | Sprinkling English with foreign words is really, really common.
         | I'm in New Zealand and people do it all the time. And even in
         | the states, right? Don't want two different strings because
         | someone writes an English sentence about how much they love
         | jalapeno.
        
         | mr_luc wrote:
         | Heh, funny, I'm implementing this _exact_ thing at the moment,
         | oddly enough -- rather, implementing a security check that
         | provides that same guarantee you mention, Mixed Script
         | protections.
         | 
         | In Unicode spec terms, 'UTS 39 (Security)' contains the
         | description of how to do this, mostly in section 5, and it
         | relies on 'UTX 24 (Scripts)'.
         | 
         | It's more nuanced than your example but only slightly. If you
         | replace "German" with "Japanese" you're talking about multiple
         | scripts in the same 'writing system', but the spec provides
         | files with the lists of 'sets of scripts' each character
         | belongs to.
         | 
         | The way that the spec tells us to ensure that the word
         | 'microsoft' isn't made up of fishy characters is that we just
         | keep the intersection of each character's augmented script
         | sets. If at the end, that intersection is empty, that's often
         | fishy -- ie, there's no intersection between '{Latin},
         | {Cyrillic}'.
         | 
         | However, the spec allows the legit uses of writing systems that
         | use more than one script; the lookup procedure outlined in the
         | spec could give script sets like '{Jpan, Kore, Hani, Hanb},
         | {Jpan, Kana}' for two characters, and that intersection isn't
         | empty; it'd give us the answer "Okay, this word is contained
         | within the Japanese writing system".
        
         | drdaeman wrote:
         | > but for most of us we would be better off
         | 
         | That's simple - it is provably wrong. While relatively uncommon
         | there are plenty of examples that would contradict this
         | statement. And it's not about being able to encode the Rosetta
         | Stone - non-scientists mix languages all the time, from Carmina
         | Burana to Blinkenlights. They even make meaningful portmanteau
         | words and write them with characters from multiple unrelated
         | writing systems, like "zashitano" (see - Latin and Cyrillic
         | scripts in the same single word!)
        
         | david-gpu wrote:
         | _> Years ago I read a rant by someone who insisted that being
         | able to mix arbitrary languages into a single String object
         | makes sense for linguists but for most of us we would be better
         | off being able to a assert that a piece of text was German, or
         | sanskrit, not a jumble of both._
         | 
         | Presumably the person who wrote it speaks a single language.
         | 
         | Just because something is not useful to them, it doesn't mean
         | it is not useful in general. There are millions of polyglots as
         | well as documents that include words and names in multiple
         | scripts.
        
           | jerf wrote:
           | I think in that case the idea would either be that you should
           | then have an array of strings, each of which may have its own
           | language set, or that the string should be labelled as
           | "containing Latin and Cyrillic", but still not able to
           | include arbitrary other characters from Unicode. And multi-
           | lingual text still generally breaks on _words_... Kilobytes
           | of Latin text with a single Cyrillic character in the middle
           | of a word is very suspicious, in a way that kilobytes of
           | Latin text with a single Cyrillic _word_ isn 't.
           | 
           | Of course you'd always need an "unrestricted" string (to
           | speak to the rest of the system if necessary), but there are
           | very few natural strings out there in the world that consist
           | of half-a-dozen languages just mishmashed together. Those
           | exceptions can be treated as _exceptions_.
        
         | danudey wrote:
         | In our Jenkins system, we have remote build nodes return data
         | back to the primary node via environment variable-style
         | formatted files (e.g. FOO=bar), so when I had to send back a
         | bunch of arbitrary multi-line textual data, I decided to base64
         | encode it. Simple enough.
         | 
         | On *nix systems, I ran this through the base64 command; the
         | data was UTF8, which meant that in practice it was ASCII
         | (because we didn't have any special characters in our commit
         | messages).
         | 
         | On Windows systems... oh god. The system treats all text as
         | UTF-16 with whatever byte order, and it took me ages to figure
         | out how to get it to convert the data to UTF-8 before encoding
         | it. Eventually it started working, and it worked for a while
         | until it didn't for whatever reason. I ended up tearing out all
         | the code and just encoding the UTF-16 in base64 and then
         | processing that into UTF-8 on the master where I had access to
         | much saner tools.
         | 
         | Generally speaking, "Unicode" works great in most cases, but
         | when you're dealing with systems with weird or unusual encoding
         | habits, like Windows using UTF-16 or MySQL's "utf8" being
         | limited to three bytes per unicode character instead of four,
         | everything goes out the window and it's the wild west all over
         | again.
        
         | int_19h wrote:
         | You can already map Unicode ranges to "code pages" of sorts, so
         | how would that help?
         | 
         | Thing is, people who are not linguists _do_ want to mix
         | languages. It 's very common in some cultures to intersperse
         | the native language with English. But even if not, if the
         | language in question uses a non-Latin alphabet, there are often
         | bits and pieces of data that have to be written down in Latin.
         | So that "most of us" perspective is really "most of us in US
         | and Western Europe", at best.
         | 
         | For domains and such, what I think is really needed is a new
         | definition of string equality that boils down to "are people
         | likely to consider these two the same?". So that would e.g.
         | treat similarly-shaped Latin/Greek/Cyrillic letters the same.
        
           | jrochkind1 wrote:
           | Oh, you can do far more than "code pages of sorts". Unicode
           | has a variety of metadata available about each codepoint. The
           | things that are "code pages of sorts" are maybe "block" (for
           | o "Latin-1 Supplement"), and "plane" (for o it's "Basic
           | Multilingual Plane"), but those are really mostly
           | administrative and probably not what want.
           | 
           | But you also have "Script" (for o "Latin). Some characters
           | belong to more than one script though. Unicode will tell you
           | that.
           | 
           | Unicode also has a variety of algorithms available already
           | written. One of the most relevant ones here is...
           | normalization. To compare two strings in the broadest
           | semantic sense of "are people likely to consider these the
           | same", you want want a "compatibility" normalization. NFKC or
           | NFKD. They will for instance make `1` and `1`[superscript]
           | the same, which is definitely one kind of "consider these the
           | same" -- very useful for, say, a search index.
           | 
           | That won't be iron-clad, but that will be better than trying
           | to role your own algorithm involving looking at character
           | metadata yourself! But it won't get you past intentional
           | attacks using "look-alike" characters that are actually
           | different semantically but look similar/indistinguishable
           | depending on font. The trick is "consider these the same"
           | really, it turns out, depends on context and purpose, it's
           | not always the same.
           | 
           | Unicode also has a variety of useful guides as part of the
           | standard, including the guide to normalization
           | https://unicode.org/reports/tr15/ and some guides related to
           | security (such as https://unicode.org/reports/tr36/ and
           | http://unicode.org/reports/tr39/), all of which are relevant
           | to this concern, and suggest approaches and algorithms.
           | 
           | Unicode has a LOT of very clever stuff in it to handle the
           | inherently complicated problem of dealing with the entire
           | universe of global languages that Unicode makes possible. It
           | pays to spend some time with em.
        
           | BlueTemplar wrote:
           | Yeah, Greek alphabet is used _a lot_ in sciences. It 's
           | really annoying that we're only starting to get proper
           | support _now_. (Including on keyboards : http://norme-
           | azerty.fr/en/ )
        
         | wisty wrote:
         | What's a word? (A quick test - how many words were in the
         | previous sentence, maybe 3 or 4 depending on whether the 's is
         | part of a word; so can we talk about Johannesson's foreign
         | policy?).
         | 
         | It's hard enough to know what a letter is in unicode. Breaking
         | things into words is just another massive headache.
        
         | Someone wrote:
         | That doesn't make sense to me. Even disregarding cases where
         | people mix languages (how do you write a dictionary? If the
         | answer is "just create a data structure combining multiple
         | strings", shouldn't we standardize how to do that?), all
         | languages share thousands of symbols such as currency symbols,
         | mathematical symbols, Greek and Hebrew alphabets (to be used in
         | math books written in the language), etc. So, even languages
         | such as Greek and English share way more symbols than that they
         | have unique ones.
        
       | jrochkind1 wrote:
       | it seems like a bug that to get consistent unicode normalization
       | you need to flip a non-default config option. What am I missing?
        
       | tpmx wrote:
       | As a European, I _kinda_ miss iso-8859-1 being used everywhere.
        
       | 0x0 wrote:
       | Java is terrible in this regard, as most file APIs use
       | "java.lang.String" to identify the filename, which most of the
       | time depends on the system property "file.encoding". With the
       | result that there will be files that you can never read from a
       | java application if the filename encoding does not match the java
       | file.encoding encoding.
        
       | mgaunard wrote:
       | Most formats (including XML) require data to be normalized to
       | NFC.
        
         | chrismorgan wrote:
         | Can you point me to a single format that actually _requires_
         | NFC? Most things either make no comment or just express
         | preferences, though I'm confident there will be some somewhere.
         | 
         | XML does _not_ require normalisation: per
         | <https://www.w3.org/TR/xml11/#sec-normalization-checking>, XML
         | data SHOULD be fully normalised, but MUST NOT be transformed by
         | processors; in other words, it's a dead letter "SHOULD", and no
         | one actually cares, just like almost everything else.
        
       | guerrilla wrote:
       | > But here, normalization caused this issue.
       | 
       | Nope, the lack of normalization on both accounts by the SMB
       | server caused the issue. It could have normalized before emitting
       | but it definitely should have normalized on receiving for
       | comparison.
        
         | B-Con wrote:
         | I think that in the ls->read workflow, Nextcloud shouldn't
         | normalize the response from SMB and should issue back to SMB
         | whatever SMB returned to Nextcloud.
        
           | guerrilla wrote:
           | According to Unicode, it should be allowed to and the SMB
           | server should be able to handle it. That's kind of the point
           | of normalization, they're meant to be done before all
           | comparisons so that exactly this doesn't happen. Your
           | suggestion is just premature optimization, i.e. eliminating a
           | redundancy.
        
             | int_19h wrote:
             | Unicode doesn't say anything about what "should be allowed
             | to" with respect to an unrelated protocol. If the protocol
             | says that filenames are sequences of 16-bit values that
             | have to be compared one by one, then that's what it is.
        
               | guerrilla wrote:
               | It does say that if comparisons are being made then...
               | and comparisons are being made, so yes, it does.
        
         | silon42 wrote:
         | At least it should perform validation and reject the NFD form
         | and force the client to normalize to NFC?
        
       | misnome wrote:
       | Why isn't the answer just "Don't unicode normalise the file
       | name"?
       | 
       | I thought the generally recommended way to deal with file names
       | is to treat as a block of bytes (to the extent that e.g. rust has
       | an entirely separate string type for OS provided strings), or
       | just to allow direct encoding/decoding but not normalisation or
       | alteration.
        
         | tialaramex wrote:
         | In terms of what filenames _are_ neither Windows nor Linux (I
         | don 't know for sure with MacOS but I doubt it) actually
         | guarantee you any sort of _characters_.
         | 
         | Linux filenames are a sequence of non-zero bytes (they might be
         | ASCII, or at least UTF-8, they might be an old 8-bit charset,
         | but they also might just be arbitrary non-zero bytes) and
         | Windows file names are a sequence of non-zero 16-bit unsigned
         | integers, which you could think of as UTF-16 code units but
         | they don't promise to encode UTF-16.
         | 
         |  _Probably_ the files have human readable names, but, maybe
         | not. If you 're accepting command line file names it's not
         | crazy to insist on human readable (thus, Unicode) names, but if
         | you process arbitrary input files you didn't create,
         | particularly files you just found by looking around on disks
         | unsupervised - you need to accept that utter gibberish is
         | inevitable sooner or later and you must cope with that
         | successfully.
         | 
         | Rust's OSStr variants match this reality.
        
           | atoav wrote:
           | This is what I found quite refreshing about Rust -- instead
           | of choosing one of the following:                 A) The
           | programmer is a almighty god who knows everything, we just
           | expose him to the raw thing              B) The programmer is
           | a immature toddler who cannot be trusted, so we handle things
           | for them
           | 
           | What Rust does is more among the lines of "you might already
           | know this, but anyways here is a reminder that you, the
           | programmer need to take some decision about this".
        
             | [deleted]
        
           | GlitchMr wrote:
           | Filenames in HFS+ filesystem (an old filesystem used by Mac
           | OS X) are normalized with a proprietary variant of NFD - this
           | is a filesystem feature. APFS removed this feature.
        
             | 1over137 wrote:
             | >APFS removed this feature.
             | 
             | And then brought it back. It normalizes now.
        
             | lilyball wrote:
             | By "proprietary variant" you mean "publicly documented
             | variant" which IIRC is just the normalization tables frozen
             | in time from an early version of Unicode (the idea being
             | that updating your OS shouldn't change the rules about what
             | filenames are valid).
             | 
             | As for APFS, it ~~doesn't~~didn't normalize but I believe
             | it still requires UTF-8. And the OS will normalize
             | filenames at a higher level. EDIT: they added native
             | normalization. At least for iOS, I didn't dig enough to
             | check it macOS is doing native normalizing or is just
             | normalization-insensitive.
        
               | chrismorgan wrote:
               | Normalisation is expressly done with the composition of
               | version 3.1 for compatibility: see
               | <https://www.unicode.org/reports/tr15/#Versioning>. IF
               | that's what HFS+ does, then "proprietary variant" is
               | wrong. And if not, I'm curious what it does differently.
               | 
               | (On the use of version 3.1, note that in practice version
               | 3.2 is used, correcting one typo: see
               | <https://www.unicode.org/versions/corrigendum3.html>.)
               | 
               | I find a few references to it being slightly different,
               | but not one of them actually says what's different;
               | Wikipedia is the only one with a citation
               | (<https://en.wikipedia.org/wiki/HFS_Plus>: "and
               | normalized to a form very nearly the same as Unicode
               | Normalization Form D (NFD)[12]"), and that citation says
               | it's UAX #15 NFD, no deviations. One library that handles
               | HFS+ differently switches to UCD 3.2.0 for HFS+
               | <https://github.com/ksze/filename-
               | sanitizer/blob/e990e963dc5b...>, but my impression from
               | UAX #15 is that this should be superfluous, not actually
               | changing anything. (Why is UCD 3.2.0 still around there?
               | Probably because IDNA 2003 needs it:
               | <https://bugs.python.org/issue42157#msg379674>.)
               | 
               |  _Update:_ https://developer.apple.com/library/archive/te
               | chnotes/tn/tn1... has actual technical information, but
               | the table in question doesn't show Unicode version
               | changes like they claim it does, so I dunno. Looks like
               | maybe from macOS 10.3 it's exactly UAX #15, but 8.1-10.2
               | was a precursor? I'm fuzzy on where the normalisation
               | actually happens, anyway.
        
               | GlitchMr wrote:
               | The `filename-sanitizer` library you have linked has the
               | following comment.                               # FIXME:
               | improve HFS+ handling, because it does not use the
               | standard NFD. It's                     # close, but it's
               | not exactly the same thing.                     'hfs+':
               | (255, 'characters', 'utf-16', 'NFD'),
               | 
               | I wonder what does that mean...
        
             | matja wrote:
             | ZFS can support normalization also:                   $
             | echo test > $'\xc3\xb6'         $ cat $'\x6f\xcc\x88'
             | cat: o: No such file or directory              $ zfs create
             | -o normalization=formD pool/dataset         $ echo test >
             | $'\xc3\xb6'         $ cat $'\x6f\xcc\x88'         test
        
           | zekica wrote:
           | macOS is interesting: some APIs normalize filenames while
           | others don't. And it causes some very interesting bugs.
           | 
           | One example is when you submit a file in Safari it doesn't
           | normalize the file name while js file.name does.
        
         | stefan_ wrote:
         | Sure but at some point you might want to create a file and
         | frequently using user input or filter files using some user
         | provided query string, the kind of use cases that unicode
         | normalization was invented for. So the whole "opaque blob of
         | bytes" filesystem handling is nice if all you want is to not
         | silently corrupt files, but it is very obviously not even
         | covering 10% of normal use cases. Rust isn't being super smart,
         | it just has its hands thrown up in the air.
        
         | alkonaut wrote:
         | Falls over on the fact that I don't want to be able to write
         | these two files in the same dir. if I write file o1.txt and
         | o1.txt then I want to be warned that the file exists even of
         | the encoding is different when I use two different apps but try
         | to write the same file.
         | 
         | The same applies for a.txt and A.txt on case insensitive file
         | systems (as someone pointed out the most common desktop file
         | systems are).
        
         | pavlov wrote:
         | The most common desktop file systems are case-insensitive,
         | which complicates the picture.
        
           | Pxtl wrote:
           | Still, it looks like the right thing to do is let the
           | filesystem do the filesystem's job. The filesystem should be
           | normalizing unicode and enforceing the case-insensitivity and
           | whatnot, but _just_ the filesystem. Wrappers around it like
           | whatever Nextcloud is doing should be treating the filenames
           | as a dumb pile of bytes.
        
             | dataflow wrote:
             | I'm not sure this problem even _has_ a  "right" solution.
             | 
             | > Wrappers around it like whatever Nextcloud is doing
             | should be treating the filenames as a dumb pile of bytes.
             | 
             | What do you do when the input isn't a dumb pile of bytes,
             | but actual text? (Like from a text box the user typed
             | into?)
        
               | rzzzt wrote:
               | Maintain a table that maps the original file name to
               | random-generated one that doesn't hit these gotchas.
        
               | rob_c wrote:
               | And place the files in chunks, and... Wait I think we're
               | getting close to reinventing block storage again ;)
        
               | dataflow wrote:
               | I'm afraid I don't follow. Who maintains this table and
               | who consumes it? What if they're different entities? How
               | do you prevent it from going out of sync with the file
               | system when the user renames a file? Are you inventing
               | your own file system here? How do you deal with existing
               | file systems?
        
               | rzzzt wrote:
               | I assumed that you have a system where file
               | management/synchronization happens strictly through a web
               | interface, and files are not changed or renamed outside
               | this system's knowledge. Under these preconditions,
               | having such a mapping table frees the users from having
               | to abide whatever restrictions the underlying file system
               | places on valid file names.
        
               | dataflow wrote:
               | Oh I was talking about the general case from a
               | programming standpoint. What do you do on a typical local
               | filesystem?
               | 
               | The point I'm trying to get at being, you need to worry
               | about the representation at multiple layers, not just at
               | the bottom FS layer.
        
           | mjevans wrote:
           | Case insensitivity is a braindead behavior. If desired it
           | should be a fallback path selecting the best match, not the
           | first resort.
        
             | laurent92 wrote:
             | So you're fine with ~/Downloads and ~/downloads coexisting
             | as entirely separate directories? And
             | John.McCauley@yahoo.fr and john.mccauley@yahoo.fr being
             | attributed to two different people ;)
        
               | im3w1l wrote:
               | > So you're fine with ~/Downloads and ~/downloads
               | coexisting as entirely separate directories?
               | 
               | Case (in)sensitivity for filenames is a non-issue in my
               | experience. Never had problems with either convention. As
               | for emails, I do think insensitivity was the right
               | choice.
        
               | tim-- wrote:
               | The RFC states that email addresses are case sensitive.
               | 
               | The local-part of a mailbox MUST BE treated as case
               | sensitive.
               | 
               | Section 2.4 RFC 2821,
               | https://www.ietf.org/rfc/rfc2821.txt
        
               | deadbunny wrote:
               | My guess would be that the local part of an email address
               | would usually map to a directory on case sensitive
               | filesystems...
        
               | justaguy37 wrote:
               | can we just say no to capital letters? (or lowercase?)
               | 
               | do capital letters have a good enough usage case to
               | justify their continued existence?
        
               | Lammy wrote:
               | Fun fact: The Apple II and II+ originally only did upper-
               | case, and it was very popular to add a Shift Key / lower-
               | case mod via one of the gamepad buttons: https://web.arch
               | ive.org/web/20010212094858/http://home.swbel...
        
               | colejohnson66 wrote:
               | You are free to stop using capital letters, but good luck
               | getting everyone to go along. Capitals have been around
               | for centuries (they're older than the printing press) and
               | aren't going anywhere.
        
               | vgel wrote:
               | First one: yes, though good UI should prevent it from
               | happening unless the user really intended it (for example
               | I have ~/Documents symlinked into Dropbox, so ~/documents
               | could be local-only documents)
               | 
               | Second one: no, emails are not filenames, and more
               | generally distinguishability is more important for
               | identifiers. In cases where identifiers like emails need
               | to be mapped to filenames, like caches, they should be
               | normalized.
        
             | jodrellblank wrote:
             | The opposite; case insensitivity is what human brains do,
             | we read word WORD Word and woRD as the same thing, it's
             | computer case-sensitive matching which is "brainless".
             | Computers not aligning with what humans do is annoying and
             | frustrating; they should be tools for us, not us for them.
             | There's no way two people would write o o and have readers
             | think they were different because one was written in oil-
             | based ink and one in water-based ink, or whatever compares
             | with behind the scenes implementation details like
             | combining form vs single character.
             | 
             | I have just been arguing the same thing in far too much
             | detail in this thread:
             | https://news.ycombinator.com/item?id=29722019
        
               | rob_c wrote:
               | WORD, Word WoRD....
               | 
               | Sorry to say I tend to use case sensitivity as a filter
               | for me offering support to other developers. I'm not
               | willing to find time for people who can't get their head
               | around "turn on/off caps lock". You don't do it in
               | professional writeups or applications (and I hope not in
               | a CV) so don't pollute my filesystems or codebases with
               | that madness.
        
               | skymt wrote:
               | There are a couple arguments against case-insensitive
               | filesystems I think are strong. The first is simply
               | compatibility with existing case-sensitive systems. The
               | second is that case is locale-dependent, so a pair of
               | names could be equivalent or not depending on the
               | device's locale.
               | 
               | I don't think I've seen any good argument against
               | normalization, though.
        
         | jrochkind1 wrote:
         | Well, precisely because if you _don 't_ normalize the
         | filenames, o [?] o. You could have two files with different
         | filenames, `goteborg.txt` and `goteborg.txt`, and they are
         | different files with different filenames.
         | 
         | Or you could have one file `goteborg.txt`, and when you try to
         | ask for it as `goteborg.txt`, the system tells you "no file by
         | that name".
         | 
         | Unicode normalization is the _solution_ to this. And the
         | unicode normalization algorithms are pretty good. The bug in
         | this case is that the system did not apply unicode
         | normalization consistently. It required a non-default config
         | option to be turned on to do so? I don 't really understand
         | what's going on here, but it sounds like a bug in the system to
         | me that this would be a non-default config option.
         | 
         | Dealing with the entire universe of human language is
         | inherently complicated. But unicode gives us some actually
         | pretty marvelous tools for doing it consistently and
         | reasonably. But you still have to use them, and use them right,
         | and with all software bugs are possible.
         | 
         | But I don't think you get fewer crazy edge cases by not
         | normalizing at all. (In some cases you can even get security
         | concerns, think about usernames and the risk of `john` and
         | `john` being two different users...). I know that this is the
         | choice some traditional/legacy OSs/file systems make, in order
         | to keep pre-unicode-hegemony backwards compat. It has problems
         | as well. I think the right choice for any greenfield
         | possibilities is consistent unicode normalization, so
         | `goteborg.txt` and `goteborg.txt` can't be two different files
         | with two different filenames.
         | 
         | [btw I tried to actually use the two common different forms of
         | o in this text; I don't believe HN normalizes them so they
         | should remain.]
        
           | nieve wrote:
           | It looks like instead of the config option switching
           | everything to use the same normalization it keeps a second
           | copy of the name in a database to compare to. What a horrible
           | kludge, I wonder how they even got into this situation of
           | using different normalization in different parts of the
           | system?
        
         | arka2147483647 wrote:
         | That works for programmers, but not for users. There could be
         | several files with the same name, buth with different
         | encodings. Worse, depending on how your terminal encodes user
         | input, some of them migth not be typable.
        
           | zarzavat wrote:
           | From the users perspective I don't want any normalisation at
           | all. It's good as long as you only have one file system but
           | as soon as you get multiple file systems with conflicting
           | rules (which includes transferring files to other people) it
           | becomes hell. Unfortunately we are stuck with that hell.
        
       | heikkilevanto wrote:
       | Well, if 7-bit US ASCII was good enough for our Lord, it is good
       | enough for me ;-)
        
       ___________________________________________________________________
       (page generated 2021-12-31 23:00 UTC)