[HN Gopher] UTF-8 Everywhere
       ___________________________________________________________________
        
       UTF-8 Everywhere
        
       Author : pcr910303
       Score  : 244 points
       Date   : 2020-04-14 15:55 UTC (7 hours ago)
        
 (HTM) web link (utf8everywhere.org)
 (TXT) w3m dump (utf8everywhere.org)
        
       | ddebernardy wrote:
       | (2012)
       | 
       | Previous discussions:
       | https://news.ycombinator.com/from?site=utf8everywhere.org
        
       | rakoo wrote:
       | Maybe it's time for MySQL to make "utf8" actually mean UTF-8 then
       | (https://medium.com/@adamhooper/in-mysql-never-use-utf8-use-u...)
        
         | treve wrote:
         | > Although utf8 is currently an alias for utf8mb3, at some
         | point utf8 will become a reference to utf8mb4. To avoid
         | ambiguity about the meaning of utf8, consider specifying
         | utf8mb4 explicitly for character set references instead of
         | utf8.
         | 
         | https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8...
         | 
         | Can't fault a database vendor being conservative, but looks
         | like this is planned. Maybe this will be a 9.0 thing.
        
         | smacktoward wrote:
         | They probably couldn't even if they wanted to, by this point
         | there will be too much software out there depending on "utf8"
         | meaning "MySQL's weird proprietary hacked-up version of UTF-8".
         | 
         | The only real solution is to hammer home the message that
         | "utf8mb4" is what you put into MySQL if you want UTF-8.
        
           | cosarara wrote:
           | There are acual problems too, when switching from utf8mb3 to
           | utf8mb4, because of maximum varchar length in indices:
           | https://stackoverflow.com/questions/48500355/mysql-
           | character...
        
       | tialaramex wrote:
       | However, sometimes you're in a layer when ASCII was fine and you
       | should just be explicit about that.
       | 
       | Server Name Indication (in RFC 3546) is flawed in several ways,
       | it's a classic unused extension point for example because it has
       | an entire field for what type of server name you mean, with only
       | a single value for that field ever defined. But one that stands
       | out is it uses UTF-8 encoding rather than insisting on ASCII for
       | the server name.
       | 
       | You can see the reasoning - international domain names are a big
       | deal, we should embrace Unicode. But IDNA already needed to
       | handle all this work, the DNS A-labels are already ASCII even for
       | IDNs.
       | 
       | Essentially choosing UTF-8 here only made things needlessly more
       | complicated in a critical security component. Users, the people
       | who IDNs were for, don't know what SNI is, and don't care how
       | it's encoded.
        
       | magicalhippo wrote:
       | I'd be happy if I could just get consistent encoding. Have to
       | handle way too many files with mixed encoding, even XML files
       | with explicit encoding header.
        
       | [deleted]
        
       | GnarfGnarf wrote:
       | I came to the same conclusion years ago. My app is Win32, but I
       | never defined UNICODE or used the TCHAR abomination. All strings
       | are stored as UTF8 until they are passed to Win32 APIs, whereupon
       | they are converted to UCS-2. I explicitly call the wchar version
       | of functions (ex: TextOutW). This strategy enabled me to
       | transition easily and safely from single-byte ASCII (Windows 3.1)
       | to Unicode.
       | 
       | The database is also UTF8.
        
       | projektfu wrote:
       | When I used to do a lot of windows programming in the late 90s, I
       | wish that I had a sensible guide like this for handling strings.
       | TCHAR was always a source of subtle bugs.
       | 
       | I suppose, though, that the underlying problem was that Microsoft
       | was so late to implement a compatibility solution for Windows 9x.
       | Most software of the time ended up implementing on "ANSI"
       | multibyte character set (MBCS) just because otherwise you would
       | need to either deploy 2 executables or do your own thunking. This
       | solution would be a double thunk on 9x because you'd be thunking
       | your UTF-8 to unicode and then thunking that back to MBCS.
        
       | [deleted]
        
       | Animats wrote:
       | I'd argue for some standard tests for UTF-8 strings:
       | 
       | - Basic - UTF-8 byte syntax correct.
       | 
       | - Unambiguous - similar to the rules for Unicode domain names.
       | The rules are complicated, but basically they prohibit
       | homoglyphs, mixing glyphs from different character sets, forwards
       | and backwards modifiers in the same string, no emoji or
       | modifiers, etc. Use where people have to visually compare two
       | things for identity or retype them, such as file names.
       | 
       | - Unambiguous, light version - as above, but allow emoji and
       | modifiers. Normal form for documents.
        
       | shpx wrote:
       | What I never see mentioned about Unicode is Han Unification
       | 
       | https://en.m.wikipedia.org/wiki/Han_unification
       | 
       | As I understand it, it's impossible to have a txt file that uses
       | Japanese and Chinese characters at the same time. The file will
       | either use the Chinese or Japanese forms of the characters,
       | depending on your font. I would think this is a big gotcha people
       | must run into all the time, but I never hear anyone talk about
       | it.
        
         | gsnedders wrote:
         | Relatively few people frequently look at different Han
         | languages, and relatively few people are looking at txt files
         | containing Han characters (and I expect those that do are
         | typically running with their OS locale set to one of the Han
         | languages?).
         | 
         | Enough CJK HTML content is tagged and heuristics are mostly
         | good enough that incorrect font selection isn't a massive issue
         | on the web, and AFAIK most major word processors include
         | metadata in the file that suffices to distinguish language.
        
         | klodolph wrote:
         | I'm not going to try and minimize the problem, here. Han
         | unification was pushed through by western interests, by my
         | understanding.
         | 
         | However, most Unicode characters are identical or nearly
         | identical in Chinese and Japanese. Characters with
         | "significant" visual differences got encoded as different
         | Unicode characters. The same thing applies to simplified and
         | traditional Chinese characters.
         | 
         | So for a given "Han character", there might be between one and
         | three different Unicode characters, and there might be between
         | one and three different ways of writing it.
         | 
         | Here's an illustration:
         | https://japanese.stackexchange.com/questions/64590/why-are-j...
         | 
         | So the issue does come up when mixing Chinese and Japanese
         | text, but it's not really one that has a big impact on
         | _legibility_ of the text but you would definitely be concerned
         | if you were writing a Japanese textbook for Chinese students,
         | or vice versa.
         | 
         | Beyond that, it is usually fairly trivial to distinguish
         | between Japanese and Chinese text, so you could just lean on
         | simple heuristics to get the work done (Japanese text, with the
         | exception of fairly ancient text or very short fragments,
         | contains kana, but Chinese does not).
        
           | cygx wrote:
           | _Han unification was pushed through by western interests, by
           | my understanding._
           | 
           | Note that as far as I'm aware, the interest in question was
           | the initial 16-bit limit of the character set and later on
           | the non-proliferation of competing standards.
           | 
           | Also note that while Han unification is the most prominent
           | example, there are technically similar cases, which just
           | aren't as charged culturally. For one, Unicode doesn't encode
           | German Fraktur: While some characters are available due to
           | their use in mathematics, it's lacking the corresponding
           | variants of a, o, u, ss, s as well as specific ligatures. So
           | if you want to intermix modern with old German writing,
           | you'll also have to go out-of-band.
        
             | anoncake wrote:
             | That's not the same thing. Fraktur is just a style of
             | fonts, antiqua and fraktur letters are semantically the
             | same.
        
               | cygx wrote:
               | There are differences as well as similarities. I'm no
               | expert, but shouldn't, say, U+4ECA still translate to
               | 'now' no matter if you draw a particular line
               | horizontally or diagonally? There are also some
               | mandatory[1] ligatures in Fraktur unavailable in Unicode.
               | What if I wanted to preserve that distinction in historic
               | writing?
               | 
               |  _edit:_
               | 
               | [1] I think the mandatory ones are actually there (just
               | not in Fraktur), it's some optional ones like sch that
               | are missing.
        
           | ksec wrote:
           | Yes, the real problem is when you start mixing All Four ( or
           | Five ) of them together Chinese Traditional, Simplified
           | Korean, Japanese things becomes extremely problematic.
           | 
           | I think it is by luck, All four writings has significant
           | usage within their own region, imagine if one of them were
           | significantly smaller and over time were forced ( or by ease
           | of use or what ever reason ) to switch to a different style
           | without knowing it.
        
           | cryptonector wrote:
           | As I understand it Han unification happened because at the
           | time all there was was UCS-2 -no UTF-16, no UTF-8- so
           | codespace was tight and precious, and that motivated
           | codespace preserving optimizations, of which Han unification
           | is the notable one.
           | 
           | To avoid that they needed to have invented UTF-8 many years
           | earlier. Perhaps if the people designing UTF-8 were more
           | diverse they might have felt the necessity to invent UTF-8 to
           | the point of actually doing it, but then perhaps they might
           | have done it poorly. At any rate, I don't know enough details
           | to really know if "Han unification was pushed through by
           | western interests" is remotely fair.
        
             | macintux wrote:
             | UTF-8 was sketched on a placemat as a response to a
             | different idea. It seems likely that had it not arisen in a
             | moment of inspiration by a genius, we would be stuck with
             | another inferior design by committee.
             | 
             | https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
        
         | cryptonector wrote:
         | First of all, there is no new unification work ongoing. The
         | Unicode Consortium moved on from that by moving on from UCS-2.
         | UCS-2 drove unification as a way to preserve precious
         | codespace.
         | 
         | There used to be language tag codepoints for this, but they've
         | been deprecated. Han unification is an accident of history: a
         | result of UTF-8 not having existed until it was too late!
         | 
         | There's not going to be a different new Unicode for doing away
         | with Han unification, which is why no one mentions it: besides
         | crying about it, what else can one do? Maybe we should revive
         | language tags?
         | 
         | Anyways, isn't the difference between unified Han/Kanji
         | characters mostly stylistic rather than semantic? I'm not
         | denying that many users would get annoyed, but again, what to
         | do about it??
        
           | innocenat wrote:
           | It's only stylistic issue if you also consider a and a
           | (alpha) to also be just stylistic different.
           | 
           | I have learned to live with it, but it is very annoying.
        
           | ksec wrote:
           | The same could be said whether e e should be the same as e
           | with different fonts. People who cares about it would
           | complain. To those who only uses English it is only the same
           | _e_.
        
             | microtherion wrote:
             | I don't think that's the same, because e.g. in French, e,
             | e, e, and e are all used, with different pronunciations.
        
           | wheybags wrote:
           | It's different enough that users will _immediately_ complain
           | if you get it wrong. And it means that you, as a developer
           | who might not understand either Chinese or Japanese, now has
           | to deal with the fallout by setting a different font in your
           | application depending on which of the two languages it is.
           | This happe ed end for us in factorio, and it was super
           | annoying, because it 's really hard to spot the problem
           | before it goes live because you A: don't know the problem
           | exists, how would you? B: have a hard time seeing it even
           | when you do know. The whole poi t of Unicode is to not have
           | to think about this crap or handle it explicitly, and this
           | breaks that guarantee fantastically.
        
       | hutzlibu wrote:
       | As someone who experienced serious pain with broken strings that
       | I sometimes only discovered, after the original files were gone
       | and new special characters were integrated, I directed quite some
       | anger to the fact, that computer systems are internal mostly
       | operated in english only, so usually nobody notices bugs with
       | wrong character encoding. So I share the sentiment of the article
       | ..
       | 
       | I do not want to think about UTF encoding, when I simply create a
       | 7z or tar file, without even programming. But I learned the hard
       | way, I had to. I never even found out for example, if it was/is a
       | bug with 7z, tar, rsync, scite text editor/ notepad++ .. or just
       | wrong usage/configuration. I just had(and still have even now my
       | workflow is clean) a special first file/codeline with special
       | characters, I checked to be correct, after compressing, rsyncing
       | between different systems. Especially between windows and linux.
       | But it probably helps, that I don't have to do that anymore.
        
       | anderspitman wrote:
       | Trying to figure out how to express this without making people
       | mad at me. I think the conflation of Unicode with "plain text"
       | might be a mistake. Don't get me wrong, Unicode serves an
       | important purpose. But bumping the version from plain text 1.0
       | (ASCII) to plain text 2.0 (Unicode) introduced a ton of
       | complexity, and there are cases where the abstractions start
       | leaking (iterating characters etc).
       | 
       | With things like data archival, if I have a hard drive with the
       | library of congress stored in ASCII, I need half a sheet of paper
       | to understand how to decode it.
       | 
       | Whereas apparently UTF8 requires 7k words just to explain why
       | it's important. And that's not even looking at the spec.
       | 
       | Just to be crystal clear, I'm not advocating to not use Unicode,
       | or even use it less. I'm just saying I think it maybe shouldn't
       | count as plain text, since it looks a lot like a relatively
       | complicated binary format to me.
        
         | cryptonector wrote:
         | There are tens of thousands of characters in all the human
         | scripts. If you're a librarian, scholar, researcher -- why
         | would you not want to be able to use them seamlessly??
        
           | droopyEyelids wrote:
           | If there was a complicated tool that claimed it could do the
           | job of every tool in history, or a simple tool that was
           | focused to cover 99% of the work you do-- and we lived on
           | planet earth-- which would you choose?
        
             | crazygringo wrote:
             | Umm... but ASCII doesn't work for 99% of people's work.
             | 
             | A majority of the world's population have writing systems
             | that ASCII doesn't encode.
             | 
             | So not really sure what you're suggesting here.
        
         | dtech wrote:
         | Of course ASCII is simpler than Unicode, it handles only 127
         | characters. If you restrict yourself to those characters ASCII
         | is binary equivalent to UTF-8.
         | 
         | So yeah, maybe you shouldn't use characters 128+ for data
         | archival, I doubt that's a good idea, but that's irrelevant to
         | whether UTF-8 is plain text or not.
        
           | tachyonbeam wrote:
           | I think that sometimes it makes sense to enforce strict
           | limitations early on (eg: overly strict input validation).
           | You can then remove such limitations in later versions of
           | your software, after careful consideration and after
           | inserting the necessary tests. The reverse usually doesn't
           | work. If you didn't have those limitations early on, and your
           | database is full of strings with characters that should never
           | have been allowed in there, you will have a hard time
           | cleaning up the mess.
           | 
           | This seems especially true to me in the design of programming
           | languages. If you have useless, badly thought out features in
           | your programming language, people will begin to rely on them,
           | and you will never be able to get rid of them... So start
           | with a small language, and make it strict. Grow it gradually.
        
         | [deleted]
        
         | nlitened wrote:
         | As a person who comes from a country with non-ASCII alphabet, I
         | strongly disagree. Since UTF-8 became de-facto standard
         | everywhere, so many headaches went away.
        
         | tingletech wrote:
         | LET'S GO BACK TO 6-BITS
        
         | hechang1997 wrote:
         | That complexity comes from the fact that you are using non
         | ASCII characters. UTF8 is a superset of standard ASCII. If you
         | are using only standard ASCII characters, they're exactly the
         | same thing.
        
         | jaseemabid wrote:
         | ASCII is English and limiting access to knowledge for the rest
         | of humanity for a simpler encoding is just not an acceptable
         | option. Someone needs to interpret those 7k words and write a
         | (complicated?) program once so that billions can read in their
         | own language? Sounds like an easy win to me.
        
           | droopyEyelids wrote:
           | counterpoint:
           | 
           | A complicated program is never an easy win, and English is
           | already spoken in every country in the world.
        
             | WorldMaker wrote:
             | Sure spoken, but both Arabic and CJK ideograms are written
             | in far more countries in the world, with far more people,
             | and for far longer in history than the ASCII set. The
             | oldest surviving great works of Mathematics were written in
             | Arabic and some of the oldest surviving great works of
             | Poetry where written in Chinese, as just two easy and
             | obvious examples of things worth preserving in "plain
             | text".
        
             | crazygringo wrote:
             | So your argument is... it's easier to teach billions of
             | people fluent English... than for software to support
             | UTF-8?
             | 
             | You are aware that a majority of the world's population
             | speaks no English whatsoever?
        
               | tachyonbeam wrote:
               | Playing the devil's advocate here. I am not a native
               | English speaker, I'm a French speaker, but I'm happy that
               | English is kind of the default international language.
               | It's a relatively simple language. I actually make less
               | grammar mistakes in English than I do in my native
               | language. I suppose it's probably not a politically
               | correct thing to say, the English are the colonists, the
               | invaders, the oppressors, but eh, maybe it's also kind of
               | a nice thing for world peace, if there is one relatively
               | simple language that's accessible to everyone?
               | 
               | Go ahead and make nice libraries that support Unicode
               | effectively, but I think it's fair game, for a small
               | software development shop (or a one-person programming
               | project), to support ASCII only for some basic software
               | projects. Things are of course different when you're
               | talking about governments providing essential services,
               | etc.
        
         | TheCoelacanth wrote:
         | You only need one sentence to explain why ASCII isn't
         | sufficient: There are languages other than English.
        
           | ignoramous wrote:
           | > You only need one sentence to explain why ASCII isn't
           | sufficient
           | 
           | Nitpick: ASCII is sufficient when you consider that Base64,
           | despite its 33% overhead from representing 6 bits with 8
           | bits, makes life easier for certain classes of software.
        
             | TheCoelacanth wrote:
             | Base64 is an encoding for representing bytes[0] in ASCII.
             | 
             | That doesn't help you represent text unless you already
             | have an encoding for representing text in bytes (e.g.
             | UTF8).
             | 
             | [0] Octets if you want to be pedantic
        
               | ignoramous wrote:
               | What I was alluding to is, I often convert any binary
               | data, including text, to Base64 to avoid dealing with
               | cross platform, cross language, cross format, cross
               | storage, cross network data-handling. Only the layer that
               | needs to deal with the blob's actual string
               | representation needs to worry about encoding schemes that
               | are outside the purview of the humble ASCII table.
        
             | dtech wrote:
             | You still need an encoding to represent non-ASCII
             | characters like e or Mu . Base64 is no help at all there
        
           | msla wrote:
           | And you're naive if you think ASCII suffices for English. I
           | wouldn't give you 1/2C/ for an OS incapable of handling
           | Unicode and UTF-8 even if you told me every language other
           | than English were mysteriously destroyed. Going back to ASCII
           | is 180deg from what would enrich English-language text.
        
         | pjscott wrote:
         | _Unicode_ is complicated because the languages it needs to
         | handle are, alas, complicated. _UTF-8_ is super simple. It 's a
         | variable-length encoding for 21-bit unsigned integers.
         | Wikipedia gives a handy table showing how it works:
         | 
         | https://en.wikipedia.org/wiki/UTF-8#Description
        
           | ftvy wrote:
           | When I wrote a very primitive UTF-8 library, I really began
           | to appreciate UTF-8's design. For example; the first byte
           | says how many bytes the character requires. At first it was
           | daunting, but when I put 2 and 2 together, it really opened
           | up.
           | 
           | I am sure there are many aspects I am missing about UTF-8,
           | but it is all reasonable in its design and implementation.
           | 
           | For reference, I was converting between code points and
           | actual bytes, and also implemented strlen and strcmp (which
           | for the latter the standard library apparently handles fine).
        
             | TheCoelacanth wrote:
             | The self-synchronizing property is also very clever. If you
             | start at an arbitrary byte, you can find the start of the
             | next character by scanning forward a maximum of 3 bytes.
        
           | carapace wrote:
           | Yeah, this. I have a pat "Unicode Rant" that boils down to
           | this essentially.
           | 
           | Having a catalog of standard numbers-to-glyphs (or symbols or
           | whatever, little pictures humans use to communicate with) is
           | awesome and useful (and all ASCII ever was) but trying to
           | digitalize all of human language is much much more
           | challenging.
        
       | dpc_pw wrote:
       | > For instance, 'ch' is two letters in English and Latin, but
       | considered to be one letter in Czech and Slovak.
       | 
       | Is "ch" really considered one _character_ in Czech and Slovak?
       | I'm Polish and we do have "ch" and consider it one ... sound...
       | represented by two letters? I mean... if you asked anyone to
       | count letters/characters in a word, they would count "ch" as two.
       | So I wonder if that's different in Slovakia or Chech Republic, or
       | is just my definition of "character" wrong.
        
         | andy_wrote wrote:
         | Based on my experience learning Czech (not native at all, just
         | interested):
         | 
         | - it's typically listed as a separate letter when writing out
         | the alphabet
         | 
         | - but in practice it's typed out as "c h" and not as a single
         | character
         | 
         | - it occupies its own place in Czech standard alphabetical
         | order, my English-Czech dictionary has all the "ch" words after
         | "h" (so interestingly in order to do a proper sort
         | programmatically you need to possibly look 2 characters ahead)
        
         | pilsetnieks wrote:
         | At first I though they simply mean the letter "c" but no, it
         | turns out that "ch" (and also "dz") is a digraph with a
         | separate place in Czech and Slovak alphabets.
        
         | masklinn wrote:
         | > So I wonder if that's different in Slovakia or Chech
         | Republic, or is just my definition of "character" wrong.
         | 
         | According to wikipedia, "Ch" is a character of the Czech
         | alphabet in the sense that it impacts alphabetical ordering
         | ("Ch" sorts between H and I), in the same way L or E are
         | apparently characters from the Polish alphabet distinct from L
         | and E respectively (wikipedia mentions that "byc comes after
         | bycie").
         | 
         | That is unlike, say, french where E and E are the same
         | character alphabetically.
         | 
         | [0] https://en.wikipedia.org/wiki/Czech_orthography
        
         | mlj45 wrote:
         | This depends on your definition of informal terms like
         | "letter", "character" etc.
         | 
         | The typographic term for combinations like this is "digraph".
         | (Wikipedia's definition: "A digraph [...] is a pair of
         | characters used in the orthography of a language to write
         | either a single phoneme [...] or a sequence of phonemes that
         | does not correspond to the normal values of the two characters
         | combined".)
         | 
         | Whether digraphs have separate keys on a keyboard, are treated
         | as distinct for the purposes of alphabetisation, whether
         | speakers of the language think of them as separate "letters"
         | when spelling out a word and so on, are all separate issues and
         | varies between languages (or, more precisely, between the
         | conventions for writing a certain language).
        
         | Svip wrote:
         | A better example would probably be "ij" in Dutch. That's
         | definitely considered a single letter, as words starting with
         | ij in Dutch are capitalised IJ. Though there are glyphs for IJ
         | /ij already in unicode.
        
           | roelschroeven wrote:
           | "ij" is sometimes considered a single letter, but certainly
           | not always. Quoting Wikipedia
           | (https://en.wikipedia.org/wiki/IJ_(digraph)):
           | 
           | "IJ (lowercase ij; Dutch pronunciation: [ei]) is a digraph of
           | the letters i and j. Occurring in the Dutch language, it is
           | sometimes considered a ligature, or a letter in itself. In
           | most fonts that have a separate character for ij, the two
           | composing parts are not connected but are separate glyphs,
           | which are sometimes slightly kerned."
           | 
           | (and equivalent in the Dutch Wikipedia article)
        
           | bartwe wrote:
           | Nobody has that as a letter on the keyboard here though, so
           | it doesn't matter. Normally typed as a digraph. Would be nice
           | if we just switched over to using y at this point. Makes me
           | wonder, is the use of diacritics reducing since ascii
           | keyboards became the norm ?
        
             | kosievdmerwe wrote:
             | Afrikaans did this. We use "y" instead of "ij".
        
           | mercer wrote:
           | "Ij" is also one sounds represented bij two letters, and I
           | think capitalizing just the 'I' is pretty standard. As a
           | Dutch person myself, I didn't even know that there's a glyph
           | for it!
           | 
           | We also have "ei", which sounds the same and was invented to
           | annoy people learning Dutch. Then there's "oe", "eu", "ui".
           | And just to fuck even more with people learning the language,
           | we have "au" and "ou" which also sound the same. Oh, and "ch"
           | and "g".
           | 
           | Hans Brinker, the inventor of the Dutch language, famously
           | would toss a florijn to decide between using ei/ij and au/ou,
           | as he was not fond of foreigners. He's mostly known for
           | saving our country though when he plugged a hole in a dyke
           | with his finger (yes, I know what you're thinking, and no, we
           | do not appreciate your dirty minds making light of this
           | heroic act).
        
             | akie wrote:
             | As a Dutch person myself, capitalizing just the I and not
             | the J hurts my eyes. Ijsselmeer or IJsselmeer?
        
               | mercer wrote:
               | Interesting. I never really gave it much thought, but Ij
               | actually bothers me so much that I usually try to avoid
               | using it at the beginning of a sentence, and I cringe
               | when I need to capitalize because it's a place (like
               | Ijsselmeer).
               | 
               | Just did some googling. Turns out that unlike the other
               | combinations, capitalizing both letters is mandatory for
               | 'IJ'. TIL...
        
             | alexis_fr wrote:
             | Same as OE/OE in French, then.
        
             | unwind wrote:
             | Spelling it "dike" helps keep people's minds on the right
             | thing. :)
        
               | samatman wrote:
               | If you spell it "dijk" it's even less racy, because it's
               | no longer a four-letter word.
        
               | mercer wrote:
               | Well shit. Guess I'll have to clean out my mind with some
               | soap...
        
           | masklinn wrote:
           | I don't know that that's correct. That there exists a
           | ligature character doesn't mean the ligature is a character
           | of the language.
           | 
           | It could, mind, I don't know dutch. But in french "oe" (which
           | has a ligatured character as you can see) is canonically
           | equivalent to "oe". It is not a separate letter of the
           | alphabet even though:
           | 
           | * many words should not be written with the ligatured form
           | 
           | * many words should be written with the ligatured form
           | 
           | * it has a different pronunciation than the base form
        
         | enedil wrote:
         | Yeah, but in Czech it's "c".
        
           | pilsetnieks wrote:
           | No. C is something else, ch is a digraph that's pronounced
           | differently. Take a look at Czech and Slovak alphabets
           | specifically:
           | 
           | https://en.wikipedia.org/wiki/Czech_orthography
           | 
           | https://en.wikipedia.org/wiki/Slovak_orthography
        
             | enedil wrote:
             | I'm Polish and I have just tangential knowledge of Czech
             | language. Sorry for confusion.
        
       | camgunz wrote:
       | This pops up every so often, and is wrong on several fronts (UNIX
       | is UTF-8, UTF-8/32 lexicographically sort, etc.) There's not
       | really a good reason to support UTF-8 over UTF-16; you can
       | quibble over byte order (just pick one) and you can try and make
       | an argument about everything being markup (it's not), but the
       | fact is that UTF-16 is a more efficient encoding for the
       | languages a plurality of people use natively.
       | 
       | But more broadly, being able to assume $encoding everywhere is
       | unrealistic. Write your programs/whatevers allowing your users to
       | be aware of and configure encodings. It might not be ideal, but
       | such is life.
        
         | jeltz wrote:
         | But is it really a plurality? Portuguese, English, Spanish,
         | Turkish, Vietnamese, French, Indonesian and German are stored
         | more efficiently in UTF-8 while Chinese, Korean and Japanese
         | are stored less effeciently. My gut feel is that more people
         | use the Latin script than people using CJK scripts. Indic
         | scripts, Thai, Cyrillic, etc are stored using two bytes in both
         | UTF-8 AND UTF-16.
         | 
         | And thus ignores markup which is in ascii.
        
           | camgunz wrote:
           | Looking at the basic multilingual plane [1], UTF-8 will use >
           | 2 bytes to encode essentially anything that isn't:
           | 
           | * ASCII/Latin
           | 
           | * Cyrillic
           | 
           | * Greek
           | 
           | * Most of Arabic
           | 
           | That leaves out:
           | 
           | * China
           | 
           | * India
           | 
           | * Japan
           | 
           | * Korea
           | 
           | * All of Southeast Asia
           | 
           | Re: markup, think about any text that's in a database, stored
           | in RAM, or stored on a disk--relatively little of it will be
           | in noisy ASCII markup formats like HTML or XML.
           | 
           | [1]: https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Mult
           | ilin...
        
             | jeltz wrote:
             | > All of Southeast Asia
             | 
             | Did you forget Indonesia, Vietnam, Malaysia, Brunei and the
             | Philippines?
        
               | camgunz wrote:
               | Again, here's what UTF-8 will use <= 2 bytes for:
               | 
               | Basic Latin (Lower half of ISO/IEC 8859-1: ISO/IEC
               | 646:1991-IRV aka ASCII) (0000-007F)
               | 
               | Latin-1 Supplement (Upper half of ISO/IEC 8859-1)
               | (0080-00FF)
               | 
               | Latin Extended-A (0100-017F)
               | 
               | Latin Extended-B (0180-024F)
               | 
               | IPA Extensions (0250-02AF)
               | 
               | Spacing Modifier Letters (02B0-02FF)
               | 
               | Combining Diacritical Marks (0300-036F)
               | 
               | Greek and Coptic (0370-03FF)
               | 
               | Cyrillic (0400-04FF)
               | 
               | Cyrillic Supplement (0500-052F)
               | 
               | Armenian (0530-058F)
               | 
               | Aramaic Scripts:                   Hebrew (0590-05FF)
               | Arabic (0600-06FF)              Syriac (0700-074F)
               | Arabic Supplement (0750-077F)              Thaana
               | (0780-07BF)              N'Ko (07C0-07FF)
               | 
               | In UTF-8, everything over U+0800 requires > 2 bytes. Am I
               | misunderstanding something? It's possible.
        
         | jcranmer wrote:
         | > There's not really a good reason to support UTF-8 over UTF-16
         | 
         | Two big reasons:
         | 
         | 1. All legal ASCII text is UTF-8. That means upgrading ASCII to
         | UTF-8 to support i18n doesn't require you to convert all your
         | files that were in ASCII.
         | 
         | 2. UTF-16 gives people the mistaken impression that characters
         | are fixed-width instead of variable-width, and this causes
         | things to break horribly on non-BMP data. I've seen amusing
         | examples of this.
         | 
         | > Write your programs/whatevers allowing your users to be aware
         | of and configure encodings.
         | 
         | Internally, your program should be using UTF-8 (or UTF-16 if
         | you have to for legacy reasons), and you should convert from
         | non-Unicode charsets as soon as possible. But if you're
         | emitting stuff... you should try hard to make sure that UTF-8
         | is the only output charset you have to support. Letting people
         | select non-UTF-8 charsets for output adds lots of complication
         | (now you have to have error paths for characters that can't be
         | emitted), and you need to have strong justification for why
         | your code needs that complication.
        
           | mark-r wrote:
           | Every program that purports to support Unicode should be
           | tested with a bunch of emoticons.
        
             | coolreader18 wrote:
             | Do you mean emoji? I don't see what the issue would be with
             | [{}:();P\\[\\],.<>/~-_+=XD]
        
               | mark-r wrote:
               | Yes, that's what I meant. I knew I was using the wrong
               | word but couldn't remember the right one.
        
           | camgunz wrote:
           | > 1. All legal ASCII text is UTF-8. That means upgrading
           | ASCII to UTF-8 to support i18n doesn't require you to convert
           | all your files that were in ASCII.
           | 
           | Eh, realistically if you're doing this, you should be
           | validating it like converting from one encoding to another
           | anyway. I get that people won't and haven't, but that's
           | because UTF-8 has this anti-feature where ASCII is compatible
           | with it, and that's led to a lot of problems.
           | 
           | > 2. UTF-16 gives people the mistaken impression that
           | characters are fixed-width instead of variable-width, and
           | this causes things to break horribly on non-BMP data. I've
           | seen amusing examples of this.
           | 
           | This is one of those problems, and it's way worse with UTF-8
           | because it encodes ASCII the same way ASCII does. It's let
           | programmers stay naive about this stuff for... decades?
           | 
           | > Internally, your program should be using UTF-8 (or UTF-16
           | if you have to for legacy reasons), and you should convert
           | from non-Unicode charsets as soon as possible.
           | 
           | There are all kinds of reasons to not use UTF-8. tialaramex
           | pointed out one above. "UTF-8 everywhere" is simply
           | unrealistic, and it forces a lot of applications to be
           | slower, or to take on unnecessary complexity. Maybe it's
           | worth it to "never have to think about encodings again", but
           | that's pretty hard to verify and there's no way it happens in
           | our lifetimes anyway.
           | 
           | > and you need to have strong justification for why your code
           | needs that complication.
           | 
           | Yeah see, I strongly disagree with this. I'll choose whatever
           | encoding I like, thanks. Maybe you don't mean to be super
           | prescriptive here, but I think a little more consideration by
           | UTF-8 advocates wouldn't hurt.
        
             | jcranmer wrote:
             | > I'll choose whatever encoding I like, thanks.
             | 
             | If everyone chooses whatever encoding they like, then the
             | charset being used has to be encoded somewhere. The problem
             | is, there are lots of places where charset isn't encoded
             | (such as your filesystem). That this is a problem can be
             | missed, because almost all charsets are a strict superset
             | of ASCII (UTF-{7,16} are the only such charsets to be found
             | in the top 99.99% of usage), so it's only when you try your
             | first non-ASCII characters that problems emerge.
             | 
             | Unicode has its share of issues, but at this point, Unicode
             | is the standard for dealing with text, and all i18n-aware
             | code is going to be built on Unicode internally. The only
             | safe way to handle text that has even the remotest change
             | of being i18n-aware is to work with charsets that support
             | all of Unicode, and given its compatibility with ASCII,
             | UTF-8 is the most reasonable one to pick.
             | 
             | If you want to insist on using KOI-8, or ISO-2022-JP, or
             | ISO-8859-1, you're implicitly saying "fuck you" to 2/3 of
             | the world's population since you can't support tasks as
             | basic as "let me write my name" for them.
        
               | camgunz wrote:
               | > If everyone chooses whatever encoding they like, then
               | the charset being used has to be encoded somewhere.
               | 
               | This is gonna be the case for the foreseeable future, as
               | you point out. Settling on one encoding only fixes this
               | like, 100 years from now. I'd prefer to build encoding-
               | aware software that solves this problem now.
               | 
               | > given its compatibility with ASCII, UTF-8 is the most
               | reasonable one to pick
               | 
               | This only makes sense of your system is ASCII in the
               | first place, and if you can't build encoding-aware
               | software. I think we can both agree that's essentially
               | legacy ASCII software, so you don't get to choose
               | anything anyway. And any system that interacts with it
               | should be encoding-aware and still validate the encoding
               | anyway, as though it might be BIG5 or whatever. Assuming
               | ASCII/UTF-8 is a bad idea, always and forever.
               | 
               | > If you want to insist on using KOI-8, or ISO-2022-JP,
               | or ISO-8859-1, you're implicitly saying "fuck you" to 2/3
               | of the world's population since you can't support tasks
               | as basic as "let me write my name" for them.
               | 
               | I'm not obligated to write software for every possible
               | user at every point in time. It's perfectly acceptable
               | for me to say, "I'm writing this program for my 1 friend
               | who speaks Spanish" and have that be my requirements. But
               | if I were to write software that had a hope of being
               | broadly useful, UTF-8 everywhere doesn't get me there.
               | I'd have to build it to be encoding-aware, and let my
               | users configure the encoding(s) it uses.
        
               | jcranmer wrote:
               | > But if I were to write software that had a hope of
               | being broadly useful, UTF-8 everywhere doesn't get me
               | there.
               | 
               | Actually, it does.
               | 
               | Right now, in 2020, if you're writing a new programming
               | language, you can insist that the input files must be
               | valid UTF-8 or it's a compiler error. If you're writing a
               | localization tool, you can insist that the localization
               | files be valid UTF-8 or it's an error. Even if you're
               | writing a compiler for an existing language (e.g., C), it
               | would not be unreasonable to say that the source file
               | must be valid UTF-8 or it's an error--and let those not
               | using UTF-8 right now handle it by converting their
               | source code to use UTF-8. And this has been the case for
               | a decade or so.
               | 
               | That's the point of UTF-8 everywhere: if you don't have
               | legacy concerns [someone actively using a non-ASCII, non-
               | UTF-8 charset that you have to support], force UTF-8 and
               | be done with it. And if you do have legacy concerns, try
               | to push people to using UTF-8 anyways (e.g., default to
               | UTF-8).
        
               | camgunz wrote:
               | I can't insist that other systems send your program
               | UTF-8, or that the users' OS use UTF-8 for filenames and
               | file contents, or that data in databases uses UTF-8, or
               | that the UTF-8 you might get is always valid. The end
               | result of all these things you're raising is "you can't
               | assume, you have to check always, UTF-8 everywhere buys
               | you nothing". Even if we did somehow get there, you'd
               | still have to validate it.
        
         | flohofwoe wrote:
         | I think it's quite obvious that UTF-8 is the better choice over
         | UTF-16 or UTF-32 for exchanging data (if just for the
         | little/big endian mess alone, and that UTF-16 isn't a fixed-
         | length encoding either).
         | 
         | From that perspective, keeping the data in UTF-8 for most of
         | its lifetime also when loaded into a program, and only convert
         | "at the last minute" when talking to underlying operating
         | system APIs makes a lot of sense, except for some very specific
         | application types which do heavy text processing.
        
           | camgunz wrote:
           | I'm gonna do little quotes but, I don't mean to be passive
           | aggressive. It's just that this stuff comes up all the time
           | 
           | > I think it's quite obvious that UTF-8 is the better choice
           | over UTF-16 or UTF-32 for exchanging data (if just for the
           | little/big endian mess alone...
           | 
           | This should be the responsibility of a string library
           | internally, and if you're saving data to disk or sending it
           | over the network, you should be serializing to a specific
           | format. That format can be UTF-8, or it can be whatever,
           | depending on your application's needs.
           | 
           | > and that UTF-16 isn't a fixed-length encoding either)
           | 
           | We should stop assuming any string data is a fixed-length
           | encoding. This is a major disadvantage of UTF-8, because it
           | allows for this conflation.
           | 
           | > keeping the data in UTF-8 for most of its lifetime also
           | when loaded into a program, and only convert "at the last
           | minute" when talking to underlying operating system APIs
           | makes a lot of sense, except for some very specific
           | application types which do heavy text processing.
           | 
           | Well, you're essentially saying "I know about your use case
           | better than you do". It might be important to me to not blow
           | space on UTF-8. But if my platform/libraries have bought into
           | "UTF-8 everywhere" and don't give me knobs to configure the
           | encoding, I have no recourse.
           | 
           | And that's the entire basis for this. It's "having to mess
           | with encodings is worse than the application-specific
           | benefits of being able to choose an encoding". I think
           | that's... at best an impossible claim and at worst pretty
           | arrogant. Again here I don't mean you, but this "UTF-8
           | everywhere" thing.
        
             | jeltz wrote:
             | > We should stop assuming any string data is a fixed-length
             | encoding. This is a major disadvantage of UTF-8, because it
             | allows for this conflation.
             | 
             | So what do you suggest? UTF-16 and UTF-32 encourage this
             | even more.
        
               | camgunz wrote:
               | Yeah, ASCII is such a powerful mental model that I think
               | anyone working with Unicode made a lot of concessions to
               | convert people, no argument there. But I think we need to
               | say we're done with that and move on to phase 2. Here's
               | what I advocate:
               | 
               | - Encodings should be configurable. Programmers get to
               | decide what format their strings are internally, users
               | get to decide what encoding programs use when dealing
               | with filenames or saving data to disk, etc. Defaults
               | matter, and we should employ smarts, but we should never
               | say "I know best" and remove those knobs.
               | 
               | - Engineers need to internalize that "strings" conceal
               | mountains of complexity (because written language is
               | complex), and default to using libraries, to manage them.
               | We should start view manual string manipulation as an
               | anti-pattern. There isn't an encoding out there that we
               | can all standardize on that makes this untrue, again
               | because written language is complex.
        
             | eMSF wrote:
             | >We should stop assuming any string data is a fixed-length
             | encoding. This is a major disadvantage of UTF-8, because it
             | allows for this conflation.
             | 
             | Mistaking a variable-width encoding for a fixed-width one
             | is _specifically_ a UTF-16 problem. UTF-8 is so obviously
             | not fixed-width that such an error could not happen by a
             | mistake, because even before widespread use of emojis,
             | multibyte sequences were not in any way a corner case for
             | UTF-8 text (for additional reference, compare UTF-16 String
             | APIs in Java /JavaScript/etc. with UTF-8 ones in, say, Rust
             | and Go, and see which ones allow you to easily split a
             | string where you shouldn't be able to, or access "half-
             | chars" as a datatype called "char".)
        
               | camgunz wrote:
               | I mean, I think we're both in the realm of [citation
               | needed] here. I would argue that people index into
               | strings quite a lot--whether that's because we thought
               | UCS-2 would be enough for anybody or UTF-8 == ASCII and
               | "it's probably fine" is academic. The solution is the
               | same though: don't index into strings, don't assume an
               | encoding until you've validated. That makes any
               | "advantage" UTF-8 has disappear.
               | 
               | If you really think no one made this mistake with UTF-8,
               | just read up on Python 3.
        
               | mark-r wrote:
               | The difference is that with UTF-8 you're much more likely
               | to trip over those bugs in random testing. With UTF-16
               | you're likely to pass all your test cases if you didn't
               | think to include a non-BMP character somewhere. Then
               | someone feeds you an emoji character and you blow up.
        
               | camgunz wrote:
               | Which is why you should be using a library for all this,
               | that uses fuzzing and other robustness checks.
        
         | crazygringo wrote:
         | > _not really a good reason to support UTF-8 over UTF-16_
         | 
         | Of course there is, the fact that if you're dealing only with
         | ASCII characters then it's backwards-compatible. Which is a
         | nice convenience in a great number of situations programmers
         | encounter.
         | 
         | The minor details of efficiency of an encoding these days isn't
         | particularly relevant -- sure UTF-16 is better for Chinese, but
         | the average webpage usually _does_ have way more markup, CSS
         | and JavaScript than text, and gzip-ing it on delivery will
         | result in a similar payload totally independent of the encoding
         | you choose.
        
           | camgunz wrote:
           | UTF-8's ASCII compatibility is an anti-feature; it's allowed
           | us to continue to use systems that are encoding naive (in
           | practice ASCII-only). It's no substitute for creating
           | encoding-aware programs, libraries, and systems.
           | 
           | The vast majority of text is not in HTML or XML, and there's
           | no reason you can't use Chinese characters in JavaScript
           | besides (your strings and variable/class/component/file names
           | will surely outpace your use of keywords).
        
             | crazygringo wrote:
             | It's not an anti-feature, it's a benefit that is a huge
             | asset in the real world. For example, you can be on a
             | legacy ASCII system, inspect a modern UTF-8 file, and if
             | it's in a Latin language then it will still be readable as
             | opposed to gibberish. Yes all modern tools should be (and
             | these days generally are) encoding-aware, but in the real
             | world we're stuck with a lot of legacy tools too.
             | 
             | And of course the vast majority of transmitted digital text
             | is in HTML and similar! What do you think it's in instead?
             | 
             | By sheer quantity of digital words consumed by the average
             | person, it's news and social media delivered in browsers
             | (HTML), followed by apps (still using HTML markup to a huge
             | degree) and ebooks (ePub based on HTML). And of course
             | plenty of JSON and XML wrapping too.
             | 
             | And of course you _can_ you Chinese characters in
             | JavaScript /JSON, but development teams are increasingly
             | international and English is the de-facto lingua franca.
        
               | camgunz wrote:
               | That huge asset has become a liability. We always needed
               | to become encoding-aware, but UTF-8's ASCII compatibility
               | has let us delay it for decades, and caused exactly the
               | confusion causing us to debate right now. So many
               | engineers have been foiled by putting off learning about
               | encodings. Joel Spolsky wrote an article, Atwood wrote an
               | article, Python made a backwards incompatible change,
               | etc. etc. etc.
               | 
               | To be honest, I'm just guessing about what text is stored
               | in--I'll cop to it being very hard to prove. But my guess
               | is the vast majority of text is in old binary formats,
               | executables, log files, firmware, or in databases without
               | markup. That's pretty much all your webpages right there.
               | 
               |  _n.b._ JSON doesn 't really fit the markup argument. The
               | whole idea is that HTML is super noisy and the noise is 1
               | byte in UTF-8, and 2 bytes in UTF-16. JSON isn't noisy so
               | the overhead is very low.
        
               | crazygringo wrote:
               | I just don't know what you're talking about.
               | 
               | You can't rewrite all existing legacy software to support
               | encodings. You just can't. A backwards-compatible format
               | was a huge catalyst for widely supporting Unicode in the
               | first place. What exactly are we delaying for decades?
               | Engineers everywhere use Unicode today for new software.
               | The battle has been won, moving forwards.
               | 
               | And the vast majority of text isn't in computer code or
               | even books. It's in the seemingly endless stream of
               | content produced by journalists and social media each and
               | every day, _dwarfing_ executables, firmware, etc. And if
               | it supports any kind of formatting (bold /italics etc.)
               | -- which most does -- then it's virtually always stored
               | in HTML or similar (XML). I mean, what are even the
               | alternatives? Neither RTF nor Markdown come even close in
               | terms of adoption.
        
               | camgunz wrote:
               | > You can't rewrite all existing legacy software to
               | support encodings. You just can't. A backwards-compatible
               | format was a huge catalyst for widely supporting Unicode
               | in the first place.
               | 
               | Totally agree.
               | 
               | > What exactly are we delaying for decades?
               | 
               | Learning how encodings work and using that knowledge to
               | write encoding-aware software.
               | 
               | > Engineers everywhere use Unicode today for new
               | software. The battle has been won, moving forwards.
               | 
               | They do, but they're frequently foiled by on-disk
               | encodings, filenames, internal string formats, network
               | data, etc. etc. etc. All this stuff is outlined in TFA.
               | 
               | > And the vast majority of text isn't in computer code or
               | even books. It's in the seemingly endless stream of
               | content produced by journalists and social media each and
               | every day
               | 
               | I concede I'm not likely to convince you here, but like,
               | do you think Twitter is storing markup in their
               | persistence layer? I doubt it. And even if there is some
               | formatting, we're talking about <b> here, not huge
               | amounts of angle brackets.
               | 
               | But think about any car display. That's probably not
               | markup. Think about ATMs. Log files. Bank records. Court
               | records. Label makers. Airport signage. Road signage.
               | University presses.
        
             | jeltz wrote:
             | The reasons most programmers use English in their source
             | code has nothing to do with file size (for that their are
             | JS minimizes) or supported encodings. It has to do with
             | that two things, English is the most used language in the
             | industry so if you want to cooperate with programmers from
             | other parts of the world English is a good idea and because
             | it frankly looks ugly to mix languages in the same file so
             | when the standard library is in English your source code
             | will be too.
             | 
             | So since most source code is in English (and for JS is
             | minimized) UTF-8 works perfectly there too.
        
       | nayuki wrote:
       | I love the typesetting on the page. It is content-first, clean,
       | and simple.
       | 
       | It lacks all the usual noise like modal dialogs, headers and
       | footers, social media icons, colorful sidebars, newsletter sign-
       | ups, cookie warnings, etc.
        
       | legulere wrote:
       | > In the UNIX world, narrow strings are considered UTF-8 by
       | default almost everywhere. Because of that, the author of the
       | file copy utility would not need to care about Unicode
       | 
       | It couldn't be further from the truth. Unix paths don't need to
       | be valid UTF-8 and most programs happily pipe the mess through
       | into text that should be valid. (Windows filenames don't have to
       | be proper UTF-16 either)
       | 
       | Rust is one of the few programming languages that correctly
       | doesn't treat file paths as strings.
        
         | marcosdumay wrote:
         | > Unix paths don't need to be valid UTF-8
         | 
         | Yet, your shell will treat them like UTF-8 just as well. As
         | will the standard library of almost every programming language,
         | as you noticed.
         | 
         | If you open one such file in most text editors, they will
         | render whatever is in it as UTF-8. If you use text manipulating
         | utilities, they will work with it as if it was encoded in
         | UTF-8.
         | 
         | It's mostly the Linux kernel that disagrees. Everything else
         | considers them UTF-8.
        
           | arendtio wrote:
           | Doesn't it depend on your locales?
           | 
           | At least for source-based Linux distributions (Gentoo,
           | Exherbo) I remember that you have to define the locales you
           | want to use and which ones should be the default. And when I
           | build a system without UTF-8 locales, I doubt that the shell
           | will treat paths as UTF-8.
        
           | Spivak wrote:
           | Which is a silly position since the kernel is the only thing
           | that matters. You're right that not too many people will
           | complain if your program crashes on non-UTF-8 paths. Same
           | with spaces in group names. 100% valid and accepted. Breaks a
           | ridiculous amount of software if you actually do it.
           | 
           | But that doesn't mean it's right. It just means that we have
           | a calcified convention.
        
             | marcosdumay wrote:
             | > narrow strings are considered UTF-8 by default almost
             | everywhere
             | 
             | It means that this is mostly true.
             | 
             | I dunno what it should be. There are benefits and costs on
             | both allowing and restricting the names. As well as there
             | are good reason for the kernel alone to support them even
             | tough all the userland doesn't. But it does mean that you
             | just use UTF-8 and it's done.
        
         | lisper wrote:
         | > Rust is one of the few programming languages that correctly
         | doesn't treat file paths as strings.
         | 
         | Common Lisp too.
        
           | tester89 wrote:
           | I've never actually understood how pathnames work in CL
           | actually.
        
             | lisper wrote:
             | That makes two of us. But they aren't strings :-)
             | 
             | (Seriously though, is it pathnames you don't understand or
             | logical hosts? Because CL pathnames are actually pretty
             | straightforward. Logical hosts, on the other hand, are a
             | hot mess.)
        
             | gumby wrote:
             | They are pretty straightforward: they are just path
             | _structures_ rather than path _names_ that may turn into
             | single strings when supplied to your kernel. Or, depending
             | on the OS maybe only part of the name is turned into a
             | string and part determines which device or syntax applies.
             | All of which is abstracted away by the path objects.
             | 
             | Back in the 1970s when thins first appeared on lisp
             | machines is was not uncommon to use remote file systems
             | transparently, and those remote file systems could be on
             | quite different OSes like ITS, TOPS10 or -20, VMS, one of
             | the lisp machine file systems and even Unix (though
             | Networking came quite late to Unix). "MC:GUMBY; FOO >" and
             | "OZ:<GUMBY>FOO.TXT;0" were perfectly reasonable filenames.
             | Some of those systems had file versioning built into them.
             | So if the world likes like Unix to you some of that
             | additional expressive power could be confusing.
             | 
             | C++17 path support is a neutered version of Common Lisp's.
        
         | [deleted]
        
         | ken wrote:
         | > one of the few programming languages that correctly doesn't
         | treat file paths as strings
         | 
         | I hear: one of those few programming languages that, despite
         | its vaunted type-safety, makes it possible to accidentally
         | create a file with a completely bogus name that I won't be able
         | to view or open correctly with half the programs on my
         | computer.
         | 
         | Languages which allow arbitrary byte sequences in paths are the
         | cause of, and solution to, all of Unix's pathname problems.
        
           | lilyball wrote:
           | So what you're saying is the language should not be able to
           | work with pre-existing files whose names are not valid UTF-8?
        
           | orf wrote:
           | No, it's impossible to do that accidentally. Due to its type
           | safety. You have to be pretty explicit about passing a non-
           | string in (all rust strings are valid utf8).
        
         | jcranmer wrote:
         | > It couldn't be further from the truth. Unix paths don't need
         | to be valid UTF-8 and most programs happily pipe the mess
         | through into text that should be valid. (Windows filenames
         | don't have to be proper UTF-16 either)
         | 
         | A decent fraction of software can impose rules on the portion
         | of the filesystem within their control. A tool like mv or vim
         | has to be prepared to handle any filepath encoding. But
         | something like a VCS could reasonably insist that they only
         | support filetrees with normalized UTF-8 encoding and no case-
         | insensitive conflicts as the only things reliably working
         | cross-platform.
        
           | Thrymr wrote:
           | Sure, as long as you don't have to be compatible with
           | anything else, you can assume whatever encoding you want.
           | That doesn't change the point that general programs can't
           | make that assumption.
        
           | acdha wrote:
           | The history of Git and Subversion handling filenames makes me
           | think that the opposite is true: A VCS which doesn't handle
           | arbitrary byte-strings will have weird edge cases which
           | prevent users from adding files or accessing them, possibly
           | even "losing" data in a local checkout. This is especially
           | tedious because it'll appear to work for a while until
           | someone first tries to commit an unusual file or checks it
           | out with a previously-unused client.
        
             | roblabla wrote:
             | My understanding is, you can't treat the filename as an
             | arbitrary bytestring, since you have to transcode it across
             | platforms, otherwise the filename won't show up properly
             | everywhere. E.G. if I make a file named "test" on unix, it
             | will be UTF-8 (assuming sane unix). If on windows I create
             | a file with the filename "test", encoded as UTF-8, it will
             | show up as worthless garbage in explorer.exe since it will
             | decode it to UTF-16.
             | 
             | So VCS needs to know the filename encoding in order to work
             | properly.
        
         | oconnor663 wrote:
         | > It couldn't be further from the truth. Unix paths don't need
         | to be valid UTF-8
         | 
         | Yes _but_ , most programs expect to be able to print filepaths
         | at least under some circumstances, like printing error
         | messages. Even if a program is fully correct and doesn't assume
         | an encoding in normal operation, it still has to assume one for
         | printing. Filepaths that aren't utf-8 lead to a bunch of  in
         | your output (at best). So I think it's fair to say that Unix
         | paths are assumed to be utf-8 by almost all programs, even if
         | being invalid utf-8 doesn't actually cause a correct program to
         | crash.
        
           | Spivak wrote:
           | I mean it doesn't have to assume an encoding for printing, it
           | just has to have a sane way of turning the path into
           | something human readable.
           | 
           | Look you're right that this ship has sailed but ideally we
           | would have decided on a way to display and encode binary for
           | file paths.
        
             | oconnor663 wrote:
             | I dunno. That sounds like proposing to render "foo.txt" as
             | "Zm9vLnR4dA==" or "[102, 111, 111, 46, 116, 120, 116]" or
             | something. I think you probably meant something like "print
             | the regular characters if the string is UTF-8, or a
             | lossless fallback representation of the bytes otherwise."
             | That's a good idea, and I think a lot of programs do that,
             | but at the same time "if the string is UTF-8" is
             | problematic. There's no reliable way for us to know what
             | strings are or are not intended to be decoded as UTF-8,
             | because non-UTF-8 encodings can coincidentally produce
             | valid UTF-8 bytes. For example, the two characters "&!" are
             | the same bytes in UTF-8 as the character "" is in UTF-16.
             | This works in Python:                   assert
             | "&!".encode("UTF-8").decode("UTF-16le") == ""
             | 
             | So I think I want to claim something a bit stronger:
             | 
             | 1) Users demand, quite rightly, to be able to read paths as
             | text. 2) There is no reliable way to determine the encoding
             | of a string, just by looking at its bytes. And Unix doesn't
             | provide any other metadata. 3) Therefore, useful Unix
             | programs _must_ assume that any path that could be UTF-8,
             | is UTF-8, for the purpose of displaying it to the user.
             | 
             | Maybe in an alternate reality, the system locale could've
             | been the reliable source of truth for string encodings? But
             | of course if we were starting from scratch today, we'd just
             | mandate UTF-8 and be done with it :)
        
           | eska wrote:
           | In the Rust std one can easily use the lossless presentation
           | with file APIs, and print a lossy version in error messages.
           | I find this to be good enough.
        
         | gumby wrote:
         | > Unix paths don't need to be valid UTF-8
         | 
         | And a lucky thing too; OSes that _do_ have UTF-8 filesystems
         | don't always agree on how to apply canonicalization, much less
         | how to deal with canonicalization differences between user
         | entered data and normalized filesystem names.
        
         | DannyB2 wrote:
         | > Rust is one of the few programming languages that correctly
         | doesn't treat file paths as strings.
         | 
         | Imagine if languages allowed subtypes of strings which are not
         | directly assignment compatible.
         | 
         | HtmlString
         | 
         | SqlString
         | 
         | String
         | 
         | A String could be converted to HtmlString not by assignment,
         | but through a function call, which escapes characters that the
         | browser would recognize as markup.
         | 
         | Similarly a String would be converted to a SqlString via a
         | function.
         | 
         | It would be difficult to accidentally mix up strings because
         | they would be assignment incompatible without the functions
         | that translate them.
         | 
         | There could be mixed "languages" within a string. Like a JSP or
         | PHP that might contain scripting snippets, and also JavaScript
         | and CSS snippets, each with different syntax rules and escaping
         | conventions.
        
           | jdc wrote:
           | Cf. _newtype_ in Python and Haskell.
        
           | torstenvl wrote:
           | Failing that, you could also adopt a naming convention with
           | prefixes to indicate what sort of thing it is you're storing
           | there:
           | 
           | hsCode = hsFromUs(usInputBuffer);
           | 
           | ssStoredCode = ssFromHs(hsCode);
           | 
           | https://www.joelonsoftware.com/2005/05/11/making-wrong-
           | code-...
        
             | DannyB2 wrote:
             | Yes. But having the compiler enforce it is your first line
             | of defense. If it doesn't compile, you know there is an
             | actual problem. In modern IDEs, you see these compile
             | errors as quickly as you type them.
        
           | bruckie wrote:
           | Some security-sensitive libraries do this, e.g. https://www.j
           | avadoc.io/doc/com.google.common.html.types/type...
        
           | mhh__ wrote:
           | Allowed you to? You could do that in C++ quite happily, it's
           | just not useful enough. To bother implementing, at least.
        
             | eska wrote:
             | It's absolutely useful enough, it's just that it's awful in
             | C++ due to language limitations as opposed to other
             | languages such as Haskell, where it is standard.
        
             | akiselev wrote:
             | They're not worth the effort in C++ because it doesn't have
             | strictly enforced affine/dependent types. The GP is
             | invisioning a language that does.
        
               | ori_b wrote:
               | Why do you need them to enforce that only escaped strings
               | are passed to functions?
               | html::append(html::string text);
               | 
               | with an constructor
               | html::string(std::string)
               | 
               | that handled escaping seems like it'd work just fine.
        
           | mika9090 wrote:
           | Try Pascal (free pascal or Delphi)
        
             | DannyB2 wrote:
             | I used Pascal for the 80's and part of the 90's. Currently
             | use Java. I almost tried Delphi, but my shop moved on to
             | something else between Pascal and Java.
        
             | robocat wrote:
             | AFAIK they just provide type name aliases, which do not
             | enforce or warn of you if you mix the "types".
        
           | gnarbarian wrote:
           | You would probably like Java 1.4
        
             | masklinn wrote:
             | This pattern ( _newtyping_ ) is a _huge_ weakness of Java
             | in general, and even more so older Java, and people who
             | like newtyping are not going to like java.
             | 
             | Because creating newtypes in Java is
             | 
             | 1. verbose, defining a trivial wrapper takes half a dozen
             | lines before you've even done anything
             | 
             | 2. slow, because you're paying for the overhead of an extra
             | allocation and pointer indirection every time, unless you
             | jump through unreadable hoops making for even more verbose
             | newtypes[0]
             | 
             | It is a much more convenient (and thus frequent) pattern in
             | languages like Haskell. Or Rust.
             | 
             | [0] https://gist.github.com/jbgi/d6b677d084fafc641fe01f7ffd
             | 00591...
        
             | DannyB2 wrote:
             | I use Java 14 now. Java 11 in production.
        
         | sitzkrieg wrote:
         | git will also do this, so on a fs that allowa arbitrarily byte
         | named files, you end up with tree objects of same name which
         | makes digging them out later "fun"
        
           | benibela wrote:
           | I have a repository full of such files:
           | https://github.com/benibela/nasty-files
           | 
           | You can clone the repository, and then you cannot delete it
           | with tools that expect utf-8 names (like KDE's Dolphin)
        
         | cryptonector wrote:
         | Yes, but the only way to interop multiple scripts on a POSIX
         | filesystem is to use UTF-8. I can forgive people for not
         | realizing that filenames in POSIX are a weird animal: they are
         | NUL-terminated strings of characters (char) in some arbitrary
         | codeset and encoding, but US-ASCII '/' is special.
         | 
         | EDIT: Also, "considered UTF-8 by default almost everywhere"
         | is... not necessarily wrong -- nowadays users should be using
         | UTF-8 locales by default. Maybe "almost everywhere" is an
         | exaggeration, but I wouldn't really know.
        
         | masklinn wrote:
         | > Unix paths don't need to be valid
         | 
         | unless they do.
         | 
         | OSX will most likely barf at or mangle invalid file names (HFS+
         | requires well-formed UTF-16, which translates to well-formed
         | UTF-8 at the POSIX layer), and there are ZFS systems which are
         | configured with utf8only set.
         | 
         | It would be more precise to say that you _can 't assume_ UNIX
         | paths are anything other than garbage.
        
         | ngrnjp wrote:
         | That's a fundamental flaw of UNIX.
        
           | msla wrote:
           | It's a reflection of the fact people aren't going to throw
           | out existing filesystems because they aren't in a specific
           | character encoding. There's nothing the OS can do about that,
           | there's nothing programmers in general can do about that, and
           | the only way to fix it is with a time machine and enough
           | persuasion to force everyone to implement Unicode and UTF-8
           | to the exclusion of any other character encoding schemes.
        
           | downerending wrote:
           | As flaws go, it's pretty awesome. Wish we had more such.
        
       | kyberias wrote:
       | Well, the font on that article is too small and otherwise ugly.
        
       | jfkebwjsbx wrote:
       | Even Microsoft is finally giving up UTF-16!
       | 
       | They recommend now to use the UTF-8 "code page" in new code.
        
         | nathanaldensr wrote:
         | Do you have a source for this? AFAIK the .NET Framework CLR and
         | CoreCLR both still store strings internally as UTF-16.
        
           | mormegil wrote:
           | AFAICT, it's not only "internal representation". .NET strings
           | are defined as a sequence of UTF-16 units, including the
           | definition of the Char type representing a single UTF-16 code
           | unit. I can't imagine how such a change could be implemented
           | (other than changing the internal representation but
           | converting on all accesses which would be nonsense, I think).
        
             | leosarev wrote:
             | Current plan is:
             | https://github.com/dotnet/corefxlab/issues/2350
        
           | ChrisSD wrote:
           | The closest I could find to a recommendation for UTF-8 is in
           | UWP design guidelines: https://docs.microsoft.com/en-
           | us/windows/uwp/design/globaliz...
           | 
           | However it's not quite unequivocal. Windows still uses UTF-16
           | in the kernel (or actually an array of 16bit integers, but
           | UTF-16 is a very strong convention). The code page will often
           | allow the Win32 API to perform the conversion back and forth
           | instead of your application doing it.
        
           | leosarev wrote:
           | CoreCLR actively discussing introducing Utf8String type.
           | https://github.com/dotnet/corefxlab/issues/2350
        
         | gpvos wrote:
         | Have they fixed all the bugs with that pseudocodepage?
        
           | xeeeeeeeeeeenu wrote:
           | Bugs like WriteFile() reporting the wrong number of bytes
           | written with 65001 codepage were fixed years ago.
        
             | buckminster wrote:
             | That's good news. Last time I looked, more than a decade
             | ago admittedly, that bug was WONTFIX.
             | 
             | In fact I was so surprised I just wrote a test program.
             | They have fixed it!
             | 
             | It was the dumbest bug I ever saw in Windows. It was
             | special case code in the console output code path of the
             | user mode part of WriteFile. It only existed to make utf8
             | work, and it didn't even do that.
        
             | gpvos wrote:
             | Ah, that's surprising, Microsoft was very stubbornly _not_
             | doing that for at least a decade and a half.
             | 
             | In fact, the FAQ in TFA (questions 9 and 20) mentions that
             | there are still problems with CP_UTF8 (65001). Is the
             | article out of date? Can someone respond to those
             | statements?
        
         | snazz wrote:
         | Is java.lang.String still UTF-16? Is there any plan to fix
         | that? Once Windows and Java take care of it, I can't think of
         | any other major UTF-16 uses left. Are there any that I've
         | forgotten about?
         | 
         | Edit: Still looks like UTF-16, according to the Oracle
         | documentation page:
         | https://docs.oracle.com/en/java/javase/14/docs/api/java.base...
         | Edit 2: JavaScript too. See my reply to someone else below.
        
           | lokedhs wrote:
           | I think it will be hard to change that. But it's not alone.
           | Javascript also uses UTF-16.
        
             | snazz wrote:
             | You're right! I'm surprised I didn't know that. It looks
             | like it can also be UCS-2, going by the spec:
             | 
             | > A conforming implementation of this International
             | standard shall interpret characters in conformance with the
             | Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1
             | with either UCS-2 or UTF-16 as the adopted encoding form,
             | implementation level 3. If the adopted ISO/IEC 10646-1
             | subset is not otherwise specified, it is presumed to be the
             | BMP subset, collection 300. If the adopted encoding form is
             | not otherwise specified, it is presumed to be the UTF-16
             | encoding form.
        
               | im3w1l wrote:
               | USC-2 is an old version of UTF-16 that lacks support for
               | surrogate pairs, which means that rare symbols and emoji
               | don't work.
        
           | josefx wrote:
           | I don't think they can fix that without completely breaking
           | backwards compatibility. The basic char type in Java is
           | defined as a 16 bit wide unsigned integer value and String
           | doesn't abstract over that.
        
           | masklinn wrote:
           | > Is java.lang.String still UTF-16?
           | 
           | Yes.
           | 
           | > Is there any plan to fix that?
           | 
           | That's not really possible as strings are defined in terms of
           | char and guarantee O(1) access to UTF16 code units. They
           | might try to switch to "indexed UTF8" (as pypy did in the
           | Python ecosystem whereas "CPython proper" refused to switch
           | to UTF8 with the Python 3 upheaval and went with the death
           | trap that is PEP 393 instead).
        
           | projektfu wrote:
           | I don't think it's a big deal for Java because it's always
           | easy to transfer in from and out to UTF-8. Very few Java
           | programs use UTF-16 as a persistence format, and Java-native
           | applications can directly marshal strings around as they are
           | a first-class datatype.
        
           | rimunroe wrote:
           | JavaScript:
           | 
           | https://www.ecma-international.org/ecma-262/5.1/#sec-2
           | 
           | > A conforming implementation of this Standard shall
           | interpret characters in conformance with the Unicode
           | Standard, Version 3.0 or later and ISO/IEC 10646-1 with
           | either UCS-2 or UTF-16 as the adopted encoding form,
           | implementation level 3. If the adopted ISO/IEC 10646-1 subset
           | is not otherwise specified, it is presumed to be the BMP
           | subset, collection 300. If the adopted encoding form is not
           | otherwise specified, it presumed to be the UTF-16 encoding
           | form.
           | 
           | https://www.ecma-international.org/ecma-262/5.1/#sec-4.3.16
           | 
           | > A String value is a member of the String type. Each integer
           | value in the sequence usually represents a single 16-bit unit
           | of UTF-16 text. However, ECMAScript does not place any
           | restrictions or requirements on the values except that they
           | must be 16-bit unsigned integers.
        
           | diroussel wrote:
           | Compact Strings were added in Java 9;
           | https://openjdk.java.net/jeps/254
           | 
           | So they can now be stored as one byte per character.
        
             | kllrnohj wrote:
             | Only for ASCII text. There is still no UTF-8 support (it's
             | even called out as a non-goal in the JEP: "It is not a goal
             | to use alternate encodings such as UTF-8 in the internal
             | representation of strings.")
        
         | jdsully wrote:
         | Are you sure? That will result in a conversion every time a
         | string is passed to the kernel.
         | 
         | Windows can handle utf-8 but it is not the native character set
         | for the platform.
        
           | JdeBP wrote:
           | There's a conversion in every ...A() function. Conversion
           | between UTF-8 and WTF-16 is just more of the same, but
           | without codepage lookup tables. (-:
        
             | mark-r wrote:
             | They probably still do a codepage lookup just for
             | consistency.
        
             | Shebanator wrote:
             | WTF-16? I like it...
        
               | ekimekim wrote:
               | WTF-8 and WTF-16 are a thing:
               | https://simonsapin.github.io/wtf-8/
               | 
               | Basically WTF-16 is any sequence of 16-bit integers, and
               | is thus a superset of UTF-16 (because UTF-16 doesn't
               | allow certain combinations of integers, mainly surrogate
               | code points that exist outside of surrogate pairs).
               | 
               | Then WTF-8 is what you get if you naively transform
               | invalid UTF-16 into UTF-8. It is a superset of UTF-8.
               | 
               | This is very useful when dealing with applications like
               | Java and Javascript that treat strings as sequences of
               | 16-bit code points, even though not all such strings are
               | valid UTF-16.
        
               | masklinn wrote:
               | > Basically WTF-16 is any sequence of 16-bit integers,
               | and is thus a superset of UTF-16 (because UTF-16 doesn't
               | allow certain combinations of integers, mainly surrogate
               | code points that exist outside of surrogate pairs).
               | 
               | If WTF-16 is the ability _in potentia_ to store and
               | return invalid UTF-16 without signalling errors, I don 't
               | know that there's any actual UTF-16 system out there to
               | the possible exception of... HFS+ maybe?.
        
       | xg15 wrote:
       | > _When writing a UTF-8 string to a file, it is the length in
       | bytes which is important. Counting any other type of 'characters'
       | is, on the other hand, not very helpful._
       | 
       | So, suppose I have a UTF-8 string of n code units (bytes) length.
       | Unfortunately my data structure only permits strings of length m
       | < n bytes.
       | 
       | How do I correctly truncate the string so it doesn't become
       | invalid UTF-8 and won't show any unexpected gibberish when
       | rendered? (E.g., the truncated string doesn't suddenly contain
       | any glyphs or grapheme clusters that weren't in the original
       | string)
        
         | toast0 wrote:
         | > How do I correctly truncate the string?
         | 
         | Refuse to accept a string that is overlong, and require an
         | interactive user (hopefully one literate in the language) to
         | truncate it for you. In a non-interactive context, you can't.
        
         | Tyr42 wrote:
         | https://play.rust-lang.org/?version=stable&mode=debug&editio...
         | 
         | Something like this? Check if each character pushes the byte
         | total over the limit?
         | 
         | I think this might fail for combining characters though.
        
         | samatman wrote:
         | Avoiding invalid UTF-8 is easy, almost trivial: just make sure
         | you don't truncate in the middle of a code point.
         | 
         | The latter is fiendishly difficult to get right in all cases,
         | the ugliest case being emoji flags. Being all-or-nothing on
         | both sides of a ZWJ will get you most of the way there,
         | however.
        
           | smasher164 wrote:
           | It's not though. Replacing invalid byte sequences is not
           | terribly difficult.
           | 
           | https://golang.org/src/strings/strings.go?s=15854:15900#L627.
        
       | heyplanet wrote:
       | I think UTF-8 was a mistake.
       | 
       | It is a pain in the ass to have a variable number of bytes per
       | char.
       | 
       | In Ascii, you could easily know every character personally. No
       | strange surprises.
       | 
       | Also no surprises while reading black on white text and suddenly
       | being confronted with clors [1].
       | 
       | [1] Also no surprises when writing a comment on HN like this one
       | and having some characters stripped. I put in a smiley as the
       | firs "o" in colors, but it was stripped out. Looks like the
       | makers of HN don't like UTF-8 either.
        
         | goatinaboat wrote:
         | Certain things such as DNS, email addresses and so on should be
         | restricted to ASCII, it's a security nightmare otherwise.
        
           | bartwe wrote:
           | I assume you mean a limited subset of 7bit ascii ? 33-126
        
             | JdeBP wrote:
             | % host -t a $'\015'.         1 \015:         19 bytes,
             | 1+0+0+0 records, response, authoritative, nxdomain
             | query: 1 \015         %
             | 
             | It's not as straightforward or sensible as you think. It's
             | case insensitive; it's case preserving; and C0 control
             | characters, SPC, and DEL are allowed. The case
             | differentiating bits for letters are nowadays sometimes
             | used in an attempt to foil attackers. If you want things to
             | look back on and say "I think that X was a mistake." then
             | forget UTF of any stripe. The DNS is full of them.
        
               | zokier wrote:
               | I thought DNS allowed any arbitrary byte sequence as
               | label (up to max length limit)
        
         | DagAgren wrote:
         | You can't even write proper English in ASCII. ASCII is an
         | absolute dead end. It's history.
         | 
         | Actually representing human language is HARD. It is also
         | absolutely necessary. Whatever solution you choose is going to
         | be complicated, because it is solving a very complicated
         | problem.
         | 
         | Throwing your hands up and going "oh this is too hard, I don't
         | like it" will get you nowhere.
        
           | kazinator wrote:
           | You can't write proper _snooty_ English in ASCII, with
           | diaereses and whatnot.
        
             | DagAgren wrote:
             | ASCII doesn't have have all the punctuation regularly used
             | in English.
        
               | kazinator wrote:
               | ASCII doesn't have a direct representation of all the
               | punctuation used in English _print_ , like 66 99 quotes,
               | and different kinds of dashes (distinct from minus). For
               | non-print, it's entirely fine.
               | 
               | Typesetting should be handled by a markup language
               | anyway. Adding a few characters to Notepad doesn't create
               | a typesetting system. A typesetting system needs to be
               | able to do kerning, ligatures, justification. Not to
               | mention bold, italics, and different fonts.
        
             | kps wrote:
             | 1967 ASCII anticipated that, with dual-use character shapes
             | so you could type o BS " - o
             | 
             | But then people invented video terminals that didn't
             | overstrike.
        
         | magicalhippo wrote:
         | > It is a pain in the ass to have a variable number of bytes
         | per char.
         | 
         | In the same vein it's a pain in the ass to write everything in
         | assembler. Which is why we don't do that, we use high-level
         | languages instead.
        
         | thechao wrote:
         | You're conflating code points and _some_ encoding; more
         | importantly, you 're conflating "array of encoded objects
         | (bytes)" for "a string of text". They're not -- and never have
         | been -- the same.
        
         | kllrnohj wrote:
         | > It is a pain in the ass to have a variable number of bytes
         | per char.
         | 
         | This is from API & language mistakes more than an issue with
         | UTF-8 itself.
         | 
         | If you actually design your API & system around being UTF-8,
         | like Rust did, then there's really no issue for the programmer.
         | The API enforces the rules, and still gives you things like a
         | simple character iterator (with characters being 32-bit, so
         | that it actually fits: https://doc.rust-
         | lang.org/std/char/index.html). The String class handles all the
         | multi-byte stuff for you, you never "see" it: https://doc.rust-
         | lang.org/std/string/struct.String.html
         | 
         | Retrofitting this into existing languages isn't going to be
         | _easy_ , but that's not an excuse to not do it at all, either.
        
         | jandrese wrote:
         | > It is a pain in the ass to have a variable number of bytes
         | per char.
         | 
         | Maybe, but nobody can stomach the wasted space you get with
         | UTF-32 in almost every situation. The encoding time tradeoff
         | was considered less objectionable than making most of your text
         | twice or four times larger.
        
           | FabHK wrote:
           | And as the article points out, even then you might have more
           | than one code point for a character.
           | 
           | > For example, the only way to represent the abstract
           | character iu _cyrillic small letter yu with acute_ is by the
           | sequence U+044E _cyrillic small letter yu_ followed by U+0301
           | _combining acute accent._
        
       | jodrellblank wrote:
       | > _Q: What do you think about Byte Order Marks? A: According to
       | the Unicode Standard (v6.2, p.30): "Use of a BOM is neither
       | required nor recommended for UTF-8". [...] Using BOMs would
       | require all existing code to be aware of them, even in simple
       | scenarios as file concatenation. This is unacceptable._
       | 
       | Then your site "UTF-8 everywhere" is misnamed, because standards-
       | following UTF-8 can have a BOM. It's not required or recommended,
       | but it is possible and allowable, so you might see them and if
       | you follow the standard you have to deal with them. It's not a
       | matter of "this would require all existing code to handle them" -
       | that is not hypothetical, that is the current world, to be
       | standards-compliant all existing code _does already_ need to be
       | aware of them. It isn 't, which means it's broken. Declaring it
       | "unacceptable" is meaningless, except to say you're rejecting the
       | standard and doing something incompatible and broken because it's
       | easier.
       | 
       | Which is a position one can take and defend, but it's not a good
       | position for a site claiming to be pushing for people to follow
       | the standard. What it is, is yet another non-standard ad-hoc
       | variant defined by what some subset of tools the authors use
       | can/can't handle in April 2020.
       | 
       | > " _the UTF-8 BOM exists only to manifest that this is a UTF-8
       | stream_ "
       | 
       | Throwing the word "only" in there doesn't make it go away. It
       | exists as a standards-compliant way to distinguish UTF-8 from
       | ASCII, not recommended but not forbidden.
       | 
       | > " _A: Are you serious about not supporting all of Unicode in
       | your software design? And, if you are going to support it anyway,
       | how does the fact that non-BMP characters are rare practically
       | change anything_ "
       | 
       | Well in the same way, how does the fact that UTF8+BOM is rare
       | practically change anything? At some level you're either pushing
       | for everyone to follow standards even if it's inconvenient
       | because that makes life better for everyone overall, like you are
       | with surrogate pairs and indexing, or you're creating another ad-
       | hoc incompatible variation of UTF-8 which you prefer to the
       | standard and trying to strong-arm everyone else into using it
       | with threats of being incompatible with all the code which
       | already does it wrong.
       | 
       | Being wary of Chesterton's Fence, presumably there's some company
       | or system which got UTF-8+BOM added to the standard because they
       | wanted it, or needed it.
        
         | jodrellblank wrote:
         | Downvoting doesn't make the BOM stop being part of the standard
         | either, btw.
         | 
         | Yes, supporting BOM on arbitrary UTF-8 streams is varying
         | between difficult and impossible, but then get it removed from
         | the standard, or state that you don't support the standard.
         | Don't pretend you support the standard while ignoring the bits
         | you don't like, that's dishonest and unhelpful.
        
         | alkonaut wrote:
         | 100% agree.
         | 
         | > using BOMs would require all existing code to be aware of
         | them, even in simple scenarios as file concatenation
         | 
         | Absolutely! Any app that writes UTF-files can (and probably
         | should) avoid writing them. But any program that reads UTF
         | files _must_ handle a BOM. A lot of apps write UTF-8 including
         | the BOM by default, for example Visual Studio.
         | 
         | You can NOT concatenate two UTF-8 streams and expect that the
         | resulting stream is also a valid UTF-8 stream. NO tool should
         | assume that, ever.
        
       | malkia wrote:
       | Still doesn't solve the fact that filesystems across different
       | OS's allow invalid UTF8 sequences in the filenames.
       | 
       | Maybe 99% of apps do not care, but even a simple "cp" tool should
       | care. Filenames (and maybe other named resoureces) should be
       | treated completely differently, and not blindly assumed that they
       | are utf8 compatible.
        
         | ken wrote:
         | To me, that's a design flaw. Would we really be any worse off
         | if we simply declared filenames must be UTF-8?
         | 
         | That seems to be the only case where a user-visible and user-
         | editable field is allowed to be an arbitrary byte sequence, and
         | its primary purpose seems to be allowing this argument to pop
         | up on HN every month.
         | 
         | I've never seen any non-malicious use of it. All popular
         | filesystems already disallow specific sets of ASCII characters
         | in names. Any database which needs to save data in files by
         | number has no problem using safe hex filenames.
        
           | ChrisSD wrote:
           | Sure we could declare that but then what? Non-unicode
           | filenames won't suddenly disappear. Operating systems won't
           | suddenly enforce unicode. Filesystems will still allow non-
           | unicode names.
           | 
           | Simply declaring it doesn't help anybody. In the meantime
           | your application still needs to handle non-unicode filenames
           | otherwise those malicious ones are free to be malicious.
        
             | AnIdiotOnTheNet wrote:
             | If unicode had a set of "explictly this byte" codepoints,
             | it should be simple to deal with, just pass the invalid
             | bytes of the filename in that way.
        
             | PeterisP wrote:
             | I'd assume that the proper place for defining what's a
             | valid filename would be on the filesystem level, so a
             | filesystem of standard ABC v123 would not allow non-unicode
             | names; so non-unicode filenames would either get refused or
             | modified upon copying/writing them to the filesystem.
             | 
             | This is not new, this would match the current behavior of
             | the OS/filesystem enforcing other character restrictions
             | such as when writing (for example) a file name with an
             | asterisk or colon to a FAT32 USB flash drive.
        
             | mark-r wrote:
             | Once you lose the expectation of being able to work with
             | non-unicode filenames, those files will quickly get renamed
             | and cease to be a problem.
        
               | ChrisSD wrote:
               | How can you rename them if you can only use unicode
               | paths?
        
               | mark-r wrote:
               | You would need to use some special utility created just
               | for that purpose.
        
             | bjourne wrote:
             | As long as the tool for _renaming_ files handles non-utf8
             | filenames you 'd be fine.
        
         | qiqitori wrote:
         | Are you saying that operating systems (i.e. the kernel) should
         | check and enforce encodings in filenames?
         | 
         | 1) Why?
         | 
         | 2) Bye bye backward compatibility and interoperability
        
           | ghettoimp wrote:
           | Backward compatibility is a laudable goal and is not to be
           | broken lightly. But sometimes, things are so fundamentally
           | broken that we would be far better off with a clean break.
           | 
           | Interoperability is quite possibly a good argument _for_
           | coming up with some reasonable restrictions on filenames.
           | Today you could easily (case sensitive names, special
           | characters, etc.) create a ZIP file or similar that cannot be
           | successfully extracted on this platform or that.
           | 
           | In an excellent article, David A. Wheeler [1] lays out a
           | compelling case against the status quo. TL;DR: bad filenames
           | are too hard to handle correctly. Programs, standards, and
           | operating systems already assume there are no bad filenames.
           | Your programs will fail in numerous ways when they encounter
           | bad filenames. Some of these failures are security problems.
           | 
           | He concludes: "In sum: It'd be far better if filenames were
           | more limited so that they would be safer and easier to use.
           | This would eliminate a whole class of errors and
           | vulnerabilities in programs that "look correct" but subtly
           | fail when unusual filenames are created (possibly by
           | attackers)." He goes on to consider many ideas towards
           | getting to this goal.
           | 
           | [1] https://dwheeler.com/essays/fixing-unix-linux-
           | filenames.html
        
           | wtetzner wrote:
           | It sounds like they're saying the opposite. All programs
           | dealing with filenames need to be able to support an
           | arbitrary stream of bytes, they can't just assume UTF-8.
        
           | masklinn wrote:
           | > 2) Bye bye backward compatibility and interoperability
           | 
           | It's already not really a thing.
           | 
           | Traditional unices allow arbitrary bytes with the exception
           | of 00 and 2f, NTFS allows arbitrary _utf-16 code units_
           | (including unpaired surrogates) with the exception of 0000
           | and 002f, and I think HFS+ requires valid UTF-16 and allows
           | everything (including NUL).
           | 
           | The OS then adds its own limitations e.g. win32 forbids \, :,
           | *, ", ?, <, >, | (as well as a few special names I think) and
           | OSX forbids 0000 and 003a (":"), the latter of which gets
           | converted to and from "/" (and similarly forbidden) by the
           | POSIX compatibility layer.
           | 
           | The latter is really weird to see in action, if you have
           | access to an OSX machine: open a terminal, try to create a
           | file called "/" and it'll fail. Now create one called ":".
           | Switch over to the Finder, and you'll see that that file is
           | now called "/" (and creating a file called ":" fails).
           | 
           | Oh yeah and ZFS doesn't really care but can require that all
           | paths be valid UTF8 (by setting the utf8only flag).
        
       ___________________________________________________________________
       (page generated 2020-04-14 23:01 UTC)