[HN Gopher] UTF-8 Everywhere ___________________________________________________________________ UTF-8 Everywhere Author : pcr910303 Score : 244 points Date : 2020-04-14 15:55 UTC (7 hours ago) (HTM) web link (utf8everywhere.org) (TXT) w3m dump (utf8everywhere.org) | ddebernardy wrote: | (2012) | | Previous discussions: | https://news.ycombinator.com/from?site=utf8everywhere.org | rakoo wrote: | Maybe it's time for MySQL to make "utf8" actually mean UTF-8 then | (https://medium.com/@adamhooper/in-mysql-never-use-utf8-use-u...) | treve wrote: | > Although utf8 is currently an alias for utf8mb3, at some | point utf8 will become a reference to utf8mb4. To avoid | ambiguity about the meaning of utf8, consider specifying | utf8mb4 explicitly for character set references instead of | utf8. | | https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8... | | Can't fault a database vendor being conservative, but looks | like this is planned. Maybe this will be a 9.0 thing. | smacktoward wrote: | They probably couldn't even if they wanted to, by this point | there will be too much software out there depending on "utf8" | meaning "MySQL's weird proprietary hacked-up version of UTF-8". | | The only real solution is to hammer home the message that | "utf8mb4" is what you put into MySQL if you want UTF-8. | cosarara wrote: | There are acual problems too, when switching from utf8mb3 to | utf8mb4, because of maximum varchar length in indices: | https://stackoverflow.com/questions/48500355/mysql- | character... | tialaramex wrote: | However, sometimes you're in a layer when ASCII was fine and you | should just be explicit about that. | | Server Name Indication (in RFC 3546) is flawed in several ways, | it's a classic unused extension point for example because it has | an entire field for what type of server name you mean, with only | a single value for that field ever defined. But one that stands | out is it uses UTF-8 encoding rather than insisting on ASCII for | the server name. | | You can see the reasoning - international domain names are a big | deal, we should embrace Unicode. But IDNA already needed to | handle all this work, the DNS A-labels are already ASCII even for | IDNs. | | Essentially choosing UTF-8 here only made things needlessly more | complicated in a critical security component. Users, the people | who IDNs were for, don't know what SNI is, and don't care how | it's encoded. | magicalhippo wrote: | I'd be happy if I could just get consistent encoding. Have to | handle way too many files with mixed encoding, even XML files | with explicit encoding header. | [deleted] | GnarfGnarf wrote: | I came to the same conclusion years ago. My app is Win32, but I | never defined UNICODE or used the TCHAR abomination. All strings | are stored as UTF8 until they are passed to Win32 APIs, whereupon | they are converted to UCS-2. I explicitly call the wchar version | of functions (ex: TextOutW). This strategy enabled me to | transition easily and safely from single-byte ASCII (Windows 3.1) | to Unicode. | | The database is also UTF8. | projektfu wrote: | When I used to do a lot of windows programming in the late 90s, I | wish that I had a sensible guide like this for handling strings. | TCHAR was always a source of subtle bugs. | | I suppose, though, that the underlying problem was that Microsoft | was so late to implement a compatibility solution for Windows 9x. | Most software of the time ended up implementing on "ANSI" | multibyte character set (MBCS) just because otherwise you would | need to either deploy 2 executables or do your own thunking. This | solution would be a double thunk on 9x because you'd be thunking | your UTF-8 to unicode and then thunking that back to MBCS. | [deleted] | Animats wrote: | I'd argue for some standard tests for UTF-8 strings: | | - Basic - UTF-8 byte syntax correct. | | - Unambiguous - similar to the rules for Unicode domain names. | The rules are complicated, but basically they prohibit | homoglyphs, mixing glyphs from different character sets, forwards | and backwards modifiers in the same string, no emoji or | modifiers, etc. Use where people have to visually compare two | things for identity or retype them, such as file names. | | - Unambiguous, light version - as above, but allow emoji and | modifiers. Normal form for documents. | shpx wrote: | What I never see mentioned about Unicode is Han Unification | | https://en.m.wikipedia.org/wiki/Han_unification | | As I understand it, it's impossible to have a txt file that uses | Japanese and Chinese characters at the same time. The file will | either use the Chinese or Japanese forms of the characters, | depending on your font. I would think this is a big gotcha people | must run into all the time, but I never hear anyone talk about | it. | gsnedders wrote: | Relatively few people frequently look at different Han | languages, and relatively few people are looking at txt files | containing Han characters (and I expect those that do are | typically running with their OS locale set to one of the Han | languages?). | | Enough CJK HTML content is tagged and heuristics are mostly | good enough that incorrect font selection isn't a massive issue | on the web, and AFAIK most major word processors include | metadata in the file that suffices to distinguish language. | klodolph wrote: | I'm not going to try and minimize the problem, here. Han | unification was pushed through by western interests, by my | understanding. | | However, most Unicode characters are identical or nearly | identical in Chinese and Japanese. Characters with | "significant" visual differences got encoded as different | Unicode characters. The same thing applies to simplified and | traditional Chinese characters. | | So for a given "Han character", there might be between one and | three different Unicode characters, and there might be between | one and three different ways of writing it. | | Here's an illustration: | https://japanese.stackexchange.com/questions/64590/why-are-j... | | So the issue does come up when mixing Chinese and Japanese | text, but it's not really one that has a big impact on | _legibility_ of the text but you would definitely be concerned | if you were writing a Japanese textbook for Chinese students, | or vice versa. | | Beyond that, it is usually fairly trivial to distinguish | between Japanese and Chinese text, so you could just lean on | simple heuristics to get the work done (Japanese text, with the | exception of fairly ancient text or very short fragments, | contains kana, but Chinese does not). | cygx wrote: | _Han unification was pushed through by western interests, by | my understanding._ | | Note that as far as I'm aware, the interest in question was | the initial 16-bit limit of the character set and later on | the non-proliferation of competing standards. | | Also note that while Han unification is the most prominent | example, there are technically similar cases, which just | aren't as charged culturally. For one, Unicode doesn't encode | German Fraktur: While some characters are available due to | their use in mathematics, it's lacking the corresponding | variants of a, o, u, ss, s as well as specific ligatures. So | if you want to intermix modern with old German writing, | you'll also have to go out-of-band. | anoncake wrote: | That's not the same thing. Fraktur is just a style of | fonts, antiqua and fraktur letters are semantically the | same. | cygx wrote: | There are differences as well as similarities. I'm no | expert, but shouldn't, say, U+4ECA still translate to | 'now' no matter if you draw a particular line | horizontally or diagonally? There are also some | mandatory[1] ligatures in Fraktur unavailable in Unicode. | What if I wanted to preserve that distinction in historic | writing? | | _edit:_ | | [1] I think the mandatory ones are actually there (just | not in Fraktur), it's some optional ones like sch that | are missing. | ksec wrote: | Yes, the real problem is when you start mixing All Four ( or | Five ) of them together Chinese Traditional, Simplified | Korean, Japanese things becomes extremely problematic. | | I think it is by luck, All four writings has significant | usage within their own region, imagine if one of them were | significantly smaller and over time were forced ( or by ease | of use or what ever reason ) to switch to a different style | without knowing it. | cryptonector wrote: | As I understand it Han unification happened because at the | time all there was was UCS-2 -no UTF-16, no UTF-8- so | codespace was tight and precious, and that motivated | codespace preserving optimizations, of which Han unification | is the notable one. | | To avoid that they needed to have invented UTF-8 many years | earlier. Perhaps if the people designing UTF-8 were more | diverse they might have felt the necessity to invent UTF-8 to | the point of actually doing it, but then perhaps they might | have done it poorly. At any rate, I don't know enough details | to really know if "Han unification was pushed through by | western interests" is remotely fair. | macintux wrote: | UTF-8 was sketched on a placemat as a response to a | different idea. It seems likely that had it not arisen in a | moment of inspiration by a genius, we would be stuck with | another inferior design by committee. | | https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt | cryptonector wrote: | First of all, there is no new unification work ongoing. The | Unicode Consortium moved on from that by moving on from UCS-2. | UCS-2 drove unification as a way to preserve precious | codespace. | | There used to be language tag codepoints for this, but they've | been deprecated. Han unification is an accident of history: a | result of UTF-8 not having existed until it was too late! | | There's not going to be a different new Unicode for doing away | with Han unification, which is why no one mentions it: besides | crying about it, what else can one do? Maybe we should revive | language tags? | | Anyways, isn't the difference between unified Han/Kanji | characters mostly stylistic rather than semantic? I'm not | denying that many users would get annoyed, but again, what to | do about it?? | innocenat wrote: | It's only stylistic issue if you also consider a and a | (alpha) to also be just stylistic different. | | I have learned to live with it, but it is very annoying. | ksec wrote: | The same could be said whether e e should be the same as e | with different fonts. People who cares about it would | complain. To those who only uses English it is only the same | _e_. | microtherion wrote: | I don't think that's the same, because e.g. in French, e, | e, e, and e are all used, with different pronunciations. | wheybags wrote: | It's different enough that users will _immediately_ complain | if you get it wrong. And it means that you, as a developer | who might not understand either Chinese or Japanese, now has | to deal with the fallout by setting a different font in your | application depending on which of the two languages it is. | This happe ed end for us in factorio, and it was super | annoying, because it 's really hard to spot the problem | before it goes live because you A: don't know the problem | exists, how would you? B: have a hard time seeing it even | when you do know. The whole poi t of Unicode is to not have | to think about this crap or handle it explicitly, and this | breaks that guarantee fantastically. | hutzlibu wrote: | As someone who experienced serious pain with broken strings that | I sometimes only discovered, after the original files were gone | and new special characters were integrated, I directed quite some | anger to the fact, that computer systems are internal mostly | operated in english only, so usually nobody notices bugs with | wrong character encoding. So I share the sentiment of the article | .. | | I do not want to think about UTF encoding, when I simply create a | 7z or tar file, without even programming. But I learned the hard | way, I had to. I never even found out for example, if it was/is a | bug with 7z, tar, rsync, scite text editor/ notepad++ .. or just | wrong usage/configuration. I just had(and still have even now my | workflow is clean) a special first file/codeline with special | characters, I checked to be correct, after compressing, rsyncing | between different systems. Especially between windows and linux. | But it probably helps, that I don't have to do that anymore. | anderspitman wrote: | Trying to figure out how to express this without making people | mad at me. I think the conflation of Unicode with "plain text" | might be a mistake. Don't get me wrong, Unicode serves an | important purpose. But bumping the version from plain text 1.0 | (ASCII) to plain text 2.0 (Unicode) introduced a ton of | complexity, and there are cases where the abstractions start | leaking (iterating characters etc). | | With things like data archival, if I have a hard drive with the | library of congress stored in ASCII, I need half a sheet of paper | to understand how to decode it. | | Whereas apparently UTF8 requires 7k words just to explain why | it's important. And that's not even looking at the spec. | | Just to be crystal clear, I'm not advocating to not use Unicode, | or even use it less. I'm just saying I think it maybe shouldn't | count as plain text, since it looks a lot like a relatively | complicated binary format to me. | cryptonector wrote: | There are tens of thousands of characters in all the human | scripts. If you're a librarian, scholar, researcher -- why | would you not want to be able to use them seamlessly?? | droopyEyelids wrote: | If there was a complicated tool that claimed it could do the | job of every tool in history, or a simple tool that was | focused to cover 99% of the work you do-- and we lived on | planet earth-- which would you choose? | crazygringo wrote: | Umm... but ASCII doesn't work for 99% of people's work. | | A majority of the world's population have writing systems | that ASCII doesn't encode. | | So not really sure what you're suggesting here. | dtech wrote: | Of course ASCII is simpler than Unicode, it handles only 127 | characters. If you restrict yourself to those characters ASCII | is binary equivalent to UTF-8. | | So yeah, maybe you shouldn't use characters 128+ for data | archival, I doubt that's a good idea, but that's irrelevant to | whether UTF-8 is plain text or not. | tachyonbeam wrote: | I think that sometimes it makes sense to enforce strict | limitations early on (eg: overly strict input validation). | You can then remove such limitations in later versions of | your software, after careful consideration and after | inserting the necessary tests. The reverse usually doesn't | work. If you didn't have those limitations early on, and your | database is full of strings with characters that should never | have been allowed in there, you will have a hard time | cleaning up the mess. | | This seems especially true to me in the design of programming | languages. If you have useless, badly thought out features in | your programming language, people will begin to rely on them, | and you will never be able to get rid of them... So start | with a small language, and make it strict. Grow it gradually. | [deleted] | nlitened wrote: | As a person who comes from a country with non-ASCII alphabet, I | strongly disagree. Since UTF-8 became de-facto standard | everywhere, so many headaches went away. | tingletech wrote: | LET'S GO BACK TO 6-BITS | hechang1997 wrote: | That complexity comes from the fact that you are using non | ASCII characters. UTF8 is a superset of standard ASCII. If you | are using only standard ASCII characters, they're exactly the | same thing. | jaseemabid wrote: | ASCII is English and limiting access to knowledge for the rest | of humanity for a simpler encoding is just not an acceptable | option. Someone needs to interpret those 7k words and write a | (complicated?) program once so that billions can read in their | own language? Sounds like an easy win to me. | droopyEyelids wrote: | counterpoint: | | A complicated program is never an easy win, and English is | already spoken in every country in the world. | WorldMaker wrote: | Sure spoken, but both Arabic and CJK ideograms are written | in far more countries in the world, with far more people, | and for far longer in history than the ASCII set. The | oldest surviving great works of Mathematics were written in | Arabic and some of the oldest surviving great works of | Poetry where written in Chinese, as just two easy and | obvious examples of things worth preserving in "plain | text". | crazygringo wrote: | So your argument is... it's easier to teach billions of | people fluent English... than for software to support | UTF-8? | | You are aware that a majority of the world's population | speaks no English whatsoever? | tachyonbeam wrote: | Playing the devil's advocate here. I am not a native | English speaker, I'm a French speaker, but I'm happy that | English is kind of the default international language. | It's a relatively simple language. I actually make less | grammar mistakes in English than I do in my native | language. I suppose it's probably not a politically | correct thing to say, the English are the colonists, the | invaders, the oppressors, but eh, maybe it's also kind of | a nice thing for world peace, if there is one relatively | simple language that's accessible to everyone? | | Go ahead and make nice libraries that support Unicode | effectively, but I think it's fair game, for a small | software development shop (or a one-person programming | project), to support ASCII only for some basic software | projects. Things are of course different when you're | talking about governments providing essential services, | etc. | TheCoelacanth wrote: | You only need one sentence to explain why ASCII isn't | sufficient: There are languages other than English. | ignoramous wrote: | > You only need one sentence to explain why ASCII isn't | sufficient | | Nitpick: ASCII is sufficient when you consider that Base64, | despite its 33% overhead from representing 6 bits with 8 | bits, makes life easier for certain classes of software. | TheCoelacanth wrote: | Base64 is an encoding for representing bytes[0] in ASCII. | | That doesn't help you represent text unless you already | have an encoding for representing text in bytes (e.g. | UTF8). | | [0] Octets if you want to be pedantic | ignoramous wrote: | What I was alluding to is, I often convert any binary | data, including text, to Base64 to avoid dealing with | cross platform, cross language, cross format, cross | storage, cross network data-handling. Only the layer that | needs to deal with the blob's actual string | representation needs to worry about encoding schemes that | are outside the purview of the humble ASCII table. | dtech wrote: | You still need an encoding to represent non-ASCII | characters like e or Mu . Base64 is no help at all there | msla wrote: | And you're naive if you think ASCII suffices for English. I | wouldn't give you 1/2C/ for an OS incapable of handling | Unicode and UTF-8 even if you told me every language other | than English were mysteriously destroyed. Going back to ASCII | is 180deg from what would enrich English-language text. | pjscott wrote: | _Unicode_ is complicated because the languages it needs to | handle are, alas, complicated. _UTF-8_ is super simple. It 's a | variable-length encoding for 21-bit unsigned integers. | Wikipedia gives a handy table showing how it works: | | https://en.wikipedia.org/wiki/UTF-8#Description | ftvy wrote: | When I wrote a very primitive UTF-8 library, I really began | to appreciate UTF-8's design. For example; the first byte | says how many bytes the character requires. At first it was | daunting, but when I put 2 and 2 together, it really opened | up. | | I am sure there are many aspects I am missing about UTF-8, | but it is all reasonable in its design and implementation. | | For reference, I was converting between code points and | actual bytes, and also implemented strlen and strcmp (which | for the latter the standard library apparently handles fine). | TheCoelacanth wrote: | The self-synchronizing property is also very clever. If you | start at an arbitrary byte, you can find the start of the | next character by scanning forward a maximum of 3 bytes. | carapace wrote: | Yeah, this. I have a pat "Unicode Rant" that boils down to | this essentially. | | Having a catalog of standard numbers-to-glyphs (or symbols or | whatever, little pictures humans use to communicate with) is | awesome and useful (and all ASCII ever was) but trying to | digitalize all of human language is much much more | challenging. | dpc_pw wrote: | > For instance, 'ch' is two letters in English and Latin, but | considered to be one letter in Czech and Slovak. | | Is "ch" really considered one _character_ in Czech and Slovak? | I'm Polish and we do have "ch" and consider it one ... sound... | represented by two letters? I mean... if you asked anyone to | count letters/characters in a word, they would count "ch" as two. | So I wonder if that's different in Slovakia or Chech Republic, or | is just my definition of "character" wrong. | andy_wrote wrote: | Based on my experience learning Czech (not native at all, just | interested): | | - it's typically listed as a separate letter when writing out | the alphabet | | - but in practice it's typed out as "c h" and not as a single | character | | - it occupies its own place in Czech standard alphabetical | order, my English-Czech dictionary has all the "ch" words after | "h" (so interestingly in order to do a proper sort | programmatically you need to possibly look 2 characters ahead) | pilsetnieks wrote: | At first I though they simply mean the letter "c" but no, it | turns out that "ch" (and also "dz") is a digraph with a | separate place in Czech and Slovak alphabets. | masklinn wrote: | > So I wonder if that's different in Slovakia or Chech | Republic, or is just my definition of "character" wrong. | | According to wikipedia, "Ch" is a character of the Czech | alphabet in the sense that it impacts alphabetical ordering | ("Ch" sorts between H and I), in the same way L or E are | apparently characters from the Polish alphabet distinct from L | and E respectively (wikipedia mentions that "byc comes after | bycie"). | | That is unlike, say, french where E and E are the same | character alphabetically. | | [0] https://en.wikipedia.org/wiki/Czech_orthography | mlj45 wrote: | This depends on your definition of informal terms like | "letter", "character" etc. | | The typographic term for combinations like this is "digraph". | (Wikipedia's definition: "A digraph [...] is a pair of | characters used in the orthography of a language to write | either a single phoneme [...] or a sequence of phonemes that | does not correspond to the normal values of the two characters | combined".) | | Whether digraphs have separate keys on a keyboard, are treated | as distinct for the purposes of alphabetisation, whether | speakers of the language think of them as separate "letters" | when spelling out a word and so on, are all separate issues and | varies between languages (or, more precisely, between the | conventions for writing a certain language). | Svip wrote: | A better example would probably be "ij" in Dutch. That's | definitely considered a single letter, as words starting with | ij in Dutch are capitalised IJ. Though there are glyphs for IJ | /ij already in unicode. | roelschroeven wrote: | "ij" is sometimes considered a single letter, but certainly | not always. Quoting Wikipedia | (https://en.wikipedia.org/wiki/IJ_(digraph)): | | "IJ (lowercase ij; Dutch pronunciation: [ei]) is a digraph of | the letters i and j. Occurring in the Dutch language, it is | sometimes considered a ligature, or a letter in itself. In | most fonts that have a separate character for ij, the two | composing parts are not connected but are separate glyphs, | which are sometimes slightly kerned." | | (and equivalent in the Dutch Wikipedia article) | bartwe wrote: | Nobody has that as a letter on the keyboard here though, so | it doesn't matter. Normally typed as a digraph. Would be nice | if we just switched over to using y at this point. Makes me | wonder, is the use of diacritics reducing since ascii | keyboards became the norm ? | kosievdmerwe wrote: | Afrikaans did this. We use "y" instead of "ij". | mercer wrote: | "Ij" is also one sounds represented bij two letters, and I | think capitalizing just the 'I' is pretty standard. As a | Dutch person myself, I didn't even know that there's a glyph | for it! | | We also have "ei", which sounds the same and was invented to | annoy people learning Dutch. Then there's "oe", "eu", "ui". | And just to fuck even more with people learning the language, | we have "au" and "ou" which also sound the same. Oh, and "ch" | and "g". | | Hans Brinker, the inventor of the Dutch language, famously | would toss a florijn to decide between using ei/ij and au/ou, | as he was not fond of foreigners. He's mostly known for | saving our country though when he plugged a hole in a dyke | with his finger (yes, I know what you're thinking, and no, we | do not appreciate your dirty minds making light of this | heroic act). | akie wrote: | As a Dutch person myself, capitalizing just the I and not | the J hurts my eyes. Ijsselmeer or IJsselmeer? | mercer wrote: | Interesting. I never really gave it much thought, but Ij | actually bothers me so much that I usually try to avoid | using it at the beginning of a sentence, and I cringe | when I need to capitalize because it's a place (like | Ijsselmeer). | | Just did some googling. Turns out that unlike the other | combinations, capitalizing both letters is mandatory for | 'IJ'. TIL... | alexis_fr wrote: | Same as OE/OE in French, then. | unwind wrote: | Spelling it "dike" helps keep people's minds on the right | thing. :) | samatman wrote: | If you spell it "dijk" it's even less racy, because it's | no longer a four-letter word. | mercer wrote: | Well shit. Guess I'll have to clean out my mind with some | soap... | masklinn wrote: | I don't know that that's correct. That there exists a | ligature character doesn't mean the ligature is a character | of the language. | | It could, mind, I don't know dutch. But in french "oe" (which | has a ligatured character as you can see) is canonically | equivalent to "oe". It is not a separate letter of the | alphabet even though: | | * many words should not be written with the ligatured form | | * many words should be written with the ligatured form | | * it has a different pronunciation than the base form | enedil wrote: | Yeah, but in Czech it's "c". | pilsetnieks wrote: | No. C is something else, ch is a digraph that's pronounced | differently. Take a look at Czech and Slovak alphabets | specifically: | | https://en.wikipedia.org/wiki/Czech_orthography | | https://en.wikipedia.org/wiki/Slovak_orthography | enedil wrote: | I'm Polish and I have just tangential knowledge of Czech | language. Sorry for confusion. | camgunz wrote: | This pops up every so often, and is wrong on several fronts (UNIX | is UTF-8, UTF-8/32 lexicographically sort, etc.) There's not | really a good reason to support UTF-8 over UTF-16; you can | quibble over byte order (just pick one) and you can try and make | an argument about everything being markup (it's not), but the | fact is that UTF-16 is a more efficient encoding for the | languages a plurality of people use natively. | | But more broadly, being able to assume $encoding everywhere is | unrealistic. Write your programs/whatevers allowing your users to | be aware of and configure encodings. It might not be ideal, but | such is life. | jeltz wrote: | But is it really a plurality? Portuguese, English, Spanish, | Turkish, Vietnamese, French, Indonesian and German are stored | more efficiently in UTF-8 while Chinese, Korean and Japanese | are stored less effeciently. My gut feel is that more people | use the Latin script than people using CJK scripts. Indic | scripts, Thai, Cyrillic, etc are stored using two bytes in both | UTF-8 AND UTF-16. | | And thus ignores markup which is in ascii. | camgunz wrote: | Looking at the basic multilingual plane [1], UTF-8 will use > | 2 bytes to encode essentially anything that isn't: | | * ASCII/Latin | | * Cyrillic | | * Greek | | * Most of Arabic | | That leaves out: | | * China | | * India | | * Japan | | * Korea | | * All of Southeast Asia | | Re: markup, think about any text that's in a database, stored | in RAM, or stored on a disk--relatively little of it will be | in noisy ASCII markup formats like HTML or XML. | | [1]: https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Mult | ilin... | jeltz wrote: | > All of Southeast Asia | | Did you forget Indonesia, Vietnam, Malaysia, Brunei and the | Philippines? | camgunz wrote: | Again, here's what UTF-8 will use <= 2 bytes for: | | Basic Latin (Lower half of ISO/IEC 8859-1: ISO/IEC | 646:1991-IRV aka ASCII) (0000-007F) | | Latin-1 Supplement (Upper half of ISO/IEC 8859-1) | (0080-00FF) | | Latin Extended-A (0100-017F) | | Latin Extended-B (0180-024F) | | IPA Extensions (0250-02AF) | | Spacing Modifier Letters (02B0-02FF) | | Combining Diacritical Marks (0300-036F) | | Greek and Coptic (0370-03FF) | | Cyrillic (0400-04FF) | | Cyrillic Supplement (0500-052F) | | Armenian (0530-058F) | | Aramaic Scripts: Hebrew (0590-05FF) | Arabic (0600-06FF) Syriac (0700-074F) | Arabic Supplement (0750-077F) Thaana | (0780-07BF) N'Ko (07C0-07FF) | | In UTF-8, everything over U+0800 requires > 2 bytes. Am I | misunderstanding something? It's possible. | jcranmer wrote: | > There's not really a good reason to support UTF-8 over UTF-16 | | Two big reasons: | | 1. All legal ASCII text is UTF-8. That means upgrading ASCII to | UTF-8 to support i18n doesn't require you to convert all your | files that were in ASCII. | | 2. UTF-16 gives people the mistaken impression that characters | are fixed-width instead of variable-width, and this causes | things to break horribly on non-BMP data. I've seen amusing | examples of this. | | > Write your programs/whatevers allowing your users to be aware | of and configure encodings. | | Internally, your program should be using UTF-8 (or UTF-16 if | you have to for legacy reasons), and you should convert from | non-Unicode charsets as soon as possible. But if you're | emitting stuff... you should try hard to make sure that UTF-8 | is the only output charset you have to support. Letting people | select non-UTF-8 charsets for output adds lots of complication | (now you have to have error paths for characters that can't be | emitted), and you need to have strong justification for why | your code needs that complication. | mark-r wrote: | Every program that purports to support Unicode should be | tested with a bunch of emoticons. | coolreader18 wrote: | Do you mean emoji? I don't see what the issue would be with | [{}:();P\\[\\],.<>/~-_+=XD] | mark-r wrote: | Yes, that's what I meant. I knew I was using the wrong | word but couldn't remember the right one. | camgunz wrote: | > 1. All legal ASCII text is UTF-8. That means upgrading | ASCII to UTF-8 to support i18n doesn't require you to convert | all your files that were in ASCII. | | Eh, realistically if you're doing this, you should be | validating it like converting from one encoding to another | anyway. I get that people won't and haven't, but that's | because UTF-8 has this anti-feature where ASCII is compatible | with it, and that's led to a lot of problems. | | > 2. UTF-16 gives people the mistaken impression that | characters are fixed-width instead of variable-width, and | this causes things to break horribly on non-BMP data. I've | seen amusing examples of this. | | This is one of those problems, and it's way worse with UTF-8 | because it encodes ASCII the same way ASCII does. It's let | programmers stay naive about this stuff for... decades? | | > Internally, your program should be using UTF-8 (or UTF-16 | if you have to for legacy reasons), and you should convert | from non-Unicode charsets as soon as possible. | | There are all kinds of reasons to not use UTF-8. tialaramex | pointed out one above. "UTF-8 everywhere" is simply | unrealistic, and it forces a lot of applications to be | slower, or to take on unnecessary complexity. Maybe it's | worth it to "never have to think about encodings again", but | that's pretty hard to verify and there's no way it happens in | our lifetimes anyway. | | > and you need to have strong justification for why your code | needs that complication. | | Yeah see, I strongly disagree with this. I'll choose whatever | encoding I like, thanks. Maybe you don't mean to be super | prescriptive here, but I think a little more consideration by | UTF-8 advocates wouldn't hurt. | jcranmer wrote: | > I'll choose whatever encoding I like, thanks. | | If everyone chooses whatever encoding they like, then the | charset being used has to be encoded somewhere. The problem | is, there are lots of places where charset isn't encoded | (such as your filesystem). That this is a problem can be | missed, because almost all charsets are a strict superset | of ASCII (UTF-{7,16} are the only such charsets to be found | in the top 99.99% of usage), so it's only when you try your | first non-ASCII characters that problems emerge. | | Unicode has its share of issues, but at this point, Unicode | is the standard for dealing with text, and all i18n-aware | code is going to be built on Unicode internally. The only | safe way to handle text that has even the remotest change | of being i18n-aware is to work with charsets that support | all of Unicode, and given its compatibility with ASCII, | UTF-8 is the most reasonable one to pick. | | If you want to insist on using KOI-8, or ISO-2022-JP, or | ISO-8859-1, you're implicitly saying "fuck you" to 2/3 of | the world's population since you can't support tasks as | basic as "let me write my name" for them. | camgunz wrote: | > If everyone chooses whatever encoding they like, then | the charset being used has to be encoded somewhere. | | This is gonna be the case for the foreseeable future, as | you point out. Settling on one encoding only fixes this | like, 100 years from now. I'd prefer to build encoding- | aware software that solves this problem now. | | > given its compatibility with ASCII, UTF-8 is the most | reasonable one to pick | | This only makes sense of your system is ASCII in the | first place, and if you can't build encoding-aware | software. I think we can both agree that's essentially | legacy ASCII software, so you don't get to choose | anything anyway. And any system that interacts with it | should be encoding-aware and still validate the encoding | anyway, as though it might be BIG5 or whatever. Assuming | ASCII/UTF-8 is a bad idea, always and forever. | | > If you want to insist on using KOI-8, or ISO-2022-JP, | or ISO-8859-1, you're implicitly saying "fuck you" to 2/3 | of the world's population since you can't support tasks | as basic as "let me write my name" for them. | | I'm not obligated to write software for every possible | user at every point in time. It's perfectly acceptable | for me to say, "I'm writing this program for my 1 friend | who speaks Spanish" and have that be my requirements. But | if I were to write software that had a hope of being | broadly useful, UTF-8 everywhere doesn't get me there. | I'd have to build it to be encoding-aware, and let my | users configure the encoding(s) it uses. | jcranmer wrote: | > But if I were to write software that had a hope of | being broadly useful, UTF-8 everywhere doesn't get me | there. | | Actually, it does. | | Right now, in 2020, if you're writing a new programming | language, you can insist that the input files must be | valid UTF-8 or it's a compiler error. If you're writing a | localization tool, you can insist that the localization | files be valid UTF-8 or it's an error. Even if you're | writing a compiler for an existing language (e.g., C), it | would not be unreasonable to say that the source file | must be valid UTF-8 or it's an error--and let those not | using UTF-8 right now handle it by converting their | source code to use UTF-8. And this has been the case for | a decade or so. | | That's the point of UTF-8 everywhere: if you don't have | legacy concerns [someone actively using a non-ASCII, non- | UTF-8 charset that you have to support], force UTF-8 and | be done with it. And if you do have legacy concerns, try | to push people to using UTF-8 anyways (e.g., default to | UTF-8). | camgunz wrote: | I can't insist that other systems send your program | UTF-8, or that the users' OS use UTF-8 for filenames and | file contents, or that data in databases uses UTF-8, or | that the UTF-8 you might get is always valid. The end | result of all these things you're raising is "you can't | assume, you have to check always, UTF-8 everywhere buys | you nothing". Even if we did somehow get there, you'd | still have to validate it. | flohofwoe wrote: | I think it's quite obvious that UTF-8 is the better choice over | UTF-16 or UTF-32 for exchanging data (if just for the | little/big endian mess alone, and that UTF-16 isn't a fixed- | length encoding either). | | From that perspective, keeping the data in UTF-8 for most of | its lifetime also when loaded into a program, and only convert | "at the last minute" when talking to underlying operating | system APIs makes a lot of sense, except for some very specific | application types which do heavy text processing. | camgunz wrote: | I'm gonna do little quotes but, I don't mean to be passive | aggressive. It's just that this stuff comes up all the time | | > I think it's quite obvious that UTF-8 is the better choice | over UTF-16 or UTF-32 for exchanging data (if just for the | little/big endian mess alone... | | This should be the responsibility of a string library | internally, and if you're saving data to disk or sending it | over the network, you should be serializing to a specific | format. That format can be UTF-8, or it can be whatever, | depending on your application's needs. | | > and that UTF-16 isn't a fixed-length encoding either) | | We should stop assuming any string data is a fixed-length | encoding. This is a major disadvantage of UTF-8, because it | allows for this conflation. | | > keeping the data in UTF-8 for most of its lifetime also | when loaded into a program, and only convert "at the last | minute" when talking to underlying operating system APIs | makes a lot of sense, except for some very specific | application types which do heavy text processing. | | Well, you're essentially saying "I know about your use case | better than you do". It might be important to me to not blow | space on UTF-8. But if my platform/libraries have bought into | "UTF-8 everywhere" and don't give me knobs to configure the | encoding, I have no recourse. | | And that's the entire basis for this. It's "having to mess | with encodings is worse than the application-specific | benefits of being able to choose an encoding". I think | that's... at best an impossible claim and at worst pretty | arrogant. Again here I don't mean you, but this "UTF-8 | everywhere" thing. | jeltz wrote: | > We should stop assuming any string data is a fixed-length | encoding. This is a major disadvantage of UTF-8, because it | allows for this conflation. | | So what do you suggest? UTF-16 and UTF-32 encourage this | even more. | camgunz wrote: | Yeah, ASCII is such a powerful mental model that I think | anyone working with Unicode made a lot of concessions to | convert people, no argument there. But I think we need to | say we're done with that and move on to phase 2. Here's | what I advocate: | | - Encodings should be configurable. Programmers get to | decide what format their strings are internally, users | get to decide what encoding programs use when dealing | with filenames or saving data to disk, etc. Defaults | matter, and we should employ smarts, but we should never | say "I know best" and remove those knobs. | | - Engineers need to internalize that "strings" conceal | mountains of complexity (because written language is | complex), and default to using libraries, to manage them. | We should start view manual string manipulation as an | anti-pattern. There isn't an encoding out there that we | can all standardize on that makes this untrue, again | because written language is complex. | eMSF wrote: | >We should stop assuming any string data is a fixed-length | encoding. This is a major disadvantage of UTF-8, because it | allows for this conflation. | | Mistaking a variable-width encoding for a fixed-width one | is _specifically_ a UTF-16 problem. UTF-8 is so obviously | not fixed-width that such an error could not happen by a | mistake, because even before widespread use of emojis, | multibyte sequences were not in any way a corner case for | UTF-8 text (for additional reference, compare UTF-16 String | APIs in Java /JavaScript/etc. with UTF-8 ones in, say, Rust | and Go, and see which ones allow you to easily split a | string where you shouldn't be able to, or access "half- | chars" as a datatype called "char".) | camgunz wrote: | I mean, I think we're both in the realm of [citation | needed] here. I would argue that people index into | strings quite a lot--whether that's because we thought | UCS-2 would be enough for anybody or UTF-8 == ASCII and | "it's probably fine" is academic. The solution is the | same though: don't index into strings, don't assume an | encoding until you've validated. That makes any | "advantage" UTF-8 has disappear. | | If you really think no one made this mistake with UTF-8, | just read up on Python 3. | mark-r wrote: | The difference is that with UTF-8 you're much more likely | to trip over those bugs in random testing. With UTF-16 | you're likely to pass all your test cases if you didn't | think to include a non-BMP character somewhere. Then | someone feeds you an emoji character and you blow up. | camgunz wrote: | Which is why you should be using a library for all this, | that uses fuzzing and other robustness checks. | crazygringo wrote: | > _not really a good reason to support UTF-8 over UTF-16_ | | Of course there is, the fact that if you're dealing only with | ASCII characters then it's backwards-compatible. Which is a | nice convenience in a great number of situations programmers | encounter. | | The minor details of efficiency of an encoding these days isn't | particularly relevant -- sure UTF-16 is better for Chinese, but | the average webpage usually _does_ have way more markup, CSS | and JavaScript than text, and gzip-ing it on delivery will | result in a similar payload totally independent of the encoding | you choose. | camgunz wrote: | UTF-8's ASCII compatibility is an anti-feature; it's allowed | us to continue to use systems that are encoding naive (in | practice ASCII-only). It's no substitute for creating | encoding-aware programs, libraries, and systems. | | The vast majority of text is not in HTML or XML, and there's | no reason you can't use Chinese characters in JavaScript | besides (your strings and variable/class/component/file names | will surely outpace your use of keywords). | crazygringo wrote: | It's not an anti-feature, it's a benefit that is a huge | asset in the real world. For example, you can be on a | legacy ASCII system, inspect a modern UTF-8 file, and if | it's in a Latin language then it will still be readable as | opposed to gibberish. Yes all modern tools should be (and | these days generally are) encoding-aware, but in the real | world we're stuck with a lot of legacy tools too. | | And of course the vast majority of transmitted digital text | is in HTML and similar! What do you think it's in instead? | | By sheer quantity of digital words consumed by the average | person, it's news and social media delivered in browsers | (HTML), followed by apps (still using HTML markup to a huge | degree) and ebooks (ePub based on HTML). And of course | plenty of JSON and XML wrapping too. | | And of course you _can_ you Chinese characters in | JavaScript /JSON, but development teams are increasingly | international and English is the de-facto lingua franca. | camgunz wrote: | That huge asset has become a liability. We always needed | to become encoding-aware, but UTF-8's ASCII compatibility | has let us delay it for decades, and caused exactly the | confusion causing us to debate right now. So many | engineers have been foiled by putting off learning about | encodings. Joel Spolsky wrote an article, Atwood wrote an | article, Python made a backwards incompatible change, | etc. etc. etc. | | To be honest, I'm just guessing about what text is stored | in--I'll cop to it being very hard to prove. But my guess | is the vast majority of text is in old binary formats, | executables, log files, firmware, or in databases without | markup. That's pretty much all your webpages right there. | | _n.b._ JSON doesn 't really fit the markup argument. The | whole idea is that HTML is super noisy and the noise is 1 | byte in UTF-8, and 2 bytes in UTF-16. JSON isn't noisy so | the overhead is very low. | crazygringo wrote: | I just don't know what you're talking about. | | You can't rewrite all existing legacy software to support | encodings. You just can't. A backwards-compatible format | was a huge catalyst for widely supporting Unicode in the | first place. What exactly are we delaying for decades? | Engineers everywhere use Unicode today for new software. | The battle has been won, moving forwards. | | And the vast majority of text isn't in computer code or | even books. It's in the seemingly endless stream of | content produced by journalists and social media each and | every day, _dwarfing_ executables, firmware, etc. And if | it supports any kind of formatting (bold /italics etc.) | -- which most does -- then it's virtually always stored | in HTML or similar (XML). I mean, what are even the | alternatives? Neither RTF nor Markdown come even close in | terms of adoption. | camgunz wrote: | > You can't rewrite all existing legacy software to | support encodings. You just can't. A backwards-compatible | format was a huge catalyst for widely supporting Unicode | in the first place. | | Totally agree. | | > What exactly are we delaying for decades? | | Learning how encodings work and using that knowledge to | write encoding-aware software. | | > Engineers everywhere use Unicode today for new | software. The battle has been won, moving forwards. | | They do, but they're frequently foiled by on-disk | encodings, filenames, internal string formats, network | data, etc. etc. etc. All this stuff is outlined in TFA. | | > And the vast majority of text isn't in computer code or | even books. It's in the seemingly endless stream of | content produced by journalists and social media each and | every day | | I concede I'm not likely to convince you here, but like, | do you think Twitter is storing markup in their | persistence layer? I doubt it. And even if there is some | formatting, we're talking about <b> here, not huge | amounts of angle brackets. | | But think about any car display. That's probably not | markup. Think about ATMs. Log files. Bank records. Court | records. Label makers. Airport signage. Road signage. | University presses. | jeltz wrote: | The reasons most programmers use English in their source | code has nothing to do with file size (for that their are | JS minimizes) or supported encodings. It has to do with | that two things, English is the most used language in the | industry so if you want to cooperate with programmers from | other parts of the world English is a good idea and because | it frankly looks ugly to mix languages in the same file so | when the standard library is in English your source code | will be too. | | So since most source code is in English (and for JS is | minimized) UTF-8 works perfectly there too. | nayuki wrote: | I love the typesetting on the page. It is content-first, clean, | and simple. | | It lacks all the usual noise like modal dialogs, headers and | footers, social media icons, colorful sidebars, newsletter sign- | ups, cookie warnings, etc. | legulere wrote: | > In the UNIX world, narrow strings are considered UTF-8 by | default almost everywhere. Because of that, the author of the | file copy utility would not need to care about Unicode | | It couldn't be further from the truth. Unix paths don't need to | be valid UTF-8 and most programs happily pipe the mess through | into text that should be valid. (Windows filenames don't have to | be proper UTF-16 either) | | Rust is one of the few programming languages that correctly | doesn't treat file paths as strings. | marcosdumay wrote: | > Unix paths don't need to be valid UTF-8 | | Yet, your shell will treat them like UTF-8 just as well. As | will the standard library of almost every programming language, | as you noticed. | | If you open one such file in most text editors, they will | render whatever is in it as UTF-8. If you use text manipulating | utilities, they will work with it as if it was encoded in | UTF-8. | | It's mostly the Linux kernel that disagrees. Everything else | considers them UTF-8. | arendtio wrote: | Doesn't it depend on your locales? | | At least for source-based Linux distributions (Gentoo, | Exherbo) I remember that you have to define the locales you | want to use and which ones should be the default. And when I | build a system without UTF-8 locales, I doubt that the shell | will treat paths as UTF-8. | Spivak wrote: | Which is a silly position since the kernel is the only thing | that matters. You're right that not too many people will | complain if your program crashes on non-UTF-8 paths. Same | with spaces in group names. 100% valid and accepted. Breaks a | ridiculous amount of software if you actually do it. | | But that doesn't mean it's right. It just means that we have | a calcified convention. | marcosdumay wrote: | > narrow strings are considered UTF-8 by default almost | everywhere | | It means that this is mostly true. | | I dunno what it should be. There are benefits and costs on | both allowing and restricting the names. As well as there | are good reason for the kernel alone to support them even | tough all the userland doesn't. But it does mean that you | just use UTF-8 and it's done. | lisper wrote: | > Rust is one of the few programming languages that correctly | doesn't treat file paths as strings. | | Common Lisp too. | tester89 wrote: | I've never actually understood how pathnames work in CL | actually. | lisper wrote: | That makes two of us. But they aren't strings :-) | | (Seriously though, is it pathnames you don't understand or | logical hosts? Because CL pathnames are actually pretty | straightforward. Logical hosts, on the other hand, are a | hot mess.) | gumby wrote: | They are pretty straightforward: they are just path | _structures_ rather than path _names_ that may turn into | single strings when supplied to your kernel. Or, depending | on the OS maybe only part of the name is turned into a | string and part determines which device or syntax applies. | All of which is abstracted away by the path objects. | | Back in the 1970s when thins first appeared on lisp | machines is was not uncommon to use remote file systems | transparently, and those remote file systems could be on | quite different OSes like ITS, TOPS10 or -20, VMS, one of | the lisp machine file systems and even Unix (though | Networking came quite late to Unix). "MC:GUMBY; FOO >" and | "OZ:<GUMBY>FOO.TXT;0" were perfectly reasonable filenames. | Some of those systems had file versioning built into them. | So if the world likes like Unix to you some of that | additional expressive power could be confusing. | | C++17 path support is a neutered version of Common Lisp's. | [deleted] | ken wrote: | > one of the few programming languages that correctly doesn't | treat file paths as strings | | I hear: one of those few programming languages that, despite | its vaunted type-safety, makes it possible to accidentally | create a file with a completely bogus name that I won't be able | to view or open correctly with half the programs on my | computer. | | Languages which allow arbitrary byte sequences in paths are the | cause of, and solution to, all of Unix's pathname problems. | lilyball wrote: | So what you're saying is the language should not be able to | work with pre-existing files whose names are not valid UTF-8? | orf wrote: | No, it's impossible to do that accidentally. Due to its type | safety. You have to be pretty explicit about passing a non- | string in (all rust strings are valid utf8). | jcranmer wrote: | > It couldn't be further from the truth. Unix paths don't need | to be valid UTF-8 and most programs happily pipe the mess | through into text that should be valid. (Windows filenames | don't have to be proper UTF-16 either) | | A decent fraction of software can impose rules on the portion | of the filesystem within their control. A tool like mv or vim | has to be prepared to handle any filepath encoding. But | something like a VCS could reasonably insist that they only | support filetrees with normalized UTF-8 encoding and no case- | insensitive conflicts as the only things reliably working | cross-platform. | Thrymr wrote: | Sure, as long as you don't have to be compatible with | anything else, you can assume whatever encoding you want. | That doesn't change the point that general programs can't | make that assumption. | acdha wrote: | The history of Git and Subversion handling filenames makes me | think that the opposite is true: A VCS which doesn't handle | arbitrary byte-strings will have weird edge cases which | prevent users from adding files or accessing them, possibly | even "losing" data in a local checkout. This is especially | tedious because it'll appear to work for a while until | someone first tries to commit an unusual file or checks it | out with a previously-unused client. | roblabla wrote: | My understanding is, you can't treat the filename as an | arbitrary bytestring, since you have to transcode it across | platforms, otherwise the filename won't show up properly | everywhere. E.G. if I make a file named "test" on unix, it | will be UTF-8 (assuming sane unix). If on windows I create | a file with the filename "test", encoded as UTF-8, it will | show up as worthless garbage in explorer.exe since it will | decode it to UTF-16. | | So VCS needs to know the filename encoding in order to work | properly. | oconnor663 wrote: | > It couldn't be further from the truth. Unix paths don't need | to be valid UTF-8 | | Yes _but_ , most programs expect to be able to print filepaths | at least under some circumstances, like printing error | messages. Even if a program is fully correct and doesn't assume | an encoding in normal operation, it still has to assume one for | printing. Filepaths that aren't utf-8 lead to a bunch of in | your output (at best). So I think it's fair to say that Unix | paths are assumed to be utf-8 by almost all programs, even if | being invalid utf-8 doesn't actually cause a correct program to | crash. | Spivak wrote: | I mean it doesn't have to assume an encoding for printing, it | just has to have a sane way of turning the path into | something human readable. | | Look you're right that this ship has sailed but ideally we | would have decided on a way to display and encode binary for | file paths. | oconnor663 wrote: | I dunno. That sounds like proposing to render "foo.txt" as | "Zm9vLnR4dA==" or "[102, 111, 111, 46, 116, 120, 116]" or | something. I think you probably meant something like "print | the regular characters if the string is UTF-8, or a | lossless fallback representation of the bytes otherwise." | That's a good idea, and I think a lot of programs do that, | but at the same time "if the string is UTF-8" is | problematic. There's no reliable way for us to know what | strings are or are not intended to be decoded as UTF-8, | because non-UTF-8 encodings can coincidentally produce | valid UTF-8 bytes. For example, the two characters "&!" are | the same bytes in UTF-8 as the character "" is in UTF-16. | This works in Python: assert | "&!".encode("UTF-8").decode("UTF-16le") == "" | | So I think I want to claim something a bit stronger: | | 1) Users demand, quite rightly, to be able to read paths as | text. 2) There is no reliable way to determine the encoding | of a string, just by looking at its bytes. And Unix doesn't | provide any other metadata. 3) Therefore, useful Unix | programs _must_ assume that any path that could be UTF-8, | is UTF-8, for the purpose of displaying it to the user. | | Maybe in an alternate reality, the system locale could've | been the reliable source of truth for string encodings? But | of course if we were starting from scratch today, we'd just | mandate UTF-8 and be done with it :) | eska wrote: | In the Rust std one can easily use the lossless presentation | with file APIs, and print a lossy version in error messages. | I find this to be good enough. | gumby wrote: | > Unix paths don't need to be valid UTF-8 | | And a lucky thing too; OSes that _do_ have UTF-8 filesystems | don't always agree on how to apply canonicalization, much less | how to deal with canonicalization differences between user | entered data and normalized filesystem names. | DannyB2 wrote: | > Rust is one of the few programming languages that correctly | doesn't treat file paths as strings. | | Imagine if languages allowed subtypes of strings which are not | directly assignment compatible. | | HtmlString | | SqlString | | String | | A String could be converted to HtmlString not by assignment, | but through a function call, which escapes characters that the | browser would recognize as markup. | | Similarly a String would be converted to a SqlString via a | function. | | It would be difficult to accidentally mix up strings because | they would be assignment incompatible without the functions | that translate them. | | There could be mixed "languages" within a string. Like a JSP or | PHP that might contain scripting snippets, and also JavaScript | and CSS snippets, each with different syntax rules and escaping | conventions. | jdc wrote: | Cf. _newtype_ in Python and Haskell. | torstenvl wrote: | Failing that, you could also adopt a naming convention with | prefixes to indicate what sort of thing it is you're storing | there: | | hsCode = hsFromUs(usInputBuffer); | | ssStoredCode = ssFromHs(hsCode); | | https://www.joelonsoftware.com/2005/05/11/making-wrong- | code-... | DannyB2 wrote: | Yes. But having the compiler enforce it is your first line | of defense. If it doesn't compile, you know there is an | actual problem. In modern IDEs, you see these compile | errors as quickly as you type them. | bruckie wrote: | Some security-sensitive libraries do this, e.g. https://www.j | avadoc.io/doc/com.google.common.html.types/type... | mhh__ wrote: | Allowed you to? You could do that in C++ quite happily, it's | just not useful enough. To bother implementing, at least. | eska wrote: | It's absolutely useful enough, it's just that it's awful in | C++ due to language limitations as opposed to other | languages such as Haskell, where it is standard. | akiselev wrote: | They're not worth the effort in C++ because it doesn't have | strictly enforced affine/dependent types. The GP is | invisioning a language that does. | ori_b wrote: | Why do you need them to enforce that only escaped strings | are passed to functions? | html::append(html::string text); | | with an constructor | html::string(std::string) | | that handled escaping seems like it'd work just fine. | mika9090 wrote: | Try Pascal (free pascal or Delphi) | DannyB2 wrote: | I used Pascal for the 80's and part of the 90's. Currently | use Java. I almost tried Delphi, but my shop moved on to | something else between Pascal and Java. | robocat wrote: | AFAIK they just provide type name aliases, which do not | enforce or warn of you if you mix the "types". | gnarbarian wrote: | You would probably like Java 1.4 | masklinn wrote: | This pattern ( _newtyping_ ) is a _huge_ weakness of Java | in general, and even more so older Java, and people who | like newtyping are not going to like java. | | Because creating newtypes in Java is | | 1. verbose, defining a trivial wrapper takes half a dozen | lines before you've even done anything | | 2. slow, because you're paying for the overhead of an extra | allocation and pointer indirection every time, unless you | jump through unreadable hoops making for even more verbose | newtypes[0] | | It is a much more convenient (and thus frequent) pattern in | languages like Haskell. Or Rust. | | [0] https://gist.github.com/jbgi/d6b677d084fafc641fe01f7ffd | 00591... | DannyB2 wrote: | I use Java 14 now. Java 11 in production. | sitzkrieg wrote: | git will also do this, so on a fs that allowa arbitrarily byte | named files, you end up with tree objects of same name which | makes digging them out later "fun" | benibela wrote: | I have a repository full of such files: | https://github.com/benibela/nasty-files | | You can clone the repository, and then you cannot delete it | with tools that expect utf-8 names (like KDE's Dolphin) | cryptonector wrote: | Yes, but the only way to interop multiple scripts on a POSIX | filesystem is to use UTF-8. I can forgive people for not | realizing that filenames in POSIX are a weird animal: they are | NUL-terminated strings of characters (char) in some arbitrary | codeset and encoding, but US-ASCII '/' is special. | | EDIT: Also, "considered UTF-8 by default almost everywhere" | is... not necessarily wrong -- nowadays users should be using | UTF-8 locales by default. Maybe "almost everywhere" is an | exaggeration, but I wouldn't really know. | masklinn wrote: | > Unix paths don't need to be valid | | unless they do. | | OSX will most likely barf at or mangle invalid file names (HFS+ | requires well-formed UTF-16, which translates to well-formed | UTF-8 at the POSIX layer), and there are ZFS systems which are | configured with utf8only set. | | It would be more precise to say that you _can 't assume_ UNIX | paths are anything other than garbage. | ngrnjp wrote: | That's a fundamental flaw of UNIX. | msla wrote: | It's a reflection of the fact people aren't going to throw | out existing filesystems because they aren't in a specific | character encoding. There's nothing the OS can do about that, | there's nothing programmers in general can do about that, and | the only way to fix it is with a time machine and enough | persuasion to force everyone to implement Unicode and UTF-8 | to the exclusion of any other character encoding schemes. | downerending wrote: | As flaws go, it's pretty awesome. Wish we had more such. | kyberias wrote: | Well, the font on that article is too small and otherwise ugly. | jfkebwjsbx wrote: | Even Microsoft is finally giving up UTF-16! | | They recommend now to use the UTF-8 "code page" in new code. | nathanaldensr wrote: | Do you have a source for this? AFAIK the .NET Framework CLR and | CoreCLR both still store strings internally as UTF-16. | mormegil wrote: | AFAICT, it's not only "internal representation". .NET strings | are defined as a sequence of UTF-16 units, including the | definition of the Char type representing a single UTF-16 code | unit. I can't imagine how such a change could be implemented | (other than changing the internal representation but | converting on all accesses which would be nonsense, I think). | leosarev wrote: | Current plan is: | https://github.com/dotnet/corefxlab/issues/2350 | ChrisSD wrote: | The closest I could find to a recommendation for UTF-8 is in | UWP design guidelines: https://docs.microsoft.com/en- | us/windows/uwp/design/globaliz... | | However it's not quite unequivocal. Windows still uses UTF-16 | in the kernel (or actually an array of 16bit integers, but | UTF-16 is a very strong convention). The code page will often | allow the Win32 API to perform the conversion back and forth | instead of your application doing it. | leosarev wrote: | CoreCLR actively discussing introducing Utf8String type. | https://github.com/dotnet/corefxlab/issues/2350 | gpvos wrote: | Have they fixed all the bugs with that pseudocodepage? | xeeeeeeeeeeenu wrote: | Bugs like WriteFile() reporting the wrong number of bytes | written with 65001 codepage were fixed years ago. | buckminster wrote: | That's good news. Last time I looked, more than a decade | ago admittedly, that bug was WONTFIX. | | In fact I was so surprised I just wrote a test program. | They have fixed it! | | It was the dumbest bug I ever saw in Windows. It was | special case code in the console output code path of the | user mode part of WriteFile. It only existed to make utf8 | work, and it didn't even do that. | gpvos wrote: | Ah, that's surprising, Microsoft was very stubbornly _not_ | doing that for at least a decade and a half. | | In fact, the FAQ in TFA (questions 9 and 20) mentions that | there are still problems with CP_UTF8 (65001). Is the | article out of date? Can someone respond to those | statements? | snazz wrote: | Is java.lang.String still UTF-16? Is there any plan to fix | that? Once Windows and Java take care of it, I can't think of | any other major UTF-16 uses left. Are there any that I've | forgotten about? | | Edit: Still looks like UTF-16, according to the Oracle | documentation page: | https://docs.oracle.com/en/java/javase/14/docs/api/java.base... | Edit 2: JavaScript too. See my reply to someone else below. | lokedhs wrote: | I think it will be hard to change that. But it's not alone. | Javascript also uses UTF-16. | snazz wrote: | You're right! I'm surprised I didn't know that. It looks | like it can also be UCS-2, going by the spec: | | > A conforming implementation of this International | standard shall interpret characters in conformance with the | Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 | with either UCS-2 or UTF-16 as the adopted encoding form, | implementation level 3. If the adopted ISO/IEC 10646-1 | subset is not otherwise specified, it is presumed to be the | BMP subset, collection 300. If the adopted encoding form is | not otherwise specified, it is presumed to be the UTF-16 | encoding form. | im3w1l wrote: | USC-2 is an old version of UTF-16 that lacks support for | surrogate pairs, which means that rare symbols and emoji | don't work. | josefx wrote: | I don't think they can fix that without completely breaking | backwards compatibility. The basic char type in Java is | defined as a 16 bit wide unsigned integer value and String | doesn't abstract over that. | masklinn wrote: | > Is java.lang.String still UTF-16? | | Yes. | | > Is there any plan to fix that? | | That's not really possible as strings are defined in terms of | char and guarantee O(1) access to UTF16 code units. They | might try to switch to "indexed UTF8" (as pypy did in the | Python ecosystem whereas "CPython proper" refused to switch | to UTF8 with the Python 3 upheaval and went with the death | trap that is PEP 393 instead). | projektfu wrote: | I don't think it's a big deal for Java because it's always | easy to transfer in from and out to UTF-8. Very few Java | programs use UTF-16 as a persistence format, and Java-native | applications can directly marshal strings around as they are | a first-class datatype. | rimunroe wrote: | JavaScript: | | https://www.ecma-international.org/ecma-262/5.1/#sec-2 | | > A conforming implementation of this Standard shall | interpret characters in conformance with the Unicode | Standard, Version 3.0 or later and ISO/IEC 10646-1 with | either UCS-2 or UTF-16 as the adopted encoding form, | implementation level 3. If the adopted ISO/IEC 10646-1 subset | is not otherwise specified, it is presumed to be the BMP | subset, collection 300. If the adopted encoding form is not | otherwise specified, it presumed to be the UTF-16 encoding | form. | | https://www.ecma-international.org/ecma-262/5.1/#sec-4.3.16 | | > A String value is a member of the String type. Each integer | value in the sequence usually represents a single 16-bit unit | of UTF-16 text. However, ECMAScript does not place any | restrictions or requirements on the values except that they | must be 16-bit unsigned integers. | diroussel wrote: | Compact Strings were added in Java 9; | https://openjdk.java.net/jeps/254 | | So they can now be stored as one byte per character. | kllrnohj wrote: | Only for ASCII text. There is still no UTF-8 support (it's | even called out as a non-goal in the JEP: "It is not a goal | to use alternate encodings such as UTF-8 in the internal | representation of strings.") | jdsully wrote: | Are you sure? That will result in a conversion every time a | string is passed to the kernel. | | Windows can handle utf-8 but it is not the native character set | for the platform. | JdeBP wrote: | There's a conversion in every ...A() function. Conversion | between UTF-8 and WTF-16 is just more of the same, but | without codepage lookup tables. (-: | mark-r wrote: | They probably still do a codepage lookup just for | consistency. | Shebanator wrote: | WTF-16? I like it... | ekimekim wrote: | WTF-8 and WTF-16 are a thing: | https://simonsapin.github.io/wtf-8/ | | Basically WTF-16 is any sequence of 16-bit integers, and | is thus a superset of UTF-16 (because UTF-16 doesn't | allow certain combinations of integers, mainly surrogate | code points that exist outside of surrogate pairs). | | Then WTF-8 is what you get if you naively transform | invalid UTF-16 into UTF-8. It is a superset of UTF-8. | | This is very useful when dealing with applications like | Java and Javascript that treat strings as sequences of | 16-bit code points, even though not all such strings are | valid UTF-16. | masklinn wrote: | > Basically WTF-16 is any sequence of 16-bit integers, | and is thus a superset of UTF-16 (because UTF-16 doesn't | allow certain combinations of integers, mainly surrogate | code points that exist outside of surrogate pairs). | | If WTF-16 is the ability _in potentia_ to store and | return invalid UTF-16 without signalling errors, I don 't | know that there's any actual UTF-16 system out there to | the possible exception of... HFS+ maybe?. | xg15 wrote: | > _When writing a UTF-8 string to a file, it is the length in | bytes which is important. Counting any other type of 'characters' | is, on the other hand, not very helpful._ | | So, suppose I have a UTF-8 string of n code units (bytes) length. | Unfortunately my data structure only permits strings of length m | < n bytes. | | How do I correctly truncate the string so it doesn't become | invalid UTF-8 and won't show any unexpected gibberish when | rendered? (E.g., the truncated string doesn't suddenly contain | any glyphs or grapheme clusters that weren't in the original | string) | toast0 wrote: | > How do I correctly truncate the string? | | Refuse to accept a string that is overlong, and require an | interactive user (hopefully one literate in the language) to | truncate it for you. In a non-interactive context, you can't. | Tyr42 wrote: | https://play.rust-lang.org/?version=stable&mode=debug&editio... | | Something like this? Check if each character pushes the byte | total over the limit? | | I think this might fail for combining characters though. | samatman wrote: | Avoiding invalid UTF-8 is easy, almost trivial: just make sure | you don't truncate in the middle of a code point. | | The latter is fiendishly difficult to get right in all cases, | the ugliest case being emoji flags. Being all-or-nothing on | both sides of a ZWJ will get you most of the way there, | however. | smasher164 wrote: | It's not though. Replacing invalid byte sequences is not | terribly difficult. | | https://golang.org/src/strings/strings.go?s=15854:15900#L627. | heyplanet wrote: | I think UTF-8 was a mistake. | | It is a pain in the ass to have a variable number of bytes per | char. | | In Ascii, you could easily know every character personally. No | strange surprises. | | Also no surprises while reading black on white text and suddenly | being confronted with clors [1]. | | [1] Also no surprises when writing a comment on HN like this one | and having some characters stripped. I put in a smiley as the | firs "o" in colors, but it was stripped out. Looks like the | makers of HN don't like UTF-8 either. | goatinaboat wrote: | Certain things such as DNS, email addresses and so on should be | restricted to ASCII, it's a security nightmare otherwise. | bartwe wrote: | I assume you mean a limited subset of 7bit ascii ? 33-126 | JdeBP wrote: | % host -t a $'\015'. 1 \015: 19 bytes, | 1+0+0+0 records, response, authoritative, nxdomain | query: 1 \015 % | | It's not as straightforward or sensible as you think. It's | case insensitive; it's case preserving; and C0 control | characters, SPC, and DEL are allowed. The case | differentiating bits for letters are nowadays sometimes | used in an attempt to foil attackers. If you want things to | look back on and say "I think that X was a mistake." then | forget UTF of any stripe. The DNS is full of them. | zokier wrote: | I thought DNS allowed any arbitrary byte sequence as | label (up to max length limit) | DagAgren wrote: | You can't even write proper English in ASCII. ASCII is an | absolute dead end. It's history. | | Actually representing human language is HARD. It is also | absolutely necessary. Whatever solution you choose is going to | be complicated, because it is solving a very complicated | problem. | | Throwing your hands up and going "oh this is too hard, I don't | like it" will get you nowhere. | kazinator wrote: | You can't write proper _snooty_ English in ASCII, with | diaereses and whatnot. | DagAgren wrote: | ASCII doesn't have have all the punctuation regularly used | in English. | kazinator wrote: | ASCII doesn't have a direct representation of all the | punctuation used in English _print_ , like 66 99 quotes, | and different kinds of dashes (distinct from minus). For | non-print, it's entirely fine. | | Typesetting should be handled by a markup language | anyway. Adding a few characters to Notepad doesn't create | a typesetting system. A typesetting system needs to be | able to do kerning, ligatures, justification. Not to | mention bold, italics, and different fonts. | kps wrote: | 1967 ASCII anticipated that, with dual-use character shapes | so you could type o BS " - o | | But then people invented video terminals that didn't | overstrike. | magicalhippo wrote: | > It is a pain in the ass to have a variable number of bytes | per char. | | In the same vein it's a pain in the ass to write everything in | assembler. Which is why we don't do that, we use high-level | languages instead. | thechao wrote: | You're conflating code points and _some_ encoding; more | importantly, you 're conflating "array of encoded objects | (bytes)" for "a string of text". They're not -- and never have | been -- the same. | kllrnohj wrote: | > It is a pain in the ass to have a variable number of bytes | per char. | | This is from API & language mistakes more than an issue with | UTF-8 itself. | | If you actually design your API & system around being UTF-8, | like Rust did, then there's really no issue for the programmer. | The API enforces the rules, and still gives you things like a | simple character iterator (with characters being 32-bit, so | that it actually fits: https://doc.rust- | lang.org/std/char/index.html). The String class handles all the | multi-byte stuff for you, you never "see" it: https://doc.rust- | lang.org/std/string/struct.String.html | | Retrofitting this into existing languages isn't going to be | _easy_ , but that's not an excuse to not do it at all, either. | jandrese wrote: | > It is a pain in the ass to have a variable number of bytes | per char. | | Maybe, but nobody can stomach the wasted space you get with | UTF-32 in almost every situation. The encoding time tradeoff | was considered less objectionable than making most of your text | twice or four times larger. | FabHK wrote: | And as the article points out, even then you might have more | than one code point for a character. | | > For example, the only way to represent the abstract | character iu _cyrillic small letter yu with acute_ is by the | sequence U+044E _cyrillic small letter yu_ followed by U+0301 | _combining acute accent._ | jodrellblank wrote: | > _Q: What do you think about Byte Order Marks? A: According to | the Unicode Standard (v6.2, p.30): "Use of a BOM is neither | required nor recommended for UTF-8". [...] Using BOMs would | require all existing code to be aware of them, even in simple | scenarios as file concatenation. This is unacceptable._ | | Then your site "UTF-8 everywhere" is misnamed, because standards- | following UTF-8 can have a BOM. It's not required or recommended, | but it is possible and allowable, so you might see them and if | you follow the standard you have to deal with them. It's not a | matter of "this would require all existing code to handle them" - | that is not hypothetical, that is the current world, to be | standards-compliant all existing code _does already_ need to be | aware of them. It isn 't, which means it's broken. Declaring it | "unacceptable" is meaningless, except to say you're rejecting the | standard and doing something incompatible and broken because it's | easier. | | Which is a position one can take and defend, but it's not a good | position for a site claiming to be pushing for people to follow | the standard. What it is, is yet another non-standard ad-hoc | variant defined by what some subset of tools the authors use | can/can't handle in April 2020. | | > " _the UTF-8 BOM exists only to manifest that this is a UTF-8 | stream_ " | | Throwing the word "only" in there doesn't make it go away. It | exists as a standards-compliant way to distinguish UTF-8 from | ASCII, not recommended but not forbidden. | | > " _A: Are you serious about not supporting all of Unicode in | your software design? And, if you are going to support it anyway, | how does the fact that non-BMP characters are rare practically | change anything_ " | | Well in the same way, how does the fact that UTF8+BOM is rare | practically change anything? At some level you're either pushing | for everyone to follow standards even if it's inconvenient | because that makes life better for everyone overall, like you are | with surrogate pairs and indexing, or you're creating another ad- | hoc incompatible variation of UTF-8 which you prefer to the | standard and trying to strong-arm everyone else into using it | with threats of being incompatible with all the code which | already does it wrong. | | Being wary of Chesterton's Fence, presumably there's some company | or system which got UTF-8+BOM added to the standard because they | wanted it, or needed it. | jodrellblank wrote: | Downvoting doesn't make the BOM stop being part of the standard | either, btw. | | Yes, supporting BOM on arbitrary UTF-8 streams is varying | between difficult and impossible, but then get it removed from | the standard, or state that you don't support the standard. | Don't pretend you support the standard while ignoring the bits | you don't like, that's dishonest and unhelpful. | alkonaut wrote: | 100% agree. | | > using BOMs would require all existing code to be aware of | them, even in simple scenarios as file concatenation | | Absolutely! Any app that writes UTF-files can (and probably | should) avoid writing them. But any program that reads UTF | files _must_ handle a BOM. A lot of apps write UTF-8 including | the BOM by default, for example Visual Studio. | | You can NOT concatenate two UTF-8 streams and expect that the | resulting stream is also a valid UTF-8 stream. NO tool should | assume that, ever. | malkia wrote: | Still doesn't solve the fact that filesystems across different | OS's allow invalid UTF8 sequences in the filenames. | | Maybe 99% of apps do not care, but even a simple "cp" tool should | care. Filenames (and maybe other named resoureces) should be | treated completely differently, and not blindly assumed that they | are utf8 compatible. | ken wrote: | To me, that's a design flaw. Would we really be any worse off | if we simply declared filenames must be UTF-8? | | That seems to be the only case where a user-visible and user- | editable field is allowed to be an arbitrary byte sequence, and | its primary purpose seems to be allowing this argument to pop | up on HN every month. | | I've never seen any non-malicious use of it. All popular | filesystems already disallow specific sets of ASCII characters | in names. Any database which needs to save data in files by | number has no problem using safe hex filenames. | ChrisSD wrote: | Sure we could declare that but then what? Non-unicode | filenames won't suddenly disappear. Operating systems won't | suddenly enforce unicode. Filesystems will still allow non- | unicode names. | | Simply declaring it doesn't help anybody. In the meantime | your application still needs to handle non-unicode filenames | otherwise those malicious ones are free to be malicious. | AnIdiotOnTheNet wrote: | If unicode had a set of "explictly this byte" codepoints, | it should be simple to deal with, just pass the invalid | bytes of the filename in that way. | PeterisP wrote: | I'd assume that the proper place for defining what's a | valid filename would be on the filesystem level, so a | filesystem of standard ABC v123 would not allow non-unicode | names; so non-unicode filenames would either get refused or | modified upon copying/writing them to the filesystem. | | This is not new, this would match the current behavior of | the OS/filesystem enforcing other character restrictions | such as when writing (for example) a file name with an | asterisk or colon to a FAT32 USB flash drive. | mark-r wrote: | Once you lose the expectation of being able to work with | non-unicode filenames, those files will quickly get renamed | and cease to be a problem. | ChrisSD wrote: | How can you rename them if you can only use unicode | paths? | mark-r wrote: | You would need to use some special utility created just | for that purpose. | bjourne wrote: | As long as the tool for _renaming_ files handles non-utf8 | filenames you 'd be fine. | qiqitori wrote: | Are you saying that operating systems (i.e. the kernel) should | check and enforce encodings in filenames? | | 1) Why? | | 2) Bye bye backward compatibility and interoperability | ghettoimp wrote: | Backward compatibility is a laudable goal and is not to be | broken lightly. But sometimes, things are so fundamentally | broken that we would be far better off with a clean break. | | Interoperability is quite possibly a good argument _for_ | coming up with some reasonable restrictions on filenames. | Today you could easily (case sensitive names, special | characters, etc.) create a ZIP file or similar that cannot be | successfully extracted on this platform or that. | | In an excellent article, David A. Wheeler [1] lays out a | compelling case against the status quo. TL;DR: bad filenames | are too hard to handle correctly. Programs, standards, and | operating systems already assume there are no bad filenames. | Your programs will fail in numerous ways when they encounter | bad filenames. Some of these failures are security problems. | | He concludes: "In sum: It'd be far better if filenames were | more limited so that they would be safer and easier to use. | This would eliminate a whole class of errors and | vulnerabilities in programs that "look correct" but subtly | fail when unusual filenames are created (possibly by | attackers)." He goes on to consider many ideas towards | getting to this goal. | | [1] https://dwheeler.com/essays/fixing-unix-linux- | filenames.html | wtetzner wrote: | It sounds like they're saying the opposite. All programs | dealing with filenames need to be able to support an | arbitrary stream of bytes, they can't just assume UTF-8. | masklinn wrote: | > 2) Bye bye backward compatibility and interoperability | | It's already not really a thing. | | Traditional unices allow arbitrary bytes with the exception | of 00 and 2f, NTFS allows arbitrary _utf-16 code units_ | (including unpaired surrogates) with the exception of 0000 | and 002f, and I think HFS+ requires valid UTF-16 and allows | everything (including NUL). | | The OS then adds its own limitations e.g. win32 forbids \, :, | *, ", ?, <, >, | (as well as a few special names I think) and | OSX forbids 0000 and 003a (":"), the latter of which gets | converted to and from "/" (and similarly forbidden) by the | POSIX compatibility layer. | | The latter is really weird to see in action, if you have | access to an OSX machine: open a terminal, try to create a | file called "/" and it'll fail. Now create one called ":". | Switch over to the Finder, and you'll see that that file is | now called "/" (and creating a file called ":" fails). | | Oh yeah and ZFS doesn't really care but can require that all | paths be valid UTF8 (by setting the utf8only flag). ___________________________________________________________________ (page generated 2020-04-14 23:01 UTC)