[HN Gopher] How UTF-8 Works ___________________________________________________________________ How UTF-8 Works Author : SethMLarson Score : 206 points Date : 2022-02-08 14:56 UTC (8 hours ago) (HTM) web link (sethmlarson.dev) (TXT) w3m dump (sethmlarson.dev) | brian_rak wrote: | This was presented well. A follow up for unicode might be in | order! | SethMLarson wrote: | Glad you enjoyed! Unicode and how it interacts with other | aspects of computers (IDNA, NFKC, grapheme clusters, etc) is | some of the spaces I want to explore more. | dspillett wrote: | Not sure if the issue is with Chrome or my local config generally | (bog standard Windows, nothing fancy), but the us-flag example | doesn't render as intended. It shows as "US" with the components | in the next step being "U" and "S" (not the ASCII characters U & | S, the encoding is as intended but those characters are being | given in place of the intended). | | Displays as I assume intended in Firefox on the same machine: | American flag emoji then when broken down in the next step U-in- | a-box & S-in-a-box. The other examples seem fine in Chrome. | | Take care when using relatively new additions to the Unicode | emoji-set, test to make sure your intentions are correctly | displayed in all the brower's you might expect your audience to | be using. | SethMLarson wrote: | Yeah, there's not much I can do there unfortunately (since I'm | using SVG with the actual U and S emojis to show the flag). I | can't comment on whether it's your config or not, but I've | tested the SVGs on iOS and Firefox/Chrome on desktop to make | sure they rendered nicely for most people. Sorry you aren't | getting a great experience there. | | Here's how it's rendering for me on Firefox: | https://pasteboard.co/rjLtqANVQUIJ.png | xurukefi wrote: | For me it also renders like this on Chrome/Windows: | https://i.imgur.com/HCJTpfA.png | | Really nice diagrams nevertheless | andylynch wrote: | They aren't new (2010) - this is a Windows thing - speculation | is it's a policy decision to avoid awkward conversations with | various governments (presumably large customers) about TW , PS | and others -- see long discussion here for instance | https://answers.microsoft.com/en-us/windows/forum/all/flag-e... | zaik wrote: | Those diagrams look really good. How were they made? | jeremieb wrote: | The author mentions at the end of the article that he spent a | lot of time on https://www.diagrams.net/. :) | nayuki wrote: | Excellent presentation! One improvement to consider is that many | usages of "code point" should be "Unicode scalar value" instead. | Basically, you don't want to use UTF-8 to encode UTF-16 surrogate | code points (which are not scalar values). | | Fun fact, UTF-8's prefix scheme can cover up to 31 payload bits. | See https://en.wikipedia.org/wiki/UTF-8#FSS-UTF , section "FSS- | UTF (1992) / UTF-8 (1993)". | | A manifesto that was much more important ~15 years ago when UTF-8 | hadn't completely won yet: https://utf8everywhere.org/ | masklinn wrote: | > Fun fact, UTF-8's prefix scheme can cover up to 31 payload | bits. | | It'd probably be more correct to say that it was originally | defined to cover 31 payload bits: you can easily complete the | first byte to get 7 and 8 byte sequences (35 and 41 bits | payloads). | | Alternatively, you could save the 11111111 leading byte to flag | the following bytes as counts (5 bits each since you'd need a | flag bit to indicate whether this was the last), then add the | actual payload afterwards, this would give you an infinite-size | payload, though it would make the payload size dynamic and | streamed (where currently you can get the entire USV in two | fetches, as the first byte tells you exactly how many | continuation bytes you need). | SethMLarson wrote: | Yeah the current definition is restricted to 4 octets in RFC | 3629. Really interesting to see the history of ranges UTF-8 | was able to cover. | CountSessine wrote: | _Basically, you don 't want to use UTF-8 to encode UTF-16 | surrogate code points_ | | The awful truth is that there is such a beast. UTF-8 wrapper | with UTF-16 surrogate pairs. | | https://en.wikipedia.org/wiki/CESU-8 | nayuki wrote: | Is CESU-8 a synonym of WTF-8? | https://en.wikipedia.org/wiki/UTF-8#WTF-8 ; | https://simonsapin.github.io/wtf-8/ | bhawks wrote: | Utf8 is one of the most momentous and under appreciated / | relatively unknown achievements in software. | | A sketch on a diner placemat has lead to every person in the | world being able to communicate written language digitally using | a common software stack. Thanks to Ken Thompson and Rob Pike we | have avoided the deeply siloed and incompatible world that code | pages, wide chars and other insufficient encoding schemes were | guiding us towards. | cryptonector wrote: | And stayed ASCII-compatible. And did not have to go to wide | chars. And it does not suck. And it resynchronizes. And... | ahelwer wrote: | It really is wonderful. I was forced to wrap my head around it | in the past year while writing a tree-sitter grammar for a | language that supports Unicode. Calculating column position | gets a whole lot trickier when the preceding codepoints are of | variable byte-width! | | It's one of those rabbit holes where you can see people whose | entire career is wrapped up in incredibly tiny details like | what number maps to what symbol - and it can get real | political! | GekkePrutser wrote: | It's great as a global character set and really enabled the | world to move ahead at just the right time. | | But the whole emoji modifier (e.g. guy + heart + lips + girl = | one kissing couple character) thing is a disaster. Too many | rules made up on the fly that make building an accurate parser | a nightmare. It should have either specified this strictly and | consistently as part of the standard, or just left it out for a | future standard to implement, and just just used separate | codepoints for the combinations that were really necessary. | | This complexity is also something that has led to multiple | vulnerabilities especially on mobiles. | | See here all the combos: https://unicode.org/emoji/charts/full- | emoji-modifiers.html | inglor_cz wrote: | As a young Czech programming acolyte in the late 1990s, I had | to cope with several competing 8-bit encodings. It was a pure | nightmare. | | Long live UTF-8. Finally I can write any Central European name | without mutilating it. | [deleted] | Simplicitas wrote: | I still wanna know in WHICH Jersey diner it was invented in! :-) | jsrcout wrote: | This may be the first explanation of Unicode representation that | I can actually follow. Great work. | SethMLarson wrote: | Wow, thank you for the kind words. You've made my morning!! | ctxc wrote: | Such clean presentation, refreshing. | RoddaWallPro wrote: | I spent 2 hours last Friday trying to wrap my head around what | UTF-8 was (https://www.joelonsoftware.com/2003/10/08/the- | absolute-minim is great, but doesn't explain the inner workings | like this does) and completely failed, could not understand it. | This made it super easy to grok, thank you! | karsinkk wrote: | The following article is one of my favorite primers on Character | sets/Unicode : https://www.joelonsoftware.com/2003/10/08/the- | absolute-minim... | jokoon wrote: | I wonder how large must a font be to display all UTF8 | characters... | | I'm also waiting for new emojis, they recently added more and | more that can be used as icons, which is simpler than integrating | PNG or SVG icons. | banana_giraffe wrote: | Opentype makes this impossible. A glyph has an index of a | UINT16, so you can't fit all of the ~143k Unicode characters. | | There are some attempts at font families to cover the majority | of characters. Like Noto ( https://fonts.google.com/noto/fonts | ), broken out into different fonts for different regions. | | Or, Unifont's ( http://www.unifoundry.com/ ) goal of gathering | the first 65536 code points in one font, though it leaves a lot | to be desired if you actually use it as a font. | dspillett wrote: | Take care using recently added Unicode entries, unless you have | some control of your user-base and when they update or are | providing a custom font that you know has those items | represented. You could be giving out broken-looking UI to many | if their setup does not interpret the newly assigned codes | correctly. | jvolkman wrote: | Rob Pike wrote up his version of its inception almost 20 years | ago. | | The history of UTF-8 as told by Rob Pike (2003): | http://doc.cat-v.org/bell_labs/utf-8_history | | Recent HN discussion: | https://news.ycombinator.com/item?id=26735958 | filleokus wrote: | Recently I learned about UTF-16 when doing some stuff with | PowerShell on Windows. | | Parallel with my annoyance with Microsoft, I realized how long | it's been since I encountered any kind of text encoding drama. As | a regular typer of aao, many hours of my youth was spent on | configuring shells, terminal emulators, and IRC clients to use | compatible encodings. | | The wide adoption of UTF-8 has been truly awesome. Let's just | hope it's another 15-20 years until I have to deal with UTF-16 | again... | legulere wrote: | There's increasing support for UTF-8 as an ansi codepage on | Windows. And UTF-8 support is also part of the modernization of | the terminal: | https://devblogs.microsoft.com/commandline/windows-command-l... | ChrisSD wrote: | There are many reasons why UTF-8 is a better encoding but | UTF-16 does at least have the benefit of being simpler. Every | scalar value is either encoded as a single unit or a pair of | units (leading surrogate + trailing surrogate). | | However, Powershell (or more often the host console) has a lot | of issues with handling Unicode. This has been improving in | recent years but it's still a work in progress. | masklinn wrote: | > There are many reasons why UTF-8 is a better encoding but | UTF-16 does at least have the benefit of being simpler. Every | scalar value is either encoded as a single unit or a pair of | units (leading surrogate + trailing surrogate). | | UTF16 is really not noticeably simpler. Decoding UTF8 is | really rather straightforward in any language which has even | minimal bit-twiddling abilities. | | And that's assuming you need to write your own encoder or | decoder, which seems unlikely. | Fill1200 wrote: | I have a MySQL database, which has large amount of Japanese | text data. When I convert it from UTF8 to UTF16, it reduces | certainly disk space. | tialaramex wrote: | UTF-16 _only_ makes sense if you were sure UCS-2 would be | fine, and then oops, Unicode is going to be more than 16-bits | and so UCS-2 won 't work and you need to somehow cope anyway. | It makes zero sense to adopt this in greenfield projects | today, whereas Java and Windows, which had bought into UCS-2 | back in the early-mid 1990s, needed UTF-16 or else they would | need to throw all their 16-bit text APIs away and start over. | | UTF-32 / UCS-4 is fine but feels very bloated especially if a | lot of your text data is more or less ASCII, which if it's | not literally human text it usually will be, and feels a bit | bloated even on a good day (it's always wasting 11-bits per | character!) | | UTF-8 is a little more complicated to handle than UTF-16 and | certainly than UTF-32 but it's nice and compact, it's pretty | ASCII compatible (lots of tools that work with ASCII also | work fine with UTF-8 unless you insist on adding a spurious | UTF-8 "byte order mark" to the front of text) and so it was a | huge success once it was designed. | ChrisSD wrote: | As I said, there are many reasons UTF-8 is a better | encoding. And indeed compact, backwards compatible, | encoding of ASCII is one of them. | glandium wrote: | It is less compact than UTF-16 for CJK languages, FWIW. | nwallin wrote: | > There are many reasons why UTF-8 is a better encoding but | UTF-16 does at least have the benefit of being simpler. | | Big endian or little endian? | cryptonector wrote: | LOL | ts4z wrote: | And did they handle surrogate pairs correctly? | | My team managed a system that did a read from user data, | doing input validation. One day we got a smart quote | character that happened to be > U+10000. But because the | data validation happened in chunks, we only got half of it. | Which was an invalid character, so input validation failed. | | In UTF-8, partial characters happen so often, they're | likely to get tested. In UTF-16, they are more rarely seen, | so things work until someone pastes in emoji and then it | falls apart. | [deleted] | daenz wrote: | Great explanation. The only part that tripped me up was in | determining the number of octets to represent the codepoint. From | the post: | | >From the previous diagram the value 0x1F602 falls in the range | for a 4 octets header (between 0x10000 and 0x10FFFF) | | Using the diagram in the post would be a crutch to rely on. It | seems easier to remember the maximum number of "data" bits that | each octet layout can support (7, 11, 16, 21). Then by knowing | that 0x1F602 maps to 11111011000000010, which is 17 bits, you | know it must fit into the 4-octet layout, which can hold 21 bits. | mananaysiempre wrote: | As the continuation bytes always bear the payload in the low 6 | bits, Connor Lane Smith suggests writing them out in octal[1]. | Though that 3 octets of UTF-8 precisely cover the BMP is also | quite convenient and easy to remember (but perhaps don't use | that like MySQL did[2]?..). | | [1] http://www.lubutu.com/soso/write-out-unicode-in-octal | | [2] https://mathiasbynens.be/notes/mysql-utf8mb4 | bumblebritches5 wrote: | riwsky wrote: | How UTF-8 works? | | "pretty well, all things considered" | who-shot-jr wrote: | Fantastic! Very well explained. | SethMLarson wrote: | Thanks for the kind comment :) | BitwiseFool wrote: | I feel the same way as the GP, great work. I also appreciate | how clean and easy to read the diagrams are. | nabla9 wrote: | > _NOTE: You can always find a character boundary from an | arbitrary point in a stream of octets by moving left an octet | each time the current octet starts with the bit prefix 10 which | indicates a tail octet. At most you 'll have to move left 3 | octets to find the nearest header octet._ | | This is incorrect. You can only find boundaries between code | points this way. | | Until your you learn that not all "user perceived characters" | (grapheme clusters) can be expressed as single code point Unicode | seems cool. These UTF-8 explanations explain the encoding but | leave out this unfortunate detail. Author might not even know | this because they deal with subset of Unicode in their life. | | If you want to split text between two user perceived characters, | not between them, this tutorial does not help. | | Unicode encodings are is great if you want to handle subset of | languages and characters, if you want to be complete, it's a | mess. | SethMLarson wrote: | You're right, that should read "codepoint boundary" not | "character boundary". I can fix that. | | I do briefly mention grapheme clusters near the end, didn't | want to introduce them as this article was more about the | encoding mechanism itself. Maybe a future article after more | research :) | nabla9 wrote: | Please do. You have the best visualizations of UTF-8 I have | seen so far. | | Usually people write just the UTF-8 encoding part, then don't | mention the rest of the Unicode, because it's clearly not as | good and simple. | mark-r wrote: | UTF-8 is one of the most brilliant things I've ever seen. I only | wish it had been invented and caught on before so many | influential bodies started using UCS-2 instead. | SethMLarson wrote: | 100% agree, it's really rare that there's a ~blanket solution | to a whole class of problems. "Just use UTF-8!" | BiteCode_dev wrote: | Like anything new, people had a hard time with it at the | beginning. | | I remember that I got a home assignment in an interview for a | PHP job. The person evaluating my code said I should not have | used UTF8, which causes "compatibility problems". At the time, | I didn't know better, and I answered that no, it was explicitly | created to solve compatibility problems, and that they just | didn't understand how to deal with encoding properly. | | Needing-less to say, I didn't get the job :) | | Same with Python 2 code. So many people, when migrating to | Python 3, suddenly though python 3 encoding management was | broken, since it was raising so many UnicodeDecodingError. | | Only much later people realize the huge number of programs that | couldn't deal with non ASCII characters in file paths, html | attributes or user names, because they just implicitly assume | ASCII. "My code used to work fine", they said. But it worked | fine on their machine, set to an english locale, tested only | using ascii plain text files on their ascii named directories | with their ascii last name. | SAI_Peregrinus wrote: | My Slack name at work is "This name iss a valid POSIX path". | My hope is that it serves as an amusing reminder to consider | things like spaces and non-ASCII characters. | andrepd wrote: | That's in general a problem with dynamic languages with weak | type systems. How "Your code runs without crashing" is really | really != "your code works". How do people even manage | production python! A bug could be lurking anywhere, | undetected until it's actually run. Whereas in a compiled | language with a strong type system, "your code compiles" is | much closer to "your code is correct". | [deleted] | digisign wrote: | There are a number of mitigations, so those kind of bugs | are quite rare. In our large code base, about 98% of bugs | we find are of the "we need to handle another case" | variety. Pyflakes quickly finds typos which eliminates most | of the rest. | BiteCode_dev wrote: | I don't think a type system can help you with decoding a | file with the wrong charset. | morelisp wrote: | Python 3 encoding management was broken, because it tried to | impose Unicode semantics on things that were actually byte | streams. For anyone _actually correctly handling encodings_ | in Python 2 it was awful because suddenly the language | runtime was hiding half the data you needed. | BiteCode_dev wrote: | Nowadays, passing bytes to any os function returns bytes | objects, not unicode. You'll get string if you pass string | objects though, and they will be using utf8 surrogate | escaping. | junon wrote: | There are (a few) very good reasons not to use UTF-8. It's a | great encoding but not suitable for all cases. | | For example, constant time subscripting, or improved length | calculations, are made possible by encodings other than utf-8. | | But when performance isn't critical, utf-8 should be the | default. I don't see a reason for any other encoding. | [deleted] | jfk13 wrote: | > For example, constant time subscripting, or improved length | calculations, are made possible by encodings other than | utf-8. | | Assuming you mean different encoding forms of Unicode (rather | than entirely different and far less comprehensive character | sets, such as ASCII or Latin-1), there are very few use cases | where "subscripting" or "length calculations" would benefit | significantly from using a different encoding form, because | it is rare that individual Unicode code points are the most | appropriate units to work with. | | (If you're happy to sacrifice support for most of the world's | writing systems in favour of raw performance for a limited | subset of scripts and text operations, that's different.) | ninkendo wrote: | Constant time subscripting is a myth. There's nothing(*) | useful to be obtained by adding a fixed offset to the base of | your string, in _any_ unicode encoding, including UTF-32. | | If you're hoping that a fixed offset gives you a user- | percieved character boundary, then you're not handling | composed characters or zero-width-joiners or any number of | other things that may cause a grapheme cluster to be composed | of multiple UTF code points. | | The "fixed" size of code points in encodings like UTF-32 are | just that: code points. Whether a code point corresponds with | anything useful, like the boundary of a visible character, | will always require linear-time indexing of the string, in | any encoding. | | (*) Approximately nothing. If you're in a position where | you've somehow already vetted that the text is of a subset of | human languages where you're guaranteed to never have | grapheme clusters that occupy more than a single code point, | then you maybe have a use case for this, but I'd argue you | really just have a bunch of bugs waiting to happen. | irq-1 wrote: | > Constant time subscripting is a myth. There's nothing(*) | useful to be obtained by adding a fixed offset to the base | of your string, in any unicode encoding, including UTF-32. | | What about UTF-256? Maybe not today, maybe not tomorrow, | but someday... | ts4z wrote: | I know you're kidding, but I want to note that UTF-256 | isn't enough. There's an Arabic ligature that decomposes | into 20 codepoints. That was already in Unicode 20 years | ago. You can probably do something even crazier with the | family emoji. These make "single characters" that do not | have precomposed forms. | pjscott wrote: | Also, if you want O(1) indexing by grapheme cluster you | can get that with less memory overhead by precomputing a | lookup table of the location in the string where you can | find every k-th grapheme cluster, for some constant k >= | 1. (This requires a single O(n) pass through the string | to build the index, but you were always going to have do | make at least one such pass through the string for other | reasons.) | wizzwizz4 wrote: | Some characters are longer than 32 codepoints. | josephg wrote: | Absolutely. At least it's well supported now in very old | languages (like C) and very new languages (like Rust). But | Java, Javascript, C# and others will probably be stuck using | UCS-2 forever. | HideousKojima wrote: | There's actually a proposal with a decent amount of support | to add utf-8 strings to C#. Probably won't be added to the | language for another 3 or 4 years (if ever) but it's not | outside the realm of possibility. | | Edit: The proposal for anyone interested | https://github.com/dotnet/runtime/issues/933 | stewx wrote: | What is stopping people from encoding their Java, JS, and C# | files in UTF-8? | mark-r wrote: | Nothing at all, and in fact there's a site set up | specifically to advocate for this: | https://utf8everywhere.org/ | | The biggest problem is when you're working in an ecosystem | that uses a different encoding and you're forced to convert | back and forth constantly. | | I like the way Python 3 does it - every string is Unicode, | and you don't know or care what encoding it is using | internally in memory. It's only when you read or write to a | file that you need to care about encoding, and the default | has slowly been converging on UTF-8. | maskros wrote: | Nothing, but Java's "char" type is always going to be | 16-bit. | josephg wrote: | Yep. In javascript (and Java and C# from memory) the | String.length property is based on the encoding length in | UTF16. It's essentially useless. I don't know if I've | _ever_ seen a valid use for the javascript String.length | field in a program which handles Unicode correctly. | | There's 3 valid (and useful) ways to measure a string | depending on context: | | - Number of Unicode characters (useful in collaborative | editing) | | - Byte length when encoded (these days usually in utf8) | | - and the number of rendered grapheme clusters | | All of these measures are identical in ASCII text - which | is an endless source of bugs. | | Sadly these languages give you a deceptively useless | .length property and make you go fishing when you want to | make your code correct. | tialaramex wrote: | Java's char is a strong competitor for most stupid "char" | type award. | | I would give it to Java outright if not for the fact that | C's char type _doesn 't define how big it is at all, nor | whether it is signed_. In practice it's probably a byte, | but you aren't actually promised that, and even if it is | a byte you aren't promised whether this byte is treated | as signed or unsigned, that's implementation dependant. | Completely useless. | | For years I thought char was just pointless, and even | today I would still say that a high level language like | Java (or Javascript) should not offer a "char" type | because the problems you're solving with these languages | are so unlikely to make effective use of such a type as | to make it far from essential. Just have a string type, | and provide methods acting on strings, forget "char". But | Rust did show me that a strongly typed systems language | might actually have some use for a distinct type here | (Rust's char really does only hold the 21-bit Unicode | Scalar Values, you can't put arbitrary 32-bit values in | it, nor UTF-16's surrogate code points) so I'll give it | that. | mark-r wrote: | The only guarantee that C gives you is that sizeof char | == 1, and even that's not as useful as it looks. | SAI_Peregrinus wrote: | It also guarantees that char is at least 8 bits. | jasode wrote: | _> What is stopping [...] Java, JS, and C# files in UTF-8?_ | | The output of files on disk can be UTF-8. The continued use | of UCS-2 (later revised to UTF16) is happening _in the | runtime_ because things like the Win32 API which C# uses is | UCS-2. The internal raw memory of layout of strings in | Win32 is UCS-2. | | *EDIT to add correction | kevin_thibedeau wrote: | Win32 narrow API calls support UTF-8 natively now. | mark-r wrote: | Code page 65001 has existed for a long time now, but it | was discouraged because there were a lot of corner cases | that didn't work. Did they finally get all the kinks out | of it? | kevin_thibedeau wrote: | Yes. Applications can switch code page on their own. | colejohnson66 wrote: | UTF-16*, not UCS-2. Although there are probably many | programs that assume UCS-2. | mark-r wrote: | When Windows adopted Unicode, I think the only encoding | available was UCS-2. They converted pretty quickly to | UTF-16 though, and I think the same is true of everybody | else who started with UCS-2. Unfortunately UTF-16 has its | own set of hassles. | nwallin wrote: | Note that the asterisk in `UTF-16*` is a _really_ big | asterisk. I fixed a UCS-16 bug last week at my day job. | DannyB2 wrote: | There is an error in the first example under Giant Reference | Card. | | The bytes come out as: | | 0xF0 0x9F 0x87 0xBA 0xF0 0x9F 0x87 0xBA | | but the bits directly above them all of the bit pattern: 010 | 10111 | SethMLarson wrote: | Great eye! I'll fix this and push it out. | bussyfumes wrote: | BTW here's a surprise I had to learn at some point: strings in JS | are UTF-16. Keep that in mind if you want to use the console to | follow this great article, you'll get the surrogate pair for the | emoji instead. | pierrebai wrote: | I never understood why ITF-8 did not use the _much_ simpler | encoding of: - 0xxxxxxx -> 7 bits, ASCII | compatible (same as UTF-8) - 10xxxxxx -> 6 bits, more | bits to come - 11xxxxxx -> final 6 bits. | | It has multiple benefits: - It encodes more | bits per octet: 7, 12, 18, 24 vs 7, 11, 16, 21 for UTF-8 | - It is easily extensible for more bits. - Such extra | bits extension is backward compatible for reasonable | implementations. | | The last point is key: UTF-8 would need to invent a new prefix to | go beyond 21 bits. Old software would not know the new prefix and | what to do with it. With the simpler scheme, they could | potentially work out of the box up to at least 30 bits (that's a | billion code points, much more than the mere million of 21 bits). | | The | nephrite wrote: | From the Wikipedia article: | | Prefix code: The first byte indicates the number of bytes in | the sequence. Reading from a stream can instantaneously decode | each individual fully received sequence, without first having | to wait for either the first byte of a next sequence or an end- | of-stream indication. The length of multi-byte sequences is | easily determined by humans as it is simply the number of high- | order 1s in the leading byte. An incorrect character will not | be decoded if a stream ends mid-sequence. | | https://en.wikipedia.org/wiki/UTF-8#Comparison_with_other_en... | pierrebai wrote: | "instantaneously" in the sense of first having to read the | first byte to know how many bytes to read. So it's a two-step | process. Given the current maximum length and SIMD, detecting | the end-byte of my scheme is easily parallelizable for up to | 4 bytes, which conveniently goes to 24 bits, enough for all | current unicode code points, so there is no waiting for | termination. Furthermore, to decode a UTF-8 characters needs | bits extraction and shifting of all bytes, so there is no | practical gain of not looking at every byte. It actually | makes the decoding loop more complex. | | Also, the human readability sounds fishy. Humans are really | bad at decoding _high-order_ bits. For example can you tell | the length of a UTF-8 sequence that would begin with 0xEC at | a glance? With my scheme, either the high bit is not set | (0x7F or less), which is easy to see you only need to compare | the first digit to 7. Or the high bit is set and the high | nibble is less than 0xC, meaning there is another byte, also | easy to see, you compare the first digit to C. | | The quote also implicitly mis-characterized the fact that in | my scheme an incorrect character would also not be decoded if | interrupted since it would lack the terminating flag (No byte | > 0xC0). | masklinn wrote: | UTF-8 as defined (or restricted) is a _prefix code_ , it gets | all relevant information on the first read, and the rest on the | (optional) second. Your scheme requires an unbounded number of | reads. | | > - It is easily extensible for more bits. | | UTF8 already is easily extensible to more bits, either 7 | continuation bytes (and 42 bits), or infinite. Neither of which | is actually useful to its purposes. | | > The last point is key: UTF-8 would need to invent a new | prefix to go beyond 21 bits | | UTF8 was defined as encoding 31 bits over 6 bytes. It was | restricted to 21 bits (over 4 bytes) when unicode itself was | restricted to 21 bits. | cesarb wrote: | > UTF8 already is easily extensible to more bits, either 7 | continuation bytes (and 42 bits), or infinite. | | Extending UTF-8 to 7 continuation bytes (or more) loses the | useful property that the all-ones byte (0xFF) never happens | in a valid UTF-8 string. Limiting it to 36 bits (6 | continuation bytes) would be better. | edflsafoiewq wrote: | Why is that useful? | mananaysiempre wrote: | You can use FF as a sentinel byte internally (I think | utf8proc actually does that?); given that FE never | occurs, either, if you see the byte sequence | corresponding to U+FEFF BYTE ORDER MARK in one of the | other UTFs you can pretty much immediately tell it can't | possibly be UTF-8. (In general UTF-8, because of all the | self-synchronization redundancy, has a very distinctive | pattern that allows it to be detected with almost perfect | reliability, and that is a frequent point of UTF-8 | advocacy, which lends some irony to the fact that UTF-8 | is the one encoding that Web browsers support but refuse | to detect[1].) I don't think there is any other advantage | to excluding FF specifically, it's not like we're using | punched paper tape. | | [1] https://hsivonen.fi/utf-8-detection/ | xigoi wrote: | You can use 11000000 and 11000001 as sentinel bytes; | since a sequence beginning with them can't possibly be | minimal. | cryptonector wrote: | And Unicode was restricted to 21 bits because of UTF-16. | There is still the possibility of that restriction being | lifted eventually. | pierrebai wrote: | No software decodes data by reading a stream byte-by-byte. | Like I said in a previous comment, decoding 4 bytes using | SIMD is possible and probably the best way to go. | Furthermore, to actually decode, you need bit twiddling | anyway, so you do need to do byte-processing. Finally, the | inner loop of detecting character boundary is simpler: the | UTF-8 scheme, due to the variable-length prefixes, requires | to detect the first non-1 bits. It is probably written with a | switch/case in C, vs two bit tests in my scheme. I'm not | convinced the UTF-8 ends-up with a faster loop. | LegionMammal978 wrote: | The problem is that UTF-8 has the ability to detect and reject | partial characters at the start of the string; this encoding | would silently produce an incorrect character. Also, UTF-8 is | easily extensible already: the bit patterns 111110xx, 1111110x, | and 11111110 are only disallowed for compatibility with | UTF-16's limits. | pierrebai wrote: | How often are stream truncated _at the start_? In my career, | I 've seen plenty of end truncation, but start truncation | never happens. Or, to be more precise, it only happens if | previous decoding is already borked. If a previous decoding | read too much data, then even UTF-8 is borked. You could be | decoding UTF-8 from the bits of any follow-up data. | | Even for pure text data, if a previous field was over-read | (the only plausible way to have start-truncation), then you | probably are decoding incorrect data from then on. | | IOW, this upside is both ludicrously improbable and much more | damning to the decoding than simply be able to skip a | character. | cryptonector wrote: | UTF-8 is self-resynchronizing. You can scan forwards and/or | backwards and all you have to do is look for bytes that start a | UTF-8 codepoint encoding to find the boundaries between | codepoints. It's genius. | [deleted] | stkdump wrote: | The current scheme is extensible to 7x6=42 bits (which will | probably never be needed). The advantage of the current scheme | is that when you read the first byte you know how long the code | point is in memory and you have less branching dependencies, | i.e. better performance. | | EDIT: another huge advantage is that lexicographical | comparison/sorting is trivial (usually the ascii version of the | code can be reused without modification). | lkuty wrote: | like "A Branchless UTF-8 Decoder" at | https://nullprogram.com/blog/2017/10/06/ | coldpie wrote: | > The current scheme is extensible to 7x6=42 bits (which will | probably never be needed). | | I have printed this out and inserted it into my safe deposit | box, so my children's children's children can take it out and | have a laugh. | jjice wrote: | Fun fact: Ken Thompson and Rob Pike of Unix, Plan 9, Go, and | other fame had a heavy influence on the standard while working on | Plan 9. To quote Wikipedia: | | > Thompson's design was outlined on September 2, 1992, on a | placemat in a New Jersey diner with Rob Pike. | | If that isn't a classic story of an international standard's | creation/impactful update, then I don't know what is. | | https://en.wikipedia.org/wiki/UTF-8#FSS-UTF | SethMLarson wrote: | I knew that Ken Thompson had an influence but wasn't aware of | Rob Pike, what a great fact! Thanks for sharing this :) | ChrisSD wrote: | For whatever it's worth Rob Pike seems to credit Ken Thompson | for the invention, though they both worked together to make | it the encoding used by Plan 9 and to advocate for its use | more widely. | YaBomm wrote: ___________________________________________________________________ (page generated 2022-02-08 23:01 UTC)