[HN Gopher] How UTF-8 Works
       ___________________________________________________________________
        
       How UTF-8 Works
        
       Author : SethMLarson
       Score  : 206 points
       Date   : 2022-02-08 14:56 UTC (8 hours ago)
        
 (HTM) web link (sethmlarson.dev)
 (TXT) w3m dump (sethmlarson.dev)
        
       | brian_rak wrote:
       | This was presented well. A follow up for unicode might be in
       | order!
        
         | SethMLarson wrote:
         | Glad you enjoyed! Unicode and how it interacts with other
         | aspects of computers (IDNA, NFKC, grapheme clusters, etc) is
         | some of the spaces I want to explore more.
        
       | dspillett wrote:
       | Not sure if the issue is with Chrome or my local config generally
       | (bog standard Windows, nothing fancy), but the us-flag example
       | doesn't render as intended. It shows as "US" with the components
       | in the next step being "U" and "S" (not the ASCII characters U &
       | S, the encoding is as intended but those characters are being
       | given in place of the intended).
       | 
       | Displays as I assume intended in Firefox on the same machine:
       | American flag emoji then when broken down in the next step U-in-
       | a-box & S-in-a-box. The other examples seem fine in Chrome.
       | 
       | Take care when using relatively new additions to the Unicode
       | emoji-set, test to make sure your intentions are correctly
       | displayed in all the brower's you might expect your audience to
       | be using.
        
         | SethMLarson wrote:
         | Yeah, there's not much I can do there unfortunately (since I'm
         | using SVG with the actual U and S emojis to show the flag). I
         | can't comment on whether it's your config or not, but I've
         | tested the SVGs on iOS and Firefox/Chrome on desktop to make
         | sure they rendered nicely for most people. Sorry you aren't
         | getting a great experience there.
         | 
         | Here's how it's rendering for me on Firefox:
         | https://pasteboard.co/rjLtqANVQUIJ.png
        
           | xurukefi wrote:
           | For me it also renders like this on Chrome/Windows:
           | https://i.imgur.com/HCJTpfA.png
           | 
           | Really nice diagrams nevertheless
        
         | andylynch wrote:
         | They aren't new (2010) - this is a Windows thing - speculation
         | is it's a policy decision to avoid awkward conversations with
         | various governments (presumably large customers) about TW , PS
         | and others -- see long discussion here for instance
         | https://answers.microsoft.com/en-us/windows/forum/all/flag-e...
        
       | zaik wrote:
       | Those diagrams look really good. How were they made?
        
         | jeremieb wrote:
         | The author mentions at the end of the article that he spent a
         | lot of time on https://www.diagrams.net/. :)
        
       | nayuki wrote:
       | Excellent presentation! One improvement to consider is that many
       | usages of "code point" should be "Unicode scalar value" instead.
       | Basically, you don't want to use UTF-8 to encode UTF-16 surrogate
       | code points (which are not scalar values).
       | 
       | Fun fact, UTF-8's prefix scheme can cover up to 31 payload bits.
       | See https://en.wikipedia.org/wiki/UTF-8#FSS-UTF , section "FSS-
       | UTF (1992) / UTF-8 (1993)".
       | 
       | A manifesto that was much more important ~15 years ago when UTF-8
       | hadn't completely won yet: https://utf8everywhere.org/
        
         | masklinn wrote:
         | > Fun fact, UTF-8's prefix scheme can cover up to 31 payload
         | bits.
         | 
         | It'd probably be more correct to say that it was originally
         | defined to cover 31 payload bits: you can easily complete the
         | first byte to get 7 and 8 byte sequences (35 and 41 bits
         | payloads).
         | 
         | Alternatively, you could save the 11111111 leading byte to flag
         | the following bytes as counts (5 bits each since you'd need a
         | flag bit to indicate whether this was the last), then add the
         | actual payload afterwards, this would give you an infinite-size
         | payload, though it would make the payload size dynamic and
         | streamed (where currently you can get the entire USV in two
         | fetches, as the first byte tells you exactly how many
         | continuation bytes you need).
        
           | SethMLarson wrote:
           | Yeah the current definition is restricted to 4 octets in RFC
           | 3629. Really interesting to see the history of ranges UTF-8
           | was able to cover.
        
         | CountSessine wrote:
         | _Basically, you don 't want to use UTF-8 to encode UTF-16
         | surrogate code points_
         | 
         | The awful truth is that there is such a beast. UTF-8 wrapper
         | with UTF-16 surrogate pairs.
         | 
         | https://en.wikipedia.org/wiki/CESU-8
        
           | nayuki wrote:
           | Is CESU-8 a synonym of WTF-8?
           | https://en.wikipedia.org/wiki/UTF-8#WTF-8 ;
           | https://simonsapin.github.io/wtf-8/
        
       | bhawks wrote:
       | Utf8 is one of the most momentous and under appreciated /
       | relatively unknown achievements in software.
       | 
       | A sketch on a diner placemat has lead to every person in the
       | world being able to communicate written language digitally using
       | a common software stack. Thanks to Ken Thompson and Rob Pike we
       | have avoided the deeply siloed and incompatible world that code
       | pages, wide chars and other insufficient encoding schemes were
       | guiding us towards.
        
         | cryptonector wrote:
         | And stayed ASCII-compatible. And did not have to go to wide
         | chars. And it does not suck. And it resynchronizes. And...
        
         | ahelwer wrote:
         | It really is wonderful. I was forced to wrap my head around it
         | in the past year while writing a tree-sitter grammar for a
         | language that supports Unicode. Calculating column position
         | gets a whole lot trickier when the preceding codepoints are of
         | variable byte-width!
         | 
         | It's one of those rabbit holes where you can see people whose
         | entire career is wrapped up in incredibly tiny details like
         | what number maps to what symbol - and it can get real
         | political!
        
         | GekkePrutser wrote:
         | It's great as a global character set and really enabled the
         | world to move ahead at just the right time.
         | 
         | But the whole emoji modifier (e.g. guy + heart + lips + girl =
         | one kissing couple character) thing is a disaster. Too many
         | rules made up on the fly that make building an accurate parser
         | a nightmare. It should have either specified this strictly and
         | consistently as part of the standard, or just left it out for a
         | future standard to implement, and just just used separate
         | codepoints for the combinations that were really necessary.
         | 
         | This complexity is also something that has led to multiple
         | vulnerabilities especially on mobiles.
         | 
         | See here all the combos: https://unicode.org/emoji/charts/full-
         | emoji-modifiers.html
        
         | inglor_cz wrote:
         | As a young Czech programming acolyte in the late 1990s, I had
         | to cope with several competing 8-bit encodings. It was a pure
         | nightmare.
         | 
         | Long live UTF-8. Finally I can write any Central European name
         | without mutilating it.
        
         | [deleted]
        
       | Simplicitas wrote:
       | I still wanna know in WHICH Jersey diner it was invented in! :-)
        
       | jsrcout wrote:
       | This may be the first explanation of Unicode representation that
       | I can actually follow. Great work.
        
         | SethMLarson wrote:
         | Wow, thank you for the kind words. You've made my morning!!
        
       | ctxc wrote:
       | Such clean presentation, refreshing.
        
       | RoddaWallPro wrote:
       | I spent 2 hours last Friday trying to wrap my head around what
       | UTF-8 was (https://www.joelonsoftware.com/2003/10/08/the-
       | absolute-minim is great, but doesn't explain the inner workings
       | like this does) and completely failed, could not understand it.
       | This made it super easy to grok, thank you!
        
       | karsinkk wrote:
       | The following article is one of my favorite primers on Character
       | sets/Unicode : https://www.joelonsoftware.com/2003/10/08/the-
       | absolute-minim...
        
       | jokoon wrote:
       | I wonder how large must a font be to display all UTF8
       | characters...
       | 
       | I'm also waiting for new emojis, they recently added more and
       | more that can be used as icons, which is simpler than integrating
       | PNG or SVG icons.
        
         | banana_giraffe wrote:
         | Opentype makes this impossible. A glyph has an index of a
         | UINT16, so you can't fit all of the ~143k Unicode characters.
         | 
         | There are some attempts at font families to cover the majority
         | of characters. Like Noto ( https://fonts.google.com/noto/fonts
         | ), broken out into different fonts for different regions.
         | 
         | Or, Unifont's ( http://www.unifoundry.com/ ) goal of gathering
         | the first 65536 code points in one font, though it leaves a lot
         | to be desired if you actually use it as a font.
        
         | dspillett wrote:
         | Take care using recently added Unicode entries, unless you have
         | some control of your user-base and when they update or are
         | providing a custom font that you know has those items
         | represented. You could be giving out broken-looking UI to many
         | if their setup does not interpret the newly assigned codes
         | correctly.
        
       | jvolkman wrote:
       | Rob Pike wrote up his version of its inception almost 20 years
       | ago.
       | 
       | The history of UTF-8 as told by Rob Pike (2003):
       | http://doc.cat-v.org/bell_labs/utf-8_history
       | 
       | Recent HN discussion:
       | https://news.ycombinator.com/item?id=26735958
        
       | filleokus wrote:
       | Recently I learned about UTF-16 when doing some stuff with
       | PowerShell on Windows.
       | 
       | Parallel with my annoyance with Microsoft, I realized how long
       | it's been since I encountered any kind of text encoding drama. As
       | a regular typer of aao, many hours of my youth was spent on
       | configuring shells, terminal emulators, and IRC clients to use
       | compatible encodings.
       | 
       | The wide adoption of UTF-8 has been truly awesome. Let's just
       | hope it's another 15-20 years until I have to deal with UTF-16
       | again...
        
         | legulere wrote:
         | There's increasing support for UTF-8 as an ansi codepage on
         | Windows. And UTF-8 support is also part of the modernization of
         | the terminal:
         | https://devblogs.microsoft.com/commandline/windows-command-l...
        
         | ChrisSD wrote:
         | There are many reasons why UTF-8 is a better encoding but
         | UTF-16 does at least have the benefit of being simpler. Every
         | scalar value is either encoded as a single unit or a pair of
         | units (leading surrogate + trailing surrogate).
         | 
         | However, Powershell (or more often the host console) has a lot
         | of issues with handling Unicode. This has been improving in
         | recent years but it's still a work in progress.
        
           | masklinn wrote:
           | > There are many reasons why UTF-8 is a better encoding but
           | UTF-16 does at least have the benefit of being simpler. Every
           | scalar value is either encoded as a single unit or a pair of
           | units (leading surrogate + trailing surrogate).
           | 
           | UTF16 is really not noticeably simpler. Decoding UTF8 is
           | really rather straightforward in any language which has even
           | minimal bit-twiddling abilities.
           | 
           | And that's assuming you need to write your own encoder or
           | decoder, which seems unlikely.
        
           | Fill1200 wrote:
           | I have a MySQL database, which has large amount of Japanese
           | text data. When I convert it from UTF8 to UTF16, it reduces
           | certainly disk space.
        
           | tialaramex wrote:
           | UTF-16 _only_ makes sense if you were sure UCS-2 would be
           | fine, and then oops, Unicode is going to be more than 16-bits
           | and so UCS-2 won 't work and you need to somehow cope anyway.
           | It makes zero sense to adopt this in greenfield projects
           | today, whereas Java and Windows, which had bought into UCS-2
           | back in the early-mid 1990s, needed UTF-16 or else they would
           | need to throw all their 16-bit text APIs away and start over.
           | 
           | UTF-32 / UCS-4 is fine but feels very bloated especially if a
           | lot of your text data is more or less ASCII, which if it's
           | not literally human text it usually will be, and feels a bit
           | bloated even on a good day (it's always wasting 11-bits per
           | character!)
           | 
           | UTF-8 is a little more complicated to handle than UTF-16 and
           | certainly than UTF-32 but it's nice and compact, it's pretty
           | ASCII compatible (lots of tools that work with ASCII also
           | work fine with UTF-8 unless you insist on adding a spurious
           | UTF-8 "byte order mark" to the front of text) and so it was a
           | huge success once it was designed.
        
             | ChrisSD wrote:
             | As I said, there are many reasons UTF-8 is a better
             | encoding. And indeed compact, backwards compatible,
             | encoding of ASCII is one of them.
        
               | glandium wrote:
               | It is less compact than UTF-16 for CJK languages, FWIW.
        
           | nwallin wrote:
           | > There are many reasons why UTF-8 is a better encoding but
           | UTF-16 does at least have the benefit of being simpler.
           | 
           | Big endian or little endian?
        
             | cryptonector wrote:
             | LOL
        
             | ts4z wrote:
             | And did they handle surrogate pairs correctly?
             | 
             | My team managed a system that did a read from user data,
             | doing input validation. One day we got a smart quote
             | character that happened to be > U+10000. But because the
             | data validation happened in chunks, we only got half of it.
             | Which was an invalid character, so input validation failed.
             | 
             | In UTF-8, partial characters happen so often, they're
             | likely to get tested. In UTF-16, they are more rarely seen,
             | so things work until someone pastes in emoji and then it
             | falls apart.
        
         | [deleted]
        
       | daenz wrote:
       | Great explanation. The only part that tripped me up was in
       | determining the number of octets to represent the codepoint. From
       | the post:
       | 
       | >From the previous diagram the value 0x1F602 falls in the range
       | for a 4 octets header (between 0x10000 and 0x10FFFF)
       | 
       | Using the diagram in the post would be a crutch to rely on. It
       | seems easier to remember the maximum number of "data" bits that
       | each octet layout can support (7, 11, 16, 21). Then by knowing
       | that 0x1F602 maps to 11111011000000010, which is 17 bits, you
       | know it must fit into the 4-octet layout, which can hold 21 bits.
        
         | mananaysiempre wrote:
         | As the continuation bytes always bear the payload in the low 6
         | bits, Connor Lane Smith suggests writing them out in octal[1].
         | Though that 3 octets of UTF-8 precisely cover the BMP is also
         | quite convenient and easy to remember (but perhaps don't use
         | that like MySQL did[2]?..).
         | 
         | [1] http://www.lubutu.com/soso/write-out-unicode-in-octal
         | 
         | [2] https://mathiasbynens.be/notes/mysql-utf8mb4
        
         | bumblebritches5 wrote:
        
       | riwsky wrote:
       | How UTF-8 works?
       | 
       | "pretty well, all things considered"
        
       | who-shot-jr wrote:
       | Fantastic! Very well explained.
        
         | SethMLarson wrote:
         | Thanks for the kind comment :)
        
           | BitwiseFool wrote:
           | I feel the same way as the GP, great work. I also appreciate
           | how clean and easy to read the diagrams are.
        
       | nabla9 wrote:
       | > _NOTE: You can always find a character boundary from an
       | arbitrary point in a stream of octets by moving left an octet
       | each time the current octet starts with the bit prefix 10 which
       | indicates a tail octet. At most you 'll have to move left 3
       | octets to find the nearest header octet._
       | 
       | This is incorrect. You can only find boundaries between code
       | points this way.
       | 
       | Until your you learn that not all "user perceived characters"
       | (grapheme clusters) can be expressed as single code point Unicode
       | seems cool. These UTF-8 explanations explain the encoding but
       | leave out this unfortunate detail. Author might not even know
       | this because they deal with subset of Unicode in their life.
       | 
       | If you want to split text between two user perceived characters,
       | not between them, this tutorial does not help.
       | 
       | Unicode encodings are is great if you want to handle subset of
       | languages and characters, if you want to be complete, it's a
       | mess.
        
         | SethMLarson wrote:
         | You're right, that should read "codepoint boundary" not
         | "character boundary". I can fix that.
         | 
         | I do briefly mention grapheme clusters near the end, didn't
         | want to introduce them as this article was more about the
         | encoding mechanism itself. Maybe a future article after more
         | research :)
        
           | nabla9 wrote:
           | Please do. You have the best visualizations of UTF-8 I have
           | seen so far.
           | 
           | Usually people write just the UTF-8 encoding part, then don't
           | mention the rest of the Unicode, because it's clearly not as
           | good and simple.
        
       | mark-r wrote:
       | UTF-8 is one of the most brilliant things I've ever seen. I only
       | wish it had been invented and caught on before so many
       | influential bodies started using UCS-2 instead.
        
         | SethMLarson wrote:
         | 100% agree, it's really rare that there's a ~blanket solution
         | to a whole class of problems. "Just use UTF-8!"
        
         | BiteCode_dev wrote:
         | Like anything new, people had a hard time with it at the
         | beginning.
         | 
         | I remember that I got a home assignment in an interview for a
         | PHP job. The person evaluating my code said I should not have
         | used UTF8, which causes "compatibility problems". At the time,
         | I didn't know better, and I answered that no, it was explicitly
         | created to solve compatibility problems, and that they just
         | didn't understand how to deal with encoding properly.
         | 
         | Needing-less to say, I didn't get the job :)
         | 
         | Same with Python 2 code. So many people, when migrating to
         | Python 3, suddenly though python 3 encoding management was
         | broken, since it was raising so many UnicodeDecodingError.
         | 
         | Only much later people realize the huge number of programs that
         | couldn't deal with non ASCII characters in file paths, html
         | attributes or user names, because they just implicitly assume
         | ASCII. "My code used to work fine", they said. But it worked
         | fine on their machine, set to an english locale, tested only
         | using ascii plain text files on their ascii named directories
         | with their ascii last name.
        
           | SAI_Peregrinus wrote:
           | My Slack name at work is "This name iss a valid POSIX path".
           | My hope is that it serves as an amusing reminder to consider
           | things like spaces and non-ASCII characters.
        
           | andrepd wrote:
           | That's in general a problem with dynamic languages with weak
           | type systems. How "Your code runs without crashing" is really
           | really != "your code works". How do people even manage
           | production python! A bug could be lurking anywhere,
           | undetected until it's actually run. Whereas in a compiled
           | language with a strong type system, "your code compiles" is
           | much closer to "your code is correct".
        
             | [deleted]
        
             | digisign wrote:
             | There are a number of mitigations, so those kind of bugs
             | are quite rare. In our large code base, about 98% of bugs
             | we find are of the "we need to handle another case"
             | variety. Pyflakes quickly finds typos which eliminates most
             | of the rest.
        
             | BiteCode_dev wrote:
             | I don't think a type system can help you with decoding a
             | file with the wrong charset.
        
           | morelisp wrote:
           | Python 3 encoding management was broken, because it tried to
           | impose Unicode semantics on things that were actually byte
           | streams. For anyone _actually correctly handling encodings_
           | in Python 2 it was awful because suddenly the language
           | runtime was hiding half the data you needed.
        
             | BiteCode_dev wrote:
             | Nowadays, passing bytes to any os function returns bytes
             | objects, not unicode. You'll get string if you pass string
             | objects though, and they will be using utf8 surrogate
             | escaping.
        
         | junon wrote:
         | There are (a few) very good reasons not to use UTF-8. It's a
         | great encoding but not suitable for all cases.
         | 
         | For example, constant time subscripting, or improved length
         | calculations, are made possible by encodings other than utf-8.
         | 
         | But when performance isn't critical, utf-8 should be the
         | default. I don't see a reason for any other encoding.
        
           | [deleted]
        
           | jfk13 wrote:
           | > For example, constant time subscripting, or improved length
           | calculations, are made possible by encodings other than
           | utf-8.
           | 
           | Assuming you mean different encoding forms of Unicode (rather
           | than entirely different and far less comprehensive character
           | sets, such as ASCII or Latin-1), there are very few use cases
           | where "subscripting" or "length calculations" would benefit
           | significantly from using a different encoding form, because
           | it is rare that individual Unicode code points are the most
           | appropriate units to work with.
           | 
           | (If you're happy to sacrifice support for most of the world's
           | writing systems in favour of raw performance for a limited
           | subset of scripts and text operations, that's different.)
        
           | ninkendo wrote:
           | Constant time subscripting is a myth. There's nothing(*)
           | useful to be obtained by adding a fixed offset to the base of
           | your string, in _any_ unicode encoding, including UTF-32.
           | 
           | If you're hoping that a fixed offset gives you a user-
           | percieved character boundary, then you're not handling
           | composed characters or zero-width-joiners or any number of
           | other things that may cause a grapheme cluster to be composed
           | of multiple UTF code points.
           | 
           | The "fixed" size of code points in encodings like UTF-32 are
           | just that: code points. Whether a code point corresponds with
           | anything useful, like the boundary of a visible character,
           | will always require linear-time indexing of the string, in
           | any encoding.
           | 
           | (*) Approximately nothing. If you're in a position where
           | you've somehow already vetted that the text is of a subset of
           | human languages where you're guaranteed to never have
           | grapheme clusters that occupy more than a single code point,
           | then you maybe have a use case for this, but I'd argue you
           | really just have a bunch of bugs waiting to happen.
        
             | irq-1 wrote:
             | > Constant time subscripting is a myth. There's nothing(*)
             | useful to be obtained by adding a fixed offset to the base
             | of your string, in any unicode encoding, including UTF-32.
             | 
             | What about UTF-256? Maybe not today, maybe not tomorrow,
             | but someday...
        
               | ts4z wrote:
               | I know you're kidding, but I want to note that UTF-256
               | isn't enough. There's an Arabic ligature that decomposes
               | into 20 codepoints. That was already in Unicode 20 years
               | ago. You can probably do something even crazier with the
               | family emoji. These make "single characters" that do not
               | have precomposed forms.
        
               | pjscott wrote:
               | Also, if you want O(1) indexing by grapheme cluster you
               | can get that with less memory overhead by precomputing a
               | lookup table of the location in the string where you can
               | find every k-th grapheme cluster, for some constant k >=
               | 1. (This requires a single O(n) pass through the string
               | to build the index, but you were always going to have do
               | make at least one such pass through the string for other
               | reasons.)
        
               | wizzwizz4 wrote:
               | Some characters are longer than 32 codepoints.
        
         | josephg wrote:
         | Absolutely. At least it's well supported now in very old
         | languages (like C) and very new languages (like Rust). But
         | Java, Javascript, C# and others will probably be stuck using
         | UCS-2 forever.
        
           | HideousKojima wrote:
           | There's actually a proposal with a decent amount of support
           | to add utf-8 strings to C#. Probably won't be added to the
           | language for another 3 or 4 years (if ever) but it's not
           | outside the realm of possibility.
           | 
           | Edit: The proposal for anyone interested
           | https://github.com/dotnet/runtime/issues/933
        
           | stewx wrote:
           | What is stopping people from encoding their Java, JS, and C#
           | files in UTF-8?
        
             | mark-r wrote:
             | Nothing at all, and in fact there's a site set up
             | specifically to advocate for this:
             | https://utf8everywhere.org/
             | 
             | The biggest problem is when you're working in an ecosystem
             | that uses a different encoding and you're forced to convert
             | back and forth constantly.
             | 
             | I like the way Python 3 does it - every string is Unicode,
             | and you don't know or care what encoding it is using
             | internally in memory. It's only when you read or write to a
             | file that you need to care about encoding, and the default
             | has slowly been converging on UTF-8.
        
             | maskros wrote:
             | Nothing, but Java's "char" type is always going to be
             | 16-bit.
        
               | josephg wrote:
               | Yep. In javascript (and Java and C# from memory) the
               | String.length property is based on the encoding length in
               | UTF16. It's essentially useless. I don't know if I've
               | _ever_ seen a valid use for the javascript String.length
               | field in a program which handles Unicode correctly.
               | 
               | There's 3 valid (and useful) ways to measure a string
               | depending on context:
               | 
               | - Number of Unicode characters (useful in collaborative
               | editing)
               | 
               | - Byte length when encoded (these days usually in utf8)
               | 
               | - and the number of rendered grapheme clusters
               | 
               | All of these measures are identical in ASCII text - which
               | is an endless source of bugs.
               | 
               | Sadly these languages give you a deceptively useless
               | .length property and make you go fishing when you want to
               | make your code correct.
        
               | tialaramex wrote:
               | Java's char is a strong competitor for most stupid "char"
               | type award.
               | 
               | I would give it to Java outright if not for the fact that
               | C's char type _doesn 't define how big it is at all, nor
               | whether it is signed_. In practice it's probably a byte,
               | but you aren't actually promised that, and even if it is
               | a byte you aren't promised whether this byte is treated
               | as signed or unsigned, that's implementation dependant.
               | Completely useless.
               | 
               | For years I thought char was just pointless, and even
               | today I would still say that a high level language like
               | Java (or Javascript) should not offer a "char" type
               | because the problems you're solving with these languages
               | are so unlikely to make effective use of such a type as
               | to make it far from essential. Just have a string type,
               | and provide methods acting on strings, forget "char". But
               | Rust did show me that a strongly typed systems language
               | might actually have some use for a distinct type here
               | (Rust's char really does only hold the 21-bit Unicode
               | Scalar Values, you can't put arbitrary 32-bit values in
               | it, nor UTF-16's surrogate code points) so I'll give it
               | that.
        
               | mark-r wrote:
               | The only guarantee that C gives you is that sizeof char
               | == 1, and even that's not as useful as it looks.
        
               | SAI_Peregrinus wrote:
               | It also guarantees that char is at least 8 bits.
        
             | jasode wrote:
             | _> What is stopping [...] Java, JS, and C# files in UTF-8?_
             | 
             | The output of files on disk can be UTF-8. The continued use
             | of UCS-2 (later revised to UTF16) is happening _in the
             | runtime_ because things like the Win32 API which C# uses is
             | UCS-2. The internal raw memory of layout of strings in
             | Win32 is UCS-2.
             | 
             | *EDIT to add correction
        
               | kevin_thibedeau wrote:
               | Win32 narrow API calls support UTF-8 natively now.
        
               | mark-r wrote:
               | Code page 65001 has existed for a long time now, but it
               | was discouraged because there were a lot of corner cases
               | that didn't work. Did they finally get all the kinks out
               | of it?
        
               | kevin_thibedeau wrote:
               | Yes. Applications can switch code page on their own.
        
               | colejohnson66 wrote:
               | UTF-16*, not UCS-2. Although there are probably many
               | programs that assume UCS-2.
        
               | mark-r wrote:
               | When Windows adopted Unicode, I think the only encoding
               | available was UCS-2. They converted pretty quickly to
               | UTF-16 though, and I think the same is true of everybody
               | else who started with UCS-2. Unfortunately UTF-16 has its
               | own set of hassles.
        
               | nwallin wrote:
               | Note that the asterisk in `UTF-16*` is a _really_ big
               | asterisk. I fixed a UCS-16 bug last week at my day job.
        
       | DannyB2 wrote:
       | There is an error in the first example under Giant Reference
       | Card.
       | 
       | The bytes come out as:
       | 
       | 0xF0 0x9F 0x87 0xBA 0xF0 0x9F 0x87 0xBA
       | 
       | but the bits directly above them all of the bit pattern: 010
       | 10111
        
         | SethMLarson wrote:
         | Great eye! I'll fix this and push it out.
        
       | bussyfumes wrote:
       | BTW here's a surprise I had to learn at some point: strings in JS
       | are UTF-16. Keep that in mind if you want to use the console to
       | follow this great article, you'll get the surrogate pair for the
       | emoji instead.
        
       | pierrebai wrote:
       | I never understood why ITF-8 did not use the _much_ simpler
       | encoding of:                   - 0xxxxxxx -> 7 bits, ASCII
       | compatible (same as UTF-8)         - 10xxxxxx -> 6 bits, more
       | bits to come         - 11xxxxxx -> final 6 bits.
       | 
       | It has multiple benefits:                   - It encodes more
       | bits per octet: 7, 12, 18, 24 vs 7, 11, 16, 21 for UTF-8
       | - It is easily extensible for more bits.         - Such extra
       | bits extension is backward compatible for reasonable
       | implementations.
       | 
       | The last point is key: UTF-8 would need to invent a new prefix to
       | go beyond 21 bits. Old software would not know the new prefix and
       | what to do with it. With the simpler scheme, they could
       | potentially work out of the box up to at least 30 bits (that's a
       | billion code points, much more than the mere million of 21 bits).
       | 
       | The
        
         | nephrite wrote:
         | From the Wikipedia article:
         | 
         | Prefix code: The first byte indicates the number of bytes in
         | the sequence. Reading from a stream can instantaneously decode
         | each individual fully received sequence, without first having
         | to wait for either the first byte of a next sequence or an end-
         | of-stream indication. The length of multi-byte sequences is
         | easily determined by humans as it is simply the number of high-
         | order 1s in the leading byte. An incorrect character will not
         | be decoded if a stream ends mid-sequence.
         | 
         | https://en.wikipedia.org/wiki/UTF-8#Comparison_with_other_en...
        
           | pierrebai wrote:
           | "instantaneously" in the sense of first having to read the
           | first byte to know how many bytes to read. So it's a two-step
           | process. Given the current maximum length and SIMD, detecting
           | the end-byte of my scheme is easily parallelizable for up to
           | 4 bytes, which conveniently goes to 24 bits, enough for all
           | current unicode code points, so there is no waiting for
           | termination. Furthermore, to decode a UTF-8 characters needs
           | bits extraction and shifting of all bytes, so there is no
           | practical gain of not looking at every byte. It actually
           | makes the decoding loop more complex.
           | 
           | Also, the human readability sounds fishy. Humans are really
           | bad at decoding _high-order_ bits. For example can you tell
           | the length of a UTF-8 sequence that would begin with 0xEC at
           | a glance? With my scheme, either the high bit is not set
           | (0x7F or less), which is easy to see you only need to compare
           | the first digit to 7. Or the high bit is set and the high
           | nibble is less than 0xC, meaning there is another byte, also
           | easy to see, you compare the first digit to C.
           | 
           | The quote also implicitly mis-characterized the fact that in
           | my scheme an incorrect character would also not be decoded if
           | interrupted since it would lack the terminating flag (No byte
           | > 0xC0).
        
         | masklinn wrote:
         | UTF-8 as defined (or restricted) is a _prefix code_ , it gets
         | all relevant information on the first read, and the rest on the
         | (optional) second. Your scheme requires an unbounded number of
         | reads.
         | 
         | > - It is easily extensible for more bits.
         | 
         | UTF8 already is easily extensible to more bits, either 7
         | continuation bytes (and 42 bits), or infinite. Neither of which
         | is actually useful to its purposes.
         | 
         | > The last point is key: UTF-8 would need to invent a new
         | prefix to go beyond 21 bits
         | 
         | UTF8 was defined as encoding 31 bits over 6 bytes. It was
         | restricted to 21 bits (over 4 bytes) when unicode itself was
         | restricted to 21 bits.
        
           | cesarb wrote:
           | > UTF8 already is easily extensible to more bits, either 7
           | continuation bytes (and 42 bits), or infinite.
           | 
           | Extending UTF-8 to 7 continuation bytes (or more) loses the
           | useful property that the all-ones byte (0xFF) never happens
           | in a valid UTF-8 string. Limiting it to 36 bits (6
           | continuation bytes) would be better.
        
             | edflsafoiewq wrote:
             | Why is that useful?
        
               | mananaysiempre wrote:
               | You can use FF as a sentinel byte internally (I think
               | utf8proc actually does that?); given that FE never
               | occurs, either, if you see the byte sequence
               | corresponding to U+FEFF BYTE ORDER MARK in one of the
               | other UTFs you can pretty much immediately tell it can't
               | possibly be UTF-8. (In general UTF-8, because of all the
               | self-synchronization redundancy, has a very distinctive
               | pattern that allows it to be detected with almost perfect
               | reliability, and that is a frequent point of UTF-8
               | advocacy, which lends some irony to the fact that UTF-8
               | is the one encoding that Web browsers support but refuse
               | to detect[1].) I don't think there is any other advantage
               | to excluding FF specifically, it's not like we're using
               | punched paper tape.
               | 
               | [1] https://hsivonen.fi/utf-8-detection/
        
               | xigoi wrote:
               | You can use 11000000 and 11000001 as sentinel bytes;
               | since a sequence beginning with them can't possibly be
               | minimal.
        
           | cryptonector wrote:
           | And Unicode was restricted to 21 bits because of UTF-16.
           | There is still the possibility of that restriction being
           | lifted eventually.
        
           | pierrebai wrote:
           | No software decodes data by reading a stream byte-by-byte.
           | Like I said in a previous comment, decoding 4 bytes using
           | SIMD is possible and probably the best way to go.
           | Furthermore, to actually decode, you need bit twiddling
           | anyway, so you do need to do byte-processing. Finally, the
           | inner loop of detecting character boundary is simpler: the
           | UTF-8 scheme, due to the variable-length prefixes, requires
           | to detect the first non-1 bits. It is probably written with a
           | switch/case in C, vs two bit tests in my scheme. I'm not
           | convinced the UTF-8 ends-up with a faster loop.
        
         | LegionMammal978 wrote:
         | The problem is that UTF-8 has the ability to detect and reject
         | partial characters at the start of the string; this encoding
         | would silently produce an incorrect character. Also, UTF-8 is
         | easily extensible already: the bit patterns 111110xx, 1111110x,
         | and 11111110 are only disallowed for compatibility with
         | UTF-16's limits.
        
           | pierrebai wrote:
           | How often are stream truncated _at the start_? In my career,
           | I 've seen plenty of end truncation, but start truncation
           | never happens. Or, to be more precise, it only happens if
           | previous decoding is already borked. If a previous decoding
           | read too much data, then even UTF-8 is borked. You could be
           | decoding UTF-8 from the bits of any follow-up data.
           | 
           | Even for pure text data, if a previous field was over-read
           | (the only plausible way to have start-truncation), then you
           | probably are decoding incorrect data from then on.
           | 
           | IOW, this upside is both ludicrously improbable and much more
           | damning to the decoding than simply be able to skip a
           | character.
        
         | cryptonector wrote:
         | UTF-8 is self-resynchronizing. You can scan forwards and/or
         | backwards and all you have to do is look for bytes that start a
         | UTF-8 codepoint encoding to find the boundaries between
         | codepoints. It's genius.
        
         | [deleted]
        
         | stkdump wrote:
         | The current scheme is extensible to 7x6=42 bits (which will
         | probably never be needed). The advantage of the current scheme
         | is that when you read the first byte you know how long the code
         | point is in memory and you have less branching dependencies,
         | i.e. better performance.
         | 
         | EDIT: another huge advantage is that lexicographical
         | comparison/sorting is trivial (usually the ascii version of the
         | code can be reused without modification).
        
           | lkuty wrote:
           | like "A Branchless UTF-8 Decoder" at
           | https://nullprogram.com/blog/2017/10/06/
        
           | coldpie wrote:
           | > The current scheme is extensible to 7x6=42 bits (which will
           | probably never be needed).
           | 
           | I have printed this out and inserted it into my safe deposit
           | box, so my children's children's children can take it out and
           | have a laugh.
        
       | jjice wrote:
       | Fun fact: Ken Thompson and Rob Pike of Unix, Plan 9, Go, and
       | other fame had a heavy influence on the standard while working on
       | Plan 9. To quote Wikipedia:
       | 
       | > Thompson's design was outlined on September 2, 1992, on a
       | placemat in a New Jersey diner with Rob Pike.
       | 
       | If that isn't a classic story of an international standard's
       | creation/impactful update, then I don't know what is.
       | 
       | https://en.wikipedia.org/wiki/UTF-8#FSS-UTF
        
         | SethMLarson wrote:
         | I knew that Ken Thompson had an influence but wasn't aware of
         | Rob Pike, what a great fact! Thanks for sharing this :)
        
           | ChrisSD wrote:
           | For whatever it's worth Rob Pike seems to credit Ken Thompson
           | for the invention, though they both worked together to make
           | it the encoding used by Plan 9 and to advocate for its use
           | more widely.
        
       | YaBomm wrote:
        
       ___________________________________________________________________
       (page generated 2022-02-08 23:01 UTC)