[HN Gopher] A Spectre Is Haunting Unicode (2018)
       ___________________________________________________________________
        
       A Spectre Is Haunting Unicode (2018)
        
       Author : EvanAnderson
       Score  : 231 points
       Date   : 2022-07-14 13:00 UTC (10 hours ago)
        
 (HTM) web link (www.dampfkraft.com)
 (TXT) w3m dump (www.dampfkraft.com)
        
       | NeoTar wrote:
       | Is there anything similar for Latin characters?
       | 
       | The only circumstance I can imagine is where a Latin character
       | has been erroneously encoded with an unused diacritic, for
       | instance a T with a diaeresis.
        
         | wbl wrote:
         | Multilocular o is known only from a single word in a single
         | manuscript.
        
           | indecisive_user wrote:
           | link to the wiki article. Though this is a variation of a
           | Cyrillic letter, not latin
           | 
           | https://en.wikipedia.org/wiki/Multiocular_O
        
             | asveikau wrote:
             | This is funny. The Indo European (and hence Slavic) roots
             | for eyes typically have an /o/, and this glyph is round
             | like the eye, so it seems like this character and others
             | linked in that article are just people making little
             | cartoonish drawings on writings involving descriptions of
             | eyes.
        
         | jxy wrote:
         | Something like the letters V and U.
        
           | lisper wrote:
           | Or double-U (i.e. UU == W)
        
       | krossitalk wrote:
       | > At this rate they'll presumably be with humanity forever. Ps
       | 
       | So, that's a really interesting thought. Perhaps our solution to
       | a permanent reminder of nuclear destruction[1] could be hidden
       | inside a plane of Unicode.
       | 
       | [1] https://en.wikipedia.org/wiki/Long-
       | term_nuclear_waste_warnin...
        
         | remram wrote:
         | Maybe Unicode will feature the same kind of warnings one day.
         | 
         | > This Unicode range is not a place of honor. No highly-
         | esteemed symbol is registered here.
         | 
         | > What was here represented cultural signs that were considered
         | powerful in our time.
        
         | wongarsu wrote:
         | Maybe we can encode instructions on how to restart society in
         | Unicode character names? After all basically every computer
         | contains a list of them.
        
           | bogwog wrote:
           | That did not end well for the Georgia guidestones...
        
             | ethbr0 wrote:
             | Also thought about posting that this morning, but wasn't
             | sure anyone else would get the reference. (As context for
             | everyone else, some kook blew up some of the guidestones
             | last week in the middle of the night)
        
               | _dain_ wrote:
               | Not so kooky, they were a call for genocide.
        
               | tomcatfish wrote:
               | From what I see, this was the maximally flame-y way to
               | say what you said, and it's still inaccurate to call it
               | "Not so kooky" as these commandments, while disagreeable
               | to me, are not really that violent.
               | 
               | 1. Maintain humanity under 500,000,000 in perpetual
               | balance with nature. 2. Guide reproduction wisely -
               | improving fitness and diversity. 3. Unite humanity with a
               | living new language. 4. Rule passion - faith - tradition
               | - and all things with tempered reason. 5. Protect people
               | and nations with fair laws and just courts. 6. Let all
               | nations rule internally resolving external disputes in a
               | world court. 7. Avoid petty laws and useless officials.
               | 8. Balance personal rights with social duties. 9. Prize
               | truth - beauty - love - seeking harmony with the
               | infinite. 10. Be not a cancer on the Earth - Leave room
               | for nature - Leave room for nature.
               | 
               | source: https://en.wikipedia.org/wiki/Georgia_Guidestones
               | #Inscriptio...
        
       | ethbr0 wrote:
       | The Latin alphabet being boring, I spent some time going through
       | ancient alphabets included in Unicode.
       | 
       | It gets pretty trippy, pretty quick.
       | 
       | As in "We don't have a clear idea what this rune was for, or what
       | it means, but we see it in documents and so added it to Unicode."
       | 
       | https://en.m.wikipedia.org/wiki/Runic_(Unicode_block)
        
         | shantara wrote:
         | My favorite Unicode glyph is Multiocular O (). There is only
         | one recorded usage, by a 15th century russian monk, who decided
         | to use it in phrase "many-eyed seraphim" instead of two regular
         | letters 'o'. So of course it was added to Unicode.
         | 
         | https://en.wikipedia.org/wiki/Multiocular_O
        
           | lmkg wrote:
           | It gets better: this glyph is bugged. Somehow, the guy
           | responsible for adding it to Unicode somehow got _the number
           | of eyes wrong_. Per his description, Unicode fonts represent
           | it with 7 eyes, but after getting called out on Twitter he
           | realized the original manuscript shows 10 eyes.
           | 
           | This bug will be fixed in Unicode 15.
        
             | corrral wrote:
             | What about modern uses of the character that specifically
             | intended 7 eyes? Unicode needs to add a time or (worse, but
             | probably OK) version datum to glyphs or glyph ranges, I
             | suppose (applying it only at the document level wouldn't
             | suffice, as in the case of quoting).
        
             | B1FF_PSUVM wrote:
             | Achieving peak Byzantium there, I guess.
        
         | thaumasiotes wrote:
         | > As in "We don't have a clear idea what this rune was for, or
         | what it means, but we see it in documents and so added it to
         | Unicode."
         | 
         | Documents? I had the strong impression that there are no
         | documents written in runes. A rune we only know by its
         | occurrence in documents would be far more interesting for the
         | existence of a document than it would be for its own sake!
         | 
         | Compare what the page about Anglo-Saxon runes says about the
         | corpus:
         | 
         | > The Old English and Old Frisian Runic Inscriptions database
         | project at the Catholic University of Eichstatt-Ingolstadt,
         | Germany aims at collecting the genuine corpus of Old English
         | inscriptions containing more than two runes in its paper
         | edition, while the electronic edition aims at including both
         | genuine and doubtful inscriptions down to single-rune
         | inscriptions.
         | 
         | > The corpus of the paper edition encompasses about one hundred
         | objects (including stone slabs, stone crosses, bones, rings,
         | brooches, weapons, urns, a writing tablet, tweezers, a sun-
         | dial,[clarification needed] comb, bracteates, caskets, a font,
         | dishes, and graffiti). The database includes, in addition, 16
         | inscriptions containing a single rune, several runic coins, and
         | 8 cases of dubious runic characters (runelike signs, possible
         | Latin characters, weathered characters). Comprising fewer than
         | 200 inscriptions, the corpus is slightly larger than that of
         | Continental Elder Futhark (about 80 inscriptions, c. 400-700),
         | but slightly smaller than that of the Scandinavian Elder
         | Futhark (about 260 inscriptions, c. 200-800).
         | 
         | So across every runic system we know, we have under 600 texts,
         | _all_ of those texts are short inscriptions, and even to reach
         | that number of samples we need to include texts that we aren 't
         | even sure contain any runes.
        
           | yorwba wrote:
           | Runes continued to be used long past the Elder Futhark period
           | and from the medieval period manuscripts survive that fit the
           | modern conception of a "document", most famously the Codex
           | Runicus https://www.e-pages.dk/ku/579/html5/ (202 pages)
        
           | bombcar wrote:
           | https://www.youtube.com/watch?v=2yWWFLI5kFU describes another
           | side-effect of encoding old scripts/runes.
        
           | hypertele-Xii wrote:
           | > I had the strong impression that there are no documents
           | written in runes.
           | 
           | There are. Such documents are called runestones and thousands
           | survive to this day, most in Sweden.
           | 
           | https://en.wikipedia.org/wiki/Runestone
        
             | eesmith wrote:
             | Huh. https://en.wikipedia.org/wiki/Document says:
             | 
             | > Documents are also distinguished from "realia", which are
             | three-dimensional objects that would otherwise satisfy the
             | definition of "document" because they memorialize or
             | represent thought; documents are considered more as
             | 2-dimensional representations.
             | 
             | I think "realia" - a term I had never heard before -
             | describes runestones better than "document".
        
           | hprotagonist wrote:
           | >Documents? I had the strong impression that there are no
           | documents written in runes.
           | 
           | If a clay tablet counts, why not a runestone?
        
             | thaumasiotes wrote:
             | I'm not knocking runestones for being the wrong medium. I'm
             | knocking them for not being documents. A typical cuneiform
             | record might be analogized to an invoice for delivery of a
             | crate of shirts or whatever. (And of course we also have
             | textbooks, dictionaries, literature, correspondence,
             | business reports, mathematical treatises, and every other
             | type of written work.) A typical runic record would be more
             | like the text "Made in Taiwan" printed on the shirt labels.
             | 
             | One of the biggest problems in the study of these cultures
             | is that they left no written records. We know they had a
             | writing _system_ , the runes, but as far as we can tell
             | they almost never used it for anything. Quite the opposite
             | is true of Mesopotamian cultures, where we're buried in
             | more records than we have the manpower to translate.
        
               | hprotagonist wrote:
               | I suppose it also matters what you think a rune is. Does
               | futhork count? There's parchment with that written on it.
               | Elder Futhark, none as far as i know.
        
           | gumby wrote:
           | > Documents? I had the strong impression that there are no
           | documents written in runes.
           | 
           | One of the original goals of Unicode was to be able to
           | computerize every document. I still have some old linguistics
           | books in which characters have been handwritten into typed or
           | even typeset text. So these are the types of documents being
           | referred to: academic papers.
           | 
           | Some fancy books have photographs of ancient writing; I'm not
           | sure if Unicode tries to encode such sources and I pretty
           | much doubt it (how would you even know what to call the
           | symbols? You touch on this in your comment). However often
           | they are attached to treatises that order the characters in
           | some way (I.e. index an alphabet) in which case the first
           | case above would apply.
           | 
           | In other words: thanks to some scholars who wrote down and
           | ordered runic alphabets, you can now discuss runes with your
           | friends and colleagues through email.
        
             | thaumasiotes wrote:
             | > One of the original goals of Unicode was to be able to
             | computerize every document. I still have some old
             | linguistics books in which characters have been handwritten
             | into typed or even typeset text.
             | 
             | That's a weird goal for Unicode to have. We've already
             | accomplished that; a PDF file does the job _better_ (note:
             | PDF documents _already support_ every character existing in
             | the past, present, or future!) while being less complex.
        
               | gumby wrote:
               | I don't understand. If there is no computerized way to
               | represent the script, all you can do would be to include
               | photographs in your pdf. The point of computerization is
               | not simply storage and retrieval (and retrieval is hard
               | if you can't represent the script) but automated
               | processing, which is meaningless if you can't represent
               | any semantics).
               | 
               | Separately, PDF felt like a step backwards on the day it
               | was announced and sadly nothing since then has changed
               | that.
        
               | CorrectHorseBat wrote:
               | How do you search for non-unicode characters in a pdf
               | document?
        
               | thaumasiotes wrote:
               | How do you search for them in a book?
        
               | gpderetta wrote:
               | ctrl-F once you have digitized it.
        
               | jen20 wrote:
               | And how do you type the character you are searching for?
        
               | cgriswald wrote:
               | On Ubuntu: l-ctrl+l-shift+u, <codepoint>, <enter>
               | 
               | Of course, that sucks, so I've programmed a nearby key to
               | act as l-ctrl+l-shift+u.
               | 
               | Several characters can also be typed with Compose Key.
               | 
               | For characters I use regularly (in my case, generally the
               | elder and younger futharks), I've created a keyboard out
               | of an Elgato StreamDeck XL so I can type any of these
               | runes with a single button press.
        
               | gpderetta wrote:
               | I don't think that not having a physical key on the
               | keyboard has ever stopped anybody from inputing unicode
               | symbols.
        
               | [deleted]
        
             | runarberg wrote:
             | This is interesting. I'm comparing this to how musical
             | notation is encoded in unicode. I mean, there is a block
             | dedicated to the symbols, so the symbols are encoded, but
             | you can't document music using only unicode. But musical
             | documents are being composed and written all the time. To
             | write music you need an additional software which arranges
             | these symbols in a certain way so that they express the
             | authors intention.
             | 
             | I guess math has a similar representation in unicode as
             | well.
             | 
             | All that said, I think people use runes to express magic
             | and spells (even to this day). I don't think all the
             | magical runes are expressed in unicode (and perhaps they
             | shouldn't). If you want to use a rune in that way, you
             | might have to draw it out in SVG or something and then
             | email it to your friends.
        
               | thaumasiotes wrote:
               | > I guess math has a similar representation in unicode as
               | well.
               | 
               | It's an ongoing project. As you seem to have guessed,
               | Unicode math symbols are just about as useless for
               | representing math as Unicode music symbols are for
               | representing music. Producing mathematical documents is
               | done using dedicated software, generally LaTeX.
               | 
               | (And what you get is a PDF, because, as I noted in
               | another comment, PDFs already support every notation
               | there is, was, or ever will be.)
        
       | jake_morrison wrote:
       | In the 90s I worked on a project to digitize land registration in
       | Taiwan.
       | 
       | In order to record deeds and property transfers, we needed to
       | enter people's names and official registered addresses into the
       | computer system. The problem was that some people used non-
       | traditional writing variants for their names, and some of their
       | birthplaces were tiny places in China with weird names.
       | 
       | Someone might write their name with a two-dot water radical
       | instead of three-dot radical. We would print it out in the normal
       | font, and the people would lose their minds, saying that it was
       | wrong. Chinese people can be superstitious about the number of
       | strokes in their name, so adding a stroke might make it unlucky,
       | so they would not buy the property.
       | 
       | The customer went to the agency responsible for managing the big
       | character set, https://en.wikipedia.org/wiki/CNS_11643 Despite
       | having more characters than anything else on earth, it didn't
       | have those variants. The agency said they would not encode them,
       | because they were not real characters, just printing differences.
       | 
       | The solution was for the staff in the office to use a "font
       | maker" program to create a custom font with these characters.
       | Then they could print out the deeds using a Chinese variant of
       | Adobe Acrobat, and everyone was happy.
        
         | agumonkey wrote:
         | Forgot which country (iran, turkey..) but one diacritic on a
         | phone text got a girl killed because it altered the meaning one
         | word. Turning the sentence from loving to threatening or
         | insulting.
        
           | not2b wrote:
           | In Spanish, dropping one diacritic (~) changes "How old are
           | you?" to "How many anuses do you have?".
        
             | eesmith wrote:
             | In English, dropping one diacritic changes "Where's the
             | rose?" to "Where's the rose?", and changes "My mate is
             | cold" to "My mate is cold."
        
               | ajuc wrote:
               | In Polish "zrob mi laske" means "do me a favor" and "zrob
               | mi laske" means "give me a blowjob".
        
               | Dylan16807 wrote:
               | > rose
               | 
               | Maybe, though it's still halfway the same word.
               | 
               | > mate
               | 
               | Not a change, both spellings are valid.
        
               | eesmith wrote:
               | Maybe even three-quarters the same word. (4/5ths if you
               | count code points in NFD!)
               | 
               | Male parties are a lot of fun.
               | 
               | Those are some pretty lame runners.
        
               | schoen wrote:
               | An oddity is that "mate" (meant to indicate that the e is
               | _pronounced_ ) is an incorrect spelling in both Spanish
               | and Portuguese, where it would wrongly suggest that the e
               | is _stressed_.
               | 
               | https://en.wikipedia.org/wiki/Yerba_mate#Name_and_pronunc
               | iat...
        
           | kps wrote:
           | https://gizmodo.com/a-cellphones-missing-dot-kills-two-
           | peopl...
        
             | jwilk wrote:
             | Disussed on HN in 2008:
             | 
             | https://news.ycombinator.com/item?id=226853 (18 comments)
        
           | asveikau wrote:
           | That sounds terrible, however, it's important to remember
           | that diacritics don't get people killed, the person who
           | decides to kill ultimately needs to stop themselves.
        
             | _jal wrote:
             | No, "diacritics don't kill people, people kill people" is
             | not an important life lesson. It is a reductive just-so
             | generalization of basic common sense that obscures more
             | than it enlightens.
             | 
             | The important thing for engineers to note is a technical
             | shortcoming caused a tragic misunderstanding. Focusing
             | instead on the well-known fact that some people have poor
             | impulse control, knowing full well that is a non-
             | controllable input, instead makes an excuse for poor
             | engineering and implicitly expresses powerlessness to do
             | anything about the problem.
        
               | asveikau wrote:
               | I am all for good localization efforts. I've been
               | something of a champion for that whenever I've been
               | around user facing code and people working on it. I also
               | am a bit of a language nerd and not monolingual.
               | 
               | But yes, misunderstanding or not, we should not kill
               | people.
               | 
               | The story in the sibling comment is about a man attacking
               | his daughter's ex because the ex came to apologize about
               | a confusion over the Turkish dotless I. That's still a
               | violent attack that the father could have kept his
               | emotions in check. I don't condone calling the daughter
               | names, even accidentally, but it is not a crime and the
               | right response is not attempted murder.
        
               | _jal wrote:
               | > but it is not a crime and the right response
               | 
               | I don't know who you're arguing with, but it isn't me.
               | Nobody is saying it was.
               | 
               | I'm saying it is an irrelevant non sequitur.
               | 
               | Imagine that Dad instead misunderstood an instruction
               | related to a financial transaction and lost a ton of
               | money. Would you now be discounting the technical problem
               | that caused the misunderstanding and berating Dad for
               | being foolish?
        
               | asveikau wrote:
               | I'm not discounting the technical problem.
               | 
               | If I were on a code review and I spotted an issue
               | affecting Turkish dotless I, I assure you I would rant
               | about it more than is reasonable.
        
             | agumonkey wrote:
             | Even to a lesser extent, it's easy to forget how a small
             | mistake can have a butterfly effect in other cultures.
        
             | pixl97 wrote:
             | Ya, I don't see that happening in authoritarian countries.
             | 
             | As a contrived example if you had a symbol for 'happy' you
             | want to be very cautious that it doesn't get converted to
             | 'gay' because in your language gay and happy mean the same
             | thing, in some repressive regime it means the leadership
             | gets to execute you with the approval of the law.
        
               | zarzavat wrote:
               | A recent example is that "Let's go [gun emoji] him" could
               | be interpreted as either harmless fun, or conspiracy to
               | murder, depending on if the recipient's phone displays
               | that as a water pistol or a real gun.
               | 
               | Edit: weirdly HN refuses to display that emoji.
        
               | tomcatfish wrote:
               | HN does not like displaying emojis, though a few slip
               | through I believe.
        
               | lolc wrote:
               | Hacker News doesn't allow emojis because only serious fun
               | or something.
        
         | EvanAnderson wrote:
         | Yikes. If somebody hasn't written a "falsehoods programmers
         | believe about human writing systems" document this would make
         | for a good start.
        
           | cestith wrote:
           | It deserves its own entry in "falsehoods programmers believe
           | about names" lists too.
        
             | lmkg wrote:
             | It's already there, #11 "People's names are all mapped in
             | Unicode code points."
        
               | Scarblac wrote:
               | The falsehood here is thinking that if you can encode the
               | name into the right code points, and you have a font that
               | can print them, the result will be acceptable to the
               | people whose name it is.
               | 
               | They had that, but needed a font that used a different
               | number of strokes for the characters because of the
               | superstition.
        
               | lmkg wrote:
               | One could argue they're facets of the same issue.
               | Although in the spirit of the original list, they would
               | probably get split into separate line items.
               | 
               | On further review, I think this is als similar to #12 &
               | #13 on the list: "names are case-sensitive," and "names
               | are not case-sensitive." To generalize that to include
               | non-Western alphabets: display variations of the same
               | character are significant, and display variations of the
               | same character are not significant.
               | 
               | This of course goes back to the evergreen philosophical
               | question "what even is a character, anyways?" Since we've
               | found a case where two characters which are the same
               | character are not the same character. Are they distinct
               | characters or typographical variants? Yesn't: one would
               | want them unified for searching, but distinct for
               | printing.
               | 
               | But regardless of what they are, these
               | characters/variants only show up in names. Names tend to
               | retain archaic (or extinct) language variations longer
               | than speech, which is the reason for rule #11, which is
               | at least part of the problem.
        
               | cestith wrote:
               | I fully agree with this second, expanded take of yours.
               | Some names are both represented and not represented by
               | the Unicode simultaneously. This suggests there should be
               | variant versions of characters, but that becomes an even
               | thornier combinatorics (and sorting/collation, and
               | lookalike characters) issue than what already exists.
        
               | corrral wrote:
               | More generally, the notion that human culture, systems,
               | and behaviors can be mapped, losslessly and without
               | causing harm, to something a computer understands.
               | 
               | I think these language examples are so good, as examples,
               | because all aspects of them are clear and easy to follow.
               | I think computerization of business and society and the
               | systems that make them work, causes immense amounts of
               | this kind of friction and pain all the time, in ways that
               | are much harder to understand, explain, or catalog (which
               | is precisely why it's such a big problem, though as far
               | as I know it's received little attention)
               | 
               | [EDIT] To distill it, I think that trying to make a
               | computer a "source of truth" rather than a tool, tends to
               | do substantial violence to the "truth".
        
               | derefr wrote:
               | I feel like there has to be some level of triviality at
               | which the harm is no longer being caused by the attempt
               | to systematize something, but rather by a small group of
               | people refusing to be systematized _not_ out of cultural
               | heritage et al, but rather purely out of the (inane)
               | human desire to feel special by intentionally doing
               | something in a way nobody else does it.
               | 
               | Language and writing exist to _communicate_ , using
               | patterns of signals that have _shared meaning and
               | recognition_ ; things like alphabets and vocabularies are
               | effectively (loose, overlapping, diasporic) consensus-
               | state autoencoding models. They only _work_ to compress
               | meaning, when there are rules for said compression that
               | generalize, and which don 't have as many exceptions with
               | their own separate symbols as there are words/names
               | needing to be encoded.
               | 
               | Most countries don't allow you to just make up your own
               | novel graphemes when writing a name on a birth
               | certificate. And nobody is asking for that, either.
               | (Presumably because living in a world where that was
               | allowed would be horrible: you'd no longer being able to
               | error-correct when reading, because any given mysterious
               | squiggle in the middle of a word or name, might be
               | exactly what some unknown-to-you-or-anyone-other-than-
               | the-author character is _supposed_ to look like. Is that
               | "o with a curlicue" written here just a semi-cursive
               | attempt at writing an "o" -- or is it an "o" with a novel
               | accent marker, one that appears nowhere else, but which
               | must be preserved nevertheless to properly record this
               | person's name?)
               | 
               | Instead, _legal names_ are (in every country I 'm aware
               | of) required to be spelled using the character-set of the
               | country you're entering a legal relationship with by
               | being born / immigrating / etc. America? Legal names
               | using the Latin alphabet. Japan? Legal names using
               | characters from this set:
               | https://en.wikipedia.org/wiki/Jinmeiy%C5%8D_kanji
               | 
               | Note, though, that legal names are _representations_ of
               | names. They aren 't _encodings_ of names. Your legal name
               | is a _distinct thing_ from your name, just as your
               | credit-card number is a distinct thing from your name. It
               | 's an applied-for + registered + assigned systematic
               | identifier for you -- a bit like a domain name, or a
               | vanity license-plate number. Which means that your legal
               | name is not a lossy _or_ lossless encoding of your name.
               | It 's, per se, a nickname. It doesn't have to have
               | anything to do with your name. (And it often doesn't;
               | immigrants often choose legal names entirely distinct
               | from what they / their home country thinks of as their
               | name.)
        
           | cardiffspaceman wrote:
           | "If the character isn't in Unicode it's in CNS-11643"
           | apparently is also false.
        
         | lostlogin wrote:
         | There was a great thread on HH about names and falsehoods
         | programmers believe.
         | 
         | You've added to it, as custom fonts wasn't one covered.
         | 
         | I think it's this thread:
         | https://news.ycombinator.com/item?id=18567548
         | 
         | Edit: and it's there, #11.
        
         | duxup wrote:
         | That sounds equally fascinating, and a little madding.
        
           | kurthr wrote:
           | Yep, and with pictographic writing systems it's a lot more
           | common than latin... but even here we have X AE A-12 Musk,
           | and Prince's name symbol.
           | 
           | Heck, my initials are totally non-standard.
        
         | 77pt77 wrote:
         | > Chinese people can be superstitious about the number of
         | strokes in their name, so adding a stroke might make it unlucky
         | 
         | Why am I not surprised in the slightest?
        
         | jetrink wrote:
         | That's a great story. The inability to represent a name with
         | standard characters reminds me of when Prince changed his name
         | to a symbol and they had to send all of the media floppy disks
         | containing a custom font with a single character.
         | 
         | https://nymag.com/intelligencer/2016/04/princes-legendary-fl...
        
           | mdp2021 wrote:
           | Are you acquainted with Freur (which means, "Underworld 0.5"
           | - Rick Smith and Karl Hyde in the '80s)?
           | 
           | "Freur", or, "The squiggle we chose as the name for a band
           | but that CBS Records insisted should at least have a
           | pronunciation".
           | 
           | I see it is not in Unicode (well, you can never really know
           | if you do not try), nor I can find pieces to reconstruct it.
           | 
           | The "freur" in foreground: https://d4q8jbdc3dbnf.cloudfront.n
           | et/user/6885/edb290c6183ac...
        
         | [deleted]
        
         | Findecanor wrote:
         | I've been told that this is also an issue in Japan, except the
         | reason might more often be a matter of pride than superstition.
         | It is supposedly one reason (of a few) why fax machines are
         | still in common use in Japan.
         | 
         | Later versions of Unicode support "Variation Forms" of Han
         | characters as a way to be able to encode different variations.
         | They are encoded as a Variation Selector code (U+E01000 and up)
         | after the Han character. The forms are listed separate from
         | Unicode versions in the "Ideographic Variation Database"
         | <https://www.unicode.org/ivd/>. So far, it contains characters
         | from a couple of Japanese dictionaries, a Korean and one from
         | Macao/Hong Kong.
        
           | hinkley wrote:
           | I knew someone who added an accent character to their name
           | because everyone pronounced it wrong. She met someone
           | bilingual who shot back that if she wants it pronounced that
           | way she needs to add an aigue. So she did, and everyone still
           | pronounced her name wrong.
           | 
           | In fact going any place with her very nearly became an "are
           | we living in a simulation" crisis for me because the number
           | of times she would say her name and the other person would
           | say it back incorrectly was... upsetting. The degree to which
           | some people butchered her name, especially combining half of
           | her first and last name into a completely different name,
           | made us joke about buggy NPCs.
           | 
           | I could imagine how in some cultures writing it incorrectly
           | hurts as much as pronouncing it incorrectly. Or possibly
           | moreso in places where multiple plausible pronunciations have
           | to be negotiated via an introduction, which is the case in
           | China, is it not?
        
             | teknopaul wrote:
             | In Poland people have a neat life hack for that problem.
             | They have other names for non-polish folk to use. Eg pawek,
             | tomek, bartek rather than have people mangle their real
             | name.
             | 
             | My name got changed when I moved to Spain and it never
             | bothered me, while I have met people who took great offence
             | at the use of standard nicks that they had not explicitly
             | sanctioned in advance. I know a guy who makes a new name up
             | for everyone he meets. Like or lump it. If you are too
             | sensitive about your name, you risk people not using it at
             | all.
        
               | lostlogin wrote:
               | It does goe both ways though. Take the time to learn how
               | to say and spell someone's name and it usually goes down
               | well.
               | 
               | I say this while fully aware of my own butchering.
        
             | isoprophlex wrote:
             | People are just incredibly dense sometimes. My wife has a
             | name that's one letter different from a more common name,
             | but clearly different in pronunciation.
             | 
             | Nevertheless there have been countless times where people
             | automatically substitute the more common name, or even
             | worse in text messages manage to misread it and reply
             | incorrectly.
             | 
             | It sometimes upsets her. The npc analogy is very apt, i
             | guess many people are just very preoccupied?!
        
               | derefr wrote:
               | > or even worse in text messages manage to misread it and
               | reply incorrectly
               | 
               | Overzealous autocorrect can happen to names, too. There's
               | a whole thing about Asian names not being in computer
               | spellcheck dictionaries:
               | https://www.abbynews.com/news/youre-not-a-mistake-b-c-
               | group-...
        
               | InitialLastName wrote:
               | Not just Asian names; my SO's (English-language) nickname
               | frequently autocorrects to its common homophone. I can
               | always tell who proofreads their texts by how it ends up
               | spelled.
        
               | lostlogin wrote:
               | Try typing 'Sian' on iOS, (well, maybe Sian) and it
               | autocorrects to Asian.
               | 
               | Unhelpful, though luckily found funny when I did it.
        
       | irusensei wrote:
       | If you type Sei  on google translate and set it to detect
       | language it will switch to Chinese and translate it to
       | "lingering". If you switch to Japanese no translation will
       | happen.
       | 
       | Also if you google search for Sei  one of the results will be
       | this video [!!!!seizure warning!!!!]
       | https://www.youtube.com/watch?v=EsOU0V2kpUI that seems to borrow
       | on the theme of a computer ghost character.
        
         | TazeTSchnitzel wrote:
         | Google Translate will hallucinate translations for complete
         | nonsense, so this probably doesn't mean anything.
        
       | einpoklum wrote:
       | I'm more worried about the inflation of emoji than a couple dozen
       | unused ghost JIS characters.
        
         | npteljes wrote:
         | What worries you about it?
        
         | jimmygrapes wrote:
         | If Slack/Discord/etc. custom emojis get used enough, do they
         | get incorporated into Unicode? I've seen something like 40
         | variants of laughing emoji, and closer to 400 variants of Pepe
         | the Frog, and I'm not even in any "alt right" or 4chan-adjacent
         | chat rooms/guilds where I imagine there are even more. Not to
         | mention the countless custom anime face ones.
        
         | wongarsu wrote:
         | Godwin's second law: any sufficiently long discussion about
         | Unicode includes a discussion about emoji :)
        
           | raphlinus wrote:
           | Yeah, it does seem to come up a lot more often than
           | discussions about U+5350.
        
         | edent wrote:
         | Why? Unicode isn't running out of space any time soon.
        
           | kevin_thibedeau wrote:
           | The encoding has gotten out of hand with compound emoji.
           | Splitting them on glyph boundaries is non-trivial.
        
           | Mountain_Skies wrote:
           | 640K should be enough for anybody.
        
       | olivierestsage wrote:
       | Reminds me of the case of U+237C [?] RIGHT ANGLE WITH DOWNWARDS
       | ZIGZAG ARROW [0], also discussed on HN [1].
       | 
       | [0] https://ionathan.ch/2022/04/09/angzarr.html
       | 
       | [1] https://news.ycombinator.com/item?id=31012865
        
       | jmillikin wrote:
       | Previously:
       | 
       | https://news.ycombinator.com/item?id=24951130 (2020)
       | 
       | https://news.ycombinator.com/item?id=17637375 (2018)
        
       | helsinkiandrew wrote:
       | It looks as if these (at least Shi ) are being used in various
       | places on and offline. It's eventually possible that they will
       | become associated with one or more meanings and perhaps a
       | pronunciation.
        
         | hnfong wrote:
         | In East Asian cultures that use Han characters, people used to
         | make up new characters when the need arises.
         | 
         | These days, we scroll though the Unicode standard and find
         | rarely used characters that were accidentally added and imbue
         | them with new meaning. (yes, this is seriously a thing)
        
           | dane-pgp wrote:
           | When the article said:
           | 
           | "In the end only one character had neither a clear source nor
           | any historical precedent: Sei ."
           | 
           | my instinct was that this character could be retconned to
           | mean "character whose meaning has been lost", thus creating a
           | self-referential paradox.
           | 
           | Presumably someone would have to then separately come up with
           | a pronunciation for it. Perhaps pronouncing it "duangu" would
           | solve another problem:
           | 
           | https://coconuts.co/hongkong/lifestyle/duang-jackie-chan-
           | ins...
        
           | 1-more wrote:
           | Oooh that sounds fascinating. Any examples of that that
           | spring to mind? Is the pronunciation (or a reasonable
           | representation thereof) already recorded in the Unicode
           | standard or is that also a bit of free-jazz?
        
             | adastra22 wrote:
             | The character usually has a radical component which hints
             | at the pronunciation. They or ordered by radical in the
             | standard. So you would go spelunking for a little-used
             | character in the part of the standard which has characters
             | close in meaning or pronunciation to what you are looking
             | for
             | 
             | Or you just make something up. If you're coining a new
             | character, you probably don't care about whether the
             | pronunciation is already known.
        
             | ssnistfajen wrote:
             | An old one but possibly the earliest and most prominent of
             | obsolete Chinese characters being imbued with new
             | (Internet-based) meanings:
             | https://en.wikipedia.org/wiki/Jiong
             | 
             | There's also Shi  https://en.wiktionary.org/wiki/%E5%A5%AD
             | which is occasionally used as a censorship workaround to
             | mock one of Xi Jinping's gaffes in an early 2000's TV
             | interview where he bluffed about being able to carry two
             | hundred "catty" (~100kg)'s worth of wheat on rural mountain
             | roads. The character is composed of two Bai  ("hundred")
             | and one Ren  ("human/person/people") which is a pitoral
             | euphemism to that line he said on TV. I can't find any
             | sources about this one that's in English so please bear
             | with my half-assed explanation.
        
               | 1-more wrote:
               | Both cases are fascinating, thank you!! Side note: of
               | course Shi  is pronounced shi. I only know a bare minimum
               | about Chinese but when in doubt: it's pronounced "shi"
               | (with some license regarding tone).
               | https://en.wikipedia.org/wiki/Lion-
               | Eating_Poet_in_the_Stone_...
        
           | adastra22 wrote:
           | One of the reasons I wish a compositional language had been
           | standardized for Unihan instead of the code-point-for-every-
           | character approach.
        
         | jxy wrote:
         | Wiktionary claims this character is in Guangyun (1007-1008, see
         | https://en.wikipedia.org/wiki/Guangyun), and gives the link to
         | Kangxi dictionary (1716),
         | https://www.kangxizidian.com/kangxi/0256.gif which means that
         | this character likely predates the Japanese "Overview of
         | National Administrative Districts".
        
       | sbf501 wrote:
       | Can we talk about the artwork used?
       | 
       | https://dl.ndl.go.jp/info:ndljp/pid/1312837?itemId=info%3And...
       | 
       | https://philamuseum.org/collection/object/84871
       | 
       | Googling for Tsukioka Yoshitoshi brings up so much SEO that it is
       | hard to find information in English. If anyone knows anything
       | about it, I'd be appreciative for a pointer about its
       | content/subject!
        
         | polm23 wrote:
         | Author here. Nobody has ever asked about the art before. It
         | depicts Maruyama Oukyo, a famous painter of ghosts (and other
         | things), where one of his pieces comes to life and frightens
         | him.
         | 
         | https://en.wikipedia.org/wiki/Maruyama_%C5%8Ckyo
        
       | lapetitejort wrote:
       | I can't be the only person who thought the character would be ,
       | right? (based on the first line of the Communist Manifesto:
       | https://en.wikisource.org/wiki/Manifesto_of_the_Communist_Pa...)
       | 
       | edit: ah the character (hammer and sickle) does not show up
        
       | aatharuv wrote:
       | For obviously fake characters, a Unicode proposal for the
       | Egyptian Hieroglyphics Extended-A block managed to include a
       | hieroglyph for an ancient Egyptian holding a laptop. (Note that
       | this is a proposal, and has not yet made it into the standard.)
       | Presumably it was a copyright trap.
       | 
       | https://www.unicode.org/mail-arch/unicode-ml/y2020-m02/0018....
        
       | ChrisArchitect wrote:
       | (2018)
        
       | hnfong wrote:
       | This might be interesting read to those unfamiliar with CJK, but
       | character bloat(?) isn't remotely a recent thing. It's actually
       | at least a couple hundred years old.
       | 
       | The Kangxi dictionary (1716), an authoritative dictionary of
       | Chinese characters, contains definitions for 47035 characters,
       | even though only a couple thousand are in common use. Quoting
       | from Wikipedia: "The dictionary was the largest of the
       | traditional dictionaries, containing 47,035 characters. Some 40%
       | of them are graphic variants, however, while others are dead,
       | archaic, or found only once. Fewer than a quarter of the
       | characters it contains are now in common use."
       | 
       | All of these archaic (or even bogus in some cases) characters
       | found in the dictionary are now part of the Unicode standard, of
       | course :) The unihan database even has a field that shows the
       | page number where the character appears in the Kangxi dictionary.
       | If you're wondering why 65536 characters isn't enough for
       | everyone, the junk in Kangxi dictionary is a significant
       | contribution.
        
         | mytailorisrich wrote:
         | I think 'character bloat' is simply inherent to the writing
         | system when characters are written by hand (now that perhaps
         | most written communication is digital people can't use
         | characters that are not already supported)
         | 
         | Anyone can invent characters whenever they want, and it's only
         | a question of them sticking or not.
         | 
         | I think this is also one of the reasons for the Chinese
         | tendency to push for unification and uniformity.
        
           | lazide wrote:
           | When it's character based instead of alphabet based, I think
           | it's the equivalent of coming up with a new word in English,
           | which is basically what you're describing.
           | 
           | Sometimes it's mashing two previously unrelated 'words'
           | together (aka the tons of compound characters in Chinese),
           | other times it's coming up with something completely new.
           | 
           | Same rules apply though, if it doesn't add value worth the
           | trouble (or get mandated by the powers that be), it'll
           | eventually just die out or be a curiosity.
           | 
           | Also, to keep it tech related:
           | 
           | RISC = English CISC/VLIW = Chinese?
        
             | tokinonagare wrote:
             | > Sometimes it's mashing two previously unrelated 'words'
             | together (aka the tons of compound characters in Chinese),
             | other times it's coming up with something completely new.
             | 
             | That's not how it works. Most Chinese characters stem from
             | a character C having a pronunciation A referring to a
             | meaning M being used to note another word of meaning M'
             | with same pronunciation A (sometimes slightly different
             | A'). This of course doesn't scale really well, hence the
             | existence of determiners in logographic scripts, which are
             | words used without their pronunciations placed before or
             | after another to give a semantic clue. The innovation of
             | Chinese (which I think is why it's still an efficient
             | script today) was to incorporate the determiner in the
             | character itself to give birth to a character C' where a
             | part refer to the pronunciation and another acts as the
             | determiner, instead of padding the main text with (a lot
             | of) determiners.
        
             | nneonneo wrote:
             | IIUC Old Chinese was a much more "isolating" language, in
             | that words were typically single characters - meaning that
             | to make new words, you typically needed to make new
             | characters. As it evolved through the ages, "compound"
             | words composed of multiple characters became more common.
             | These days, new words are almost always combinations of
             | multiple characters (often 2, occasionally 3-4).
        
               | lazide wrote:
               | Any idea if it was due to things like the Confucian
               | Official's exam system (and corresponding increase in
               | prioritization of education)?
               | 
               | More complex characters require more education to
               | understand is my guess. Some of the traditional ones
               | are..... obscure, and crazy complex.
        
               | R0b0t1 wrote:
               | I'm not entirely sure what you mean to ask nor am I a
               | Chinese speaker, but I have myself suspected that the
               | massive variety of characters was a side-effect of having
               | a middle class that was differentiated based on their
               | ability to read. You see various in-group signalling
               | systems similar to this in lots of areas.
               | 
               | A good historical example is all the strangly specific
               | words for groups of animals. A history I read of this
               | indicated these terms were first found in books sold to
               | nobility, and they were just made up. But you weren't hip
               | if you weren't reading that literature.
        
               | duskwuff wrote:
               | > These days, new words are almost always combinations of
               | multiple characters (often 2, occasionally 3-4).
               | 
               | Yep! For example, the most common Chinese term for
               | "Internet" is Yin Te Wang . This is composed of three
               | characters:
               | 
               | Hu : "mutual"
               | 
               | Lian : "join", "coupled", "allied"
               | 
               | Wang : "net" -- carrying both the meaning of a woven net
               | and a computer network
        
         | ars wrote:
         | Does Unicode really need to store Chinese words? Is it
         | impossible to deconstruct the glyphs into strokes, each stroke
         | effectively being a character?
        
           | j16sdiz wrote:
           | Many attempted, but nobody have suceed. The most famous one
           | is `Chu, B.F.: Han Zi Ji Yin Zhu Bang Fu Han Zi Ji Yin Gong
           | Cheng  (Genetic engineering of Chinese characters) (2003),
           | http://cbflabs.com/down/show.php?id=26 `
        
           | peter303 wrote:
           | In the early days of computers some character systems were
           | stroke-based because that used less memory than a 32x32 bit
           | map. A kilobit of ROM (one character) could cost $10.
           | 
           | Currently stroke-based systems are used for calligraphic
           | effect. You could generate new font types, e.g. bold., but
           | controlling the shape of strokes.
           | 
           | Stroke systems are important for teaching character writing
           | because the drawing order is rigorously prescribed. Once you
           | learn the first couple hundred, you can pretty much guess
           | future characters. Wrong order characters often look bad and
           | suggest a non-Chinese speaker mis-copied them. (e.g. some
           | tattoos)
        
           | nneonneo wrote:
           | Unicode has support for this, in the Ideographic Description
           | Characters block (https://en.m.wikipedia.org/wiki/Ideographic
           | _Description_Char...). However, it's purely descriptive, and
           | not designed for rendering.
           | 
           | There are somewhat more sophisticated systems which define
           | both the rendering and stroke decomposition of characters
           | (e.g. CDL: http://guide.wenlininstitute.org/wenlin4.3/Charact
           | er_Descrip...). The general workaround for characters that
           | aren't on Unicode would be to use one of these stroke
           | description systems to create the character, then render it
           | to an image and insert it.
        
           | cyphar wrote:
           | Even with the current system, very little software is even
           | aware that the same codepoint should be rendered differently
           | in different languages (Fan su needs to be rendered
           | differently in every CJK locale) which often results in
           | websites and programs using Chinese fonts for Japanese text
           | (even if you've configured your language as Japanese). Having
           | stroke breakdowns would not make this situation better
           | because there are multiple ways to render the same stroke
           | description and there aren't really systematic rules for how
           | to correctly represent the Japanese (or Taiwanese or Korean)
           | version of a character.
           | 
           | I dread to think what an enormous mess would result if every
           | character was represented as a build-it-yourself instruction
           | manual rather than allowing font authors to correctly
           | represent the characters.
           | 
           | Also nobody in China, Japan, nor Korea would use an encoding
           | system so incredibly inefficient that more strokes results in
           | more bytes being necessary to store the character (Japan
           | already compromised with having 3-byte UTF-8 characters when
           | JIS only required 2). This would've resulted in the failure
           | of Unicode's mission to be the One True Encoding Format.
        
           | yongjik wrote:
           | The problem with that would be that every software must know
           | the intricate rules about combining glyphs, and if they guess
           | wrong, users get garbage characters.
           | 
           | Considering that the majority of code is written by people
           | who don't know Chinese characters, it would result in never-
           | ending issues, pretty much everywhere.
           | 
           | Korean actually has a two-way system in Unicode. Every
           | conceivable character (= syllable) possible in modern Korean
           | has its own codepoint, which allows most software to display
           | them correctly: from their point of view, it's just another
           | CJK character.
           | 
           | On the other hand, there is a Unicode area containing Korean
           | sub-blocks ("jamo") that were used historically. In theory,
           | you can combine them and get some pretty funky archaic
           | syllables. Almost no software renders them right.
        
             | mike_hock wrote:
             | They can't even get much simpler things right. Qt
             | incorrectly combines accents with the character to the
             | right instead of the left and has been refusing to fix this
             | bug for years.
        
             | [deleted]
        
         | ComodoHacker wrote:
         | >Fewer than a quarter of the characters it contains are now in
         | common use
         | 
         | 12K characters in common use is equally impressing for me as a
         | non-Asian.
        
           | adastra22 wrote:
           | More like 12k characters currently in use at all. Common use
           | characters are a much smaller set than that. (3k or so?)
        
           | ssnistfajen wrote:
           | It's actually way fewer than that IRL. Japan's official list
           | of commonly used Kanji only has 2136 characters. Taiwan's
           | list has 4808, and the PRC's list has 3500 "frequent"
           | characters with another 3000 supplementary "common" ones.
           | Digitization has made it even easier to use these characters
           | without recognizing the actual form or how to write them.
        
             | cyphar wrote:
             | The Chang Yong Han Zi  (Japanese Common Use Kanji) list
             | does not include many kanji that native speakers can read
             | and newspapers don't always follow the rule that they only
             | should use characters from the list. In addition, you need
             | to include the Ren Ming Yong Han Zi  (Personal Name Use
             | Kanji) in the list because basically all of those
             | characters are also used in fairly common words.
             | 
             | Native speakers can probably recognise at least 3-4k kanji
             | if not more but can probably only write around 2k from
             | memory, depending on how well-read they are.
             | 
             | Xu  (lie) is the best example of an incredibly common word
             | whose kanji form (which is used fairly often) is not in any
             | official government list.
        
             | DiogenesKynikos wrote:
             | If you look at a frequency list of Chinese characters,[0]
             | the top 4800 characters make up about 99.9% of modern
             | texts.
             | 
             | That means that if you know 4800 characters, and you read a
             | text that is 1000 characters (equivalent to around 700
             | words) long, there's likely one character you won't
             | recognize.
             | 
             | The funny thing is, if you recognize only the top six
             | characters, you already know 10% of the characters in a
             | typical text. The distribution is very top-heavy, but with
             | a long tail that you do have to learn to become literate.
             | 
             | 0. https://lingua.mtsu.edu/chinese-
             | computing/statistics/char/li...
        
       | jamal-kumar wrote:
       | I thought this was going to be about something like the massive
       | security problem of homoglyph attacks being currently deployed in
       | stuff like phishing baked into the standard at first glance of
       | the title, but this ghost character business is pretty
       | interesting. Japanese literacy requires you to know 2-4 meanings
       | per 2,136 kanji characters (something like 6000+ in total
       | possible meanings between these characters) just to be able to
       | pass a university level literacy test, it's a massive amount of
       | complexity to get right. Even if you just need basic literacy
       | it's still about a thousand less than that, and there's even more
       | than these I mentioned for further literacy competence.
       | Furthermore each of these characters look funny if not unreadable
       | if you write them down using the wrong order of strokes. I can
       | see how mistakes might have been made even by native speakers of
       | that language. The two kana syllabiaries are there of course and
       | mixed in with the kanji, but if everything was written in that
       | you wouldn't be able to achieve the same amount of information
       | density, which is probably part of the reason they never switched
       | over (I understand before world war 2 or so, the more rounded
       | hiragana was for women while the more sword stroke like katakana
       | was for men).
        
       | js8 wrote:
       | Is it possible for Unicode standard to deprecate characters? If
       | yes, has it already happened?
        
         | Fell wrote:
         | I don't think so. It would make it impossible to talk about
         | deprecated characters ever again, even in a historical context.
         | 
         | Unicode contains even some ancient and long forgotten scripts
         | so historians can keep proper records of them.
        
         | jfk13 wrote:
         | Yes, and yes.
         | 
         | https://en.wikipedia.org/wiki/Unicode_character_property#Dep...
        
       | bqmjjx0kac wrote:
       | This is a tangent, but I felt like sharing. In college, I
       | purchased a used copy of the communist manifesto. Famously, the
       | first line reads, "A spectre is haunting Europe, ...".
       | 
       | The previous owner had both highlighted and circled the word
       | "spectre" and wrote "ghost?" in the margins. The rest of the text
       | was similarly marked up.
       | 
       | Every time I hear the word "spectre" I see "ghost?" in my mind's
       | eye.
        
       ___________________________________________________________________
       (page generated 2022-07-14 23:00 UTC)