[HN Gopher] A Spectre Is Haunting Unicode (2018) ___________________________________________________________________ A Spectre Is Haunting Unicode (2018) Author : EvanAnderson Score : 231 points Date : 2022-07-14 13:00 UTC (10 hours ago) (HTM) web link (www.dampfkraft.com) (TXT) w3m dump (www.dampfkraft.com) | NeoTar wrote: | Is there anything similar for Latin characters? | | The only circumstance I can imagine is where a Latin character | has been erroneously encoded with an unused diacritic, for | instance a T with a diaeresis. | wbl wrote: | Multilocular o is known only from a single word in a single | manuscript. | indecisive_user wrote: | link to the wiki article. Though this is a variation of a | Cyrillic letter, not latin | | https://en.wikipedia.org/wiki/Multiocular_O | asveikau wrote: | This is funny. The Indo European (and hence Slavic) roots | for eyes typically have an /o/, and this glyph is round | like the eye, so it seems like this character and others | linked in that article are just people making little | cartoonish drawings on writings involving descriptions of | eyes. | jxy wrote: | Something like the letters V and U. | lisper wrote: | Or double-U (i.e. UU == W) | krossitalk wrote: | > At this rate they'll presumably be with humanity forever. Ps | | So, that's a really interesting thought. Perhaps our solution to | a permanent reminder of nuclear destruction[1] could be hidden | inside a plane of Unicode. | | [1] https://en.wikipedia.org/wiki/Long- | term_nuclear_waste_warnin... | remram wrote: | Maybe Unicode will feature the same kind of warnings one day. | | > This Unicode range is not a place of honor. No highly- | esteemed symbol is registered here. | | > What was here represented cultural signs that were considered | powerful in our time. | wongarsu wrote: | Maybe we can encode instructions on how to restart society in | Unicode character names? After all basically every computer | contains a list of them. | bogwog wrote: | That did not end well for the Georgia guidestones... | ethbr0 wrote: | Also thought about posting that this morning, but wasn't | sure anyone else would get the reference. (As context for | everyone else, some kook blew up some of the guidestones | last week in the middle of the night) | _dain_ wrote: | Not so kooky, they were a call for genocide. | tomcatfish wrote: | From what I see, this was the maximally flame-y way to | say what you said, and it's still inaccurate to call it | "Not so kooky" as these commandments, while disagreeable | to me, are not really that violent. | | 1. Maintain humanity under 500,000,000 in perpetual | balance with nature. 2. Guide reproduction wisely - | improving fitness and diversity. 3. Unite humanity with a | living new language. 4. Rule passion - faith - tradition | - and all things with tempered reason. 5. Protect people | and nations with fair laws and just courts. 6. Let all | nations rule internally resolving external disputes in a | world court. 7. Avoid petty laws and useless officials. | 8. Balance personal rights with social duties. 9. Prize | truth - beauty - love - seeking harmony with the | infinite. 10. Be not a cancer on the Earth - Leave room | for nature - Leave room for nature. | | source: https://en.wikipedia.org/wiki/Georgia_Guidestones | #Inscriptio... | ethbr0 wrote: | The Latin alphabet being boring, I spent some time going through | ancient alphabets included in Unicode. | | It gets pretty trippy, pretty quick. | | As in "We don't have a clear idea what this rune was for, or what | it means, but we see it in documents and so added it to Unicode." | | https://en.m.wikipedia.org/wiki/Runic_(Unicode_block) | shantara wrote: | My favorite Unicode glyph is Multiocular O (). There is only | one recorded usage, by a 15th century russian monk, who decided | to use it in phrase "many-eyed seraphim" instead of two regular | letters 'o'. So of course it was added to Unicode. | | https://en.wikipedia.org/wiki/Multiocular_O | lmkg wrote: | It gets better: this glyph is bugged. Somehow, the guy | responsible for adding it to Unicode somehow got _the number | of eyes wrong_. Per his description, Unicode fonts represent | it with 7 eyes, but after getting called out on Twitter he | realized the original manuscript shows 10 eyes. | | This bug will be fixed in Unicode 15. | corrral wrote: | What about modern uses of the character that specifically | intended 7 eyes? Unicode needs to add a time or (worse, but | probably OK) version datum to glyphs or glyph ranges, I | suppose (applying it only at the document level wouldn't | suffice, as in the case of quoting). | B1FF_PSUVM wrote: | Achieving peak Byzantium there, I guess. | thaumasiotes wrote: | > As in "We don't have a clear idea what this rune was for, or | what it means, but we see it in documents and so added it to | Unicode." | | Documents? I had the strong impression that there are no | documents written in runes. A rune we only know by its | occurrence in documents would be far more interesting for the | existence of a document than it would be for its own sake! | | Compare what the page about Anglo-Saxon runes says about the | corpus: | | > The Old English and Old Frisian Runic Inscriptions database | project at the Catholic University of Eichstatt-Ingolstadt, | Germany aims at collecting the genuine corpus of Old English | inscriptions containing more than two runes in its paper | edition, while the electronic edition aims at including both | genuine and doubtful inscriptions down to single-rune | inscriptions. | | > The corpus of the paper edition encompasses about one hundred | objects (including stone slabs, stone crosses, bones, rings, | brooches, weapons, urns, a writing tablet, tweezers, a sun- | dial,[clarification needed] comb, bracteates, caskets, a font, | dishes, and graffiti). The database includes, in addition, 16 | inscriptions containing a single rune, several runic coins, and | 8 cases of dubious runic characters (runelike signs, possible | Latin characters, weathered characters). Comprising fewer than | 200 inscriptions, the corpus is slightly larger than that of | Continental Elder Futhark (about 80 inscriptions, c. 400-700), | but slightly smaller than that of the Scandinavian Elder | Futhark (about 260 inscriptions, c. 200-800). | | So across every runic system we know, we have under 600 texts, | _all_ of those texts are short inscriptions, and even to reach | that number of samples we need to include texts that we aren 't | even sure contain any runes. | yorwba wrote: | Runes continued to be used long past the Elder Futhark period | and from the medieval period manuscripts survive that fit the | modern conception of a "document", most famously the Codex | Runicus https://www.e-pages.dk/ku/579/html5/ (202 pages) | bombcar wrote: | https://www.youtube.com/watch?v=2yWWFLI5kFU describes another | side-effect of encoding old scripts/runes. | hypertele-Xii wrote: | > I had the strong impression that there are no documents | written in runes. | | There are. Such documents are called runestones and thousands | survive to this day, most in Sweden. | | https://en.wikipedia.org/wiki/Runestone | eesmith wrote: | Huh. https://en.wikipedia.org/wiki/Document says: | | > Documents are also distinguished from "realia", which are | three-dimensional objects that would otherwise satisfy the | definition of "document" because they memorialize or | represent thought; documents are considered more as | 2-dimensional representations. | | I think "realia" - a term I had never heard before - | describes runestones better than "document". | hprotagonist wrote: | >Documents? I had the strong impression that there are no | documents written in runes. | | If a clay tablet counts, why not a runestone? | thaumasiotes wrote: | I'm not knocking runestones for being the wrong medium. I'm | knocking them for not being documents. A typical cuneiform | record might be analogized to an invoice for delivery of a | crate of shirts or whatever. (And of course we also have | textbooks, dictionaries, literature, correspondence, | business reports, mathematical treatises, and every other | type of written work.) A typical runic record would be more | like the text "Made in Taiwan" printed on the shirt labels. | | One of the biggest problems in the study of these cultures | is that they left no written records. We know they had a | writing _system_ , the runes, but as far as we can tell | they almost never used it for anything. Quite the opposite | is true of Mesopotamian cultures, where we're buried in | more records than we have the manpower to translate. | hprotagonist wrote: | I suppose it also matters what you think a rune is. Does | futhork count? There's parchment with that written on it. | Elder Futhark, none as far as i know. | gumby wrote: | > Documents? I had the strong impression that there are no | documents written in runes. | | One of the original goals of Unicode was to be able to | computerize every document. I still have some old linguistics | books in which characters have been handwritten into typed or | even typeset text. So these are the types of documents being | referred to: academic papers. | | Some fancy books have photographs of ancient writing; I'm not | sure if Unicode tries to encode such sources and I pretty | much doubt it (how would you even know what to call the | symbols? You touch on this in your comment). However often | they are attached to treatises that order the characters in | some way (I.e. index an alphabet) in which case the first | case above would apply. | | In other words: thanks to some scholars who wrote down and | ordered runic alphabets, you can now discuss runes with your | friends and colleagues through email. | thaumasiotes wrote: | > One of the original goals of Unicode was to be able to | computerize every document. I still have some old | linguistics books in which characters have been handwritten | into typed or even typeset text. | | That's a weird goal for Unicode to have. We've already | accomplished that; a PDF file does the job _better_ (note: | PDF documents _already support_ every character existing in | the past, present, or future!) while being less complex. | gumby wrote: | I don't understand. If there is no computerized way to | represent the script, all you can do would be to include | photographs in your pdf. The point of computerization is | not simply storage and retrieval (and retrieval is hard | if you can't represent the script) but automated | processing, which is meaningless if you can't represent | any semantics). | | Separately, PDF felt like a step backwards on the day it | was announced and sadly nothing since then has changed | that. | CorrectHorseBat wrote: | How do you search for non-unicode characters in a pdf | document? | thaumasiotes wrote: | How do you search for them in a book? | gpderetta wrote: | ctrl-F once you have digitized it. | jen20 wrote: | And how do you type the character you are searching for? | cgriswald wrote: | On Ubuntu: l-ctrl+l-shift+u, <codepoint>, <enter> | | Of course, that sucks, so I've programmed a nearby key to | act as l-ctrl+l-shift+u. | | Several characters can also be typed with Compose Key. | | For characters I use regularly (in my case, generally the | elder and younger futharks), I've created a keyboard out | of an Elgato StreamDeck XL so I can type any of these | runes with a single button press. | gpderetta wrote: | I don't think that not having a physical key on the | keyboard has ever stopped anybody from inputing unicode | symbols. | [deleted] | runarberg wrote: | This is interesting. I'm comparing this to how musical | notation is encoded in unicode. I mean, there is a block | dedicated to the symbols, so the symbols are encoded, but | you can't document music using only unicode. But musical | documents are being composed and written all the time. To | write music you need an additional software which arranges | these symbols in a certain way so that they express the | authors intention. | | I guess math has a similar representation in unicode as | well. | | All that said, I think people use runes to express magic | and spells (even to this day). I don't think all the | magical runes are expressed in unicode (and perhaps they | shouldn't). If you want to use a rune in that way, you | might have to draw it out in SVG or something and then | email it to your friends. | thaumasiotes wrote: | > I guess math has a similar representation in unicode as | well. | | It's an ongoing project. As you seem to have guessed, | Unicode math symbols are just about as useless for | representing math as Unicode music symbols are for | representing music. Producing mathematical documents is | done using dedicated software, generally LaTeX. | | (And what you get is a PDF, because, as I noted in | another comment, PDFs already support every notation | there is, was, or ever will be.) | jake_morrison wrote: | In the 90s I worked on a project to digitize land registration in | Taiwan. | | In order to record deeds and property transfers, we needed to | enter people's names and official registered addresses into the | computer system. The problem was that some people used non- | traditional writing variants for their names, and some of their | birthplaces were tiny places in China with weird names. | | Someone might write their name with a two-dot water radical | instead of three-dot radical. We would print it out in the normal | font, and the people would lose their minds, saying that it was | wrong. Chinese people can be superstitious about the number of | strokes in their name, so adding a stroke might make it unlucky, | so they would not buy the property. | | The customer went to the agency responsible for managing the big | character set, https://en.wikipedia.org/wiki/CNS_11643 Despite | having more characters than anything else on earth, it didn't | have those variants. The agency said they would not encode them, | because they were not real characters, just printing differences. | | The solution was for the staff in the office to use a "font | maker" program to create a custom font with these characters. | Then they could print out the deeds using a Chinese variant of | Adobe Acrobat, and everyone was happy. | agumonkey wrote: | Forgot which country (iran, turkey..) but one diacritic on a | phone text got a girl killed because it altered the meaning one | word. Turning the sentence from loving to threatening or | insulting. | not2b wrote: | In Spanish, dropping one diacritic (~) changes "How old are | you?" to "How many anuses do you have?". | eesmith wrote: | In English, dropping one diacritic changes "Where's the | rose?" to "Where's the rose?", and changes "My mate is | cold" to "My mate is cold." | ajuc wrote: | In Polish "zrob mi laske" means "do me a favor" and "zrob | mi laske" means "give me a blowjob". | Dylan16807 wrote: | > rose | | Maybe, though it's still halfway the same word. | | > mate | | Not a change, both spellings are valid. | eesmith wrote: | Maybe even three-quarters the same word. (4/5ths if you | count code points in NFD!) | | Male parties are a lot of fun. | | Those are some pretty lame runners. | schoen wrote: | An oddity is that "mate" (meant to indicate that the e is | _pronounced_ ) is an incorrect spelling in both Spanish | and Portuguese, where it would wrongly suggest that the e | is _stressed_. | | https://en.wikipedia.org/wiki/Yerba_mate#Name_and_pronunc | iat... | kps wrote: | https://gizmodo.com/a-cellphones-missing-dot-kills-two- | peopl... | jwilk wrote: | Disussed on HN in 2008: | | https://news.ycombinator.com/item?id=226853 (18 comments) | asveikau wrote: | That sounds terrible, however, it's important to remember | that diacritics don't get people killed, the person who | decides to kill ultimately needs to stop themselves. | _jal wrote: | No, "diacritics don't kill people, people kill people" is | not an important life lesson. It is a reductive just-so | generalization of basic common sense that obscures more | than it enlightens. | | The important thing for engineers to note is a technical | shortcoming caused a tragic misunderstanding. Focusing | instead on the well-known fact that some people have poor | impulse control, knowing full well that is a non- | controllable input, instead makes an excuse for poor | engineering and implicitly expresses powerlessness to do | anything about the problem. | asveikau wrote: | I am all for good localization efforts. I've been | something of a champion for that whenever I've been | around user facing code and people working on it. I also | am a bit of a language nerd and not monolingual. | | But yes, misunderstanding or not, we should not kill | people. | | The story in the sibling comment is about a man attacking | his daughter's ex because the ex came to apologize about | a confusion over the Turkish dotless I. That's still a | violent attack that the father could have kept his | emotions in check. I don't condone calling the daughter | names, even accidentally, but it is not a crime and the | right response is not attempted murder. | _jal wrote: | > but it is not a crime and the right response | | I don't know who you're arguing with, but it isn't me. | Nobody is saying it was. | | I'm saying it is an irrelevant non sequitur. | | Imagine that Dad instead misunderstood an instruction | related to a financial transaction and lost a ton of | money. Would you now be discounting the technical problem | that caused the misunderstanding and berating Dad for | being foolish? | asveikau wrote: | I'm not discounting the technical problem. | | If I were on a code review and I spotted an issue | affecting Turkish dotless I, I assure you I would rant | about it more than is reasonable. | agumonkey wrote: | Even to a lesser extent, it's easy to forget how a small | mistake can have a butterfly effect in other cultures. | pixl97 wrote: | Ya, I don't see that happening in authoritarian countries. | | As a contrived example if you had a symbol for 'happy' you | want to be very cautious that it doesn't get converted to | 'gay' because in your language gay and happy mean the same | thing, in some repressive regime it means the leadership | gets to execute you with the approval of the law. | zarzavat wrote: | A recent example is that "Let's go [gun emoji] him" could | be interpreted as either harmless fun, or conspiracy to | murder, depending on if the recipient's phone displays | that as a water pistol or a real gun. | | Edit: weirdly HN refuses to display that emoji. | tomcatfish wrote: | HN does not like displaying emojis, though a few slip | through I believe. | lolc wrote: | Hacker News doesn't allow emojis because only serious fun | or something. | EvanAnderson wrote: | Yikes. If somebody hasn't written a "falsehoods programmers | believe about human writing systems" document this would make | for a good start. | cestith wrote: | It deserves its own entry in "falsehoods programmers believe | about names" lists too. | lmkg wrote: | It's already there, #11 "People's names are all mapped in | Unicode code points." | Scarblac wrote: | The falsehood here is thinking that if you can encode the | name into the right code points, and you have a font that | can print them, the result will be acceptable to the | people whose name it is. | | They had that, but needed a font that used a different | number of strokes for the characters because of the | superstition. | lmkg wrote: | One could argue they're facets of the same issue. | Although in the spirit of the original list, they would | probably get split into separate line items. | | On further review, I think this is als similar to #12 & | #13 on the list: "names are case-sensitive," and "names | are not case-sensitive." To generalize that to include | non-Western alphabets: display variations of the same | character are significant, and display variations of the | same character are not significant. | | This of course goes back to the evergreen philosophical | question "what even is a character, anyways?" Since we've | found a case where two characters which are the same | character are not the same character. Are they distinct | characters or typographical variants? Yesn't: one would | want them unified for searching, but distinct for | printing. | | But regardless of what they are, these | characters/variants only show up in names. Names tend to | retain archaic (or extinct) language variations longer | than speech, which is the reason for rule #11, which is | at least part of the problem. | cestith wrote: | I fully agree with this second, expanded take of yours. | Some names are both represented and not represented by | the Unicode simultaneously. This suggests there should be | variant versions of characters, but that becomes an even | thornier combinatorics (and sorting/collation, and | lookalike characters) issue than what already exists. | corrral wrote: | More generally, the notion that human culture, systems, | and behaviors can be mapped, losslessly and without | causing harm, to something a computer understands. | | I think these language examples are so good, as examples, | because all aspects of them are clear and easy to follow. | I think computerization of business and society and the | systems that make them work, causes immense amounts of | this kind of friction and pain all the time, in ways that | are much harder to understand, explain, or catalog (which | is precisely why it's such a big problem, though as far | as I know it's received little attention) | | [EDIT] To distill it, I think that trying to make a | computer a "source of truth" rather than a tool, tends to | do substantial violence to the "truth". | derefr wrote: | I feel like there has to be some level of triviality at | which the harm is no longer being caused by the attempt | to systematize something, but rather by a small group of | people refusing to be systematized _not_ out of cultural | heritage et al, but rather purely out of the (inane) | human desire to feel special by intentionally doing | something in a way nobody else does it. | | Language and writing exist to _communicate_ , using | patterns of signals that have _shared meaning and | recognition_ ; things like alphabets and vocabularies are | effectively (loose, overlapping, diasporic) consensus- | state autoencoding models. They only _work_ to compress | meaning, when there are rules for said compression that | generalize, and which don 't have as many exceptions with | their own separate symbols as there are words/names | needing to be encoded. | | Most countries don't allow you to just make up your own | novel graphemes when writing a name on a birth | certificate. And nobody is asking for that, either. | (Presumably because living in a world where that was | allowed would be horrible: you'd no longer being able to | error-correct when reading, because any given mysterious | squiggle in the middle of a word or name, might be | exactly what some unknown-to-you-or-anyone-other-than- | the-author character is _supposed_ to look like. Is that | "o with a curlicue" written here just a semi-cursive | attempt at writing an "o" -- or is it an "o" with a novel | accent marker, one that appears nowhere else, but which | must be preserved nevertheless to properly record this | person's name?) | | Instead, _legal names_ are (in every country I 'm aware | of) required to be spelled using the character-set of the | country you're entering a legal relationship with by | being born / immigrating / etc. America? Legal names | using the Latin alphabet. Japan? Legal names using | characters from this set: | https://en.wikipedia.org/wiki/Jinmeiy%C5%8D_kanji | | Note, though, that legal names are _representations_ of | names. They aren 't _encodings_ of names. Your legal name | is a _distinct thing_ from your name, just as your | credit-card number is a distinct thing from your name. It | 's an applied-for + registered + assigned systematic | identifier for you -- a bit like a domain name, or a | vanity license-plate number. Which means that your legal | name is not a lossy _or_ lossless encoding of your name. | It 's, per se, a nickname. It doesn't have to have | anything to do with your name. (And it often doesn't; | immigrants often choose legal names entirely distinct | from what they / their home country thinks of as their | name.) | cardiffspaceman wrote: | "If the character isn't in Unicode it's in CNS-11643" | apparently is also false. | lostlogin wrote: | There was a great thread on HH about names and falsehoods | programmers believe. | | You've added to it, as custom fonts wasn't one covered. | | I think it's this thread: | https://news.ycombinator.com/item?id=18567548 | | Edit: and it's there, #11. | duxup wrote: | That sounds equally fascinating, and a little madding. | kurthr wrote: | Yep, and with pictographic writing systems it's a lot more | common than latin... but even here we have X AE A-12 Musk, | and Prince's name symbol. | | Heck, my initials are totally non-standard. | 77pt77 wrote: | > Chinese people can be superstitious about the number of | strokes in their name, so adding a stroke might make it unlucky | | Why am I not surprised in the slightest? | jetrink wrote: | That's a great story. The inability to represent a name with | standard characters reminds me of when Prince changed his name | to a symbol and they had to send all of the media floppy disks | containing a custom font with a single character. | | https://nymag.com/intelligencer/2016/04/princes-legendary-fl... | mdp2021 wrote: | Are you acquainted with Freur (which means, "Underworld 0.5" | - Rick Smith and Karl Hyde in the '80s)? | | "Freur", or, "The squiggle we chose as the name for a band | but that CBS Records insisted should at least have a | pronunciation". | | I see it is not in Unicode (well, you can never really know | if you do not try), nor I can find pieces to reconstruct it. | | The "freur" in foreground: https://d4q8jbdc3dbnf.cloudfront.n | et/user/6885/edb290c6183ac... | [deleted] | Findecanor wrote: | I've been told that this is also an issue in Japan, except the | reason might more often be a matter of pride than superstition. | It is supposedly one reason (of a few) why fax machines are | still in common use in Japan. | | Later versions of Unicode support "Variation Forms" of Han | characters as a way to be able to encode different variations. | They are encoded as a Variation Selector code (U+E01000 and up) | after the Han character. The forms are listed separate from | Unicode versions in the "Ideographic Variation Database" | <https://www.unicode.org/ivd/>. So far, it contains characters | from a couple of Japanese dictionaries, a Korean and one from | Macao/Hong Kong. | hinkley wrote: | I knew someone who added an accent character to their name | because everyone pronounced it wrong. She met someone | bilingual who shot back that if she wants it pronounced that | way she needs to add an aigue. So she did, and everyone still | pronounced her name wrong. | | In fact going any place with her very nearly became an "are | we living in a simulation" crisis for me because the number | of times she would say her name and the other person would | say it back incorrectly was... upsetting. The degree to which | some people butchered her name, especially combining half of | her first and last name into a completely different name, | made us joke about buggy NPCs. | | I could imagine how in some cultures writing it incorrectly | hurts as much as pronouncing it incorrectly. Or possibly | moreso in places where multiple plausible pronunciations have | to be negotiated via an introduction, which is the case in | China, is it not? | teknopaul wrote: | In Poland people have a neat life hack for that problem. | They have other names for non-polish folk to use. Eg pawek, | tomek, bartek rather than have people mangle their real | name. | | My name got changed when I moved to Spain and it never | bothered me, while I have met people who took great offence | at the use of standard nicks that they had not explicitly | sanctioned in advance. I know a guy who makes a new name up | for everyone he meets. Like or lump it. If you are too | sensitive about your name, you risk people not using it at | all. | lostlogin wrote: | It does goe both ways though. Take the time to learn how | to say and spell someone's name and it usually goes down | well. | | I say this while fully aware of my own butchering. | isoprophlex wrote: | People are just incredibly dense sometimes. My wife has a | name that's one letter different from a more common name, | but clearly different in pronunciation. | | Nevertheless there have been countless times where people | automatically substitute the more common name, or even | worse in text messages manage to misread it and reply | incorrectly. | | It sometimes upsets her. The npc analogy is very apt, i | guess many people are just very preoccupied?! | derefr wrote: | > or even worse in text messages manage to misread it and | reply incorrectly | | Overzealous autocorrect can happen to names, too. There's | a whole thing about Asian names not being in computer | spellcheck dictionaries: | https://www.abbynews.com/news/youre-not-a-mistake-b-c- | group-... | InitialLastName wrote: | Not just Asian names; my SO's (English-language) nickname | frequently autocorrects to its common homophone. I can | always tell who proofreads their texts by how it ends up | spelled. | lostlogin wrote: | Try typing 'Sian' on iOS, (well, maybe Sian) and it | autocorrects to Asian. | | Unhelpful, though luckily found funny when I did it. | irusensei wrote: | If you type Sei on google translate and set it to detect | language it will switch to Chinese and translate it to | "lingering". If you switch to Japanese no translation will | happen. | | Also if you google search for Sei one of the results will be | this video [!!!!seizure warning!!!!] | https://www.youtube.com/watch?v=EsOU0V2kpUI that seems to borrow | on the theme of a computer ghost character. | TazeTSchnitzel wrote: | Google Translate will hallucinate translations for complete | nonsense, so this probably doesn't mean anything. | einpoklum wrote: | I'm more worried about the inflation of emoji than a couple dozen | unused ghost JIS characters. | npteljes wrote: | What worries you about it? | jimmygrapes wrote: | If Slack/Discord/etc. custom emojis get used enough, do they | get incorporated into Unicode? I've seen something like 40 | variants of laughing emoji, and closer to 400 variants of Pepe | the Frog, and I'm not even in any "alt right" or 4chan-adjacent | chat rooms/guilds where I imagine there are even more. Not to | mention the countless custom anime face ones. | wongarsu wrote: | Godwin's second law: any sufficiently long discussion about | Unicode includes a discussion about emoji :) | raphlinus wrote: | Yeah, it does seem to come up a lot more often than | discussions about U+5350. | edent wrote: | Why? Unicode isn't running out of space any time soon. | kevin_thibedeau wrote: | The encoding has gotten out of hand with compound emoji. | Splitting them on glyph boundaries is non-trivial. | Mountain_Skies wrote: | 640K should be enough for anybody. | olivierestsage wrote: | Reminds me of the case of U+237C [?] RIGHT ANGLE WITH DOWNWARDS | ZIGZAG ARROW [0], also discussed on HN [1]. | | [0] https://ionathan.ch/2022/04/09/angzarr.html | | [1] https://news.ycombinator.com/item?id=31012865 | jmillikin wrote: | Previously: | | https://news.ycombinator.com/item?id=24951130 (2020) | | https://news.ycombinator.com/item?id=17637375 (2018) | helsinkiandrew wrote: | It looks as if these (at least Shi ) are being used in various | places on and offline. It's eventually possible that they will | become associated with one or more meanings and perhaps a | pronunciation. | hnfong wrote: | In East Asian cultures that use Han characters, people used to | make up new characters when the need arises. | | These days, we scroll though the Unicode standard and find | rarely used characters that were accidentally added and imbue | them with new meaning. (yes, this is seriously a thing) | dane-pgp wrote: | When the article said: | | "In the end only one character had neither a clear source nor | any historical precedent: Sei ." | | my instinct was that this character could be retconned to | mean "character whose meaning has been lost", thus creating a | self-referential paradox. | | Presumably someone would have to then separately come up with | a pronunciation for it. Perhaps pronouncing it "duangu" would | solve another problem: | | https://coconuts.co/hongkong/lifestyle/duang-jackie-chan- | ins... | 1-more wrote: | Oooh that sounds fascinating. Any examples of that that | spring to mind? Is the pronunciation (or a reasonable | representation thereof) already recorded in the Unicode | standard or is that also a bit of free-jazz? | adastra22 wrote: | The character usually has a radical component which hints | at the pronunciation. They or ordered by radical in the | standard. So you would go spelunking for a little-used | character in the part of the standard which has characters | close in meaning or pronunciation to what you are looking | for | | Or you just make something up. If you're coining a new | character, you probably don't care about whether the | pronunciation is already known. | ssnistfajen wrote: | An old one but possibly the earliest and most prominent of | obsolete Chinese characters being imbued with new | (Internet-based) meanings: | https://en.wikipedia.org/wiki/Jiong | | There's also Shi https://en.wiktionary.org/wiki/%E5%A5%AD | which is occasionally used as a censorship workaround to | mock one of Xi Jinping's gaffes in an early 2000's TV | interview where he bluffed about being able to carry two | hundred "catty" (~100kg)'s worth of wheat on rural mountain | roads. The character is composed of two Bai ("hundred") | and one Ren ("human/person/people") which is a pitoral | euphemism to that line he said on TV. I can't find any | sources about this one that's in English so please bear | with my half-assed explanation. | 1-more wrote: | Both cases are fascinating, thank you!! Side note: of | course Shi is pronounced shi. I only know a bare minimum | about Chinese but when in doubt: it's pronounced "shi" | (with some license regarding tone). | https://en.wikipedia.org/wiki/Lion- | Eating_Poet_in_the_Stone_... | adastra22 wrote: | One of the reasons I wish a compositional language had been | standardized for Unihan instead of the code-point-for-every- | character approach. | jxy wrote: | Wiktionary claims this character is in Guangyun (1007-1008, see | https://en.wikipedia.org/wiki/Guangyun), and gives the link to | Kangxi dictionary (1716), | https://www.kangxizidian.com/kangxi/0256.gif which means that | this character likely predates the Japanese "Overview of | National Administrative Districts". | sbf501 wrote: | Can we talk about the artwork used? | | https://dl.ndl.go.jp/info:ndljp/pid/1312837?itemId=info%3And... | | https://philamuseum.org/collection/object/84871 | | Googling for Tsukioka Yoshitoshi brings up so much SEO that it is | hard to find information in English. If anyone knows anything | about it, I'd be appreciative for a pointer about its | content/subject! | polm23 wrote: | Author here. Nobody has ever asked about the art before. It | depicts Maruyama Oukyo, a famous painter of ghosts (and other | things), where one of his pieces comes to life and frightens | him. | | https://en.wikipedia.org/wiki/Maruyama_%C5%8Ckyo | lapetitejort wrote: | I can't be the only person who thought the character would be , | right? (based on the first line of the Communist Manifesto: | https://en.wikisource.org/wiki/Manifesto_of_the_Communist_Pa...) | | edit: ah the character (hammer and sickle) does not show up | aatharuv wrote: | For obviously fake characters, a Unicode proposal for the | Egyptian Hieroglyphics Extended-A block managed to include a | hieroglyph for an ancient Egyptian holding a laptop. (Note that | this is a proposal, and has not yet made it into the standard.) | Presumably it was a copyright trap. | | https://www.unicode.org/mail-arch/unicode-ml/y2020-m02/0018.... | ChrisArchitect wrote: | (2018) | hnfong wrote: | This might be interesting read to those unfamiliar with CJK, but | character bloat(?) isn't remotely a recent thing. It's actually | at least a couple hundred years old. | | The Kangxi dictionary (1716), an authoritative dictionary of | Chinese characters, contains definitions for 47035 characters, | even though only a couple thousand are in common use. Quoting | from Wikipedia: "The dictionary was the largest of the | traditional dictionaries, containing 47,035 characters. Some 40% | of them are graphic variants, however, while others are dead, | archaic, or found only once. Fewer than a quarter of the | characters it contains are now in common use." | | All of these archaic (or even bogus in some cases) characters | found in the dictionary are now part of the Unicode standard, of | course :) The unihan database even has a field that shows the | page number where the character appears in the Kangxi dictionary. | If you're wondering why 65536 characters isn't enough for | everyone, the junk in Kangxi dictionary is a significant | contribution. | mytailorisrich wrote: | I think 'character bloat' is simply inherent to the writing | system when characters are written by hand (now that perhaps | most written communication is digital people can't use | characters that are not already supported) | | Anyone can invent characters whenever they want, and it's only | a question of them sticking or not. | | I think this is also one of the reasons for the Chinese | tendency to push for unification and uniformity. | lazide wrote: | When it's character based instead of alphabet based, I think | it's the equivalent of coming up with a new word in English, | which is basically what you're describing. | | Sometimes it's mashing two previously unrelated 'words' | together (aka the tons of compound characters in Chinese), | other times it's coming up with something completely new. | | Same rules apply though, if it doesn't add value worth the | trouble (or get mandated by the powers that be), it'll | eventually just die out or be a curiosity. | | Also, to keep it tech related: | | RISC = English CISC/VLIW = Chinese? | tokinonagare wrote: | > Sometimes it's mashing two previously unrelated 'words' | together (aka the tons of compound characters in Chinese), | other times it's coming up with something completely new. | | That's not how it works. Most Chinese characters stem from | a character C having a pronunciation A referring to a | meaning M being used to note another word of meaning M' | with same pronunciation A (sometimes slightly different | A'). This of course doesn't scale really well, hence the | existence of determiners in logographic scripts, which are | words used without their pronunciations placed before or | after another to give a semantic clue. The innovation of | Chinese (which I think is why it's still an efficient | script today) was to incorporate the determiner in the | character itself to give birth to a character C' where a | part refer to the pronunciation and another acts as the | determiner, instead of padding the main text with (a lot | of) determiners. | nneonneo wrote: | IIUC Old Chinese was a much more "isolating" language, in | that words were typically single characters - meaning that | to make new words, you typically needed to make new | characters. As it evolved through the ages, "compound" | words composed of multiple characters became more common. | These days, new words are almost always combinations of | multiple characters (often 2, occasionally 3-4). | lazide wrote: | Any idea if it was due to things like the Confucian | Official's exam system (and corresponding increase in | prioritization of education)? | | More complex characters require more education to | understand is my guess. Some of the traditional ones | are..... obscure, and crazy complex. | R0b0t1 wrote: | I'm not entirely sure what you mean to ask nor am I a | Chinese speaker, but I have myself suspected that the | massive variety of characters was a side-effect of having | a middle class that was differentiated based on their | ability to read. You see various in-group signalling | systems similar to this in lots of areas. | | A good historical example is all the strangly specific | words for groups of animals. A history I read of this | indicated these terms were first found in books sold to | nobility, and they were just made up. But you weren't hip | if you weren't reading that literature. | duskwuff wrote: | > These days, new words are almost always combinations of | multiple characters (often 2, occasionally 3-4). | | Yep! For example, the most common Chinese term for | "Internet" is Yin Te Wang . This is composed of three | characters: | | Hu : "mutual" | | Lian : "join", "coupled", "allied" | | Wang : "net" -- carrying both the meaning of a woven net | and a computer network | ars wrote: | Does Unicode really need to store Chinese words? Is it | impossible to deconstruct the glyphs into strokes, each stroke | effectively being a character? | j16sdiz wrote: | Many attempted, but nobody have suceed. The most famous one | is `Chu, B.F.: Han Zi Ji Yin Zhu Bang Fu Han Zi Ji Yin Gong | Cheng (Genetic engineering of Chinese characters) (2003), | http://cbflabs.com/down/show.php?id=26 ` | peter303 wrote: | In the early days of computers some character systems were | stroke-based because that used less memory than a 32x32 bit | map. A kilobit of ROM (one character) could cost $10. | | Currently stroke-based systems are used for calligraphic | effect. You could generate new font types, e.g. bold., but | controlling the shape of strokes. | | Stroke systems are important for teaching character writing | because the drawing order is rigorously prescribed. Once you | learn the first couple hundred, you can pretty much guess | future characters. Wrong order characters often look bad and | suggest a non-Chinese speaker mis-copied them. (e.g. some | tattoos) | nneonneo wrote: | Unicode has support for this, in the Ideographic Description | Characters block (https://en.m.wikipedia.org/wiki/Ideographic | _Description_Char...). However, it's purely descriptive, and | not designed for rendering. | | There are somewhat more sophisticated systems which define | both the rendering and stroke decomposition of characters | (e.g. CDL: http://guide.wenlininstitute.org/wenlin4.3/Charact | er_Descrip...). The general workaround for characters that | aren't on Unicode would be to use one of these stroke | description systems to create the character, then render it | to an image and insert it. | cyphar wrote: | Even with the current system, very little software is even | aware that the same codepoint should be rendered differently | in different languages (Fan su needs to be rendered | differently in every CJK locale) which often results in | websites and programs using Chinese fonts for Japanese text | (even if you've configured your language as Japanese). Having | stroke breakdowns would not make this situation better | because there are multiple ways to render the same stroke | description and there aren't really systematic rules for how | to correctly represent the Japanese (or Taiwanese or Korean) | version of a character. | | I dread to think what an enormous mess would result if every | character was represented as a build-it-yourself instruction | manual rather than allowing font authors to correctly | represent the characters. | | Also nobody in China, Japan, nor Korea would use an encoding | system so incredibly inefficient that more strokes results in | more bytes being necessary to store the character (Japan | already compromised with having 3-byte UTF-8 characters when | JIS only required 2). This would've resulted in the failure | of Unicode's mission to be the One True Encoding Format. | yongjik wrote: | The problem with that would be that every software must know | the intricate rules about combining glyphs, and if they guess | wrong, users get garbage characters. | | Considering that the majority of code is written by people | who don't know Chinese characters, it would result in never- | ending issues, pretty much everywhere. | | Korean actually has a two-way system in Unicode. Every | conceivable character (= syllable) possible in modern Korean | has its own codepoint, which allows most software to display | them correctly: from their point of view, it's just another | CJK character. | | On the other hand, there is a Unicode area containing Korean | sub-blocks ("jamo") that were used historically. In theory, | you can combine them and get some pretty funky archaic | syllables. Almost no software renders them right. | mike_hock wrote: | They can't even get much simpler things right. Qt | incorrectly combines accents with the character to the | right instead of the left and has been refusing to fix this | bug for years. | [deleted] | ComodoHacker wrote: | >Fewer than a quarter of the characters it contains are now in | common use | | 12K characters in common use is equally impressing for me as a | non-Asian. | adastra22 wrote: | More like 12k characters currently in use at all. Common use | characters are a much smaller set than that. (3k or so?) | ssnistfajen wrote: | It's actually way fewer than that IRL. Japan's official list | of commonly used Kanji only has 2136 characters. Taiwan's | list has 4808, and the PRC's list has 3500 "frequent" | characters with another 3000 supplementary "common" ones. | Digitization has made it even easier to use these characters | without recognizing the actual form or how to write them. | cyphar wrote: | The Chang Yong Han Zi (Japanese Common Use Kanji) list | does not include many kanji that native speakers can read | and newspapers don't always follow the rule that they only | should use characters from the list. In addition, you need | to include the Ren Ming Yong Han Zi (Personal Name Use | Kanji) in the list because basically all of those | characters are also used in fairly common words. | | Native speakers can probably recognise at least 3-4k kanji | if not more but can probably only write around 2k from | memory, depending on how well-read they are. | | Xu (lie) is the best example of an incredibly common word | whose kanji form (which is used fairly often) is not in any | official government list. | DiogenesKynikos wrote: | If you look at a frequency list of Chinese characters,[0] | the top 4800 characters make up about 99.9% of modern | texts. | | That means that if you know 4800 characters, and you read a | text that is 1000 characters (equivalent to around 700 | words) long, there's likely one character you won't | recognize. | | The funny thing is, if you recognize only the top six | characters, you already know 10% of the characters in a | typical text. The distribution is very top-heavy, but with | a long tail that you do have to learn to become literate. | | 0. https://lingua.mtsu.edu/chinese- | computing/statistics/char/li... | jamal-kumar wrote: | I thought this was going to be about something like the massive | security problem of homoglyph attacks being currently deployed in | stuff like phishing baked into the standard at first glance of | the title, but this ghost character business is pretty | interesting. Japanese literacy requires you to know 2-4 meanings | per 2,136 kanji characters (something like 6000+ in total | possible meanings between these characters) just to be able to | pass a university level literacy test, it's a massive amount of | complexity to get right. Even if you just need basic literacy | it's still about a thousand less than that, and there's even more | than these I mentioned for further literacy competence. | Furthermore each of these characters look funny if not unreadable | if you write them down using the wrong order of strokes. I can | see how mistakes might have been made even by native speakers of | that language. The two kana syllabiaries are there of course and | mixed in with the kanji, but if everything was written in that | you wouldn't be able to achieve the same amount of information | density, which is probably part of the reason they never switched | over (I understand before world war 2 or so, the more rounded | hiragana was for women while the more sword stroke like katakana | was for men). | js8 wrote: | Is it possible for Unicode standard to deprecate characters? If | yes, has it already happened? | Fell wrote: | I don't think so. It would make it impossible to talk about | deprecated characters ever again, even in a historical context. | | Unicode contains even some ancient and long forgotten scripts | so historians can keep proper records of them. | jfk13 wrote: | Yes, and yes. | | https://en.wikipedia.org/wiki/Unicode_character_property#Dep... | bqmjjx0kac wrote: | This is a tangent, but I felt like sharing. In college, I | purchased a used copy of the communist manifesto. Famously, the | first line reads, "A spectre is haunting Europe, ...". | | The previous owner had both highlighted and circled the word | "spectre" and wrote "ghost?" in the margins. The rest of the text | was similarly marked up. | | Every time I hear the word "spectre" I see "ghost?" in my mind's | eye. ___________________________________________________________________ (page generated 2022-07-14 23:00 UTC)