[HN Gopher] Katakana, Hiragana, and Unicode ___________________________________________________________________ Katakana, Hiragana, and Unicode Author : zdw Score : 69 points Date : 2022-09-26 20:11 UTC (2 days ago) (HTM) web link (www.johndcook.com) (TXT) w3m dump (www.johndcook.com) | sapkernel wrote: | this is cool language learning project in code. Thanks | dudeinjapan wrote: | And here's half-width katakana in non-Unicode, circa 1979: | https://github.com/receipt-print-hq/escpos-printer-db/issues... | | This is probably the system that is still used today by most bank | printers in Japan. The code point usage was carried forward to | JIS. | gweinberg wrote: | What are these mysterious symbols that don't fit in the table? | teshigahara wrote: | wi (wi) (wu) we (we) (yi) (ye) | | wi and we (these days pronounced the same as Japanese i and e) | are known by all native Japanese speakers, were used | historically, and actually still see some use in certain | scenarios (like signs, or names of things). The other ones were | never actually used much afaik and only recently were | introduced to Unicode at all, and are probably unknown to most | Japanese people except those interested in this kind of thing. | Symmetry wrote: | I was confused by the lack of little `yx` characters there so | looked it up on my own. The 'yu' in riyu, "riyu" is 86 and the | 'yu' in riyu, "ryu", is 85. | viggity wrote: | same with the small tsu which makes you kinda pause/emphasize | the following consonant. | | Ex. asari = asari = ah - sa - ri atsusari = assari = ah <tiny | pause, hard s> sa - ri | | not to be confused with atsusari which is atsusari (which is a | made up word), but because the tsu is regular sized, you | pronounce it instead of altering other character | pronunciations. | | Also of note - they completely left out "n" n in hiragana, n in | katakana. | | And "wo" isn't really pronounced "wo", it is pronounced just | "oh" and spelled "o" in romaji. And while there is a "wo" in | katakana, I have never seen it used. It is used as a particle | which is inherently a native japanese thing and ergo you use | hiragana for it. | Anon1096 wrote: | You see wo if you read stuff where there's a robot, alien, or | super stereotypical foreigner speaking, since oftentimes | their entire lines are written in katakana to feel non- | native. | kensai wrote: | John Cook is a genius. All his blog posts are gems. I love his | job, I wish I could do it. | sylware wrote: | Is there a plain and simple C written text shaper for unicode | Katakana, Hiragana strings? | ranger_danger wrote: | what is a text shaper? | HelloNurse wrote: | The software that controls how to render text to images, | relying on fonts but considering higher level issues (e.g. | line breaking and metrics for multiple characters) than the | low-level information in a font. | vore wrote: | There's no shaping or really even anything fancy | typographically required for kana, just put the glyphs next to | each other fixed-width no kerning. | mananaysiempre wrote: | Modern horizontal hiragana and katakana are not complex or huge | scripts, there are several dozen base characters (of one or two | different widths) and two or so accent marks. There might be no | spaces, you break lines whenever they run out without | considering word boundaries. I expect anything capable of | dealing with Latin should be able to handle this, and it hardly | deserves the name of "shaping". | | (Adding kanji into the mix somewhat complicates matters, as | there are so many potential characters you cannot just blindly | cache the rasterization of every one of them and never throw | any away, but that's also not the degree of complexity you get | from Arabic and such.) | amichal wrote: | Line layout rules are a bit complex. Long long ago when i was | 19 someone handed me a photocopied set of around 50 rules for | line breaking Japanese text and followed them to implement | our first draft of it in a text layout program we were adding | Japanese support to. I implemented it blind, I dont speak | Japanese , it never shipped and I dont remember the rules but | i do remember quite some complexity around punctuation etc. | This section from W3C covers some of what I remember and | quite a bit more I'm sure https://www.w3.org/TR/jlreq/#line_c | omposition_rules_for_punc... | innocenat wrote: | To be honest, both questions can be answered in a few seconds by | looking at the code point table for Hiragana/Katakana if you | already know Japanese. Hence, that's why nobody write about it. | | > How do the 46 characters map into the 90 characters? | | Because there are actually more than 46 characters. | | > Do they map the same way for both hiragana and katakana? | | Yes. That's also how we do conversion between hiragana and | katakana. By adding/subtracting 0x60. | superjan wrote: | For curiosity: how are they sorted? Are hiragana/katakana | symbols considered equivalent? | innocenat wrote: | They are sorted by dictionary order (Wu Shi Yin Shun | gojuuonjun) and there are no duplicated symbols in hiragana- | katakana blocks. | zerocrates wrote: | The hiragana and katakana and various versions thereof for | each mora all share the same "primary" Unicode collation | value. Adding a dakuten or handakuten creates a secondary | difference: e.g. ha (ha) < ba (ba) < pa (pa). | | As between the versions for the same mora, they get sorted | with tertiary differences as: hiragana comes before katakana, | small comes before regular-size, and for katakana regular | width comes before halfwidth. There's also a "circled" set of | the katakana that sort after the halfwidth ones. | | So they're equivalent (or not) depending on how you're doing | the collation/comparison. | innocentoldguy wrote: | The article shows how they are sorted. Hiragana is used for | things like Japanese words, particles, names, and to | conjugate verbs. Katakana is used for things like foreign | words, names, and sometimes emphasis. Both writing systems | describe the same phonetics. For example, the hiragana ka and | katakana ka are both pronounced "ka." | Pxtl wrote: | I'm surprised they're both used, from that description it | sounds like one would fall by the wayside, like cursive has | in North America. | | ... | | That said, culturally Japan seems like exactly the kind of | place where, were they English-speakers, all the kids would | absolutely be required to learn perfect cursive. | msbarnett wrote: | > I'm surprised they're both used, from that description | it sounds like one would fall by the wayside, like | cursive has in North America. | | In practice, there are no less than 4 separate scripts | that are used in Japanese: hiragana, katakana, kanji, and | romaji, and some mix of all 4 can appear in the same | sentence. | | It's not so much analogous to cursive, which is a | different "style" of writing the same "thing" - katakana | and hiragana developed at different times for different | groups and came to play different roles, and there are | (usually) semantic implications to which are used. | AnIdiotOnTheNet wrote: | English print still has two separate character sets with | exactly the same pronunciation too. One is use most of | the time, and the other is used to start sentences, for | EMPHASIS on whole words, or to indicate proper nouns. | popularonion wrote: | Japanese people could ask the same question about why | English continues to have uppercase and lowercase | letters. | | Actually when you look at the use of English in Japanese | media, you'll quickly notice a lot of unnatural-looking | overuse of uppercase. That's because to them it feels | natural to use uppercase the same way they use katakana. | dfinninger wrote: | I am very early on in my Japanese-learning journey. So if | others contradict me, they are probably a better source. | :) | | But from what I understand Hirigana is used more for | Japanese words, and Katakana more for loan words from | other languages. | | It actually leads to a nice shortcut for some words. If | I'm reading Hirigana I'll try to match that with words in | Japanese that I know. However, if the word I'm looking at | is Katakana, I'll flip that off and start trying to match | phonetically. | | I assume with fluency this all becomes automatic, but I'm | a ways off from that yet! | layer8 wrote: | It's more akin to italics in usage than to cursive. | bsder wrote: | > I'm surprised they're both used, from that description | it sounds like one would fall by the wayside, like | cursive has in North America. | | Katakana, hiragana, and kanji are all in active use-- | that's why they don't fall away. | | Kanji are your primary word base. They are sort of like | root words in English. | | Hiragana often serves as kind of a marker--endings of | certain words as well as phrase markers (particles). | These are particularly important because Japanese does | not normally break words with spaces. | | Katakana often denotes a foreign phonetic word or foreign | names. Login is particularly good example for this forum: | roguin (ro-gu-i-n). | | Japanese speakers actively use the differences as cues | when reading. Watch a native Japanese speaker try to | puzzle out Japanese learning materials for non-native | speakers. If everything is written in hiragana (not | uncommon for beginning materials), native speakers often | have to puzzle over things a bit before they work out | what a sentence says. This is one of the reasons why you | want to get to Kanji as fast as possible when learning | Japanese--the differences in script are _important_ for | reading comprehension. | | You can see all of these in play on the Asahi Shimbun | webpage: https://www.asahi.com/ | spacehunt wrote: | Another simple technical reason is that that's how JIS did it, | and Unicode wants to have lossless round-trip encoding | conversions in order to promote its adoption in East Asia at | the time. | dhosek wrote: | There are some interesting variations in different scripts | thanks to how they were handled in pre-Unicode encodings. | Perhaps the most interesting divergence is in the various | scripts derived from the old Brahmi script. These are all | abugidas (as are the Japanese kana) where vowels do not exist | independently of consonants. But in Thai, for example, the | syllable NA is written naa with n and aa treated as separate | characters, while in Devanagari, NA is written naa where n is | the N sound and the A sound aa is a spacing mark which | changes the shape and spacing of the first letter to give | naa. Although a Thai reader will read the combination of | consonant and vowel as a single entity, they are treated as | two graphemes by Unicode, while the equivalent in Devanagari | is a single grapheme (and it's not simply because they're | printed connected since naanaa will be connected but treated | as two graphemes). | | Perhaps most interesting in this respect is the comparison | between the Devanagri i and the Thai ai which both appear | before the consonant that they're attached to, but in Thai | the input will be ai + kh to get aikh (so you input in the | order of appearance rather than the order of pronunciation) | while in Devanagari, the input would be k + i to get ki (so | you input in pronunciation order rather than graphic order). | lloeki wrote: | For someone like me who knows somewhere between pretty much | none to a very small bit of Japanese and slowly working my way | up as time permits in a busy life, this was an interesting and | very well presented article, saving me more than a few seconds | of searching that I don't have to spare, and for which the | reading time was both enjoyable and knowledge incrementing. | | Hence, that's very much fortunate that someone wrote about it. | innocenat wrote: | I don't disagree about that. I just answered the first | question in the article about why there are no one writing | about this. | resoluteteeth wrote: | There are a bunch of other annoying complexities with dealing | with Japanese text like halfwidth/full-width characters: | depending on what you're doing you may have to account for | additional stuff like a instead of a, or A instead of A. Ideally | these wouldn't actually be used (this formatting should not be | done at the character set level) but since they were included in | unicode for backwards compatibility reasons, they do | unfortunately get used a fair amount. | | Also I guess this isn't specific to Japanese, but if you use | normalization in NFD form, the modifiers like handakuten get | split into separate characters (I don't think most people ever | use unicode normalization but iirc mac filesystem paths are | normalized so it can be really confusing when you do actually run | into it). ___________________________________________________________________ (page generated 2022-09-28 23:00 UTC)