[HN Gopher] Katakana, Hiragana, and Unicode
       ___________________________________________________________________
        
       Katakana, Hiragana, and Unicode
        
       Author : zdw
       Score  : 69 points
       Date   : 2022-09-26 20:11 UTC (2 days ago)
        
 (HTM) web link (www.johndcook.com)
 (TXT) w3m dump (www.johndcook.com)
        
       | sapkernel wrote:
       | this is cool language learning project in code. Thanks
        
       | dudeinjapan wrote:
       | And here's half-width katakana in non-Unicode, circa 1979:
       | https://github.com/receipt-print-hq/escpos-printer-db/issues...
       | 
       | This is probably the system that is still used today by most bank
       | printers in Japan. The code point usage was carried forward to
       | JIS.
        
       | gweinberg wrote:
       | What are these mysterious symbols that don't fit in the table?
        
         | teshigahara wrote:
         | wi (wi)  (wu) we (we)  (yi)  (ye)
         | 
         | wi and we (these days pronounced the same as Japanese i and e)
         | are known by all native Japanese speakers, were used
         | historically, and actually still see some use in certain
         | scenarios (like signs, or names of things). The other ones were
         | never actually used much afaik and only recently were
         | introduced to Unicode at all, and are probably unknown to most
         | Japanese people except those interested in this kind of thing.
        
       | Symmetry wrote:
       | I was confused by the lack of little `yx` characters there so
       | looked it up on my own. The 'yu' in riyu, "riyu" is 86 and the
       | 'yu' in riyu, "ryu", is 85.
        
         | viggity wrote:
         | same with the small tsu which makes you kinda pause/emphasize
         | the following consonant.
         | 
         | Ex. asari = asari = ah - sa - ri atsusari = assari = ah <tiny
         | pause, hard s> sa - ri
         | 
         | not to be confused with atsusari which is atsusari (which is a
         | made up word), but because the tsu is regular sized, you
         | pronounce it instead of altering other character
         | pronunciations.
         | 
         | Also of note - they completely left out "n" n in hiragana, n in
         | katakana.
         | 
         | And "wo" isn't really pronounced "wo", it is pronounced just
         | "oh" and spelled "o" in romaji. And while there is a "wo" in
         | katakana, I have never seen it used. It is used as a particle
         | which is inherently a native japanese thing and ergo you use
         | hiragana for it.
        
           | Anon1096 wrote:
           | You see wo if you read stuff where there's a robot, alien, or
           | super stereotypical foreigner speaking, since oftentimes
           | their entire lines are written in katakana to feel non-
           | native.
        
       | kensai wrote:
       | John Cook is a genius. All his blog posts are gems. I love his
       | job, I wish I could do it.
        
       | sylware wrote:
       | Is there a plain and simple C written text shaper for unicode
       | Katakana, Hiragana strings?
        
         | ranger_danger wrote:
         | what is a text shaper?
        
           | HelloNurse wrote:
           | The software that controls how to render text to images,
           | relying on fonts but considering higher level issues (e.g.
           | line breaking and metrics for multiple characters) than the
           | low-level information in a font.
        
         | vore wrote:
         | There's no shaping or really even anything fancy
         | typographically required for kana, just put the glyphs next to
         | each other fixed-width no kerning.
        
         | mananaysiempre wrote:
         | Modern horizontal hiragana and katakana are not complex or huge
         | scripts, there are several dozen base characters (of one or two
         | different widths) and two or so accent marks. There might be no
         | spaces, you break lines whenever they run out without
         | considering word boundaries. I expect anything capable of
         | dealing with Latin should be able to handle this, and it hardly
         | deserves the name of "shaping".
         | 
         | (Adding kanji into the mix somewhat complicates matters, as
         | there are so many potential characters you cannot just blindly
         | cache the rasterization of every one of them and never throw
         | any away, but that's also not the degree of complexity you get
         | from Arabic and such.)
        
           | amichal wrote:
           | Line layout rules are a bit complex. Long long ago when i was
           | 19 someone handed me a photocopied set of around 50 rules for
           | line breaking Japanese text and followed them to implement
           | our first draft of it in a text layout program we were adding
           | Japanese support to. I implemented it blind, I dont speak
           | Japanese , it never shipped and I dont remember the rules but
           | i do remember quite some complexity around punctuation etc.
           | This section from W3C covers some of what I remember and
           | quite a bit more I'm sure https://www.w3.org/TR/jlreq/#line_c
           | omposition_rules_for_punc...
        
       | innocenat wrote:
       | To be honest, both questions can be answered in a few seconds by
       | looking at the code point table for Hiragana/Katakana if you
       | already know Japanese. Hence, that's why nobody write about it.
       | 
       | > How do the 46 characters map into the 90 characters?
       | 
       | Because there are actually more than 46 characters.
       | 
       | > Do they map the same way for both hiragana and katakana?
       | 
       | Yes. That's also how we do conversion between hiragana and
       | katakana. By adding/subtracting 0x60.
        
         | superjan wrote:
         | For curiosity: how are they sorted? Are hiragana/katakana
         | symbols considered equivalent?
        
           | innocenat wrote:
           | They are sorted by dictionary order (Wu Shi Yin Shun
           | gojuuonjun) and there are no duplicated symbols in hiragana-
           | katakana blocks.
        
           | zerocrates wrote:
           | The hiragana and katakana and various versions thereof for
           | each mora all share the same "primary" Unicode collation
           | value. Adding a dakuten or handakuten creates a secondary
           | difference: e.g. ha (ha) < ba (ba) < pa (pa).
           | 
           | As between the versions for the same mora, they get sorted
           | with tertiary differences as: hiragana comes before katakana,
           | small comes before regular-size, and for katakana regular
           | width comes before halfwidth. There's also a "circled" set of
           | the katakana that sort after the halfwidth ones.
           | 
           | So they're equivalent (or not) depending on how you're doing
           | the collation/comparison.
        
           | innocentoldguy wrote:
           | The article shows how they are sorted. Hiragana is used for
           | things like Japanese words, particles, names, and to
           | conjugate verbs. Katakana is used for things like foreign
           | words, names, and sometimes emphasis. Both writing systems
           | describe the same phonetics. For example, the hiragana ka and
           | katakana ka are both pronounced "ka."
        
             | Pxtl wrote:
             | I'm surprised they're both used, from that description it
             | sounds like one would fall by the wayside, like cursive has
             | in North America.
             | 
             | ...
             | 
             | That said, culturally Japan seems like exactly the kind of
             | place where, were they English-speakers, all the kids would
             | absolutely be required to learn perfect cursive.
        
               | msbarnett wrote:
               | > I'm surprised they're both used, from that description
               | it sounds like one would fall by the wayside, like
               | cursive has in North America.
               | 
               | In practice, there are no less than 4 separate scripts
               | that are used in Japanese: hiragana, katakana, kanji, and
               | romaji, and some mix of all 4 can appear in the same
               | sentence.
               | 
               | It's not so much analogous to cursive, which is a
               | different "style" of writing the same "thing" - katakana
               | and hiragana developed at different times for different
               | groups and came to play different roles, and there are
               | (usually) semantic implications to which are used.
        
               | AnIdiotOnTheNet wrote:
               | English print still has two separate character sets with
               | exactly the same pronunciation too. One is use most of
               | the time, and the other is used to start sentences, for
               | EMPHASIS on whole words, or to indicate proper nouns.
        
               | popularonion wrote:
               | Japanese people could ask the same question about why
               | English continues to have uppercase and lowercase
               | letters.
               | 
               | Actually when you look at the use of English in Japanese
               | media, you'll quickly notice a lot of unnatural-looking
               | overuse of uppercase. That's because to them it feels
               | natural to use uppercase the same way they use katakana.
        
               | dfinninger wrote:
               | I am very early on in my Japanese-learning journey. So if
               | others contradict me, they are probably a better source.
               | :)
               | 
               | But from what I understand Hirigana is used more for
               | Japanese words, and Katakana more for loan words from
               | other languages.
               | 
               | It actually leads to a nice shortcut for some words. If
               | I'm reading Hirigana I'll try to match that with words in
               | Japanese that I know. However, if the word I'm looking at
               | is Katakana, I'll flip that off and start trying to match
               | phonetically.
               | 
               | I assume with fluency this all becomes automatic, but I'm
               | a ways off from that yet!
        
               | layer8 wrote:
               | It's more akin to italics in usage than to cursive.
        
               | bsder wrote:
               | > I'm surprised they're both used, from that description
               | it sounds like one would fall by the wayside, like
               | cursive has in North America.
               | 
               | Katakana, hiragana, and kanji are all in active use--
               | that's why they don't fall away.
               | 
               | Kanji are your primary word base. They are sort of like
               | root words in English.
               | 
               | Hiragana often serves as kind of a marker--endings of
               | certain words as well as phrase markers (particles).
               | These are particularly important because Japanese does
               | not normally break words with spaces.
               | 
               | Katakana often denotes a foreign phonetic word or foreign
               | names. Login is particularly good example for this forum:
               | roguin (ro-gu-i-n).
               | 
               | Japanese speakers actively use the differences as cues
               | when reading. Watch a native Japanese speaker try to
               | puzzle out Japanese learning materials for non-native
               | speakers. If everything is written in hiragana (not
               | uncommon for beginning materials), native speakers often
               | have to puzzle over things a bit before they work out
               | what a sentence says. This is one of the reasons why you
               | want to get to Kanji as fast as possible when learning
               | Japanese--the differences in script are _important_ for
               | reading comprehension.
               | 
               | You can see all of these in play on the Asahi Shimbun
               | webpage: https://www.asahi.com/
        
         | spacehunt wrote:
         | Another simple technical reason is that that's how JIS did it,
         | and Unicode wants to have lossless round-trip encoding
         | conversions in order to promote its adoption in East Asia at
         | the time.
        
           | dhosek wrote:
           | There are some interesting variations in different scripts
           | thanks to how they were handled in pre-Unicode encodings.
           | Perhaps the most interesting divergence is in the various
           | scripts derived from the old Brahmi script. These are all
           | abugidas (as are the Japanese kana) where vowels do not exist
           | independently of consonants. But in Thai, for example, the
           | syllable NA is written naa with n and aa treated as separate
           | characters, while in Devanagari, NA is written naa where n is
           | the N sound and the A sound aa is a spacing mark which
           | changes the shape and spacing of the first letter to give
           | naa. Although a Thai reader will read the combination of
           | consonant and vowel as a single entity, they are treated as
           | two graphemes by Unicode, while the equivalent in Devanagari
           | is a single grapheme (and it's not simply because they're
           | printed connected since naanaa will be connected but treated
           | as two graphemes).
           | 
           | Perhaps most interesting in this respect is the comparison
           | between the Devanagri i and the Thai ai which both appear
           | before the consonant that they're attached to, but in Thai
           | the input will be ai + kh to get aikh (so you input in the
           | order of appearance rather than the order of pronunciation)
           | while in Devanagari, the input would be k + i to get ki (so
           | you input in pronunciation order rather than graphic order).
        
         | lloeki wrote:
         | For someone like me who knows somewhere between pretty much
         | none to a very small bit of Japanese and slowly working my way
         | up as time permits in a busy life, this was an interesting and
         | very well presented article, saving me more than a few seconds
         | of searching that I don't have to spare, and for which the
         | reading time was both enjoyable and knowledge incrementing.
         | 
         | Hence, that's very much fortunate that someone wrote about it.
        
           | innocenat wrote:
           | I don't disagree about that. I just answered the first
           | question in the article about why there are no one writing
           | about this.
        
       | resoluteteeth wrote:
       | There are a bunch of other annoying complexities with dealing
       | with Japanese text like halfwidth/full-width characters:
       | depending on what you're doing you may have to account for
       | additional stuff like a instead of a, or A instead of A. Ideally
       | these wouldn't actually be used (this formatting should not be
       | done at the character set level) but since they were included in
       | unicode for backwards compatibility reasons, they do
       | unfortunately get used a fair amount.
       | 
       | Also I guess this isn't specific to Japanese, but if you use
       | normalization in NFD form, the modifiers like handakuten get
       | split into separate characters (I don't think most people ever
       | use unicode normalization but iirc mac filesystem paths are
       | normalized so it can be really confusing when you do actually run
       | into it).
        
       ___________________________________________________________________
       (page generated 2022-09-28 23:00 UTC)