[HN Gopher] How does Chrome decide what to highlight when you do...
       ___________________________________________________________________
        
       How does Chrome decide what to highlight when you double-click
       Japanese text?
        
       Author : polm23
       Score  : 124 points
       Date   : 2020-05-08 06:15 UTC (16 hours ago)
        
 (HTM) web link (stackoverflow.com)
 (TXT) w3m dump (stackoverflow.com)
        
       | darkerside wrote:
       | > The quality is not amazing but I'm surprised this is supported
       | at all.
       | 
       | I find this line hilarious for some reason. Reminds me of the
       | line about being a tourist in France, "French people don't expect
       | you to speak French, but they appreciate it when you try"
        
         | whoisjuan wrote:
         | Was OP referring to the fact that V8 namespace is available
         | inside JSFiddle?
         | 
         | Because I was a bit surprised about that and made me wonder if
         | opening this JSFiddle on Safari would work at all (I'm on a
         | phone so I can't test).
        
           | darkerside wrote:
           | That's funny. Not at all the way I read it, but it could
           | totally be read that way. Made me do a double take.
           | 
           | Still, I think my original reading is correct, because I
           | don't think there is any issue with the "quality" of v8
           | inside of jsfiddle. While imagining Chrome doing its best to
           | identify real words in long strings of Japanese text and
           | failing spectacularly just made me laugh again.
        
           | yftsui wrote:
           | The fiddle doesn't work on Safari.
           | 
           | TypeError: Intl.v8BreakIterator is not a function. (In
           | 'Intl.v8BreakIterator(['ja-JP'], {type:'word'})',
           | 'Intl.v8BreakIterator' is undefined)
        
         | curiousgal wrote:
         | Being a resident of France I have to say that French people are
         | the opposite of that. Same for Japanese people. The only people
         | who get genuinely excited as you butcher their language are
         | Arabic speakers I've noticed.
        
       | LikeAnElephant wrote:
       | This is often determine by Unicode and not the browsers
       | specifically (though some browsers could override the suggested
       | Unicode approach).
       | 
       | Each unicode character has certain properties, one of which is
       | whether that character indicates a break before / after itself.
       | 
       | I've done extensive research on this for my job, but
       | unfortunately don't have time to do the whole writeup here. Here
       | are several resources for those who are interested
       | 
       | Info on break opportunities:
       | 
       | https://unicode.org/reports/tr14/#BreakOpportunities
       | 
       | The entire Unicode Character Database (~80MB XML file last I
       | checked)
       | 
       | https://unicode.org/reports/tr44/
       | 
       | The properties within the UCD are hard to parse, here's a
       | reference if you're interested:
       | 
       | https://unicode.org/reports/tr14/#Table1
       | 
       | https://www.unicode.org/Public/5.2.0/ucd/PropertyAliases.txt
       | 
       | https://www.unicode.org/Public/5.2.0/ucd/PropertyValueAliase...
       | 
       | Overall, word / line breaking in Unicode in no-space languages is
       | a very difficult problem. Where the UCD says there can be a line
       | break isn't where a native speaker would put one. In order to do
       | it correctly you have to bring in Natural Language Processing,
       | but that has its own set of complexities.
       | 
       | In summary: I18N is hard!
        
         | swang wrote:
         | Yes. This seems to work even when you pass it Chinese while
         | maintaining ja-JP as the language                 function
         | tokenizeJA(text) {         var it = Intl.v8BreakIterator(['ja-
         | JP'], {type:'word'})         it.adoptText(text)         var
         | words = []                var cur = 0, prev = 0
         | while (cur < text.length) {           prev = cur           cur
         | = it.next()           words.push(text.substring(prev, cur))
         | }                return words              }
         | console.log(tokenizeJA("Jin Tian Yao Qu Na Li ?"))
         | 
         | still seems to parse just fine. so most likely just using the
         | passed input to parse.
        
           | LikeAnElephant wrote:
           | Yep, the browser has the UCD info built into it (a
           | simplification... but basically). Similarly our mobile
           | devices and various backend languages have the same data
           | backed into it.
           | 
           | This is where there are sometimes discrepancies between how a
           | given browser or device would output this data, as it could
           | be working off of an outdated version of Unicode's data.
           | 
           | Some devices even overwrite the default Unicode behavior.
           | There are just SO many languages and SO many regions and SO
           | many combinations thereof that even Unicode can't cover all
           | the bases. It's all very fascinating from an engineering
           | perspective.
        
           | yorwba wrote:
           | The underlying library is actually using a single dictionary
           | for both Chinese and Japanese https://github.com/unicode-
           | org/icu/tree/7814980f51bca2000a96...
        
           | erjiang wrote:
           | It turns out that's because ICU uses a combined
           | Chinese/Japanese dictionary instead of separate dictionaries
           | for each language. Which probably is a little more robust if
           | you misdetect some Chinese text as Japanese and vice-versa.
        
         | erjiang wrote:
         | The library that Chrome uses seems to use a dictionary[0],
         | since you can't determine word boundaries in Japanese just by
         | looking at two characters.
         | 
         | Your first link also says:
         | 
         | > To handle certain situations, some line breaking
         | implementations use techniques that cannot be expressed within
         | the framework of the Unicode Line Breaking Algorithm. Examples
         | include using dictionaries of words for languages that do not
         | use spaces
         | 
         | [0] posted in another top-level comment: http://userguide.icu-
         | project.org/boundaryanalysis
        
           | LikeAnElephant wrote:
           | Yea, CJK (Chinese, Japanese, Korean) breaking is particularly
           | complex. Google has done a lot of work, and have this open
           | source implementation which uses NLP. It's the best I've
           | personally come across:
           | 
           | https://github.com/google/budou
        
             | oehtXRwMkIs wrote:
             | Can't imagine it being difficult for Korean.
        
               | Asooka wrote:
               | Korean is especially difficult. Chinese uses only hanzi
               | and you have a limited set of logogram combinations that
               | result in a word. Japanese is easier because they often
               | add kana to the ends of words (for e.g. verb conjugation)
               | and you can use that (in addition to the Chinese
               | algorithm) to delineate words - there is no word* where a
               | kanji character follows a kana character. Korean on the
               | other hand uses only phonetic characters with no semantic
               | component, so you just have to guess the way a human
               | guesses.
               | 
               | * with a few exceptions, of course :)
        
               | oehtXRwMkIs wrote:
               | Curious where you came up with "phonetic characters with
               | no semantic component". It's just an alphabet with
               | spaces. With each block representing a syllable. It's
               | easier than Latin.
        
               | qiqitori wrote:
               | Korean uses spaces.
        
             | jcampbell1 wrote:
             | I wrote a Chinese Segmenter that is available on the web:
             | 
             | https://chinese.yabla.com/chinese-english-pinyin-
             | dictionary....
             | 
             | It does basic path finding, and then picks the best path
             | based on the following rules:
             | 
             | 1) Fewest words
             | 
             | 2) Least variance in word length (e.g. prefer a 2,2
             | character split vs a 3-1 split)
             | 
             | 3) Solo Freedom (this is based on corpus analysis which
             | tags characters with a probability of being a 1 character
             | word. For example Wang Jia Ting  (this is either "Wang
             | Household" (Wang  Jia Ting ) or "Prince's courtyard" (Wang
             | Jia  Ting ) and we split as Wang Household, because Wang
             | Wang  is a common name that frequently appears in
             | isolation, and Ting  is less likely to be in isolation. It
             | is interesting that solo freedom works better than
             | comparing the corpus frequency of "Prince" Wang Jia  vs
             | "Household" Jia Ting .
             | 
             | It works reasonably well. A surprising number of people use
             | it every day.
        
             | krackers wrote:
             | I don't think their own lexing backend is actually open-
             | source, as Budou just relies on a choice of 3 backends
             | (MeCab, TinySegmenter, Google NLP) to do the lexing. I'm
             | assuming Google NLP performs the best, but that isn't free
             | and certainly not open source.
        
               | Wowfunhappy wrote:
               | Does this mean that highlighting doesn't work properly
               | when you're offline?
        
               | krackers wrote:
               | I think the implementation used in Chrome is different
               | from the ones used in Budou. The implementation in Chrome
               | is dictionary based as one of the parent threads
               | mentioned, and _that_ is completely open-source albeit
               | probably doesn 't produce as good a result as their
               | homegrown NLP stuff.
        
       | dirtydroog wrote:
       | TIL: Some languages do not have spaces
        
       | 1024core wrote:
       | So what is the most accurate way to tokenize CJK text?
        
         | edflsafoiewq wrote:
         | By hand using a native speaker.
        
           | tobyhinloopen wrote:
           | Is there an API for that?
        
             | syrrim wrote:
             | you could try something with mturk, I'm sure.
        
             | cferr wrote:
             | Yes, but it hasn't been written yet.
        
       | wikibob wrote:
       | Safari seems to do this too.
        
       | emilfihlman wrote:
       | From a very quick cursory look, Japanese seems to have some sort
       | of commas and periods. Shouldn't it be just simple to adopt
       | spaces, (and proper commas and periods and parentheses, which
       | they actually use already) too?
        
         | Freak_NL wrote:
         | You want to change a written language for the express purpose
         | of enabling computers to be able to detect word boundaries
         | easier?
        
           | Asooka wrote:
           | It's not entirely unprecedented. Modern cyrillic shapes are
           | in part influenced by the printing press - when it was being
           | imported in Russia, they didn't want to mill an entire new
           | set of characters, so they repurposed the latin ones that
           | look like cyrillic characters and only added the ones that
           | were missing. Of course, parent's suggestion will never pass,
           | as it will require everyone everywhere to start using spaces,
           | which is more trouble than it's worth to native language
           | users.
        
           | bluquark wrote:
           | Note that Japanese and Chinese already changed to commonly
           | use horizontal left-to-right text (in addition to vertical
           | top-to-bottom/right-to-left text, which is still usual
           | especially in "proper published typesetting") in large part
           | because computers handled that much better for decades.
        
           | emilfihlman wrote:
           | It's just lighthearted pondering on the issue. Written
           | language evolves, or is that only a standard applied to
           | $YOUR_LANGUAGE when defending people not learning it?
           | 
           | I think having spaces is useful, too, for learning speed and
           | expression.
        
             | rafi_kamal wrote:
             | Similar argument also applies to English. English isn't
             | phonetic which makes it hard to pronounce correctly unless
             | you are learning from a native speaker. But I don't see
             | written English to evolve in a way that makes it easier to
             | pronounce for new learners.
        
               | mring33621 wrote:
               | I'm a life-long English speaker and am 100% OK with
               | changing the language to make it easier to learn.
        
               | zeroimpl wrote:
               | Let's make phase 1 be the unification of American and
               | British spelling. That should be easy enough right?
        
         | 4bpp wrote:
         | Who would be both interested in and able to champion such a
         | change, though? Spaced Japanese text would look at least as
         | jarring to someone who is used to its present state as English
         | text following the German capitalization pattern (all Nouns to
         | be capitalised in all Contexts) would to an English speaker,
         | and I doubt the present anglophone world could be persuaded to
         | adopt the latter change even if it turned out to convey
         | immeasurable advantages to computational processing of text.
        
         | corey_moncure wrote:
         | Sort of but then no one can agree on things like whether
         | particles are part of the word or not. To the fluent reader,
         | the kanji/kana switching breaks up the text enough to lex it.
         | I'm non-native and actually reading all-kana is much more
         | difficult for me than the usual mixture.
         | 
         | Many all-kana texts employ spaces. All-kana with no spaces is a
         | nightmare.
        
           | LikeAnElephant wrote:
           | Agree with this, then you come across a more philosophical
           | debate: should the very nature of written language change to
           | adapt to the needs of technology, or the other way around?
           | 
           | On one hand, language has always been an amorphous thing and
           | has changed throughout history. On the other could some
           | changes change the very nature and heart of it? No one agrees
           | on those trade-offs, and I argue likely never will.
        
             | knolax wrote:
             | If the Japanese language needs to change, it won't be
             | because some Western developer found it too hard to develop
             | some obscure unused highlighting feature for it. The HN
             | crowd's fetishistic need to change everything for the sake
             | developer convenience really reflects poorly on programmers
             | as an industry.
        
             | Asooka wrote:
             | > should the very nature of written language change to
             | adapt to the needs of technology
             | 
             | This has been always the case. I mean, think of how modern
             | latin letter forms developed. Think back to the Roman
             | Empire, when the letters were etched in stone or wax
             | tablets, with only capitals (before the invention of
             | minuscule), drawn with straight lines that are easy to
             | chisel. Think of the later periods when many books were
             | written by scribes using pen and ink, resulting in changes
             | to letters so they could write entire words in one hand
             | motion. Later still, we have the printing press and its
             | demands that each letter be a discrete unit, doing away
             | with the wavy flow of words. Today most people write using
             | print letters simply because the majority of writing you
             | would encounter is in print, so that's what's easiest to
             | read. Those changes probably don't register to you as
             | egregious as the one parent proposes, due to the fact that
             | the person enforcing the change and the person developing
             | the technology are one and the same - a native speaker
             | modifying his own language to fit the tools at hand. Which
             | I think brings us to a resolution - the native speakers of
             | the language are the ones to decide how to change it to
             | suit their everyday needs, including making it easier to
             | produce and consume with the tools of the day.
        
               | LikeAnElephant wrote:
               | > Which I think brings us to a resolution - the native
               | speakers of the language are the ones to decide how to
               | change it to suit their everyday needs, including making
               | it easier to produce and consume with the tools of the
               | day.
               | 
               | Agree with you here, it's well outside my rights as an
               | English speaking American to have an opinion of any
               | merit. However I would still struggle to call handing it
               | over to native speakers a resolution. Many native
               | speakers have incredibly strong feelings on both sides of
               | the argument; and often for very good and valid reasons.
               | 
               | I'm glad the Unicode Consortium exists, because I am
               | CERTAIN I don't want to be the decider or facilitator of
               | these discussions. Way above my pay grade.
        
         | nayuki wrote:
         | Spaces have been tried in Japanese. If you look at pure-kana
         | (hiragana/katakana) texts aimed at grade schoolers and foreign
         | language learners, there are spaces because without them there
         | would be too much parsing ambiguity. Text with full kanji for
         | adults do not use spaces, and this does not hurt comprehension
         | at all.
        
         | bluquark wrote:
         | The squarish glyph-shapes aren't a good fit for spaces. Scripts
         | with spaces have unevenly shaped letters so that one can
         | quickly scan a word by its length and pattern of
         | ascenders/descenders. In Japanese spaces do not aid this type
         | of scanning, and conversely the uneven _densities_ of the
         | glyphs allow for fast scanning without the aid of spaces.
        
       | peter303 wrote:
       | I wonder if this applicable to Chinese. In Chinese one to four
       | characters comprise a word. In post-revolutionary Chinese all the
       | characters in a sentence are run together and you have to
       | mentally parse the words in your mind. (Its worse in pre-
       | revolutionary Chinese where neither sentences nor paragraphs are
       | punctuated.) (Pre-medieval european languages used to run all
       | their letters together without word or sentence breaks. Torah
       | Hebrew still does this.)
        
       | oh_sigh wrote:
       | Given this property of Japanese text, is there wordplay
       | associated with a string of characters with double/reverse
       | meanings depending on how the characters are combined?
        
       | dlivingston wrote:
       | Here's a brief write-up [0] on techniques and software for
       | Japanese tokenization.
       | 
       | [0]: http://www.solutions.asia/2016/10/japanese-
       | tokenization.html...
        
       | knolax wrote:
       | Double-click highlighting barely makes any sense in English, let
       | alone in a language that doesn't use spaces. The fact that mobile
       | browsers treat long presses as a double press for urls has been
       | the bane of my existence. I doubt any native Japanese speakers
       | use this feature.
        
       | JonathonW wrote:
       | ICU (International Components for Unicode) provides an API for
       | this: http://userguide.icu-project.org/boundaryanalysis
       | 
       | Assuming Blink is using the same technique for text selection as
       | V8 is for the public Intl.v8BreakIterator method, that's how
       | Chrome's handling this-- Intl.v8BreakIterator is a pretty thin
       | wrapper around the ICU BreakIterator implementation:
       | https://chromium.googlesource.com/v8/v8/+/refs/heads/master/...
        
         | TheChaplain wrote:
         | For firefox there's a bug discussing the pros and cons of using
         | ICU for boundaries, so at least they are aware of the issue.
         | 
         | https://bugzilla.mozilla.org/show_bug.cgi?id=820261
        
         | chch wrote:
         | Doing a bit more deep diving into the ICU code, it looks like
         | the source code for the Break engine (used by Chinese,
         | Japanese, and Korean) is here: https://github.com/unicode-
         | org/icu/blob/778d0a6d1d46faa724ea...
         | 
         | and then according to the LICENSE file[1], the dictionary :
         | #  The word list in cjdict.txt are generated by combining three
         | word lists        # listed below with further processing for
         | compound word breaking. The        # frequency is generated
         | with an iterative training against Google web        # corpora.
         | #        #  * Libtabe (Chinese)        #    -
         | https://sourceforge.net/project/?group_id=1519        #    -
         | Its license terms and conditions are shown below.        #
         | #  * IPADIC (Japanese)        #    - http://chasen.aist-
         | nara.ac.jp/chasen/distribution.html        #    - Its license
         | terms and conditions are shown below.        #
         | 
         | It's interesting to see some of the other techniques used in
         | that engine, such as a special function to figure out the
         | weights of potential katakana word splits.
         | 
         | [1] https://github.com/unicode-
         | org/icu/blob/6417a3b720d8ae3643f7...
        
         | erjiang wrote:
         | According to your first link, BreakIterator uses a dictionary
         | for several languages, including Japanese. So I guess the full
         | answer is something like:
         | 
         | Chrome uses v8's Intl.v8BreakIterator which uses
         | icu::BreakIterator, which, for Japanese text, uses a big long
         | list of Japanese words to try to figure out what is a word and
         | what isn't. I've worked on a similar segmenter for Chinese and
         | yeah, quality isn't great but it works in enough cases to be
         | useful.
        
         | wwarner wrote:
         | was looking for that last link myself -- thanks!
        
       | trnglina wrote:
       | Firefox, in contrast, breaks at script boundaries, so it'll
       | select runs of Hiragana, Katakana, and Kanji. Not nearly as
       | useful, and definitely makes copying Japanese text especially
       | annoying.
        
         | zeroimpl wrote:
         | It also prevents the dictionary lookup gesture of macOS from
         | working in Firefox, since it selects the whole sentence and
         | looks that up (which fails).
        
         | knolax wrote:
         | Personally I find double click highlighting to be a useless
         | feature in any language, but the Firefox approach is superior
         | imo. Breaking at script boundaries is predictable behavior the
         | user can anticipate whereas doing some janky ad hoc natural
         | language processing invariably results in behavior that is
         | essentially random from a user perspective. I tried out double
         | click highlighting on some basic Japanese sentences on Chromium
         | and it failed to highlight any of what would be considered
         | words.
         | 
         | It's not like English highlighting does complex grammatical
         | analysis to make sure "Project Manager" gets highlighted as one
         | chunk and "eventUpdate" gets highlighted as two chunks, most
         | implementations just breaks at spaces like the user expects.
        
       ___________________________________________________________________
       (page generated 2020-05-08 23:00 UTC)