[HN Gopher] How does Chrome decide what to highlight when you do... ___________________________________________________________________ How does Chrome decide what to highlight when you double-click Japanese text? Author : polm23 Score : 124 points Date : 2020-05-08 06:15 UTC (16 hours ago) (HTM) web link (stackoverflow.com) (TXT) w3m dump (stackoverflow.com) | darkerside wrote: | > The quality is not amazing but I'm surprised this is supported | at all. | | I find this line hilarious for some reason. Reminds me of the | line about being a tourist in France, "French people don't expect | you to speak French, but they appreciate it when you try" | whoisjuan wrote: | Was OP referring to the fact that V8 namespace is available | inside JSFiddle? | | Because I was a bit surprised about that and made me wonder if | opening this JSFiddle on Safari would work at all (I'm on a | phone so I can't test). | darkerside wrote: | That's funny. Not at all the way I read it, but it could | totally be read that way. Made me do a double take. | | Still, I think my original reading is correct, because I | don't think there is any issue with the "quality" of v8 | inside of jsfiddle. While imagining Chrome doing its best to | identify real words in long strings of Japanese text and | failing spectacularly just made me laugh again. | yftsui wrote: | The fiddle doesn't work on Safari. | | TypeError: Intl.v8BreakIterator is not a function. (In | 'Intl.v8BreakIterator(['ja-JP'], {type:'word'})', | 'Intl.v8BreakIterator' is undefined) | curiousgal wrote: | Being a resident of France I have to say that French people are | the opposite of that. Same for Japanese people. The only people | who get genuinely excited as you butcher their language are | Arabic speakers I've noticed. | LikeAnElephant wrote: | This is often determine by Unicode and not the browsers | specifically (though some browsers could override the suggested | Unicode approach). | | Each unicode character has certain properties, one of which is | whether that character indicates a break before / after itself. | | I've done extensive research on this for my job, but | unfortunately don't have time to do the whole writeup here. Here | are several resources for those who are interested | | Info on break opportunities: | | https://unicode.org/reports/tr14/#BreakOpportunities | | The entire Unicode Character Database (~80MB XML file last I | checked) | | https://unicode.org/reports/tr44/ | | The properties within the UCD are hard to parse, here's a | reference if you're interested: | | https://unicode.org/reports/tr14/#Table1 | | https://www.unicode.org/Public/5.2.0/ucd/PropertyAliases.txt | | https://www.unicode.org/Public/5.2.0/ucd/PropertyValueAliase... | | Overall, word / line breaking in Unicode in no-space languages is | a very difficult problem. Where the UCD says there can be a line | break isn't where a native speaker would put one. In order to do | it correctly you have to bring in Natural Language Processing, | but that has its own set of complexities. | | In summary: I18N is hard! | swang wrote: | Yes. This seems to work even when you pass it Chinese while | maintaining ja-JP as the language function | tokenizeJA(text) { var it = Intl.v8BreakIterator(['ja- | JP'], {type:'word'}) it.adoptText(text) var | words = [] var cur = 0, prev = 0 | while (cur < text.length) { prev = cur cur | = it.next() words.push(text.substring(prev, cur)) | } return words } | console.log(tokenizeJA("Jin Tian Yao Qu Na Li ?")) | | still seems to parse just fine. so most likely just using the | passed input to parse. | LikeAnElephant wrote: | Yep, the browser has the UCD info built into it (a | simplification... but basically). Similarly our mobile | devices and various backend languages have the same data | backed into it. | | This is where there are sometimes discrepancies between how a | given browser or device would output this data, as it could | be working off of an outdated version of Unicode's data. | | Some devices even overwrite the default Unicode behavior. | There are just SO many languages and SO many regions and SO | many combinations thereof that even Unicode can't cover all | the bases. It's all very fascinating from an engineering | perspective. | yorwba wrote: | The underlying library is actually using a single dictionary | for both Chinese and Japanese https://github.com/unicode- | org/icu/tree/7814980f51bca2000a96... | erjiang wrote: | It turns out that's because ICU uses a combined | Chinese/Japanese dictionary instead of separate dictionaries | for each language. Which probably is a little more robust if | you misdetect some Chinese text as Japanese and vice-versa. | erjiang wrote: | The library that Chrome uses seems to use a dictionary[0], | since you can't determine word boundaries in Japanese just by | looking at two characters. | | Your first link also says: | | > To handle certain situations, some line breaking | implementations use techniques that cannot be expressed within | the framework of the Unicode Line Breaking Algorithm. Examples | include using dictionaries of words for languages that do not | use spaces | | [0] posted in another top-level comment: http://userguide.icu- | project.org/boundaryanalysis | LikeAnElephant wrote: | Yea, CJK (Chinese, Japanese, Korean) breaking is particularly | complex. Google has done a lot of work, and have this open | source implementation which uses NLP. It's the best I've | personally come across: | | https://github.com/google/budou | oehtXRwMkIs wrote: | Can't imagine it being difficult for Korean. | Asooka wrote: | Korean is especially difficult. Chinese uses only hanzi | and you have a limited set of logogram combinations that | result in a word. Japanese is easier because they often | add kana to the ends of words (for e.g. verb conjugation) | and you can use that (in addition to the Chinese | algorithm) to delineate words - there is no word* where a | kanji character follows a kana character. Korean on the | other hand uses only phonetic characters with no semantic | component, so you just have to guess the way a human | guesses. | | * with a few exceptions, of course :) | oehtXRwMkIs wrote: | Curious where you came up with "phonetic characters with | no semantic component". It's just an alphabet with | spaces. With each block representing a syllable. It's | easier than Latin. | qiqitori wrote: | Korean uses spaces. | jcampbell1 wrote: | I wrote a Chinese Segmenter that is available on the web: | | https://chinese.yabla.com/chinese-english-pinyin- | dictionary.... | | It does basic path finding, and then picks the best path | based on the following rules: | | 1) Fewest words | | 2) Least variance in word length (e.g. prefer a 2,2 | character split vs a 3-1 split) | | 3) Solo Freedom (this is based on corpus analysis which | tags characters with a probability of being a 1 character | word. For example Wang Jia Ting (this is either "Wang | Household" (Wang Jia Ting ) or "Prince's courtyard" (Wang | Jia Ting ) and we split as Wang Household, because Wang | Wang is a common name that frequently appears in | isolation, and Ting is less likely to be in isolation. It | is interesting that solo freedom works better than | comparing the corpus frequency of "Prince" Wang Jia vs | "Household" Jia Ting . | | It works reasonably well. A surprising number of people use | it every day. | krackers wrote: | I don't think their own lexing backend is actually open- | source, as Budou just relies on a choice of 3 backends | (MeCab, TinySegmenter, Google NLP) to do the lexing. I'm | assuming Google NLP performs the best, but that isn't free | and certainly not open source. | Wowfunhappy wrote: | Does this mean that highlighting doesn't work properly | when you're offline? | krackers wrote: | I think the implementation used in Chrome is different | from the ones used in Budou. The implementation in Chrome | is dictionary based as one of the parent threads | mentioned, and _that_ is completely open-source albeit | probably doesn 't produce as good a result as their | homegrown NLP stuff. | dirtydroog wrote: | TIL: Some languages do not have spaces | 1024core wrote: | So what is the most accurate way to tokenize CJK text? | edflsafoiewq wrote: | By hand using a native speaker. | tobyhinloopen wrote: | Is there an API for that? | syrrim wrote: | you could try something with mturk, I'm sure. | cferr wrote: | Yes, but it hasn't been written yet. | wikibob wrote: | Safari seems to do this too. | emilfihlman wrote: | From a very quick cursory look, Japanese seems to have some sort | of commas and periods. Shouldn't it be just simple to adopt | spaces, (and proper commas and periods and parentheses, which | they actually use already) too? | Freak_NL wrote: | You want to change a written language for the express purpose | of enabling computers to be able to detect word boundaries | easier? | Asooka wrote: | It's not entirely unprecedented. Modern cyrillic shapes are | in part influenced by the printing press - when it was being | imported in Russia, they didn't want to mill an entire new | set of characters, so they repurposed the latin ones that | look like cyrillic characters and only added the ones that | were missing. Of course, parent's suggestion will never pass, | as it will require everyone everywhere to start using spaces, | which is more trouble than it's worth to native language | users. | bluquark wrote: | Note that Japanese and Chinese already changed to commonly | use horizontal left-to-right text (in addition to vertical | top-to-bottom/right-to-left text, which is still usual | especially in "proper published typesetting") in large part | because computers handled that much better for decades. | emilfihlman wrote: | It's just lighthearted pondering on the issue. Written | language evolves, or is that only a standard applied to | $YOUR_LANGUAGE when defending people not learning it? | | I think having spaces is useful, too, for learning speed and | expression. | rafi_kamal wrote: | Similar argument also applies to English. English isn't | phonetic which makes it hard to pronounce correctly unless | you are learning from a native speaker. But I don't see | written English to evolve in a way that makes it easier to | pronounce for new learners. | mring33621 wrote: | I'm a life-long English speaker and am 100% OK with | changing the language to make it easier to learn. | zeroimpl wrote: | Let's make phase 1 be the unification of American and | British spelling. That should be easy enough right? | 4bpp wrote: | Who would be both interested in and able to champion such a | change, though? Spaced Japanese text would look at least as | jarring to someone who is used to its present state as English | text following the German capitalization pattern (all Nouns to | be capitalised in all Contexts) would to an English speaker, | and I doubt the present anglophone world could be persuaded to | adopt the latter change even if it turned out to convey | immeasurable advantages to computational processing of text. | corey_moncure wrote: | Sort of but then no one can agree on things like whether | particles are part of the word or not. To the fluent reader, | the kanji/kana switching breaks up the text enough to lex it. | I'm non-native and actually reading all-kana is much more | difficult for me than the usual mixture. | | Many all-kana texts employ spaces. All-kana with no spaces is a | nightmare. | LikeAnElephant wrote: | Agree with this, then you come across a more philosophical | debate: should the very nature of written language change to | adapt to the needs of technology, or the other way around? | | On one hand, language has always been an amorphous thing and | has changed throughout history. On the other could some | changes change the very nature and heart of it? No one agrees | on those trade-offs, and I argue likely never will. | knolax wrote: | If the Japanese language needs to change, it won't be | because some Western developer found it too hard to develop | some obscure unused highlighting feature for it. The HN | crowd's fetishistic need to change everything for the sake | developer convenience really reflects poorly on programmers | as an industry. | Asooka wrote: | > should the very nature of written language change to | adapt to the needs of technology | | This has been always the case. I mean, think of how modern | latin letter forms developed. Think back to the Roman | Empire, when the letters were etched in stone or wax | tablets, with only capitals (before the invention of | minuscule), drawn with straight lines that are easy to | chisel. Think of the later periods when many books were | written by scribes using pen and ink, resulting in changes | to letters so they could write entire words in one hand | motion. Later still, we have the printing press and its | demands that each letter be a discrete unit, doing away | with the wavy flow of words. Today most people write using | print letters simply because the majority of writing you | would encounter is in print, so that's what's easiest to | read. Those changes probably don't register to you as | egregious as the one parent proposes, due to the fact that | the person enforcing the change and the person developing | the technology are one and the same - a native speaker | modifying his own language to fit the tools at hand. Which | I think brings us to a resolution - the native speakers of | the language are the ones to decide how to change it to | suit their everyday needs, including making it easier to | produce and consume with the tools of the day. | LikeAnElephant wrote: | > Which I think brings us to a resolution - the native | speakers of the language are the ones to decide how to | change it to suit their everyday needs, including making | it easier to produce and consume with the tools of the | day. | | Agree with you here, it's well outside my rights as an | English speaking American to have an opinion of any | merit. However I would still struggle to call handing it | over to native speakers a resolution. Many native | speakers have incredibly strong feelings on both sides of | the argument; and often for very good and valid reasons. | | I'm glad the Unicode Consortium exists, because I am | CERTAIN I don't want to be the decider or facilitator of | these discussions. Way above my pay grade. | nayuki wrote: | Spaces have been tried in Japanese. If you look at pure-kana | (hiragana/katakana) texts aimed at grade schoolers and foreign | language learners, there are spaces because without them there | would be too much parsing ambiguity. Text with full kanji for | adults do not use spaces, and this does not hurt comprehension | at all. | bluquark wrote: | The squarish glyph-shapes aren't a good fit for spaces. Scripts | with spaces have unevenly shaped letters so that one can | quickly scan a word by its length and pattern of | ascenders/descenders. In Japanese spaces do not aid this type | of scanning, and conversely the uneven _densities_ of the | glyphs allow for fast scanning without the aid of spaces. | peter303 wrote: | I wonder if this applicable to Chinese. In Chinese one to four | characters comprise a word. In post-revolutionary Chinese all the | characters in a sentence are run together and you have to | mentally parse the words in your mind. (Its worse in pre- | revolutionary Chinese where neither sentences nor paragraphs are | punctuated.) (Pre-medieval european languages used to run all | their letters together without word or sentence breaks. Torah | Hebrew still does this.) | oh_sigh wrote: | Given this property of Japanese text, is there wordplay | associated with a string of characters with double/reverse | meanings depending on how the characters are combined? | dlivingston wrote: | Here's a brief write-up [0] on techniques and software for | Japanese tokenization. | | [0]: http://www.solutions.asia/2016/10/japanese- | tokenization.html... | knolax wrote: | Double-click highlighting barely makes any sense in English, let | alone in a language that doesn't use spaces. The fact that mobile | browsers treat long presses as a double press for urls has been | the bane of my existence. I doubt any native Japanese speakers | use this feature. | JonathonW wrote: | ICU (International Components for Unicode) provides an API for | this: http://userguide.icu-project.org/boundaryanalysis | | Assuming Blink is using the same technique for text selection as | V8 is for the public Intl.v8BreakIterator method, that's how | Chrome's handling this-- Intl.v8BreakIterator is a pretty thin | wrapper around the ICU BreakIterator implementation: | https://chromium.googlesource.com/v8/v8/+/refs/heads/master/... | TheChaplain wrote: | For firefox there's a bug discussing the pros and cons of using | ICU for boundaries, so at least they are aware of the issue. | | https://bugzilla.mozilla.org/show_bug.cgi?id=820261 | chch wrote: | Doing a bit more deep diving into the ICU code, it looks like | the source code for the Break engine (used by Chinese, | Japanese, and Korean) is here: https://github.com/unicode- | org/icu/blob/778d0a6d1d46faa724ea... | | and then according to the LICENSE file[1], the dictionary : | # The word list in cjdict.txt are generated by combining three | word lists # listed below with further processing for | compound word breaking. The # frequency is generated | with an iterative training against Google web # corpora. | # # * Libtabe (Chinese) # - | https://sourceforge.net/project/?group_id=1519 # - | Its license terms and conditions are shown below. # | # * IPADIC (Japanese) # - http://chasen.aist- | nara.ac.jp/chasen/distribution.html # - Its license | terms and conditions are shown below. # | | It's interesting to see some of the other techniques used in | that engine, such as a special function to figure out the | weights of potential katakana word splits. | | [1] https://github.com/unicode- | org/icu/blob/6417a3b720d8ae3643f7... | erjiang wrote: | According to your first link, BreakIterator uses a dictionary | for several languages, including Japanese. So I guess the full | answer is something like: | | Chrome uses v8's Intl.v8BreakIterator which uses | icu::BreakIterator, which, for Japanese text, uses a big long | list of Japanese words to try to figure out what is a word and | what isn't. I've worked on a similar segmenter for Chinese and | yeah, quality isn't great but it works in enough cases to be | useful. | wwarner wrote: | was looking for that last link myself -- thanks! | trnglina wrote: | Firefox, in contrast, breaks at script boundaries, so it'll | select runs of Hiragana, Katakana, and Kanji. Not nearly as | useful, and definitely makes copying Japanese text especially | annoying. | zeroimpl wrote: | It also prevents the dictionary lookup gesture of macOS from | working in Firefox, since it selects the whole sentence and | looks that up (which fails). | knolax wrote: | Personally I find double click highlighting to be a useless | feature in any language, but the Firefox approach is superior | imo. Breaking at script boundaries is predictable behavior the | user can anticipate whereas doing some janky ad hoc natural | language processing invariably results in behavior that is | essentially random from a user perspective. I tried out double | click highlighting on some basic Japanese sentences on Chromium | and it failed to highlight any of what would be considered | words. | | It's not like English highlighting does complex grammatical | analysis to make sure "Project Manager" gets highlighted as one | chunk and "eventUpdate" gets highlighted as two chunks, most | implementations just breaks at spaces like the user expects. ___________________________________________________________________ (page generated 2020-05-08 23:00 UTC)