[HN Gopher] Unicode Utilities: Confusables ___________________________________________________________________ Unicode Utilities: Confusables Author : simonpure Score : 46 points Date : 2022-08-17 15:09 UTC (3 days ago) (HTM) web link (util.unicode.org) (TXT) w3m dump (util.unicode.org) | gcau wrote: | If its of interest, I'm the author of the javascript/npm package | 'confusables', which does the same thing (and also in the | reverse) | | Source: https://github.com/gc/confusables Demo: | https://confusables.gc.codes/ | danbruc wrote: | _Total raw values: 148,068,974,592,000 | | Too many raw items to process._ | | Any good guesses what this is about? | | EDIT: Got it, the number of strings confusable with the input. | jancsika wrote: | Too bad there isn't a standard lib that will flag any non-ASCII | characters to be rendered big like those oversized sprites in the | "giant" level of Super Mario 3. | | So you scroll through source code and hit this weirdly spaced | line with a big "h" character and go-- oh-- that's some buggy | crap in there. | | Then maybe have a map for ranges that you want to include in your | allowable set, and we're all good to go. :) | jabbany wrote: | I think the rust compiler kind of does this: | https://mobile.twitter.com/hmemcpy/status/115189096847766732... | jancsika wrote: | Ooh, good job, Rust! | jeroenhd wrote: | Only a partial set, though: https://github.com/rust- | lang/rust/blob/master/compiler/rustc... | II2II wrote: | Better yet, highlight characters out of context. Not everyone | writes in languages that are fully representable in ASCII, yet | confusable characters are still an issue. | hilbert42 wrote: | Potentially this is a big problem. Especially with OCR, | transliterations from and across different languages, characters | missing their diacriticals, etc. | | Whilst I was aware of the problem I wasn't aware that it's as big | as it is. Those of us who use Latin scripts are reasonably | familiar with the common ones such as _o, O_ and _0_ [l /c, u/c | alpha & zero], but there's some tricky ones even in these Latin | scripts that many of us get wrong. | | As I discovered a while back - but I can't remember how - most of | us get the World War Two abbreviation _WWII_ wrong (myself | included). The _' II'_ is not two alpha characters as we almost | inevitability use but rather it should be the Unicode characters | for Roman numerals. Even then I cannot remember if the correct | transliteration for Arabic numeral 2 is supposed to to be Roman | numeral '1' used twice/repeated or if Roman numeral '2' actually | has its own Roman numerical glyph (I suspect the latter is | correct). There are many more instances like this too, the dash, | minus sign, en and em dash for instance. | | Yes, I could look them up but it's a nuisance to do so on-the-fly | and that's the whole point/trouble. | | It seems to me we need much better proofing tools that would flag | errors or potential errors. I reckon we've been very poorly | served in this regard in that there no simple software tools | available of the quality we need. | | It's not only symbols or characters we need to correct but also | typos that don't show up on spelling checkers such as _for_ and | _fro_ and the big troublemakers _it 's_ and _its._ Spelling and | grammar checkers should automatically highlight or flag such | words whether their usage is correct or not. | jonstewart wrote: | Confusables are used by attackers to make malicious domains and | URLs apppear innocuous. If you show such things to users, it's | good to highlight confusables. ___________________________________________________________________ (page generated 2022-08-20 23:00 UTC)