[HN Gopher] I couldn't debug the code because of my name ___________________________________________________________________ I couldn't debug the code because of my name Author : mikasjp Score : 158 points Date : 2021-10-18 08:40 UTC (2 days ago) (HTM) web link (mikolaj-kaminski.com) (TXT) w3m dump (mikolaj-kaminski.com) | m_kos wrote: | Isn't it bizarre that we have self-driving cars, the ISS, and | phones with 50 megapixel cameras but still struggle with | character encoding? | tetha wrote: | Character encoding is in a special class of problems. Like time | handling. | | If you pick up a halfway non-ancient framework in a somewhat | common language with a somewhat non-terrible persistence like | postgres, you just don't have problems. Just don't care, and it | just works. | | But it's super easy to derail that fragile correctness with | something like MySQLs utf8-ish handling, or some OS's path | handling, or 'efficiency', or a user or frontend dev submitting | data in a wrong encoding. And then it gets mangled. And then | the user is unhappy. | | At that point, it becomes very hard to argue why one of the two | things is wrong, and the other is not. While the user argues | the other way around. Because both look correct, if you look | from the right angle. And the only reason why I am right is | because of some standard, while the customer is right because | of money. | | And yes, it is very 'surprising' why our software now functions | correctly for russian or greek customers. | darkhorn wrote: | I think it is a Java related issue. Relevant issue occurs in | Jaspersoft Report. You cannot install Jaspersoft Report on | Turkish Windows no matter what. | dmingod666 wrote: | The domain name to the website is all ascii.. | zamalek wrote: | If you use a Microsoft account to set up windows then you have | no control over the local username. | dmingod666 wrote: | That sucks.. always hated the idea of an online account to | access your local system.. | moonchrome wrote: | This is exactly why I don't do that initially - I don't mind | my account being linked - but I've been bitten by the home | path bugs multiple times, I unplug my pc during setup | numpad0 wrote: | Oh, it's not a common knowledge that you should not UTF-8 in | Windows username? That had been the case since 95 days. Only | recently it had supposedly improved after Microsoft Account login | become semi mandatory. | progval wrote: | On the contrary, the first bug happens because docker-compose | tries to decode the path as UTF-8, but it is not UTF-8-encoded. | ("'utf-8' codec can't decode byte") | chris_overseas wrote: | I don't think this bug is anything to do with Windows, rather | it is due to the way the paths are handled in the IDE's | codebase. Presumably the same problem exists when using these | IDEs in conjunction with a path containing non-ascii characters | in the Linux or macOS world. | numpad0 wrote: | Isn't it some compilation option issue in native part? I | thought it's a line on .sln or include library in a C++ | source or something that has to be explicitly specified when | building a Win32 binary. | GoblinSlayer wrote: | InteliJ has native part? | Fordec wrote: | A lot of adults today weren't even alive in 95. Also, the | assumption that people are familiar with windows vs other | operating systems is becoming less and less valid. And as the | world gets more globalised and remote, it's no longer to be | assumed that all technical people are of a Anglo American | culture. | david422 wrote: | There's also this article: falsehoods-programmers-believe-about- | names: | | https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-... | | Certainly informative if you haven't seen it before. | | My takeaway from it was that design your system to try to | accommodate as much as possible, but it would basically be | impossible to accommodate them all, so aim for your target | audience. | ygra wrote: | One way of working arrive such issues is to use subst. That way | the application thinks your project directory is actually located | on P:\ or something like that. | rcxdude wrote: | Sadly there is even still software which fails to build or even | fails to run when there is a space in a filename (as is super | common on windows file paths, as well as autogenerated CI build | folders). It's ridiculous to no end that software cannot handle | paths correctly. | tazjin wrote: | The amount of random encoding problems that still exist are so | bizarre. I recently left a UK job after already leaving the | country more than a year ago, and in their attempt to mail P45 | form to my new address (in Moscow) the only bits that survived | are the string "c/o" and the postal code. | tediousdemise wrote: | The solution to this is extremely simple: don't validate | usernames, period. | | The rationale is from an article someone linked here ("Falsehoods | Programmer's Believe About Names"): | | > Anything someone tells you is their name is--by definition--an | appropriate identifier for them. | | If you try to validate by checking for profanity, knowing full | well that people can have names that contain profane substrings, | I have a tongue-in-check message for you-- _you are a fucking | asshole_. | xlii wrote: | Very similar problem to one described started my exodus from | Google services. | | I also have non-latin characters in my name however I knew it was | always an issue so I never used it in paths etc. | | At some point, long time ago, I was tasked to do some maintance | with Google Cloud service (can't remember the name of the service | now) which was doable only through Python CLI utility and it | failed with very similar Python error. | | What I found out rather quickly is that utility took my name from | Google+ profile, which did include those non-latin characters. No | biggie - I thought and fired e-mail to support (yeah it was those | times it was still that easy). Few hours passed and I received | information that this won't be fixed anytime soon and the best | course of action would be to change my name. | | Of course, support person probably meant to remove the | diacriticals from my Google+ profiles, but still it left | unplesant aftertaste for years to come. | nullspace wrote: | > the best course of action would be to change my name | | As someone who has been told this, for other reasons, I | empathize. My reaction has always been - "Your system can't | even handle names, you need to fix it". | | Edit: I wish there was a library / service that helped you | handle all sorts of edge cases in names, so that you don' t | have to worry about it. Just use a user-id, and set / get a | name from a lib / service that can actually handle it. | dymk wrote: | Has that reaction ever resulted in the other party fixing | their system in a timely manner? | mjevans wrote: | This is exactly why I hate the way Python3 handles Unicode. | | EVERY language should _try_ to handle Unicode such that if a | data sequence were valid before it remains valid after. NONE | should ever FORCE validation, since sometimes, like in the | article's case, the correct answer is GIGO. Just pass it | through and hope it continues to work. Sometimes the error is | trying to enforce that validation. | geofft wrote: | Python 3 usually handles this correctly, and I'm a little bit | confused what's going on in the article, exactly. | | For UNIX path names (and other OS data like environment | variables), Python uses the "surrogateescape" error handling | method, which does exactly what you ask. Any byte sequence | can be converted to a string. If it decodes as valid UTF-8, | it will do that. If it hits a byte that does not decode as | valid UTF-8 (necessarily a byte >= 128), it will map it to | code points U+DC80 through U+DCFF. These are in a reserved | ranges of code points ("surrogates", which make it possible | to represent code points > 0xFFFF in UTF-16), and they can't | show up in actual Unicode text (i.e., there is no UTF-8 | encoding of them, strictly speaking, and if you applied the | UTF-8 encoding algorithm to a code point in the U+D800 to | U+DFFF range, you would get bytes that aren't valid UTF-8). | | On the way out, this is reversed. So you get the results you | expect if your filenames are in UTF-8, but since UNIX has no | requirement that filenames are indeed UTF-8 (the only | constraint is they can't contain NUL or ASCII-forward-slash), | the bytes are preserved in a funky-looking format in Python | and you get the exact same output on the other end. | | See https://www.python.org/dev/peps/pep-0383/ for more on | what's going on. The tl;dr for users of Python is that if you | want to interact with, say, subprocess output as mostly- | normal strings (instead of bytes) but you want to be robust | to non-UTF-8 bytes, you should do something like | subprocess.check_output(["some", "command"], | errors="surrogateescape") | | You don't need to do this for APIs that directly interact | with pathnames, because they do it already. You just need to | do it for things like subprocess output and file contents | that Python doesn't know you want to handle in this way. | | ... | | On Windows, however, path names must be valid Unicode and are | stored in UTF-16. So the idea of a "l" that doesn't decode | properly shouldn't even happen! Mikolaj's home directory | ought to be a very boring (and valid) 004d 0069 006b 006f | 0142 0061 006a on disk. | | Windows doesn't enforce that file paths are _valid_ UTF-16 | though (specifically, the surrogate code points are only | supposed to show up in a certain way, but nothing enforces | that and you can have random surrogates on disk), and hence | Rust, which internally represents all strings in UTF-8, has a | solution ( "WTF-8") that's basically the inverse of | surrogateescape - it uses extrapolated-UTF-8-encoding-of- | surrogates to handle unpaired surrogates. | http://simonsapin.github.io/wtf-8/ But it seems very odd to | me that the directory C:\Users\Mikolaj would actually contain | any of those, and if it doesn't, I would expect it to very | easily turn into a Python Unicode string. | | Maybe this is from a Python version before | https://www.python.org/dev/peps/pep-0529/ , which is claimed | to "fail to round-trip characters outside of the user's | active code page"? Maybe this is from a Python version | _after_ that change and it 's wrong? | nightpool wrote: | The incorrect docker-compose file was _generated_ by Java | (Jetbrains) but _consumed_ by Python (docker-compose). The | GP comment was complaining about Python 's strict Unicode | consumption, not Java's invalid Unicode generation. | nightpool wrote: | How is this Python's fault? It's not like the `docker- | compose` file would have worked any better if it silently | replaced one of the volumes with an inaccessible file. | Instead, you'd just get a failure from the Windows filesystem | API when you tried to access or create a file at "C:\\\Users\ | \\Mikoaj\\\AppData\\\Local\\\JetBrains\\\Rider2021.2\\\log\\\ | DebuggerWorker\\\\\", right? | sschueller wrote: | Many years ago I could not access the apple developer panel | because of the umlaut in my last name. It was eventually fixed | but I was quite surprised that such a large company would run | into such a basic issue. | rodgerd wrote: | If you look at many of the responses here it's sadly | unsurprising: small-minded provincialism or outright xenophobia | are no less common amongst programmers than the general | population. | [deleted] | devrand wrote: | My last name has an apostrophe in it which Apple apparently | loves to embed directly into their JavaScript unescaped. For a | long time neither I nor Apple could look up AppleCare status on | my stuff as they were all linked to my Apple ID. The portal | would thus require me to login, but then would just show a | partially rendered page as my last name was causing an JS | syntax error. | nneonneo wrote: | Hmm, it sure sounds like John <script>alert(1);</script>Doe | (Bobby Tables' distant cousin) should sign up for an Apple | account. An XSS attack which could target the AppleCare reps' | machines could be catastrophically bad... | doubled112 wrote: | You'd think the apostrophe would be common enough they'd know | it could happen, but no. | | I love to enter it and see what each vendor and website's | backend does with it. | | The Staples Canada website, for example, returns it as ' | (HTML escaped) A couple times I've logged in, it seems to | escape a new character. I'm currently up to &amp;#39; | irrational wrote: | >such a large company would run into such a basic issue | | Every large company is just a conglomeration of smaller | departments. Each department had individual contributors. Some | individual contributor in that department wrote the code and if | nobody else is their department caught it, nobody else at the | large company would have caught it since they have their own | work to consider and don't have time to look at other people's | stuff. | lostgame wrote: | I think what OP means is that a company so large should have | the resources to test such edge cases. | supernes wrote: | It's somewhat common to see videogames issue a patch shortly | after release where they fix crashes due to non-ASCII Windows | usernames or non-English locales. I'm not sure what the root | cause of the confusion is, other than text strings being hard in | general. | GoblinSlayer wrote: | It's text encoding confusion: | https://en.wikipedia.org/wiki/Mojibake | jerf wrote: | It's easy to think the answer is "just UTF-8 everything" but | unfortunately the long and twisty history of filesystems means | that's not the correct answer, and the "correct answer" is | really hard to write down quickly. | | If you never display the filename, the answer is to treat | existing filenames as bags of bytes, but that breaks down as | soon as you need to display them, or if you need to manipulate | them by appending unicode to them, in which case you have to | decide on an encoding. | | Unicode encodings tend to mangle non-Unicode values because | they're specified to replace whatever they can't understand | with a particular Unicode character, usually represented as a | diamond with an inverted ? inside of it. | | There's some obscure solutions to this problem, like | https://simonsapin.github.io/wtf-8/ (which includes discussion | of the 16 bit encodings you need for Windows), but I haven't | seen broad support for them. You need a deliberately | "noncompliant" encoding/decoding system that doesn't replace | unknown characters with replacement characters. Fortunately, | compliant systems are becoming more and more popular and | available. Unfortunately, that can make file name handling | _harder_ than when you had a non-Unicode-compliant handling | system for your strings. | nyanpasu64 wrote: | Rust uses WTF-8 on Windows for OsStr[ing] and Path[Buf]. It's | zero-overhead to cast from &str to &OsStr/&Path to &[u8] | (though converting WTF-8 to UTF-16 costs an extra operation | when performing a Win32 function call). However this doesn't | solve the inability to round-trip "possibly-valid UTF-8/16" | to "Unicode text" and back (though Python's surrogateescape | might be one viable approach). | | Other libraries handle this even worse than Rust. On Linux | (filenames are bytes), Qt is unable to open files with | invalid UTF-8 names, while GTK can open them (but shows an | "invalid encoding" message instead of the original filename), | which I think is a good-enough approach. | garaetjjte wrote: | Part of the problem is legacy Windows cruft. For long time to | properly handle Unicode characers you needed to explictly use | widechar UTF-16 functions. Legacy narrow encoding is systemwide | setting, couldn't be set to UTF8, thus only subset of | characters would be represented correctly. Only recently they | introduced ability to set narrow encoding for application to | UTF-8 with setlocale, which is a lot saner. | mkotowski wrote: | In case of a home-grown code, it could be simply the question | of a programmer awareness. There are still many outdated and/or | unfinished tutorials that use WinAPI without any concern about | enabling Unicode and wide chars support. | | If we are talking about ready game engines like Unity and | Unreal... it is probably a naive assumption about input being 1 | byte wide and things getting lost because of that in some | gamedev-made script. | jan_Inkepa wrote: | I've been bitten on a few small releases by forgetting that C# | localises number->string conversion by default (which makes | sense. But if you forget, and you're writing floats to csv | files and the decimal points become decimal commas....). | breakingcups wrote: | It's also a common thing that Silent (aka CookiePLMonster) | fixes in the games he patches. | | See for example: - | https://cookieplmonster.github.io/2020/05/23/silentpatch-maf... | - https://cookieplmonster.github.io/2021/02/27/silentpatch- | yak... | amarshall wrote: | For a list of strings that often cause problems to, e.g., add to | a test suite, see https://github.com/minimaxir/big-list-of- | naughty-strings | tomaslaureano wrote: | Great resource! I usually use pangrams (holoalphabetic | sentences like "The quick brown fox jumps over the lazy dog") | to ensure that my code can handle all the alphabet characters | for the languages that should be supported at the very minimum. | munk-a wrote: | It's also important to width-test fields. Never forget to make | sure that WWWWWWWWWWWW doesn't cause weird application | wrapping. | aidenn0 wrote: | I used a system where the maximum length on the "new | password" field in the change password form was longer than | the password field in the login form. | | The symptom was that I could login if I used my password | manager browser plugin, but not if I pasted it from my | password manager. | kevinmgranger wrote: | You're lucky they weren't different lengths in the backend. | I've been bitten by that surprise one too many times (which | is any number higher than zero) | aidenn0 wrote: | The most ridiculous thing is the UI for setting the | password even said "X-Y characters long, must include at | least one..." but the login page could not support Y | characters. | pferde wrote: | I have seen a windows app with a text field whose max | character count was somehow determined by system font size | - probably a crude way to make sure the entered text fits | the hard-coded field size. | | The problem was that this field was used to enter a | 10-digit code, and as it turns out, on default Windows10 | system, the fonts are set up so that this field only fit 8 | of them. Oops! :) | munk-a wrote: | I'd like to see how that App would work with me sitting | here fonts cranked up to 175%. I've never heard of a | setup like that though - it sounds like it'd be | surprisingly intricate to actually configure. | munk-a wrote: | I maintained a system where we had unbounded password | length... but only respected the first six characters of | the password. (we did fix that). | amarshall wrote: | Related (we do this at my work): | https://en.wikipedia.org/wiki/Pseudolocalization | OskarS wrote: | An enormously useful list, I've used it several times, and it | can often dig up some real nastiness if you haven't been super | careful. | | This entry, by the way, is a fantastic little easter egg in the | list: https://github.com/minimaxir/big-list-of-naughty- | strings/blo... | vertis wrote: | No, seriously, wake up | [deleted] | ryanianian wrote: | Very handy. My previous simple test-case was simply a selection | from this well-known text-file which is simply a collection of | somewhat uncommon unicode characters, usually used for | rendering tests. | | https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt | | But this set of strings is specifically designed to cause edge- | case errors. | | Also don't forget Spolsky's seminal "The Absolute Minimum Every | Software Developer Absolutely, Positively Must Know About | Unicode and Character Sets (No Excuses!)". | | https://www.joelonsoftware.com/2003/10/08/the-absolute-minim... | spicybright wrote: | So frustrating how this still happens. It's too latin centric. | mikasjp wrote: | I think the whole problem is keeping the character encoding | consistent in the applications and their dependencies. | Programmers often forget this because they avoid non-ASCII | characters in their code. | mkotowski wrote: | I, too, have the L letter in my name, and yes, it is a sick joke | that so many things even in a supposedly modern systems make an | assumption that the world runs on ASCII. | | In the case of the Windows operating system, the worst fact is | that every single part of it behaves differently. Some parts | display the path with a wrong encoding, but handle it correctly. | A third-party app can display it correctly, but fails while | trying to access any file. From what I remember, even the built- | in PATH variable editor/manager goes through some arcane steps to | display the letters in a wrong way, but getting them to work | _sometimes_. | | I can only imagine how much more pain it is for someone using any | of the less widely-used writing systems or those with more | advanced features compared to ASCII (Hebrew's RTL, Arabic scripts | mid- and final forms, etcetera). | gerdesj wrote: | Can L have an alternative representation? For example the | German ss => ss. Also I think o can be written as oe. | | In English we simply shake the big bag of letters, pick a few | at random and then throw them at the page until a few stick. | q3k wrote: | > Can L have an alternative representation? | | Nope. Neither can z, c, s, a or e. You can, and people do | write them as z, c, s, a and e when writing in a restriced | character set, but that is not 'correct' and is not a | bijection, ie. ,,polka" and ,,polka" mean two different | things. | | There's also the case of technically-same-sounding- | especially-recently z/rz and o/u (whose replacement would let | you get rid of two 'non standard' characters), but for | historical reasons these are not interchangeable. | gerdesj wrote: | I do find this sort of stuff fascinating and also faintly | frustrating but of course my mother tongue is (in)famous | for being a bit loose at first sight. | | According to one of my employees (Polish) L sounds roughly | like w as in win or water but not as in what. A quick read | of this: https://en.wikipedia.org/wiki/%C5%81 doesn't help | too much. | | Does enforcing L instead of say w cause your written | language to fail in some way? I don't want to cause | offense, I want to understand the causes of difference. | q3k wrote: | 'W' in Polish is already used, but for a different sound | - it's pronounced like the English 'v'. 'V' in turn is | not present the Polish alphabet (in the sense of it not | being present in words of Polish origin). | | If you wanna change that, you might as well change the | entire writing system of the language, eg. to be more in | line with some other, more common writing system (ie. | other latin alphabets or the cyrillic alphabet which | would probably make the most sense phonetically). But no- | one's gonna go for that any time soon. | gerdesj wrote: | "If you wanna change". | | I think we have found the disconnect: you quite happily | use a word like "wanna" which is nonsense in English. Its | allowed because it is understandable. Wanna is "want to". | | Ooh, "gonna": That'll be "going to". | | What's gonna to you is l bar for me or vice versa or | something 8) | bagswatchesus wrote: | Not sure how they managed to do it but they had some basic rules | that they used to say "no real name can look like this, this is a | fake person!" and just kicked it out. | https://www.thelvbags.co/louis-vuitton-wallets-and-purses.ht... | xwdv wrote: | What's wrong with just writing it as Mikolaj? It's not like it's | a kanji or something. | sophacles wrote: | Because that's not their name? | dahfizz wrote: | Their URL is even mikolaj-kaminski.com . I get its annoying, | but I would never use non-ascii chars in a username / file | path. | jerf wrote: | So, what does A Bu Ming Ren do in this case? | | Polish may be close enough that an approximation is available | in English, but there's an awful lot of languages that don't | have a large overlap with English characters. | | In the Asian case above, if someone with that name did try to | "convert to English" they are ironically just as likely to | end up with Akihito Abe as the ASCII, which will be just as | broken! | numpad0 wrote: | Assuming that hypothetical guy is an average Japanese | male(somewhat leaning right), he'd just turn IME off. | Japanese input on desktop is consist of three following | states: | | - IME On state. IME capture and interpret keypresses as | engraved and generate corresponding Kana-Kanji texts. | | - IME Off state. IME passes through keypresses as engraved | on keytops. | | - Direct Input state. IME becomes dormant. | | In IME Off state, the keyboard behaves as a plain jp106(or | ANSI if it is) keyboard, like I'm doing right now. The | cases where you would use conversion with IME on for an | English word is when you have reasons for the word to be in | "full width"(usually for typesetting reasons). | jerf wrote: | I don't think it's something that people should 'just | know' that when Windows asks them their name during | install time, they _ought_ to use 7-bit clean ASCII for | everything, no matter where they are in the world or how | much they know about other languages. When Windows says | "What is your name?", they ought to be able to _use_ | their name without things breaking. | | I'm sure a computer savvy speaker of a fully-non-Latin | language may still guess this is a good idea, but | "computer savvy" doesn't cover everyone... and they | shouldn't _have_ to. | | "Just use 7-bit-clean ASCII English" is not a solution to | this problem. | dahfizz wrote: | They could use a different name as their windows name (Do | people use their real names as their usernames? I never | do). Or, they would have to go through the pain of finding | a real solution, like the author did. | | Considering JetBrains seems unwilling to fix this bug, | maybe the best solution of all is to switch to an IDE that | works. | Kye wrote: | "You're holding it wrong" | | The problem is the technology, not the user using it in a | reasonable way. l is older than computers and the only reason | computers struggle with it is lack of foresight or choosing | to make things harder for most of the world by some of the | people involved early on. | dahfizz wrote: | Obviously the IDE is at fault here. Rider has a bug with | Unicode. | | BUT, there is an easy workaround to avoid all Unicode | related bugs: don't use Unicode. If that's morally | objectionable for you, then you can keep fighting this | fight. | bivargen wrote: | Avoiding unicode, or anything but 7-bit ASCII is like | using chiseling text into a stone instead of pen and | paper because the pen might break. Fix the pen! Or | replace it with a computer (and we're back full circle)! | | It is not morally objectionable avoiding, it's just | stupid. | tremon wrote: | I think it's reasonable to find that morally | objectionable: English is the only language* that can be | fully represented in ASCII, so pretending that ASCII is | all you need excludes a large part of the world. | | * yes, by and large. Many languages make do, but even the | European languages that use the same script as English | cannot be fully represented: | | - Pretty much all mainland European languages use accents | (simple example, in Spanish el and el are different | words) | | - French misses c | | - German/Swiss/Austrian misses ss | | - Spanish misses n | | - Dutch misses ij | InitialLastName wrote: | It's naive of you to maintain the facade that English can | be fully represented in ASCII. We've just had longer than | other languages to adapt to that particular encoding | technology, and the good luck to have a code set built to | represent our language become the lingua franca of | computer technology. | Symbiote wrote: | Not even Britain and Ireland can manage with ASCII: they | need PS and EUR. | | I agree with you, and disagree strongly with dahfizz, who | is essentially telling people their name and language are | unacceptable. | Muromec wrote: | Cyrillic-writing countries miss all of their alphabets | and so does Greek. | ludamad wrote: | For the record, it's a stark pronunciation difference as l has | drifted to a very different "w" sound | MadeThisToReply wrote: | Yep. For example, the name of the third-largest city in | Poland is "Lodz", which might look like it's pronounced | "lods", but is actually pronounced more like "wootch". | garaetjjte wrote: | Sometimes you end up with parcel addressed to city "??d?". | Shipping systems cannot cope with non-ASCII chars more | often than I would expect... | greenshackle2 wrote: | I've seen shipping labels with HTML encoded characters, | like é and è. I'm not sure if that's better | or worse: | | Łódź | ssivark wrote: | That's about as aggravating as asking Ryan to change name to | Pyan -- because the encoding doesn't support "R" and "P" looks | very similar. | no_time wrote: | Because it's not his name. Imagine you are John but you had to | make do with Yohn because the people designing you software | didn't need the letter J... | kmlx wrote: | it was 30 years ago when i discovered that it doesn't really | matter what my name is. the system i'm interacting with | expects my name to be "john" or something like that. so i let | it be. | | 30 years later and i completely dropped all non-latin chars | from my name in any and all forms. from airplane tickets to | passport to you name it. | | and you know what? no one cared about non-latin. not even the | government. i loled when i actually realised. | | i've encountered zero issues ever since. | | and it's been the same for lots of my friends. they just | adopted some western name. case closed, no more issues. | | it all depends on who much importance you attribute to your | name. for me it's always been a random variable. for others | it's a matter of pride. but to the "system" it will be a | "random list of chars", sometimes latin, other times utf. | zanderwohl wrote: | It's not strange to localize your name. In ASL for example, | you could sign your English name letter-by-letter, but it's | much more common to have a totally new sign for your name - | usually a word combined with the first letter of your name. | Taking part in a different system often means taking on | another name. | q3k wrote: | It seems that you're implying computers are universally | american and therefore people are expected to | speak/use/adapt to american. | thereddaikon wrote: | That's the harsh way to put it. A more diplomatic way is | that computing is not unique in having deeply ingrained | artifacts of the language and culture that birthed it and | developed many of the paradigms. | | Take anything having to do with seamanship. There are | many terms that date back to early modern English that | simply don't make sense anymore yet are accepted and | universal because the British Empire had a large and | enduring influence on maritime matters and happened to be | at the forefront of most modern developments until about | 70 years ago. | | In some cases this is actually built into laws and | industry practice. Pilots speak English. That's the | rules. Don't like it? Invent the time machine and beat | Wilbur and Orville. For much the same reason, science | speaks Latin. | | This technical debt is difficult if not impossible to | overcome, especially in regards to computers because we | still haven't cracked general purpose AI. Software will | only accommodate what it was written to accommodate. | | Recognizing the problem and working to fix it is all well | and good. But its wise to understand that this wont be | solved any time soon so in the meantime it is pragmatic | to operate in such a way to maximize compatibility. | | After all, I still have to call it a Foc'sle even if I | think that's dumb or isn't inclusive of my culture. | xxpor wrote: | There's also the practical consideration that English, | due to having a) an alphabet b) letter shapes that aren't | affected by surrounding letters and c) no diacritics, is | the easiest major language to store and display on a | computer. Even if silicon valley ended up in a country | with a logographic writing system, I'd bet that the first | character set that would have been used would have been | Latin based | [deleted] | AdrianB1 wrote: | My name contains non-Latin characters (apparently strange as | we use a Latin language), but 40 years of working with | computers I learned to avoid using the original form and | always convert to ASCII; yes, it is not my name, but my pride | and sense of entitlement are not hurt at all. | | Sometimes it is better to avoid being hit by the bus even if | you are right. | wbsss4412 wrote: | So the solution is for the user to change their entire windows | account name, rather than handling common characters in your | code? | toast0 wrote: | For a user, changing their account (probably creating a new | user, since rename apparently doesn't change the directory), | is something they can do. | | Changing all software to respect their perfectly valid name | isn't something they can do. | | They shouldn't need to change their name, but if they do, | they can ignore all the broken software and go about their | day. | | This particular user is more capable than most, and found a | workaround for this particular problem, which is good... But | this is not likely to be the last of the problems. | dahfizz wrote: | Of course it would be better if all code was bug free. But | that's impossible. As a user, avoiding unicode is a pretty | easy way to avoid bugs like this - its the rational thing to | do. | Jensson wrote: | When you have non-standard characters in your name you | quickly learn to never use them in computers since even | though most systems works fine, some don't. And you can't fix | all the thousands of systems your name has to interact with. | | I even had trouble booking flight tickets since their | security system couldn't parse my name, and then had to go | through some special security check due to it returning | errors. After that, never again. Not sure how they managed to | do it but they had some basic rules that they used to say "no | real name can look like this, this is a fake person!" and | just kicked it out. | wbsss4412 wrote: | I totally understand what you're saying, but it's also a | sad state of affairs when we can't handle "non standard | characters". | | Standard characters (ie english) are only used by a small | subset (maybe 5-10%) of the global population. | yuliyp wrote: | They're not non-standard characters. They're just as much a | part of the Polish alphabet as 'a' and 'b' are. | Jensson wrote: | That is exactly what I meant. My name doesn't have non- | standard characters either from the perspective of my | home country, it is just normal letters in the alphabet, | but not in the English alphabet. | q3k wrote: | > When you have non-standard characters in your name | | 'standard' by what measure? L is more standard than X or Q | in the polish alphabet. | | ~ Sincerely, a person whose name contains ,,n" and | therefore had to deal with this bullshit his entire life. | Jensson wrote: | From a programmers perspective. The characters in my name | are standard where I come from, but they are not standard | to the international air travel security systems likely | developed by Americans. | | Edit: You know how aircraft travel security always | transforms your name into letters from the English | alphabet to parse? Yeah, it transformed my name and then | the resulting string looked so bad that the system | rejected that. The original name doesn't look bad, but | after transformations it did... | miloignis wrote: | From the article: | | The first idea was to change the username to one that does not | contain Polish characters. It turned out that Windows does not | rename the user's folder when changing the username. Manually | renaming the folder was not an option. This way I could corrupt | my profile in the system. | | The end of the article is about how to change the directory | where the temporary files go to one not under the user folder. | jasonpeacock wrote: | And yet it's one of the simplest things to add non-ASCII chars to | your tests to validate their handling. | | It's like not testing if your calculate application can handle | negative numbers or decimals. | nradov wrote: | In fact it's trivial to generate a text file of all valid | Unicode code points and use that as input to unit tests. | yakubin wrote: | It may be faster to generate them on the fly. Iterating over | ranges of integers is a lot faster than reading files from | disk. | Someone wrote: | I would have to do research on whether the list of valid code | points depends on the Unicode version. For example, can | regional indicator code points | (https://en.wikipedia.org/wiki/Regional_indicator_symbol) | appear in isolation? If not, is that different in Unicode < | 6, where those code points weren't assigned yet? | | Similarly, what about tags | (https://en.wikipedia.org/wiki/Tags_(Unicode_block) )? Do | these _require_ an U+E007F CANCEL TAG? | | The 66 noncharacters certainly need consideration. | http://www.unicode.org/faq/private_use.html says: | | _"Because of this complicated history and confusing changes | of wording in the standard over the years regarding what are | now known as noncharacters, there is still considerable | disagreement about their use and whether they should be | considered "illegal" or "invalid" in various contexts"_ | | Edit: also, testing all code points likely is overkill and | using code points in isolation likely isn't enough. Most | tests are better of with something like the big list of | naughty strings (https://github.com/minimaxir/big-list-of- | naughty-strings) | mrweasel wrote: | It's a pretty good test case. Similarly we found a number of bugs | in a Django application and path handling, because I happend to | be using Windows for six months, while the rest of the team was | on Linux and Mac. | umvi wrote: | Using non-ascii characters in file paths, toolchain config files, | and other non-display contexts is just asking for trouble, even | if it is your name... | fluxem wrote: | Also spaces. I spent half an hour debugging why cmake cuda | build was failing. | munk-a wrote: | A lack of support for spaces at this point is unacceptable. | I, personally, despise spaces in paths but on windows a whole | bunch of default system paths already have spaces embedded in | them in major ways... and let's not forget parens as well - | thanks "Program Files (x86)" | bbarnett wrote: | This wouldn't have happened if using rust! | burnished wrote: | Some of the other attempts are a little subtle, this one is a | pretty blatant attempt to rile up the folks that are already | angry about rust for whatever reason. Please stop. | nightfly wrote: | Can you knock it off??? This is even more annoying that out- | of-place rust evangelism | jasonpeacock wrote: | This is the modern, post-ASCII computing world, we should no | longer be willing to settle for the lowest-common-denominator | of ASCII-only strings. | | There's no excuse for actively supported, _paid_ products to | have these problems today. | amenod wrote: | True. But these actively supported, paid products build upon | layers and layers of no-longer-supported, free/opensource | products. Good luck fixing them. | | Not saying that this is OK, just explaining why using non- | ascii characters, in this day and age, is still asking for | trouble. | SAI_Peregrinus wrote: | This is on the Windows version. | | Windows 2000 is when the OS changed to UTF-16 by default. | Before that Windows NT was UCS-2, IIRC only the DOS-based | Windows versions were Windows-1252 internally, starting | from Windows 1.0. So while l wasn't supported in Windows 1, | characters like n were. Windows has literally NEVER been an | ASCII-based OS. | horsawlarway wrote: | Sure, but having used a lot of the windows system apis | (admittedly - a lot of years ago) it was a complete | hodgepodge of which api would take a char vs a wchar, and | then they tried to hide the whole thing behind tchar, | which just made it even harder to keep track of. | | Basically - I agree: This shouldn't be a problem, and 7 | months is a long time to wait for a basic fix. But there | are a lot of footguns hanging around in windows code with | respect to character encodings. | | Just looking at the first result on google for "c++ get | windows home directory" shows this: | https://docs.microsoft.com/en- | us/windows/win32/api/userenv/n... | | Which takes a long pointer to tchar string (LPTSTR) - so | this behavior is dependent on the unicode settings of the | project at compile time, even today. | david_allison wrote: | > Windows 2000 is when the OS changed to UTF-16 by | default. | | Paths are UTF-16 + unpaired surrogates, so a Windows path | isn't legally representable in UTF-8. | ainar-g wrote: | _Especially_ if those products are developed by a company | from Russia, where Cyrillic is used. For me, a Russian | myself, this situation is honestly ridiculous. | zczc wrote: | Russian companies generally have ascii-only username | policies | mbesto wrote: | Do you write "if" statements in Cyrillic when you write in | <insert Python/Ruby/Java/.NET/whatever>? | pavel_lishin wrote: | It would be very amusing to see "esli" in an if | statement, given how much it looks and sound like "else" | at a brief glance. | GoblinSlayer wrote: | I thought ArnoldC was just a couple of #define's, but | looks like it isn't. | nine_k wrote: | No. Keywords are ASCII everywhere (no, APL's are not | words). Mixing English in keywords and non-English in | identifiers feels odd. | | Algol-68 supported localized sets of keywords; | fortunately this language is gone. | | You can #define non-ASCII stuff in modern C++. It's your | best chance to "localize" a mainstream language. | | Same would work for Clojure, but Lisp uses a lot of | quirky abbreviations like `cdr` or `setq` that give | awkward translations. | gumby wrote: | This is blaming the victim | BiteCode_dev wrote: | Unfortunately, it's true, most toolchains are stuck in the | past, and don't deal with non-ascii characters or even spaces | very well. In fact, I just learned that spaces in .deskop files | values could cause trouble after a long debugging. | | But it's a shame. | | In Europe, we do have a lot of non-ascii characters everywhere. | Ubuntu puts a "Video" and a "Telechargements" directory in my | $HOME because I'm french. If I were to use my name as my | username I would have even more troubles. | | I'm careful with not using special chars in names for work, but | it feels like I'm a girl trying to not dress sexy in the wrong | part of town: necessary, but I shouldn't have to do this, and | it's definitely the others to blame. | | All in all, I thank the Gods of encoding for Python 3 unicode | handling. Having a scripting language that does the right thing | out of the box is wonderful on this side of the pond. | mjevans wrote: | "The right thing" for filesystem entries is transparently | copy, do not evaluate. A file path is a mem-copied, length | value sized block of identifier you don't ever mangle. If you | must mangle it, touch only the necessary areas as directed. | (E.G. join with os.pathsep and do not normalize anything). | | Want to offer Unicode validation? Sure having that as an | OPTION is fine. Forcing it means I can't rely on that tool to | handle real world data which happens to not be valid but is | still a valid file-system address. | GoblinSlayer wrote: | No seriously, create a user d'Artagnan. | simonblack wrote: | Isn't this one of those "100 things Programmers don't know about | People's Names" things? | | Like the poor, it will be with us always. | xdfgh1112 wrote: | I don't know, it's just a Unicode character? Not even a newer | one, it's just 2 utf8 bytes. Pretty much everything should | support that in 2021. | | When I think of 100 things I think of stuff like "some people | spell their name in all lowercase and get really funny if you | change it" | numpad0 wrote: | Yeah so double byte characters costs extra. I don't know, a | checkbox or something default off. Always did still does. | Double width costs even more. | horsawlarway wrote: | you're getting downvoted, but between tchar hiding wchar vs | char... this literally could be someone toggling off the | "UNICODE" checkbox in visual studio somewhere. | hprotagonist wrote: | windows probably defaults to latin-1 | bryanrasmussen wrote: | the default windows encoding is UTF-16, a long time ago it | was Windows-1252 https://en.wikipedia.org/wiki/Windows-1252 | hprotagonist wrote: | or CP-1251, in some locations. | f311a wrote: | That's a pretty common problem, especially for cyrillic names. | People just use ASCII names. | souptonuts wrote: | Idk changing your stupid fucking name could be a fix too | Dannymetconan wrote: | I can very much relate to this but also have very little sympathy | here. | | I have a special character in my name, an apostrophe, and it | causes trouble regularly online and with tooling. A number of | years ago I decided just to never use it when it came to anything | to do with technical work be it email, logins or usernames. | | Unicode characters are a pain to deal with and I have suffered | from it first hand trying to handle it. At the end of the day it | is much easier just to not use the special characters and move on | with your life rather then be battling the constant frustration. | | I'm sure these tools have lots of issues opening and you would be | surprised at the amount of time, effort and testing it would be | required to provide fully Unicode support. Most people would see | it as a very small positive and not worth the effort. I find it | hard to disagree. | vultour wrote: | I'm really surprised someone technically minded thought it's a | good idea to put a non ASCII character in their username. I'd | never do that. | ctdonath wrote: | I'm really surprised someone technically minded thought it's | a good idea to not allow non ASCII alphanumerics in a | username. | | Unicode has been a thing since 1988. Names have included non | a-z characters since forever. | jltsiren wrote: | My legal last name is "Siren". When I was younger, I almost | always used "Siren", because it was easier to type. Then, ~15 | years ago, I started noticing that American websites sometimes | rejected it, because they considered it inappropriate. | Sometimes "Siren" would work, sometimes it worked but caused | minor annoyances, and sometimes it would not work for technical | reasons. | | Both versions work most of the time these days, but I still run | into trouble once in a while no matter which name I use. | 10000truths wrote: | Why would Siren be an inappropriate name? | lostgame wrote: | Someone who I know has the last name 'Island' and was | unable to sign up for Facebook forever because they thought | it was a fake last name. | | Maybe 'Siren' is similar. It's a pre-existing word that | perhaps flags some sort of weird edge case. | pledess wrote: | The article offers a solution of | idea.system.path=${root.dir}/JetBrains/Rider/system but doesn't | mention the C:\JetBrains directory permissions. Directory | permissions under %LOCALAPPDATA% (the location that works for | people without a Polish character) should restrict write access | to one user. With the Windows default behavior, creating | C:\JetBrains would inherit permissions from C:\ - and wouldn't | restrict write access to one user. Maybe 99% of the time this is | irrelevant (i.e., there's no realistic threat from malicious | actors who control unprivileged user accounts on your own | development machine). Still, it's a potential downside of the | solution, and more motivation for the vendor to fix their code so | that Polish characters can be used under %LOCALAPPDATA%. | Kwpolska wrote: | If you are on a multi-user system, the path "C:\JetBrains" | isn't really ideal (what if other users also need Rider and | have non-ASCII usernames?). That said, you can easily change | file permissions on Windows if the default ones don't work for | you. | [deleted] ___________________________________________________________________ (page generated 2021-10-20 23:00 UTC)