[HN Gopher] In the Turkish locale, "INFO".lower() != "info" ___________________________________________________________________ In the Turkish locale, "INFO".lower() != "info" Author : duckerude Score : 140 points Date : 2020-08-16 13:10 UTC (9 hours ago) (HTM) web link (github.com) (TXT) w3m dump (github.com) | beeforpork wrote: | My genius idea was once to use toupper() to normalise paths on | Windows, which are case-insensitive. One day, a customer from | Azerbaijan reported that my application failed to access a file | in C:\WINDOWS\\... | tryauuum wrote: | i feel your pain | CodesInChaos wrote: | In the Danish locale "aa" doesn't start with "a". | 60secz wrote: | Stringly typed: Play stupid games, win stupid prizes. | bayindirh wrote: | Welcome to the Turkish language, where we have i, i, I and I. In | our language the conversion is as follows: | | - i <-> I | | - i <-> I | | We love our dots and preserve them. For a more detailed read, | please see: | | https://blog.codinghorror.com/whats-wrong-with-turkey/ | Natsu wrote: | As I understand it, Turkish is one of the more important | locales to test with because of things like this. | bayindirh wrote: | Turkish is the only language which has the i & I pair. | Similarly, AFAIK, Turkish is again the only language with g | and s letters. So, by testing for Turkish, you test for a lot | of European languages at once. Moreover we share some | modified letters(c, u) with other Central European languages. | | If your program can pass "The Turkish Test", you pass a lot | of others too. | [deleted] | anticensor wrote: | Azerbaijani too. Moreover, Azerbaijani has an additional | letter @, which sounds like /ae/. | therein wrote: | I love the feeling of camaraderie arising from that | partial mutual intelligibility of Turkish and | Azerbaijani. | | That connection through language goes a long way. | | muq@dd@s baci mill@t :) | 1-more wrote: | Poor encoding can lead to the odd murder too: | http://gizmodo.com/382026/a-cellphones-missing-dot-kills- | two... | | > The use of "i" resulted in an SMS with a completely twisted | meaning: instead of writing the word "sikisinca" it looked | like he wrote "sikisince." Ramazan wanted to write "You | change the topic every time you run out of arguments" (sounds | familiar enough) but what Emine read was, "You change the | topic every time they are fucking you" (sounds familiar too.) | rvnx wrote: | That doesn't explain the e instead of the a, does it ? | bayindirh wrote: | In the olden times, ending words with _e_ instead of _a_ | is considered an acceptable typo. | | Also the article explicitly says "it looked _like_ he | wrote ". So when you see red, that last letter can become | anything and nothing would change. | generationP wrote: | > - i <-> I | | > - i <-> I | | After seeing this, I don't understand how the rest of us can | fail to have the same distinction. There's something logically | beautiful -- like the rhyme in a good poem -- about artificial | languages (or, in this case, alphabets) that naturally evolved | languages just cannot compete with. | sampo wrote: | > We love our dots and preserve them. | | Turkish preserves the dots of i, o and u in their capital | versions, but not with j. The capital J is dotless. | bayindirh wrote: | Isn't capital j is dotless in every language? | Macha wrote: | 07/04/2008 -> April 7th seems about as reasonable a result as | July 4th, especially when you've explicitly opted in to a Turkish | locale. I don't agree with the article's assertion that the | format being interpreted according to the user's locale is wrong | here, the one wrong part is a US centric programmer's expectation | that PP-QQ-YYYY is an unambiguous format. Use YYYY-mm-dd when you | need a format that's not ambiguous | frabert wrote: | YYYY-mm-dd also plays nice with lexicographic ordering, which | is why I always use it when I need to put dates in e.g. | filenames | Macha wrote: | I'm a European working primarily with Americans. My home | country uses dd/mm/YYYY (or dd/mm for short) and the US uses | mm/dd/YYYY for with mm/dd for short. I've switched to YYYY- | mm-dd simply for my own sanity and if I omit the year I write | the month in text format, such as "5 June". | withinboredom wrote: | The US military uses the almost same convention (dd-mmm- | yyyy) so 07-aug-2020. | dgellow wrote: | That's dd-mmm-yyyy | withinboredom wrote: | Thanks! | Macha wrote: | Note: This is actually a reply to the article here: | https://news.ycombinator.com/item?id=24178270 , for some reason | I thought that was the top level link. | | Maybe if dang sees this, it could be reparented? | heavenlyblue wrote: | > PP-QQ-YYYY is an unambiguous format | | "US centric" is one way to say it | snthd wrote: | > 07/04/2008 -> March 7th | | I think you mean April. | Macha wrote: | Fixed | kentonv wrote: | Has anyone here _ever_ had a use case for toLower() where they | actually wanted localization to apply? | | It seems to me that in practice, it's extremely rare to want to | change case of real, natural-language text. When I have natural- | language text, it's just a blob to me, and I don't want to touch | it. | | The only time I ever want to lower-case or capitalize something, | I'm working with identifiers meant for computer -- not human -- | consumption. Usually, specifically, I'm dealing with identifiers | that have annoyingly been defined to be case-insensitive even | though the only humans that ever see them are programmers and | programmers hate case-insensitivity. HTTP headers are a common | example. | | I mostly write C++, and I end up writing code like: | for (char& c: str) { if ('A' <= c && c <= 'Z') c = c - | 'A' + 'a'; } | | Later on, some well-meaning developer on my team will come along | and say "Ugh what is this NIH syndrome?" and then they "clean it | up" as: #include <ctype.h> for | (char& c: str) { c = tolower(c); } | | And then I have to say NOOOOOOO DON'T DO THAT YOU HAVE NO IDEA | WHAT tolower() REALLY DOES! | | I struggle to imagine any real use case where you'd actually want | locale-dependent tolower() other than, maybe, a word processor -- | but if you're writing a word processor, you're probably not going | to be depending on the language's built-in string APIs to do your | text manipulation. | felixarba wrote: | I have a morse code app which consistently crashed when certain | users would try to translate letter "i", and it took me a long | time to figure out that only the turkish users would complain | about it, and when one of them sent me a screenshot I only | noticed a "wrongly" rendered capital letter i (I used toUpper) | and after digging around a bunch, I learned about this while | turkish letter i. | aflag wrote: | File names, URLs and email address support utf-8 characters and | you may want to lower case them in many situations. If the user | is trying to search for a string, they probably want case | insensitivity. I don't think it's that rare/weird for people to | want localisation to apply when calling toLower. | nurettin wrote: | You need localization if you do any kind of multilingual text | processing. Not sure how it could escape a thinking person's | imagination. | vasama wrote: | This is why I have a set of functions like AsciiToLower(char* | string, size_t size). They only touch characters in the ASCII | space at <0x80. Even went and implemented them with SSE for | x86. | crazygringo wrote: | Of course! There are _tons_ of cases where you need to store in | "sentence case" (first word and proper nouns and acryonyms | capitalized, nothing else) so you can convert to title case or | all-caps as needed for display purposes. Templates are full of | this kind of stuff. | | There are similarly tons of cases where you reduce everything | to lowercase without accents for searching and indexing | purposes. Depending on your setup, your database might handle | that for you, but there are edge cases where you need to do it | at the application level. | | Long story short, every string has a locale, and you should | never change the case of something without specifying its | locale. Either be explicit that it's American English or ASCII | or Latin1 or whatever... or that it's something else. Never | leave someone reading the code guessing. | asveikau wrote: | > you can convert to title case or ... for display purposes. | | I am skeptical if someone thinks they need to do this and how | they will get it done. | | Eg. Looping through and capitalizing the first gylph after | breaking whitespace regardless of locale is not the way to | go, but I guarantee you a nontrivial amount of people reading | this would write exactly that if asked to solve the high | level problem. | | I find it annoying when software or even in some cases human | typists try to enforce English language title case. Some | other languages have different rules for titles and | capitalization and seeing the English rules enforced out of | context can be jarring. | reaperducer wrote: | _Has anyone here ever had a use case for toLower() where they | actually wanted localization to apply?_ | | Yes. In a system I'm about done with, there is a sortable chart | of dates and times. In some languages day and month names are | capitalized, and in some they are not. | bjourne wrote: | How does that work? toUpper() can't possibly know that the | string is a day or month name. | [deleted] | mrighele wrote: | > Has anyone here ever had a use case for toLower() where they | actually wanted localization to apply? | | If you are collecting data which include people's names or | addresses you probably want localization to be applied | correctly so that you can compare data coming from different | sources and possibly with different cases. Having your name | spelled differently in different documents can cause a non | trivial amount of problems with an overzealous bureaucracy. | jmiller099 wrote: | i like c |= 0x20; :) | tyingq wrote: | Airlines might be a good example. The back end system doesn't | grok lowercase characters at all, so you need to transform data | to uppercase A-Z, 0-9 and a few punctuation marks. | miahi wrote: | But they do have the most extensive transliteration rules | library to match everything to that limited character set | (ICAO Doc 9303[1]) that is used by many systems outside the | aviation world. | | [1] https://www.icao.int/publications/pages/publication.aspx? | doc... | fovc wrote: | What about sorting users by name? | phonebanshee wrote: | That's completely language+locale dependent. For example, | here'an alphabetical list of Irish surnames - | https://www.duchas.ie/en/nom?txt=M. You'll notice that sort | order ignores an initial O or Mac (or Ni or Bean, etc). | rkangel wrote: | This is a classic case of a 'why' code comment being needed. | It's obvious what you're doing, but without a 2 line | explanation, it's not clear _why_. | kentonv wrote: | Yeah I probably wrote that comment the first few times I did | this but it's hard to write it the 50th time. | | Maybe I should have my own tolower() function that I can call | so I only have to write the comment once but it just feels | ridiculous somehow. | Izkata wrote: | #include <kentonv.h> | random314 wrote: | Why does it feel ridiculous? | kentonv wrote: | Because I've already rewritten more of the standard | library than is healthy. | | I mean, it's clearly the right thing to do here but I can | predict the conversation that will inevitably result... | "You wrote your own tolower() function? Why?" "The | standard one is horribly broken." "How could a function | that lower-cases a letter be broken??? Jesus Kenton your | NIH syndrome is out of control." "Sigh..." | | (Slightly more seriously, any particular time I need to | lower-case something, it takes 10 seconds to write out | the code, but would take 10 minutes to find a good place | to define a reusable function and exactly what its API | should be, and so it never seems worth the effort in the | moment. Just like how most messy code comes to be.) | nitrogen wrote: | Most codebases I've worked with have a StringUtils.java, | or .kt, or a str.c or utils.c. Maybe just start one. | Interestingly I haven't needed it as much in Ruby. | | But I too feel the cognitive (and social!) burden of | introducing a new function. It's not just "where do I put | this", but "how do I convince the team I know what I'm | doing since 15 years of experience clearly isn't enough | and developers (mostly rightly) ignore positional | authority and seniority". | Natsu wrote: | It's far more ridiculous to repeat yourself over and over | instead of making a simple function that describes exactly | what you want and why. | eitland wrote: | Write the function, comment it! | | Many of these are obvious to many people here, but some | aren't. | | Even I can admit that some of the stuff in this thread is | not obvious at all. | dmurray wrote: | Seems like it would be even better to put this in its own | function with a descriptive name, ascii_tolower or | roman_tolower or whatever, that has exactly the semantics you | want. | gregmac wrote: | This is exactly right, is and is a great example of what | self-documenting code can be. The function itself could | have a bit more explanation but any code calling it is | going to be obvious. | | The big difference is it _looks deliberate_ , instead of | just code written by someone trying to micro-optimize, be | very clever, or who just didn't realize tolower() exists. | Most people will pause before just replacing it, and | likewise it should trigger questions in the PR. | sedatk wrote: | C# has a `ToLowerInvariant()` variety for that. | paranoidrobot wrote: | Which iirc is an alias for ToLower on the en-us locale. | (Same for the other C# *Invariant() methods) | eitland wrote: | Still warrants a comment to be sure no one concludes the | built in is good enough. | a-nikolaev wrote: | I think, this is why you need explicitly ASCII and explicitly | Unicode lower-/upper-/capitalized-transformations. So you don't | assume these things to work automagically. Some times you need | one type, the other times you need the other type. | [deleted] | grumple wrote: | Now I'm wondering about what happens when we change email | addresses to lowercase... | | https://en.m.wikipedia.org/wiki/Email_address#Internationali... | sedatk wrote: | You shouldn't. Email addresses are case-sensitive. | happytoexplain wrote: | We frequently use localized upper/lower casing at my workplace, | as we do not store such stylization in user facing copy. Most | copy is written and translated in sentence case or title case | (because both are much harder to achieve programmatically), and | then our designers have the option of using that casing as-is, | or using all-upper or all-lower. | ramshorns wrote: | A char in C++ is one byte, right? Is it even possible for this | "fixed" code to call ctype::tolower() on something like a UTF-8 | or UTF-16 code point? | kentonv wrote: | Correct, it won't even work as intended with modern Unicode | locales. | ramshorns wrote: | So maybe if the code is broken anyway for non-ASCII | characters, it's fine to use tolower, since somewhere else | in the code it ensures that c is a byte. | kentonv wrote: | The code is not broken for non-ASCII characters. UTF-8 | works just fine with 8-bit chars, and the code I wrote | correctly lower-cases ASCII letters even when UTF-8 is | present (it just won't touch the UTF-8 chars, which is | fine in this use case). | | It's only tolower() and toupper() specifically that are | broken because they expect to be able to do their job on | a single byte, which is no longer possible with UTF-8. | | Meanwhile, using tolower() to lower-case an HTTP header | name won't give you the correct results if the locale is | set to Turkish with the ISO 8859-9 character set, which | is 8-bit, and where tolower('I') will produce the byte | 0xFD which is 'i' in this character set. | ramshorns wrote: | I see, thanks for the explanation. | cesarb wrote: | Java has two variants of toLowerCase(): one which uses the | default/current locale (almost never what you want), and one | which receives an explicit locale (Locale.ROOT is almost always | the one you want). At work, we use the "forbidden APIs" checker | (https://github.com/policeman-tools/forbidden-apis) to fail the | CI if the variant which uses the default locale is ever used; | if you really want to use a locale-dependent toLowerCase(), you | have to explicitly call Locale.getDefault() and use it as the | locale. | | Is there something similar for C and C++? It could help in your | case, by making your well-meaning colleagues aware of the | issue. | vesinisa wrote: | > Locale.ROOT is almost always the one you want | | At least Android developers are advised to use Locale.US: | https://developer.android.com/reference/java/util/Locale | | > The default locale is not appropriate for machine-readable | output. The best choice there is usually Locale.US - this | locale is guaranteed to be available on all devices, and the | fact that it has no surprising special cases and is | frequently used | | It would be indeed interesting to see in which features these | two locales actually differ. | rvnx wrote: | Yes, tolower_l(string, locale) | __s wrote: | I recently ordered a Pixel, on the mail slip they had converted | my name to uppercase, last name read "DUBe" | | Also got my address screwed up on account of living at a half | address.. 1/2 some street #42 | smnrchrds wrote: | > _Has anyone here ever had a use case for toLower() where they | actually wanted localization to apply?_ | | On many documents, including Turkish passport and identity card | and many (all?) other passports, names are written in all caps. | Maybe toLower() is not that useful, but toUpper() is crucial in | any application where you are dealing with real person names. | phonebanshee wrote: | toUpper is definitely language-dependent. For example, in | Irish there are initial letters that are written as lower- | case even in all caps. Wikipedia's example is amusing, since | it's a photo of a government passport office sign - the all- | caps version of Oifig na bPasanna is OIFIG NA bPASANNA (photo | https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/AL. | .., article https://en.wikipedia.org/wiki/Irish_orthography#C | apitalisati...). It would look utterly bizarre to write OIFIG | NA BPASANNA. And this isn't at all an unusual construction in | Irish, it happens in personal names all the time. | | Plus, there's the issue of diacritical marks. Irish keeps | long marks over capitals, but French drops accents. Do you | plan to do e => E (required for Irish - POBLACHT NA hEIREANN | is the all-caps version of Poblacht na hEireann [the Republic | of Ireland]) or e => E (common practice for French)? You have | to get it right, and you have to know the language to do | that. (Poblacht na hEireann also illustrates the fact that | initial caps is also a language-dependent idea; you | absolutely can't write Poblacht Na Heireann - that makes my | eyes burn just looking at it.) | | (And before you say, well, Irish isn't a language spoken by | very many people, remember that it's an official language of | the European Union. If you're writing software to be used by | EU agencies, you're going to have to care.) | smnrchrds wrote: | > _French drops accents_ | | Official position of both _Academie francaise_ and _Office | quebecois de la langue francaise_ is that accents must be | preserved in capital letters. However, it is common in | France to drop them, while they are almost always preserved | in Quebec. I have heard that the reason is that European | French keyboard layout makes it difficult to type accented | capital letters, unlike Quebecois French layout which makes | writing them easy. But I am not sure if this is the cause | rather than the effect of the practice. | ccccc0 wrote: | French here, I recall my primary school textbook where | they said something along the lines of "sometimes accents | are dropped, that's sort of fine as long as it doesn't | change the meaning". They gave the example of a | fictitious newspaper whose headline was "UN POLICIER | TUE": depending on the accent (tue/tue) it means either | "a policeman kills" or "a policeman killed". | hocuspocus wrote: | Wrongly dropping accents on uppercase letters predates | computer keyboards; the French azerty layout puts | accented letters on the first level of number row: | | http://j.poitou.free.fr/pro/img/tkn/tw-image.jpg | | This idiocy carried over. The recent layout update makes | dead accents more accessible: | | https://norme-azerty.fr/ | | But I haven't seen much adoption yet. | zorked wrote: | That button to the right of the P that contains four | different forms of dashes is... interesting. | | Even more if you consider that the minus sign there is | not the character - that is used by every programming | language. | masklinn wrote: | > That button to the right of the P that contains four | different forms of dashes is... interesting. | | And there are two more on the 8 key. | | I like the mac international keyboard layout, but it | still only provides for 4 of those: the non-breaking | hyphen and the "proper" minus sign are lacking. | | I like that the "new azerty" provides for pretty much | every diacritic, even those which are not in use in | french. | phonebanshee wrote: | Interesting that I was wrong - another data point in the | "it's more complicated than you think it is" column. I | always thought you were supposed to drop them (because I | was explicitly told so by a French engineer I worked with | in the 90s, talking about one particular poster, and many | years later still assume that one hallway conversation | was enough to make that THE OFFICIAL RULE without | bothering to actually check...) | forty wrote: | I confirm that in French capital letters should have | accents. | | I have an anecdote on this: on birth certificates, family | names are written in capital letters. It turns out my | partner name ends with a E which was written as E in her | birth certificate. She never noticed (it had never | prevented her to get national ID with her name properly | accented) until we had our first kid which has both our | names, and they refused to have the name accented until | we had my partner's birth certificate updated (which as | you can imagine is quite an adventure, since you need to | dig ancient family birth certificates to prove it was | originally written with an accent...). | mehrdadn wrote: | > Has anyone here ever had a use case for toLower() where they | actually wanted localization to apply? | | How do you lowercase _without_ localization? Remember all text | isn 't English. Unless you're actually asking if anyone has | ever had a use case for lower-casing non-English text? | karmakaze wrote: | And a note that it assumes ASCII. On an EBCDIC system, the | 'A'-'Z' test will translate other characters besides letters. | iforgotpassword wrote: | Wouldn't converting to nfkd/c first solve this issue too? My | understanding of those forms was that they're made exactly for | this case. | brewmarche wrote: | Case mapping and case folding are independent of normalization | (in practice and it is the case here, see the end of | SpecialCasing.txt) | | There is a good Unicode FAQ on the topic: < | http://unicode.org/faq/casemap_charprop.html > | | E: to elaborate, I'm not sure whether the independence of case | handling and normalization is guaranteed anywhere, and if we | for example were to change the uppercase of s to something else | than S then its compatibility forms' (s) case handling would | differ. In practice the SpecialCasing.txt is designed to "make | it work" (e.g. s uppercases to S). | jwilk wrote: | No, these are ASCII strings, so they are already normalized. | iforgotpassword wrote: | Oh, I haven't used python much, but I thought it's all | Unicode? If this were ascii it would work out of the box | since there is no dotless lowercase i in ascii. | estebank wrote: | There are no code point for TURKISH LOWERCASE DOTTED I not | for TURKISH UPPERCASE DOTLESS I, which means that the text | doesn't carry enough information for roundtrip | preservation. | | I believe this has proven to be a mistake but I'm not an | expert. I don't know _why_ it wasn 't done. | FrontAid wrote: | Changes to the casing might also change the value's length. E.g. | uppercasing the German ss will transform it to SS. Example using | JavaScript: | | 'ss'.toUpperCase(); // returns 'SS' | | https://en.wikipedia.org/wiki/%C3%9F | dathinab wrote: | Which can be both correct and wrong depending on context. | | Normally there is no such thing as a capital ss, so it was | decided that if for some unreasonable reason you do uppercase | it you go with SS. | | But then for some all-caps usages this is not right. E.g. a all | caps name of an restaurant as placed above the restaurants | door. In which case it was common to have a ss in a all-caps | name like FOOssBAR. So they decided that for reasons like this | we now have an (EDIT: semi?) official uppercase ss. | | So all in all this and other examples in other languages mean | you should never do a case insensitive comparison by | upper/lower casing both sides, it won't work reliable. | schoen wrote: | There is apparently a multi-decade controversy about that: | | https://en.wikipedia.org/wiki/Capital_%E1%BA%9E | | (with German language authorities recently endorsing the idea | that ss can have a distinctive uppercase form "Ss") | [deleted] | seqizz wrote: | Yeah, there were some weird bugs about that. I remember one in a | media player. Also "info".upper() would be INFO probably. | cazim wrote: | http://www.moserware.com/2008/02/does-your-code-pass-turkey-... | | This is old but still valid reading... | shagmin wrote: | I learned about this in javascript when I discovered Angular has | its own lowercase method. Apparently it's internal only now. | | https://github.com/angular/angular.js/commit/1daa4f2231a89ee... | decafbad wrote: | Please stop doing this. Don't bind lower() upper() functions to | environment variables or anything else system related. Sun did | this in Java and doesn't even bother to mention the issue in | documents. It caused huge problems for more than a decade. | | You can just make string lowercase() uppercase() function work | the same everywhere, regardless of locale settings. Provide a | special case function lowercaseTR() or so. This works very well | in Go. | | By the way, Azerbaijan has the same problem because they accepted | help from wrong guys when they switched to Latin. | netsharc wrote: | > lowercaseTR() | | Huh, that works well if we know the input string is in Turkish. | What if this information is not available as you're writing the | code? | | And what will lowercase()/uppercase() be hard coded to do, and | what are they supposed to output when the input isn't ASCII? | alkonaut wrote: | Repeat after me: don't do string operations without explicit | locale. Don't do string operations without explicit locale. | | I don't know why so many languages have string functions that | should take a locale but provide an overload that doesn't and | which uses the _system_ locale as the default. It can't be what | many developers actually want, yet it has become the norm. Worse, | code using a default locale _appears_ to work on the developers | machine and in production, until someone parses a number in | France or lowercases a string in Turkey, which is a late and | expensive discovery of the bug. | | The default shouldn't be the system locale, it should be an | invariant locale. And I'll go so far as arguing this invariant | locale should be invariant across systems (meaning it can't just | defer to a system C library either). | madeofpalk wrote: | I ran into this with C#/.NET on Windows - I tried to convert a | string "1.3" to the float 1.3, and it failed on languages that | use comma as their decimal separator. | | That was a learning experience. | alkonaut wrote: | Indeed. As a person from a comma country, I find these | mistakes in most code bases I look at. It makes it | frustrating to contribute to open source, for example. | | Perhaps it'll make you feel better about your parsing bug | that even the C# compiler (Roslyn) code base had several of | these issues. | garydgregory wrote: | See also https://garygregory.wordpress.com/2015/11/03/java- | lowercase-... | sedatk wrote: | Note to the next language designer: don't use strings as a | substitute for enums. | teddyh wrote: | It might be OK if strings are immutable and therefore | internable. | TazeTSchnitzel wrote: | The PHP interpreter has an internal reimplementation of string | case conversion that's ASCII-only in order to avoid this problem. | asddubs wrote: | doesn't php have this exact problem with their case-insensitive | (hate that btw) function/method names and turkish localization? | or did they actually fix it at some point? | stevoski wrote: | For a similar reason, Java on Mac and Linux was briefly broken | for anyone using it in the Turkish locale. It was because in the | Turkish locale, !"POSIX".toLowerCase().equals("posix"). | | Relevant bug report here: | https://bugs.openjdk.java.net/browse/JDK-8047340 | anticensor wrote: | Correct: you would get "info", "warning" and "critical" in | Turkish and in Azerbaijani. | mapgrep wrote: | Further context: | | https://en.m.wikipedia.org/wiki/Dotted_and_dotless_I | | Did not know Istanbul is actually Istanbul. | gvx wrote: | Me neither. I did know it's not Constantinople, though. | chihuahua wrote: | I remember running into problems with SQL stored procedures where | column and table names were case-insensitive, so you don't know | if you've properly typed all the column and table names. Until a | customer in Turkey eventually installs it and you find out you've | missed the proper capitalization of an identifier containing the | letter "I", and the stored procedure fails. | Pxtl wrote: | Honestly, I'm very pro case-insensitivity, but my experience | with SQL servers have impressively demonstrated how _not_ to do | it. | | For example, MS SqlPackage, used for deploying schema, is case- | insensitive... But that also means changes to text constants | within your stored procs do not get treated as changes. | heavenlyblue wrote: | This is what I usually think about whenever people say yay to | Unicode in language identifiers. | formerly_proven wrote: | "I" is in ASCII. | tantalor wrote: | https://bugs.python.org/issue1524081 | | > KeyError: 'Info' | tryauuum wrote: | Unrelated story about Russian language. | | The first letter of russian alphabet is A, the last one is Ia. So | it's natural to try to match russian words with '[A-Iaa-ia]+'. | But this is a recipe for disaster, this regexp doesn't match | words with 'Io' in them like "Artiom". | | This is due to the fact that regexp ranges work on byte values. | All letters of russian language have neatly ordered byte values, | except for the Io. | blkhawk wrote: | not that unusual - for German for instance uoaUOAss need to be | added so all words can be matched. | a3w wrote: | now, there is even a capital ss ;) | Forge36 wrote: | Out of curiosity I tried on my phone: ss | | Ss | | SS | | So my phone doesn't have that yet! | maxpro wrote: | mine has it - Ss the small one is ss | Igelau wrote: | Is the capital supposed to be shorter? | mikepurvis wrote: | This would be an argument for just using [:alpha:] everywhere; | presumably it does the correct thing based on locale? | tryauuum wrote: | No, alpha doesn't work, at least in "grep -P" with | "ru_RU.UTF-8" locale: $ echo Test | grep -oP | '[[:alpha:]]+' Test $ echo Artiom | grep -oP | '[[:alpha:]]+' $ echo Artiom | grep -oP '[A-Iaa-ia]+' | Art m | | This thing works, though I've never seen one in the wild: | $ echo Artiom | grep -oP '[\p{Cyrillic}]+' Artiom | Sharlin wrote: | English is probably the only commonly spoken language where | naive char range matching _kind of sort of_ works. I say "kind | of sort of" because [a-zA-Z] trivially fails to match all words | in many English texts that haven't been lossily compressed to | ASCII, including this comment. | | It is practically always wrong to match on [a-z] unless you're | parsing a computer language whose spec guarantees that it | works. | Izkata wrote: | Forget ascii conversion, that also fails on contractions like | "don't". | dheera wrote: | The easiest solution to this problem would be to just rename | it to "naive". | tryauuum wrote: | I always wanted to know, how easy is it to type naive on a | common western keyboard? | | Do you have to press some obscure keyboard shortcut? | dylz wrote: | "i or similar works. ^i, `i, 'i, etc. for the others. | reaperducer wrote: | _how easy is it to type naive on a common western | keyboard?_ | | In macOS, you can either use Command-u (for "umlat") | followed by i, or hold down the i key for a second and | press 2 to select the i from the pop-up menu. | [deleted] | masklinn wrote: | > Command-u | | option-u (aka alt-u). | | Generally speaking, command is for application-level or | os-level commands, control is for text edition, and alt | is for alternate characters (all can be shifted and | command "overrides" the rest). | reaperducer wrote: | You're right, it's Option-u. Most of the key labels on my | MacBook have long since been scratched away. | | This has happened with every single Apple keyboard I've | ever used. I suspect it's my fault, since I'm a key | pounder, having learned to type on an IBM Selectric | typewriter. | boring_twenties wrote: | On any Unix, just enable the "compose" key, then | <Compose>+"+i. | | It's always something easy to remember, like " for umlauts, | o for circles ((c), (r)), obviously ' for accents (n) and | so on. | Izkata wrote: | On Ubuntu, I use xmodmap to turn Print Screen into a | Compose key. Then it's: <compose>"i | | https://en.m.wikipedia.org/wiki/Compose_key | figomore wrote: | I use the Macintosh keyboard map in Linux. So I do <right | alt>+e to ', <right alt>+n to ~. | fnord123 wrote: | It is acceptable to write English without diacritics. | "Naive" is accepted. | madeofpalk wrote: | By default on a Mac you just hold down the key to get | different options, similar to on an iPhone (and a presume | touch-Android). | | https://i.imgur.com/yuG063t.png | nemetroid wrote: | On a Swedish keyboard, there's a dead key for ", so you | press that followed by _i_ to get _i_. | | It's not very clear why the Swedish keyboard has that key, | since a and o each have their own keys. The layout has | other quirks as well, such as keys for SS, 1/2 and the | useless "currency sign", $?. | aliswe wrote: | Yes! I think the "mine" character should be switched for | the dollar sign. | | BTW, the dead key could be from german, for writing their | U:s. | jakub_g wrote: | I'm Windows-based and wanted a keyboard layout that will | allow me typing easily Polish and French at the same time, | without switching keyboard layouts (PL == US+AltGr for | accents; while FR layout is insane, because apart from | being AZERTY, all special chars are in different places, | and you need a Shift to type numbers; and the way to type | accents is also special). | | I found "Polish international" [1] layout which honestly | can be perfect for many people. It's optimized to be | compatible with regular Polish keyboard (hence with US | keyboard too), and maybe not the fastest if you type a lot | special chars, but it's extremely intuitive: | | i = AltGr+:, i | | u = AltGr+:, u | | e = AltGr+/, e | | e = AltGr+\, e (since it's extremely common, also aliased | as AltGr+w) | | If you're Windows based and want US-compatible keyboard | layout that allows easily typing any special chars, I | highly recommend it. | | [1] https://translate.google.com/translate?sl=pl&tl=en&u=ht | tps%3... | 205guy wrote: | I type English and French in Windows on the same QWERTY | keyboard. I once learned to type on Azerty, but I mainly | type English now on a standard US keyboard layout. For | the French, I find the windows alt-numbers works the | easiest for accented characters. Alt-130=e, Alt-133=a, | Alt-135=c, Alt-137=e, Alt-138=e which covers 95% of the | accented character usage. I have a little chart next to | my desk with all the others (i,o,u) they're nearly all | Alt-14x and Alt-15x. And then I'll put e in the paste | buffer because it is the most used and a bit quicker that | way (for words like "prefere"). | | The Alt-13x codes are not as quick as the Azerty keys, | but good enough and once memorized are fairly easy with a | keyboard that has a keypad (most PCs do, even my laptop). | This is especially true because they are done with both | hands simultaneously, as opposed to something like | Cmd-e+e on a Mac. Actually, they are faster than finding | the accented characters on my QWERTY virtual keyboard as | I type this comment on iOS. | | Those AltGr- combos seem complicated to me, I would much | prefer a system such as AltGr-e =e, then AltGr-ee=e, | AltGr-eee=e, etc. To me that would be more intuitive than | remembering the composing character (slash for aigue, | etc). | cassepipe wrote: | You seem to be quite used to your Alt combination but as | you said they really are not straightforward. I found | another very simple solution, on Linux you can set a | compose key (typically Alt gr or the contextual menu | key). You type one after another, the compose key and | then any two keys that make sense like ' followed by e | (et vice versa), it will give you a e. It is both fast | and easy to work with. | jakub_g wrote: | Reminds me of when the 'D' key broke in my physical | keyboard long time ago. I liked that keyboard a lot and | couldn't find a good replacement so I learnt to type | Alt-100 do get 'd'. | mbostleman wrote: | Hence toUpper/toLower is not a strategy that passes the Turkey | Test for case insensitivity. | jwilk wrote: | Looks like it's no longer the case in Python 3: | Python 3.7.3 (default, Jul 25 2020, 13:03:44) [GCC 8.3.0] | on linux Type "help", "copyright", "credits" or "license" | for more information. >>> from locale import * >>> | setlocale(LC_ALL, 'tr_TR.UTF-8') 'tr_TR.UTF-8' >>> | 'INFO'.lower() 'info' | xyst wrote: | Python 3.7.5 (default, Nov 5 2019, 22:30:48) | | [Clang 11.0.0 (clang-1100.0.33.12)] on darwin | | Type "help", "copyright", "credits" or "license" for more | information. | | >>> from locale import * | | >>> setlocale(LC_ALL, 'tr_TR.UTF-8') | | 'tr_TR.UTF-8' | | >>> 'INFO'.lower() | | 'info' >>> ' [?]'.lower() | | '\u200d[?]' | | >>> exit() | | There's something wrong with emojis + lower() though | Dylan16807 wrote: | It lowercased the 'show this as emoji' variation selector to | zero width joiner? | anderskaseorg wrote: | Oddly, it also wasn't the case for Python 2 Unicode strings | (u'INFO'), only for Python 2 byte strings ('INFO'). So it's | possible that Python 3 lost this behavior by accident. | scrollaway wrote: | Ive long thought programming languages need a "localizable | string" (Aka user-facing string) type, different from regular | utf8 strings. Something like what gettext and other i18n | libraries fake for you, but native to the language. | | Behaviour like this is definitely a good reason why: sorting, | changing case, etc should be consistent when dealing with strings | used as constants and identifiers, but Python's .lower() | behaviour makes sense in a localizable string context. | lazulicurio wrote: | Along similar lines, I've thought that it would be useful if | Unicode included language marks (i.e. codepoints to identify | blocks of text as being written in a specific language). It | would be strictly more useful than the barebones left-to- | right/right-to-left marks (U+200E/U+200F) when deciding how to | process and display text. And it would be a step towards | correcting the mess that was Han unification. | jwilk wrote: | See RFC 2482 -- Language Tagging in Unicode Plain Text: | | https://tools.ietf.org/html/rfc2482 | | But it was deprecated later on: | | https://tools.ietf.org/html/rfc6082 | lazulicurio wrote: | Interesting. Unfortunate that the deprecation notice | doesn't include much rationale. I found at least one mail | thread about it[1], which seems to confirm that the main | thought was that semantic information about text should be | handled at a higher layer (e.g. XML). I can understand that | argument for a general purpose tagging mechanism, but | language and glyphs are strongly semantically linked. | | (Somewhat ironically, the previous thread on that mailing | list is about the struggles of case folding in a general | fashion across multiple language scripts[2]) | | Edit: I also found [3], which offers the following: | | ---- | | - Most of the data sources used to assemble the documents | on the Web will not contain these characters; producers, in | the process of assembling or serializing the data, will | need to introspect and insert the characters as needed-- | changing the data from the original source. Consumers must | then deserialize and introspect the information using an | identical agreement. The consumer has no way of knowing if | the characters found in the data were inserted by the | producer (and should be removed) or if the characters were | part of the source data. Overzealous producers might | introduce additional and unnecessary characters, for | example adding an additional layer of bidi control codes to | a string that would not otherwise require it. Equally, an | overzealous consumer might remove characters that are | needed by or intended for downstream processes. | | - Another challenge is that many applications that use | these data formats have limitations on content, such as | length limits or character set restrictions. Inserting | additional characters into the data may violate these | externally applied requirements, and interfere with | processing. In the worst case, portions (or all of) the | data value itself might be rejected, corrupted, or lost as | a result. | | - Inserting additional characters changes the identity of | the string. This may have important consequences in certain | contexts. | | - Inserting and removing characters from the string is not | a common operation for most data serialization libraries. | Any processing that adds language or direction controls | would need to introspect the string to see if these are | already present or might need to do other processing to | insert or modify the contents of the string as part of | serializing the data. | | ---- | | Other than #3 (the one about string identity), I find these | wholly unpersuasive. And even #3 isn't even that great a | reason considering that programmatic processors have to | deal with that issue anyway due to case folding. | | [1] https://www.unicode.org/mail-arch/unicode- | ml/y2010-m11/0039.... | | [2] https://www.unicode.org/mail-arch/unicode- | ml/y2010-m11/0038.... | | [3] https://www.w3.org/TR/string-meta/ | Ericson2314 wrote: | What this gets right down to is that Unicode is a flawed | idea: the meaning/behavior/whatever of characters is insanely | dependent on their context. | | The problem was never gazillions of code pages, but our | inability to write C to deal with that amount of complexity | circa 1990. | | With modern machines, and good programming languages with | good type systems, I absolutely think we could store a | language per string, and concatenate into a polylinguistic | rope if needed. | | This would hopefully push us away from stringly-typed crap in | general. | throwaway_pdp09 wrote: | > the meaning/behavior/whatever of characters is insanely | dependent on their context | | I wish you would give an example instead of just | proclaiming crapness. You know, so we n00bs can learn | something. | toast0 wrote: | Different languages have different rules for change case | (as seen here) or what to do when translitterating to | 7-bit ascii, in French, you can mostly drop accents if | you need to, in German, you need to transform an umlaut | to an e following the vowel. Of course, many languages | don't have a way to translitterate to 7-bit ascii. | | Sorting of strings is language dependent, but I don't | know that there's a defined order for mixed language | lists, so I guess user's context works if you're sorting | for user purposes, but if you're sorting for machine | purposes, you better not use the locale aware sort | without telling it a hardcoded locale that doesn't change | between localization library versions. | throwaway_pdp09 wrote: | @toast0, @lazulicurio, both of your points seem to | illustrate the complexities of the languages, not | "...that Unicode is a flawed idea" as the original poster | said. AFAIKS this is intrinsic complexity showing itself | and does not make any indication of how it should be done | correctly, or better. | Ericson2314 wrote: | The benefit of looking at languages/scripts in isolation | is that the _combinatorial explosion_ of all languages | /scripts at once is dodged. | | E.g. lookalike charaters, and social engineering by using | a vs a. (One is Cyrillic). I don't want to even _define_ | "a == a". I want Latin and Cyrillic to be different types | of characters, and that expression to be ill-typed. | | This solves the Turkish problem, where the upper case I | is two different charters in two different types (Turkish | Roman script?), and the case folding functions likewise | have disjoint types. | lazulicurio wrote: | > both of your points seem to illustrate the complexities | of the languages, not "...that Unicode is a flawed idea" | | The flaw in Unicode is that it punts on the intrinsic | complexity---pretending that codepoints have language- | independent, plain-text, semantic meaning. | | A couple of threads that have molded my views over time: | | _I can 't write my name in Unicode_ | https://news.ycombinator.com/item?id=9219162 | (Specifically these two comments | https://news.ycombinator.com/item?id=9220530 and | https://news.ycombinator.com/item?id=9220970) | | _Why isn 't the external link symbol in Unicode?_ | https://news.ycombinator.com/item?id=23016832 | Ericson2314 wrote: | > The flaw in Unicode is that it punts on the intrinsic | complexity---pretending that codepoints have language- | independent, plain-text, semantic meaning. | | > Pretending "plain text" isn't an oxymoron | | FTFY :) | lazulicurio wrote: | How about: case folding for the letter 'I' is dependent | on whether the locale is Turkish or not. | | ;) | arcticbull wrote: | Unicode goes to great pains to avoid ascribing any | meaning/behavior/whatever to character sets. Because to | your point you can't. Unicode is actually incredibly well | thought out. That's why we have values, code points and | grapheme clusters. I don't think the Unicode standard even | defines casing except in the human-readable names ascribed | to code points. | | If you want to build a polylinguistic rope you can | certainly do that with Unicode, but you won't have solved | anything because language alone without context doesn't | really define many of the operations you're describing. | | The answer is usually the same as "doctor it hurts when | I..." -- stop doing it. Stop manipulating user input | without context. Stop trying to limit user visible strings | by character count, use pixel width in the rendered font. | And so on. | jfk13 wrote: | > I don't think the Unicode standard even defines casing | except in the human-readable names ascribed to code | points | | Sure it does; the Unicode Character Database includes | fields for the lowercase, uppercase and titlecase | mappings. But it also acknowledges that these are just | default mappings, and may need to be tailored for | specific languages/locales. | Ericson2314 wrote: | Unicode is well thought out! And that's what makes it | hard to critique :). I think it's one of the best- | maintained, well-thought out standards there is, but I | still think the premise is wrong. | | If all that good effort went into something along the | lines I am describing, where languages, or at least | scripts, cannot be arbitrary mixed at the character | level, I think we would have an even better result with | the same level of effort. | kevin_thibedeau wrote: | Unicode supported this with tag sequences but that is | deprecated and unlikely to work with modern libs. | thomasahle wrote: | That would be great! For example, in Python you currently have | to do something like this import locale | sorted(list_of_strings, key=locale.strxfrm) | | To sort using the current loacale, which many people forget. | DougBTX wrote: | Along the lines of this? | | https://docs.microsoft.com/en-us/dotnet/api/system.globaliza... | layer8 wrote: | In Java, there is Locale.ROOT, which can be used in a similar | way. In particular, it is useful when performing locale- | dependent operations in locale-independent contexts (e.g. | working with case-insensitive identifiers) where you don't | want the behavior of your code to depend on the current | default locale. | wongarsu wrote: | .NET is one of the few ecoecosystems to get this right. It | offers the invariant culture for identifier-like things, "fr" | for French language and "fr-FR" for French language in | France, allowing you to specify your intention to every | string-modifying function. | | Support at the type level would be a lot less verbose, but | support at the function level is already much better than | many other popular languages. | kanox wrote: | It would be great if strings and especially date-time | values always carried locale and timezone information with | them. | | It would take slightly more memory but not significant on | modern machines. | wongarsu wrote: | Putting the locale information on the string sounds like | a good idea. However I'm not sure how that should handle | combined strings with components from different locales. | For example `logLevel + ": " + logMessage` might produce | "info: baglanti kesildi" in Turkish. How to annotate | that? Neither English nor Turkish would work correctly, | each would produce the wrong result when uppercasing. | | You could treat it as a series of string slices with | different locales `[("info", "en"), (": ", ""), | ("baglanti kesildi", "tr")]`. That would work correctly, | and you could now uppercase each slice according to its | appropriate locale, but it wouldn't really be low | overhead anymore. Maybe still worth it. It would be an | interesting approach that might even be able to be | implemented pretty seamlessly as a library in some | languages (C++ or rust for example) | scrollaway wrote: | That just seems to be a parameter for locale-dependent | functions. Very useful, but no, I'm talking about splitting | the unicode-string datatype in two: "user-facing unicode | string" vs "internal unicode string". | | Example: logging.log("INFO", i"This is a localizable string") | | In the i18n world, we could gather i-strings just like | gettext does (where it looks like `logging.log("INFO", | _("This is a localizable string")`). The language could then | have other useful hooks/behaviours into that datatype, and | definitely one of them would be whether various methods have | i18n behaviour enabled on them, versus using a C locale. | maweki wrote: | As it isn't yet mentioned: for these cases the Python standard | library explicitly has | https://docs.python.org/3.8/library/stdtypes.html#str.casefo... | (str.casefold), which aggressively lowercase-normalizes strings | with an algorithm from the unicode standard. Every case | comparison using lower() instead of casefold() can be considered | a bug. | Alex3917 wrote: | > Every case comparison using lower() instead of casefold() can | be considered a bug. | | If you just casefold two strings and compare them, it's still a | bug. You need to normalize them to NFKC first. | pas wrote: | Is NFKC necessary, isn't NFKD enough? (As in you have to | normalize and decompose both strings, but at that point you | can check them for equality, and doing the canonical | composition isn't needed, right?) | Alex3917 wrote: | I think that would work if you're just checking for | equality and want to minimize processing. I guess as a web | developer I always just assume people are going to be | storing strings in a database after normalizing them, so | would want to minimize string length. | chippy wrote: | https://garygregory.wordpress.com/2015/11/03/java-lowercase-... | | In the Turkish locale, the Unicode LATIN CAPITAL LETTER I becomes | a LATIN SMALL LETTER DOTLESS I. That's not a lowercase "i". | [deleted] | geofft wrote: | In C (POSIX.1-2008, specifically), there's tolower_l() and the | rest of the _l functions for this use case, which take a locale | as an argument. That let's you ask for the English (or even "C | locale") lowercase versions of these English words, even when | your process's current locale is Turkish. | | https://www.man7.org/linux/man-pages/man3/tolower_l.3.html | adamjb wrote: | The mention of _l functions reminded me of this gloriously over | the top git message/rant. | | "Those not comfortable with toxic language should pretend this | is a religious text." | | https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f02... | [deleted] | [deleted] | formerly_proven wrote: | ITT calling setlocale or std::locale::global(...) is ALMOST | ALWAYS a heinously bad idea and should rarely be done, because it | breaks tons of code (notably everything that uses printf/scanf | and everything using stringstream). | crazygringo wrote: | Serious question. | | Why on earth would you hard-code these, instead of simply call a | lowercase function in the en-US locale? | | These are English words. Naively lowercasing them according to | whatever locale the server or user has set seems like a terrible | programming practice. Any call to a lowercase function should be | explicitly including an argument that specifies it's English, no? | | In the same way we've all learned to never store times without an | explicit timezone (even if it's UTC), or locate a string offset | without knowing your encoding... you should never perform | language transformations (case changes, accent removal, etc.) | without a locale. | | Hardcoding these things is just patching over the symptoms | without addressing the cause, no? | jaclaz wrote: | Only for the record, there is something very similar that may | happen when creating CD/DVD's (please read when using mkisofs and | similar), with the "dash" that when "capital" becomes underscore | (but not only ) depending on the reference ISO | 9660/Joliet/RockRidge convention in use. | | https://web.archive.org/web/20151007005513/http://www.911cd.... | TwoBit wrote: | This particular case seems odd to me because INFO is an English | word, and info is not. | wongarsu wrote: | You could make a case that Unicode should have different "i" | characters for different languages. Then you could do all | transformations unambiguously. On the other hand almost | everyone abuses the minus sign as a dash, and treats the | apostrophe and the prime sign (signifying feet or minutes) as | interchangeable, so in all likelihood they would constantly use | the wrong i too. | kps wrote: | > On the other hand almost everyone abuses the minus sign as | a dash | | Unicode calls it HYPHEN-MINUS. It does also have an | unambiguous '-' MINUS SIGN as well as '-' U+2010 HYPHEN and | the various dashes, but most people use bad keyboard layouts. | josefx wrote: | > You could make a case that Unicode should have different | "i" characters for different languages. | | And different "SS" for any case where the lowercase was an | sz, of course at some point Germany introduced an uppercase | SZ character to avoid that round trip loss issue, but we | still have tons of text that use the old sz -> SS conversion. | Also note that "y" in Germany, not all German speaking | countries follow the same rules for sz, some dropped it | entirely. We basically need something like the time zone | database to have even a snowballs chance in hell to handle | text correctly. | heavenlyblue wrote: | Pretty sure that's not true. When you switch your keyboard | you will have a proper i character in another language unless | your keymap is broken. How do you think Chinese, Russians or | Greek type their characters? | tzot wrote: | The grandparent obviously meant "latin i"; none of the | three languages you mention have any latin letters, but at | least Russian and Greek have some lowercase and some more | uppercase letters with the same glyph/shape as latin ones. | heavenlyblue wrote: | Yeah, and those similar glyphs are not available on their | own language keyboard. | wongarsu wrote: | I frequently type German with a US layout with dead keys | (so I can type "a to get a). I also imagine that most | Turkish developers type English on a Turkish layout, since | Turkish contains all characters used by English. | anticensor wrote: | I have a better solution: use combining characters COMBINING | DOT ABOVE (which already exists) and DELETE DOT ABOVE (which | needs to be added into Unicode), which would manipulate "I" | into "I" and "i" into "i" respectively. Those combining | characters would also work perfectly with j too. | estebank wrote: | The only issue I can see is with people working in a | Turkish locale writing Latin text producing, let's say | English blogposts with the wrong i and I. I still think | that this should have been done this way though... | anticensor wrote: | Indeed. LATIN SMALL LETTER I + DELETE DOT ABOVE becomes | LATIN CAPITAL LETTER I + DELETE DOT ABOVE in uppercase, | which then becomes LATIN SMALL LETTER I + DELETE DOT | ABOVE back in lowercase. The same thing applies to LATIN | CAPITAL LETTER I + COMBINING DOT ABOVE. Survives infinite | number of case conversions. | johnwalkr wrote: | Well a round-trip or two could still be ambiguous which could | easily fail when comparing strings later in some edge case. | Especially when we can't even consistently agree to use by- | application, by-OS, by-language and by-locale settings | consistently. I don't have a solution, just pointing out that | this is a really challenging problem to fully solve. | mapgrep wrote: | Dumb question, if you _really_ need the exact string "info" in a | given context, why not hard code it? What does .lower() or even a | map liked the linked one actually buy you? | nicoburns wrote: | Presumably it's for normalising input. Following the principle | that you ought to be permissive in what data you accept, and | strict in what data you give out. | simion314 wrote: | Maybe the input is case insensitive, for example if you work | with html you might see "DIV","div" who knows some crazy dev or | tool might generate "DIv" or "dIv" so is simpler to lowercase | the input then work on it. | dependenttypes wrote: | Ah yes, locales. Everyone loves them https://github.com/mpv- | player/mpv/commit/1e70e82baa9193f6f02... | ramses0 wrote: | ObTurkeyTest: http://www.moserware.com/2008/02/does-your-code- | pass-turkey-... ___________________________________________________________________ (page generated 2020-08-16 23:00 UTC)