hngopher.com

       [HN Gopher] In the Turkish locale, "INFO".lower() != "info"
       ___________________________________________________________________
        
       In the Turkish locale, "INFO".lower() != "info"
        
       Author : duckerude
       Score  : 140 points
       Date   : 2020-08-16 13:10 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | beeforpork wrote:
       | My genius idea was once to use toupper() to normalise paths on
       | Windows, which are case-insensitive. One day, a customer from
       | Azerbaijan reported that my application failed to access a file
       | in C:\WINDOWS\\...
        
         | tryauuum wrote:
         | i feel your pain
        
       | CodesInChaos wrote:
       | In the Danish locale "aa" doesn't start with "a".
        
       | 60secz wrote:
       | Stringly typed: Play stupid games, win stupid prizes.
        
       | bayindirh wrote:
       | Welcome to the Turkish language, where we have i, i, I and I. In
       | our language the conversion is as follows:
       | 
       | - i <-> I
       | 
       | - i <-> I
       | 
       | We love our dots and preserve them. For a more detailed read,
       | please see:
       | 
       | https://blog.codinghorror.com/whats-wrong-with-turkey/
        
         | Natsu wrote:
         | As I understand it, Turkish is one of the more important
         | locales to test with because of things like this.
        
           | bayindirh wrote:
           | Turkish is the only language which has the i & I pair.
           | Similarly, AFAIK, Turkish is again the only language with g
           | and s letters. So, by testing for Turkish, you test for a lot
           | of European languages at once. Moreover we share some
           | modified letters(c, u) with other Central European languages.
           | 
           | If your program can pass "The Turkish Test", you pass a lot
           | of others too.
        
             | [deleted]
        
             | anticensor wrote:
             | Azerbaijani too. Moreover, Azerbaijani has an additional
             | letter @, which sounds like /ae/.
        
               | therein wrote:
               | I love the feeling of camaraderie arising from that
               | partial mutual intelligibility of Turkish and
               | Azerbaijani.
               | 
               | That connection through language goes a long way.
               | 
               | muq@dd@s baci mill@t :)
        
           | 1-more wrote:
           | Poor encoding can lead to the odd murder too:
           | http://gizmodo.com/382026/a-cellphones-missing-dot-kills-
           | two...
           | 
           | > The use of "i" resulted in an SMS with a completely twisted
           | meaning: instead of writing the word "sikisinca" it looked
           | like he wrote "sikisince." Ramazan wanted to write "You
           | change the topic every time you run out of arguments" (sounds
           | familiar enough) but what Emine read was, "You change the
           | topic every time they are fucking you" (sounds familiar too.)
        
             | rvnx wrote:
             | That doesn't explain the e instead of the a, does it ?
        
               | bayindirh wrote:
               | In the olden times, ending words with _e_ instead of _a_
               | is considered an acceptable typo.
               | 
               | Also the article explicitly says "it looked _like_ he
               | wrote ". So when you see red, that last letter can become
               | anything and nothing would change.
        
         | generationP wrote:
         | > - i <-> I
         | 
         | > - i <-> I
         | 
         | After seeing this, I don't understand how the rest of us can
         | fail to have the same distinction. There's something logically
         | beautiful -- like the rhyme in a good poem -- about artificial
         | languages (or, in this case, alphabets) that naturally evolved
         | languages just cannot compete with.
        
         | sampo wrote:
         | > We love our dots and preserve them.
         | 
         | Turkish preserves the dots of i, o and u in their capital
         | versions, but not with j. The capital J is dotless.
        
           | bayindirh wrote:
           | Isn't capital j is dotless in every language?
        
       | Macha wrote:
       | 07/04/2008 -> April 7th seems about as reasonable a result as
       | July 4th, especially when you've explicitly opted in to a Turkish
       | locale. I don't agree with the article's assertion that the
       | format being interpreted according to the user's locale is wrong
       | here, the one wrong part is a US centric programmer's expectation
       | that PP-QQ-YYYY is an unambiguous format. Use YYYY-mm-dd when you
       | need a format that's not ambiguous
        
         | frabert wrote:
         | YYYY-mm-dd also plays nice with lexicographic ordering, which
         | is why I always use it when I need to put dates in e.g.
         | filenames
        
           | Macha wrote:
           | I'm a European working primarily with Americans. My home
           | country uses dd/mm/YYYY (or dd/mm for short) and the US uses
           | mm/dd/YYYY for with mm/dd for short. I've switched to YYYY-
           | mm-dd simply for my own sanity and if I omit the year I write
           | the month in text format, such as "5 June".
        
             | withinboredom wrote:
             | The US military uses the almost same convention (dd-mmm-
             | yyyy) so 07-aug-2020.
        
               | dgellow wrote:
               | That's dd-mmm-yyyy
        
               | withinboredom wrote:
               | Thanks!
        
         | Macha wrote:
         | Note: This is actually a reply to the article here:
         | https://news.ycombinator.com/item?id=24178270 , for some reason
         | I thought that was the top level link.
         | 
         | Maybe if dang sees this, it could be reparented?
        
         | heavenlyblue wrote:
         | > PP-QQ-YYYY is an unambiguous format
         | 
         | "US centric" is one way to say it
        
         | snthd wrote:
         | > 07/04/2008 -> March 7th
         | 
         | I think you mean April.
        
           | Macha wrote:
           | Fixed
        
       | kentonv wrote:
       | Has anyone here _ever_ had a use case for toLower() where they
       | actually wanted localization to apply?
       | 
       | It seems to me that in practice, it's extremely rare to want to
       | change case of real, natural-language text. When I have natural-
       | language text, it's just a blob to me, and I don't want to touch
       | it.
       | 
       | The only time I ever want to lower-case or capitalize something,
       | I'm working with identifiers meant for computer -- not human --
       | consumption. Usually, specifically, I'm dealing with identifiers
       | that have annoyingly been defined to be case-insensitive even
       | though the only humans that ever see them are programmers and
       | programmers hate case-insensitivity. HTTP headers are a common
       | example.
       | 
       | I mostly write C++, and I end up writing code like:
       | for (char& c: str) {           if ('A' <= c && c <= 'Z') c = c -
       | 'A' + 'a';         }
       | 
       | Later on, some well-meaning developer on my team will come along
       | and say "Ugh what is this NIH syndrome?" and then they "clean it
       | up" as:                   #include <ctype.h>              for
       | (char& c: str) {           c = tolower(c);         }
       | 
       | And then I have to say NOOOOOOO DON'T DO THAT YOU HAVE NO IDEA
       | WHAT tolower() REALLY DOES!
       | 
       | I struggle to imagine any real use case where you'd actually want
       | locale-dependent tolower() other than, maybe, a word processor --
       | but if you're writing a word processor, you're probably not going
       | to be depending on the language's built-in string APIs to do your
       | text manipulation.
        
         | felixarba wrote:
         | I have a morse code app which consistently crashed when certain
         | users would try to translate letter "i", and it took me a long
         | time to figure out that only the turkish users would complain
         | about it, and when one of them sent me a screenshot I only
         | noticed a "wrongly" rendered capital letter i (I used toUpper)
         | and after digging around a bunch, I learned about this while
         | turkish letter i.
        
         | aflag wrote:
         | File names, URLs and email address support utf-8 characters and
         | you may want to lower case them in many situations. If the user
         | is trying to search for a string, they probably want case
         | insensitivity. I don't think it's that rare/weird for people to
         | want localisation to apply when calling toLower.
        
         | nurettin wrote:
         | You need localization if you do any kind of multilingual text
         | processing. Not sure how it could escape a thinking person's
         | imagination.
        
         | vasama wrote:
         | This is why I have a set of functions like AsciiToLower(char*
         | string, size_t size). They only touch characters in the ASCII
         | space at <0x80. Even went and implemented them with SSE for
         | x86.
        
         | crazygringo wrote:
         | Of course! There are _tons_ of cases where you need to store in
         | "sentence case" (first word and proper nouns and acryonyms
         | capitalized, nothing else) so you can convert to title case or
         | all-caps as needed for display purposes. Templates are full of
         | this kind of stuff.
         | 
         | There are similarly tons of cases where you reduce everything
         | to lowercase without accents for searching and indexing
         | purposes. Depending on your setup, your database might handle
         | that for you, but there are edge cases where you need to do it
         | at the application level.
         | 
         | Long story short, every string has a locale, and you should
         | never change the case of something without specifying its
         | locale. Either be explicit that it's American English or ASCII
         | or Latin1 or whatever... or that it's something else. Never
         | leave someone reading the code guessing.
        
           | asveikau wrote:
           | > you can convert to title case or ... for display purposes.
           | 
           | I am skeptical if someone thinks they need to do this and how
           | they will get it done.
           | 
           | Eg. Looping through and capitalizing the first gylph after
           | breaking whitespace regardless of locale is not the way to
           | go, but I guarantee you a nontrivial amount of people reading
           | this would write exactly that if asked to solve the high
           | level problem.
           | 
           | I find it annoying when software or even in some cases human
           | typists try to enforce English language title case. Some
           | other languages have different rules for titles and
           | capitalization and seeing the English rules enforced out of
           | context can be jarring.
        
         | reaperducer wrote:
         | _Has anyone here ever had a use case for toLower() where they
         | actually wanted localization to apply?_
         | 
         | Yes. In a system I'm about done with, there is a sortable chart
         | of dates and times. In some languages day and month names are
         | capitalized, and in some they are not.
        
           | bjourne wrote:
           | How does that work? toUpper() can't possibly know that the
           | string is a day or month name.
        
         | [deleted]
        
         | mrighele wrote:
         | > Has anyone here ever had a use case for toLower() where they
         | actually wanted localization to apply?
         | 
         | If you are collecting data which include people's names or
         | addresses you probably want localization to be applied
         | correctly so that you can compare data coming from different
         | sources and possibly with different cases. Having your name
         | spelled differently in different documents can cause a non
         | trivial amount of problems with an overzealous bureaucracy.
        
         | jmiller099 wrote:
         | i like c |= 0x20; :)
        
         | tyingq wrote:
         | Airlines might be a good example. The back end system doesn't
         | grok lowercase characters at all, so you need to transform data
         | to uppercase A-Z, 0-9 and a few punctuation marks.
        
           | miahi wrote:
           | But they do have the most extensive transliteration rules
           | library to match everything to that limited character set
           | (ICAO Doc 9303[1]) that is used by many systems outside the
           | aviation world.
           | 
           | [1] https://www.icao.int/publications/pages/publication.aspx?
           | doc...
        
         | fovc wrote:
         | What about sorting users by name?
        
           | phonebanshee wrote:
           | That's completely language+locale dependent. For example,
           | here'an alphabetical list of Irish surnames -
           | https://www.duchas.ie/en/nom?txt=M. You'll notice that sort
           | order ignores an initial O or Mac (or Ni or Bean, etc).
        
         | rkangel wrote:
         | This is a classic case of a 'why' code comment being needed.
         | It's obvious what you're doing, but without a 2 line
         | explanation, it's not clear _why_.
        
           | kentonv wrote:
           | Yeah I probably wrote that comment the first few times I did
           | this but it's hard to write it the 50th time.
           | 
           | Maybe I should have my own tolower() function that I can call
           | so I only have to write the comment once but it just feels
           | ridiculous somehow.
        
             | Izkata wrote:
             | #include <kentonv.h>
        
             | random314 wrote:
             | Why does it feel ridiculous?
        
               | kentonv wrote:
               | Because I've already rewritten more of the standard
               | library than is healthy.
               | 
               | I mean, it's clearly the right thing to do here but I can
               | predict the conversation that will inevitably result...
               | "You wrote your own tolower() function? Why?" "The
               | standard one is horribly broken." "How could a function
               | that lower-cases a letter be broken??? Jesus Kenton your
               | NIH syndrome is out of control." "Sigh..."
               | 
               | (Slightly more seriously, any particular time I need to
               | lower-case something, it takes 10 seconds to write out
               | the code, but would take 10 minutes to find a good place
               | to define a reusable function and exactly what its API
               | should be, and so it never seems worth the effort in the
               | moment. Just like how most messy code comes to be.)
        
               | nitrogen wrote:
               | Most codebases I've worked with have a StringUtils.java,
               | or .kt, or a str.c or utils.c. Maybe just start one.
               | Interestingly I haven't needed it as much in Ruby.
               | 
               | But I too feel the cognitive (and social!) burden of
               | introducing a new function. It's not just "where do I put
               | this", but "how do I convince the team I know what I'm
               | doing since 15 years of experience clearly isn't enough
               | and developers (mostly rightly) ignore positional
               | authority and seniority".
        
             | Natsu wrote:
             | It's far more ridiculous to repeat yourself over and over
             | instead of making a simple function that describes exactly
             | what you want and why.
        
             | eitland wrote:
             | Write the function, comment it!
             | 
             | Many of these are obvious to many people here, but some
             | aren't.
             | 
             | Even I can admit that some of the stuff in this thread is
             | not obvious at all.
        
           | dmurray wrote:
           | Seems like it would be even better to put this in its own
           | function with a descriptive name, ascii_tolower or
           | roman_tolower or whatever, that has exactly the semantics you
           | want.
        
             | gregmac wrote:
             | This is exactly right, is and is a great example of what
             | self-documenting code can be. The function itself could
             | have a bit more explanation but any code calling it is
             | going to be obvious.
             | 
             | The big difference is it _looks deliberate_ , instead of
             | just code written by someone trying to micro-optimize, be
             | very clever, or who just didn't realize tolower() exists.
             | Most people will pause before just replacing it, and
             | likewise it should trigger questions in the PR.
        
             | sedatk wrote:
             | C# has a `ToLowerInvariant()` variety for that.
        
               | paranoidrobot wrote:
               | Which iirc is an alias for ToLower on the en-us locale.
               | (Same for the other C# *Invariant() methods)
        
             | eitland wrote:
             | Still warrants a comment to be sure no one concludes the
             | built in is good enough.
        
         | a-nikolaev wrote:
         | I think, this is why you need explicitly ASCII and explicitly
         | Unicode lower-/upper-/capitalized-transformations. So you don't
         | assume these things to work automagically. Some times you need
         | one type, the other times you need the other type.
        
           | [deleted]
        
         | grumple wrote:
         | Now I'm wondering about what happens when we change email
         | addresses to lowercase...
         | 
         | https://en.m.wikipedia.org/wiki/Email_address#Internationali...
        
           | sedatk wrote:
           | You shouldn't. Email addresses are case-sensitive.
        
         | happytoexplain wrote:
         | We frequently use localized upper/lower casing at my workplace,
         | as we do not store such stylization in user facing copy. Most
         | copy is written and translated in sentence case or title case
         | (because both are much harder to achieve programmatically), and
         | then our designers have the option of using that casing as-is,
         | or using all-upper or all-lower.
        
         | ramshorns wrote:
         | A char in C++ is one byte, right? Is it even possible for this
         | "fixed" code to call ctype::tolower() on something like a UTF-8
         | or UTF-16 code point?
        
           | kentonv wrote:
           | Correct, it won't even work as intended with modern Unicode
           | locales.
        
             | ramshorns wrote:
             | So maybe if the code is broken anyway for non-ASCII
             | characters, it's fine to use tolower, since somewhere else
             | in the code it ensures that c is a byte.
        
               | kentonv wrote:
               | The code is not broken for non-ASCII characters. UTF-8
               | works just fine with 8-bit chars, and the code I wrote
               | correctly lower-cases ASCII letters even when UTF-8 is
               | present (it just won't touch the UTF-8 chars, which is
               | fine in this use case).
               | 
               | It's only tolower() and toupper() specifically that are
               | broken because they expect to be able to do their job on
               | a single byte, which is no longer possible with UTF-8.
               | 
               | Meanwhile, using tolower() to lower-case an HTTP header
               | name won't give you the correct results if the locale is
               | set to Turkish with the ISO 8859-9 character set, which
               | is 8-bit, and where tolower('I') will produce the byte
               | 0xFD which is 'i' in this character set.
        
               | ramshorns wrote:
               | I see, thanks for the explanation.
        
         | cesarb wrote:
         | Java has two variants of toLowerCase(): one which uses the
         | default/current locale (almost never what you want), and one
         | which receives an explicit locale (Locale.ROOT is almost always
         | the one you want). At work, we use the "forbidden APIs" checker
         | (https://github.com/policeman-tools/forbidden-apis) to fail the
         | CI if the variant which uses the default locale is ever used;
         | if you really want to use a locale-dependent toLowerCase(), you
         | have to explicitly call Locale.getDefault() and use it as the
         | locale.
         | 
         | Is there something similar for C and C++? It could help in your
         | case, by making your well-meaning colleagues aware of the
         | issue.
        
           | vesinisa wrote:
           | > Locale.ROOT is almost always the one you want
           | 
           | At least Android developers are advised to use Locale.US:
           | https://developer.android.com/reference/java/util/Locale
           | 
           | > The default locale is not appropriate for machine-readable
           | output. The best choice there is usually Locale.US - this
           | locale is guaranteed to be available on all devices, and the
           | fact that it has no surprising special cases and is
           | frequently used
           | 
           | It would be indeed interesting to see in which features these
           | two locales actually differ.
        
           | rvnx wrote:
           | Yes, tolower_l(string, locale)
        
         | __s wrote:
         | I recently ordered a Pixel, on the mail slip they had converted
         | my name to uppercase, last name read "DUBe"
         | 
         | Also got my address screwed up on account of living at a half
         | address.. 1/2 some street #42
        
         | smnrchrds wrote:
         | > _Has anyone here ever had a use case for toLower() where they
         | actually wanted localization to apply?_
         | 
         | On many documents, including Turkish passport and identity card
         | and many (all?) other passports, names are written in all caps.
         | Maybe toLower() is not that useful, but toUpper() is crucial in
         | any application where you are dealing with real person names.
        
           | phonebanshee wrote:
           | toUpper is definitely language-dependent. For example, in
           | Irish there are initial letters that are written as lower-
           | case even in all caps. Wikipedia's example is amusing, since
           | it's a photo of a government passport office sign - the all-
           | caps version of Oifig na bPasanna is OIFIG NA bPASANNA (photo
           | https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/AL.
           | .., article https://en.wikipedia.org/wiki/Irish_orthography#C
           | apitalisati...). It would look utterly bizarre to write OIFIG
           | NA BPASANNA. And this isn't at all an unusual construction in
           | Irish, it happens in personal names all the time.
           | 
           | Plus, there's the issue of diacritical marks. Irish keeps
           | long marks over capitals, but French drops accents. Do you
           | plan to do e => E (required for Irish - POBLACHT NA hEIREANN
           | is the all-caps version of Poblacht na hEireann [the Republic
           | of Ireland]) or e => E (common practice for French)? You have
           | to get it right, and you have to know the language to do
           | that. (Poblacht na hEireann also illustrates the fact that
           | initial caps is also a language-dependent idea; you
           | absolutely can't write Poblacht Na Heireann - that makes my
           | eyes burn just looking at it.)
           | 
           | (And before you say, well, Irish isn't a language spoken by
           | very many people, remember that it's an official language of
           | the European Union. If you're writing software to be used by
           | EU agencies, you're going to have to care.)
        
             | smnrchrds wrote:
             | > _French drops accents_
             | 
             | Official position of both _Academie francaise_ and _Office
             | quebecois de la langue francaise_ is that accents must be
             | preserved in capital letters. However, it is common in
             | France to drop them, while they are almost always preserved
             | in Quebec. I have heard that the reason is that European
             | French keyboard layout makes it difficult to type accented
             | capital letters, unlike Quebecois French layout which makes
             | writing them easy. But I am not sure if this is the cause
             | rather than the effect of the practice.
        
               | ccccc0 wrote:
               | French here, I recall my primary school textbook where
               | they said something along the lines of "sometimes accents
               | are dropped, that's sort of fine as long as it doesn't
               | change the meaning". They gave the example of a
               | fictitious newspaper whose headline was "UN POLICIER
               | TUE": depending on the accent (tue/tue) it means either
               | "a policeman kills" or "a policeman killed".
        
               | hocuspocus wrote:
               | Wrongly dropping accents on uppercase letters predates
               | computer keyboards; the French azerty layout puts
               | accented letters on the first level of number row:
               | 
               | http://j.poitou.free.fr/pro/img/tkn/tw-image.jpg
               | 
               | This idiocy carried over. The recent layout update makes
               | dead accents more accessible:
               | 
               | https://norme-azerty.fr/
               | 
               | But I haven't seen much adoption yet.
        
               | zorked wrote:
               | That button to the right of the P that contains four
               | different forms of dashes is... interesting.
               | 
               | Even more if you consider that the minus sign there is
               | not the character - that is used by every programming
               | language.
        
               | masklinn wrote:
               | > That button to the right of the P that contains four
               | different forms of dashes is... interesting.
               | 
               | And there are two more on the 8 key.
               | 
               | I like the mac international keyboard layout, but it
               | still only provides for 4 of those: the non-breaking
               | hyphen and the "proper" minus sign are lacking.
               | 
               | I like that the "new azerty" provides for pretty much
               | every diacritic, even those which are not in use in
               | french.
        
               | phonebanshee wrote:
               | Interesting that I was wrong - another data point in the
               | "it's more complicated than you think it is" column. I
               | always thought you were supposed to drop them (because I
               | was explicitly told so by a French engineer I worked with
               | in the 90s, talking about one particular poster, and many
               | years later still assume that one hallway conversation
               | was enough to make that THE OFFICIAL RULE without
               | bothering to actually check...)
        
               | forty wrote:
               | I confirm that in French capital letters should have
               | accents.
               | 
               | I have an anecdote on this: on birth certificates, family
               | names are written in capital letters. It turns out my
               | partner name ends with a E which was written as E in her
               | birth certificate. She never noticed (it had never
               | prevented her to get national ID with her name properly
               | accented) until we had our first kid which has both our
               | names, and they refused to have the name accented until
               | we had my partner's birth certificate updated (which as
               | you can imagine is quite an adventure, since you need to
               | dig ancient family birth certificates to prove it was
               | originally written with an accent...).
        
         | mehrdadn wrote:
         | > Has anyone here ever had a use case for toLower() where they
         | actually wanted localization to apply?
         | 
         | How do you lowercase _without_ localization? Remember all text
         | isn 't English. Unless you're actually asking if anyone has
         | ever had a use case for lower-casing non-English text?
        
         | karmakaze wrote:
         | And a note that it assumes ASCII. On an EBCDIC system, the
         | 'A'-'Z' test will translate other characters besides letters.
        
       | iforgotpassword wrote:
       | Wouldn't converting to nfkd/c first solve this issue too? My
       | understanding of those forms was that they're made exactly for
       | this case.
        
         | brewmarche wrote:
         | Case mapping and case folding are independent of normalization
         | (in practice and it is the case here, see the end of
         | SpecialCasing.txt)
         | 
         | There is a good Unicode FAQ on the topic: <
         | http://unicode.org/faq/casemap_charprop.html >
         | 
         | E: to elaborate, I'm not sure whether the independence of case
         | handling and normalization is guaranteed anywhere, and if we
         | for example were to change the uppercase of s to something else
         | than S then its compatibility forms' (s) case handling would
         | differ. In practice the SpecialCasing.txt is designed to "make
         | it work" (e.g. s uppercases to S).
        
         | jwilk wrote:
         | No, these are ASCII strings, so they are already normalized.
        
           | iforgotpassword wrote:
           | Oh, I haven't used python much, but I thought it's all
           | Unicode? If this were ascii it would work out of the box
           | since there is no dotless lowercase i in ascii.
        
             | estebank wrote:
             | There are no code point for TURKISH LOWERCASE DOTTED I not
             | for TURKISH UPPERCASE DOTLESS I, which means that the text
             | doesn't carry enough information for roundtrip
             | preservation.
             | 
             | I believe this has proven to be a mistake but I'm not an
             | expert. I don't know _why_ it wasn 't done.
        
       | FrontAid wrote:
       | Changes to the casing might also change the value's length. E.g.
       | uppercasing the German ss will transform it to SS. Example using
       | JavaScript:
       | 
       | 'ss'.toUpperCase(); // returns 'SS'
       | 
       | https://en.wikipedia.org/wiki/%C3%9F
        
         | dathinab wrote:
         | Which can be both correct and wrong depending on context.
         | 
         | Normally there is no such thing as a capital ss, so it was
         | decided that if for some unreasonable reason you do uppercase
         | it you go with SS.
         | 
         | But then for some all-caps usages this is not right. E.g. a all
         | caps name of an restaurant as placed above the restaurants
         | door. In which case it was common to have a ss in a all-caps
         | name like FOOssBAR. So they decided that for reasons like this
         | we now have an (EDIT: semi?) official uppercase ss.
         | 
         | So all in all this and other examples in other languages mean
         | you should never do a case insensitive comparison by
         | upper/lower casing both sides, it won't work reliable.
        
         | schoen wrote:
         | There is apparently a multi-decade controversy about that:
         | 
         | https://en.wikipedia.org/wiki/Capital_%E1%BA%9E
         | 
         | (with German language authorities recently endorsing the idea
         | that ss can have a distinctive uppercase form "Ss")
        
         | [deleted]
        
       | seqizz wrote:
       | Yeah, there were some weird bugs about that. I remember one in a
       | media player. Also "info".upper() would be INFO probably.
        
       | cazim wrote:
       | http://www.moserware.com/2008/02/does-your-code-pass-turkey-...
       | 
       | This is old but still valid reading...
        
       | shagmin wrote:
       | I learned about this in javascript when I discovered Angular has
       | its own lowercase method. Apparently it's internal only now.
       | 
       | https://github.com/angular/angular.js/commit/1daa4f2231a89ee...
        
       | decafbad wrote:
       | Please stop doing this. Don't bind lower() upper() functions to
       | environment variables or anything else system related. Sun did
       | this in Java and doesn't even bother to mention the issue in
       | documents. It caused huge problems for more than a decade.
       | 
       | You can just make string lowercase() uppercase() function work
       | the same everywhere, regardless of locale settings. Provide a
       | special case function lowercaseTR() or so. This works very well
       | in Go.
       | 
       | By the way, Azerbaijan has the same problem because they accepted
       | help from wrong guys when they switched to Latin.
        
         | netsharc wrote:
         | > lowercaseTR()
         | 
         | Huh, that works well if we know the input string is in Turkish.
         | What if this information is not available as you're writing the
         | code?
         | 
         | And what will lowercase()/uppercase() be hard coded to do, and
         | what are they supposed to output when the input isn't ASCII?
        
       | alkonaut wrote:
       | Repeat after me: don't do string operations without explicit
       | locale. Don't do string operations without explicit locale.
       | 
       | I don't know why so many languages have string functions that
       | should take a locale but provide an overload that doesn't and
       | which uses the _system_ locale as the default. It can't be what
       | many developers actually want, yet it has become the norm. Worse,
       | code using a default locale _appears_ to work on the developers
       | machine and in production, until someone parses a number in
       | France or lowercases a string in Turkey, which is a late and
       | expensive discovery of the bug.
       | 
       | The default shouldn't be the system locale, it should be an
       | invariant locale. And I'll go so far as arguing this invariant
       | locale should be invariant across systems (meaning it can't just
       | defer to a system C library either).
        
         | madeofpalk wrote:
         | I ran into this with C#/.NET on Windows - I tried to convert a
         | string "1.3" to the float 1.3, and it failed on languages that
         | use comma as their decimal separator.
         | 
         | That was a learning experience.
        
           | alkonaut wrote:
           | Indeed. As a person from a comma country, I find these
           | mistakes in most code bases I look at. It makes it
           | frustrating to contribute to open source, for example.
           | 
           | Perhaps it'll make you feel better about your parsing bug
           | that even the C# compiler (Roslyn) code base had several of
           | these issues.
        
       | garydgregory wrote:
       | See also https://garygregory.wordpress.com/2015/11/03/java-
       | lowercase-...
        
       | sedatk wrote:
       | Note to the next language designer: don't use strings as a
       | substitute for enums.
        
         | teddyh wrote:
         | It might be OK if strings are immutable and therefore
         | internable.
        
       | TazeTSchnitzel wrote:
       | The PHP interpreter has an internal reimplementation of string
       | case conversion that's ASCII-only in order to avoid this problem.
        
         | asddubs wrote:
         | doesn't php have this exact problem with their case-insensitive
         | (hate that btw) function/method names and turkish localization?
         | or did they actually fix it at some point?
        
       | stevoski wrote:
       | For a similar reason, Java on Mac and Linux was briefly broken
       | for anyone using it in the Turkish locale. It was because in the
       | Turkish locale, !"POSIX".toLowerCase().equals("posix").
       | 
       | Relevant bug report here:
       | https://bugs.openjdk.java.net/browse/JDK-8047340
        
       | anticensor wrote:
       | Correct: you would get "info", "warning" and "critical" in
       | Turkish and in Azerbaijani.
        
         | mapgrep wrote:
         | Further context:
         | 
         | https://en.m.wikipedia.org/wiki/Dotted_and_dotless_I
         | 
         | Did not know Istanbul is actually Istanbul.
        
           | gvx wrote:
           | Me neither. I did know it's not Constantinople, though.
        
       | chihuahua wrote:
       | I remember running into problems with SQL stored procedures where
       | column and table names were case-insensitive, so you don't know
       | if you've properly typed all the column and table names. Until a
       | customer in Turkey eventually installs it and you find out you've
       | missed the proper capitalization of an identifier containing the
       | letter "I", and the stored procedure fails.
        
         | Pxtl wrote:
         | Honestly, I'm very pro case-insensitivity, but my experience
         | with SQL servers have impressively demonstrated how _not_ to do
         | it.
         | 
         | For example, MS SqlPackage, used for deploying schema, is case-
         | insensitive... But that also means changes to text constants
         | within your stored procs do not get treated as changes.
        
         | heavenlyblue wrote:
         | This is what I usually think about whenever people say yay to
         | Unicode in language identifiers.
        
           | formerly_proven wrote:
           | "I" is in ASCII.
        
       | tantalor wrote:
       | https://bugs.python.org/issue1524081
       | 
       | > KeyError: 'Info'
        
       | tryauuum wrote:
       | Unrelated story about Russian language.
       | 
       | The first letter of russian alphabet is A, the last one is Ia. So
       | it's natural to try to match russian words with '[A-Iaa-ia]+'.
       | But this is a recipe for disaster, this regexp doesn't match
       | words with 'Io' in them like "Artiom".
       | 
       | This is due to the fact that regexp ranges work on byte values.
       | All letters of russian language have neatly ordered byte values,
       | except for the Io.
        
         | blkhawk wrote:
         | not that unusual - for German for instance uoaUOAss need to be
         | added so all words can be matched.
        
           | a3w wrote:
           | now, there is even a capital ss ;)
        
             | Forge36 wrote:
             | Out of curiosity I tried on my phone: ss
             | 
             | Ss
             | 
             | SS
             | 
             | So my phone doesn't have that yet!
        
               | maxpro wrote:
               | mine has it - Ss the small one is ss
        
               | Igelau wrote:
               | Is the capital supposed to be shorter?
        
         | mikepurvis wrote:
         | This would be an argument for just using [:alpha:] everywhere;
         | presumably it does the correct thing based on locale?
        
           | tryauuum wrote:
           | No, alpha doesn't work, at least in "grep -P" with
           | "ru_RU.UTF-8" locale:                 $ echo Test | grep -oP
           | '[[:alpha:]]+'       Test       $ echo Artiom | grep -oP
           | '[[:alpha:]]+'       $ echo Artiom | grep -oP '[A-Iaa-ia]+'
           | Art       m
           | 
           | This thing works, though I've never seen one in the wild:
           | $ echo Artiom | grep -oP '[\p{Cyrillic}]+'       Artiom
        
         | Sharlin wrote:
         | English is probably the only commonly spoken language where
         | naive char range matching _kind of sort of_ works. I say "kind
         | of sort of" because [a-zA-Z] trivially fails to match all words
         | in many English texts that haven't been lossily compressed to
         | ASCII, including this comment.
         | 
         | It is practically always wrong to match on [a-z] unless you're
         | parsing a computer language whose spec guarantees that it
         | works.
        
           | Izkata wrote:
           | Forget ascii conversion, that also fails on contractions like
           | "don't".
        
           | dheera wrote:
           | The easiest solution to this problem would be to just rename
           | it to "naive".
        
           | tryauuum wrote:
           | I always wanted to know, how easy is it to type naive on a
           | common western keyboard?
           | 
           | Do you have to press some obscure keyboard shortcut?
        
             | dylz wrote:
             | "i or similar works. ^i, `i, 'i, etc. for the others.
        
             | reaperducer wrote:
             | _how easy is it to type naive on a common western
             | keyboard?_
             | 
             | In macOS, you can either use Command-u (for "umlat")
             | followed by i, or hold down the i key for a second and
             | press 2 to select the i from the pop-up menu.
        
               | [deleted]
        
               | masklinn wrote:
               | > Command-u
               | 
               | option-u (aka alt-u).
               | 
               | Generally speaking, command is for application-level or
               | os-level commands, control is for text edition, and alt
               | is for alternate characters (all can be shifted and
               | command "overrides" the rest).
        
               | reaperducer wrote:
               | You're right, it's Option-u. Most of the key labels on my
               | MacBook have long since been scratched away.
               | 
               | This has happened with every single Apple keyboard I've
               | ever used. I suspect it's my fault, since I'm a key
               | pounder, having learned to type on an IBM Selectric
               | typewriter.
        
             | boring_twenties wrote:
             | On any Unix, just enable the "compose" key, then
             | <Compose>+"+i.
             | 
             | It's always something easy to remember, like " for umlauts,
             | o for circles ((c), (r)), obviously ' for accents (n) and
             | so on.
        
             | Izkata wrote:
             | On Ubuntu, I use xmodmap to turn Print Screen into a
             | Compose key. Then it's: <compose>"i
             | 
             | https://en.m.wikipedia.org/wiki/Compose_key
        
               | figomore wrote:
               | I use the Macintosh keyboard map in Linux. So I do <right
               | alt>+e to ', <right alt>+n to ~.
        
             | fnord123 wrote:
             | It is acceptable to write English without diacritics.
             | "Naive" is accepted.
        
             | madeofpalk wrote:
             | By default on a Mac you just hold down the key to get
             | different options, similar to on an iPhone (and a presume
             | touch-Android).
             | 
             | https://i.imgur.com/yuG063t.png
        
             | nemetroid wrote:
             | On a Swedish keyboard, there's a dead key for ", so you
             | press that followed by _i_ to get _i_.
             | 
             | It's not very clear why the Swedish keyboard has that key,
             | since a and o each have their own keys. The layout has
             | other quirks as well, such as keys for SS, 1/2 and the
             | useless "currency sign", $?.
        
               | aliswe wrote:
               | Yes! I think the "mine" character should be switched for
               | the dollar sign.
               | 
               | BTW, the dead key could be from german, for writing their
               | U:s.
        
             | jakub_g wrote:
             | I'm Windows-based and wanted a keyboard layout that will
             | allow me typing easily Polish and French at the same time,
             | without switching keyboard layouts (PL == US+AltGr for
             | accents; while FR layout is insane, because apart from
             | being AZERTY, all special chars are in different places,
             | and you need a Shift to type numbers; and the way to type
             | accents is also special).
             | 
             | I found "Polish international" [1] layout which honestly
             | can be perfect for many people. It's optimized to be
             | compatible with regular Polish keyboard (hence with US
             | keyboard too), and maybe not the fastest if you type a lot
             | special chars, but it's extremely intuitive:
             | 
             | i = AltGr+:, i
             | 
             | u = AltGr+:, u
             | 
             | e = AltGr+/, e
             | 
             | e = AltGr+\, e (since it's extremely common, also aliased
             | as AltGr+w)
             | 
             | If you're Windows based and want US-compatible keyboard
             | layout that allows easily typing any special chars, I
             | highly recommend it.
             | 
             | [1] https://translate.google.com/translate?sl=pl&tl=en&u=ht
             | tps%3...
        
               | 205guy wrote:
               | I type English and French in Windows on the same QWERTY
               | keyboard. I once learned to type on Azerty, but I mainly
               | type English now on a standard US keyboard layout. For
               | the French, I find the windows alt-numbers works the
               | easiest for accented characters. Alt-130=e, Alt-133=a,
               | Alt-135=c, Alt-137=e, Alt-138=e which covers 95% of the
               | accented character usage. I have a little chart next to
               | my desk with all the others (i,o,u) they're nearly all
               | Alt-14x and Alt-15x. And then I'll put e in the paste
               | buffer because it is the most used and a bit quicker that
               | way (for words like "prefere").
               | 
               | The Alt-13x codes are not as quick as the Azerty keys,
               | but good enough and once memorized are fairly easy with a
               | keyboard that has a keypad (most PCs do, even my laptop).
               | This is especially true because they are done with both
               | hands simultaneously, as opposed to something like
               | Cmd-e+e on a Mac. Actually, they are faster than finding
               | the accented characters on my QWERTY virtual keyboard as
               | I type this comment on iOS.
               | 
               | Those AltGr- combos seem complicated to me, I would much
               | prefer a system such as AltGr-e =e, then AltGr-ee=e,
               | AltGr-eee=e, etc. To me that would be more intuitive than
               | remembering the composing character (slash for aigue,
               | etc).
        
               | cassepipe wrote:
               | You seem to be quite used to your Alt combination but as
               | you said they really are not straightforward. I found
               | another very simple solution, on Linux you can set a
               | compose key (typically Alt gr or the contextual menu
               | key). You type one after another, the compose key and
               | then any two keys that make sense like ' followed by e
               | (et vice versa), it will give you a e. It is both fast
               | and easy to work with.
        
               | jakub_g wrote:
               | Reminds me of when the 'D' key broke in my physical
               | keyboard long time ago. I liked that keyboard a lot and
               | couldn't find a good replacement so I learnt to type
               | Alt-100 do get 'd'.
        
       | mbostleman wrote:
       | Hence toUpper/toLower is not a strategy that passes the Turkey
       | Test for case insensitivity.
        
       | jwilk wrote:
       | Looks like it's no longer the case in Python 3:
       | Python 3.7.3 (default, Jul 25 2020, 13:03:44)         [GCC 8.3.0]
       | on linux        Type "help", "copyright", "credits" or "license"
       | for more information.        >>> from locale import *        >>>
       | setlocale(LC_ALL, 'tr_TR.UTF-8')        'tr_TR.UTF-8'        >>>
       | 'INFO'.lower()        'info'
        
         | xyst wrote:
         | Python 3.7.5 (default, Nov 5 2019, 22:30:48)
         | 
         | [Clang 11.0.0 (clang-1100.0.33.12)] on darwin
         | 
         | Type "help", "copyright", "credits" or "license" for more
         | information.
         | 
         | >>> from locale import *
         | 
         | >>> setlocale(LC_ALL, 'tr_TR.UTF-8')
         | 
         | 'tr_TR.UTF-8'
         | 
         | >>> 'INFO'.lower()
         | 
         | 'info' >>> ' [?]'.lower()
         | 
         | '\u200d[?]'
         | 
         | >>> exit()
         | 
         | There's something wrong with emojis + lower() though
        
           | Dylan16807 wrote:
           | It lowercased the 'show this as emoji' variation selector to
           | zero width joiner?
        
         | anderskaseorg wrote:
         | Oddly, it also wasn't the case for Python 2 Unicode strings
         | (u'INFO'), only for Python 2 byte strings ('INFO'). So it's
         | possible that Python 3 lost this behavior by accident.
        
       | scrollaway wrote:
       | Ive long thought programming languages need a "localizable
       | string" (Aka user-facing string) type, different from regular
       | utf8 strings. Something like what gettext and other i18n
       | libraries fake for you, but native to the language.
       | 
       | Behaviour like this is definitely a good reason why: sorting,
       | changing case, etc should be consistent when dealing with strings
       | used as constants and identifiers, but Python's .lower()
       | behaviour makes sense in a localizable string context.
        
         | lazulicurio wrote:
         | Along similar lines, I've thought that it would be useful if
         | Unicode included language marks (i.e. codepoints to identify
         | blocks of text as being written in a specific language). It
         | would be strictly more useful than the barebones left-to-
         | right/right-to-left marks (U+200E/U+200F) when deciding how to
         | process and display text. And it would be a step towards
         | correcting the mess that was Han unification.
        
           | jwilk wrote:
           | See RFC 2482 -- Language Tagging in Unicode Plain Text:
           | 
           | https://tools.ietf.org/html/rfc2482
           | 
           | But it was deprecated later on:
           | 
           | https://tools.ietf.org/html/rfc6082
        
             | lazulicurio wrote:
             | Interesting. Unfortunate that the deprecation notice
             | doesn't include much rationale. I found at least one mail
             | thread about it[1], which seems to confirm that the main
             | thought was that semantic information about text should be
             | handled at a higher layer (e.g. XML). I can understand that
             | argument for a general purpose tagging mechanism, but
             | language and glyphs are strongly semantically linked.
             | 
             | (Somewhat ironically, the previous thread on that mailing
             | list is about the struggles of case folding in a general
             | fashion across multiple language scripts[2])
             | 
             | Edit: I also found [3], which offers the following:
             | 
             | ----
             | 
             | - Most of the data sources used to assemble the documents
             | on the Web will not contain these characters; producers, in
             | the process of assembling or serializing the data, will
             | need to introspect and insert the characters as needed--
             | changing the data from the original source. Consumers must
             | then deserialize and introspect the information using an
             | identical agreement. The consumer has no way of knowing if
             | the characters found in the data were inserted by the
             | producer (and should be removed) or if the characters were
             | part of the source data. Overzealous producers might
             | introduce additional and unnecessary characters, for
             | example adding an additional layer of bidi control codes to
             | a string that would not otherwise require it. Equally, an
             | overzealous consumer might remove characters that are
             | needed by or intended for downstream processes.
             | 
             | - Another challenge is that many applications that use
             | these data formats have limitations on content, such as
             | length limits or character set restrictions. Inserting
             | additional characters into the data may violate these
             | externally applied requirements, and interfere with
             | processing. In the worst case, portions (or all of) the
             | data value itself might be rejected, corrupted, or lost as
             | a result.
             | 
             | - Inserting additional characters changes the identity of
             | the string. This may have important consequences in certain
             | contexts.
             | 
             | - Inserting and removing characters from the string is not
             | a common operation for most data serialization libraries.
             | Any processing that adds language or direction controls
             | would need to introspect the string to see if these are
             | already present or might need to do other processing to
             | insert or modify the contents of the string as part of
             | serializing the data.
             | 
             | ----
             | 
             | Other than #3 (the one about string identity), I find these
             | wholly unpersuasive. And even #3 isn't even that great a
             | reason considering that programmatic processors have to
             | deal with that issue anyway due to case folding.
             | 
             | [1] https://www.unicode.org/mail-arch/unicode-
             | ml/y2010-m11/0039....
             | 
             | [2] https://www.unicode.org/mail-arch/unicode-
             | ml/y2010-m11/0038....
             | 
             | [3] https://www.w3.org/TR/string-meta/
        
           | Ericson2314 wrote:
           | What this gets right down to is that Unicode is a flawed
           | idea: the meaning/behavior/whatever of characters is insanely
           | dependent on their context.
           | 
           | The problem was never gazillions of code pages, but our
           | inability to write C to deal with that amount of complexity
           | circa 1990.
           | 
           | With modern machines, and good programming languages with
           | good type systems, I absolutely think we could store a
           | language per string, and concatenate into a polylinguistic
           | rope if needed.
           | 
           | This would hopefully push us away from stringly-typed crap in
           | general.
        
             | throwaway_pdp09 wrote:
             | > the meaning/behavior/whatever of characters is insanely
             | dependent on their context
             | 
             | I wish you would give an example instead of just
             | proclaiming crapness. You know, so we n00bs can learn
             | something.
        
               | toast0 wrote:
               | Different languages have different rules for change case
               | (as seen here) or what to do when translitterating to
               | 7-bit ascii, in French, you can mostly drop accents if
               | you need to, in German, you need to transform an umlaut
               | to an e following the vowel. Of course, many languages
               | don't have a way to translitterate to 7-bit ascii.
               | 
               | Sorting of strings is language dependent, but I don't
               | know that there's a defined order for mixed language
               | lists, so I guess user's context works if you're sorting
               | for user purposes, but if you're sorting for machine
               | purposes, you better not use the locale aware sort
               | without telling it a hardcoded locale that doesn't change
               | between localization library versions.
        
               | throwaway_pdp09 wrote:
               | @toast0, @lazulicurio, both of your points seem to
               | illustrate the complexities of the languages, not
               | "...that Unicode is a flawed idea" as the original poster
               | said. AFAIKS this is intrinsic complexity showing itself
               | and does not make any indication of how it should be done
               | correctly, or better.
        
               | Ericson2314 wrote:
               | The benefit of looking at languages/scripts in isolation
               | is that the _combinatorial explosion_ of all languages
               | /scripts at once is dodged.
               | 
               | E.g. lookalike charaters, and social engineering by using
               | a vs a. (One is Cyrillic). I don't want to even _define_
               | "a == a". I want Latin and Cyrillic to be different types
               | of characters, and that expression to be ill-typed.
               | 
               | This solves the Turkish problem, where the upper case I
               | is two different charters in two different types (Turkish
               | Roman script?), and the case folding functions likewise
               | have disjoint types.
        
               | lazulicurio wrote:
               | > both of your points seem to illustrate the complexities
               | of the languages, not "...that Unicode is a flawed idea"
               | 
               | The flaw in Unicode is that it punts on the intrinsic
               | complexity---pretending that codepoints have language-
               | independent, plain-text, semantic meaning.
               | 
               | A couple of threads that have molded my views over time:
               | 
               |  _I can 't write my name in Unicode_
               | https://news.ycombinator.com/item?id=9219162
               | (Specifically these two comments
               | https://news.ycombinator.com/item?id=9220530 and
               | https://news.ycombinator.com/item?id=9220970)
               | 
               |  _Why isn 't the external link symbol in Unicode?_
               | https://news.ycombinator.com/item?id=23016832
        
               | Ericson2314 wrote:
               | > The flaw in Unicode is that it punts on the intrinsic
               | complexity---pretending that codepoints have language-
               | independent, plain-text, semantic meaning.
               | 
               | > Pretending "plain text" isn't an oxymoron
               | 
               | FTFY :)
        
               | lazulicurio wrote:
               | How about: case folding for the letter 'I' is dependent
               | on whether the locale is Turkish or not.
               | 
               | ;)
        
             | arcticbull wrote:
             | Unicode goes to great pains to avoid ascribing any
             | meaning/behavior/whatever to character sets. Because to
             | your point you can't. Unicode is actually incredibly well
             | thought out. That's why we have values, code points and
             | grapheme clusters. I don't think the Unicode standard even
             | defines casing except in the human-readable names ascribed
             | to code points.
             | 
             | If you want to build a polylinguistic rope you can
             | certainly do that with Unicode, but you won't have solved
             | anything because language alone without context doesn't
             | really define many of the operations you're describing.
             | 
             | The answer is usually the same as "doctor it hurts when
             | I..." -- stop doing it. Stop manipulating user input
             | without context. Stop trying to limit user visible strings
             | by character count, use pixel width in the rendered font.
             | And so on.
        
               | jfk13 wrote:
               | > I don't think the Unicode standard even defines casing
               | except in the human-readable names ascribed to code
               | points
               | 
               | Sure it does; the Unicode Character Database includes
               | fields for the lowercase, uppercase and titlecase
               | mappings. But it also acknowledges that these are just
               | default mappings, and may need to be tailored for
               | specific languages/locales.
        
               | Ericson2314 wrote:
               | Unicode is well thought out! And that's what makes it
               | hard to critique :). I think it's one of the best-
               | maintained, well-thought out standards there is, but I
               | still think the premise is wrong.
               | 
               | If all that good effort went into something along the
               | lines I am describing, where languages, or at least
               | scripts, cannot be arbitrary mixed at the character
               | level, I think we would have an even better result with
               | the same level of effort.
        
           | kevin_thibedeau wrote:
           | Unicode supported this with tag sequences but that is
           | deprecated and unlikely to work with modern libs.
        
         | thomasahle wrote:
         | That would be great! For example, in Python you currently have
         | to do something like this                   import locale
         | sorted(list_of_strings, key=locale.strxfrm)
         | 
         | To sort using the current loacale, which many people forget.
        
         | DougBTX wrote:
         | Along the lines of this?
         | 
         | https://docs.microsoft.com/en-us/dotnet/api/system.globaliza...
        
           | layer8 wrote:
           | In Java, there is Locale.ROOT, which can be used in a similar
           | way. In particular, it is useful when performing locale-
           | dependent operations in locale-independent contexts (e.g.
           | working with case-insensitive identifiers) where you don't
           | want the behavior of your code to depend on the current
           | default locale.
        
           | wongarsu wrote:
           | .NET is one of the few ecoecosystems to get this right. It
           | offers the invariant culture for identifier-like things, "fr"
           | for French language and "fr-FR" for French language in
           | France, allowing you to specify your intention to every
           | string-modifying function.
           | 
           | Support at the type level would be a lot less verbose, but
           | support at the function level is already much better than
           | many other popular languages.
        
             | kanox wrote:
             | It would be great if strings and especially date-time
             | values always carried locale and timezone information with
             | them.
             | 
             | It would take slightly more memory but not significant on
             | modern machines.
        
               | wongarsu wrote:
               | Putting the locale information on the string sounds like
               | a good idea. However I'm not sure how that should handle
               | combined strings with components from different locales.
               | For example `logLevel + ": " + logMessage` might produce
               | "info: baglanti kesildi" in Turkish. How to annotate
               | that? Neither English nor Turkish would work correctly,
               | each would produce the wrong result when uppercasing.
               | 
               | You could treat it as a series of string slices with
               | different locales `[("info", "en"), (": ", ""),
               | ("baglanti kesildi", "tr")]`. That would work correctly,
               | and you could now uppercase each slice according to its
               | appropriate locale, but it wouldn't really be low
               | overhead anymore. Maybe still worth it. It would be an
               | interesting approach that might even be able to be
               | implemented pretty seamlessly as a library in some
               | languages (C++ or rust for example)
        
           | scrollaway wrote:
           | That just seems to be a parameter for locale-dependent
           | functions. Very useful, but no, I'm talking about splitting
           | the unicode-string datatype in two: "user-facing unicode
           | string" vs "internal unicode string".
           | 
           | Example: logging.log("INFO", i"This is a localizable string")
           | 
           | In the i18n world, we could gather i-strings just like
           | gettext does (where it looks like `logging.log("INFO",
           | _("This is a localizable string")`). The language could then
           | have other useful hooks/behaviours into that datatype, and
           | definitely one of them would be whether various methods have
           | i18n behaviour enabled on them, versus using a C locale.
        
       | maweki wrote:
       | As it isn't yet mentioned: for these cases the Python standard
       | library explicitly has
       | https://docs.python.org/3.8/library/stdtypes.html#str.casefo...
       | (str.casefold), which aggressively lowercase-normalizes strings
       | with an algorithm from the unicode standard. Every case
       | comparison using lower() instead of casefold() can be considered
       | a bug.
        
         | Alex3917 wrote:
         | > Every case comparison using lower() instead of casefold() can
         | be considered a bug.
         | 
         | If you just casefold two strings and compare them, it's still a
         | bug. You need to normalize them to NFKC first.
        
           | pas wrote:
           | Is NFKC necessary, isn't NFKD enough? (As in you have to
           | normalize and decompose both strings, but at that point you
           | can check them for equality, and doing the canonical
           | composition isn't needed, right?)
        
             | Alex3917 wrote:
             | I think that would work if you're just checking for
             | equality and want to minimize processing. I guess as a web
             | developer I always just assume people are going to be
             | storing strings in a database after normalizing them, so
             | would want to minimize string length.
        
       | chippy wrote:
       | https://garygregory.wordpress.com/2015/11/03/java-lowercase-...
       | 
       | In the Turkish locale, the Unicode LATIN CAPITAL LETTER I becomes
       | a LATIN SMALL LETTER DOTLESS I. That's not a lowercase "i".
        
       | [deleted]
        
       | geofft wrote:
       | In C (POSIX.1-2008, specifically), there's tolower_l() and the
       | rest of the _l functions for this use case, which take a locale
       | as an argument. That let's you ask for the English (or even "C
       | locale") lowercase versions of these English words, even when
       | your process's current locale is Turkish.
       | 
       | https://www.man7.org/linux/man-pages/man3/tolower_l.3.html
        
         | adamjb wrote:
         | The mention of _l functions reminded me of this gloriously over
         | the top git message/rant.
         | 
         | "Those not comfortable with toxic language should pretend this
         | is a religious text."
         | 
         | https://github.com/mpv-player/mpv/commit/1e70e82baa9193f6f02...
        
           | [deleted]
        
       | [deleted]
        
       | formerly_proven wrote:
       | ITT calling setlocale or std::locale::global(...) is ALMOST
       | ALWAYS a heinously bad idea and should rarely be done, because it
       | breaks tons of code (notably everything that uses printf/scanf
       | and everything using stringstream).
        
       | crazygringo wrote:
       | Serious question.
       | 
       | Why on earth would you hard-code these, instead of simply call a
       | lowercase function in the en-US locale?
       | 
       | These are English words. Naively lowercasing them according to
       | whatever locale the server or user has set seems like a terrible
       | programming practice. Any call to a lowercase function should be
       | explicitly including an argument that specifies it's English, no?
       | 
       | In the same way we've all learned to never store times without an
       | explicit timezone (even if it's UTC), or locate a string offset
       | without knowing your encoding... you should never perform
       | language transformations (case changes, accent removal, etc.)
       | without a locale.
       | 
       | Hardcoding these things is just patching over the symptoms
       | without addressing the cause, no?
        
       | jaclaz wrote:
       | Only for the record, there is something very similar that may
       | happen when creating CD/DVD's (please read when using mkisofs and
       | similar), with the "dash" that when "capital" becomes underscore
       | (but not only ) depending on the reference ISO
       | 9660/Joliet/RockRidge convention in use.
       | 
       | https://web.archive.org/web/20151007005513/http://www.911cd....
        
       | TwoBit wrote:
       | This particular case seems odd to me because INFO is an English
       | word, and info is not.
        
         | wongarsu wrote:
         | You could make a case that Unicode should have different "i"
         | characters for different languages. Then you could do all
         | transformations unambiguously. On the other hand almost
         | everyone abuses the minus sign as a dash, and treats the
         | apostrophe and the prime sign (signifying feet or minutes) as
         | interchangeable, so in all likelihood they would constantly use
         | the wrong i too.
        
           | kps wrote:
           | > On the other hand almost everyone abuses the minus sign as
           | a dash
           | 
           | Unicode calls it HYPHEN-MINUS. It does also have an
           | unambiguous '-' MINUS SIGN as well as '-' U+2010 HYPHEN and
           | the various dashes, but most people use bad keyboard layouts.
        
           | josefx wrote:
           | > You could make a case that Unicode should have different
           | "i" characters for different languages.
           | 
           | And different "SS" for any case where the lowercase was an
           | sz, of course at some point Germany introduced an uppercase
           | SZ character to avoid that round trip loss issue, but we
           | still have tons of text that use the old sz -> SS conversion.
           | Also note that "y" in Germany, not all German speaking
           | countries follow the same rules for sz, some dropped it
           | entirely. We basically need something like the time zone
           | database to have even a snowballs chance in hell to handle
           | text correctly.
        
           | heavenlyblue wrote:
           | Pretty sure that's not true. When you switch your keyboard
           | you will have a proper i character in another language unless
           | your keymap is broken. How do you think Chinese, Russians or
           | Greek type their characters?
        
             | tzot wrote:
             | The grandparent obviously meant "latin i"; none of the
             | three languages you mention have any latin letters, but at
             | least Russian and Greek have some lowercase and some more
             | uppercase letters with the same glyph/shape as latin ones.
        
               | heavenlyblue wrote:
               | Yeah, and those similar glyphs are not available on their
               | own language keyboard.
        
             | wongarsu wrote:
             | I frequently type German with a US layout with dead keys
             | (so I can type "a to get a). I also imagine that most
             | Turkish developers type English on a Turkish layout, since
             | Turkish contains all characters used by English.
        
           | anticensor wrote:
           | I have a better solution: use combining characters COMBINING
           | DOT ABOVE (which already exists) and DELETE DOT ABOVE (which
           | needs to be added into Unicode), which would manipulate "I"
           | into "I" and "i" into "i" respectively. Those combining
           | characters would also work perfectly with j too.
        
             | estebank wrote:
             | The only issue I can see is with people working in a
             | Turkish locale writing Latin text producing, let's say
             | English blogposts with the wrong i and I. I still think
             | that this should have been done this way though...
        
               | anticensor wrote:
               | Indeed. LATIN SMALL LETTER I + DELETE DOT ABOVE becomes
               | LATIN CAPITAL LETTER I + DELETE DOT ABOVE in uppercase,
               | which then becomes LATIN SMALL LETTER I + DELETE DOT
               | ABOVE back in lowercase. The same thing applies to LATIN
               | CAPITAL LETTER I + COMBINING DOT ABOVE. Survives infinite
               | number of case conversions.
        
           | johnwalkr wrote:
           | Well a round-trip or two could still be ambiguous which could
           | easily fail when comparing strings later in some edge case.
           | Especially when we can't even consistently agree to use by-
           | application, by-OS, by-language and by-locale settings
           | consistently. I don't have a solution, just pointing out that
           | this is a really challenging problem to fully solve.
        
       | mapgrep wrote:
       | Dumb question, if you _really_ need the exact string "info" in a
       | given context, why not hard code it? What does .lower() or even a
       | map liked the linked one actually buy you?
        
         | nicoburns wrote:
         | Presumably it's for normalising input. Following the principle
         | that you ought to be permissive in what data you accept, and
         | strict in what data you give out.
        
         | simion314 wrote:
         | Maybe the input is case insensitive, for example if you work
         | with html you might see "DIV","div" who knows some crazy dev or
         | tool might generate "DIv" or "dIv" so is simpler to lowercase
         | the input then work on it.
        
       | dependenttypes wrote:
       | Ah yes, locales. Everyone loves them https://github.com/mpv-
       | player/mpv/commit/1e70e82baa9193f6f02...
        
       | ramses0 wrote:
       | ObTurkeyTest: http://www.moserware.com/2008/02/does-your-code-
       | pass-turkey-...
        
       ___________________________________________________________________
       (page generated 2020-08-16 23:00 UTC)