[HN Gopher] Why can't you reverse a string with a flag emoji? ___________________________________________________________________ Why can't you reverse a string with a flag emoji? Author : da12 Score : 106 points Date : 2022-01-27 18:35 UTC (4 hours ago) (HTM) web link (davidamos.dev) (TXT) w3m dump (davidamos.dev) | zanzibar735 wrote: | Of course you can reverse a string with a flag emoji. You just | need to treat a "string" as a collected of Extended Grapheme | Clusters, and then you reverse the order of the EGCs. So if the | string is `a<flag unicode bytes>b`, the output should be `b<flag | unicode bytes>a`. | Crazyontap wrote: | This section on the linked Wikipedia article(1) is quite amazing | on how the family emoji is rendered using a zero-width joiner | | (1) https://en.wikipedia.org/wiki/Emoji#Joining | | edit: forgot HN doesn't render emojis. Better read it directly on | Wikipedia i guess. | codezero wrote: | You also can't URL Encode a string (In JS at least) if you | truncate an emoji at the beginning or end of it. | coreyp_1 wrote: | If you think the Unicode flag emoji take a lot of bytes, then | consider the family emoji! | (https://unicode.org/emoji/charts/full-emoji-list.html#family) | | I'm in the process of designing a scripting language and | implementing it in C++. I plan to put together a YouTube series | about it. (Doesn't everyone want to see Bison and Flex mixed with | proper unit tests and C++20 code?) | | Due to my future intended use case, I needed good support for | Unicode. I thought that I could write it myself, and I was wrong. | I wasted two weeks (in my spare time, mostly evenings) trying to | cobble together things that should work, identifying patterns, | figuring out how to update it as Unicode itself is updated, | thinking about edge cases, i18n, zalgo text, etc. And then I | finally reached the point where I knew enough to know that I was | making the wrong choice. | | I'm now using ICU. (https://icu.unicode.org/) It's huge, it was | hard to get it working in my environment, and there are very few | examples of it's usage online, but after the initial setup dues | are paid, it WORKS. | | Aside: Yes, I know I'm crazy for implementing a programming | language that I intend for serious usage. Yes, I have good | reasons for doing it, and yes I have considered alternatives. But | it's fun, so I'm doing it anyways. | | Moral of the story: Dealing with Unicode is hard, and if you | think it shouldn't be that hard, then you probably don't know | enough about the problem! | josephg wrote: | Handling unicode can be fine, depending on what you're doing. | The hard parts are: | | - Counting, rendering and collapsing grapheme clusters (like | the flag emoji) | | - Converting between legacy encodings (shiftjis, ko8, etc) and | UTF-8 / UTF-16 | | - Canonicalization | | If all you need is to deal with utf8 byte buffers, you don't | need all that stuff. And your code can stay simple, small and | fast. | | IIRC the rust standard library doesn't bother supporting any of | the hard parts in unicode. The only real unicode support in std | is utf8 validation for strings. All the complex aspects of | unicode are delegated to 3rd party crates. | | By contrast, nodejs (and web browsers) do all of this. But they | implement it in the same way you're suggesting - they simply | call out to libicu. | tialaramex wrote: | > The only real unicode support in std is utf8 validation for | strings. | | Rust's core library gives char methods such as is_numeric | which asks whether this Unicode codepoint is in one of | Unicode's numeric classes such as the letter-like-numerics | and various digits. (Rust does provide char with | is_ascii_digit and is_ascii_hexdigit if that's all you | actually cared about) | | So yes, the Rust standard library is carrying around the | entire Unicode standard class rule list among other things, | of course Rust's library isn't built to a vast binary, so if | you never use these features your binary doesn't get that | code. | Gigachad wrote: | It always feels like the most amount of work goes to the least | used emoji. So many revisions and additions to the family emoji | and yet it's one of the ones I don't recall anyone ever using. | | I think the trap Unicode got in to is technically they can have | infinite emoji so they just don't ever have a way to say no to | new proposals. | masklinn wrote: | > It always feels like the most amount of work goes to the | least used emoji. | | I always feel like those emoji were added on purpose in order | to force implementations to fix their unicode support. Before | emoji were added, most software had completely broken support | for anything beyond the BMP (case study: MySQL's so-called | "UTF8" encoding). The introduction of emoji, and their | immediate popularity, forced many systems to better support | astral planes (that is officially acknowledged: | https://unicode.org/faq/emoji_dingbats.html#EO1) | | Progressively, emoji using more advanced features got | introduced, which force systems (and developers) to fix their | unicode-handling, or at least improve it somewhat e.g. | skintones for combining codepoints, etc.... | | > I think the trap Unicode got in to is technically they can | have infinite emoji so they just don't ever have a way to say | no to new proposals. | | You should try to follow a new character through the process, | because that's absolutely not what happens and shepherding a | new emoji through to standardisation is not an easy task. The | unicode consortium absolutely does say no, and has many | reasons to do so. There's an entire page on just proposal | guidelines (https://unicode.org/emoji/proposals.html), and | following it does not in any way ensure it'll be accepted. | mike_hock wrote: | WTF business do emojis have in Unicode? The BMP is all | there ever should have been. Standardize the actual writing | systems of the world, so everyone can write in their | language. And once that is done, the standard doesn't need | to change for a hundred years. | | What we need now is a standardized, sane subset of Unicode | that implementations can support while rejecting the insane | scope creep that got added on top of that. I guess the BMP | is a good start, even though it already contains | superfluous crap like "dingbats" and boxes. | laumars wrote: | They do say no though. Frequently too. | | The problem with Unicode is simply that it's trying to solve | a very hard problem. | tialaramex wrote: | Exactly this. Humans have _incredibly_ complicated writing | systems, and all Unicode wants to do is encode them all. | Keep in mind that the trivial toy system we 're more | familiar with, ASCII, already has some pretty strange | features because even to half-arse one human writing system | they needed those features. | | Case is totally wild, it only applies to like 5% of the | symbols in ASCII, but in the process it means they each | need two codepoints and you're expected to carry around | tech for switching back and forth between cases. | | And then there are several distinct types of white space, | each gets a codepoint, some of them try to mess with your | text's "position" which may not make any sense in the | context where you wanted to use it. What does it mean to | have a "horizontal tab" between two parts of the text I | wanted to draw on this mug? I found a document which says | it is the same as "eight spaces" which seems wrong because | surely if you wanted eight spaces you'd just write eight | spaces. | | And after all that ASCII doesn't have working quotation | marks, it doesn't understand how to spell a bunch of common | English words like naive or cafe, pretty disappointing. | xxpor wrote: | >Humans have incredibly complicated writing systems | | Not only that, there isn't even agreement about what's | correct all the time! | | >it doesn't understand how to spell a bunch of common | English words like naive or cafe, pretty disappointing. | | A perfect example of this, since I would argue English | doesn't have any diacritics at all. So the use of cafe is | code switching. :) | mattkrause wrote: | Not a New Yorker writer, I see.... | mappu wrote: | If you like this, you may also like why len(emoji) is still not 1 | in Python 3 despite all the unicode breakage: | https://storytime.ivysaur.me/posts/grapheme-clusters/ | | I do feel like these are all 'gotcha' questions - I haven't seen | any real-world requirement to reverse a string and then have it | be displayed in a useful way. | raffy wrote: | Kinda related: I am developing a library for ENS (Ethereum Name | Service) name normalization: https://github.com/adraffy/ens- | normalize.js | | I'm trying to find the best combination of UTS-46, UTS-51, | UTS-39, and prior work on IDN resolution w/r/t confusables: | https://adraffy.github.io/ens-normalize.js/test/report-confu... | | Personally, I found the Unicode spec very messy. Critical | information is all over the place. You can see the direct effect | of this when you compare various packages across different | languages and discover that every library disagrees in multiple | places. Even JS String.normalize() isn't consistent in the latest | version of most browsers: https://adraffy.github.io/ens- | normalize.js/test/report-nf.ht... (fails in Chrome, Safari) | | The major difference between ENS and DNS is emoji are front and | center. ENS resolves by computing a hash of a name in a | canonicalized form. Since resolution must happen decentralized, | simply punting to punycode and relying custom logic for Unicode- | handling isn't possible. On-chain records are 1:1, so there's no | fuzzy matching either. Additionally, ENS is actively registering | names, so any improvement to the system must preserve as many | names as possible. | | At the moment, I'm attempting to improve upon the confusables in | the Common/Greek/Latin/Cyrillic scripts, and will combine these | new grouping with the mixed-script limitations similar to IDN | handling in Chromium. | | Interactive Demo: https://adraffy.github.io/ens- | normalize.js/test/resolver.htm... | | Also this emoji report is pretty cool: | https://adraffy.github.io/ens-normalize.js/test/report-emoji... | [deleted] | xmprt wrote: | This is a cool article about Unicode encoding however I still | feel like it should be possible to reverse strings with Flag | emojis. I don't see why computers can't handle multi rune symbols | in the same way that they handle multi byte runes. We could | combine all the runes that should be a single symbol and make | sure that we're maintaining the ordering of those runes in the | reversed string. Of course that means that naive string reversing | doesn't work anymore but naive string reversing wouldn't work in | the world of UTF-8 if we just went byte by byte. | happytoexplain wrote: | Swift, for example, does what you're saying. I thought that the | reason many languages don't do it that way is that part of the | definition of an array (or at least expected-by-convention) is | constant-time operations. If you treat a string as an array, | then having to deal with variable-length units breaks that | rule. That's why, when there _is_ an API for dealing with | grapheme clusters, it is usually a special case that duplicates | an array-like API, instead of literally using an array. | | I actually don't know how/why Python is apparently using code | points, since they are variable length. That seems like a | compromise between using code units and using grapheme clusters | that gets you the worst of both worlds. | | Edit: Maybe it uses UTF-32 under the hood when it's doing array | operations on code points? | kevin_thibedeau wrote: | This misses the real problem with flag emoji in that they are | composed of codepoints that can be in any order. With other emoji | you get a base codepoint with potential combining characters. | Using a table of combining character ranges you can skip over | them and isolate the logical glyph sequences. You don't need | surrounding context to parse them out like flags need. | uniqueuid wrote: | Thanks for that interesting detail! | | If such re-purposing continues, it might be easier to go | straight to utf-32 for some use cases. | dhosek wrote: | Nope, because the repurposing is independent of how the | Unicode is represented. There's absolutely no advantage to | having a string in UTF-32 over UTF-8 since you'll still need | to examine every character and the added overhead for | converting byte strings in UTF-8 to 32-bit code points is by | far offset by the huge memory increase necessary to store | UTF-32. | | What's more, it's really not that difficult to start at the | end of a valid UTF-8 string and get the characters in reverse | order. UTF-8 is well-designed that way in that there's never | ambiguity about whether you're looking at the beginning byte | of a code point. | colejohnson66 wrote: | > UTF-8 is well-designed that way in that there's never | ambiguity about whether you're looking at the beginning | byte of a code point. | | To expand, if the most-significant-bit is a 0, it's an | ASCII codepoint. If the top two are '10', it's a | continuation byte, and if they're '11', it's the start of a | multibyte codepoint (the other most-significant-bits | specify how long it is to facilitate easy codepoint | counting). | | So a naive codepoint reversal algorithm would start at the | end, and move backwards until it sees either an ASCII | codepoint or the start of a multibyte one. Upon reaching | it, copy those 1-4 bytes to the start of a new buffer. | Continue until you reach the start. | | [0]: https://en.wikipedia.org/wiki/UTF-8#Encoding | jug wrote: | I think that somewhere in this answer lies a reason why Windows | still doesn't support flag emoji. I don't count Microsoft Edge | as "Windows" in this case, but as Chromium. Windows doesn't | support flag emoji in its native text boxes, but it does | support even colorized emoji. | | But then again, flags seem to be not only Unicode-hard but | post-Unicode-hard. | masklinn wrote: | > But then again, flags seem to be not only Unicode-hard but | post-Unicode-hard. | | Flags are not that hard, they're a very specific block | combining in very predictable way. They're little more than | ligatures. Family emoji are much harder. | | And this is not "post-Unicode" in any way. | cygx wrote: | _Flags are not that hard, they 're a very specific block | combining in very predictable way._ | | But before their introduction, you could decide if there's | a grapheme cluster break between codepoints just by looking | at the two codepoints in question. Now, you may need to | parse a whole sequence of codepoints to see how flags pair | up. | otagekki wrote: | If flag emojis are really a combination of 2 special characters, | the reversal of the U.S. flag should result in having the Soviet | Union flag. | TonyTrapp wrote: | It's up to the installed fonts really. I don't know if the | combination of S + U is standardized as a Soviet Union flag | emoji, but even if it is, your locally installed fonts may not | contain every single flag emoji, so the browser would still | fall back to rendering the two letters instead. | masklinn wrote: | > the reversal of the U.S. flag should result in having the | Soviet Union flag. | | Except it has been deleted from the ISO 3166-2 registry, so not | having it is perfectly valid (arguably more so than having it). | jameshart wrote: | I was _so_ disappointed that didn 't turn out to be the case. | brewmarche wrote: | Just tried reversing a Spanish flag with Python and indeed I | got Sweden back | ezfe wrote: | Works in Swift, which is the benefit of Swift having the most | painful String API possible: | | let v = "Flag: " String(v.reversed()) // Output: :galF v.count // | Output: 7 | jiveturkey wrote: | Interesting article. Written for beginners, conversationally. Has | excessive amounts of whitespace, for "readability" I guess. But | at the same time, it dives quite deep, which I don't think this | "style" of presentation matches up with the amount of time a more | novice reader is going to devote to a single long form article. | | As to the content, for all the deep dive, a simple link to | https://unicode.org/reports/tr51/#Flags and what an emoji is, | would have saved so much exposition. I also wish he'd touched on | normalization. With the amount of time he's demanding from | readers he could have mentioned this important subject. Because | then he could discuss why (starting from his emoji example) | a-grave (a) might or might not be reversible, depending how the | character is composed. | | Also wish he'd pointed to some libraries that can do such | reversals. | faebi wrote: | Why reverse them if one barely can implement, display and edit | them correctly. I never could make them work perfectly in VIM. | Also I had to open a bug in Firefox recently: | | _Flag emojis and others are displayed in double the size on | Windows 10 using Firefox Nightly_ | https://bugzilla.mozilla.org/show_bug.cgi?id=1746795 | [deleted] | nottorp wrote: | So basically unicode along with c++ are great job security if you | do bother to learn them. | | There's another word that comes to mind when thinking about those | two: metastasis. | [deleted] | ts4z wrote: | Let me cheat a bit and say Unicode comes in three flavors: UTF-8, | UCS-2 aka UTF-16, and UTF-32. UTF-8 is byte-oriented, UTF-16 is | double-byte oriented, and UTF-32 nobody uses because you waste | half the word almost all of the time. | | You can't reduce the _bytes_ in UTF-8 or UTF-16, because you 'll | scramble the encoding. But you could parsing the string, | codepoint-at-a-time, handling the specifics of UTF-8, or UTF-16 | with its surrogate pairs, and reversing those. This sounds | equivalent to reversing UTF-32, and I believe is what the | original poster was imagining. | | Except you can't do that, because Unicode has composing | characters. Now, I'm American and too stupid to type anything | other than ASCII, but I know about n+~ = n. If you have the pre- | composed version of n, you can reverse the codepoint (it's one | codepoint). If you don't have it, and you have n+dead ~, you | can't reverse it, or in the word "ano" you might put the ~ on the | "o". (Even crazier things happen when you get to the ligatures in | Arabic; IIRC one of those is about 20 codepoints.) | | So we can't just reverse codepoints, even ancient versions of | Unicode. Other posters have talked about the even more exotic | stuff like Emoji + skin tone. It's necessary to be very careful. | | Now, the old fart in me says that ASCII never had this problem. | But the old fart in me knows about CRLF in text protocols, and | that's never LFCR; and that if you want to make a n in ASCII you | must send n ^H ~. I guess you can reverse that, but if you want | to do more exotic things it becomes less obvious. | | (IIRC UCS-2 is the deadname, now we call it UTF-16 to remind us | to always handle surrogate pairs correctly, which we don't.) | | TLDR: Strings are hard. | progbits wrote: | Semi-related (about length of emoji "characters", not reversing): | https://hsivonen.fi/string-length/ | | Previously discussed: | | https://news.ycombinator.com/item?id=20914184 | | https://news.ycombinator.com/item?id=26591373 | | As for this article & Python - as usual it is biasing towards | convenience and implicit behavior rather than properly handling | all edge cases. | | Compare with Rust where you can't "reverse" a string - that is | not a defined operation. But you can either break it into a | sequence of characters or graphemes and then reverse that, with | expected results: https://play.rust- | lang.org/?version=stable&mode=debug&editio... | | (Sadly the grapheme segmentation is not part of standard library, | at least yet) | aidenn0 wrote: | > The answer is: it depends. There isn't a canonical way to | reverse a string, at least that I'm aware of. | | Unicode defines grapheme clusters[1] that represent "user- | perceived characters" separating a string into those and | reversing seems like a pretty good way to go about it. | | 1: http://www.unicode.org/reports/tr29/ | qqii wrote: | > Challenge: How would you go about writing a function that | reverses a string while leaving symbols encoded as sequences of | code points intact? Can you do it from scratch? Is there a | package available in your language that can do it for you? How | did that package solve the problem? | | So are there any good libraries that can deal with code points | that are merged together into a single pictographic and reverse | them "as expected"? | da12 wrote: | If you're using Python, check out grapheme: | https://github.com/alvinlindstam/grapheme | tl wrote: | This is a nice dive into limitations in Python's unicode handling | and at the end, how to work around some problems. But you could | use languages with proper unicode support like Swift or Elixir | (weirdly HN is fighting flags in comment code which makes | examples header to demonstrate). | anamexis wrote: | HN doesn't allow any emoji. | mlindner wrote: | The person tries to define character when there isn't actually | any definition of what that even means. Character is a term | limited to languages that actually use them and not all text is | made up of characters. | yoyohello13 wrote: | Maybe I'm missing some prerequisite knowledge here, but why would | I assume `flag="us"` is an emoji? Looking at that first block of | code, there is no reason for me to think "us" is a single | character. | | Edit: Turns out my browser wasn't rendering the flags. | ljm wrote: | If it's Windows, it doesn't actually use flags for those | emojis, it renders a country code instead. If it wasn't | supported you would just see the glyph for an unknown | character. | | The reason was because they didn't want to be caught up in any | arguments about what flag to render for a country during any | dispute, as with, e.g. the flag for Afghanistan after the | Taliban took control. | happytoexplain wrote: | In Windows Chrome, it doesn't render the emoji for me. In | Android Chrome, it renders a flag emoji - not the raw region | indicators (which look like the letters "u" and "s"). | Benlights wrote: | I had the same issue when I read the article, I kept on getting | stuck and asking myself what I was missing. | greenyoda wrote: | In my browser (Firefox on Windows), the thing between the | quotes in the first block of code looks like a picture of the | US flag cropped to a circle, not like the characters "us". | yoyohello13 wrote: | Ah I see, I just opened it in firefox. It looks like some JS | library is not getting loaded in Edge. The author was talking | about "us", "so", etc. looking like one character and I | thought I was going crazy, lol. | da12 wrote: | A whole lesson in Unicode in itself right there with your | experience, haha! | bialpio wrote: | Reminds me of an image that renders differently on Macs | (https://www.bleepingcomputer.com/news/technology/this- | image-...), I bet it'd make for a fun conversation that | could make the participants question their sanity. :-) | masklinn wrote: | There should not be any JS involved though, only a font | able to render these grapheme clusters. | | Do you see the US flag after "copy and paste this emoji" on | https://emojipedia.org/flag-united-states/? | jfk13 wrote: | I don't think that's about a JS library. Firefox bundles an | emoji font that supports some things -- such as the flags | -- that aren't supported by Segoe UI Emoji on Windows, so | it has additional coverage for such character sequences. | yoyohello13 wrote: | That makes sense. I saw a failure to load a JS module in | the console and assumed that was part of the problem. | jug wrote: | I'm not surprised the flag had two components, but I _was_ | surprised the US flag was made by literally U and S, haha! | | I definitely thought it'd be something like [I am a Flag] and | [The flag ID between 0 and 65535]. And reversing it would be | [Flag ID] + [I am a Flag] which would not be a defined | "component" and instead rendered as the individual two nonsense | characters. | andylynch wrote: | You might also have noticed this is partly a very well thought | out hack to make Unicode less sensitive to disagreements and | changes in consensus on which flags are encoded, or even the | names of the countries concerned! | happytoexplain wrote: | I guessed that it would become the USSR flag (US -> SU), but | apparently Unicode doesn't define that one! I wonder why. That | would have been humorous. | bloak wrote: | As I understand it, there is no two-letter ISO code for the | USSR because when they update the standard they remove | countries that no longer exist. In at least one case they have | reused a code point: CS has been both "Czechoslovakia" and | "Serbia and Montenegro", neither of which currently exist. | | As a result, two-letter ISO codes are useless for many | potential applications, such as, for example, recording which | country a book was published in, unless you supplement them | with a reference to a particular version of the standard. | | Is there a way of getting the Czechoslovakian flag as an emoji? | And did Serbia and Montenegro get round to making a flag? | happytoexplain wrote: | Ah, I didn't realize they reused codes from ISO 3166-3. I | figured, because they keep these regions around in their own | set, that was some implication that the codes would not be | reused. | ts4z wrote: | IIRC Unicode doesn't define country codes. It was a workaround | for a political issue of which countries recognize which other | countries. | | It would have been difficult to get the CN delegation to sign | off on a list that contained TW, although there are probably | others. | andylynch wrote: | There are many more than I realised - Wikipedia has a decent | list https://en.m.wikipedia.org/wiki/List_of_states_with_limi | ted_... | chungy wrote: | Unicode doesn't define any flags, really. That's up to the font | rendering on systems/libraries. | happytoexplain wrote: | True, but Unicode explicitly defines "SU" as a deprecated | combination, regardless of flags. Seems like they omit | everything from the list of "no longer used" country codes, | with some exceptions. I would think they would have no reason | not to allow historical regions. | WA9ACE wrote: | I feel like I'm obligated to share this almost 20 year old | Spolsky post that gave me my understanding of characters. | | https://www.joelonsoftware.com/2003/10/08/the-absolute-minim... | xmprt wrote: | In that same vein, here's my introduction to Unicode about 10 | years ago from Tom Scott. | | https://www.youtube.com/watch?v=MijmeoH9LT4 | zerox7felf wrote: | poor man gave me and many others something like half of our | introduction to computer science, but has gotten far more | fame as the "emoji guy" for his repeated bouts with this | particular part of unicode :) | ciupicri wrote: | That's more about the UTF-8 encoding than Unicode itself. | bandyaboot wrote: | Would be interesting to see the list of flag emojis that, when | reversed, become a different flag emoji. | jfk13 wrote: | There are plenty of country codes that when reversed become a | different, valid country code: e.g. Israel (IL) when reversed | is Lithuania (LI); Australia (AU) becomes Ukraine (UA). | | Whether "reversing flag emojis" causes such transformations | will depend on what is meant by "reversing", which is kind of | the whole point here: there are a number of possible | interpretations of "reverse". | alfredxing wrote: | Related -- I did a deep dive a couple years ago on emoji | codepoints and how they're encoded in the Apple emoji font file, | with the end goal of extracting the embedded images -- | https://github.com/alfredxing/emoji | utopcell wrote: | There are unicode characters that reverse parsing order | themselves. This has been the basis of a code injection attack, | analyzed in [1]. | | [1] ``Trojan Source: Invisible Vulnerabilities'': | https://trojansource.codes/trojan-source.pdf | uniqueuid wrote: | Upper and lower codepoints are really way too obscure and can | create issues you didn't even know you had. | | I once had the very unpleasant experience of debugging a case | where data saved with R on windows and loaded on macOS ended up | with individually double-encoded codepoints. | | Not fun. | randpx wrote: | Try reversing the Canadian flag (CA) and you get the Ascension | Island Flag (AC). Great article, but completely misses the point. | Mesopropithecus wrote: | Unfortunately the HN text input won't let me do this, but a funny | starter for the article would have been this: | | '(Spanish flag)'[::-1] | | basically ''.join([chr(127466), chr(127480)]) vs. | ''.join([chr(127466), chr(127480)])[::-1] | | I'll add this to my collection of party tricks and show myself | out. | | Cool article! | dhosek wrote: | On the challenge front, there are things like a which might be a | single code point or two code points (a+'). Then there are the | really challenging things like a where if the components are | individual characters, the order of and ~ are not guaranteed to | be consistent. | saltminer wrote: | Then you have stuff like zalgo text (http://eeemo.net/) which | takes pride in abusing code points | happytoexplain wrote: | Which is why these APIs should always make normalization | available: https://unicode.org/reports/tr15/ | treesknees wrote: | But you can, and did, reverse a string. It seems you would need | more details, such as a request to reverse the meaning or | interpretation of the string, which is what the author is getting | at. | | If someone challenges you to reverse an image, what do you do? Do | you invert the colors? Mirror horizontally? Mirror vertically? | Just reverse the byte order? | wahern wrote: | There's a specification problem here. I like to say that a | "string" isn't a data structure, it's the absence of one. | Discussing "strings" is pointless. It follows that comparing | programming languages by their "string" handling is likewise | pointless. | | Case in point: a "struct" in languages like C and Rust is | literally a specification of how to treat segments of a | "string" of contiguous bytes. | shadowgovt wrote: | Even the most basic ASCII string is still a data structure. | | Is it a PASCAL string (length byte followed by data) or a C | string (arbitrary run of bytes terminated by a null | character)? | wahern wrote: | You qualified "string" with "ASCII", and also tacitly | admitted you still need more information than the octets | themselves--the length. | | Of course, various programming languages have primitives | and concepts which they may label "string". But you still | need to specify that _context_ , drawing in the additional | specification those languages provide. Plus, traditionally | and in practice, such concepts often serve the function of | importing or exporting unstructured data. So even in the | context of a specific programming language, the label | "string" is often used to _elide_ details necessary to | understanding the content and semantics of some particular | chunk of data. | shadowgovt wrote: | I think I understand the difference; you're using | "string" the way I would use "blob" or "untyped byte | array." | | Shifting definitions to yours, I agree. | avianlyric wrote: | In languages like C "string" isn't a proper data structure, | it's a `char` array, which itself is little more than a `int` | array or `byte` array. | | But these languages don't provide true "string" support. They | just have a vaguely useful type alias that renames a byte | array to a char array, and a bunch of byte array functions | that have been renamed to sound like string functions. In | reality all the language supports are byte arrays, with some | syntactical sugar so you can pretend they're strings. | | Newer languages, like go and Python 3, that where created in | the world of Unicode provide true string types. Where the | type primitives properly deal with idea of variable length | characters, and provide tools to make it easy to manipulate | strings and characters as independent concepts. If you want | to ignore Unicode, because your specific application doesn't | need to understand, then you need cast your strings into byte | arrays, and all pretences of true string manipulation vanish | at the same time. | | This is not to say the C can't handle Unicode etc. just like | the language doesn't provide true primitives to manipulate | strings, instead relies on libraries to provide that | functionality, which is perfectly valid approach. Just as | baking in more complex string primitives into your language | is also a perfectly valid approach. It's just a question of | trade offs and use cases, I.e. the problem at the heart of | all good engineering. | samatman wrote: | We would all be better off if this were actually true. | | Tragically, in C, a string is just _barely_ a data structure, | because it must have \0 at the end. | | If it were the complete absence of a data structure, we would | need some way to get at the length of it, and could treat a | slice of it as the same sort of thing as the thing itself. | jameshart wrote: | Yep, it's as meaningful a programming task as 'reverse this | double-precision float'. | egypturnash wrote: | Galaxy brain image reversal: completely redraw it from scratch, | with a viewpoint 180o from the original. | ravi-delia wrote: | New computer vision challenge | zwerdlds wrote: | In normal conditions you can check for a ZWJ, but with regional | coding chars, you would have to consider the regional chars block | as a single char in the reversal. Given that is isn't necessarily | locale dependant but presentation layer dependant, there might | not be anough info to decide how to act. | jerf wrote: | So, in terms of acing interviews, increasingly one of the best | answers to the question "Write some code that reverses a string" | is that in a world of unicode, "reversing a string" is no longer | possible or meaningful. | | You'll probably be told "oh, assume US ASCII" or something, but | in the meantime, if you can back that up when they dig into it, | you'll look really smart. | Someone wrote: | Even ASCII can be argued to be problematic. | | What is "3 >= 2", reversed? | | What is "Rijksmuseum", reversed? | (https://en.wikipedia.org/wiki/IJ_(digraph); capitalization | isn't simple here, either | (https://en.wikipedia.org/wiki/IJ_(digraph)#Capitalisation) | greenyoda wrote: | > "reversing a string" is no longer possible or meaningful. | | If you really wanted to, you could write a string reversal | algorithm that treated two-character emojis as an indivisible | element of the string and preserved its order (just as you'd | need to preserve the order of the bytes in a single multi-byte | UTF-8 character). You'd just need to carefully specify what you | mean by the terms "string", "character" and "reverse" in a way | that includes ordered, multi-character sequences like flag | emojis. | happytoexplain wrote: | I would argue that it is possible and meaningful. AFAIK | extended grapheme clusters are well defined by the standard, | and are very well suited to the default meaning of when | somebody says "character", so, given no other information, it's | reasonable to reverse a string based on them. I guess the issue | is "reverse a string" lacks details, but I think that's | different from "not meaningful". | viktorcode wrote: | You certainly can. `print(String(flag.reversed()))` in Swift | reverses emojis correctly. | Spivak wrote: | Reversing a string is still meaningful. Take a step back | outside the implementation and imagine handing a Unicode string | to a human. They could without any knowledge look at the | characters they see and produce the correct string reversal. | | There is a solution to this which is to compute the list of | grapheme clusters, and reverse that. | | https://unicode.org/reports/tr29/ | akersten wrote: | > imagine handing a Unicode string to a human. They could | without any knowledge look at the characters they see and | produce the correct string reversal. | | I really highly doubt it. | | How do you reverse this?: mrHban , hdhh slsl@. | | Can you do it without any knowledge about whether what looks | like one character is actually a special case joiner between | two adjacent codepoints that only happens in one direction? | Can you do it without knowing that this string appears | wrongly in the HN textbbox due to an apparent RTL issue? | | It's just not well-defined to reverse a string, and the | reason we say it's not meaningful is that no User Story ever | starts "as a visitor to this website I want to be able to see | this string in opposite order, no not just that all the bytes | are reversed, but you know what I mean." | adolph wrote: | Is a RTL character string already "reversed" from a LTR | POV? | | Is an absolute value signed as positive? | Spivak wrote: | I mean no but only because I don't understand the | characters. Someone who reads Arabic (I assume based on the | shape) would have no trouble. You're nitpicking cases where | for _some readers_ visual characters might be hard to | distinguish but it doesn't change the fact that _there | exists a correct answer_ for every piece of text that will | be obvious to readers of that text which is the definition | of a grapheme cluster. | akersten wrote: | > the fact that there exists a correct answer for every | piece of text that will be obvious to readers of that | text which is the definition of a grapheme cluster. | | No, I insist there is _not_ a single "correct answer," | even if a reader has perfect knowledge of the language(s) | involved. Now remember, this is already moving the | goalposts, since it was claimed that a human needed "no | knowledge" to get to this allegedly "correct answer." | | You already admit that people who don't speak Arabic will | have trouble finding the "grapheme clusters," but even | two people who speak Arabic may do your clustering or | not, depending on some implicit feeling of "the right way | to do it" vs taking the question literally and pasting | the smallest highlight-able selection of the string in | reverse at a time. | | Anyway, take a string like this: "here is some Arabic | text: <RLM> <Arabic codepoints> <LRM> And back to | English" | | Whether you discard the ordering mark[0], keep them, or | inverse them is an implementation decision that already | produces three completely different strings. Unless we | want to write a rulebook for the right way to reverse a | string, it remains an impossibility to declare anything | the correct answer, and because there is no _reason_ to | reverse such a string outside of contrived interview | questions and ivory tower debates, it is also | meaningless. | | [0]: https://en.m.wikipedia.org/wiki/Right-to-left_mark | https://en.m.wikipedia.org/wiki/Left-to-right_mark | Spivak wrote: | You added the requirement that it be a single correct | answer. I just asserted that there existed a correct | answer. You're being woefully pedantic -- a human who can | read the text presented to them but _no knowledge of | unicode_ was my intended meaning. Grapheme clusters are | language dependent and chosen for readers of languages | that use the characters involved. There 's no implicit | feeling, this is what the standards body has decided is | the "right way to do it." If you want to use different | grapheme clusters because you think the Unicode people | are wrong then fine, use those. You can still reverse the | string. | | Like what are you even arguing? You declared that | something was impossible and then ended with that it's | not only possible but it's so possible that there are | many reasonable correct answers. Pick one and call it a | day. | akersten wrote: | > Like what are you even arguing? | | It is impossible to "correctly reverse a string" because | "reverse a string" is not well defined. We explored many | different potential definitions of it, to show that there | is no meaningful singular answer. | | > You added the requirement that it be a single correct | answer. | | Your original post says "they could produce _the_ correct | string reversal "? | happytoexplain wrote: | >what looks like one character is actually a special case | joiner between two adjacent codepoints | | Are you referring to a grouping not covered by the | definition of grapheme clusters (which I am only passingly | familiar with)? If so, then I don't think it's any more | non-meaningful to reverse it than to reverse an English | string. The result is gibberish to humans either way - it | sounds more like you're saying that there is no universally | "meaningful to humans" way to reverse some text in | potentially any language, which is true regardless of what | encoding or written language you're using. I was thinking | of it more from the programmer side - i.e. that Unicode | provides ways to reverse strings that are more "meaningful" | (as opposed to arbitrary) than e.g. just reversing code | points. | nonameiguess wrote: | You can even demonstrate a similar concept with English and | Latin characters. There is no single thing called a | "grapheme" linguistically. There are actually two different | types of graphemes. The character sequence "sh" in English | is a single referential grapheme but two analogical | graphemes. Depending on what the specification means, | "short" could be reversed as either "trosh" or "trohs". | That's without getting into transliteration. The word for | Cherokee in the Cherokee language is "Tsalagi" but the "ts" | is a Latin transliteration of a single Cherokee character. | Should we count that as one grapheme or two? | | Of course, if an interviewer is really asking you how to do | this, they're probably either 1) working in bioinformatics, | in which case there are exactly four ASCII characters they | really care about and the problem is well-defined, or 2) | it's implementing something like rev | cut -d '-' -f1 | rev | to get rid of the last field and it doesn't matter how you | implement "rev" just so long as it works exactly the same | in reverse and you can always recover the original string. | Spivak wrote: | The fact that how to reverse a piece of text is locale | dependent doesn't mean it's impossible. Basically and | transformation on text will be locale dependent. Hell, | _length_ is locale dependent. | lloeki wrote: | Should it reverse a BOM as well or keep it first? | Spivak wrote: | Keep it first? Like that's not a gotcha. Your input is a | string and the output is that string visually reversed. | What it looks like in memory is irrelevant. | paxys wrote: | UTF-8 reverse string has been a thing for a long time in | most/all programming languages. It may not work perfectly in | 100% of the cases, but that doesn't mean reversing a string is | no longer possible. | jerf wrote: | "It may not work perfectly in 100% of the cases, but that | doesn't mean reversing a string is no longer possible." | | It depends on your point of view. From a strict point of | view, it _does_ exactly mean it is no longer possible. By | contrast, we all 100% knew what reversing an ASCII string | meant, with no ambiguity. | | It also depends on the version of Unicode you are using, and | oh by the way, unicode strings do not come annotated with the | version they are in. Since it's supposed to be backwards | compatible hopefully the latest works, but I'd be unsurprised | if someone can name something whose correct reversal depends | on the version of Unicode. And, if not now, then in some | later not-yet-existing pair of Unicode standards. | pwdisswordfish9 wrote: | > By contrast, we all 100% knew what reversing an ASCII | string meant, with no ambiguity. | | Not if the ASCII string employed the backspace control | character to accomplish what is today done with Unicode | combining characters. | | Or, in fact, if it employed any other kind of control | sequence. | thaumasiotes wrote: | I always thought it was interesting that ASCII is | transparently just a bunch of control codes for a | typewriter (where "strike an 'a'" is a mechanical | instruction no different from "reset the carriage | position"), but when we wanted to represent symbolic data | we copied it and included all of the nonsensical | mechanical instructions. | adzm wrote: | Well the control codes were specifically for TTY rather | than typewriters, many of the control codes still make | sense from that standpoint. | jameshart wrote: | Like... \r\n | jcelerier wrote: | > It may not work perfectly in 100% of the cases, but that | doesn't mean reversing a string is no longer possible. | | I don't understand why in maths finding one single counter- | example is enough to disprove a theorem yet in programming | people seem to be happy with 99.x % of success rate. To me, | "It may not work perfectly in 100% of the cases" exactly | means "no longer possible" as "possible" used to imply that | it would work consistently, 100% of the time. | tux3 wrote: | It is very useful in engineering to do things that are | mathematically impossible, by simply ignoring or rejecting | the last 1%. | | Sometimes that's unacceptable, because you really do care | about 100% of cases. When it isn't, you get really cool | "impossible" tools out of it :) | paxys wrote: | Because programming is not a science (or at most it is an | applied science). | | By your logic any software that has a single bug would be | useless, and if that were the case this entire profession | wouldn't exist. | jameshart wrote: | I'd go further and argue that _in general_ reversing a string | isn 't possible or meaningful. | | It's just not a thing people do, so it's just... not very | interesting to argue about what the 'correct' way to do it is. | | Similarly, any argument over whether a string has n characters | or n+1 characters in it is almost entirely meaningless and | uninteresting for real world string processing problems. Allow | me to let you into a secret: | | _there 's never really such a thing as a 'character limit'_ | | There might be a 'printable character width' limit; or there | might be a 'number of bytes of storage' limit. Which means | interesting questions about a string include things like 'how | wide is it when displayed in this font?' or 'how many bytes | does it take to store or transmit it?'... But there's rarely | any point where, for a general string, it is really interesting | to know 'how many characters does the string contain?' | | Processing direct user text input is the only situation where | you really need a rich notion of 'character', because you need | to have a clear sense of what will happen if the user moves a | cursor using a left or right arrow, and for exactly what will | be deleted when a user hits backspace, or copied/cut and pasted | when they operate on a selection. The ij ligature might be a | single glyph, but is it a single character? When does it | matter? Probably not at all unless you're trying to decide | whether to let a user put a cursor in the middle of it or not. | | And next to that, I just feel to argue that there is such a | thing as a 'correct' way to reverse "Rijndael" according to a | strict reading of Unicode glyph composability rules seems like | a supremely silly thing to try to do. | | I'd much rather, when asked to reverse a string, more | developers simply said 'that doesn't make sense, you can't | arbitrarily chunk up a string and reassemble it in a different | order and expect any good to come of it'. | Beldin wrote: | Interestingly, on my phone the so-called flag is not a flag at | all, but "US" in outline. | | So python behaves as expected: the 2 character string, when | reversed, becomes "SU". Similar stuff happens with the other | "flag" strings. | | I'm sure emojis in my phone are outdated. I'm not sure how that | affects whether I see a flag or letters. | pilsetnieks wrote: | Thankfully, there isn't an assigned ISO 3166-1 2-letter country | code for SU currently; people may have interesting reactions | seeing what happens when reversing a US flag emoji if there | were. | nextstep wrote: | Compare all of this nonsense to how it's done in Swift. String | APIs in Swift are great: intuitive and do what you expect. | exdsq wrote: | Am I missing something or is this Day 1 of a programming course | in C? | techwiz137 wrote: | It's pretty funny that reversing the American flag yields Soviet | Union(SU). | emodendroket wrote: | What I'd like to know is, given the explosion of the character | set for emoji, does the rationale for Han unification still make | sense? The case for not allowing national variants seems less and | less compelling with every emoji they add. | | This is a bit of a hobby horse, but imagine if every time you | read an article in English on your phone some of the letters were | replaced with "equivalent" Greek or Cyrillic one and you can get | an idea of the annoyance. Yeah, you can still read it with a bit | of thought, but who wants to read that way? | AlanYx wrote: | I agree that Han unification was an unfortunate design | decision, but I'd argue that the consortium is following a | consistent approach to the Han unification with emoji. For | example, they treat "regional" vendor variations in emoji as a | font issue. If you get a message with the gun emoji, unless you | have out-of-band information regarding which vendor variant is | intended, there's no way in software to know if it should be | displayed as a water gun (Apple "regional" variant) or a weapon | (other vendor variants). Which is not that different from a | common problem stemming from Han unification. | emodendroket wrote: | I don't disagree, but my point is more than their concern was | about having "too many characters" in Unicode, which no | longer seems to be a real concern, so what would be the harm | of adding national variants? | hougaard wrote: | In other news, water is wet :) | michaelsbradley wrote: | See chapter 7 in _Hacking the Planet (with Notcurses)_ for a | short treatment of encodings, extended grapheme clusters, etc. | | https://nick-black.com/htp-notcurses.pdf#page53 | smegsicle wrote: | did they think all those skintone emojis are individual | codepoints? | advisedwang wrote: | They might have thought that `reverse()` had some kind of | unicode-aware handling. I believe `upper()`/`lower()` do. | daveslash wrote: | When I first realized that the skin tone emojis were a code- | point + a color code-point modifier, I tried to see what other | colors there were and if I could apply those to _other_ emojis. | The immature child in me looked to see if there was a red color | code point and if so, could I use it to make a _" blood poop"_ | emoji. Turns out.... no. | codingkev wrote: | Yes, this allows for easy building of flag emojis as long as you | know the ISO 3166 two-letter country code. | | Example: https://github.com/kennell/flagz/blob/master/flagz.py | sltkr wrote: | So what was the deal with the Scottish flag? | gsnedders wrote: | From Wikipedia: | | > A separate mechanism (emoji tag sequences) is used for | regional flags, such as England , Scotland , Wales , Texas or | California . It uses U+1F3F4 WAVING BLACK FLAG and formatting | tag characters instead of regional indicator symbols. It is | based on ISO 3166-2 regions with hyphen removed and lowercase, | e.g. GB-ENG - gbeng, terminating with U+E007F CANCEL TAG. Flag | of England is therefore represented by a sequence U+1F3F4, | U+E0067, U+E0062, U+E0065, U+E006E, U+E0067, U+E007F. | ghostly_s wrote: | This was the only part that was surprising to me, and as it | turns out my surprise mostly stems from still not really | understanding how the United Kingdom works. | tialaramex wrote: | Don't worry, "How the United Kingdom works" is a political | question and so subject to change. | | For example, Wales was essentially just straight up | conquered, and so for long periods Wales did not have any | distinct legal identity from England. You'll see that today | there's a bunch of laws which are for _England and Wales_ | but notably not Scotland, including criminal laws. In | living memory Wales got some measure of independent control | over its own affairs, via an elected "Assembly" but what | powers are "devolved" to this assembly are in effect the | gift of the Parliament, in Westminster, which is sovereign. | Whether taking away those powers would go well is a good | question. | | On the other hand, Northern Ireland is what's left of | English/ British dominion over the entire island of | Ireland, most of which today is the Republic of Ireland, a | sovereign entity with its own everything. It's only existed | for about a century, and is a result of the agreed | "partition" when the Irish rebelled because _most of the | Irish_ wanted independence but those in the North not so | much. Feel free to read about euphemistically named | "Troubles". In the modern era, Northern Ireland, like | Wales, gets a devolved government in Stormont. Unlike | Wales, the Northern Ireland government is a total mess, and | e.g. they have abortion (like the rest of the UK, and like | the rest of Ireland) only because Stormont was so broken | that Westminster imposed abortion legalisation on them | since they weren't actually governing. If you think the US | Congress is dysfunctional, check out Stormont... | | Finally Scotland was for a very long time an independent | but closely related sovereign nation. It _agreed_ to join | this United Kingdom about three hundred years ago in the | Acts of Union after about a century with the same Monarch | ruling both countries. However, it too got a devolved | government, a Parliament, probably the most powerful of the | three, in Holyrood, Edingburgh in the 20th century and it | has a relatively powerful pro-independence politics, the | Scottish National Party is the dominant power in Scottish | politics, although how many of its voters _actually_ | support independence per se is tricky to judge. | | Brexit changed all this again, because as part of the EU a | bunch of the powers you could reasonably localise, and so | were "devolved" to Wales, Scotland and Northern Ireland, | had been controlled by EU law. So Westminster could _say_ | they were devolved, knowing that the constituent entities | couldn 't actually do much with this supposed power. Having | left the EU, those powers were among the thing Brexiteers | seemed to have imagined now lay at Westminster, but of | course the devolved countries said no, these are our | powers, we get to decide e.g. how agricultural subsidies | are distributed to suit our farmers. | | That's even more fun in Northern Ireland, because they | share a border with the Republic, an EU member, and so | they're not allowed to have certain rules that would | obviously result in a physical border with guards and so | on. Their Unionists (the people who are why it isn't just | part of the Republic of Ireland because they want to be in | the United Kingdom) feel like they were sold out by | Westminster politicians, while the Republicans (those who'd | rather be part of the Republic) see this as potentially a | further argument in favour of that. All of which isn't | helping at all to keep the peace between these rivals, that | peace being the whole reason we don't want to put up a | border... | dhosek wrote: | Most flags use the ISO 2-character country code to access their | values. However, some flags don't map to 2-character country | codes (Scotland being one example). In this case it uses the | sequence black flag, GBSCT (for Great Britain-Scotland, | represented using the tag latin small letter codes for the | letters) then cancel tag to end the sequence. Changing the | middle five to be GBENG gives the English flag and GBWLS gives | the Welsh flag. | [deleted] | architectdrone wrote: | humorously, on my local machine, I only see the string "us", and | was rather confused when he was asserting that it was a single | character :D ___________________________________________________________________ (page generated 2022-01-27 23:00 UTC)