[HN Gopher] So Long Surrogates: How We Moved to UTF-8 in Haskell ___________________________________________________________________ So Long Surrogates: How We Moved to UTF-8 in Haskell Author : wofo Score : 103 points Date : 2022-04-27 15:55 UTC (7 hours ago) (HTM) web link (www.channable.com) (TXT) w3m dump (www.channable.com) | nerdponx wrote: | This site saves 26 "statistics" cookies and 99 "marketing" | cookies. | | Really? Is all that necessary? | meetups323 wrote: | Your user agent saves the cookies. If you don't like it, change | it. | hombre_fatal wrote: | I do think cookies get unfair treatment. | | They are things that your browser happily rebroadcasts back | to the server with no real UI for it outside of the shitty | devtool bar made for devs, even after all this outcry about | cookies. | | It reminds me of the meme of the guy riding a bicycle, | throwing a branch into the spokes (rebroadcasting cookies), | and then roaring in pain on the ground about how evil | websites/advertisers are tracking him with cookies. | | That said, what a lame HN thread on a post about Haskell. | eklavya wrote: | I have come to accept it and just ignore it. Many times | there would be a long thread having to do absolutely | nothing about the topic at hand. Not a tangent but like | completely unrelated, why are we even discussing this here | kind of thing. | | I wish there was a good way to visually differentiate when | a new top comment starts except by squinting and figuring | out the whitespace from the left of the mobile screen, more | painful than necessary I presume. | Mindless2112 wrote: | Use HackerWeb. Top-level comments are highlighted, and it | automatically collapses threads to show only top-level | comments when there are a lot of comments. | | https://hackerweb.app/#/item/31181595 | hombre_fatal wrote: | I couldn't help myself. Always desperate to find an | opportunity to shove my 2 cents into the world. So | imagine my glee when you provided me with another one! | | Yeah, I think both (A) defaulting to auto-expanded | threads and (B) making them annoying to collapse make HN | worse than it could be. | | You tend to read the top-level thread because it's | already there. And then it ends up being longer than you | expected, or you're trapped in a subtree that just won't | end, or you just want to see what other people are | saying. And there's no good way to move past it. | | Would be nice to click the indentation to collapse the | thread anywhere inside the tree. | boogies wrote: | I just scroll to the top and use the "next" link on the | top comment (added with the prev and context links around | October 27-28th last year I think). | bawolff wrote: | Ignoring the privacy bit - 125 cookies is quite a bit of per | request overhead, especially in http/1.1 where they are not | compressed. I would say its poor website design. | Rygian wrote: | Why shift the burden on the user and the user agent? The | website is the only one to blame here. | meetups323 wrote: | Blaming the website for your own agent doing something you | don't want it to is learned helplessness. | | Every marketing cookie generates revenue for the website in | some way or another. The website wants revenue, so it asks | the user agent to maintain those cookies. The user agent | agrees. Then the operator of the user agent gets upset that | the website asked their agent to store the cookies? Get | upset that your agent agreed, not that a request was made. | | Or better yet, don't get upset at all and just solve the | darned problem yourself. Is this Hacker News or Complier | News? | jstimpfle wrote: | Cookies as a mechanism are useful and required for a | solid modern web experience. However, tracking cookies | are arguably the opposite of that. A typical modern | website with marketing comes with, I don't know, 100s of | cookies. Are you really arguing that the user should be | required to vet each individual cookie whenever following | a link with unvetted cookies? | | Or how do you solve this problem? Personally, the most I | can be arsed to do is install some Adblock Plugin. I did | that only a few months ago and I'm not even sure that it | improved my experience by a lot. | jimmaswell wrote: | There is no problem to solve, the cookies can't hurt you | and the website needs to stay afloat. | jstimpfle wrote: | To state the obvious, some people don't love the | extensive profiles that are created of them. | eternityforest wrote: | Those people should be able to avoid the profiling, but | any solution should be aimed at protecting those people, | without impacting the 95% who don't care enough to give | up convenience or pay for private services too much. | jstimpfle wrote: | Maybe my view is warped (I'm from Germany) but 95% seems | a tad high... | zasdffaa wrote: | > and required for a solid modern web experience | | Absence of cookies don't make things unstable (non- | solid?), and fuck knows what 'modern' is supposed to | mean, or why it's good. | | > Or how do you solve this problem? | | Block all cookies except for rare moments like posting on | HN, which then immediately get deleted. And no JS, which | means CPU is trivial (so no burn-a-core-for-every-open- | tab which is so common with page-sized pointless | animations). Many problems can be solved if you want them | to be. | eternityforest wrote: | How exactly will sites remember that you are logged in? | And how would be have any web apps that aren't horrendous | without JS? | | Also, where is this burn-a-core-for-every-open-tab stuff? | Many websites are highly optimized and do not use much | CPU. Not enough to be noticed without actually looking at | the numbers anyway. | | What sites have page size animations these days? | zasdffaa wrote: | > How exactly will sites remember that you are logged in? | | I don't want them to. I log back in if necessary (browser | remembers id/pswd). For those few I need to stay logged | in, I use a VM and save the state - I'm more concerned | about controlling JS than cookies in such cases. | | > And how would be have any web apps that aren't | horrendous without JS? | | I don't use web apps. My tradeoff. | | > Also, where is this burn-a-core-for-every-open-tab | stuff? Many websites are highly optimized and do not use | much CPU. | | Oddly, it seems to be corporate bullshit sites that are | the worse offenders. Can't find one but you're right, | it's not all by any means. I retract. | jstimpfle wrote: | But you realize you're the oddball that considers the | problem solved like that? I'm not sure that being a | "hacker" means to straight out refuse things. You're | missing out on a lot of fun and inspiring information | (and yes, many many hours wasted to irrelevant content). | zasdffaa wrote: | You make your choices and I make mine. Should a person | make the informed choice to immerse themselves in the web | as-is with all its problems & risks, ok, but most people | just pick the easy path then bitch after. I'm not one of | them, and straight out refusal is in fact a viable option | for me. | | If I do need anything more, there's VMs. BTW what 'fun | and inspiring information' do you refer to? Shadertoy is | a loss I grant, but what else? | jstimpfle wrote: | If you miss Shadertoy it won't be hard to imagine other | similar things, of which there are plenty. Anything that | requires interactivity beyond the one provided by HTML & | CSS will obviously require Javascript. Any personalized | experience (not only suggestions which yes are evil, but | also personal storage) will obviously require cookies to | function. | | Deleting Cookies on exit (and/or at regular intervals) | will probably not help much in terms of avoiding | tracking, especially if you log back in using your | reinitialized cookies. | zasdffaa wrote: | > it won't be hard to imagine other similar things, of | which there are plenty | | which again you don't give. | | > Anything that requires interactivity ... obviously | require Javascript | | jeez, no shit, I get it. | | > (some defeatist blah about cookies) | | Whatever. | | You just persistently don't get it. These are my choices. | I made them carefully. They suit me. They may not suit | you. We could even compromise if you made an effort to | see what I'm after but you won't/can't. Now please try to | understand I'm not you, and just back off! | Rygian wrote: | Blaming the user-agent for accepting an abusive amount of | cookies set by the website is outright bad faith. | | The only entity with any real power to decide which | cookies the website uses is the website itself. | | Asking the user or the user agent to comb through cookies | and decide, one by one, which ones seem marketing-related | and which ones are technically required, and then block, | is _way_ too much to ask from a regular internet user. | | I have tried, but fail to see good faith in your reply. | grumbel wrote: | The browser is the one who stores and sends cookies. It | would be trivial to make that action explicit and only at | the users request. That wouldn't even be a new feature, | that used to be how things worked 20 years ago. Lynx is | however the only browser left that I know that still asks | you before storing cookies. | | You don't even have to shift through cookies for this to | work, you can just reject all by default until the user | explicitly request them to be stored (or use a whitelist | or wait until the users tried to login that would | necessitate a cookie, etc.) Lots of possibilities. | | > is way too much to ask from a regular internet user. | | That's kind of the point. By making it all transparent | and seamless browser makers played into the hand of | marketing companies. If cookies had a cost and would | degrade the user experience, they might be thinking twice | before putting hundreds of them on a site. | | Marketing companies are just making use of the tools they | are given. And browser manufacturers gave them a lot of | tools, while taking control away from the user. | zasdffaa wrote: | Word. Tired of these "I don't want this but I won't spend | any time or money on fixing it so someone else should do | it" posts. | | Hint: it's under Tools|Preferences in firefox/palemoon | Rygian wrote: | No, it's not under "Tools|Preferences." | | There is no setting anywhere, in any web browser, to | "retain cookies that are technically necessary and reject | marketing cookies" which is the desirable behaviour. | zasdffaa wrote: | Define marketing cookie for me - do you mean 3rd party? | | (Some possible control via | Tools|Preferences|Exceptions... button allows you to | customise by website, although I've never used it. Or | just disallow all, which is what I do) | | --- | | Edit: answer the question please, there may be an easy | solution to what you want. | | Edit2: No reply because god forbid there's an actual way | you could take control, that would simply ruin everything | (in a parallel universe, man complains the streets are | rife with face stabbing but when presented with proof | they're not, stabs self in face to prove otherwise). | | Biggest problem with learned helplessness is that they | like it that way. Gives them something to be angrily | resentful about. | rini17 wrote: | Easy, enable only cookies for the things you want | (maintain your session with 1st website, plus core | functionality like payments). Everything else are | marketing cookies. | | I used umatrix for years but gave up. The guessing what | to enable to get a site to work got tiresome, and IIRC | there was also problem with browser support. | Rygian wrote: | Definition of cookies I don't consent to: any cookie that | is not mandatory for the site to technically work. | zasdffaa wrote: | You don't answer my question, then use a vague term of | 'technically work' to ensure I can't give you useful info | tl;dr you don't want to be helped. | matthewmacleod wrote: | Blaming others for making legitimate complaints about | pervasive bad practices is learned assholishness. | | We should all complain loudly and far more than we do | about the creeping tendency of many companies to do so | many obviously shitty things, instead of merely shrugging | our shoulders. | deathanatos wrote: | Heh, so I actually do this. | | An _incredible_ amount of the web just breaks. Twitter, | Reuters, Imgur. Like it 's one thing if, when I attempt to | log in, your log in fails (and usually, logins fail to handle | the error & will just loop back to the start, that's at least | a _start_ ) but a lot of the web will have a flash-of-text | and then nothing, & JS has crashed. | Aardwolf wrote: | If only Windows, Java and JavaScript could also move away from | internal usage of UTF-16, it's purely a legacy format and the | worst of both worlds (UTF-32 and UTF-8). Even worse is that | unicode itself, which should in theory be a list of codes for | glyphs, modifiers and other script related values, that's | independent of encoding, had to have some codes reserved for | "surrogates" for the UTF-16 encoding anyway. UTF-8 doesn't need | such a thing... | cryptonector wrote: | Microsoft is making improvements in their UTF-8 support. | Getting rid of the `W` APIs will take forever. Java and | JavaScript are even more stuck with UTF-16. | Aardwolf wrote: | UTF-8 support for filenames would be a great start, to | support windows filenames in a multiplatform way in C! | cryptonector wrote: | But what do you care how they file names are stored on | disk, as long as you can read directories and traverse | paths using UTF-8? | layer8 wrote: | Besides the surrogate characters there are also some other | noncharacters: | https://www.unicode.org/faq/private_use.html#noncharacters | | Because of modifier characters, control characters like for | bidi, stuff like soft-hyphens and ligatures, locale-dependent | semantics (upper/lowercase, collation etc.), the general | discordance between glyphs and characters, and so on and so | forth, Unicode is so complex, and in general always requires | careful processing of code point (or code unit) sequences, that | honestly the surrogate encoding doesn't make that much of a | difference. It's just an additional wrinkle in a sea of | wrinkles. | Aardwolf wrote: | I still find the surrogates different. Bidi, private use, | ligatures, ... are script or locale related. | | Unicode uses numeric values from 0 to 1112063. You can invent | all kinds of methods to encode numbers from 0 to 1112063 | (variable length, fixed length, decimal, hexadecimal, | anything else). But most ways I can think of to encode these | numbers, including variable length ones that would use 8 bit | or 16 bit primitives, don't require me to actually reserve | some of those to-be-encoded numbers themselves for a special | meaning. Yet for UTF-16 they managed to do it. Imagine that | all other encodings out there would also want to reserve some | Unicode values for their own purpose! | layer8 wrote: | You always have to work with sequences of code units anyway | (instead of just single code points), so the individual | reasons for that doesn't make much of a difference. It | seems your rejection is more on aesthetic than on practical | grounds. | camgunz wrote: | I have an old saw about UTF-16 not being an irredeemable format | and UTF-8 eating the world being bad, and I'm happy to dig it | out again. | | UTF-16 is great for lots of East Asian languages, which | billions of people use. In UTF-8, most of those languages | require 3 bytes to encode a 32-bit codepoint, in UTF-16 they | only ever need 2. This ends up being a huge savings. | | The main benefit of UTF-8 if you're say, Chinese, is interop. | Everything else is worse. | | You might think "but BOMs are super evil." Checking a BOM is | extremely, extremely easy. Furthermore, you don't get to bail | out of checking anything just by using UTF-8, you have to check | to ensure you have _valid_ UTF-8. That's right, you gotta scan | the whole bytestream anyway, so you may as well just check the | 2-byte BOM at the beginning too. | | You might also think "what about ASCII compatibility?" ASCII | compatibility is an anti-feature. You should never be indexing | into UTF strings (you always have to iterate, or save the | results of an iteration), upper/lowercasing isn't | addition/subtraction, etc. etc. You also can't just forget | about encodings as a result--you can store ASCII in something | expecting UTF-8, but you definitely can't store UTF-8 in | something expecting ASCII. So if you're | sniffing/decoding/tagging a format anyway, you may as well be | agnostic. | | You might also think "OK OK, you could be right, but what about | HTML, which is mostly ASCII and would nearly double in size if | it went from UTF-8 to UTF-16." Practically all HTML is gzipped, | so the difference is pretty small, plus the majority of text | isn't HTML (almost anything stored in a database, almost | anything in a file on your computer, etc.) | | Different encodings are good at different things. There's no | one superior encoding for all uses. What we need is text | encoding agnosticism. | | --- | | In fairness, I will say I've heard that UTF-8 is pretty popular | in countries with exactly the kind of languages I'm talking | about, so the issue is mostly moot at this point. I just think | UTF-16 gets a really bad rap, and I think we shouldn't just | gloss over UTF-8 having won because it's good for European | languages. | JoshTriplett wrote: | If you care about text size, you should compress your text; | that'll save much more space, since it can optimize for | what's actually used in the document. | | > ASCII compatibility is an anti-feature. | | ASCII compatibility is extremely useful if you're working | with, for instance, filenames or programming languages. You | can lex UTF-8 and handle separators like `/` or quotes like | `"` and `'`, because those bytes can never occur otherwise. | languageserver wrote: | I am always extremely doubtful of these types of blogposts that | take a well-known algorithm and somehow beat all others | (including academia, bioinformatics tools, etc.) with a fancy | implementation in <insert cool programming language 2022> | poorlyknit wrote: | (author here) | | I wrote this article during a short internship at Channable. | Not to be apologetic but I think these kind of articles are so | prevalent because young or unpopular languages usually have | worse documentation than established ones (naturally). I | basically wrote down the things I learned during my internship | that I found noteworthy. | Nebasuke wrote: | The article is about how they moved an existing (fast) | implementation in Haskell in UTF-16 to an even faster | implementation in Haskell by switching to UTF-8. This is stated | in the first paragraph. | | The post they reference, is also very honest: ..., the fastest | Haskell implementation of the Aho-Corasick string searching | algorithm, which powers string search in Channable. | | Basically the blog posts show that if you want to program in | Haskell and still optimise, this is how you can do it. I think | both posts are great resources and don't overstate their | claims. | danschuller wrote: | I was taught Haskell at university and I'm old. Looking at it's | wiki page it's a 32 year old language not that much younger | than 37 year old C++. | crdrost wrote: | Oh wow. That is really not very much pain, as described. | | I have to say, I never thought that the benefit of Haskell having | a horrible native string type would be "you can just upgrade | strings like any other dependency," which is really kinda slick. | You think about how much pain there was for Py2 -> Py3 where one | of the big sticking factors was all of the distinctions around | strings and encoding and byte arrays... this is comparatively | quite nice. Makes me wonder how much of a programming language | can be hotswappable. | resoluteteeth wrote: | Utf8 vs utf16 as the internal representation of the Unicode | string type is mostly just an implementation detail. | | This is very different from going from python2, which conflated | bytes and ascii strings, to python3, which intentionally | changed the api to propely distinguish sequences of bytes and | strings. ___________________________________________________________________ (page generated 2022-04-27 23:01 UTC)