[HN Gopher] So Long Surrogates: How We Moved to UTF-8 in Haskell
       ___________________________________________________________________
        
       So Long Surrogates: How We Moved to UTF-8 in Haskell
        
       Author : wofo
       Score  : 103 points
       Date   : 2022-04-27 15:55 UTC (7 hours ago)
        
 (HTM) web link (www.channable.com)
 (TXT) w3m dump (www.channable.com)
        
       | nerdponx wrote:
       | This site saves 26 "statistics" cookies and 99 "marketing"
       | cookies.
       | 
       | Really? Is all that necessary?
        
         | meetups323 wrote:
         | Your user agent saves the cookies. If you don't like it, change
         | it.
        
           | hombre_fatal wrote:
           | I do think cookies get unfair treatment.
           | 
           | They are things that your browser happily rebroadcasts back
           | to the server with no real UI for it outside of the shitty
           | devtool bar made for devs, even after all this outcry about
           | cookies.
           | 
           | It reminds me of the meme of the guy riding a bicycle,
           | throwing a branch into the spokes (rebroadcasting cookies),
           | and then roaring in pain on the ground about how evil
           | websites/advertisers are tracking him with cookies.
           | 
           | That said, what a lame HN thread on a post about Haskell.
        
             | eklavya wrote:
             | I have come to accept it and just ignore it. Many times
             | there would be a long thread having to do absolutely
             | nothing about the topic at hand. Not a tangent but like
             | completely unrelated, why are we even discussing this here
             | kind of thing.
             | 
             | I wish there was a good way to visually differentiate when
             | a new top comment starts except by squinting and figuring
             | out the whitespace from the left of the mobile screen, more
             | painful than necessary I presume.
        
               | Mindless2112 wrote:
               | Use HackerWeb. Top-level comments are highlighted, and it
               | automatically collapses threads to show only top-level
               | comments when there are a lot of comments.
               | 
               | https://hackerweb.app/#/item/31181595
        
               | hombre_fatal wrote:
               | I couldn't help myself. Always desperate to find an
               | opportunity to shove my 2 cents into the world. So
               | imagine my glee when you provided me with another one!
               | 
               | Yeah, I think both (A) defaulting to auto-expanded
               | threads and (B) making them annoying to collapse make HN
               | worse than it could be.
               | 
               | You tend to read the top-level thread because it's
               | already there. And then it ends up being longer than you
               | expected, or you're trapped in a subtree that just won't
               | end, or you just want to see what other people are
               | saying. And there's no good way to move past it.
               | 
               | Would be nice to click the indentation to collapse the
               | thread anywhere inside the tree.
        
               | boogies wrote:
               | I just scroll to the top and use the "next" link on the
               | top comment (added with the prev and context links around
               | October 27-28th last year I think).
        
           | bawolff wrote:
           | Ignoring the privacy bit - 125 cookies is quite a bit of per
           | request overhead, especially in http/1.1 where they are not
           | compressed. I would say its poor website design.
        
           | Rygian wrote:
           | Why shift the burden on the user and the user agent? The
           | website is the only one to blame here.
        
             | meetups323 wrote:
             | Blaming the website for your own agent doing something you
             | don't want it to is learned helplessness.
             | 
             | Every marketing cookie generates revenue for the website in
             | some way or another. The website wants revenue, so it asks
             | the user agent to maintain those cookies. The user agent
             | agrees. Then the operator of the user agent gets upset that
             | the website asked their agent to store the cookies? Get
             | upset that your agent agreed, not that a request was made.
             | 
             | Or better yet, don't get upset at all and just solve the
             | darned problem yourself. Is this Hacker News or Complier
             | News?
        
               | jstimpfle wrote:
               | Cookies as a mechanism are useful and required for a
               | solid modern web experience. However, tracking cookies
               | are arguably the opposite of that. A typical modern
               | website with marketing comes with, I don't know, 100s of
               | cookies. Are you really arguing that the user should be
               | required to vet each individual cookie whenever following
               | a link with unvetted cookies?
               | 
               | Or how do you solve this problem? Personally, the most I
               | can be arsed to do is install some Adblock Plugin. I did
               | that only a few months ago and I'm not even sure that it
               | improved my experience by a lot.
        
               | jimmaswell wrote:
               | There is no problem to solve, the cookies can't hurt you
               | and the website needs to stay afloat.
        
               | jstimpfle wrote:
               | To state the obvious, some people don't love the
               | extensive profiles that are created of them.
        
               | eternityforest wrote:
               | Those people should be able to avoid the profiling, but
               | any solution should be aimed at protecting those people,
               | without impacting the 95% who don't care enough to give
               | up convenience or pay for private services too much.
        
               | jstimpfle wrote:
               | Maybe my view is warped (I'm from Germany) but 95% seems
               | a tad high...
        
               | zasdffaa wrote:
               | > and required for a solid modern web experience
               | 
               | Absence of cookies don't make things unstable (non-
               | solid?), and fuck knows what 'modern' is supposed to
               | mean, or why it's good.
               | 
               | > Or how do you solve this problem?
               | 
               | Block all cookies except for rare moments like posting on
               | HN, which then immediately get deleted. And no JS, which
               | means CPU is trivial (so no burn-a-core-for-every-open-
               | tab which is so common with page-sized pointless
               | animations). Many problems can be solved if you want them
               | to be.
        
               | eternityforest wrote:
               | How exactly will sites remember that you are logged in?
               | And how would be have any web apps that aren't horrendous
               | without JS?
               | 
               | Also, where is this burn-a-core-for-every-open-tab stuff?
               | Many websites are highly optimized and do not use much
               | CPU. Not enough to be noticed without actually looking at
               | the numbers anyway.
               | 
               | What sites have page size animations these days?
        
               | zasdffaa wrote:
               | > How exactly will sites remember that you are logged in?
               | 
               | I don't want them to. I log back in if necessary (browser
               | remembers id/pswd). For those few I need to stay logged
               | in, I use a VM and save the state - I'm more concerned
               | about controlling JS than cookies in such cases.
               | 
               | > And how would be have any web apps that aren't
               | horrendous without JS?
               | 
               | I don't use web apps. My tradeoff.
               | 
               | > Also, where is this burn-a-core-for-every-open-tab
               | stuff? Many websites are highly optimized and do not use
               | much CPU.
               | 
               | Oddly, it seems to be corporate bullshit sites that are
               | the worse offenders. Can't find one but you're right,
               | it's not all by any means. I retract.
        
               | jstimpfle wrote:
               | But you realize you're the oddball that considers the
               | problem solved like that? I'm not sure that being a
               | "hacker" means to straight out refuse things. You're
               | missing out on a lot of fun and inspiring information
               | (and yes, many many hours wasted to irrelevant content).
        
               | zasdffaa wrote:
               | You make your choices and I make mine. Should a person
               | make the informed choice to immerse themselves in the web
               | as-is with all its problems & risks, ok, but most people
               | just pick the easy path then bitch after. I'm not one of
               | them, and straight out refusal is in fact a viable option
               | for me.
               | 
               | If I do need anything more, there's VMs. BTW what 'fun
               | and inspiring information' do you refer to? Shadertoy is
               | a loss I grant, but what else?
        
               | jstimpfle wrote:
               | If you miss Shadertoy it won't be hard to imagine other
               | similar things, of which there are plenty. Anything that
               | requires interactivity beyond the one provided by HTML &
               | CSS will obviously require Javascript. Any personalized
               | experience (not only suggestions which yes are evil, but
               | also personal storage) will obviously require cookies to
               | function.
               | 
               | Deleting Cookies on exit (and/or at regular intervals)
               | will probably not help much in terms of avoiding
               | tracking, especially if you log back in using your
               | reinitialized cookies.
        
               | zasdffaa wrote:
               | > it won't be hard to imagine other similar things, of
               | which there are plenty
               | 
               | which again you don't give.
               | 
               | > Anything that requires interactivity ... obviously
               | require Javascript
               | 
               | jeez, no shit, I get it.
               | 
               | > (some defeatist blah about cookies)
               | 
               | Whatever.
               | 
               | You just persistently don't get it. These are my choices.
               | I made them carefully. They suit me. They may not suit
               | you. We could even compromise if you made an effort to
               | see what I'm after but you won't/can't. Now please try to
               | understand I'm not you, and just back off!
        
               | Rygian wrote:
               | Blaming the user-agent for accepting an abusive amount of
               | cookies set by the website is outright bad faith.
               | 
               | The only entity with any real power to decide which
               | cookies the website uses is the website itself.
               | 
               | Asking the user or the user agent to comb through cookies
               | and decide, one by one, which ones seem marketing-related
               | and which ones are technically required, and then block,
               | is _way_ too much to ask from a regular internet user.
               | 
               | I have tried, but fail to see good faith in your reply.
        
               | grumbel wrote:
               | The browser is the one who stores and sends cookies. It
               | would be trivial to make that action explicit and only at
               | the users request. That wouldn't even be a new feature,
               | that used to be how things worked 20 years ago. Lynx is
               | however the only browser left that I know that still asks
               | you before storing cookies.
               | 
               | You don't even have to shift through cookies for this to
               | work, you can just reject all by default until the user
               | explicitly request them to be stored (or use a whitelist
               | or wait until the users tried to login that would
               | necessitate a cookie, etc.) Lots of possibilities.
               | 
               | > is way too much to ask from a regular internet user.
               | 
               | That's kind of the point. By making it all transparent
               | and seamless browser makers played into the hand of
               | marketing companies. If cookies had a cost and would
               | degrade the user experience, they might be thinking twice
               | before putting hundreds of them on a site.
               | 
               | Marketing companies are just making use of the tools they
               | are given. And browser manufacturers gave them a lot of
               | tools, while taking control away from the user.
        
               | zasdffaa wrote:
               | Word. Tired of these "I don't want this but I won't spend
               | any time or money on fixing it so someone else should do
               | it" posts.
               | 
               | Hint: it's under Tools|Preferences in firefox/palemoon
        
               | Rygian wrote:
               | No, it's not under "Tools|Preferences."
               | 
               | There is no setting anywhere, in any web browser, to
               | "retain cookies that are technically necessary and reject
               | marketing cookies" which is the desirable behaviour.
        
               | zasdffaa wrote:
               | Define marketing cookie for me - do you mean 3rd party?
               | 
               | (Some possible control via
               | Tools|Preferences|Exceptions... button allows you to
               | customise by website, although I've never used it. Or
               | just disallow all, which is what I do)
               | 
               | ---
               | 
               | Edit: answer the question please, there may be an easy
               | solution to what you want.
               | 
               | Edit2: No reply because god forbid there's an actual way
               | you could take control, that would simply ruin everything
               | (in a parallel universe, man complains the streets are
               | rife with face stabbing but when presented with proof
               | they're not, stabs self in face to prove otherwise).
               | 
               | Biggest problem with learned helplessness is that they
               | like it that way. Gives them something to be angrily
               | resentful about.
        
               | rini17 wrote:
               | Easy, enable only cookies for the things you want
               | (maintain your session with 1st website, plus core
               | functionality like payments). Everything else are
               | marketing cookies.
               | 
               | I used umatrix for years but gave up. The guessing what
               | to enable to get a site to work got tiresome, and IIRC
               | there was also problem with browser support.
        
               | Rygian wrote:
               | Definition of cookies I don't consent to: any cookie that
               | is not mandatory for the site to technically work.
        
               | zasdffaa wrote:
               | You don't answer my question, then use a vague term of
               | 'technically work' to ensure I can't give you useful info
               | tl;dr you don't want to be helped.
        
               | matthewmacleod wrote:
               | Blaming others for making legitimate complaints about
               | pervasive bad practices is learned assholishness.
               | 
               | We should all complain loudly and far more than we do
               | about the creeping tendency of many companies to do so
               | many obviously shitty things, instead of merely shrugging
               | our shoulders.
        
           | deathanatos wrote:
           | Heh, so I actually do this.
           | 
           | An _incredible_ amount of the web just breaks. Twitter,
           | Reuters, Imgur. Like it 's one thing if, when I attempt to
           | log in, your log in fails (and usually, logins fail to handle
           | the error & will just loop back to the start, that's at least
           | a _start_ ) but a lot of the web will have a flash-of-text
           | and then nothing, & JS has crashed.
        
       | Aardwolf wrote:
       | If only Windows, Java and JavaScript could also move away from
       | internal usage of UTF-16, it's purely a legacy format and the
       | worst of both worlds (UTF-32 and UTF-8). Even worse is that
       | unicode itself, which should in theory be a list of codes for
       | glyphs, modifiers and other script related values, that's
       | independent of encoding, had to have some codes reserved for
       | "surrogates" for the UTF-16 encoding anyway. UTF-8 doesn't need
       | such a thing...
        
         | cryptonector wrote:
         | Microsoft is making improvements in their UTF-8 support.
         | Getting rid of the `W` APIs will take forever. Java and
         | JavaScript are even more stuck with UTF-16.
        
           | Aardwolf wrote:
           | UTF-8 support for filenames would be a great start, to
           | support windows filenames in a multiplatform way in C!
        
             | cryptonector wrote:
             | But what do you care how they file names are stored on
             | disk, as long as you can read directories and traverse
             | paths using UTF-8?
        
         | layer8 wrote:
         | Besides the surrogate characters there are also some other
         | noncharacters:
         | https://www.unicode.org/faq/private_use.html#noncharacters
         | 
         | Because of modifier characters, control characters like for
         | bidi, stuff like soft-hyphens and ligatures, locale-dependent
         | semantics (upper/lowercase, collation etc.), the general
         | discordance between glyphs and characters, and so on and so
         | forth, Unicode is so complex, and in general always requires
         | careful processing of code point (or code unit) sequences, that
         | honestly the surrogate encoding doesn't make that much of a
         | difference. It's just an additional wrinkle in a sea of
         | wrinkles.
        
           | Aardwolf wrote:
           | I still find the surrogates different. Bidi, private use,
           | ligatures, ... are script or locale related.
           | 
           | Unicode uses numeric values from 0 to 1112063. You can invent
           | all kinds of methods to encode numbers from 0 to 1112063
           | (variable length, fixed length, decimal, hexadecimal,
           | anything else). But most ways I can think of to encode these
           | numbers, including variable length ones that would use 8 bit
           | or 16 bit primitives, don't require me to actually reserve
           | some of those to-be-encoded numbers themselves for a special
           | meaning. Yet for UTF-16 they managed to do it. Imagine that
           | all other encodings out there would also want to reserve some
           | Unicode values for their own purpose!
        
             | layer8 wrote:
             | You always have to work with sequences of code units anyway
             | (instead of just single code points), so the individual
             | reasons for that doesn't make much of a difference. It
             | seems your rejection is more on aesthetic than on practical
             | grounds.
        
         | camgunz wrote:
         | I have an old saw about UTF-16 not being an irredeemable format
         | and UTF-8 eating the world being bad, and I'm happy to dig it
         | out again.
         | 
         | UTF-16 is great for lots of East Asian languages, which
         | billions of people use. In UTF-8, most of those languages
         | require 3 bytes to encode a 32-bit codepoint, in UTF-16 they
         | only ever need 2. This ends up being a huge savings.
         | 
         | The main benefit of UTF-8 if you're say, Chinese, is interop.
         | Everything else is worse.
         | 
         | You might think "but BOMs are super evil." Checking a BOM is
         | extremely, extremely easy. Furthermore, you don't get to bail
         | out of checking anything just by using UTF-8, you have to check
         | to ensure you have _valid_ UTF-8. That's right, you gotta scan
         | the whole bytestream anyway, so you may as well just check the
         | 2-byte BOM at the beginning too.
         | 
         | You might also think "what about ASCII compatibility?" ASCII
         | compatibility is an anti-feature. You should never be indexing
         | into UTF strings (you always have to iterate, or save the
         | results of an iteration), upper/lowercasing isn't
         | addition/subtraction, etc. etc. You also can't just forget
         | about encodings as a result--you can store ASCII in something
         | expecting UTF-8, but you definitely can't store UTF-8 in
         | something expecting ASCII. So if you're
         | sniffing/decoding/tagging a format anyway, you may as well be
         | agnostic.
         | 
         | You might also think "OK OK, you could be right, but what about
         | HTML, which is mostly ASCII and would nearly double in size if
         | it went from UTF-8 to UTF-16." Practically all HTML is gzipped,
         | so the difference is pretty small, plus the majority of text
         | isn't HTML (almost anything stored in a database, almost
         | anything in a file on your computer, etc.)
         | 
         | Different encodings are good at different things. There's no
         | one superior encoding for all uses. What we need is text
         | encoding agnosticism.
         | 
         | ---
         | 
         | In fairness, I will say I've heard that UTF-8 is pretty popular
         | in countries with exactly the kind of languages I'm talking
         | about, so the issue is mostly moot at this point. I just think
         | UTF-16 gets a really bad rap, and I think we shouldn't just
         | gloss over UTF-8 having won because it's good for European
         | languages.
        
           | JoshTriplett wrote:
           | If you care about text size, you should compress your text;
           | that'll save much more space, since it can optimize for
           | what's actually used in the document.
           | 
           | > ASCII compatibility is an anti-feature.
           | 
           | ASCII compatibility is extremely useful if you're working
           | with, for instance, filenames or programming languages. You
           | can lex UTF-8 and handle separators like `/` or quotes like
           | `"` and `'`, because those bytes can never occur otherwise.
        
       | languageserver wrote:
       | I am always extremely doubtful of these types of blogposts that
       | take a well-known algorithm and somehow beat all others
       | (including academia, bioinformatics tools, etc.) with a fancy
       | implementation in <insert cool programming language 2022>
        
         | poorlyknit wrote:
         | (author here)
         | 
         | I wrote this article during a short internship at Channable.
         | Not to be apologetic but I think these kind of articles are so
         | prevalent because young or unpopular languages usually have
         | worse documentation than established ones (naturally). I
         | basically wrote down the things I learned during my internship
         | that I found noteworthy.
        
         | Nebasuke wrote:
         | The article is about how they moved an existing (fast)
         | implementation in Haskell in UTF-16 to an even faster
         | implementation in Haskell by switching to UTF-8. This is stated
         | in the first paragraph.
         | 
         | The post they reference, is also very honest: ..., the fastest
         | Haskell implementation of the Aho-Corasick string searching
         | algorithm, which powers string search in Channable.
         | 
         | Basically the blog posts show that if you want to program in
         | Haskell and still optimise, this is how you can do it. I think
         | both posts are great resources and don't overstate their
         | claims.
        
         | danschuller wrote:
         | I was taught Haskell at university and I'm old. Looking at it's
         | wiki page it's a 32 year old language not that much younger
         | than 37 year old C++.
        
       | crdrost wrote:
       | Oh wow. That is really not very much pain, as described.
       | 
       | I have to say, I never thought that the benefit of Haskell having
       | a horrible native string type would be "you can just upgrade
       | strings like any other dependency," which is really kinda slick.
       | You think about how much pain there was for Py2 -> Py3 where one
       | of the big sticking factors was all of the distinctions around
       | strings and encoding and byte arrays... this is comparatively
       | quite nice. Makes me wonder how much of a programming language
       | can be hotswappable.
        
         | resoluteteeth wrote:
         | Utf8 vs utf16 as the internal representation of the Unicode
         | string type is mostly just an implementation detail.
         | 
         | This is very different from going from python2, which conflated
         | bytes and ascii strings, to python3, which intentionally
         | changed the api to propely distinguish sequences of bytes and
         | strings.
        
       ___________________________________________________________________
       (page generated 2022-04-27 23:01 UTC)