[HN Gopher] Brian Kernighan adds Unicode support to Awk
       Brian Kernighan adds Unicode support to Awk
       Author : ducktective
       Score  : 288 points
       Date   : 2022-08-20 18:32 UTC (4 hours ago)
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
       | nanna wrote:
       | Apparently Kernighan is also updating his Awk book of 1988 this
       | summer too.
       | https://irreal.org/blog/?p=10746
         | 7thaccount wrote:
         | That would be cool
       | ducktective wrote:
       | I became aware of this while watching Professor Brailsford's
       | interview with him (Computerphile channel):
       | https://www.youtube.com/watch?v=GNyQxXw_oMQ
       | (Around 7-8 minute mark)
       | Update:
       | At 24-25 minute mark, he talks about the technologies he is
       | inquiring to write his new book with (he mentions troff and
       | groff).
       | He says he wanted to try "XeTeX" (which supports Unicode) but
       | "...I was going to download it as an experiment and they wanted 5
       | gigabytes and 5 gigabytes at the particular boonies place I'm
       | living would...mmm..not be finished yet!"
       | So there we go...We had the opportunity to read the mind of the
       | developer of awk and unix and co-author of the literal "C
       | Programming Language", confronting with the absolute state of the
       | tooling of the modern world.
         | mlyle wrote:
         | > He says he wanted to try "XeTeX" (which supports Unicode) but
         | "...I was going to download it as an experiment and they wanted
         | 5 gigabytes and 5 gigabytes at the particular boonies place I'm
         | living would...mmm..not be finished yet!"
         | Man, I think once you're Kernighan there should be like a
         | 1gigabit/sec symmetric circuit wherever you go just in case you
         | use it to do something else useful.
           | rustqt6 wrote:
           | You put it jokingly? But distinguished people should
           | definitely get some state managed perks like politicians do.
           | scoot wrote:
           | THis isn't about bandwidth, it's about the size of modern
           | binaries.
             | inglor_cz wrote:
             | Pardon me, but why have modern binaries grown so big? When
             | I wrote my thesis in TeX, the entire installation would fit
             | in some 30 megabytes or so. It was actually right in the
             | uncomfortable middle. Far too big to carry around on a set
             | of diskettes, but a CD would be waste of space.
               | mlyle wrote:
               | You can fit a decent TeX distribution in <100MB.
               | But if you want to have every macro package that everyone
               | everywhere likes, you're going to use some space.
               | Turing_Machine wrote:
               | I've often wondered why there isn't a dependency system
               | for TeX that lets you get only the packages you need...
               | feed it a document and tell it to automatically download
               | and install any missing packages.
               | There may be some technical reason why this isn't
               | practical. Anyone know, offhand?
               | mickmcq wrote:
               | There is a TeX distribution that does what you say,
               | TinyTeX. For some reason it is obscure and only really
               | used by R and R Studio users. That may be because it is
               | used to render R Markdown documents to pdf.
               | kccqzy wrote:
               | Totally practical and supported. See MikTeX:
               | https://miktex.org/kb/just-enough-tex
               | But the reason I don't use it is because I don't always
               | have Internet when I need to write my document. I'd
               | rather download all the packages and their documentation
               | beforehand.
               | rdlw wrote:
               | MikTeX also supports downloading individual packages,
               | though I wish there was an option to download, say, the
               | most commonly used 1GB or something.
             | mananaysiempre wrote:
             | TeX distributions are enormous, but the binaries themselves
             | are actually not big--the overhead imposed by Knuth's
             | obnoxious license, while nonzero (you can't modify the
             | original Pascal-in-WEB source, only patch it, so a manual
             | source port to a different language is painful enough that
             | nobody tried, we're all using an automatic Pascal-to-C
             | translation with lipstick on it), is not huge, and aside
             | from a smattering of utilities that's it for the binary
             | part.
             | It's just that the distros also include oodles of (plain
             | text!) macro packages for everything under the sun. There
             | are some legitimately large things such as fonts, but
             | generally speaking a full TeXLive or MiKTeX distribution is
             | bloat by ten thousand 100-kilobyte files, like a Python
             | distribution with the whole of PyPI included.
             | If you know what you want, you can probably fit a
             | comprehensive LaTeX workbench in under 50M, but it takes an
             | inordinate amount of time.
         | jfk13 wrote:
         | Sounds like he was looking at downloading a complete TeX Live
         | distribution; XeTeX itself isn't anything like that size (by a
         | couple orders of magnitude, at least).
           | ducktective wrote:
           | I think distros package it like TeX-full, TeX-minimal
           | etc...The one having documentation files is a couple of GiB
           | on Ubuntu...
           | I wonder what distro or editor he is using...
             | naves wrote:
             | Two years ago, he used macOS on a 13" MacBook Air and an
             | iMac, as per his conversation with Lex Fridman:
             | https://youtu.be/O9upVbGSBFo?t=2523
           | zimpenfish wrote:
           | MacTeX is 4.7GB which matches the 5GB he's talking about
           | and...
           | "MacTeX installs TeX Live, which contains TeX, LaTeX, AMS-
           | TeX, and virtually every TeX-related style file and font.
           | [...] MacTeX also installs the GUI programs TeXShop, LaTeXiT,
           | TeX Live Utility, and BibDesk. MacTeX installs Ghostscript,
           | an open source version of Postscript."
           | Which is, as you say, considerably more than just "XeTeX".
           | (Also those are universal binaries containing both Intel and
           | ARM versions which probably adds some heft.)
             | JadeNB wrote:
             | > (Also those are universal binaries containing both Intel
             | and ARM versions which probably adds some heft.)
             | Heh, I remember when "universal binaries" meant "PowerPC
             | and Intel". Different universes ....
               | jhbadger wrote:
               | In the early/mid 1990s there were even "fat binaries"
               | that had 68000 (the original Mac platform) and PowerPC
               | binaries back when PowerPC was the new thing.
         | maxnoe wrote:
         | The Problem is that TeXLive still defaults to doing a full
         | install.
         | A full install means installing ~4000 packages, including their
         | source files (tens of thousands of tex files) and built
         | documentation (thousands of PDF files) and hundreds of free
         | fonts (otfs, ttfs, texs own format).
         | This is _huge_ ( >7GB, not just the 5 GB claimed here).
         | However, you don't need 99 % of this for any given document.
         | Not installing the source files and documentation PDFs will
         | alone reduce the size by roughly half.
         | Only installing the packages you really need from a minimal
         | installation gives you a few hundred megabytes at most for even
         | complex documents.
         | It's a bit annoying to get the list of packages needed though,
         | since there is not really any working dependency management.
         | I wrote a python wrapper around the tex live installer [1] to
         | make this easy for CI jobs, see e.g. [2].
         | On a side note: I'd recommend luatex over xetex.
         | - [1] https://github.com/maxnoe/texlive-batch-installation/
         | - [2] https://github.com/pep-dortmund/toolbox-
         | workshop/blob/8b00f0...
           | jcelerier wrote:
           | On archlinux there's the texlive-core package which does not
           | ship the PDF docs (most of the size). It should install 500mb
           | (most of which are fonts..) and already provide enough to
           | build normal documents, including lualatex for unicode
           | support
           | JadeNB wrote:
           | TeXLive also comes with installation schemes that will give
           | you (if I remember the names correctly) bare, medium, and
           | full installations, if you prefer not to pick packages
           | yourself. Alternately, although I don't use it myself, I'm
           | sure you could use MikTeX, which is much better about on-
           | demand package installation. (Or even Overleaf, if you don't
           | want to put anything on your local device!)
         | cfiggers wrote:
         | Watching this interview inspired me to start playing around
         | with groff. It has a very steep learning curve... And being as
         | old/niche as it is, I've found it very hard to find any active
         | community to get newbie questions answered. If anybody knows
         | where I could find that sort of thing, I'd be very grateful.
         | samatman wrote:
         | > _the absolute state of the tooling of the modern world_
         | Hah, TeX Live is... not that.
         | It's been enormous since I installed it off a CD in the 90s.
         | The idea, and it works, is that you can just compile anyone's
         | stuff out of the TeX ecosystem.
         | There is just... a lot... in it. You don't need a package
         | manager if you install the whole universe locally. Like I said:
         | not what _I_ would call a modern approach to tooling.
         | On the other hand, I have latex files from the mid-Noughties
         | and, I don't even need to check: they'll compile if I want them
         | to.
         | But yeah, if you want just a little piece of TeX here and
         | there, you're off the beaten track. That's not how TUG rolls.
           | rdlw wrote:
           | TeX Live can also be configured to install the bare minimum
           | TeX ecosystem (or just TeX+LaTeX), which only takes a few
           | minutes to download and install but results in hunting down
           | dependencies and manually installing them whenever you want
           | to use a new package.
           | It also seems quite slow to update, and a recent (?) name
           | change of `tools' to `latex-tools' seems to have broken
           | multicol, which drove me to MikTeX. Internet connection
           | required, but far less headache.
         | stjohnswarts wrote:
         | He could master it in a week if he set his head to it. I don't
         | have one doubt of that. He just doesn't really need to.
       | svnpenn wrote:
       | Looks like its not really done yet:
       | https://github.com/onetrueawk/awk/compare/master...unicode-s...
       | arduinomancer wrote:
       | For context this is 37 years after it was released (1985)
       | YesThatTom2 wrote:
       | Of course he did. Aho has better things to do and Weinberger is
       | too rich to write code any more.
       | [deleted]
       | ducktective wrote:
       | [off-topic]
       | Following the spirit of UNIX, I did a little analysis on the
       | upvotes this post got over time (fish-shell):
       | while true; curl -sL
       | 'https://news.ycombinator.com/item?id=32534173' | pup
       | '#score_32534173 text{}' | awk -F'[^0-9]*' '{print $1}' | tee -a
       | points; sleep 15s; end
       | (Initially I used `grep -Po '\d+'` but switched it with an awk
       | solution due to...context!)
       | I started it approx. when I posted it. Now ~2 hours have passed
       | since. Using `gnuplot`:                 f(x) = a*x+b; fit f(x)
       | "points" via a,b; set terminal png size 1920,1080 enhanced font
       | "Inconsolata,20" ;set output "HN-analysis.png" ;set grid; set
       | ylabel "points";set key bottom right ; set xlabel "sample # (15s
       | interval)"; plot 'points' w linesp lt 7 lw 3 lc rgb "orange",
       | f(x) lc rgb 'blue' lw 2
       | We generate the plot: https://i.imgur.com/pS6AaI5.png
       | (The jump #100 sample is due to a network error on my side)
       | And here are the coeffs. of a linear fit over the data (note that
       | every 4 samples is 1 minute, so this post got ~1.52 upvotes per
       | minute)                 a = 0.380809, b = 19.8437
         | [deleted]
       | etaioinshrdlu wrote:
       | I liked this interview with Brian with Lex Fridman:
       | https://www.youtube.com/watch?v=O9upVbGSBFo
         | neilpanchal wrote:
         | +1. Also, I like your username.
       | pid_0 wrote:
       | timakro wrote:
       | I believe no distro actually ships this version of awk by
       | default. They ship GNU awk which has Unicode support anyways.
         | svnpenn wrote:
         | Debian:
         | https://distrowatch.com/table.php?distribution=debian&pkglis...
           | timakro wrote:
           | So it turns out the default on Debian is mawk which does NOT
           | support Unicode. Thanks for pointing that out. This simple
           | test gives different results for gawk and mawk.
           | $ echo 'o' | awk '{print length}'
             | layer8 wrote:
             | ...only if the current locale is set to use UTF-8 (or some
             | other variable-width encoding). Which nowadays the default
             | locale usually does, but in principle it doesn't need to
             | be.
         | chasil wrote:
         | OpenBSD uses "The One True AWK."                 $ awk -V
         | awk version 20211208
         | Kernighan's version is likely used in other places where the
         | GPL is eschewed.
           | fanf2 wrote:
           | I think the other BSDs do too, including macOS.
       | lelandfe wrote:
       | > _Once I figure out how... I will try to submit a pull request.
       | I wish I understood git better, but in spite of your help, I
       | still don 't have a proper understanding, so this may take a
       | while._
       | Even Kernighan struggles with git.
         | brudgers wrote:
         | Torvalds is a better programmer than that.
         | Pull requests are feature of GitHub, not a part of git.
         | https://docs.github.com/en/pull-requests/collaborating-with-...
         | Blikkentrekker wrote:
         | The culture around p.r.s is truly a high barrier of entry for
         | many people.
         | Figuring out how all of this works is substantially more
         | difficult I find in practice than fixing many longstanding
         | trivial bugs in a great deal of software.
           | umanwizard wrote:
           | What's the alternative? The old way (which is still used by
           | many projects) is to send patches to mailing lists, which I
           | find more difficult: you need to learn how to generate the
           | patch from your source code repo, send the patch as an e-mail
           | (needing weird hacks like `git imap-send`), and then
           | configure your MUA not to mangle it somehow. Then you also
           | don't have a centralized search/tracking interface.
           | Some good reasons not to use GitHub is because you're
           | familiar with standard/traditional tools, or because you
           | prefer not to use centralized services. Both of those are
           | fine reasons! But "the traditional way is easier" isn't.
         | [deleted]
         | mordechai9000 wrote:
         | This reminds me of the relevant xkcd: https://xkcd.com/1597/
       | [deleted]
       | stakkur wrote:
       | It's somewhat comforting to hear even Brian K. say he doesn't
       | understand Git well.
         | rustqt6 wrote:
         | Git really is a mess. The fact that commits and not diffs have
         | hashes should be lampooned despite arguably a few small
         | benefits. Geniuses make mistakes too and git is linuses. The
         | only reason git is respected is because it came from Linus. If
         | it were from Microsoft it would get all the criticism it
         | deserves and then 20 times more
         | layer8 wrote:
         | I came to the comments to say that it's reassuring. :)
           | cafard wrote:
           | Likewise.
       | tialaramex wrote:
       | The choice to use UTF-32 (ie Unicode code points as integers,
       | which might as well be 32-bit since your CPU definitely doesn't
       | have a suitably sized integer type) is unexpected, as I had seen
       | so many other systems just choose to work entirely in UTF-8 for
       | this problem.
       | Now, Brian obviously has much better instincts about performance
       | than I do and may even have tried some things and benchmarked
       | them, but my guess would have been that you should stay in UTF-8
       | because it's always faster for the typical cases.
         | bombcar wrote:
         | Is UTF-32 fixed size per char? Because then it allows simple
         | math that you can't do on UTF-8.
           | moomin wrote:
           | A "character" can be of fairly arbitrary length in Unicode,
           | so no.
             | fooster wrote:
             | Not to be contradictory, but unicode is not a specific
             | encoding. ufc-8 is an encoding (with a non specific length)
             | and utf-32 is an encoding of a Unicode code point with a
             | specific length.
           | valleyer wrote:
           | It's a fixed size per codepoint. Many clusters that appear
           | atomic in a text editor are made up of multiple codepoints.
           | The flag emojis are among the many examples.
           | simias wrote:
           | It's always the tradeoff, some operations are simpler on
           | UTF-32 but they have additional memory (and therefore cache)
           | footprint and since you typically don't want to use UTF-32
           | externally you have to convert back and forth which is not
           | free.
           | I think these days people don't bother with UTF-32 too much
           | because it's not even like you have a clean "one 32bit int,
           | one character" relation anyway since some characters can be
           | built from multiple codepoints. Since generally most code
           | manipulating character strings are interested in characters
           | and not codepoints, UTF-32 is effectively a variable-length
           | encoding too...
             | layer8 wrote:
             | Another factor is that nowadays machine code execution is
             | much faster than memory accesses, so the trade-off of
             | requiring more program logic to process a more compact
             | format makes a lot of sense.
             | tialaramex wrote:
             | Right, somebody else might have actual metrics but I'd have
             | guessed actual regular expression patterns are split
             | something like:
             | 90% Only care about ASCII, thus individual bytes in UTF-8,
             | and so UTF-32 just wastes memory
             | 1% Care about individual code points, but spread over
             | multiple bytes (e.g. the double dagger ++), UTF-32 is
             | perfect
             | 9% Care about multiple code points (to form e.g. a Flag, or
             | e written in combining form, or two women kissing) and so
             | UTF-32 doesn't really help again
           | happytoexplain wrote:
           | It is fixed size per code point, which are what developers
           | (and programming languages) sometimes casually call a
           | character, but in practice a character is a grapheme, which
           | can be multiple code points once you're outside the ASCII
           | range. But it can still be useful to count code points, which
           | would be faster in UTF-32.
           | Edit: Mixed up code units and code points.
             | thayne wrote:
             | And even then, in some languages at least, what constitutes
             | a grapheme isn't always well defined.
               | happytoexplain wrote:
               | True - I was thinking of Unicode's definition
               | ("[extended] grapheme clusters").
               | a1369209993 wrote:
               | > in some languages at least, what constitutes a grapheme
               | isn't always well defined.
               | Can you provide some examples? People _say_ this a lot,
               | but the cases I 've been able to find tend to be things
               | like U+01F1 LATIN CAPITAL LETTER DZ, which is only not
               | well defined in the sense that Unicode defines it wrong
               | (as one character rather than two) presumably-on-purpose,
               | for compatibility with one or more older character
               | encodings.
               | happytoexplain wrote:
               | Is DZ "wrong" because it's not considered a digraph by
               | professionals, or because people don't agree that
               | digraphs should be considered single characters?
               | a1369209993 wrote:
               | "DZ" isn't 'wrong', it's a perfectly valid two-character
               | string consisting of "D" followed by "Z". Assigning to a
               | multi-character string a encoded representation that
               | isn't the concatenation of representations of each
               | character _in_ the string (especially while insisting
               | that that makes it a distinct character in its own right)
               | is what 's wrong.
           | masklinn wrote:
           | > Because then it allows simple math that you can't do on
           | UTF-8.
           | That's not actually useful, because unicode itself is a
           | variable length encoding.
           | So it mostly blows up the size of your data.
           | Though it might have been selected for implementation
           | simplicity and / or backwards compatibility (e.g. same reason
           | why Python did it, then had to invent "flexible string
           | representation" because strings had become way too big to be
           | acceptable).
             | moefh wrote:
             | UTF-32 is a fixed-length encoding of Unicode[1], so it does
             | simplify things a lot for a regex engine.
             | [1] At least when talking about code points, which is what
             | matters for regular expressions (unless you want stuff like
             | \X with is not universally supported).
             | dotancohen wrote:
             | Unicode is not an encoding, despite MS Notepad calling some
             | encoding "Unicode".
               | tialaramex wrote:
               | Unicode isn't a _storage_ encoding and so yeah, Notepad
               | shouldn 't do that. However Unicode does encode
               | essentially all extant human writing systems into
               | integers called "code points" between zero and 0x10FFFF.
               | The Latin "capital A" is 65 for example.
               | However you'd probably like to store something more
               | compact than, say, JSON arrays of integers. So there are
               | also a bunch of encodings which turn the integers into
               | bytes. These encodings would work for any integers, but
               | they make most sense to encode Unicode's code points.
               | UTF-8 turns each code point into 1-4 bytes, a pair of
               | UTF-16 encodings turns them into one or two "code units"
               | each of 2 bytes either little or big endian. And UTF-32
               | just encodes them as native 32-bit integers but again
               | either little or big endian.
               | formerly_proven wrote:
               | > Q: What is Unicode?
               | > A: Unicode is the universal character encoding,
               | maintained by the Unicode Consortium. This encoding
               | standard provides the basis for processing, storage and
               | interchange of text data in any language in all modern
               | software and information technology protocols.
               | https://home.unicode.org/basic-info/faq/
               | brewmarche wrote:
               | In your quote encoding refers to assigning numbers (code
               | points in Unicode parlance) to characters (I am
               | simplifying here, I know the definition of character in
               | Unicode is not that easy).
               | It's like a catalogue of scripts. We have to extend it
               | when we encounter new scripts that are not catalogued yet
               | (or when we create new emojis)
               | Converting a byte sequence to a Unicode code point
               | sequence and vice-versa is called transformation format
               | (or more generally an encoding form, but then might not
               | be deterministic) by Unicode (see
               | <https://www.unicode.org/faq/utf_bom.html#gen2>). Unicode
               | specifies UTF-8, -16 and -32. We do not have to change
               | these formats unless the catalogue hit the limits of 32
               | bits (not a big problem for UTF-8 but for the other two
               | formats). These formats are already able to encode code
               | points that are not assigned yet.
               | And the confusion now is that a lot of people call what
               | Unicode calls transformation format (i.e. the byte to
               | code point mapping) encoding as well. The term charset is
               | also used sometimes.
               | PS: Note that a goal of Unicode is to be able to
               | accommodate legacy encoding/charsets by having a broad
               | enough catalogue. This is so that these legacy encoding
               | which may come with their own catalogue can be mapped to
               | the Unicode catalogue. So we have control codes (even
               | though not part of any "proper" human script),
               | precomposed letters (there is a code point for a although
               | it could be represented by a + combining `), things like
               | the Greek terminal form of sigma separately encoded,
               | although that could be done in font-rendering (like
               | generally done for Arabic), and a lot more to aid with
               | mapping and roundtrips.
               | jll29 wrote:
               | Regardless of official terminology, there are two levels:
               | 1. Map a character to a unique number in a character set
               | (in Unicode: called codepoint)
               | 2. Map a number that represents a character in a
               | character set to a bit pattern for storage (transiently
               | or persistently, internally or externally). Unicode code
               | points can be bit-encoded in various ways: UTF8, UCS2 and
               | UCS4/UTF32.
               | The original code points permit the same character to be
               | represented in various ways, which makes equality checks
               | non-trivial: for instance a character like "a" can be
               | represented as a single character or alternatively as a
               | composition of "a" + umlaut accent (2 characters).
               | So far, this is all about plain text, so we are not
               | talking about font families or character properties
               | (bold, italics, underlined) or orientation (super-script,
               | sup-script).
               | Ken Lunde's opus magnum is the standard book on
               | representing text in various languages other than
               | English, with a focus on Asian languages:
               | https://www.oreilly.com/library/view/cjkv-information-
               | proces...
               | layer8 wrote:
               | Unicode uses the term "character encoding form" or
               | "character encoding scheme" for what is normally referred
               | to or abbreviated as "character encoding" or "charset"
               | (see e.g. RFC 8187), and uses "character encoding" or
               | "coded character set" for the abstract assignment of
               | natural numbers to the abstract characters in a character
               | repertoire, which is more usually referred to as just
               | "[coded] character set" (cf. also UCS = Unicode Character
               | Set). This different use of terminology can cause
               | confusion. The GP is correct that Unicode as a whole is
               | not what is colloquially meant by "encoding".
           | rustqt6 wrote:
           | At the rate emojis are being added, in a few decades it won't
           | be. Unless Biden mistakes his nuclear briefcase with
           | children's toys in his perpetual confusion (although
           | thankfully the American msm is keeping people calm by never
           | showing his regular gaffes)
           | alganet wrote:
           | You are right.
           | Unicode in UTF-8 will have variable char length. Plain ASCII
           | will be one byte for each char, but others might have up to 4
           | bytes. Anything dealing with it will have to be aware of
           | leading bytes.
           | UTF-32 in other hand will encode all chars, even plain ASCII
           | ones, using 4 bytes.
           | Take the "length of a string" function, for example. Porting
           | that from ASCII to UTF-32 is just dividing the length in
           | bytes by 4. For UTF-8, you'd have to iterate over each
           | character and figure out if there is a combination of bytes
           | that collapse into a single character.
         | simias wrote:
         | He mentions that "The amount of actual change isn't too great,
         | so I think this might be ok" so I wonder if part of the
         | equation has more to do with avoiding messing with legacy code
         | rather than raw performance. If the current code expects all
         | codepoints to have a constant-width representation, it may be
         | complicated to add UTF-8 into the mix.
         | A complete guess on my part though, I never looked into AWK's
         | source code.
           | xonix wrote:
           | This sounds reasonable. When the GoAWK creator tried to add
           | Unicode support through UTF-8 he discovered that this had
           | drastic performance implications (rendering some algorithms
           | to be O(N^2) instead of O(N)), if done naive
           | https://github.com/benhoyt/goawk/issues/35. Therefore the
           | change was reverted till the more efficient implementation
           | can be found.
         | fpoling wrote:
         | The code only uses UTF-32 in regular expressions where I
         | suppose it was much simpler to adopt the older code. The rest
         | uses UTF-8.
       | cyocum wrote:
       | Here is Brian Kernighan mentioning the Unicode work in an
       | interview: https://www.youtube.com/watch?v=GNyQxXw_oMQ
       (page generated 2022-08-20 23:00 UTC)