A Rebuttal of an Article Involving a Character Set

This article was spurred by ``UTF-8 Everywhere'' which may be found here:
https://utf8everywhere.org

As an aside, this domain is one of several I've seen which abuses the DNS in order to translate part
of common social media nonsense to the greater Internet; that's to write it's common to attach to an
idea a fragment of text and then proliferate this text.  Registering an entire domain name purely to
have it point to an HTTP server providing a single document is stupid and wasteful.  It crossed mind
to title this ``UTF-8 Nowhere'' but, clearly, reserving the UTF8NOWHERE second-level domain and then
having it pointer to a Gopher hole and website is stupid and a spectre of social media nonsense with
which I want naught to do.

Before reading the remainder of this article, I recommend reading my 2018-06-06 article detailing my
ideal machine text representation.

The document being rebutted presents as its purpose the support of the UTF-8 encoding of Unicode and
cites performance improvements, a reduction in system complexity, and the prevention of bugs as some
benefits; it also advocates UTF-8 be used as the internal representation of text.  The first section
closes by downplaying the importance of iterating over discrete units of text and dismissing this as
unimportant.  The purpose of this article is to rebutt these ideas and emphasize the deficiencies of
Unicode, which are numerous and have me believe it's a complicated and poorly-designed character set
which should be discarded.

The background section begins by explaining how Unicode lacked proper foresight and reading the file
linked reveals that its current iteration has utterly failed to meet its expectations.  Particluarly
amusing are the following excerpts:

 http://unicode.org/history/unicode88.pdf

 Rather than struggling to salvage obsolete 8-bit encodings via horrendous `extension' contrivances,
 we need to recognize that the current absence of a standard international/multilingual encoding is
 a unique opportunity to rethink and revitalize the design concepts behind text encoding.

 Nothing comes for free, and the price of Unicode's fixed-length 16-bit character code design is the
 twofold expansion of ASCII (or other 8-bit-based) text storage, as seen in the figure on the
 previous page.  This initially repugnant consequence becomes a great deal more attractive once the
 alternative is considered.
 The only alternative to fixed-length encoding is a variable-length scheme using some sort of flags
 to signal the length and interpretation of subsequent information units.  Such schemes require
 flag-parsing overhead effort to be expended for every basic text operation, such as get next
 character, get previous character, truncate text, etc.  Any number of variable-length encoding
 schemes are possible (this fact itself being a major drawback); several that have been implemented
 are described in a later section.

The facts section remarks on the several advantages UTF-8 has over UTF-16, and I don't disagree, yet
it's amusing to see massively advantaging English listed as a point in favor of UTF-8.  The supposed
ideal and universal encoding disadvantages every other language.  My ideal machine text system gives
every supported language an optimized representation.

The fourth section, ``Opaque data argument'', lists the POSIX approach to filenames as somehow ideal
through the trivial example of a file-manipulation tool.  Firstly, it's important to note that UTF-8
was created by Ken Thompson and Rob Pike; unsurprisingly, this pair specifically designed it to have
unnecessary qualities specifically for soothing the C language's delicate sensibilities.  A placemat
is an appropriate venue for such an encoding to have been designed.  Both the ASCII NULL and ASCII /
characters don't spontaneously appear in UTF-8 for other reasons, purely because this would burden C
and POSIX, which are accustomed to being catered to and accomodating nothing.  This section fails to
mention the file-manipulation tool benefits from conflating characters and integers, which is common
with C and POSIX, as almost no POSIX systems demand a filename be proper UTF-8; this damning flaw is
inexcusable as it means there are filenames which can't rightly be accessed by some of the languages
which do enforce a real notion of a character.  A Common Lisp program represents file systems with a
pathname abstraction, which must be a string composed of characters.  An Ada program can access such
malformed filenames, by virtue of Ada supporting several different variations on its Character type,
the smallest being Latin-1; since Ada is designed for real work spanning decades, where the solution
isn't to demand the world bend around you, Ada supports types of Character, Wide_Character, and also
Wide_Wide_Character, while also supporting several different Unicode encodings and types of Strings.
It's telling this section is quick to criticize Windows issues, then entirely ignore those of POSIX.
Closing on this section, it's as if Ken Thompson thought ``I haven't done enough damage.''.

The fifth section lists various unnecessary Unicode concepts and sophistry intended to obscure basic
concepts of various languages, such as characters, and is expanded upon in section eight.

The sixth section tries to dispel the obvious, great disadvantage of Asian text in UTF-8 by claiming
that ASCII text is the most common, by virtue of being used in HTML and other such formats.  This is
really good reason to stop using so-called textual formats and instead use numerical formats, this I
touch on in my 2019-04-30 article and that concerning my ideal machine text system.  The notion that
an inefficient storage format such as UTF-8 can be dealt with through compression is laughable; goes
against the historical Unicode document again, in that it recommends storing a large text in special
encoding; is misguided; and merely excuses inefficient formats.  The Asian languages aren't the only
which are disadvantaged, however; UTF-8 disadvantages each and every language that isn't English, by
giving it the most efficient encoding.  My ideal machine text system doesn't suffer this, as I don't
agree with the idea that a multi-lingual document should be encoded with a single character set.  In
my system, every language used is tagged and the encoding thereof can then be optimal.  I'm inclined
to believe the reason Unicode and UTF-8 are promoted so is due to the incapability of POSIX to truly
support multiple languages; such systems can only support a one true encoding and so the notion that
one must be selected and the evil others stomped out arises.  A proper system supports multiple ways
to store text, including multiple encodings, and then has no such issues; no languages are then made
disadvantaged.

The seventh and eighth sections regard operations and so-called myths, concerning Unicode and UTF-8.
A decent programming language usually represents a string as an array of characters and this betrays
many advantages, such as more generic handling, orthogonality, etc.  Sophistry is used in an attempt
to argue obvious and fundamental operations on strings, several of which are shared with arrays, are
actually unnecessary.  That section should read as insanity to those with good taste.

In my conclusion, Unicode is a very poorly-designed character set, which is fundamentally misguided.
I dislike ASCII, in part because of its control character class which behaves differently from every
other character, and yet ASCII is tolerable if for no other reason than it is simple.  Proper system
design takes pain unto itself to eliminate edge cases.  A proper text system would localize language
edge cases, so that in a language in which the notion of a character makes sense the notion could be
used; in a language with the notion of one-to-one upper and lower cases, that could be employed; and
in a language where all text follows a certain flow, that flow could be used without issue, as three
examples.  The Unicode approach seems to be to pour all edge cases into a single container, and then
expect that the programmer will handle every single one, but this is unreasonable and doesn't happen
in practice, leading to broken systems or those which only accept a subset of Unicode.

In the example of stream I/O, it seems reasonable to give the terminal, file system, and TCP similar
interfaces, yet these have fundamentally different failure cases; reading from a terminal can't fail
as with the others and may wait indefinitely; a file system can attempt a read and yet fail when the
file doesn't exist or changes; and a TCP connection can fail at most any point and offers the fewest
methods for correction, in the worst case of a true network failure.

Similarly, Unicode and UTF-8 remove invariants and introduce failure cases.  There's real value with
using the invariants of a language and for this reason alone Unicode is fundamentally misguided.  In
UTF-8, the simple act of collecting a character can result in an invalid character or be split along
an improper boundary.

Another damning aspect of Unicode is its more recent undertaking of filling itself with garbage that
serves only to complicate and entertain, such as characters representing flags and humans in various
acts; one reason Unicode contains so many superfluous graphics is to accomodate those character sets
which already featured such, but I believe another reason is to serve as graphical interface toolkit
purely because the real toolkits for such on modern systems are overly complicated.  The lowest part
of the system which is reasonable to use then becomes this toolkit.

In closing, I believe my proposed machine text system is better, in that it encourages a far simpler
system that is also smaller, lacking in superfluous qualities, and has a mechanism for the rare case
of multiple languages in one document in a way that doesn't severely disadvantage most.

As an aside, I find it amusing how the eleventh FAQ answer recommends using the incorrect POSIX line
ending rather than the proper carriage return and line feed.  I don't consider it beyond possibility
that this article is truly naught but sophistry from the cult of C and POSIX, considering UTF-8 also
originated from the same place and all positions seem to conveniently align with that view.
.