Protocol pondering intensifies, Pt II ------------------------------------- In the previous post in this series[1] I compared request formats for gopher and HTTP and thought a bit about what a good anonymous document system actually needs. I ended up deciding that the answer was nothing more than gopher already provides. In this post I'll continue that discussion, focussing instead on the response format. Recall that a gopher server's response to a request consists of nothing more than the content. What does HTTP look like? Here's a quite light real world example, obtained by requesting the /index.html path from grex.org: --------- HTTP/1.0 200 OK Server: nginx Date: Fri, 14 Jun 2019 19:16:18 GMT Content-Type: text/html Content-Length: 45 Last-Modified: Sat, 21 Apr 2018 12:23:32 GMT Connection: close ETag: "5adb2d44-2d" Accept-Ranges: bytes

It works!

--------- The very first part, HTTP/1.0, is of course the protocol version. Notice that this was the last component of the request, but it's the first component of the response. What's all that about? Anyhow, in general I think it makes a heck of a lot of a sense for a response to a request to use the same protocol version as the request, which the client of course is already aware of, so this is dead weight. The next part, "200", is a status code, indicating whether or not the request was successful or triggered an error. It's followed by a human-friendly version of the machine-friendly status code, in this case simply "OK". There are lots of lots of status codes in HTTP[2]! Then we have a bunch of headers, which look just like the request headers from last post. There's a lot of dead weight in here, for simple purposes. The "Date:" is in there for cache-related reasons. In a protocol without caching, this is useless. Specifying the "Server:" software and version serves no useful purposes, and many webserver admins actually disable this feature to avoid giving away hints about which vulnerabilities might be applicable to their server. The "Content-Length:" is useful for when a single TCP/IP connection is used for multiple request/response pairs. There's some overhead involved in setting up and tearing down these connections, and as webpages started to trigger more and more requests - to fetch stylesheets, and scripts, and images - this overhead added up to a non-trivial part of total time for a website to render. Re-using connections is one solution to this, and it means that the client needs to know when the server is done responding, and it can do this by counting bytes until the entire "Content-Length" has been received. A *better* solution to this problem is to Stop Making So Many Damn Requests, which means the server can signal the end of the content by just closing the connection, rendering this header useless. Is there anything of value in here? I think so! The status code is interesting. Did you know that gopher has no real way to signal an error? You might be thinking "Hey, what about item type 3?", but the thing about item type 3 is, well, it's an item type. When do we see item types? In gopher menus, and in gopher menus only. If a gopher client sends a request for what it thinks should be a text file, but it's followed a misspelled selector and the file doesn't exist, the client isn't going to try to parse the response as a menu, so it's not going to have any way to recognise the error item type. Indeed, if you request a non-existent selector from a gopher server, it'll say something to you like "Error: File or directory not found!" (this is what Gophernicus will say), but it's only you, as a human, who hopefully reads English, who can recognise this as an error. A simple script has *no* way to distinguish this situation from a totally successful transaction. Because of this, it's e.g. impossible to write a script to crawl a gopherhole looking for broken links. Well, maybe not impossible, but certainly non-trivial: you could figure out the particular server's idiosyncratic choice of error message by requesting a couple of randomly-generated long selectors which are highly unlikely to be in actual use, and use the most common response as the "404 equivalent string". Needless to say, this is not exactly simple. It's perhaps not the biggest problem in the world, but it's certainly a shortcoming of gopher which could very easily be avoided. But much more interesting and important is the "Content-Type" header. Gopher, frankly, sucks at signalling content type. If you've arrived at a document via a gopher menu, then you know its item type. What if you want to request a document directly, not by following an item in a menu? Maybe your friend has told you the selector in an email or via XMPP. Maybe you bookmarked it last month. You can request that document by just sending the path and a , but how do you know what kind of content you're getting back? If you don't somehow know it in advance, you need to figure it out for yourself by looking at it hard ("you" here are a gopher client, not an end user). This is the reason that gopher has its very own unique URL scheme with its own RFC (RFC4266), where the itemtype is introduced as an extra component of the path. You need to write gopher://zaibatsu.circumlunar.space/1/~solderpunk instead of just gopher://zaibatsu.circumlunar.space/~solderpunk because with the later option your client would have no idea whether or not it should try to parse what comes back as a menu, display it as text or save it as a binary file. This problem is also the reason that if you write a gopher client with bookmark support, you need to store the item type along with host, port and selector. Neither of these things are terribly hard, but they are examples of small, inelegant extra hoops which have to be jumped through because gopher, in this respect, is *too* simple. It's too simple to straightforwardly handle a perfectly reasonable situation like "I'd like to fetch this document from this server but I've never seen it appear in a menu because my friend just emailed me the link". To me it makes a *lot* of sense that the *only* piece of information you should need to request and then make use of a resource is that resource's path. That seems, well, simple. This problem in gopher is more widespread than just not knowing what item type a document is. Even if you *know* that a path points to an item type 0 text file, you can have problems. One of the earliest bug reports I got after releasing VF-1 turned out to be the result of floodgap.com using iso-8859-1 text encoding to support accented characters in some of their content. VF-1 had just assumed that everything on gopher was ASCII, which turned out to be very wrong. There are a lot of encodings out in the wild on gopher. Standard gopher has no way of telling you what they are. The only way to write a client which can Just Go Anyway is to user some kind of third party party library to try to "sniff" the encoding (VF-1 uses Chardet[3] for this). That's a hard problem, which is never guaranteed to be solvable, and is only possible using a big slab of natural language corpus statistics. This requirement massively flies in the fact of the RFC1436-enshrined philosophy that "intelligence is held by the server". When all a protocol does is shovel a bunch of bytes down your throat and say "you figure out what this is and what to do with it!", you need a *very* intelligent client for it to really work out in all conditions. I don't think it makes much sense to have every client repeat exactly the same set of expensive computations after requesting a document in order to figure out information that the server *already knows*, but didn't share. There's a saner alternative to this, and it's for the server to tell the client, succinctly, what it's actually getting. This can be implemented with a very small increase in protocol complexity, which can result in a very large decrease in client complexity. Consider the following as a response format, in a hypothetical protocol which retains gopher's bare bones request format: ---------- ---------- A concrete example: ---------- 200 text/plain utf-8 Hello, world! ---------- The text encoding could be optional for non-text MIME types. We could get away from having to specify an encoding at all if this protocol specified "Thou shalt use UTF-8 and no other encoding shalt thou use", saving us ~5 bytes, but I dunno if that's too authoritarian. Yes, you can represent any language you like in UTF-8, but some languages can be represented more compactly in other encodings, and it seems like a good thing to provide the ability to minimise the number of bytes sent over the network. Isn't that also part of the spirit of a minimalist protocol? A compromise: if you use UTF-8, it's valid to leave off the third component of the response header. UTF-8 is the implicit default, but other encodings are possible for a tiny extra cost. For the sake of fully specifying a system, including a navigation solution, without any further discussion or design, let's keep gopher's menu system as is, and introduce a new pseudo-MIME type for it, like text/menu or something. I'm not saying this is a great idea, it just provides a complete concrete example to talk about for the rest of this post. If we give gopher a complexity score of 1 and full-blown HTTP a complexity score of 100, I don't see how this new protocol can be reasonably scored higher than 10. It's still absolutely trivial to write a client for this protocol, a nice little weekend project. You can memorise the protocol easily so you don't need to look up a complicated RFC to remind yourself of some detail while coding. You can still cobble together a client out of standard unix utilities: the response header is guaranteed to be one line long, so you can just pipe what you get from the network through `tail -n +2` to cut it off. I'm not sure if that would work for binary files, admittedly, but for something vaguely gopher-like that's an edge case anyway. You could even still use telnet as a client for this protocol if you wanted to. Yes, you would see one short line of noise at the top of each file, but that's a heck of a lot better than seeing a full set of HTTP headers and I guarantee you'd get used to it and stop even consciously seeing them after a day of practice. None of the extra information in this header represents any threat to a user's privacy. The network overhead is around 20 bytes per request, which is less than 1% of the size of a typical phlog post. Compared to gopher, this protocol can: * Use standard URLs without embedded item types, without any ambiguity. * Serve plain text in any encoding under the sun, without ambiguity that would otherwise force the client to waste computational effort trying to identify the encoding. * Serve any kind of non-text content under the sun, without ambiguity that would otherwise force the client to waste computational effort trying to identify the binary file format, and without being forced to categorise the content as one of a small number of pre-defined item types which are either hopeless vague or, in 2019, just kind of whacky (e.g. gopher item type 5, "PC-DOS binary file of some sort"). * Precisely indicate error conditions in a machine-readable way. In the example above I just copied HTTP's "200" status for "everything's fine", but in reality HTTP's three digit status codes are surely overkill for anything vaguely gopher-like. Status codes could probably be a single character. I haven't thought too much about applications of these. We *could* go nuts, implementing redirects and all sorts, but I'm not really keen. From time to time there are complaints on the gopher mailing list about badly behaved crawlers making too many requests per second and overloading servers, so a "too many requests, try again later" error code would seem a practical thing. I'm not imagining any situation where 99.9% of requests result in more than 3 or 4 statuses. It should be possible to learn all the status codes by heart easily. This protocol is not as simple as gopher, but I would argue its power to weight ratio is substantially greater. It's still very simple, and its still totally harmless. Crucially, it's non-extensible: the response header is not open ended, like HTTP's is, so people can't just add in whatever extra junk they like. I don't want to say that extensibility is a bad thing, it's often a very smart engineering solution to some particular problem, but I think I do want to say that extensibility is the enemy of intentionally brutal simplicity. Optional extra cruft will inevitably accumulate and then become a de facto requirement. In the third and final post in this series, I'll address possible solutions to the problem of navigation. [1] gopher://zaibatsu.circumlunar.space:70/0/~solderpunk/phlog/protocol-pondering-intensifies.txt [2] https://en.wikipedia.org/wiki/List_of_HTTP_status_codes [3] https://chardet.github.io/