Thoughts Concerning Numerical Logging Formats

As this writing  is a collection of my ideas  that have not been written down  before now, this page
will be  gradually updated and refined  whenever I recall more  and more of my  ideas concerning the
topic and feel the  want.  I will likely design more numerical logging  formats and that will likely
lead to new ideas I write here.


I prefer numerical data  formats, also known as binary formats; these tend  to be smaller, easier to
manipulate and verify, easier to manipulate, among other advantages.  A disadvantage of these is the
need of specialized tooling for creation, manipulation, and display with regards to human use.  What
pass for  textual formats can  be thought of  as having the advantage  of generic tooling,  but this
falls short  and I believe  their proliferation is  primarily due to  sloth with regards  to writing
proper tools.

Every logging  format should accomodate several  logging failure strategies without  good excuse.  A
good logger should  preallocate some storage so that  it can detect a failure before  it happens and
warn an operator.  I see three primary strategies:  A failure to continue logging causes the program
to stop  serving requests,  preventing requests that  aren't logged; a  failure to  continue logging
causes the program to use the last available space to  log that it is no longer able to properly log
and continue serving  requests; and a failure to  log causes the program to begin  using older space
for logging, creating a modular log.

A  log for  a  networked service  will generally  need  to store  the  IP addresses  related to  the
connections  made.  It  is obvious  to  store the  octets directly,  but  IPv4 and  IPv6 presents  a
complication.  A simple solution that provides a fixed-length is storing a complete IPv6 address and
storing  IPv4 addresses  as a  subset of  these; the  obvious solution  is using  the range  of IPv6
addresses that map to  IPv4.  Actions of the server itself that are  logged could be accomplished by
using an IPv4 or IPv6 address of all zeroes, as both are special, and this is a valuable quality due
to it being  trivial to test for.  A major  disadvantage of this is that sixteen  octets is one more
than  the worst-case  for storing  an IPv4  address textually,  which is  bad for  the common  case.
Another  mechanism that  could be  used  for storing  IP addresses  is  four or  sixteen octets,  as
determined by a  flag stored elsewhere; this has  the disadvantage of a variable field,  but has the
major advantage of consuming far less space in the common case.

Time is  a particularly  troublesome thing to  store, relatively, and  following has  many different
methods that can be used  to store it.  The primary issues of time  is its relentless march forward,
leading to an unbounded  nature, and the accuracy desired, which can exacerbate  this; there are two
main  solutions to  these that  I see:  One can  use a  sufficiently large  unit or  one can  have a
mechanism  for seamlessly  using a  smaller unit  without exhausting  it.  The  former method  is so
popular that it  should need no introduction;  the idea is merely  to have a measure  that will last
longer than the  service ever will with any  likelihood.  The latter method can be  without end, but
does have complications; a system message could set the beginning of an epoch in some way, including
simply telling the system to move to the next, providing unbounded measure analogous to moving on an
infinite tape;  all time is  then relative to  this epoch.  The  latter method has  some unfortunate
disadvantages, including the need to find a system message that sets the epoch before records can be
decoded,  leading  to  either periodically  restating  the  epoch,  wasting  space, or  requiring  a
potentially long  period of time to  sift through each record  looking for such, assuming  it can be
found; further, this leads to complications  involving the modular logging failure recovery strategy
presented earlier, as records now depend on previous records in an intimate way.

Continuing with time, there's also the question of precision and convention.  The unit of the second
is generally  chosen as  the precision for  logging, but  I'm of the  opinion it's  often completely
unnecessary to have this precision and a lower precision can lead to great storage savings.  Varying
by the service,  it could be reasonable to  have granularities of minutes, hours, or  even days.  As
the precision lowers,  it becomes more important  for all records to be  properly relatively ordered
within a single unit,  but this is already a very important concern  for accurate logging.  The most
popular convention  for storing time seems  to be storing  a count of  the unit from an  epoch; this
approach is the easiest  to convert into many different date formats, but  is more difficult for any
single format than  other means; this method is also  the easiest to verify, by virtue  of having no
real invalid states.  The  other main approach I see is storing the  time symbolically; this has the
advantage of being trivial to display for human use, but has the corresponding disadvantage of being
more intimately tied to that particular way to  display the time; this method requires more checking
to determine if a date  is valid, since there will likely be invalid states  in the encoding.  A BCD
approach to storing dates symbolically is an early  thought and can be used, but concern for storage
use can quickly  have that change to an integer  encoding that is more compact.  As  an aside, a BCD
approach to  storing dates  would need at  least seven  digits, but octet  concerns would  have this
rounded to  eight; four could  be used for  the year, with  the latter four  being used in  pairs to
represent the month  and day; alternatively, five digits  could be used to represent  the year, with
the latter three  being used to indicate the  day within that year; this is  a pleasant arrangement.

I'll now describe thoughts for a numerical format for logging Gopher requests.  The format should be
simple,  consume  little storage,  and  support  everything reasonable.   The  only  data it  should
certainly store are the  selectors used; a length-prefixed vector will suffice,  with a single octet
used for the length.   Invalid selectors used should also be stored, which  this accomodates.  It is
valuable to  store a  flag indicating whether  the request completed  successfully or  not, perhaps.
Gopher  isn't a  busy protocol,  unfortunately, so  accuracy of  a second  is entirely  unnecessary;
regardless of the precision and convention chosen, it should consume no more than three octets.  The
IP address should be stored, but how is an important question.  As this format accumulates flags, it
may make sense  to have an octet  solely for flags, as they  can no longer be  placed elsewhere, and
this greatly  affects the  overall design of  the format, as  there are  far fewer than  eight flags
needed, permitting flags that may not otherwise be  used.  An invalid selector not only references a
resource that doesn't exist, but also fails to end  in a CARRIAGE RETURN followed by a LINE FEED and
two octets per valid request,  which could be figured to be the majority of  them, could be saved by
using a flag to indicate  that this was so and then omit storing these  in the selector vector; this
does add  complexity to determining the  length of a selector,  since it is the  length indicated if
this flag is  not set and two more  if so, with an  extra length check to determine  that a selector
length wouldn't  exceed the  limit of  255 with  this addition,  meaning values  254 and  255 become
invalid in  this case.  Further, a  flag could be  used to determine if  an IPv4 or IPv6  address is
stored, along with  a flag indicating an action of  the server, rather than using an  address of all
zeroes.  However, fewer flags could be stored in  the time data or, more appropriately, the selector
length, halving it each time; sixty-four isn't  an inappropriate length; every selector in my Gopher
hole, as of the time of writing, fits in less than half of this limit.

Ultimately, in the  pursuit of simplicity, a flag  octet would be avoided, as all  my thinking would
leave half of it unused and that I find poor,  and so the final format is as follows: sixteen octets
indicate  the IPv4  or  IPv6 address,  with  all zeroes  indicating a  system  action; three  octets
represent the  time, likely  with fifteen bits  representing the  year as an  integer and  nine bits
representing the  day within  that year;  an octet  has its top  two bits  determine if  the request
successfully completed  and if the  selector ended properly,  with the latter  six being used  for a
length  of the  only variable-length  component, the  selector itself.   The fixed-length  header is
twenty  octets, also  being the  minimum  size, with  the  maximum size  being eighty-three  octets.

This format is  exemplary, in that most  numerical logging formats would likely  heavily resemble it
and this shows the  strengths of this approach, as codes, lengths, and  other concerns are stored in
their most compact  general representation.  It is  clear by description how to  process this format
programmatically and rather easily.

This then sums my thoughts on numerical logging formats.
.