[HN Gopher] Elements of a great markup language
       ___________________________________________________________________
        
       Elements of a great markup language
        
       Author : ingve
       Score  : 40 points
       Date   : 2022-10-29 05:55 UTC (2 days ago)
        
 (HTM) web link (matklad.github.io)
 (TXT) w3m dump (matklad.github.io)
        
       | tabtab wrote:
       | Re: _A good markup language describes an abstract hierarchical
       | structure of the document, and lets a separate program to adapt
       | that structure to the desired output._
       | 
       | I have to disagree. Often the abstract nature is hard to describe
       | and/or the constructs for it either don't exist, or need
       | cleaning/updating. Often I find myself saying, "I don't know why
       | it's more legible to format this thing such and such way, it just
       | is." Creating a good abstract language or category set for a
       | given domain is not easy. It's good to use abstraction where it's
       | practical, but often you just have to tell them system "just
       | format it like this!"
       | 
       | For example, the difference between a button and hyperlink is
       | often blurry. We could abstract it something like: [Action
       | FormatType="Button" Label="Send email to Mom"
       | ActionType="mailto:mom73@moms.sample"/] so that FormatType could
       | be changed to "Hyperlink", but most find this goofy and perhaps
       | long-winded.
       | 
       | As far as compact wiki-like shortcuts versus XML, the first often
       | has more escaping issues or confusion. It's not a free lunch, but
       | about best fitting intended use.
        
         | einpoklum wrote:
         | Seconded. That quote may be a description of a markup language
         | that is easy for programs to work with, not one that's easy for
         | people to work with.
        
           | tabtab wrote:
           | Good point. What's people-friendly may not be machine-
           | friendly, and vice versa. Further, what's human writing
           | friendly may not be human _reader_ friendly. When writing,
           | verbosity reduction matters most; but when reading, clarity
           | often trumps verbosity concerns.
        
             | abathur wrote:
             | In a conversation elsewhere recently I wrote:
             | 
             | > I think Markdown is popular because it's a reasonably
             | intuitive bridge between how we format handwritten
             | documents (which is itself a blend of what makes sense on
             | the page, and an approximation of ~spoken rhetoric) and
             | HTML.
             | 
             | > In some cases (like the table example), what's trivial on
             | the page is painful with a keyboard. The technologies that
             | shaped the old idioms didn't have editability as a
             | selection pressure. You can make a few edits inline, but at
             | some point you'll just have to rewrite the document.
             | 
             | > Much of the power and pain of code are byproducts of the
             | machines forcing us to be explicit, but we didn't have the
             | foresight to spend centuries molding ourselves (human
             | languages, pedagogy, rhetoric) around that kind of
             | precision. Markdown meets us very close to where we are.
             | The other path, I think, entails developing toolchains that
             | do expect that precision and requiring the humans to do the
             | changing.
             | 
             | I guess the split between the ease of a format for
             | reading/writing is an extension of the difference between
             | shorthand or even cursive and print when it comes to these
             | affordances.
        
         | tannhaeuser wrote:
         | > _As far as compact wiki-like shortcuts versus XML, the first
         | often has more escaping issues or confusion._
         | 
         | It isn't an either-or, and never has been. XML has been created
         | as an SGML subset, and SGML always had short references ie.
         | custom tokens the parser is replacing by something else in a
         | context-dependent way. For example, an asterisk can be replaced
         | by an <em> start-element tag to start emphasized text, and,
         | within emphasized text, asterisks can be replaced by </em> end-
         | element tags.
        
       | einpoklum wrote:
       | Since when is LaTeX is a "lightweight markup language"?
       | 
       | * It's not at all lightweight, I'm pretty sure it's Turing
       | complete up the wazoo.
       | 
       | * You can reprogram the f'ing grammar, down to the interpretation
       | of individual input characters...
       | 
       | * In fact, I'm not sure you can even call it a markup language.
       | It's a macro-based programmable system on top of a typesetter.
        
         | taeric wrote:
         | For the generation of documents and papers, I'm not actually
         | sure that avoidance of Turing complete is really a worthy goal.
         | It can certainly go too far, but I think it is a sympathetic
         | view that the quest for complete separation between content and
         | presentation has given more victims to failed delivery than
         | macros have.
         | 
         | I do think that stability is something that needs some
         | discussion here, for a great markup language. Not stability as
         | in "doesn't crash", but in "doesn't change."
        
       | bscphil wrote:
       | IMO this misses the most important thing needed in a great markup
       | language. It should be readable, as a complete document, in its
       | existing text form without any transformation. The ideal syntax
       | would be so good that you would rarely want to transform the
       | document in any way.
       | 
       | I think inspiration in this area should come from an unusual
       | place: RFC documents (like https://www.rfc-
       | editor.org/rfc/rfc6877.txt). These documents are fully readable
       | in their standard text form. The only things missing are
       | conveniences like clickable links, which could be supplied by a
       | viewer application without transforming the text of the document.
       | 
       | Markdown comes the closest to being directly readable, but it
       | makes a number of sacrifices to syntax in view of its intended
       | transformation to HTML. It's also limited in scope, patterned
       | after ASCII email syntax. So you miss document-oriented
       | conveniences like standardized headers, proper citations, and
       | pages - all of which RFC documents have!
       | 
       | That's in part the reason that a bunch of people have come behind
       | and written extensions to Markdown. IMO someone needs to come
       | back and gather the best of these into a consistent and simple
       | syntax that is oriented around the needs of plain text readers.
        
         | nooyurrsdey wrote:
         | > It should be readable, as a complete document, in its
         | existing text form without any transformation.
         | 
         | Is this _needed_, or is this nice to have?
         | 
         | A markup language annotates text and describes _how it should
         | be rendered_. It feels redundant to describe how a document
         | should be rendered (presumable for final consumption) _and_
         | have the document be readable as-is.
         | 
         | Case in point: I'd argue that HTML is a great markup language.
         | I wouldn't call it the most readable in its current form.
         | 
         | I agree with the spirit here, but it ultimately feels more
         | "nice to have" than truly required.
        
       | rrwo wrote:
       | Markdown is not markup. It's in the name.
       | 
       | The point of markdown is to be readable by humans, using similar
       | annotations that people have been using in ASCII text emails for
       | decades.
        
         | goto11 wrote:
         | Why doesn't markdown qualify as a markup language?
        
         | frou_dh wrote:
         | Kudos for fighting the good fight, but honestly this is a
         | losing battle. As time goes on, fewer and fewer of the people
         | in the world using Markdown are interested in this distinction.
        
         | bscphil wrote:
         | Markdown _is_ markup.
         | 
         | > Markdown is a lightweight markup language for creating
         | formatted text using a plain-text editor. -- Wikipedia
         | 
         | > Markdown is a text-to-HTML conversion tool for web writers.
         | Markdown allows you to write using an easy-to-read, easy-to-
         | write plain text format, then convert it to structurally valid
         | XHTML (or HTML). -- Gruber
         | 
         | Moreover, Markdown is constrained by the fact that it doesn't
         | have a semantics, strictly speaking. It has a syntax, and that
         | syntax is tied to the HTML canonical form of the document,
         | which any valid Markdown file can be transformed into.
         | 
         | The fact that Markdown is written in extremely simple, easy to
         | read (and brilliant IMO) syntax doesn't make it not markup.
        
       | woolybully wrote:
       | Jeffrey Kingston's Lout is pretty good, though it's only
       | lightweight if LateX is.
       | https://src.fedoraproject.org/lookaside/pkgs/lout/user-guide...
        
       | xigoi wrote:
       | Personally I've found that I prefer a consistent syntax with few
       | special characters -- like HTML, but less verbose -- which led to
       | the creation of xidoc: https://xidoc.nim.town/
        
         | emmanueloga_ wrote:
         | I have a very simple way of evaluating any sort of markup
         | language. How easy is it to edit tables?
         | 
         | https://htmlpreview.github.io/?https://github.com/jgm/djot/b...
         | 
         | In that example, xidoc requires a human to manually align ascii
         | chars to make it look pretty... That's a job that _computers
         | are good at_!
         | 
         | I would rather be able to define some sort of data block:
         | [let "myvar;" [csv;         col1,col2,col3         1,2,3
         | 4,5,6         ]]         [output-table "myvar"];
         | 
         | ... or something.
         | 
         | I'm not sure right now how useful is markdown really... It was
         | a nice experiment but I mostly use it for bullet points (*) and
         | headings (#). I would say that's successful enough as a
         | contribution to internet culture! :-)
         | 
         | EDIT: seems like I commented on xidoc+djot at the same time,
         | but the criticism is the same for both languages :-)
         | 
         | https://djot.net/
        
       | MilStdJunkie wrote:
       | A subject very near and dear to my heart. Asciidoc, flawed as it
       | is, is still the best game in town, with conditional content,
       | transclusion, and a particularly robust table model. Combined
       | with S1000D architecture, it makes a passable method of writing
       | publications for heavy industry in aerospace and defense. On the
       | double cheap. If your team has some small amount of tech chops.
       | 
       | The `include` is a problem, though, and the devs have known this
       | since at least 2017. Two directives under work here. The
       | `ainclude` is an extension that works in the subdoc direction,
       | and `subdoc` is a directive under development that's on the
       | milestones to go into core. Generic blocks is also on the
       | roadmap, and it's available via extension.
       | 
       | However.
       | 
       | There's deeper problems, I feel, inherent in the DNA of component
       | content systems themselves, as a basic concept. It plagues every
       | CCMS (component content management system), whether it's DITA,
       | S1000D, DocBook (with xincludes), ReST, or otherwise. It's not a
       | markup problem. In S1000D, I call it "the applicability trap".
       | 
       | To put this very briefly, you incur risk when you replace any
       | component of natural language with a constructed language. In the
       | case of CCMSs, you have an implied content architecture doing the
       | work of NL in between content "chunks", but not everyone
       | architects content, or, hell, even their product. The writers
       | won't know this until they start, and by then it's too late.
       | 
       | Without architecture, you have a bunch of chunks that used to be
       | structured linguistically, and now they're not doing much of
       | anything at all.
       | 
       | I can't tell you how many S1000D systems I've seen built around a
       | product that existed as little more than a twinkle in a
       | salesman's eye. You build out an SNS and an applicability
       | model[1], all from a bunch of emails or a spotty CAD diagram, and
       | tell the writers to get cracking. Then, N months and N millions
       | of dollars later, scrap everything because, lo and behold, the
       | actual _possible_ product completely invalidates the content
       | architecture.  "How did the doc set cost 19 million dollars?".
       | Well, it's not the markup, it's not the vendor, and whatever
       | miracle editor your salesmen buddies are pushing won't do jack to
       | fix it, either.
       | 
       | What the writers _need_ is a quantifiable test for architecture
       | before they start, or else, sooner or later, the Applicability
       | Trap will get them. Either that, or write individual BDMs[2] for
       | every single product variant that spews out of the pieholes of
       | business development.
       | 
       | I feel a little dumb getting worked up here. All of this crap is
       | going to be glitter in the river come 2030 anyway, because we'll
       | be training our pocket AIs for most of this.
       | 
       | [1](sort of a structure of conditions to filter content)
       | 
       | [2] Big Dumb Manuals, like mamma used to make
        
       | abathur wrote:
       | Hmm. I found myself nodding along with what this has to say about
       | syntax design and the responsibilities of well-abstracted markup
       | languages/converters, but I was also a little surprised by some
       | apparent contradictions.
       | 
       | > Great markup format unambiguously interprets an input string as
       | an abstract tree model of a document. It doesn't ascribe
       | semantics to particular tag names or attributes.
       | 
       | >
       | 
       | Yes!
       | 
       | > Markup language which nails this perfectly is HTML. It directly
       | expresses this tree structure. Various viewers for HTML can then
       | render the document in a particular fashion. HTML's syntax itself
       | doesn't really care about tag names and semantics: you can
       | imagine authoring HTML documents using an alternative set of tag
       | names.
       | 
       | But it's a bit weird to uphold HTML when it seems to really be
       | valuing the underlying syntax and not the big heckin' standard
       | that details the valid elements, attributes, and their semantics.
       | :)
       | 
       | > Great markup language defines the semantics of converting text
       | to a document tree
       | 
       | This seems to conflict with the first one I quoted. Maybe I'm
       | missing something? Maybe the author is using semantic in
       | different ways, here? Markup languages tend to mix different
       | kinds of elements, and it can make thinking and communicating in
       | this space tricky! I started trying to untangle this knot
       | recently in
       | https://t-ravis.com/post/doc/what_color_is_your_markup/.
       | 
       | (I also found banning "semantic" from my writing helped me
       | untangle by making it easier to notice when I was letting it do
       | too much work.
       | https://t-ravis.com/post/doc/semantic_the_8_letter_s-word/)
        
         | nerdponx wrote:
         | > But it's a bit weird to uphold HTML when it seems to really
         | be valuing the underlying syntax and not the big heckin'
         | standard that details the valid elements, attributes, and their
         | semantics. :)
         | 
         | SGML?
        
       | gizmo wrote:
       | The main problem continues to be, in my view, that we don't have
       | the equivalent of plain text for formatted documents or trees.
       | Plain text can be copy-pasted, read by any program, edited
       | anywhere, and that's great. Binary formats are inaccessible and
       | hard to work with by comparison, but this is only because of
       | mundane technical reasons. We could easily make new binary
       | formats that would be terrific for lightly marked up text, and
       | the editors that go along with them.
       | 
       | Today bold, italic, tables don't fit in plain text, but carriage
       | returns and tab characters do, as well as Halloween pumpkin
       | emojis. There is no good reason for this, this is just where we
       | kind of ended up. You can't put a floating point number in text,
       | or a price, or a phone number, or date. And when you don't have
       | standards that support this you can't have any meaningful kind of
       | data exchange. So forget about copying a table from a spreadsheet
       | anywhere. Everybody ends up shoehorning this necessary
       | functionality in text editors and it won't ever work.
       | 
       | How can computing still be at the stage where if you want a table
       | you have to conjure up a bunch of pipe and dash characters that
       | hopefully compiles into a table that looks presentable? And then
       | you have to switch between "code view" and "presentation view" to
       | check if you did it right. It would be funny if this wasn't so
       | tragic.
        
       | fiddlosopher wrote:
       | > Markup language which completely falls over this is Markdown.
       | There's no way to express generic tree structure, conversion to
       | HTML with specific browser tags is hard-coded.
       | 
       | This isn't really a fair criticism. True, the original
       | Markdown.pl did not produce a generic tree structure, but that's
       | a fact about the program, not the syntax it parses. Many Markdown
       | and Commonmark implementations do support creation of an abstract
       | syntax tree. Pandoc has done this for the last 17 years. It also
       | provides nestable, generic containers as a syntax extension.
       | 
       | > It feels like there's a smaller, simpler language somewhere
       | 
       | Here's my attempt: <https://djot.net>.
        
         | AB1908 wrote:
         | Wait a minute, I recognize that GitHub handle! Are you John
         | McFarlane? What a mind-blowing day to come across the creator
         | of pandoc. You've saved many a student from pain. Thanks for
         | everything.
        
       | djedr wrote:
       | > More or less, what I want from markup is to convert a text
       | string into a document tree:                 enum Element {
       | Text(String),         Node {           tag: String,
       | attributes: Map<String, String>           children: Vec<Element>,
       | }       }            fn parse_markup(input: &str) -> Element {
       | ... }
       | 
       | > Markup language which nails this perfectly is HTML.
       | 
       | The reason HTML nails it perfectly is because this is modeled
       | after HTML.
       | 
       | If I were to make up a markup language, I wouldn't follow that
       | model.
       | 
       | In particular I would get rid of attributes which to me are a
       | restricted kind of children with a specialized syntax.
       | 
       | This is both unnecessary and undesirable in many cases.
       | 
       | The major problem with attributes is the <String, String>
       | mapping. Once something is defined to be an attribute, it cannot
       | be sensibly extended without creating an unnecessary problem.
       | 
       | For example the `class` attribute in HTML looks like it was
       | originally designed to hold a single class name. Then people
       | realized that it would be desirable to have multiple classes per
       | element.
       | 
       | So, to keep things backwards-compatible, the value of the
       | attribute was extended to hold a space-separated list of classes
       | instead. Essentially creating a little DSL inside of the
       | attribute's value.
       | 
       | If `class` was instead a kind of child, initially limited to a
       | single instance per element, extending to multiple instances in a
       | backward-compatible manner would not require introducing the DSL.
       | It would be natural. Just allow many `class` children.
       | 
       | A valid argument in HTML in favor of attributes is conciseness.
       | But that's an artifact of the syntax.
       | 
       | We could make up a language with syntax for nodes as concise as
       | HTML attributes, eliminating that argument.
       | 
       | > It feels like there's a smaller, simpler language somewhere
       | 
       | Certainly.
       | 
       | I have experimented with many different designs for such a markup
       | language on top of Jevko[0]. One interesting design that
       | trivially maps to HTML looks like this:                 h1
       | [Title]            p [         paragraph       ]            p [
       | [paragraph with a ]         a [           href=[...]
       | [link]         ]       ]
       | 
       | Another one looks like this:                 [h1][Title]
       | [p][paragraph]            [p][         paragraph with a
       | [href[...] a][link]       ]
       | 
       | Both are extremely simple and minimal, but also extensible and
       | lend themselves to writing by hand.
       | 
       | [0] https://news.ycombinator.com/item?id=33287620
        
         | typon wrote:
         | You just described Lisp :)
        
           | djedr wrote:
           | Surely you mean S-expressions. They're great, but not as a
           | markup language.
           | 
           | What I show here is in fact even simpler and more flexible
           | than S-exps[0].
           | 
           | [0] For some details and polemic see this thread:
           | https://news.ycombinator.com/item?id=33334789 | TL;DR: Jevko
           | is well-defined, basically just unicode text + escapeable
           | brackets for making trees; it doesn't treat whitespace as a
           | separator/atmosphere (particularly important in markup); it
           | takes advantage of natural name-value pairing tendencies
           | (like tag-children); and it's closed under concatenation by
           | design
        
       | vitiral wrote:
       | I recently made cxt[1] to try to solve many of the issues you
       | mention.
       | 
       | [1]: https://github.com/civboot/cxt
        
       | woolybully wrote:
       | With apologies to Dr. Knuth...
       | 
       | The most important thing in a markup language is the name. A
       | language will not succeed without a good name. I have recently
       | invented a very good name, and now I am looking for a suitable
       | language.
        
         | [deleted]
        
         | brudgers wrote:
         | Unfortunately, MIX is already taken.
        
       ___________________________________________________________________
       (page generated 2022-10-31 23:00 UTC)