[HN Gopher] Elements of a great markup language ___________________________________________________________________ Elements of a great markup language Author : ingve Score : 40 points Date : 2022-10-29 05:55 UTC (2 days ago) (HTM) web link (matklad.github.io) (TXT) w3m dump (matklad.github.io) | tabtab wrote: | Re: _A good markup language describes an abstract hierarchical | structure of the document, and lets a separate program to adapt | that structure to the desired output._ | | I have to disagree. Often the abstract nature is hard to describe | and/or the constructs for it either don't exist, or need | cleaning/updating. Often I find myself saying, "I don't know why | it's more legible to format this thing such and such way, it just | is." Creating a good abstract language or category set for a | given domain is not easy. It's good to use abstraction where it's | practical, but often you just have to tell them system "just | format it like this!" | | For example, the difference between a button and hyperlink is | often blurry. We could abstract it something like: [Action | FormatType="Button" Label="Send email to Mom" | ActionType="mailto:mom73@moms.sample"/] so that FormatType could | be changed to "Hyperlink", but most find this goofy and perhaps | long-winded. | | As far as compact wiki-like shortcuts versus XML, the first often | has more escaping issues or confusion. It's not a free lunch, but | about best fitting intended use. | einpoklum wrote: | Seconded. That quote may be a description of a markup language | that is easy for programs to work with, not one that's easy for | people to work with. | tabtab wrote: | Good point. What's people-friendly may not be machine- | friendly, and vice versa. Further, what's human writing | friendly may not be human _reader_ friendly. When writing, | verbosity reduction matters most; but when reading, clarity | often trumps verbosity concerns. | abathur wrote: | In a conversation elsewhere recently I wrote: | | > I think Markdown is popular because it's a reasonably | intuitive bridge between how we format handwritten | documents (which is itself a blend of what makes sense on | the page, and an approximation of ~spoken rhetoric) and | HTML. | | > In some cases (like the table example), what's trivial on | the page is painful with a keyboard. The technologies that | shaped the old idioms didn't have editability as a | selection pressure. You can make a few edits inline, but at | some point you'll just have to rewrite the document. | | > Much of the power and pain of code are byproducts of the | machines forcing us to be explicit, but we didn't have the | foresight to spend centuries molding ourselves (human | languages, pedagogy, rhetoric) around that kind of | precision. Markdown meets us very close to where we are. | The other path, I think, entails developing toolchains that | do expect that precision and requiring the humans to do the | changing. | | I guess the split between the ease of a format for | reading/writing is an extension of the difference between | shorthand or even cursive and print when it comes to these | affordances. | tannhaeuser wrote: | > _As far as compact wiki-like shortcuts versus XML, the first | often has more escaping issues or confusion._ | | It isn't an either-or, and never has been. XML has been created | as an SGML subset, and SGML always had short references ie. | custom tokens the parser is replacing by something else in a | context-dependent way. For example, an asterisk can be replaced | by an <em> start-element tag to start emphasized text, and, | within emphasized text, asterisks can be replaced by </em> end- | element tags. | einpoklum wrote: | Since when is LaTeX is a "lightweight markup language"? | | * It's not at all lightweight, I'm pretty sure it's Turing | complete up the wazoo. | | * You can reprogram the f'ing grammar, down to the interpretation | of individual input characters... | | * In fact, I'm not sure you can even call it a markup language. | It's a macro-based programmable system on top of a typesetter. | taeric wrote: | For the generation of documents and papers, I'm not actually | sure that avoidance of Turing complete is really a worthy goal. | It can certainly go too far, but I think it is a sympathetic | view that the quest for complete separation between content and | presentation has given more victims to failed delivery than | macros have. | | I do think that stability is something that needs some | discussion here, for a great markup language. Not stability as | in "doesn't crash", but in "doesn't change." | bscphil wrote: | IMO this misses the most important thing needed in a great markup | language. It should be readable, as a complete document, in its | existing text form without any transformation. The ideal syntax | would be so good that you would rarely want to transform the | document in any way. | | I think inspiration in this area should come from an unusual | place: RFC documents (like https://www.rfc- | editor.org/rfc/rfc6877.txt). These documents are fully readable | in their standard text form. The only things missing are | conveniences like clickable links, which could be supplied by a | viewer application without transforming the text of the document. | | Markdown comes the closest to being directly readable, but it | makes a number of sacrifices to syntax in view of its intended | transformation to HTML. It's also limited in scope, patterned | after ASCII email syntax. So you miss document-oriented | conveniences like standardized headers, proper citations, and | pages - all of which RFC documents have! | | That's in part the reason that a bunch of people have come behind | and written extensions to Markdown. IMO someone needs to come | back and gather the best of these into a consistent and simple | syntax that is oriented around the needs of plain text readers. | nooyurrsdey wrote: | > It should be readable, as a complete document, in its | existing text form without any transformation. | | Is this _needed_, or is this nice to have? | | A markup language annotates text and describes _how it should | be rendered_. It feels redundant to describe how a document | should be rendered (presumable for final consumption) _and_ | have the document be readable as-is. | | Case in point: I'd argue that HTML is a great markup language. | I wouldn't call it the most readable in its current form. | | I agree with the spirit here, but it ultimately feels more | "nice to have" than truly required. | rrwo wrote: | Markdown is not markup. It's in the name. | | The point of markdown is to be readable by humans, using similar | annotations that people have been using in ASCII text emails for | decades. | goto11 wrote: | Why doesn't markdown qualify as a markup language? | frou_dh wrote: | Kudos for fighting the good fight, but honestly this is a | losing battle. As time goes on, fewer and fewer of the people | in the world using Markdown are interested in this distinction. | bscphil wrote: | Markdown _is_ markup. | | > Markdown is a lightweight markup language for creating | formatted text using a plain-text editor. -- Wikipedia | | > Markdown is a text-to-HTML conversion tool for web writers. | Markdown allows you to write using an easy-to-read, easy-to- | write plain text format, then convert it to structurally valid | XHTML (or HTML). -- Gruber | | Moreover, Markdown is constrained by the fact that it doesn't | have a semantics, strictly speaking. It has a syntax, and that | syntax is tied to the HTML canonical form of the document, | which any valid Markdown file can be transformed into. | | The fact that Markdown is written in extremely simple, easy to | read (and brilliant IMO) syntax doesn't make it not markup. | woolybully wrote: | Jeffrey Kingston's Lout is pretty good, though it's only | lightweight if LateX is. | https://src.fedoraproject.org/lookaside/pkgs/lout/user-guide... | xigoi wrote: | Personally I've found that I prefer a consistent syntax with few | special characters -- like HTML, but less verbose -- which led to | the creation of xidoc: https://xidoc.nim.town/ | emmanueloga_ wrote: | I have a very simple way of evaluating any sort of markup | language. How easy is it to edit tables? | | https://htmlpreview.github.io/?https://github.com/jgm/djot/b... | | In that example, xidoc requires a human to manually align ascii | chars to make it look pretty... That's a job that _computers | are good at_! | | I would rather be able to define some sort of data block: | [let "myvar;" [csv; col1,col2,col3 1,2,3 | 4,5,6 ]] [output-table "myvar"]; | | ... or something. | | I'm not sure right now how useful is markdown really... It was | a nice experiment but I mostly use it for bullet points (*) and | headings (#). I would say that's successful enough as a | contribution to internet culture! :-) | | EDIT: seems like I commented on xidoc+djot at the same time, | but the criticism is the same for both languages :-) | | https://djot.net/ | MilStdJunkie wrote: | A subject very near and dear to my heart. Asciidoc, flawed as it | is, is still the best game in town, with conditional content, | transclusion, and a particularly robust table model. Combined | with S1000D architecture, it makes a passable method of writing | publications for heavy industry in aerospace and defense. On the | double cheap. If your team has some small amount of tech chops. | | The `include` is a problem, though, and the devs have known this | since at least 2017. Two directives under work here. The | `ainclude` is an extension that works in the subdoc direction, | and `subdoc` is a directive under development that's on the | milestones to go into core. Generic blocks is also on the | roadmap, and it's available via extension. | | However. | | There's deeper problems, I feel, inherent in the DNA of component | content systems themselves, as a basic concept. It plagues every | CCMS (component content management system), whether it's DITA, | S1000D, DocBook (with xincludes), ReST, or otherwise. It's not a | markup problem. In S1000D, I call it "the applicability trap". | | To put this very briefly, you incur risk when you replace any | component of natural language with a constructed language. In the | case of CCMSs, you have an implied content architecture doing the | work of NL in between content "chunks", but not everyone | architects content, or, hell, even their product. The writers | won't know this until they start, and by then it's too late. | | Without architecture, you have a bunch of chunks that used to be | structured linguistically, and now they're not doing much of | anything at all. | | I can't tell you how many S1000D systems I've seen built around a | product that existed as little more than a twinkle in a | salesman's eye. You build out an SNS and an applicability | model[1], all from a bunch of emails or a spotty CAD diagram, and | tell the writers to get cracking. Then, N months and N millions | of dollars later, scrap everything because, lo and behold, the | actual _possible_ product completely invalidates the content | architecture. "How did the doc set cost 19 million dollars?". | Well, it's not the markup, it's not the vendor, and whatever | miracle editor your salesmen buddies are pushing won't do jack to | fix it, either. | | What the writers _need_ is a quantifiable test for architecture | before they start, or else, sooner or later, the Applicability | Trap will get them. Either that, or write individual BDMs[2] for | every single product variant that spews out of the pieholes of | business development. | | I feel a little dumb getting worked up here. All of this crap is | going to be glitter in the river come 2030 anyway, because we'll | be training our pocket AIs for most of this. | | [1](sort of a structure of conditions to filter content) | | [2] Big Dumb Manuals, like mamma used to make | abathur wrote: | Hmm. I found myself nodding along with what this has to say about | syntax design and the responsibilities of well-abstracted markup | languages/converters, but I was also a little surprised by some | apparent contradictions. | | > Great markup format unambiguously interprets an input string as | an abstract tree model of a document. It doesn't ascribe | semantics to particular tag names or attributes. | | > | | Yes! | | > Markup language which nails this perfectly is HTML. It directly | expresses this tree structure. Various viewers for HTML can then | render the document in a particular fashion. HTML's syntax itself | doesn't really care about tag names and semantics: you can | imagine authoring HTML documents using an alternative set of tag | names. | | But it's a bit weird to uphold HTML when it seems to really be | valuing the underlying syntax and not the big heckin' standard | that details the valid elements, attributes, and their semantics. | :) | | > Great markup language defines the semantics of converting text | to a document tree | | This seems to conflict with the first one I quoted. Maybe I'm | missing something? Maybe the author is using semantic in | different ways, here? Markup languages tend to mix different | kinds of elements, and it can make thinking and communicating in | this space tricky! I started trying to untangle this knot | recently in | https://t-ravis.com/post/doc/what_color_is_your_markup/. | | (I also found banning "semantic" from my writing helped me | untangle by making it easier to notice when I was letting it do | too much work. | https://t-ravis.com/post/doc/semantic_the_8_letter_s-word/) | nerdponx wrote: | > But it's a bit weird to uphold HTML when it seems to really | be valuing the underlying syntax and not the big heckin' | standard that details the valid elements, attributes, and their | semantics. :) | | SGML? | gizmo wrote: | The main problem continues to be, in my view, that we don't have | the equivalent of plain text for formatted documents or trees. | Plain text can be copy-pasted, read by any program, edited | anywhere, and that's great. Binary formats are inaccessible and | hard to work with by comparison, but this is only because of | mundane technical reasons. We could easily make new binary | formats that would be terrific for lightly marked up text, and | the editors that go along with them. | | Today bold, italic, tables don't fit in plain text, but carriage | returns and tab characters do, as well as Halloween pumpkin | emojis. There is no good reason for this, this is just where we | kind of ended up. You can't put a floating point number in text, | or a price, or a phone number, or date. And when you don't have | standards that support this you can't have any meaningful kind of | data exchange. So forget about copying a table from a spreadsheet | anywhere. Everybody ends up shoehorning this necessary | functionality in text editors and it won't ever work. | | How can computing still be at the stage where if you want a table | you have to conjure up a bunch of pipe and dash characters that | hopefully compiles into a table that looks presentable? And then | you have to switch between "code view" and "presentation view" to | check if you did it right. It would be funny if this wasn't so | tragic. | fiddlosopher wrote: | > Markup language which completely falls over this is Markdown. | There's no way to express generic tree structure, conversion to | HTML with specific browser tags is hard-coded. | | This isn't really a fair criticism. True, the original | Markdown.pl did not produce a generic tree structure, but that's | a fact about the program, not the syntax it parses. Many Markdown | and Commonmark implementations do support creation of an abstract | syntax tree. Pandoc has done this for the last 17 years. It also | provides nestable, generic containers as a syntax extension. | | > It feels like there's a smaller, simpler language somewhere | | Here's my attempt: <https://djot.net>. | AB1908 wrote: | Wait a minute, I recognize that GitHub handle! Are you John | McFarlane? What a mind-blowing day to come across the creator | of pandoc. You've saved many a student from pain. Thanks for | everything. | djedr wrote: | > More or less, what I want from markup is to convert a text | string into a document tree: enum Element { | Text(String), Node { tag: String, | attributes: Map<String, String> children: Vec<Element>, | } } fn parse_markup(input: &str) -> Element { | ... } | | > Markup language which nails this perfectly is HTML. | | The reason HTML nails it perfectly is because this is modeled | after HTML. | | If I were to make up a markup language, I wouldn't follow that | model. | | In particular I would get rid of attributes which to me are a | restricted kind of children with a specialized syntax. | | This is both unnecessary and undesirable in many cases. | | The major problem with attributes is the <String, String> | mapping. Once something is defined to be an attribute, it cannot | be sensibly extended without creating an unnecessary problem. | | For example the `class` attribute in HTML looks like it was | originally designed to hold a single class name. Then people | realized that it would be desirable to have multiple classes per | element. | | So, to keep things backwards-compatible, the value of the | attribute was extended to hold a space-separated list of classes | instead. Essentially creating a little DSL inside of the | attribute's value. | | If `class` was instead a kind of child, initially limited to a | single instance per element, extending to multiple instances in a | backward-compatible manner would not require introducing the DSL. | It would be natural. Just allow many `class` children. | | A valid argument in HTML in favor of attributes is conciseness. | But that's an artifact of the syntax. | | We could make up a language with syntax for nodes as concise as | HTML attributes, eliminating that argument. | | > It feels like there's a smaller, simpler language somewhere | | Certainly. | | I have experimented with many different designs for such a markup | language on top of Jevko[0]. One interesting design that | trivially maps to HTML looks like this: h1 | [Title] p [ paragraph ] p [ | [paragraph with a ] a [ href=[...] | [link] ] ] | | Another one looks like this: [h1][Title] | [p][paragraph] [p][ paragraph with a | [href[...] a][link] ] | | Both are extremely simple and minimal, but also extensible and | lend themselves to writing by hand. | | [0] https://news.ycombinator.com/item?id=33287620 | typon wrote: | You just described Lisp :) | djedr wrote: | Surely you mean S-expressions. They're great, but not as a | markup language. | | What I show here is in fact even simpler and more flexible | than S-exps[0]. | | [0] For some details and polemic see this thread: | https://news.ycombinator.com/item?id=33334789 | TL;DR: Jevko | is well-defined, basically just unicode text + escapeable | brackets for making trees; it doesn't treat whitespace as a | separator/atmosphere (particularly important in markup); it | takes advantage of natural name-value pairing tendencies | (like tag-children); and it's closed under concatenation by | design | vitiral wrote: | I recently made cxt[1] to try to solve many of the issues you | mention. | | [1]: https://github.com/civboot/cxt | woolybully wrote: | With apologies to Dr. Knuth... | | The most important thing in a markup language is the name. A | language will not succeed without a good name. I have recently | invented a very good name, and now I am looking for a suitable | language. | [deleted] | brudgers wrote: | Unfortunately, MIX is already taken. ___________________________________________________________________ (page generated 2022-10-31 23:00 UTC)