[HN Gopher] Overlapping markup
       ___________________________________________________________________
        
       Overlapping markup
        
       Author : akkartik
       Score  : 65 points
       Date   : 2022-12-12 06:33 UTC (16 hours ago)
        
 (HTM) web link (en.wikipedia.org)
 (TXT) w3m dump (en.wikipedia.org)
        
       | laszlokorte wrote:
       | Wouldnt one obvious solution be to allow tags from different
       | namespaces to overlap? Maybe it is mentioned in the article but I
       | could not see it:                 <ns1:root>       <ns2:root>
       | <ns1:elemA>This is some <ns2:mark>content</ns1:elemA>
       | <ns1:elemB>that is split</ns:mark> into two nodes</ns1:elemB>
       | </ns2:root>       </ns1:root>
       | 
       | Then in this case two trees with common leaf nodes (4 text nodes)
       | are constructed. From point of ns2-root there are only 3 children
       | (the 2 next nodes outside <mark> and the <mark>) and from point
       | fof ns1-root there are two children (elemA and elemB).
       | 
       | Then when parsing one could even pre-select which namespaces to
       | parse and skip all other, for example if I am only interested in
       | ns1, ns2 could be skipped during parsing.
        
         | layer8 wrote:
         | Your proposal is very similar to SGML's CONCUR feature
         | mentioned in the Wikipedia article.
        
       | low_tech_punk wrote:
       | Maybe the "H" in HTML should stand for "Hierarchical"?
        
       | codetrotter wrote:
       | This Wikipedia article is sorely lacking concrete examples to
       | help aid understanding..
       | 
       | Anyone care to add some examples to the article??
        
         | tannhaeuser wrote:
         | SGML's CONCUR feature (criticized but not described in that
         | Wikipedia article) allows tags to have optional _name groups_
         | specifying one or more document type names (that must be
         | declared in the prolog) to which the tag applies, and allows
         | tag pairs with disjoint document type name qualifiers to
         | overlap like this:
         | 
         | <(a)x>bla <(b|c)y>bla</(a)x></(b|c)y>
         | 
         | Traditionally used for poetry and lyrics/drama but could also
         | be useful for postal addresses, lyrics in certain types of
         | musical notation, in translations, and maybe specific text apps
         | such as subtitles/tracks for the hearing impaired. Basically,
         | wherever there's a desire to markup text in more than a single
         | hierarchy.
        
         | teej wrote:
         | I immediately thought of this Vox breakdown of rhyming in rap.
         | https://youtu.be/QWveXdj6oZU
        
         | dspillett wrote:
         | The "Approaches and implementations" section includes some
         | clear (to my eyes at least) examines of overlapping lines and
         | sentences in poetry represented as html-like markup.
         | 
         | What sort of examples would improve the article's clarity for
         | you?
         | 
         | Wrt the existing examples, perhaps there should be a small
         | section before that, explicitly called "examples", that
         | contains a minimal summary of those examples to illustrate the
         | concept before the reader delves deeper.
        
           | codetrotter wrote:
           | Yeah, I agree with what you are saying. I was viewing this
           | article on mobile and it was hard to spot these examples on
           | mobile, because all sections are collapsed by default and
           | none of the sections had the examples stand out at a cursory
           | glance on mobule. Now that I am on a laptop I easily spot
           | them. I also agree with what you are saying that an explicit
           | section named examples would be good. Especially for mobile
           | reading.
        
         | FigmentEngine wrote:
         | overlapping b and i elements <p>he<b>ll<i>o w</b>or</i>ld</p>
         | 
         | contary to the article it can still be represented as a tree,
         | by decomposing the children into their own nodes (so in this
         | case characters become nodes with child nodes expressing what
         | formatting is active, followed by the letter, and then turn of
         | all the active formatting)
        
           | admax88qqq wrote:
           | No that's just nesting. It's overlapping if the lifetime of a
           | child tag is greater than the lifetime of the parent tag.
           | 
           | Example if you have two paragraphs and bold the end of one
           | and the start of the next
           | 
           | <p>hello <b>world</p> <p>this is</b> your captain
           | speaking</p>
           | 
           | Obviously bold is a poor example as you can just terminate
           | and start a new bold without penalty. But if these were more
           | semantic elements like "sections" and "verses" and "lines"
           | then it might not be possible.
        
             | chrismorgan wrote:
             | > _without penalty_
             | 
             | It's actually fiddlier than you may think. Take "Ta" for an
             | example: in most decent fonts, there will be a kerning pair
             | that tightens those characters, tucking the "a" underneath
             | the beam of the "T" a little. The shaper thus needs to
             | follow the actual fonts being used, for kerning purposes,
             | rather than the markup--but this is still visible at the
             | element level, with getBoundingClientRect().
             | 
             | Take this demo (which depends on your default font having
             | such a kerning pair; if it doesn't, you may need to find
             | one that does and change the font by inserting <html
             | style="font-family:sans-serif"> or similar after the
             | comma):                 data:text/html,<p>Ta<p><b>T</b>a<p>
             | T<b>a</b><p><b>Ta</b><p><b>T</b><b>a</b><script>document.qu
             | erySelectorAll("b").forEach(e=>console.log(e.getBoundingCli
             | entRect().width))</script>
             | 
             | This shows five variants of "Ta", with the last two being
             | <b>Ta</b> and <b>T</b><b>a</b>, and prints five numbers to
             | the console, the widths of each <b> element. Numbers one
             | and four (both corresponding to a <b>T</b>) differ if you
             | have a kerning pair such as I describe: for me, the first
             | is 11.7px, and the second 10.73333px (though it overflows
             | that width in its rendering) because of the <b>a</b> that
             | follows it. If you gave bold elements the style `display:
             | inline-block`, it wouldn't kern the pair and would thus go
             | back to 11.7px.
             | 
             | Most fonts could _really_ use italic-aware kerning (that
             | is, kerning a pair where one glyph is regular and the other
             | italic), but it's sadly not a thing.
        
       | TreeRingCounter wrote:
       | Can someone summarize this? 90% of the content on this page seems
       | like excessively-verbose nonsense.
        
         | thomascgalvin wrote:
         | Many, if not most, computer models represent data as a tree.
         | Some data, however, can't really be represented by a tree,
         | because a "thing" can have multiple parents.
         | 
         | The example in the link:
         | 
         | Example, with lines marked up:                 <line>I, by
         | attorney, bless thee from thy mother,</line>
         | <line>Who prays continually for Richmond's good.</line>
         | <line>So much for that.--The silent hours steal on,</line>
         | <line>And flaky darkness breaks within the east.</line>
         | 
         | With sentences marked up:                 <sentence>I, by
         | attorney, bless thee from thy mother,       Who prays
         | continually for Richmond's good.</sentence>
         | <sentence>So much for that.</sentence>
         | <sentence>--The silent hours steal on, And flaky darkness
         | breaks within the east.</sentence>
         | 
         | If you care about lines _and_ sentences, this is difficult to
         | represent as a tree.
        
           | TreeRingCounter wrote:
        
           | lioeters wrote:
           | One way to solve this could be to provide separate start/end
           | tags without inner content.                 <line-
           | start/><sentence-start/>I, by attorney, bless thee from thy
           | mother,<line-end/>       <line-start/>Who prays continually
           | for Richmond's good.<sentence-end/><line-end/>
        
             | thomascgalvin wrote:
             | Yeah, that's how the linked article does it, but that's ...
             | icky? It's still a token spanning multiple parents, it's
             | just masquerading as a couple of self-closing tags.
             | 
             | Which, of course, is the point of the article, and why this
             | is a difficult problem.
        
               | lioeters wrote:
               | Ah you're right, I should have read the article before
               | commenting, haha. I agree it's not an ideal solution. A
               | disadvantage I imagine is that this syntax pushes the
               | problem onto the parser/consumer to keep track of
               | overlapping regions.
               | 
               | > Milestones are empty elements that mark the beginning
               | and end of a component, typically using the XML ID
               | mechanism to indicate which "begin" element goes with
               | which "end" element.
               | 
               | https://en.wikipedia.org/wiki/Overlapping_markup#Mileston
               | es
        
       | captainmuon wrote:
       | Back in the day when I was in school, and there was a IE
       | monopoly, I wrote a simple HTML parser. Instead of parsing it
       | into a tree, it just recorded the beginning and end position of
       | tags as indicies into the string. I think I did use a stack to
       | match nested tags properly. But overlapping markup was common
       | back then, and IE rendered it "correctly" IIRC. This simple
       | parser was enough to power a scraper (I don't remember what I was
       | scraping. Maybe a competitor's emule link site or something like
       | that :-P) and a crude rich text renderer, which I was very proud
       | of.
        
       | dejj wrote:
       | Consider Aftertext (draft): it separates the markup from the text
       | entirely. Overlapping markup ranges becomes trivial.
       | 
       | https://breckyunits.com/aftertext.html
        
         | masswerk wrote:
         | This is how styled SimpleText read-me files worked in classic
         | Mac OS. A normal file was plain text, but styles could be
         | appended based on indices (much like selection and regions work
         | in modern web APIs).
        
       | NWoodsman wrote:
       | Change my view: given any data storage medium, the smallest
       | granularity of data must also be the most-child element of any
       | markup language. Given the immense overhead of storing markups on
       | a granular level, processing markup therefore must be a perpetual
       | exercise in recursion.
       | 
       | I.e.                     Poem->Verse->Line-> <char>
       | Book->Page->Chapter->Paragraph->Sentence->Word-> <char>
       | HTML->Body->Div->P-> <char>
       | 
       | Therefore, any given letter (here as a <char> type) can retain a
       | back reference of parents, so the <char> object retains a hashset
       | of {Line,Word,P} parent type references representing three
       | domains, but really needs to be a Dictionary of key values, the
       | key being the domain name, the value being the parent name, so
       | that would be:
       | 
       | Domain: Poetry, Value: Line
       | 
       | Domain: Book Object Model, Value: Word
       | 
       | Domain: HTML, Value: P Element
       | 
       | We could then ask any letter arbitrarily "what is your Font Style
       | in your HTML context?" and it would be able to walk up the parent
       | P which obtains its style from a CSS markup, and return that
       | correctly. Or "What is your Poem's name in your Poetry context?"
       | and it could recurse up to the Poem element to find it's Title.
        
         | jerf wrote:
         | Are you claiming the parents will always be unique? Because as
         | the article says, you can easily have this, where going to the
         | _right_ is a parent relationship:
         | -> Line -> Verse -> Poem         char -> Word
         | -> Clause -> Sentence -> Poem
         | 
         | You can try adding a further constraint that any given property
         | must have only one path, so you can then recurse over the tree
         | and find the one match, but as your model gets richer you will
         | find that breaks.
         | 
         | And it's that last clause that is the killer for pretty much
         | anything: "As your model gets richer you will find that
         | breaks."
         | 
         | Plus the UI experience for that is awful. "I want to add this
         | property to this Line but you're telling me it's a duplicate
         | for some particular character? What the hell does that mean?
         | I'm not adding a property to the character!" etc. etc.
        
       | mdciotti wrote:
       | I've frequently wondered why a hierarchical approach is the norm
       | for text formatting. It seems that many problems could be solved
       | trivially using a text buffer and a list of formatting sequences
       | defined by a starting index and a length. The only place I've
       | seen this in practice is in Telegram's TL Schema [1]. Is this
       | method found anywhere else?
       | 
       | Edit to note: there is one obvious advantage to in-band markup
       | such as HTML -- streaming formatted content. Though I wonder if
       | this could be done with a non-hierarchical method, for example
       | using in-band start tags which also encode the length.
       | 
       | Edit 2: looks like Conde Nast maintains a similar technology
       | called atjson [2].
       | 
       | [1]: https://core.telegram.org/api/entities
       | 
       | [2]: https://github.com/CondeNast/atjson
        
         | jake-low wrote:
         | There are a number of rich text editors that model documents as
         | a flat array of characters and a separate table of formatting
         | modifiers (each with an offset and length). Medium's text
         | editor is one of them. This post [1] on their engineering blog
         | introduced me to the idea, and I think it's a good starting
         | point for anyone interested in this topic.
         | 
         | ProseMirror (a JavaScript library for building rich text
         | editors) also employs a document model like this. The docs for
         | that project [2] do a good job of explaining how their
         | implementation of this idea works, and what problems it solves.
         | 
         | [1]: https://medium.engineering/why-contenteditable-is-
         | terrible-1...
         | 
         | [2]: https://prosemirror.net/docs/guide/#doc
        
         | samwillis wrote:
         | That list of formatting sequences would have to be updated with
         | new indexes when the content of the buffer changed. Keeping the
         | two in sync wouldn't be trivial (for a computer or a human), a
         | tree of nodes fixes that and works for 99.99% of use cases.
        
           | jerf wrote:
           | It may not be trivial, but it's a solved problem. Many rich
           | text UI widgets and corresponding backing data structures
           | exist today, based on a tagging system where tags can
           | trivially define regions that overlap with each other. It's
           | tricky and full of corner cases, but not _that_ hard if you
           | put your mind to it, and it 's not computationally
           | inefficient either.
        
         | jcparkyn wrote:
         | I guess because it would be a total pain for humans to read and
         | write without specialised tooling. Imagine trying to add a word
         | at the start of your document.
        
         | jerf wrote:
         | "I've frequently wondered why a hierarchical approach is the
         | norm for text formatting."
         | 
         | 80/20, if not 90/10, effectiveness. Most people are not trying
         | to do what the Wikipedia article is talking about. About the
         | most complicated thing that people want to do is the moral
         | equivalent of <i>italic <b>bold and italic</i> bold</b>, and
         | you can losslessly convert that to <i>italic <b>bold and
         | italic</b></i><b> bold</b> for almost all practical purposes.
         | 
         | It isn't until you're getting very precise about what your tags
         | mean, for tags that intrinsically "cross" hierarchies like
         | that, that you start seeing this issues. And then by the time
         | you've gotten that far, you realize you have all sorts of
         | problems, as the article says.
         | 
         | But a good deal of the answer is that while the stuff mentioned
         | in the Wikipedia article is true and important, it's also
         | fairly specialist.
         | 
         | As for "The only place I've seen this in practice is in
         | Telegram's TL Schema [1]. Is this method found anywhere else?",
         | tag-based formatting is the norm for rich text widgets, which
         | generally can natively represent my first HTML example above in
         | its internal format. Generally if you dig into your favorite
         | language you'll find someone has already implemented this
         | efficiently as a library you can pick up if you want to use the
         | capability directly outside of a text widget. It has its own
         | consequences, as anyone who has ever fought with them may
         | realize, but it's not impossibly difficult to deal with.
         | 
         | It isn't a magic solution to everything either, though. Even if
         | it is what you think you want, a widget able to represent a
         | bold section starting in the middle of a paragraph, then
         | proceeding through the first three rows of a table, then
         | stopping in the middle of a paragraph in the third column of
         | the next row is generally weird. To some extent, people have a
         | certain hierarchiness to their thinking about these matters
         | too, whether it's cause or effect. But that hierarchiness is
         | messy; I think it's fair to say most people wouldn't "mean"
         | that bold to mean something in my table case, we don't
         | necessarily expect tags to proceed through tables like that,
         | but <i>i<b>bi</i>b</b> is something that people might
         | intuitively expect to be able to do. It's a fractally messy
         | space both in the computer science and human expectations, and
         | the fractal messiness only gets messier when we try to
         | harmonize those two things.
        
       | samwillis wrote:
       | There are so many odd edge cases in HTML, a good one I found was
       | with forms. If you open a <form> but don't have a closing tag,
       | the browser will close the form block "visually" at the end of
       | the forms immediate parent, as you would expect. All styles are
       | applied to it, or children via selectors, up to that
       | automatically inserted end point. It's how browsers handles most
       | unclosed block tags.
       | 
       | However, the forms "functionality" isn't closed at that point,
       | any inputs further down the page (outside of the forms DOM tree)
       | are included in the post/get when the form is submitted. Or at
       | least until another form is found in the DOM. Effectively an
       | unclosed form is two things, a visual block that is closed
       | automatically, and an "overlapping" form capturing inputs
       | indefinitely.
        
         | chrismorgan wrote:
         | This behaviour is defined and explained in the HTML spec with
         | the _form element pointer_
         | <https://html.spec.whatwg.org/multipage/parsing.html#form-
         | ele...>:
         | 
         | > _The form element pointer points to the last form element
         | that was opened and whose end tag has not yet been seen. It is
         | used to make form controls associate with forms in the face of
         | dramatically bad markup, for historical reasons._
         | 
         | And search through the rest of the page for the term to find
         | how it's implemented--it's straightforward, just set on a
         | <form> open tag and reset on an (explicit) </form> close tag.
         | 
         | This is somewhat unreliable: _browsers_ support it, but tools
         | using XML pipelines are allowed to ignore it (SS13.2.9), and
         | lots of JavaScript code will assume hierarchy rather than using
         | form.elements, and thus not catch such elements, or elements
         | that manually specify a form owner via the _form_ attribute.
        
           | samwillis wrote:
           | Thanks! My 2 minuets of googling back when I found it didn't
           | surface that and I moved on to the next job.
           | 
           | Somehow despite coding html for 25 years I had either not
           | seen the input form attribute or forgotten about it. I
           | suspect the latter!
        
             | chrismorgan wrote:
             | Steps on finding this from the HTML spec:
             | 
             | 1 Start at https://html.spec.whatwg.org/multipage/. Or
             | https://html.spec.whatwg.org/ if you prefer, with
             | everything in one page, but that's a _big_ document. You
             | can also build it all locally yourself if you like. I have.
             | 
             | 2 "The form element" sounds like a good place to look.
             | https://html.spec.whatwg.org/multipage/forms.html#the-
             | form-e...
             | 
             | 3 Look through the DOM interface listed, _elements_ sounds
             | promising. Find the explanation of that IDL attribute
             | below: "The elements IDL attribute must return an
             | HTMLFormControlsCollection rooted at the form element 's
             | root, whose filter matches listed elements whose form owner
             | is the form element, with the exception of input elements
             | whose type attribute is in the Image Button state, which
             | must, for historical reasons, be excluded from this
             | particular collection." Roll your eyes at the bizarre
             | exclusion of <input type=image>, then focus on the term
             | _form owner_ which sounds relevant. That links you to
             | https://html.spec.whatwg.org/multipage/form-control-
             | infrastr....
             | 
             | 4 Hmm... null, parser inserted flag, nearest ancestor form
             | element, form attribute. Parser inserted flag sounds
             | relevant (though it's just a flag, not the actual
             | association link). Also the note "They are also complicated
             | by rules in the HTML parser that, for historical reasons,
             | can result in a form-associated element being associated
             | with a form element that is not its ancestor."
             | 
             | 5 This is where having the _whole_ spec open, rather than
             | the multipage version, is handy: you can search the entire
             | document for the term "parser inserted flag" to see where
             | that gets set. You can also guess that it's going to be in
             | SS13.2 _Parsing HTML documents_ (parsing.html). In the end,
             | it's https://html.spec.whatwg.org/multipage/parsing.html#cr
             | eating...: "... then associate element with the form
             | element pointed to by the form element pointer and set
             | element's parser inserted flag." Ah hah!
             | 
             | 6 You have found the concept in the parser: "form element
             | pointer". You can then look through where it's used and
             | quickly see how it's set on <form> and unset on </form>,
             | thus deliberately handling the missing-</form> case.
             | 
             | You develop a feeling for this kind of thing over time. I
             | didn't know about the form element pointer (though I feel I
             | should have known about it), but this is a loose
             | description of what I did, though I was able to speed
             | through some of the steps, and I really should have just
             | started by looking at "An end tag whose tag name is
             | "form"", but at first I thought the claim was bogus.
        
               | samwillis wrote:
               | I think got to point 2, found no reference in the form
               | tag section, and gave up.
               | 
               | But what's fascinating is that it describes the html
               | parser effectively implementing "overlapping markup", as
               | in the Wikipedia article, for this edge case for
               | backwards compatibility.
        
         | lordnacho wrote:
         | Why didn't they go the grammar nazi route? Define a spec and if
         | the page doesn't conform, draw an error message.
         | 
         | It's really annoying to have this kind of undefined behaviour
         | that might end up being relied upon.
        
         | jraph wrote:
         | This seems strange, how is that represented in the DOM, which
         | is strictly a tree?
        
           | samwillis wrote:
           | It's not in the DOM, from memory chrome dev tools even shows
           | a closing form tag where it's been inserted. I have no idea
           | how it's implemented internally.
           | 
           | Confuse me for a while when debugging a legacy website. It
           | had actually been done intentionally to work around a rather
           | complex architecture.
        
             | laszlokorte wrote:
             | There exists a "form"-attribute for input elements that can
             | be used to associate input elements outside the form
             | hierarchy to be included in the form submission.
             | 
             | So the semantics of "form field outside the actual form"
             | are available anyway. When parsing a not-closed <form> the
             | browsers just make use of that.
        
           | ggus wrote:
           | Maybe it leverages the "form" optional attribute that can
           | specify the form the <input> element belongs to.
        
       ___________________________________________________________________
       (page generated 2022-12-12 23:00 UTC)