[HN Gopher] Overlapping markup ___________________________________________________________________ Overlapping markup Author : akkartik Score : 65 points Date : 2022-12-12 06:33 UTC (16 hours ago) (HTM) web link (en.wikipedia.org) (TXT) w3m dump (en.wikipedia.org) | laszlokorte wrote: | Wouldnt one obvious solution be to allow tags from different | namespaces to overlap? Maybe it is mentioned in the article but I | could not see it: <ns1:root> <ns2:root> | <ns1:elemA>This is some <ns2:mark>content</ns1:elemA> | <ns1:elemB>that is split</ns:mark> into two nodes</ns1:elemB> | </ns2:root> </ns1:root> | | Then in this case two trees with common leaf nodes (4 text nodes) | are constructed. From point of ns2-root there are only 3 children | (the 2 next nodes outside <mark> and the <mark>) and from point | fof ns1-root there are two children (elemA and elemB). | | Then when parsing one could even pre-select which namespaces to | parse and skip all other, for example if I am only interested in | ns1, ns2 could be skipped during parsing. | layer8 wrote: | Your proposal is very similar to SGML's CONCUR feature | mentioned in the Wikipedia article. | low_tech_punk wrote: | Maybe the "H" in HTML should stand for "Hierarchical"? | codetrotter wrote: | This Wikipedia article is sorely lacking concrete examples to | help aid understanding.. | | Anyone care to add some examples to the article?? | tannhaeuser wrote: | SGML's CONCUR feature (criticized but not described in that | Wikipedia article) allows tags to have optional _name groups_ | specifying one or more document type names (that must be | declared in the prolog) to which the tag applies, and allows | tag pairs with disjoint document type name qualifiers to | overlap like this: | | <(a)x>bla <(b|c)y>bla</(a)x></(b|c)y> | | Traditionally used for poetry and lyrics/drama but could also | be useful for postal addresses, lyrics in certain types of | musical notation, in translations, and maybe specific text apps | such as subtitles/tracks for the hearing impaired. Basically, | wherever there's a desire to markup text in more than a single | hierarchy. | teej wrote: | I immediately thought of this Vox breakdown of rhyming in rap. | https://youtu.be/QWveXdj6oZU | dspillett wrote: | The "Approaches and implementations" section includes some | clear (to my eyes at least) examines of overlapping lines and | sentences in poetry represented as html-like markup. | | What sort of examples would improve the article's clarity for | you? | | Wrt the existing examples, perhaps there should be a small | section before that, explicitly called "examples", that | contains a minimal summary of those examples to illustrate the | concept before the reader delves deeper. | codetrotter wrote: | Yeah, I agree with what you are saying. I was viewing this | article on mobile and it was hard to spot these examples on | mobile, because all sections are collapsed by default and | none of the sections had the examples stand out at a cursory | glance on mobule. Now that I am on a laptop I easily spot | them. I also agree with what you are saying that an explicit | section named examples would be good. Especially for mobile | reading. | FigmentEngine wrote: | overlapping b and i elements <p>he<b>ll<i>o w</b>or</i>ld</p> | | contary to the article it can still be represented as a tree, | by decomposing the children into their own nodes (so in this | case characters become nodes with child nodes expressing what | formatting is active, followed by the letter, and then turn of | all the active formatting) | admax88qqq wrote: | No that's just nesting. It's overlapping if the lifetime of a | child tag is greater than the lifetime of the parent tag. | | Example if you have two paragraphs and bold the end of one | and the start of the next | | <p>hello <b>world</p> <p>this is</b> your captain | speaking</p> | | Obviously bold is a poor example as you can just terminate | and start a new bold without penalty. But if these were more | semantic elements like "sections" and "verses" and "lines" | then it might not be possible. | chrismorgan wrote: | > _without penalty_ | | It's actually fiddlier than you may think. Take "Ta" for an | example: in most decent fonts, there will be a kerning pair | that tightens those characters, tucking the "a" underneath | the beam of the "T" a little. The shaper thus needs to | follow the actual fonts being used, for kerning purposes, | rather than the markup--but this is still visible at the | element level, with getBoundingClientRect(). | | Take this demo (which depends on your default font having | such a kerning pair; if it doesn't, you may need to find | one that does and change the font by inserting <html | style="font-family:sans-serif"> or similar after the | comma): data:text/html,<p>Ta<p><b>T</b>a<p> | T<b>a</b><p><b>Ta</b><p><b>T</b><b>a</b><script>document.qu | erySelectorAll("b").forEach(e=>console.log(e.getBoundingCli | entRect().width))</script> | | This shows five variants of "Ta", with the last two being | <b>Ta</b> and <b>T</b><b>a</b>, and prints five numbers to | the console, the widths of each <b> element. Numbers one | and four (both corresponding to a <b>T</b>) differ if you | have a kerning pair such as I describe: for me, the first | is 11.7px, and the second 10.73333px (though it overflows | that width in its rendering) because of the <b>a</b> that | follows it. If you gave bold elements the style `display: | inline-block`, it wouldn't kern the pair and would thus go | back to 11.7px. | | Most fonts could _really_ use italic-aware kerning (that | is, kerning a pair where one glyph is regular and the other | italic), but it's sadly not a thing. | TreeRingCounter wrote: | Can someone summarize this? 90% of the content on this page seems | like excessively-verbose nonsense. | thomascgalvin wrote: | Many, if not most, computer models represent data as a tree. | Some data, however, can't really be represented by a tree, | because a "thing" can have multiple parents. | | The example in the link: | | Example, with lines marked up: <line>I, by | attorney, bless thee from thy mother,</line> | <line>Who prays continually for Richmond's good.</line> | <line>So much for that.--The silent hours steal on,</line> | <line>And flaky darkness breaks within the east.</line> | | With sentences marked up: <sentence>I, by | attorney, bless thee from thy mother, Who prays | continually for Richmond's good.</sentence> | <sentence>So much for that.</sentence> | <sentence>--The silent hours steal on, And flaky darkness | breaks within the east.</sentence> | | If you care about lines _and_ sentences, this is difficult to | represent as a tree. | TreeRingCounter wrote: | lioeters wrote: | One way to solve this could be to provide separate start/end | tags without inner content. <line- | start/><sentence-start/>I, by attorney, bless thee from thy | mother,<line-end/> <line-start/>Who prays continually | for Richmond's good.<sentence-end/><line-end/> | thomascgalvin wrote: | Yeah, that's how the linked article does it, but that's ... | icky? It's still a token spanning multiple parents, it's | just masquerading as a couple of self-closing tags. | | Which, of course, is the point of the article, and why this | is a difficult problem. | lioeters wrote: | Ah you're right, I should have read the article before | commenting, haha. I agree it's not an ideal solution. A | disadvantage I imagine is that this syntax pushes the | problem onto the parser/consumer to keep track of | overlapping regions. | | > Milestones are empty elements that mark the beginning | and end of a component, typically using the XML ID | mechanism to indicate which "begin" element goes with | which "end" element. | | https://en.wikipedia.org/wiki/Overlapping_markup#Mileston | es | captainmuon wrote: | Back in the day when I was in school, and there was a IE | monopoly, I wrote a simple HTML parser. Instead of parsing it | into a tree, it just recorded the beginning and end position of | tags as indicies into the string. I think I did use a stack to | match nested tags properly. But overlapping markup was common | back then, and IE rendered it "correctly" IIRC. This simple | parser was enough to power a scraper (I don't remember what I was | scraping. Maybe a competitor's emule link site or something like | that :-P) and a crude rich text renderer, which I was very proud | of. | dejj wrote: | Consider Aftertext (draft): it separates the markup from the text | entirely. Overlapping markup ranges becomes trivial. | | https://breckyunits.com/aftertext.html | masswerk wrote: | This is how styled SimpleText read-me files worked in classic | Mac OS. A normal file was plain text, but styles could be | appended based on indices (much like selection and regions work | in modern web APIs). | NWoodsman wrote: | Change my view: given any data storage medium, the smallest | granularity of data must also be the most-child element of any | markup language. Given the immense overhead of storing markups on | a granular level, processing markup therefore must be a perpetual | exercise in recursion. | | I.e. Poem->Verse->Line-> <char> | Book->Page->Chapter->Paragraph->Sentence->Word-> <char> | HTML->Body->Div->P-> <char> | | Therefore, any given letter (here as a <char> type) can retain a | back reference of parents, so the <char> object retains a hashset | of {Line,Word,P} parent type references representing three | domains, but really needs to be a Dictionary of key values, the | key being the domain name, the value being the parent name, so | that would be: | | Domain: Poetry, Value: Line | | Domain: Book Object Model, Value: Word | | Domain: HTML, Value: P Element | | We could then ask any letter arbitrarily "what is your Font Style | in your HTML context?" and it would be able to walk up the parent | P which obtains its style from a CSS markup, and return that | correctly. Or "What is your Poem's name in your Poetry context?" | and it could recurse up to the Poem element to find it's Title. | jerf wrote: | Are you claiming the parents will always be unique? Because as | the article says, you can easily have this, where going to the | _right_ is a parent relationship: | -> Line -> Verse -> Poem char -> Word | -> Clause -> Sentence -> Poem | | You can try adding a further constraint that any given property | must have only one path, so you can then recurse over the tree | and find the one match, but as your model gets richer you will | find that breaks. | | And it's that last clause that is the killer for pretty much | anything: "As your model gets richer you will find that | breaks." | | Plus the UI experience for that is awful. "I want to add this | property to this Line but you're telling me it's a duplicate | for some particular character? What the hell does that mean? | I'm not adding a property to the character!" etc. etc. | mdciotti wrote: | I've frequently wondered why a hierarchical approach is the norm | for text formatting. It seems that many problems could be solved | trivially using a text buffer and a list of formatting sequences | defined by a starting index and a length. The only place I've | seen this in practice is in Telegram's TL Schema [1]. Is this | method found anywhere else? | | Edit to note: there is one obvious advantage to in-band markup | such as HTML -- streaming formatted content. Though I wonder if | this could be done with a non-hierarchical method, for example | using in-band start tags which also encode the length. | | Edit 2: looks like Conde Nast maintains a similar technology | called atjson [2]. | | [1]: https://core.telegram.org/api/entities | | [2]: https://github.com/CondeNast/atjson | jake-low wrote: | There are a number of rich text editors that model documents as | a flat array of characters and a separate table of formatting | modifiers (each with an offset and length). Medium's text | editor is one of them. This post [1] on their engineering blog | introduced me to the idea, and I think it's a good starting | point for anyone interested in this topic. | | ProseMirror (a JavaScript library for building rich text | editors) also employs a document model like this. The docs for | that project [2] do a good job of explaining how their | implementation of this idea works, and what problems it solves. | | [1]: https://medium.engineering/why-contenteditable-is- | terrible-1... | | [2]: https://prosemirror.net/docs/guide/#doc | samwillis wrote: | That list of formatting sequences would have to be updated with | new indexes when the content of the buffer changed. Keeping the | two in sync wouldn't be trivial (for a computer or a human), a | tree of nodes fixes that and works for 99.99% of use cases. | jerf wrote: | It may not be trivial, but it's a solved problem. Many rich | text UI widgets and corresponding backing data structures | exist today, based on a tagging system where tags can | trivially define regions that overlap with each other. It's | tricky and full of corner cases, but not _that_ hard if you | put your mind to it, and it 's not computationally | inefficient either. | jcparkyn wrote: | I guess because it would be a total pain for humans to read and | write without specialised tooling. Imagine trying to add a word | at the start of your document. | jerf wrote: | "I've frequently wondered why a hierarchical approach is the | norm for text formatting." | | 80/20, if not 90/10, effectiveness. Most people are not trying | to do what the Wikipedia article is talking about. About the | most complicated thing that people want to do is the moral | equivalent of <i>italic <b>bold and italic</i> bold</b>, and | you can losslessly convert that to <i>italic <b>bold and | italic</b></i><b> bold</b> for almost all practical purposes. | | It isn't until you're getting very precise about what your tags | mean, for tags that intrinsically "cross" hierarchies like | that, that you start seeing this issues. And then by the time | you've gotten that far, you realize you have all sorts of | problems, as the article says. | | But a good deal of the answer is that while the stuff mentioned | in the Wikipedia article is true and important, it's also | fairly specialist. | | As for "The only place I've seen this in practice is in | Telegram's TL Schema [1]. Is this method found anywhere else?", | tag-based formatting is the norm for rich text widgets, which | generally can natively represent my first HTML example above in | its internal format. Generally if you dig into your favorite | language you'll find someone has already implemented this | efficiently as a library you can pick up if you want to use the | capability directly outside of a text widget. It has its own | consequences, as anyone who has ever fought with them may | realize, but it's not impossibly difficult to deal with. | | It isn't a magic solution to everything either, though. Even if | it is what you think you want, a widget able to represent a | bold section starting in the middle of a paragraph, then | proceeding through the first three rows of a table, then | stopping in the middle of a paragraph in the third column of | the next row is generally weird. To some extent, people have a | certain hierarchiness to their thinking about these matters | too, whether it's cause or effect. But that hierarchiness is | messy; I think it's fair to say most people wouldn't "mean" | that bold to mean something in my table case, we don't | necessarily expect tags to proceed through tables like that, | but <i>i<b>bi</i>b</b> is something that people might | intuitively expect to be able to do. It's a fractally messy | space both in the computer science and human expectations, and | the fractal messiness only gets messier when we try to | harmonize those two things. | samwillis wrote: | There are so many odd edge cases in HTML, a good one I found was | with forms. If you open a <form> but don't have a closing tag, | the browser will close the form block "visually" at the end of | the forms immediate parent, as you would expect. All styles are | applied to it, or children via selectors, up to that | automatically inserted end point. It's how browsers handles most | unclosed block tags. | | However, the forms "functionality" isn't closed at that point, | any inputs further down the page (outside of the forms DOM tree) | are included in the post/get when the form is submitted. Or at | least until another form is found in the DOM. Effectively an | unclosed form is two things, a visual block that is closed | automatically, and an "overlapping" form capturing inputs | indefinitely. | chrismorgan wrote: | This behaviour is defined and explained in the HTML spec with | the _form element pointer_ | <https://html.spec.whatwg.org/multipage/parsing.html#form- | ele...>: | | > _The form element pointer points to the last form element | that was opened and whose end tag has not yet been seen. It is | used to make form controls associate with forms in the face of | dramatically bad markup, for historical reasons._ | | And search through the rest of the page for the term to find | how it's implemented--it's straightforward, just set on a | <form> open tag and reset on an (explicit) </form> close tag. | | This is somewhat unreliable: _browsers_ support it, but tools | using XML pipelines are allowed to ignore it (SS13.2.9), and | lots of JavaScript code will assume hierarchy rather than using | form.elements, and thus not catch such elements, or elements | that manually specify a form owner via the _form_ attribute. | samwillis wrote: | Thanks! My 2 minuets of googling back when I found it didn't | surface that and I moved on to the next job. | | Somehow despite coding html for 25 years I had either not | seen the input form attribute or forgotten about it. I | suspect the latter! | chrismorgan wrote: | Steps on finding this from the HTML spec: | | 1 Start at https://html.spec.whatwg.org/multipage/. Or | https://html.spec.whatwg.org/ if you prefer, with | everything in one page, but that's a _big_ document. You | can also build it all locally yourself if you like. I have. | | 2 "The form element" sounds like a good place to look. | https://html.spec.whatwg.org/multipage/forms.html#the- | form-e... | | 3 Look through the DOM interface listed, _elements_ sounds | promising. Find the explanation of that IDL attribute | below: "The elements IDL attribute must return an | HTMLFormControlsCollection rooted at the form element 's | root, whose filter matches listed elements whose form owner | is the form element, with the exception of input elements | whose type attribute is in the Image Button state, which | must, for historical reasons, be excluded from this | particular collection." Roll your eyes at the bizarre | exclusion of <input type=image>, then focus on the term | _form owner_ which sounds relevant. That links you to | https://html.spec.whatwg.org/multipage/form-control- | infrastr.... | | 4 Hmm... null, parser inserted flag, nearest ancestor form | element, form attribute. Parser inserted flag sounds | relevant (though it's just a flag, not the actual | association link). Also the note "They are also complicated | by rules in the HTML parser that, for historical reasons, | can result in a form-associated element being associated | with a form element that is not its ancestor." | | 5 This is where having the _whole_ spec open, rather than | the multipage version, is handy: you can search the entire | document for the term "parser inserted flag" to see where | that gets set. You can also guess that it's going to be in | SS13.2 _Parsing HTML documents_ (parsing.html). In the end, | it's https://html.spec.whatwg.org/multipage/parsing.html#cr | eating...: "... then associate element with the form | element pointed to by the form element pointer and set | element's parser inserted flag." Ah hah! | | 6 You have found the concept in the parser: "form element | pointer". You can then look through where it's used and | quickly see how it's set on <form> and unset on </form>, | thus deliberately handling the missing-</form> case. | | You develop a feeling for this kind of thing over time. I | didn't know about the form element pointer (though I feel I | should have known about it), but this is a loose | description of what I did, though I was able to speed | through some of the steps, and I really should have just | started by looking at "An end tag whose tag name is | "form"", but at first I thought the claim was bogus. | samwillis wrote: | I think got to point 2, found no reference in the form | tag section, and gave up. | | But what's fascinating is that it describes the html | parser effectively implementing "overlapping markup", as | in the Wikipedia article, for this edge case for | backwards compatibility. | lordnacho wrote: | Why didn't they go the grammar nazi route? Define a spec and if | the page doesn't conform, draw an error message. | | It's really annoying to have this kind of undefined behaviour | that might end up being relied upon. | jraph wrote: | This seems strange, how is that represented in the DOM, which | is strictly a tree? | samwillis wrote: | It's not in the DOM, from memory chrome dev tools even shows | a closing form tag where it's been inserted. I have no idea | how it's implemented internally. | | Confuse me for a while when debugging a legacy website. It | had actually been done intentionally to work around a rather | complex architecture. | laszlokorte wrote: | There exists a "form"-attribute for input elements that can | be used to associate input elements outside the form | hierarchy to be included in the form submission. | | So the semantics of "form field outside the actual form" | are available anyway. When parsing a not-closed <form> the | browsers just make use of that. | ggus wrote: | Maybe it leverages the "form" optional attribute that can | specify the form the <input> element belongs to. ___________________________________________________________________ (page generated 2022-12-12 23:00 UTC)