[HN Gopher] When XML in Word Became Illegal
       ___________________________________________________________________
        
       When XML in Word Became Illegal
        
       Author : ejz
       Score  : 92 points
       Date   : 2023-10-12 14:49 UTC (6 hours ago)
        
 (HTM) web link (blog.withedge.com)
 (TXT) w3m dump (blog.withedge.com)
        
       | jkaptur wrote:
       | "Microsoft.... built a custom XML tool into its word processor in
       | 2007... this was a tool for power users, and was only used by a
       | small percentage of its user base."
       | 
       | I'm definitely confused by that statement and its link, because
       | it implies the relevant tool is _the disk format for every Office
       | file_ , which has been described by an Excel program manager as
       | "complicated enough to reduce a grown programmer to tears."
       | https://www.joelonsoftware.com/2008/02/19/why-are-the-micros...
        
         | colejohnson66 wrote:
         | Nitpick: Joel is referring to the _old_ BIFF-style format (from
         | 2003 and before) in that quote. The new  "Office Open XML"
         | formats are not mentioned in that post at all. However, one of
         | the many criticisms of the Office Open XML formats is that they
         | are, in some areas, nothing more than an XML serialization of
         | the BIFF records.
        
         | xanathar wrote:
         | The article says the feature has been removed; if it was the
         | disk format:
         | 
         | 1) it has never been removed, afaik Word still uses OOXML, so
         | Word would keep being infringing
         | 
         | 2) LibreOffice would probably be infringing too, as ODF is also
         | XML based
         | 
         | So... it has to be some other form of XML tool and not the file
         | format.
         | 
         | As for Joel's comment, IIRC he was an Excel PM _before_ OOXML;
         | in any case his blog post refer to the binary format that
         | precedes OOXML. I 'm pretty sure OOXML is equally if not even
         | more complicated, as the product themselves are way more
         | complicated than they appear, but the fact is that he was
         | talking about a different thing.
         | 
         | Edit: as many users pointed out, it's not the file format
         | itself, but the ability to add arbitrary attributes/elements to
         | the file format XML as additional data.
        
         | blackboxlogic wrote:
         | I believe I ran into this issue a few years ago and discovered
         | the patent case when trying to work around. The xml file format
         | allowed for arbitrary properties to be added (as xml does), and
         | we were trying to embed metadata in word files. But when MS
         | Word opened a file with anything extra in it it gave a warning
         | like "this file has extra stuff in it" and it automatically
         | removed anything that wasn't explicitly expected.
        
           | msp_yc wrote:
           | Not sure why this is downvoted, it's absolutely correct. I
           | tried this myself; it would have -greatly- simplified
           | scraping Word docs because the custom tags would have been
           | available for XPath querying. Alas, Word strips it all on
           | open.
        
         | richard_todd wrote:
         | It's not referring to the XML formats.. it was a feature of
         | Word specifically which allowed you to embed a user-defined xml
         | schema in your Word document, and use XML data that fits the
         | schema in your document.
         | 
         | See https://www.zdnet.com/article/custom-xml-the-key-to-
         | patent-s...
         | 
         | (edit: grammar)
        
           | jkaptur wrote:
           | Ah, thanks for explaining.
        
         | tannhaeuser wrote:
         | Looking at the patent application, it doesn't appear to mention
         | XML at all (it does talk about SGML, though), and the
         | application appears to claim any mapping of a symbolic name to
         | style properties (think Word styles or CSS classes); in other
         | words, technical trivialities, reflecting poorly on US lawyers
         | and their patent law.
        
         | rbehrends wrote:
         | It's not about storing XML, it's (as far as I understand the
         | patent) about a specific representation of XML that can be more
         | efficient to read.
         | 
         | The patent is about representing documents with markup (XML or
         | otherwise) not by embedding them in the text, but rather having
         | them stripped and maintained as a separate list of (tag,
         | position) pairs, with the document only containing the raw
         | text.
         | 
         | I'm only surprised that Microsoft couldn't find prior art,
         | because having a (content-type, address) index at the beginning
         | of a file is not exactly an unusual representation. It also
         | reminds me that the USPTO's idiosyncratic usage of non-
         | obviousness doesn't really match my intuition.
        
           | ejz wrote:
           | This is a huge issue with the patent world in general.
           | There's just so much prior art out there, and you have to be
           | really clear about showing that it applies. This isn't a
           | patent case, but I have a great Google Maps case involving
           | Wi-Fi where a judge completely borked it. As for this
           | particular patent, I'm not enough of an XML expert to say
           | whether the court got it right here. But it is worth noting
           | that Microsoft tried to invalidate the patent several times
           | with USPTO and failed to do so there as well. So perhaps
           | there's something more to the patent than meets the eye, or
           | that is was novel at that time but not modern XML. Remember,
           | the actual i4i patent at issue was filed in 1994, and it only
           | matters if there was prior art from before 1994. It might
           | have been novel at the time.
        
             | rbehrends wrote:
             | > Remember, the actual i4i patent at issue was filed in
             | 1994, and it only matters if there was prior art from
             | before 1994. It might have been novel at the time.
             | 
             | I am aware of the date of the "invention". I was
             | programming on 8- and 16-bit computers in the 1980s and I
             | was using this and similar kinds of formats for non-textual
             | data, simply because it was easier to do this in assembler
             | than writing a parser, paired with the difficulty of
             | finding unused special bytes in binary data to separate
             | meta-information from the data proper.
             | 
             | And I was also talking about non-obviousness, not novelty.
        
               | ejz wrote:
               | Fair enough. I haven't seen the invalidation proceedings
               | and am clearly less of an expert than you. So don't know
               | whether they got it right. Non-obviousness is, erm, non-
               | obvious.
        
           | cm2187 wrote:
           | Am I right to understand that it would be the equivalent of
           | visual studio's wpf designer [1], where you have the WYSIWYG
           | editor side by side with an xml editor and you can make the
           | change in either of them and it translates into the other?
           | 
           | If it is, it would have been really really cool.
           | 
           | [1] https://i.stack.imgur.com/8pJnn.png
        
             | rbehrends wrote:
             | No. It's more like what the following piece of code
             | produces:                 def convert(xml):
             | import re                parsed = re.split(r"(<.+?>)", xml)
             | output = parsed[0]           tags_with_pos = []
             | for i in range(1, len(parsed), 2):
             | tags_with_pos.append((parsed[i], len(output)))
             | output += parsed[i+1]           return tags_with_pos,
             | output
        
           | robertlagrant wrote:
           | > the USPTO's idiosyncratic usage of non-obviousness doesn't
           | really match my intuition
           | 
           | Remember that USPTO gets paid for each patent application,
           | and not penalised when it's later falsified.
        
             | rbehrends wrote:
             | Well, it was apparently upheld twice on reexamination,
             | where they could have fixed that. The problem is more that
             | the bar for non-obviousness is so low, it's basically on
             | the floor. Paired with a discipline (software development),
             | where independent reinvention is common, this is just a
             | recipe for disaster.
        
         | Karellen wrote:
         | > it implies the relevant tool is _the disk format for every
         | Office file_
         | 
         | Does it imply that?
         | 
         | Another commenter has already pointed out why it's likely not
         | the case.
         | 
         | But also, I don't think the article is well written. Partly
         | because it doesn't clearly explain what the infringing tool
         | was, or did, or how it operated. Also I'm pretty sure there's a
         | typo in "ex part" instead of "ex parte". But another major
         | issue is the following:
         | 
         | > $40 million of that judgment [against Microsoft] was imposed
         | by the court as punishment for continually arguing that i4i was
         | a patent troll even though it had an operating business in a
         | manner that was "persistent, legally improper, and in direct
         | violation of the Court's instructions."
         | 
         | What?
         | 
         | Why would i4i operating in a manner that was persistent,
         | improper and in violation of the court's instructions preclude
         | it from also being a patent troll? It could do both?
         | 
         | Or is the "persistent..." descriptor meant to apply to
         | Microsoft? That might make more sense, but the "even though"
         | seems to be a comparison between two types of activity by one
         | entity - namely i4i.
         | 
         | But then again, I might be reading "it had an operating
         | business in a manner" wrong, because it feels ungrammatical to
         | me. I might not be putting the emphasis in the right place, and
         | that's what's causing me to misread the sentence?
         | 
         | The whole thing just feels confusing.
        
           | ejz wrote:
           | Thanks for reading. Sorry if this was confusing! Microsoft
           | said that i4i was a patent troll despite the court repeatedly
           | telling Microsoft to not do that. The judge referred to
           | Microsoft's repeated ignoring of its instructions as
           | "persistent" etc. i4i had an operating business; it wasn't a
           | patent troll. That operating business is niche and small, but
           | it is real. I have updated that sentence to make it clearer.
           | Thanks for your feedback!
        
             | ackfoobar wrote:
             | Depends on one's definition. I don't think "not having a
             | real product/service" is the defining charateristic of
             | "patent troll". Here's what Wikipedia says.
             | 
             | > attempts to enforce patent rights against accused
             | infringers far beyond the patent's actual value or
             | contribution to the prior art
             | 
             | > often do not manufacture products or supply services
             | based upon the patents in question
        
         | ejz wrote:
         | This isn't want Joel is talking about here.
         | 
         | On the backend, all .docx files use XML. Joel is saying the
         | root XML format was difficult to work with.
         | 
         | What my article is about is this: Microsoft used to allow users
         | to write their own custom XML rules on top of Word. (This was
         | mostly app developers using XML for macros rather than end
         | users, and overall it was very rare.) This is the feature that
         | was at issue with the patent.
         | 
         | Sorry if this was not clear!
        
           | jkaptur wrote:
           | Thanks for clarifying!
        
           | Jtsummers wrote:
           | > Joel is saying the root XML format was difficult to work
           | with.
           | 
           | Joel wasn't writing about the XML version of MS Office
           | documents, he was writing about the binary versions.
        
       | londons_explore wrote:
       | Anyone got a screenshot of this feature?
        
         | dbavaria wrote:
         | See here: https://learn.microsoft.com/en-
         | us/office/troubleshoot/word/c...
        
       | jandrese wrote:
       | > Indeed, as you work on your Excel clone, you'll discover all
       | kinds of subtle details about date handling. When does Excel
       | convert numbers to dates? How does the formatting work? Why is
       | 1/31 interpreted as January 31 of this year, while 1/50 is
       | interpreted as January 1st, 1950? All of these subtle bits of
       | behavior cannot be fully documented without writing a document
       | that has the same amount of information as the Excel source code.
       | 
       | A quick note to anybody building an Excel clone: If you want to
       | turn this insane date handling behavior of Excel into an optional
       | feature that can be disabled everybody will appreciate it.
        
         | atoav wrote:
         | I always wondered why they won't just make it a popup button?
         | 
         | Default should be to not change anything, if a date is
         | recognized offer a button right next to the cell that allows
         | you to accept the suggestion to turn it into a fully fledged
         | date. Just make it so that pressing tab or shift enter or a
         | similar comination accepts that suggestion.
        
           | xigoi wrote:
           | https://xkcd.com/1172/
        
             | numpad0 wrote:
             | Just do https://xkcd.com/927/, happened once and it was
             | okay.
        
               | xigoi wrote:
               | What comes after .docx? .docxx? .docy? .docxi?
        
               | jimmaswell wrote:
               | docxEx, in Win32 fashion.
        
         | WirelessGigabit wrote:
         | Scientists will thank you:
         | https://www.theverge.com/2020/8/6/21355674/human-genes-renam...
        
           | qclibre22 wrote:
           | > Scientists will thank you
           | 
           | Scientists gave up and changed the gene names:
           | https://duckduckgo.com/?q=excel+gene+names+changed+septin1
        
         | jahav wrote:
         | It's also country specific.
         | 
         | I work on Excel library and the text to number/date feature was
         | one of less fun things to implement at least semi-correctly.
         | 
         | I remember my comment on the PR back then:
         | 
         | https://github.com/ClosedXML/ClosedXML/pull/1899
        
       | pjungwir wrote:
       | If only someone had filed a patent that blocked Word from
       | inserting curly quotes the wrong way, like '449.
        
       | willcipriano wrote:
       | The United States is #1 for protection of intellectual property
       | in the world according to the property rights index:
       | https://www.internationalpropertyrightsindex.org/
       | 
       | Real property on the other hand? The US is ranked 14th.
        
       | breakfastduck wrote:
       | I'm completely baffled as to how it's allowed to get a patent on
       | stuff like this.
       | 
       | Can I patent sending REST requests using JSON?
        
         | empath-nirvana wrote:
         | No, that's not how it works, you can't patent a specific
         | technology that's already been invented, what you do is wait
         | for a new technology to be invented and then patent doing some
         | obvious thing with the new technology.
         | 
         | Like, a good patent today would be: "Using a computer text
         | prediction engine to automatically review and approve code."
         | 
         | It probably would have been pretty smart to skim through all
         | the hacker news threads after ChatGpt came out patenting every
         | other comment.
        
           | donatj wrote:
           | XML editing had already been invented
        
         | jahav wrote:
         | You can try to patent anything, but patent might not be
         | accepted.
         | 
         | The thing is that patent office is funded by patent fees, so
         | there is an incentive to accept the patent plus they are often
         | hard to read.
        
           | svachalek wrote:
           | What I understand of US law is that there's very little in
           | the way of filing a patent. It's not really tested until
           | someone challenges it.
        
         | lucozade wrote:
         | You could apply for that patent but I would expect it to be
         | rejected due to prior art i.e. someone came up with it before
         | you. Even if it was accepted, if you tried to enforce it, it'd
         | definitely be challenged on prior art and you would very likely
         | lose because it wouldn't be hard to prove you went the first.
         | 
         | Now, why this particular patent exists, and seems so general,
         | is also likely related to prior art. What could be patented for
         | software was a bit murky until the late 1990's when it was
         | established that business methods implemented in software were
         | allowed. This led to a large flood of patents in that space.
         | 
         | One of the issues is that the Patent Office tends to look at
         | prior art as being "things that have already been patented" so
         | when rules change, a lot of things that seem obvious are up for
         | grabs because there's no prior patent. Now, these can (and are)
         | challenged in court and, in court, they're more likely to
         | accept blatant prior usage in the wild. i don;t know whether
         | this case won it's challenge but it's possible that it didn't
         | because XML was quite new in the late 90s too.
         | 
         | Source: I have a patent from around that time that's basically
         | covers anything in finance that's data driven from an XML
         | document. For about a decade, that covered a fairly large chunk
         | of finance. I never did anything about it as I disagreed in
         | principle with the premise of such an absurdly broad patent. I
         | agreed to it being patented solely for defensive reasons ie it
         | might prevent a competitor from egregiously attacking my
         | employer with patents.
        
         | renewiltord wrote:
         | If you're sufficiently creative, certainly. Some of my friends
         | patented something totally absurd: there's a transformation you
         | can easily do in software and lots of software does it quite
         | routinely. They did it twice. Patent issued.
        
       | yarone wrote:
       | A classic Joel on Software article about funny backwards
       | compatibility built into Excel:
       | https://www.joelonsoftware.com/2006/06/16/my-first-billg-rev...
        
         | Macha wrote:
         | I realised there is more time between now and that article,
         | than there is between that article and the events described
         | within.
        
       | FpUser wrote:
       | Looking at patent abstract [0] it basically patents separation of
       | information and structure. That latter can be used to present
       | information in various ways.
       | 
       | My take is that it is fucking obvious and I just simply do not
       | believe that the concept did not have prior art. It just show
       | what a crooked business this whole modern patent system is.
       | 
       | [0] - "A system and method for the separate manipulation of the
       | architecture and content of a document, particularly for data
       | representation and transformations. The system, for use by
       | computer software developers, removes dependency on document
       | encoding technology. A map of metacodes found in the document is
       | produced and provided and stored separately from the document.
       | The map indicates the location and addresses of metacodes in the
       | document. The system allows of multiple views of the same
       | content, the ability to work solely on structure and solely on
       | content, storage efficiency of multiple versions and efficiency
       | of operation."
        
       | ClearDayDev wrote:
       | I've not read the patent, but it's definitely inaccurate to say
       | "Microsoft removed custom XML from Word." It's still possible to
       | create custom XML parts programmatically, and I suspect it's
       | quite commonly done. Also, I just checked, and Microsoft 365 has
       | a custom XML mapping tool on the developer tab. So it would be
       | interesting to know how Microsoft complied with the judgment and
       | the subsequent history of the feature.
        
       ___________________________________________________________________
       (page generated 2023-10-12 21:00 UTC)