[HN Gopher] When XML in Word Became Illegal ___________________________________________________________________ When XML in Word Became Illegal Author : ejz Score : 92 points Date : 2023-10-12 14:49 UTC (6 hours ago) (HTM) web link (blog.withedge.com) (TXT) w3m dump (blog.withedge.com) | jkaptur wrote: | "Microsoft.... built a custom XML tool into its word processor in | 2007... this was a tool for power users, and was only used by a | small percentage of its user base." | | I'm definitely confused by that statement and its link, because | it implies the relevant tool is _the disk format for every Office | file_ , which has been described by an Excel program manager as | "complicated enough to reduce a grown programmer to tears." | https://www.joelonsoftware.com/2008/02/19/why-are-the-micros... | colejohnson66 wrote: | Nitpick: Joel is referring to the _old_ BIFF-style format (from | 2003 and before) in that quote. The new "Office Open XML" | formats are not mentioned in that post at all. However, one of | the many criticisms of the Office Open XML formats is that they | are, in some areas, nothing more than an XML serialization of | the BIFF records. | xanathar wrote: | The article says the feature has been removed; if it was the | disk format: | | 1) it has never been removed, afaik Word still uses OOXML, so | Word would keep being infringing | | 2) LibreOffice would probably be infringing too, as ODF is also | XML based | | So... it has to be some other form of XML tool and not the file | format. | | As for Joel's comment, IIRC he was an Excel PM _before_ OOXML; | in any case his blog post refer to the binary format that | precedes OOXML. I 'm pretty sure OOXML is equally if not even | more complicated, as the product themselves are way more | complicated than they appear, but the fact is that he was | talking about a different thing. | | Edit: as many users pointed out, it's not the file format | itself, but the ability to add arbitrary attributes/elements to | the file format XML as additional data. | blackboxlogic wrote: | I believe I ran into this issue a few years ago and discovered | the patent case when trying to work around. The xml file format | allowed for arbitrary properties to be added (as xml does), and | we were trying to embed metadata in word files. But when MS | Word opened a file with anything extra in it it gave a warning | like "this file has extra stuff in it" and it automatically | removed anything that wasn't explicitly expected. | msp_yc wrote: | Not sure why this is downvoted, it's absolutely correct. I | tried this myself; it would have -greatly- simplified | scraping Word docs because the custom tags would have been | available for XPath querying. Alas, Word strips it all on | open. | richard_todd wrote: | It's not referring to the XML formats.. it was a feature of | Word specifically which allowed you to embed a user-defined xml | schema in your Word document, and use XML data that fits the | schema in your document. | | See https://www.zdnet.com/article/custom-xml-the-key-to- | patent-s... | | (edit: grammar) | jkaptur wrote: | Ah, thanks for explaining. | tannhaeuser wrote: | Looking at the patent application, it doesn't appear to mention | XML at all (it does talk about SGML, though), and the | application appears to claim any mapping of a symbolic name to | style properties (think Word styles or CSS classes); in other | words, technical trivialities, reflecting poorly on US lawyers | and their patent law. | rbehrends wrote: | It's not about storing XML, it's (as far as I understand the | patent) about a specific representation of XML that can be more | efficient to read. | | The patent is about representing documents with markup (XML or | otherwise) not by embedding them in the text, but rather having | them stripped and maintained as a separate list of (tag, | position) pairs, with the document only containing the raw | text. | | I'm only surprised that Microsoft couldn't find prior art, | because having a (content-type, address) index at the beginning | of a file is not exactly an unusual representation. It also | reminds me that the USPTO's idiosyncratic usage of non- | obviousness doesn't really match my intuition. | ejz wrote: | This is a huge issue with the patent world in general. | There's just so much prior art out there, and you have to be | really clear about showing that it applies. This isn't a | patent case, but I have a great Google Maps case involving | Wi-Fi where a judge completely borked it. As for this | particular patent, I'm not enough of an XML expert to say | whether the court got it right here. But it is worth noting | that Microsoft tried to invalidate the patent several times | with USPTO and failed to do so there as well. So perhaps | there's something more to the patent than meets the eye, or | that is was novel at that time but not modern XML. Remember, | the actual i4i patent at issue was filed in 1994, and it only | matters if there was prior art from before 1994. It might | have been novel at the time. | rbehrends wrote: | > Remember, the actual i4i patent at issue was filed in | 1994, and it only matters if there was prior art from | before 1994. It might have been novel at the time. | | I am aware of the date of the "invention". I was | programming on 8- and 16-bit computers in the 1980s and I | was using this and similar kinds of formats for non-textual | data, simply because it was easier to do this in assembler | than writing a parser, paired with the difficulty of | finding unused special bytes in binary data to separate | meta-information from the data proper. | | And I was also talking about non-obviousness, not novelty. | ejz wrote: | Fair enough. I haven't seen the invalidation proceedings | and am clearly less of an expert than you. So don't know | whether they got it right. Non-obviousness is, erm, non- | obvious. | cm2187 wrote: | Am I right to understand that it would be the equivalent of | visual studio's wpf designer [1], where you have the WYSIWYG | editor side by side with an xml editor and you can make the | change in either of them and it translates into the other? | | If it is, it would have been really really cool. | | [1] https://i.stack.imgur.com/8pJnn.png | rbehrends wrote: | No. It's more like what the following piece of code | produces: def convert(xml): | import re parsed = re.split(r"(<.+?>)", xml) | output = parsed[0] tags_with_pos = [] | for i in range(1, len(parsed), 2): | tags_with_pos.append((parsed[i], len(output))) | output += parsed[i+1] return tags_with_pos, | output | robertlagrant wrote: | > the USPTO's idiosyncratic usage of non-obviousness doesn't | really match my intuition | | Remember that USPTO gets paid for each patent application, | and not penalised when it's later falsified. | rbehrends wrote: | Well, it was apparently upheld twice on reexamination, | where they could have fixed that. The problem is more that | the bar for non-obviousness is so low, it's basically on | the floor. Paired with a discipline (software development), | where independent reinvention is common, this is just a | recipe for disaster. | Karellen wrote: | > it implies the relevant tool is _the disk format for every | Office file_ | | Does it imply that? | | Another commenter has already pointed out why it's likely not | the case. | | But also, I don't think the article is well written. Partly | because it doesn't clearly explain what the infringing tool | was, or did, or how it operated. Also I'm pretty sure there's a | typo in "ex part" instead of "ex parte". But another major | issue is the following: | | > $40 million of that judgment [against Microsoft] was imposed | by the court as punishment for continually arguing that i4i was | a patent troll even though it had an operating business in a | manner that was "persistent, legally improper, and in direct | violation of the Court's instructions." | | What? | | Why would i4i operating in a manner that was persistent, | improper and in violation of the court's instructions preclude | it from also being a patent troll? It could do both? | | Or is the "persistent..." descriptor meant to apply to | Microsoft? That might make more sense, but the "even though" | seems to be a comparison between two types of activity by one | entity - namely i4i. | | But then again, I might be reading "it had an operating | business in a manner" wrong, because it feels ungrammatical to | me. I might not be putting the emphasis in the right place, and | that's what's causing me to misread the sentence? | | The whole thing just feels confusing. | ejz wrote: | Thanks for reading. Sorry if this was confusing! Microsoft | said that i4i was a patent troll despite the court repeatedly | telling Microsoft to not do that. The judge referred to | Microsoft's repeated ignoring of its instructions as | "persistent" etc. i4i had an operating business; it wasn't a | patent troll. That operating business is niche and small, but | it is real. I have updated that sentence to make it clearer. | Thanks for your feedback! | ackfoobar wrote: | Depends on one's definition. I don't think "not having a | real product/service" is the defining charateristic of | "patent troll". Here's what Wikipedia says. | | > attempts to enforce patent rights against accused | infringers far beyond the patent's actual value or | contribution to the prior art | | > often do not manufacture products or supply services | based upon the patents in question | ejz wrote: | This isn't want Joel is talking about here. | | On the backend, all .docx files use XML. Joel is saying the | root XML format was difficult to work with. | | What my article is about is this: Microsoft used to allow users | to write their own custom XML rules on top of Word. (This was | mostly app developers using XML for macros rather than end | users, and overall it was very rare.) This is the feature that | was at issue with the patent. | | Sorry if this was not clear! | jkaptur wrote: | Thanks for clarifying! | Jtsummers wrote: | > Joel is saying the root XML format was difficult to work | with. | | Joel wasn't writing about the XML version of MS Office | documents, he was writing about the binary versions. | londons_explore wrote: | Anyone got a screenshot of this feature? | dbavaria wrote: | See here: https://learn.microsoft.com/en- | us/office/troubleshoot/word/c... | jandrese wrote: | > Indeed, as you work on your Excel clone, you'll discover all | kinds of subtle details about date handling. When does Excel | convert numbers to dates? How does the formatting work? Why is | 1/31 interpreted as January 31 of this year, while 1/50 is | interpreted as January 1st, 1950? All of these subtle bits of | behavior cannot be fully documented without writing a document | that has the same amount of information as the Excel source code. | | A quick note to anybody building an Excel clone: If you want to | turn this insane date handling behavior of Excel into an optional | feature that can be disabled everybody will appreciate it. | atoav wrote: | I always wondered why they won't just make it a popup button? | | Default should be to not change anything, if a date is | recognized offer a button right next to the cell that allows | you to accept the suggestion to turn it into a fully fledged | date. Just make it so that pressing tab or shift enter or a | similar comination accepts that suggestion. | xigoi wrote: | https://xkcd.com/1172/ | numpad0 wrote: | Just do https://xkcd.com/927/, happened once and it was | okay. | xigoi wrote: | What comes after .docx? .docxx? .docy? .docxi? | jimmaswell wrote: | docxEx, in Win32 fashion. | WirelessGigabit wrote: | Scientists will thank you: | https://www.theverge.com/2020/8/6/21355674/human-genes-renam... | qclibre22 wrote: | > Scientists will thank you | | Scientists gave up and changed the gene names: | https://duckduckgo.com/?q=excel+gene+names+changed+septin1 | jahav wrote: | It's also country specific. | | I work on Excel library and the text to number/date feature was | one of less fun things to implement at least semi-correctly. | | I remember my comment on the PR back then: | | https://github.com/ClosedXML/ClosedXML/pull/1899 | pjungwir wrote: | If only someone had filed a patent that blocked Word from | inserting curly quotes the wrong way, like '449. | willcipriano wrote: | The United States is #1 for protection of intellectual property | in the world according to the property rights index: | https://www.internationalpropertyrightsindex.org/ | | Real property on the other hand? The US is ranked 14th. | breakfastduck wrote: | I'm completely baffled as to how it's allowed to get a patent on | stuff like this. | | Can I patent sending REST requests using JSON? | empath-nirvana wrote: | No, that's not how it works, you can't patent a specific | technology that's already been invented, what you do is wait | for a new technology to be invented and then patent doing some | obvious thing with the new technology. | | Like, a good patent today would be: "Using a computer text | prediction engine to automatically review and approve code." | | It probably would have been pretty smart to skim through all | the hacker news threads after ChatGpt came out patenting every | other comment. | donatj wrote: | XML editing had already been invented | jahav wrote: | You can try to patent anything, but patent might not be | accepted. | | The thing is that patent office is funded by patent fees, so | there is an incentive to accept the patent plus they are often | hard to read. | svachalek wrote: | What I understand of US law is that there's very little in | the way of filing a patent. It's not really tested until | someone challenges it. | lucozade wrote: | You could apply for that patent but I would expect it to be | rejected due to prior art i.e. someone came up with it before | you. Even if it was accepted, if you tried to enforce it, it'd | definitely be challenged on prior art and you would very likely | lose because it wouldn't be hard to prove you went the first. | | Now, why this particular patent exists, and seems so general, | is also likely related to prior art. What could be patented for | software was a bit murky until the late 1990's when it was | established that business methods implemented in software were | allowed. This led to a large flood of patents in that space. | | One of the issues is that the Patent Office tends to look at | prior art as being "things that have already been patented" so | when rules change, a lot of things that seem obvious are up for | grabs because there's no prior patent. Now, these can (and are) | challenged in court and, in court, they're more likely to | accept blatant prior usage in the wild. i don;t know whether | this case won it's challenge but it's possible that it didn't | because XML was quite new in the late 90s too. | | Source: I have a patent from around that time that's basically | covers anything in finance that's data driven from an XML | document. For about a decade, that covered a fairly large chunk | of finance. I never did anything about it as I disagreed in | principle with the premise of such an absurdly broad patent. I | agreed to it being patented solely for defensive reasons ie it | might prevent a competitor from egregiously attacking my | employer with patents. | renewiltord wrote: | If you're sufficiently creative, certainly. Some of my friends | patented something totally absurd: there's a transformation you | can easily do in software and lots of software does it quite | routinely. They did it twice. Patent issued. | yarone wrote: | A classic Joel on Software article about funny backwards | compatibility built into Excel: | https://www.joelonsoftware.com/2006/06/16/my-first-billg-rev... | Macha wrote: | I realised there is more time between now and that article, | than there is between that article and the events described | within. | FpUser wrote: | Looking at patent abstract [0] it basically patents separation of | information and structure. That latter can be used to present | information in various ways. | | My take is that it is fucking obvious and I just simply do not | believe that the concept did not have prior art. It just show | what a crooked business this whole modern patent system is. | | [0] - "A system and method for the separate manipulation of the | architecture and content of a document, particularly for data | representation and transformations. The system, for use by | computer software developers, removes dependency on document | encoding technology. A map of metacodes found in the document is | produced and provided and stored separately from the document. | The map indicates the location and addresses of metacodes in the | document. The system allows of multiple views of the same | content, the ability to work solely on structure and solely on | content, storage efficiency of multiple versions and efficiency | of operation." | ClearDayDev wrote: | I've not read the patent, but it's definitely inaccurate to say | "Microsoft removed custom XML from Word." It's still possible to | create custom XML parts programmatically, and I suspect it's | quite commonly done. Also, I just checked, and Microsoft 365 has | a custom XML mapping tool on the developer tab. So it would be | interesting to know how Microsoft complied with the judgment and | the subsequent history of the feature. ___________________________________________________________________ (page generated 2023-10-12 21:00 UTC)