[HN Gopher] USPTO to add surcharge on non-DOCX patent applicatio... ___________________________________________________________________ USPTO to add surcharge on non-DOCX patent applications in 2023 Author : zinekeller Score : 100 points Date : 2022-08-26 13:50 UTC (9 hours ago) (HTM) web link (unblock.federalregister.gov) (TXT) w3m dump (unblock.federalregister.gov) | NotYourLawyer wrote: | Can't wait to see what kinds of unredacted metadata people start | uploading without a thought. | nfriedly wrote: | It sounds like they're going to try to catch and remove that - | one of the bullet points under DOCX Benefits on | https://www.uspto.gov/patents/docx reads: | | > _Privacy: provides automatic metadata detection (e.g. author | and comments) and removal features to support the submission of | only substantive information in the DOCX file._ | | And, then further down in the FAQ it says: | | > _What happens to the metadata in DOCX files?_ | | > _Metadata is generally removed by applicants prior to | submission. However, if metadata is found during the validation | process, it is automatically removed prior to submission. | Examples of metadata include author, company, last modified by, | etc. The only information that is preserved is the size, page | count, and word count._ | | > _Outgoing DOCX documents (i.e. Office actions) from the USPTO | to applicants will also have metadata removed._ | lesuorac wrote: | I don't think you submitted the link you wanted to. | | I see https://unblock.federalregister.gov/ which doesn't take me | anywhere useful. | bsimpson wrote: | That explains why I get a 503 every time I pass the captcha. | zinekeller wrote: | Sorry, I thought the second time (!) I've submitted this that | the HN filter didn't mess up the link. Here it is (https://www. | federalregister.gov/documents/2022/04/28/2022-09...), | hopefully! | | In case that the overagressive HN filer ate it again, it's | https www-federalregister-gov | documents/2022/04/28/2022-09027/filing-patent-applications-in- | docx-format | | Also dang, recently HN's filters are too aggressive - from | sentence-casing USPTO (to Uspto which isn't even a | pronounceable acronym) to removing Twitter's search queries. | altairprime wrote: | You should email dang using the footer Contact link about | this thread, so that he sees it. | nazgulsenpai wrote: | I was wondering why docx format would be chosen instead of PDF | but they answer it pretty completely here if anyone else is | interested: https://www.uspto.gov/patents/docx | [deleted] | jedberg wrote: | It looks like most of those translate to "We build our | automated systems around DOCX so you get all our features if | you use it". | | But it doesn't really say why they chose to build on docx. | bragr wrote: | >But it doesn't really say why they chose to build on docx. | Is requiring the DOCX format just adding another step in the | process for applicants? Actually, it's the opposite. | The USPTO conducted a study and found that over 80% of | applicants are authoring their applications in DOCX format | (through writing tools such as Microsoft Word). Because the | files are originally in a DOCX format, uploading the original | file eliminates the step for the applicant to convert the | document to PDF prior to submission. Instead, the applicant | is able to save the step of converting because our system | will do that automatically. | apocalyptic0n3 wrote: | > But it doesn't really say why they chose to build on docx. | | Having worked directly with their teams in the past (although | not on this), a lot of their systems seemed to evolve | naturally over time based on the needs present. In that | industry, a large majority of the documents being passed back | and forth are DOCX. So my semi-educated guess is someone | built a system to handle some simple intake tasks for DOCX | applications because a large majority already were, it | evolved over a few years, and when they finally decided to | fully automate the process, they decided to build upon what | they have which only supported DOCX and it was cheaper/easier | to mandate everyone submit in that format than to build a new | system or add support for others. | ramoz wrote: | I get that you worked with them it seems, but would argue | that your hunch is wrong here. | | Regulatory processes, business systems, and international | integration are plagued by PDF OCR complexities. OCR | creates systemic issues and an anatomy of complex system | architectures. Im sure XML is a typical downstream for | parsing anyways. Use DOCX to enhance quality of the overall | scope of integrations. | apocalyptic0n3 wrote: | They could just use standardized application forms the | way they do for research reports they require (the "ISA | ###" forms). Those forms are easily parsable by things | like pdftk and don't require any OCR. | | I don't necessarily disagree with your point (since it | makes complete sense), just wanting to point out that | they already have a system in place for this using other | means (although even there they are moving toward XML | instead, likely because of what a pain it is to deal with | text that exceeds the area of the input in PDFs) | jedberg wrote: | This makes the most sense. | joshstrange wrote: | As a few others have mentioned, the parsing alone means DOCX | is a huge win over PDF. I had to parse a bunch of PDF data | related to COVID and it was always a PITA. Every time they | changed their layout even a little bit I had to rewrite parts | of my extractor. The worst part? The headers/metadata showed | it was all made in Word so they could have exported to DOCX | as well as PDF if they wanted to but they only provided PDF. | meragrin_ wrote: | I guess you have little exposure to the industry. My | experience is the vast majority already use Word or something | else which supports DOCX. I cannot think of another format | which practitioners have easy access to and would use. PDF | just needs to go away for this process. | jedberg wrote: | I actually do have a lot of exposure to patents, and I know | everyone uses DocX already. I'm just saying that that web | page doesn't say why _they_ chose DocX, only why you should | _use_ DocX. | [deleted] | [deleted] | meragrin_ wrote: | Sure it does. I'll give you it does not say why they | chose it over other alternatives which I'm thinking is | what you are looking for. Are there really any | alternatives? The only real alternative I can think of is | OpenDocument Format and I don't consider it alternative. | As they say on that page, 80% of their users already deal | with DOCX so 80+% of them will have to convert to ODF. I | can't imagine ODF having any sort of benefit worth | requiring most people to convert their documents before | sending. | nescioquid wrote: | To me, the salient question is why is the government | officially adopting a proprietary file format? Why is it | important to optimize for the trivial convenience of | patent applicants? | | It seems more like rationalization than reason. | dataflow wrote: | It actually seems like a sane choice to me. PDF is good for | rendering, but horrible for parsing. DOCX is a ZIP file with | XML data. Maybe ODT or whatever would've been a better | choice, I don't know what the format is like. But if you | disregard the usual knee-jerk "but it's Microsoft!" reaction, | it doesn't seem like a bad choice. | ndiddy wrote: | The Office Open XML file format is extremely complex, and | takes up around 6,500 pages (compared to ~1000 for ODF). | One thing you notice when reading the DOCX spec is that | they designed it with the sole constraint that DOC files | could easily be converted to DOCX. For example, you'll | frequently see compatibility tags like | "autoSpaceLikeWord95", "footnoteLayoutLikeWW8", | "useWord2002TableStyleRules", and "lineWrapLikeWord6" that | expose internal implementation details. Rather than | creating a useful standard allowing all users to store | their documents in a clean, portable way, Microsoft decided | to make their standard faithfully reproduce all of the | quirks and bugs of their legacy binary formats. It's so | difficult to correctly implement the Office Open XML | standard that even Microsoft took until Office 2013 to do | so (the standard was approved in 2006). | notriddle wrote: | > "autoSpaceLikeWord95", "footnoteLayoutLikeWW8", | "useWord2002TableStyleRules", and "lineWrapLikeWord6" | | I expect that whatever tooling the USPTO uses can | probably just ignore those things. They're extracting | metadata, not actually rendering it. | dataflow wrote: | Interesting! How do they compare feature-wise? I feel | like there must be things each of them support that the | other one doesn't, but I don't know how consequential | they are. | Kye wrote: | I don't know how thorough or accurate it is, but | Microsoft has a list. | | https://support.microsoft.com/en-us/office/differences- | betwe... | | The list was only a few lines the last time I looked | years ago, so maybe they're actually trying to make a | complete list. | not2b wrote: | I think they pretty much had to do that to preserve the | formatting of existing documents for users who are force- | upgraded by their employers to new Office versions. But | it seems a scraper that just wants the information in the | document can ignore almost all of those tags. | | edit: ninja'd. | jfk13 wrote: | So true. "It's XML, so it must easy to parse and | manipulate" is such a naive, even misleading attitude. If | what you do is take a byzantine, legacy-encrusted | implementation and just serialise its data strucures to | an XML representation, very little has been gained. | | [edit: but I will grant that almost anything is better | than attempting to parse useful content from PDF.] | fezfight wrote: | I think the knee-jerk is against any alternative to Office, | not against Office. Statistically speaking, trying to use | anything reasonable that doesn't genuflect to Microsoft's | monopoly is what seems to be met with a knee-jerk reply | such as yours. As in, there's probably more people who | don't care but hate the complaining about libre stuff than | their are advocates for libre stuff. | molsongolden wrote: | Was just about to post this. Unzipping DOCX and parsing XML | is much easier than accurately processing PDF submissions. | oneplane wrote: | OCR has come a long way, so much that visually | interpreting a PDF is about as error-prom as parsing XML | output from Microsoft in non-microsoft software. | programmarchy wrote: | Try extracting tabular data from a PDF! With XML it's | trivial, but for PDF you need highly specialized software | packages to do this. One of the best, pdfplumber, is | largely based [1] on a Master's thesis titled Algorithmic | Extraction of Data in Tables in PDF Documents [2]. | | [1] https://github.com/jsvine/pdfplumber/blob/stable/pdfp | lumber/... | | [2] https://trepo.tuni.fi/bitstream/handle/123456789/2152 | 0/Nurmi... | jedberg wrote: | I never said it was a bad choice, only that they didn't say | why they chose it. | dataflow wrote: | I guess, but you were replying to "I was wondering why | docx format would be chosen instead of PDF" seemingly | unconvinced, so I assumed you thought PDF would've made | more sense. | Noted wrote: | Nice to see they call out LibreOffice as a usable application | as well. | AdmiralAsshat wrote: | INSTEAD of Open Office, no less! | hedora wrote: | They say that 80% of the submissions used to be converted to | PDF from word. I'd be interested to know where the other 20% | came from. | meragrin_ wrote: | There's this one guy I've dealt with. He uses a editor he | wrote himself. He'll convert his documents to Pages and then | use Pages for any other conversion needed. | NotYourLawyer wrote: | I'm guessing google docs is the next most common, and then | probably libre office. | meragrin_ wrote: | My experience is a large distrust of cloud environments. I | would expect Pages and Libre/Open Office to be more common | than Google Docs. | NotYourLawyer wrote: | Oh right, I bet Pages is up there. I dunno though--I know | lots of patent attorneys who are surprisingly non- | technical and probably haven't given a thought to cloud | security issues. | bonyt wrote: | My guess is that printed and scanned documents from word make | up a large component of this. | pavon wrote: | Do patent attorneys love Word Perfect as much as some other | people in the legal profession? | deathanatos wrote: | > _Due to aggressive automated scraping of FederalRegister.gov | and eCFR.gov, programmatic access to these sites is limited to | access to our extensive developer APIs._ | | Apparently. Then a captcha and a button to request access, which | if you complete, returns a 500 Internal Server Error. | | ... my tax dollars are _hard_ at work, I see. | | The Wayback Machine hasn't got a snapshot, either, it seems. | zevra wrote: | Its apparently the wrong link see: | https://news.ycombinator.com/item?id=32609165 | | The correct one is: | https://www.federalregister.gov/documents/2022/04/28/2022-09... | batmaniam wrote: | Seriously, I've had my share of horror stories on government | systems. Do they just not dogfood their own product? Where is | their QA team? It's atrocious how basic tasks are so broken all | the time. | ceeplusplus wrote: | Government can't afford to hire the competent talent, only | the scraps after everyone else (even the consulting | bodyshops) are done. The top GS pay bracket is lower than | entry level engineers at many companies (not just FAANG, but | also defense, F500 companies, etc.) | CobrastanJorji wrote: | One of the many things that I found tempting about working | for the U.S. Digital Service was that, while the GS-15 pay | grade is definitely way less than I'd make in the private | sector, my spouse's family is military/government and the | difference between "hippy programming thingy" and "has a | GS-15/O-6 job" would've been night and day. The one puts me | in a pile of stereotypes, but the other says "oh, he's | basically the bureaucratic equivalent of a Captain, that's | very respectable." | ceeplusplus wrote: | Different circles I guess. While my spouse's family | considers government jobs to be stable and somewhat | respectable, there is a lot more respect for FAANG and | other high paying jobs. One is respectable, the other is | prestigious. | nullc wrote: | Machine learning is coming for the examiners jobs. :P | apocalyptic0n3 wrote: | I've been working on some tools that integrate with USPTO (both | from the application side and the validation side) for quite a | few years now and they've been making a TON of formatting changes | recently. A lot of their PDF forms have changed, they're | requiring XML versions of all data we submit, they're handling | classifications differently, etc. Their process always felt like | it was stuck in the past and being handled manually by humans | before and now it feels like they're moving everything toward | automated intake and initial reviews. I imagine this change is | for the same reasons, and that's a hefty fee to force it. | | This also likely means I will need to rework our systems to spit | out docx instead of/in addition to PDFs, which will be a | nightmare to do. So that's fun. | MrLeap wrote: | The consolation is that, if I remember correctly, docx is just | a zip file containing xml. | | I made an xlsx exporter in actionscript3 (lol) years ago and it | worked like this. What I ultimately did was made a "template" | document, and my code just injected strings into key spots, | zipped it up in memory and gave it to you as a file.xlsx. | Probably took me 3 days? | | I didn't have the benefit of libraries so I imagine this is | significantly easier in less hobbled environments, nodejs or | whatever probably has a kitchen sink package to do it. | lofatdairy wrote: | That's exactly right. There are definitely nodejs docx | templating packages (I've worked on codebases that used them | in the past), but they're certainly not required provided | your documents are reasonably simple. | | If anything, generating a pdf from various input | files/structured text has been a much harder task. We | generated docx files to allow for easy modification by non- | technical staff, but to generate a pdf we had to use a | headless instance of libreoffice since pandoc was struggling | with the rendering. | peteradio wrote: | I'm a masochist in need of work. | aaaddaaaaa1112 wrote: | Ironlink wrote: | In case anyone is looking for the size of the surcharge, I found | it in the last row of this table: | https://www.federalregister.gov/d/2020-16559/p-555 | colejohnson66 wrote: | So, $100-400 depending on the size (CFR section 1.16(u)). That | feels... excessive, but if processing a DOCX is automatic and | PDFs require humans (I'm assuming), it makes sense. | nfriedly wrote: | Note: the correct link is | https://www.federalregister.gov/documents/2022/04/28/2022-09... | | The fee is $100, $200, or $400 depending on the size of the | document. | emgeee wrote: | The fee is $400 ___________________________________________________________________ (page generated 2022-08-26 23:00 UTC)