hngopher.com

       [HN Gopher] USPTO to add surcharge on non-DOCX patent applicatio...
       ___________________________________________________________________
        
       USPTO to add surcharge on non-DOCX patent applications in 2023
        
       Author : zinekeller
       Score  : 100 points
       Date   : 2022-08-26 13:50 UTC (9 hours ago)
        
 (HTM) web link (unblock.federalregister.gov)
 (TXT) w3m dump (unblock.federalregister.gov)
        
       | NotYourLawyer wrote:
       | Can't wait to see what kinds of unredacted metadata people start
       | uploading without a thought.
        
         | nfriedly wrote:
         | It sounds like they're going to try to catch and remove that -
         | one of the bullet points under DOCX Benefits on
         | https://www.uspto.gov/patents/docx reads:
         | 
         | > _Privacy: provides automatic metadata detection (e.g. author
         | and comments) and removal features to support the submission of
         | only substantive information in the DOCX file._
         | 
         | And, then further down in the FAQ it says:
         | 
         | > _What happens to the metadata in DOCX files?_
         | 
         | > _Metadata is generally removed by applicants prior to
         | submission. However, if metadata is found during the validation
         | process, it is automatically removed prior to submission.
         | Examples of metadata include author, company, last modified by,
         | etc. The only information that is preserved is the size, page
         | count, and word count._
         | 
         | > _Outgoing DOCX documents (i.e. Office actions) from the USPTO
         | to applicants will also have metadata removed._
        
       | lesuorac wrote:
       | I don't think you submitted the link you wanted to.
       | 
       | I see https://unblock.federalregister.gov/ which doesn't take me
       | anywhere useful.
        
         | bsimpson wrote:
         | That explains why I get a 503 every time I pass the captcha.
        
         | zinekeller wrote:
         | Sorry, I thought the second time (!) I've submitted this that
         | the HN filter didn't mess up the link. Here it is (https://www.
         | federalregister.gov/documents/2022/04/28/2022-09...),
         | hopefully!
         | 
         | In case that the overagressive HN filer ate it again, it's
         | https www-federalregister-gov
         | documents/2022/04/28/2022-09027/filing-patent-applications-in-
         | docx-format
         | 
         | Also dang, recently HN's filters are too aggressive - from
         | sentence-casing USPTO (to Uspto which isn't even a
         | pronounceable acronym) to removing Twitter's search queries.
        
           | altairprime wrote:
           | You should email dang using the footer Contact link about
           | this thread, so that he sees it.
        
       | nazgulsenpai wrote:
       | I was wondering why docx format would be chosen instead of PDF
       | but they answer it pretty completely here if anyone else is
       | interested: https://www.uspto.gov/patents/docx
        
         | [deleted]
        
         | jedberg wrote:
         | It looks like most of those translate to "We build our
         | automated systems around DOCX so you get all our features if
         | you use it".
         | 
         | But it doesn't really say why they chose to build on docx.
        
           | bragr wrote:
           | >But it doesn't really say why they chose to build on docx.
           | Is requiring the DOCX format just adding another step in the
           | process for applicants?       Actually, it's the opposite.
           | The USPTO conducted a study and found that over 80% of
           | applicants are authoring their applications in DOCX format
           | (through writing tools such as Microsoft Word). Because the
           | files are originally in a DOCX format, uploading the original
           | file eliminates the step for the applicant to convert the
           | document to PDF prior to submission. Instead, the applicant
           | is able to save the step of converting because our system
           | will do that automatically.
        
           | apocalyptic0n3 wrote:
           | > But it doesn't really say why they chose to build on docx.
           | 
           | Having worked directly with their teams in the past (although
           | not on this), a lot of their systems seemed to evolve
           | naturally over time based on the needs present. In that
           | industry, a large majority of the documents being passed back
           | and forth are DOCX. So my semi-educated guess is someone
           | built a system to handle some simple intake tasks for DOCX
           | applications because a large majority already were, it
           | evolved over a few years, and when they finally decided to
           | fully automate the process, they decided to build upon what
           | they have which only supported DOCX and it was cheaper/easier
           | to mandate everyone submit in that format than to build a new
           | system or add support for others.
        
             | ramoz wrote:
             | I get that you worked with them it seems, but would argue
             | that your hunch is wrong here.
             | 
             | Regulatory processes, business systems, and international
             | integration are plagued by PDF OCR complexities. OCR
             | creates systemic issues and an anatomy of complex system
             | architectures. Im sure XML is a typical downstream for
             | parsing anyways. Use DOCX to enhance quality of the overall
             | scope of integrations.
        
               | apocalyptic0n3 wrote:
               | They could just use standardized application forms the
               | way they do for research reports they require (the "ISA
               | ###" forms). Those forms are easily parsable by things
               | like pdftk and don't require any OCR.
               | 
               | I don't necessarily disagree with your point (since it
               | makes complete sense), just wanting to point out that
               | they already have a system in place for this using other
               | means (although even there they are moving toward XML
               | instead, likely because of what a pain it is to deal with
               | text that exceeds the area of the input in PDFs)
        
             | jedberg wrote:
             | This makes the most sense.
        
           | joshstrange wrote:
           | As a few others have mentioned, the parsing alone means DOCX
           | is a huge win over PDF. I had to parse a bunch of PDF data
           | related to COVID and it was always a PITA. Every time they
           | changed their layout even a little bit I had to rewrite parts
           | of my extractor. The worst part? The headers/metadata showed
           | it was all made in Word so they could have exported to DOCX
           | as well as PDF if they wanted to but they only provided PDF.
        
           | meragrin_ wrote:
           | I guess you have little exposure to the industry. My
           | experience is the vast majority already use Word or something
           | else which supports DOCX. I cannot think of another format
           | which practitioners have easy access to and would use. PDF
           | just needs to go away for this process.
        
             | jedberg wrote:
             | I actually do have a lot of exposure to patents, and I know
             | everyone uses DocX already. I'm just saying that that web
             | page doesn't say why _they_ chose DocX, only why you should
             | _use_ DocX.
        
               | [deleted]
        
               | [deleted]
        
               | meragrin_ wrote:
               | Sure it does. I'll give you it does not say why they
               | chose it over other alternatives which I'm thinking is
               | what you are looking for. Are there really any
               | alternatives? The only real alternative I can think of is
               | OpenDocument Format and I don't consider it alternative.
               | As they say on that page, 80% of their users already deal
               | with DOCX so 80+% of them will have to convert to ODF. I
               | can't imagine ODF having any sort of benefit worth
               | requiring most people to convert their documents before
               | sending.
        
               | nescioquid wrote:
               | To me, the salient question is why is the government
               | officially adopting a proprietary file format? Why is it
               | important to optimize for the trivial convenience of
               | patent applicants?
               | 
               | It seems more like rationalization than reason.
        
           | dataflow wrote:
           | It actually seems like a sane choice to me. PDF is good for
           | rendering, but horrible for parsing. DOCX is a ZIP file with
           | XML data. Maybe ODT or whatever would've been a better
           | choice, I don't know what the format is like. But if you
           | disregard the usual knee-jerk "but it's Microsoft!" reaction,
           | it doesn't seem like a bad choice.
        
             | ndiddy wrote:
             | The Office Open XML file format is extremely complex, and
             | takes up around 6,500 pages (compared to ~1000 for ODF).
             | One thing you notice when reading the DOCX spec is that
             | they designed it with the sole constraint that DOC files
             | could easily be converted to DOCX. For example, you'll
             | frequently see compatibility tags like
             | "autoSpaceLikeWord95", "footnoteLayoutLikeWW8",
             | "useWord2002TableStyleRules", and "lineWrapLikeWord6" that
             | expose internal implementation details. Rather than
             | creating a useful standard allowing all users to store
             | their documents in a clean, portable way, Microsoft decided
             | to make their standard faithfully reproduce all of the
             | quirks and bugs of their legacy binary formats. It's so
             | difficult to correctly implement the Office Open XML
             | standard that even Microsoft took until Office 2013 to do
             | so (the standard was approved in 2006).
        
               | notriddle wrote:
               | > "autoSpaceLikeWord95", "footnoteLayoutLikeWW8",
               | "useWord2002TableStyleRules", and "lineWrapLikeWord6"
               | 
               | I expect that whatever tooling the USPTO uses can
               | probably just ignore those things. They're extracting
               | metadata, not actually rendering it.
        
               | dataflow wrote:
               | Interesting! How do they compare feature-wise? I feel
               | like there must be things each of them support that the
               | other one doesn't, but I don't know how consequential
               | they are.
        
               | Kye wrote:
               | I don't know how thorough or accurate it is, but
               | Microsoft has a list.
               | 
               | https://support.microsoft.com/en-us/office/differences-
               | betwe...
               | 
               | The list was only a few lines the last time I looked
               | years ago, so maybe they're actually trying to make a
               | complete list.
        
               | not2b wrote:
               | I think they pretty much had to do that to preserve the
               | formatting of existing documents for users who are force-
               | upgraded by their employers to new Office versions. But
               | it seems a scraper that just wants the information in the
               | document can ignore almost all of those tags.
               | 
               | edit: ninja'd.
        
               | jfk13 wrote:
               | So true. "It's XML, so it must easy to parse and
               | manipulate" is such a naive, even misleading attitude. If
               | what you do is take a byzantine, legacy-encrusted
               | implementation and just serialise its data strucures to
               | an XML representation, very little has been gained.
               | 
               | [edit: but I will grant that almost anything is better
               | than attempting to parse useful content from PDF.]
        
             | fezfight wrote:
             | I think the knee-jerk is against any alternative to Office,
             | not against Office. Statistically speaking, trying to use
             | anything reasonable that doesn't genuflect to Microsoft's
             | monopoly is what seems to be met with a knee-jerk reply
             | such as yours. As in, there's probably more people who
             | don't care but hate the complaining about libre stuff than
             | their are advocates for libre stuff.
        
             | molsongolden wrote:
             | Was just about to post this. Unzipping DOCX and parsing XML
             | is much easier than accurately processing PDF submissions.
        
               | oneplane wrote:
               | OCR has come a long way, so much that visually
               | interpreting a PDF is about as error-prom as parsing XML
               | output from Microsoft in non-microsoft software.
        
               | programmarchy wrote:
               | Try extracting tabular data from a PDF! With XML it's
               | trivial, but for PDF you need highly specialized software
               | packages to do this. One of the best, pdfplumber, is
               | largely based [1] on a Master's thesis titled Algorithmic
               | Extraction of Data in Tables in PDF Documents [2].
               | 
               | [1] https://github.com/jsvine/pdfplumber/blob/stable/pdfp
               | lumber/...
               | 
               | [2] https://trepo.tuni.fi/bitstream/handle/123456789/2152
               | 0/Nurmi...
        
             | jedberg wrote:
             | I never said it was a bad choice, only that they didn't say
             | why they chose it.
        
               | dataflow wrote:
               | I guess, but you were replying to "I was wondering why
               | docx format would be chosen instead of PDF" seemingly
               | unconvinced, so I assumed you thought PDF would've made
               | more sense.
        
         | Noted wrote:
         | Nice to see they call out LibreOffice as a usable application
         | as well.
        
           | AdmiralAsshat wrote:
           | INSTEAD of Open Office, no less!
        
         | hedora wrote:
         | They say that 80% of the submissions used to be converted to
         | PDF from word. I'd be interested to know where the other 20%
         | came from.
        
           | meragrin_ wrote:
           | There's this one guy I've dealt with. He uses a editor he
           | wrote himself. He'll convert his documents to Pages and then
           | use Pages for any other conversion needed.
        
           | NotYourLawyer wrote:
           | I'm guessing google docs is the next most common, and then
           | probably libre office.
        
             | meragrin_ wrote:
             | My experience is a large distrust of cloud environments. I
             | would expect Pages and Libre/Open Office to be more common
             | than Google Docs.
        
               | NotYourLawyer wrote:
               | Oh right, I bet Pages is up there. I dunno though--I know
               | lots of patent attorneys who are surprisingly non-
               | technical and probably haven't given a thought to cloud
               | security issues.
        
           | bonyt wrote:
           | My guess is that printed and scanned documents from word make
           | up a large component of this.
        
           | pavon wrote:
           | Do patent attorneys love Word Perfect as much as some other
           | people in the legal profession?
        
       | deathanatos wrote:
       | > _Due to aggressive automated scraping of FederalRegister.gov
       | and eCFR.gov, programmatic access to these sites is limited to
       | access to our extensive developer APIs._
       | 
       | Apparently. Then a captcha and a button to request access, which
       | if you complete, returns a 500 Internal Server Error.
       | 
       | ... my tax dollars are _hard_ at work, I see.
       | 
       | The Wayback Machine hasn't got a snapshot, either, it seems.
        
         | zevra wrote:
         | Its apparently the wrong link see:
         | https://news.ycombinator.com/item?id=32609165
         | 
         | The correct one is:
         | https://www.federalregister.gov/documents/2022/04/28/2022-09...
        
         | batmaniam wrote:
         | Seriously, I've had my share of horror stories on government
         | systems. Do they just not dogfood their own product? Where is
         | their QA team? It's atrocious how basic tasks are so broken all
         | the time.
        
           | ceeplusplus wrote:
           | Government can't afford to hire the competent talent, only
           | the scraps after everyone else (even the consulting
           | bodyshops) are done. The top GS pay bracket is lower than
           | entry level engineers at many companies (not just FAANG, but
           | also defense, F500 companies, etc.)
        
             | CobrastanJorji wrote:
             | One of the many things that I found tempting about working
             | for the U.S. Digital Service was that, while the GS-15 pay
             | grade is definitely way less than I'd make in the private
             | sector, my spouse's family is military/government and the
             | difference between "hippy programming thingy" and "has a
             | GS-15/O-6 job" would've been night and day. The one puts me
             | in a pile of stereotypes, but the other says "oh, he's
             | basically the bureaucratic equivalent of a Captain, that's
             | very respectable."
        
               | ceeplusplus wrote:
               | Different circles I guess. While my spouse's family
               | considers government jobs to be stable and somewhat
               | respectable, there is a lot more respect for FAANG and
               | other high paying jobs. One is respectable, the other is
               | prestigious.
        
       | nullc wrote:
       | Machine learning is coming for the examiners jobs. :P
        
       | apocalyptic0n3 wrote:
       | I've been working on some tools that integrate with USPTO (both
       | from the application side and the validation side) for quite a
       | few years now and they've been making a TON of formatting changes
       | recently. A lot of their PDF forms have changed, they're
       | requiring XML versions of all data we submit, they're handling
       | classifications differently, etc. Their process always felt like
       | it was stuck in the past and being handled manually by humans
       | before and now it feels like they're moving everything toward
       | automated intake and initial reviews. I imagine this change is
       | for the same reasons, and that's a hefty fee to force it.
       | 
       | This also likely means I will need to rework our systems to spit
       | out docx instead of/in addition to PDFs, which will be a
       | nightmare to do. So that's fun.
        
         | MrLeap wrote:
         | The consolation is that, if I remember correctly, docx is just
         | a zip file containing xml.
         | 
         | I made an xlsx exporter in actionscript3 (lol) years ago and it
         | worked like this. What I ultimately did was made a "template"
         | document, and my code just injected strings into key spots,
         | zipped it up in memory and gave it to you as a file.xlsx.
         | Probably took me 3 days?
         | 
         | I didn't have the benefit of libraries so I imagine this is
         | significantly easier in less hobbled environments, nodejs or
         | whatever probably has a kitchen sink package to do it.
        
           | lofatdairy wrote:
           | That's exactly right. There are definitely nodejs docx
           | templating packages (I've worked on codebases that used them
           | in the past), but they're certainly not required provided
           | your documents are reasonably simple.
           | 
           | If anything, generating a pdf from various input
           | files/structured text has been a much harder task. We
           | generated docx files to allow for easy modification by non-
           | technical staff, but to generate a pdf we had to use a
           | headless instance of libreoffice since pandoc was struggling
           | with the rendering.
        
         | peteradio wrote:
         | I'm a masochist in need of work.
        
       | aaaddaaaaa1112 wrote:
        
       | Ironlink wrote:
       | In case anyone is looking for the size of the surcharge, I found
       | it in the last row of this table:
       | https://www.federalregister.gov/d/2020-16559/p-555
        
         | colejohnson66 wrote:
         | So, $100-400 depending on the size (CFR section 1.16(u)). That
         | feels... excessive, but if processing a DOCX is automatic and
         | PDFs require humans (I'm assuming), it makes sense.
        
       | nfriedly wrote:
       | Note: the correct link is
       | https://www.federalregister.gov/documents/2022/04/28/2022-09...
       | 
       | The fee is $100, $200, or $400 depending on the size of the
       | document.
        
       | emgeee wrote:
       | The fee is $400
        
       ___________________________________________________________________
       (page generated 2022-08-26 23:00 UTC)