hngopher.com

       [HN Gopher] So you want to modify the text of a PDF by hand (2020)
       ___________________________________________________________________
        
       So you want to modify the text of a PDF by hand (2020)
        
       Author : mutant_glofish
       Score  : 113 points
       Date   : 2023-09-03 06:24 UTC (1 days ago)
        
 (HTM) web link (gist.github.com)
 (TXT) w3m dump (gist.github.com)
        
       | rogeliodh wrote:
       | LibreOffice can open and edit PDFs. Last time I tried it was
       | really good. Not sure what limitations are there.
        
         | lucb1e wrote:
         | For me it always seems to change the font from whatever was
         | built into the PDF (rendered just fine in any PDF reader) to a
         | random system font which completely breaks the spacing, making
         | different parts of the document overflow into each other
        
       | ks2048 wrote:
       | This seems to be missing an important point: at the end of PDF is
       | a table ("cross-reference" table) that stores the BYTE-OFFSET to
       | different objects in the file.
       | 
       | If you modify things within the file, typically these offsets
       | will change and the file will be corrupt. It looks like in this
       | article, maybe they were only interested in changing one number
       | to another, so none of the positions change.
       | 
       | But, generally, adding/removing/modifying things in the middle of
       | the file require recomputing the xref table and thus become much
       | easier to use a library rather than direct text editing.
        
         | userbinator wrote:
         | That's the weirdest part of the PDF spec IMHO. It's a mix of
         | both binary and text, with text-specified byte offsets. It
         | would be very interesting to read about why the format became
         | like that, if its authors would ever talk about it. My guess is
         | that it was meant to be completely textual at first (but then
         | requiring the xref table to have fixed-length entries is odd),
         | and then they decided binary would be more efficient.
        
           | Someone wrote:
           | > My guess is that it was meant to be completely textual at
           | first
           | 
           | It indeed started life as "not Turing complete postscript
           | with an index" (those makes it easy to render just the third
           | page of a PDF file, something that's impossible in postscript
           | without rendering the first and second pages first). Like
           | postscript, that was a pure text format.
           | 
           | One nice feature is that you can append a few pieces and a
           | new index to an existing PDF file and get a new valid PDF
           | file (which would still contain its old index as a piece of
           | "junk DNA")
           | 
           | I think compression was added because users complained about
           | file sizes. Ascii85 (https://en.m.wikipedia.org/wiki/Ascii85)
           | grows binary data by 25%.
           | 
           | > but then requiring the xref table to have fixed-length
           | entries is odd
           | 
           | My guess is that made it easier to hack together a tool to
           | convert PDF to postscript.
        
           | detourdog wrote:
           | I actually was at a Acrobat/PDF launch event in midtown NYC.
           | It was an embedded file type that could be generated at the
           | type of publishing and all dependencies could either be
           | embedded or not.
           | 
           | This made a coherent point in a digital workflow that could
           | be saved and reprinted with ease. This was a big deal before
           | the portable document format came to be.
           | 
           | I once made a workflow that took pdf files from Word,
           | filemaker, excel, and mini-cad. This all got combined into a
           | single 9,000 page pdf. The final pdf had a coherent
           | thumbnails, page numbers and headers and footer.
           | 
           | Only took a couple of hours to get the final documnet after
           | pushing the go buttton.
        
           | pmarreck wrote:
           | The roots of PDF are PostScript, which is like Forth, and is
           | text-based, so that's why
        
         | bena wrote:
         | Ah. So it's a lot like editing compiled binaries.
         | 
         | You can modify binaries all you want as long as you preserve
         | the length of everything.
         | 
         | Some piece of software we had authenticated against a server,
         | but everything was done on the client. The client executed SQL
         | against the server directly, etc. Basically, the server checked
         | to see if this client would put you over the number of licenses
         | you purchased and that's it.
         | 
         | I had run it against a disassembler, found the part where it
         | performed the check, and was able to change it to a straight
         | JMP and then pad the rest of the space with NOPs.
        
         | gpvos wrote:
         | That's why they decode it with qpdf and re-encode it again
         | afterwards, so qpdf takes care of that. qpdf reconstructs the
         | original PDF structure, and I think it even tries to keep the
         | object numbers the same, but the offsets are recalculated
         | completely.
        
         | aidos wrote:
         | In my experience it's easiest just to break the xref table and
         | run something like "mutool clean" to fix it again. It can be
         | completely derived from the content so it's safe to do.
        
       | Const-me wrote:
       | > I didn't see an obvious open-source tool that lets you dig into
       | PDF internals
       | 
       | That's a matter of the toolset. I program C#, and I have good
       | experience with that open source library:
       | https://www.nuget.org/packages/iTextSharp-LGPL/ It's a decade old
       | by now, but PDF ain't exactly a new format. That library is not
       | terribly bad for many practical use cases. Particularly good when
       | you only need to create the documents as opposed to editing them,
       | because for that use case you'd want to use an old version of the
       | format anyway, for optimal compatibility.
        
       | jl6 wrote:
       | This seems to be missing an important step in the use of qpdf's
       | --qdf mode: after you've finished editing, you need to run the
       | file through the fix-pdf utility to recalculate all the object
       | offsets and rebuild the cross-reference table that lives at the
       | end of the file (unless you only change bytes in-place rather
       | than adding or removing bytes).
       | 
       | My top 3 fun PDF facts:
       | 
       | 1) Although PDF documents are typically 8-bit binary files, you
       | can make one that is valid UTF-8 "plain text", even including
       | images, through the use of the ASCII85 filter.[0]
       | 
       | 2) PDF allows an incredible variety of freaky features (3D
       | objects, JavaScript, movies in an embedded flash object,
       | invisible annotations...). PDF/A is a much saner, safer subset.
       | 
       | 3) The PDF spec allows you to write widgets (e.g. form controls)
       | using "rich text", which is a subset of XHTML and CSS - but this
       | feature is very sparsely supported outside the official Adobe
       | Reader.
       | 
       | [0] For example: https://lab6.com/2
        
         | gpvos wrote:
         | After you've finished editing, just run it through qpdf without
         | parameters, as explained in the beginning of the article, and
         | it will recompress the data and recreate the xref table. No
         | need for yet another tool.
        
           | jl6 wrote:
           | I guess you could, but this is the source of the errors
           | (actually warnings) that the article mentions. Probably best
           | to fix the file with the provided tool (fix-qdf is
           | distributed with qpdf) rather than get in the habit of
           | ignoring warnings.
        
       | LispSporks22 wrote:
       | As I recall, words aren't even necessarily made up of contiguous
       | characters. Especially true in OCRed documents in PDF.
        
       | aleden wrote:
       | I'm surprised no one has mentioned qpdf.
       | 
       | https://qpdf.readthedocs.io/en/stable/overview.html
       | 
       | It turns a PDF (typically everything in it is compressed binary
       | blobs) into a mixed binary/ASCII file (which itself is a PDF)
       | that can be edited with vim.
        
         | rhaway84773 wrote:
         | It's mentioned in the gist
         | 
         | > To view the compressed data, you can use a command line tool
         | called qpdf.
        
         | chrnola wrote:
         | The linked article literally mentions qpdf within the first few
         | paragraphs.
        
         | gpvos wrote:
         | I'm not sure what you were reading, but the fine article is
         | centred around using qpdf.
        
       | seszett wrote:
       | Although this is an interesting dive into the PDF format, just
       | opening the PDF in Libreoffice or Inkscape usually works fine to
       | modify its text.
        
         | gcanyon wrote:
         | I'm interested in extracting the contents of a pdf form -- many
         | individual text boxes. You're saying libre office would likely
         | be able to parse that pdf into a usable format?
        
           | anon____ wrote:
           | With LibreOffice Draw you can edit the PDF (modify the text,
           | move or change images, etc), then save as pdf, but it can't
           | parse and save it as .odt, .doc, .html or similar.
        
             | ShadowBanThis01 wrote:
             | LibreOffice has some really perplexing functionality gaps.
             | 
             | The one that baffles me is that it doesn't understand its
             | own graphics format, so you have to export drawings to TIFF
             | or something (if I remember correctly).
        
           | pikrzyszto wrote:
           | Poppler ( https://poppler.freedesktop.org/ ) handles this for
           | you with pdftotext utility. It also ships with bunch of other
           | utilities to work with PDFs
        
       | desgeeko wrote:
       | If you want to continue this journey and learn more about PDF,
       | you can read the anatomy of a file I documented recently:
       | https://pdfsyntax.dev/introduction_pdf_syntax.html
        
       | enriquto wrote:
       | You can do this:                   pdf2ps a.pdf    # convert to
       | postscript "a.ps"         vim a.ps        # edit postscript by
       | hand         ps2pdf a.ps     # convert back to pdf
       | 
       | Some complex pdf (with embedded javascript, animations, etc) fail
       | to work correctly after this back and forth. Yet for "plain"
       | documents this works alright. You can easily remove watermarks,
       | change some words and numbers, etc. Spacing is harder to modify.
       | Of course you need to know some postscript.
        
       | jordann wrote:
       | If you don't mind using java, you can use the open source Apache
       | PDFBox library
       | 
       | https://pdfbox.apache.org/
       | 
       | It's relatively performant and it's a mature and supported
       | codebase that can accomplish most pdf tasks.
        
       | aidos wrote:
       | This topic comes up periodically as most people think PDFs are
       | some impenetrable binary format, but they're really not.
       | 
       | They are a graph of objects of different types. The types
       | themselves are well described in the official spec (I'm a sadist,
       | I read it for fun).
       | 
       | My advice is always to convert the pdf to a version without
       | compressed data like the author here has. My tool of choice is
       | mutool (mutool clean -d in.pdf out.pdf). Then just have a
       | rummage. You'll be surprised by how much you can follow.
       | 
       | In the article the author missed a step where you look at the
       | page object to see the resources. That's where the mapping from
       | the font name use in the content stream to the underlying object
       | is made.
       | 
       | There's also another important bit missing - most fonts are
       | subset into the pdf. Ie, only the glyphs that are needed are
       | maintained in the font. I think that's often where the re-
       | encoding happens. ToUnicode is maintained to allow you to copy
       | text (or search in a PDF). It's a nice to have for users (in my
       | experience it's normally there and correct though).
        
         | esafak wrote:
         | It is a shame Adobe designed a format so hard to work with that
         | people are amazed when someone accomplishes what should be a
         | basic task with it.
         | 
         | Their design philosophy of creating a read-only format was
         | flawed to begin with. What's the first feature people are going
         | to ask for??
        
           | pwg wrote:
           | > It is a shame Adobe designed a format so hard to work with
           | 
           | PDF was not designed to be editable, nor for anyone to "work
           | with" it in that way.
           | 
           | It was designed (at least the original purpose circa 1989) to
           | represent printed pages electronically in a format that would
           | view and print identically everywhere. In fact, the initial
           | advertising for the "value" of the PDF format was exactly
           | this, no matter where a recipient viewed your PDF output, it
           | would look, and print, identically to everywhere else.
           | 
           | It was originally meant to be "electronic paper".
        
             | dylan604 wrote:
             | Wasn't the PDF format based on the Illustrator format?
             | 
             | The weird thing to me is people using a distribution format
             | as an original source. It's right up there with video
             | cameras shooting an acquisition source as an MP4 and all of
             | the negative baggage that comes with that.
        
               | userbinator wrote:
               | I believe Illustrator format is very similar to
               | PostScript.
        
           | mistrial9 wrote:
           | .. waves to Leonard Rosenthol
        
       | lucascacho wrote:
       | Every time I read about the hardships of interacting with the PDF
       | format, I gain more respect for Photopea, which has full PDF
       | editing support.
        
       | blincoln wrote:
       | The PDF specification is wild. My current favourite trivia is
       | that it supports all of Photoshop's layer blend modes for
       | rendering overlapping elements.[1] My second-favourite is that it
       | supports appended content that modifies earlier content, so one
       | should always look for forensic evidence in all distinct versions
       | represented in a given file.[2]
       | 
       | It's also a fun example of the futility of DRM. The spec includes
       | password-based encryption, and allows for different "owner" and
       | "user" passwords. There's a bitfield with options for things like
       | "prevent printing", "prevent copying text", and so forth,[3] but
       | because reading the document necessarily involves decrypting it,
       | one can use the "user" password to open an encrypted PDF in a
       | non-compliant tool,[4] then save the unencrypted version to get
       | an editable equivalent.
       | 
       | [1] "More than just transparency" section of
       | https://blog.adobe.com/en/publish/2022/01/31/20-years-of-tra...
       | 
       | [2] https://blog.didierstevens.com/2008/05/07/solving-a-
       | little-p...
       | 
       | [3] Page 61 of https://opensource.adobe.com/dc-acrobat-sdk-
       | docs/pdfstandard...
       | 
       | [4] For example, a script that uses the pypdf library.
        
         | aidos wrote:
         | To be fair, if you wanted to stop copying of text it would be
         | easiest just to drop the ToUnicode mapping against the fonts
         | and then it's a manual process for people to recreate them.
        
           | miki123211 wrote:
           | That also breaks search (and more importantly screen reader
           | accessibility), and if you're professionally required to
           | specifically produce PDFs with these security features
           | enabled, you're pretty likely to be working in a context
           | where that would be illegal.
        
         | userbinator wrote:
         | In the context of a format that was originally proprietary and
         | not widely available to everyone, and conceived in an era where
         | encryption was strongly controlled by export law, that sort of
         | security-by-obscurity was very common. Incidentally, a popular
         | cracking tutorial back then was to de-DRM the official reader
         | by patching the function that checks those permissions.
        
         | aardvark179 wrote:
         | Aren't the blend modes supported just the Porter-Duff
         | compositing modes? You might think that's overkill, but it's a
         | really good mapping of what other rendering pipelines offer and
         | it can really help reduce the work to produce a PDF.
        
           | pavlov wrote:
           | The original Porter-Duff compositing operators don't cover
           | Photoshop-style blending. Here's a link with pictures:
           | 
           | http://ssp.impulsetrain.com/porterduff.html
           | 
           | The Porter-Duff operators are appealingly rigorous and easy
           | to implement because they're simply the possible combinations
           | of a simple formula. But many of these operators are not very
           | useful either.
           | 
           | The Photoshop blending modes are practically the opposite:
           | they are not derived from anything mathematically appealing,
           | it's really just a collection of algorithms that Photoshop's
           | designers originally found useful. They reflect the
           | limitations of their early 1990s desktop computer
           | implementations (for example, no attempt is made to account
           | for gamma correction when combining the layers, which makes
           | many of these operations behave very differently from actual
           | light that they mean to emulate).
        
       | crtified wrote:
       | This brings back horrible memories of working with large complex
       | maps back in the 2000s. Having various CAD and GIS applications
       | generate messy, inefficient spaghetti-coded PDF outputs - then
       | bouncing those PDFs around the Adobe apps of the time, to add
       | effects and other prettifications not available in the mapping
       | apps.
       | 
       | It would reach the point where things would start to break, and
       | .... "good times were had, by all".
        
       | schlowmo wrote:
       | PDF is such a weird format. Not so long ago I had to write some
       | Java code for manipulating PDFs: find a string, remove it and
       | place an image at the former string position. I should have known
       | better as I thought "Well, how hard can that be?"
       | 
       | What followed was a deep dive down the rabbit hole, a lot of
       | fiddling with the same tools the author of this gist is using
       | trying to make sense of it all.
       | 
       | The final solution worked better than I thought while at the same
       | time felt incredibly wrong.
       | 
       | I'm very thankful for all the (probably painful) work that went
       | into those open source PDF tools.
        
       | miki123211 wrote:
       | What people often miss about PDF is that it's closer to an image
       | format in some ways than to a Word document. Word documents, PDFs
       | and images are in document editing what DAW projects, midis and
       | mp3 files are in music and what Java source code, JVM bytecode
       | and pure x86 machine code are in software.
       | 
       | The primary purpose of a PDF file is to tell you what to display
       | (or print), with perfect clarity, in much fewer bytes than an
       | actual image would take. It exploits the fact that the document
       | creator knows about patterns in the document structure that, if
       | expressed properly, make the document much more compressible than
       | anything that an actual image compression algorithm could
       | accomplish. For example, if you have access to the actual font,
       | it's better to say "put these characters at these coordinates
       | with that much spacing between them" than to include every
       | occurrence of every character as a part of the image, hoping that
       | the compression algorithm notices and compresses away the
       | repetitions. Things like what character is part of what word, or
       | even what unicode codepoint is mapped to which font glyph are
       | basically unimportant if all you're after is efficiently
       | transferring the image of a document.
       | 
       | If you have an editable document, you care a lot more about the
       | semantics of the content, not just about its presentation. It
       | matters to you whether a particular break in the text is supposed
       | to be multiple spaces, the next column in a table or just a weird
       | page layout caused by an image being present. If you have some
       | text at the bottom of each page, you care whether that text was
       | put there by the document author multiple times, or whether it
       | was entered once and set as a footer. If you add a new paragraph
       | and have to change page layout, it matters to you that the last
       | paragraph on this page is a footnote and should not be moved to
       | the next one. If a section heading moves to another page, you
       | care about the fact that the table of contents should update
       | automatically and isn't just some text that the author has
       | manually entered. If you're a printer or screen, you care about
       | none of these things, you just print or display whatever you're
       | told to print or display. For a PDF, footnotes, section headings,
       | footers or tables of contents don't have to be special, they can
       | just be text with some meaningless formatting applied to it. This
       | is why making PDF work for any purpose which isn't displaying or
       | printing is never going to be 100% accurate. Of course, there are
       | efforts to remedy this, and a PDF-creating program is free to
       | include any metadata it sees fit, but it's by no means required
       | to do so.
       | 
       | This isn't necessarily the mental model that the PDF authors had
       | in mind, but it's an useful way to look at PDF and understand why
       | it is the way it is.
        
       | eschaton wrote:
       | Anybody trying to do this is missing the point of PDF: It's a
       | _page-description format_ and therefore only represents the
       | _marks on a page_ , not _document structure_.
       | 
       | One should not attempt to edit a PDF, one should edit the
       | document from which the PDF is generated.
        
         | lucb1e wrote:
         | I'll stop trying to edit PDFs when people stop sending me PDFs
         | that I want to edit.
         | 
         | Somehow it became "unprofessional" to just send meant-to-be-
         | editable documents around for everyone to enjoy, so this is
         | where we end up...
        
         | louthy wrote:
         | > It's a page-description format and therefore only represents
         | the marks on a page, not document structure
         | 
         | Maybe they should have called it 'Page Description Format'
         | then? Instead of 'Portable _Document_ Format'
        
         | layer8 wrote:
         | PDF does support incorporating information about the logical
         | document structure, aka Tagged PDF. It's optional, but
         | recommended for accessibility (e.g. PDF/UA). See chapters
         | 14.7-14.8 in [1].
         | 
         | [1] https://opensource.adobe.com/dc-acrobat-sdk-
         | docs/pdfstandard...
        
         | o1y32 wrote:
         | "should not" is meaningless here, because in the real world
         | there are tons of situations where people _want_ you to edit
         | PDF, one way or another
        
       | yair99dd wrote:
       | Inkscape+1.2 multipage support is Great for editing graphics and
       | text on PDFs
        
       ___________________________________________________________________
       (page generated 2023-09-04 23:00 UTC)