[HN Gopher] So you want to modify the text of a PDF by hand (2020) ___________________________________________________________________ So you want to modify the text of a PDF by hand (2020) Author : mutant_glofish Score : 113 points Date : 2023-09-03 06:24 UTC (1 days ago) (HTM) web link (gist.github.com) (TXT) w3m dump (gist.github.com) | rogeliodh wrote: | LibreOffice can open and edit PDFs. Last time I tried it was | really good. Not sure what limitations are there. | lucb1e wrote: | For me it always seems to change the font from whatever was | built into the PDF (rendered just fine in any PDF reader) to a | random system font which completely breaks the spacing, making | different parts of the document overflow into each other | ks2048 wrote: | This seems to be missing an important point: at the end of PDF is | a table ("cross-reference" table) that stores the BYTE-OFFSET to | different objects in the file. | | If you modify things within the file, typically these offsets | will change and the file will be corrupt. It looks like in this | article, maybe they were only interested in changing one number | to another, so none of the positions change. | | But, generally, adding/removing/modifying things in the middle of | the file require recomputing the xref table and thus become much | easier to use a library rather than direct text editing. | userbinator wrote: | That's the weirdest part of the PDF spec IMHO. It's a mix of | both binary and text, with text-specified byte offsets. It | would be very interesting to read about why the format became | like that, if its authors would ever talk about it. My guess is | that it was meant to be completely textual at first (but then | requiring the xref table to have fixed-length entries is odd), | and then they decided binary would be more efficient. | Someone wrote: | > My guess is that it was meant to be completely textual at | first | | It indeed started life as "not Turing complete postscript | with an index" (those makes it easy to render just the third | page of a PDF file, something that's impossible in postscript | without rendering the first and second pages first). Like | postscript, that was a pure text format. | | One nice feature is that you can append a few pieces and a | new index to an existing PDF file and get a new valid PDF | file (which would still contain its old index as a piece of | "junk DNA") | | I think compression was added because users complained about | file sizes. Ascii85 (https://en.m.wikipedia.org/wiki/Ascii85) | grows binary data by 25%. | | > but then requiring the xref table to have fixed-length | entries is odd | | My guess is that made it easier to hack together a tool to | convert PDF to postscript. | detourdog wrote: | I actually was at a Acrobat/PDF launch event in midtown NYC. | It was an embedded file type that could be generated at the | type of publishing and all dependencies could either be | embedded or not. | | This made a coherent point in a digital workflow that could | be saved and reprinted with ease. This was a big deal before | the portable document format came to be. | | I once made a workflow that took pdf files from Word, | filemaker, excel, and mini-cad. This all got combined into a | single 9,000 page pdf. The final pdf had a coherent | thumbnails, page numbers and headers and footer. | | Only took a couple of hours to get the final documnet after | pushing the go buttton. | pmarreck wrote: | The roots of PDF are PostScript, which is like Forth, and is | text-based, so that's why | bena wrote: | Ah. So it's a lot like editing compiled binaries. | | You can modify binaries all you want as long as you preserve | the length of everything. | | Some piece of software we had authenticated against a server, | but everything was done on the client. The client executed SQL | against the server directly, etc. Basically, the server checked | to see if this client would put you over the number of licenses | you purchased and that's it. | | I had run it against a disassembler, found the part where it | performed the check, and was able to change it to a straight | JMP and then pad the rest of the space with NOPs. | gpvos wrote: | That's why they decode it with qpdf and re-encode it again | afterwards, so qpdf takes care of that. qpdf reconstructs the | original PDF structure, and I think it even tries to keep the | object numbers the same, but the offsets are recalculated | completely. | aidos wrote: | In my experience it's easiest just to break the xref table and | run something like "mutool clean" to fix it again. It can be | completely derived from the content so it's safe to do. | Const-me wrote: | > I didn't see an obvious open-source tool that lets you dig into | PDF internals | | That's a matter of the toolset. I program C#, and I have good | experience with that open source library: | https://www.nuget.org/packages/iTextSharp-LGPL/ It's a decade old | by now, but PDF ain't exactly a new format. That library is not | terribly bad for many practical use cases. Particularly good when | you only need to create the documents as opposed to editing them, | because for that use case you'd want to use an old version of the | format anyway, for optimal compatibility. | jl6 wrote: | This seems to be missing an important step in the use of qpdf's | --qdf mode: after you've finished editing, you need to run the | file through the fix-pdf utility to recalculate all the object | offsets and rebuild the cross-reference table that lives at the | end of the file (unless you only change bytes in-place rather | than adding or removing bytes). | | My top 3 fun PDF facts: | | 1) Although PDF documents are typically 8-bit binary files, you | can make one that is valid UTF-8 "plain text", even including | images, through the use of the ASCII85 filter.[0] | | 2) PDF allows an incredible variety of freaky features (3D | objects, JavaScript, movies in an embedded flash object, | invisible annotations...). PDF/A is a much saner, safer subset. | | 3) The PDF spec allows you to write widgets (e.g. form controls) | using "rich text", which is a subset of XHTML and CSS - but this | feature is very sparsely supported outside the official Adobe | Reader. | | [0] For example: https://lab6.com/2 | gpvos wrote: | After you've finished editing, just run it through qpdf without | parameters, as explained in the beginning of the article, and | it will recompress the data and recreate the xref table. No | need for yet another tool. | jl6 wrote: | I guess you could, but this is the source of the errors | (actually warnings) that the article mentions. Probably best | to fix the file with the provided tool (fix-qdf is | distributed with qpdf) rather than get in the habit of | ignoring warnings. | LispSporks22 wrote: | As I recall, words aren't even necessarily made up of contiguous | characters. Especially true in OCRed documents in PDF. | aleden wrote: | I'm surprised no one has mentioned qpdf. | | https://qpdf.readthedocs.io/en/stable/overview.html | | It turns a PDF (typically everything in it is compressed binary | blobs) into a mixed binary/ASCII file (which itself is a PDF) | that can be edited with vim. | rhaway84773 wrote: | It's mentioned in the gist | | > To view the compressed data, you can use a command line tool | called qpdf. | chrnola wrote: | The linked article literally mentions qpdf within the first few | paragraphs. | gpvos wrote: | I'm not sure what you were reading, but the fine article is | centred around using qpdf. | seszett wrote: | Although this is an interesting dive into the PDF format, just | opening the PDF in Libreoffice or Inkscape usually works fine to | modify its text. | gcanyon wrote: | I'm interested in extracting the contents of a pdf form -- many | individual text boxes. You're saying libre office would likely | be able to parse that pdf into a usable format? | anon____ wrote: | With LibreOffice Draw you can edit the PDF (modify the text, | move or change images, etc), then save as pdf, but it can't | parse and save it as .odt, .doc, .html or similar. | ShadowBanThis01 wrote: | LibreOffice has some really perplexing functionality gaps. | | The one that baffles me is that it doesn't understand its | own graphics format, so you have to export drawings to TIFF | or something (if I remember correctly). | pikrzyszto wrote: | Poppler ( https://poppler.freedesktop.org/ ) handles this for | you with pdftotext utility. It also ships with bunch of other | utilities to work with PDFs | desgeeko wrote: | If you want to continue this journey and learn more about PDF, | you can read the anatomy of a file I documented recently: | https://pdfsyntax.dev/introduction_pdf_syntax.html | enriquto wrote: | You can do this: pdf2ps a.pdf # convert to | postscript "a.ps" vim a.ps # edit postscript by | hand ps2pdf a.ps # convert back to pdf | | Some complex pdf (with embedded javascript, animations, etc) fail | to work correctly after this back and forth. Yet for "plain" | documents this works alright. You can easily remove watermarks, | change some words and numbers, etc. Spacing is harder to modify. | Of course you need to know some postscript. | jordann wrote: | If you don't mind using java, you can use the open source Apache | PDFBox library | | https://pdfbox.apache.org/ | | It's relatively performant and it's a mature and supported | codebase that can accomplish most pdf tasks. | aidos wrote: | This topic comes up periodically as most people think PDFs are | some impenetrable binary format, but they're really not. | | They are a graph of objects of different types. The types | themselves are well described in the official spec (I'm a sadist, | I read it for fun). | | My advice is always to convert the pdf to a version without | compressed data like the author here has. My tool of choice is | mutool (mutool clean -d in.pdf out.pdf). Then just have a | rummage. You'll be surprised by how much you can follow. | | In the article the author missed a step where you look at the | page object to see the resources. That's where the mapping from | the font name use in the content stream to the underlying object | is made. | | There's also another important bit missing - most fonts are | subset into the pdf. Ie, only the glyphs that are needed are | maintained in the font. I think that's often where the re- | encoding happens. ToUnicode is maintained to allow you to copy | text (or search in a PDF). It's a nice to have for users (in my | experience it's normally there and correct though). | esafak wrote: | It is a shame Adobe designed a format so hard to work with that | people are amazed when someone accomplishes what should be a | basic task with it. | | Their design philosophy of creating a read-only format was | flawed to begin with. What's the first feature people are going | to ask for?? | pwg wrote: | > It is a shame Adobe designed a format so hard to work with | | PDF was not designed to be editable, nor for anyone to "work | with" it in that way. | | It was designed (at least the original purpose circa 1989) to | represent printed pages electronically in a format that would | view and print identically everywhere. In fact, the initial | advertising for the "value" of the PDF format was exactly | this, no matter where a recipient viewed your PDF output, it | would look, and print, identically to everywhere else. | | It was originally meant to be "electronic paper". | dylan604 wrote: | Wasn't the PDF format based on the Illustrator format? | | The weird thing to me is people using a distribution format | as an original source. It's right up there with video | cameras shooting an acquisition source as an MP4 and all of | the negative baggage that comes with that. | userbinator wrote: | I believe Illustrator format is very similar to | PostScript. | mistrial9 wrote: | .. waves to Leonard Rosenthol | lucascacho wrote: | Every time I read about the hardships of interacting with the PDF | format, I gain more respect for Photopea, which has full PDF | editing support. | blincoln wrote: | The PDF specification is wild. My current favourite trivia is | that it supports all of Photoshop's layer blend modes for | rendering overlapping elements.[1] My second-favourite is that it | supports appended content that modifies earlier content, so one | should always look for forensic evidence in all distinct versions | represented in a given file.[2] | | It's also a fun example of the futility of DRM. The spec includes | password-based encryption, and allows for different "owner" and | "user" passwords. There's a bitfield with options for things like | "prevent printing", "prevent copying text", and so forth,[3] but | because reading the document necessarily involves decrypting it, | one can use the "user" password to open an encrypted PDF in a | non-compliant tool,[4] then save the unencrypted version to get | an editable equivalent. | | [1] "More than just transparency" section of | https://blog.adobe.com/en/publish/2022/01/31/20-years-of-tra... | | [2] https://blog.didierstevens.com/2008/05/07/solving-a- | little-p... | | [3] Page 61 of https://opensource.adobe.com/dc-acrobat-sdk- | docs/pdfstandard... | | [4] For example, a script that uses the pypdf library. | aidos wrote: | To be fair, if you wanted to stop copying of text it would be | easiest just to drop the ToUnicode mapping against the fonts | and then it's a manual process for people to recreate them. | miki123211 wrote: | That also breaks search (and more importantly screen reader | accessibility), and if you're professionally required to | specifically produce PDFs with these security features | enabled, you're pretty likely to be working in a context | where that would be illegal. | userbinator wrote: | In the context of a format that was originally proprietary and | not widely available to everyone, and conceived in an era where | encryption was strongly controlled by export law, that sort of | security-by-obscurity was very common. Incidentally, a popular | cracking tutorial back then was to de-DRM the official reader | by patching the function that checks those permissions. | aardvark179 wrote: | Aren't the blend modes supported just the Porter-Duff | compositing modes? You might think that's overkill, but it's a | really good mapping of what other rendering pipelines offer and | it can really help reduce the work to produce a PDF. | pavlov wrote: | The original Porter-Duff compositing operators don't cover | Photoshop-style blending. Here's a link with pictures: | | http://ssp.impulsetrain.com/porterduff.html | | The Porter-Duff operators are appealingly rigorous and easy | to implement because they're simply the possible combinations | of a simple formula. But many of these operators are not very | useful either. | | The Photoshop blending modes are practically the opposite: | they are not derived from anything mathematically appealing, | it's really just a collection of algorithms that Photoshop's | designers originally found useful. They reflect the | limitations of their early 1990s desktop computer | implementations (for example, no attempt is made to account | for gamma correction when combining the layers, which makes | many of these operations behave very differently from actual | light that they mean to emulate). | crtified wrote: | This brings back horrible memories of working with large complex | maps back in the 2000s. Having various CAD and GIS applications | generate messy, inefficient spaghetti-coded PDF outputs - then | bouncing those PDFs around the Adobe apps of the time, to add | effects and other prettifications not available in the mapping | apps. | | It would reach the point where things would start to break, and | .... "good times were had, by all". | schlowmo wrote: | PDF is such a weird format. Not so long ago I had to write some | Java code for manipulating PDFs: find a string, remove it and | place an image at the former string position. I should have known | better as I thought "Well, how hard can that be?" | | What followed was a deep dive down the rabbit hole, a lot of | fiddling with the same tools the author of this gist is using | trying to make sense of it all. | | The final solution worked better than I thought while at the same | time felt incredibly wrong. | | I'm very thankful for all the (probably painful) work that went | into those open source PDF tools. | miki123211 wrote: | What people often miss about PDF is that it's closer to an image | format in some ways than to a Word document. Word documents, PDFs | and images are in document editing what DAW projects, midis and | mp3 files are in music and what Java source code, JVM bytecode | and pure x86 machine code are in software. | | The primary purpose of a PDF file is to tell you what to display | (or print), with perfect clarity, in much fewer bytes than an | actual image would take. It exploits the fact that the document | creator knows about patterns in the document structure that, if | expressed properly, make the document much more compressible than | anything that an actual image compression algorithm could | accomplish. For example, if you have access to the actual font, | it's better to say "put these characters at these coordinates | with that much spacing between them" than to include every | occurrence of every character as a part of the image, hoping that | the compression algorithm notices and compresses away the | repetitions. Things like what character is part of what word, or | even what unicode codepoint is mapped to which font glyph are | basically unimportant if all you're after is efficiently | transferring the image of a document. | | If you have an editable document, you care a lot more about the | semantics of the content, not just about its presentation. It | matters to you whether a particular break in the text is supposed | to be multiple spaces, the next column in a table or just a weird | page layout caused by an image being present. If you have some | text at the bottom of each page, you care whether that text was | put there by the document author multiple times, or whether it | was entered once and set as a footer. If you add a new paragraph | and have to change page layout, it matters to you that the last | paragraph on this page is a footnote and should not be moved to | the next one. If a section heading moves to another page, you | care about the fact that the table of contents should update | automatically and isn't just some text that the author has | manually entered. If you're a printer or screen, you care about | none of these things, you just print or display whatever you're | told to print or display. For a PDF, footnotes, section headings, | footers or tables of contents don't have to be special, they can | just be text with some meaningless formatting applied to it. This | is why making PDF work for any purpose which isn't displaying or | printing is never going to be 100% accurate. Of course, there are | efforts to remedy this, and a PDF-creating program is free to | include any metadata it sees fit, but it's by no means required | to do so. | | This isn't necessarily the mental model that the PDF authors had | in mind, but it's an useful way to look at PDF and understand why | it is the way it is. | eschaton wrote: | Anybody trying to do this is missing the point of PDF: It's a | _page-description format_ and therefore only represents the | _marks on a page_ , not _document structure_. | | One should not attempt to edit a PDF, one should edit the | document from which the PDF is generated. | lucb1e wrote: | I'll stop trying to edit PDFs when people stop sending me PDFs | that I want to edit. | | Somehow it became "unprofessional" to just send meant-to-be- | editable documents around for everyone to enjoy, so this is | where we end up... | louthy wrote: | > It's a page-description format and therefore only represents | the marks on a page, not document structure | | Maybe they should have called it 'Page Description Format' | then? Instead of 'Portable _Document_ Format' | layer8 wrote: | PDF does support incorporating information about the logical | document structure, aka Tagged PDF. It's optional, but | recommended for accessibility (e.g. PDF/UA). See chapters | 14.7-14.8 in [1]. | | [1] https://opensource.adobe.com/dc-acrobat-sdk- | docs/pdfstandard... | o1y32 wrote: | "should not" is meaningless here, because in the real world | there are tons of situations where people _want_ you to edit | PDF, one way or another | yair99dd wrote: | Inkscape+1.2 multipage support is Great for editing graphics and | text on PDFs ___________________________________________________________________ (page generated 2023-09-04 23:00 UTC)