[HN Gopher] New Ghostscript PDF interpreter
       ___________________________________________________________________
        
       New Ghostscript PDF interpreter
        
       Author : diskmuncher
       Score  : 148 points
       Date   : 2022-07-31 15:40 UTC (7 hours ago)
        
 (HTM) web link (www.ghostscript.com)
 (TXT) w3m dump (www.ghostscript.com)
        
       | mepian wrote:
       | "But Ghostscript's PDF interpreter was, as noted, written in
       | PostScript, and PostScript is not a great language for handling
       | error conditions and recovering."
       | 
       | Isn't C, their chosen replacement of PostScript, also
       | particularly bad at this?
        
         | daptaq wrote:
         | I'd say a language is bad at error handling if it doesn't let
         | you check if a procedure failed or not. What C does it that it
         | compiles even if you ignore this, which is a different issue.
         | Java, Rust, etc. wouldn't compile if you totally ignored it,
         | but you that doesn't mean you have to do proper error handling,
         | beyond satisfying the compiler/type system.
        
           | ptx wrote:
           | Are there any languages that are bad at error handling then,
           | according to that definition? That don't let you return
           | values, set global flags, mutate arguments or in any other
           | way communicate back from a procedure?
        
         | colonwqbang wrote:
         | I also had a slight chuckle at this. However, I'm sure C is
         | still a great step up from Postscript.
         | 
         | It is however quite entertaining to read the predictable
         | comments from Rust/Java/C++ fans who are upset that they didn't
         | choose their favourite language.
        
       | forgotpwd16 wrote:
       | Surprised the decision wasn't made sooner.
        
       | vivegi wrote:
       | In the past when we had to use Ghostscript for PDF processing, we
       | always separated it out into its own process and added a whole
       | lot of error management externally.
       | 
       | Even if the application was fine, you would always encounter
       | PS/PDF files in the wild that kept stress-testing the
       | application's memory safety.
        
       | diskmuncher wrote:
       | How interpreting PDF in Postscript became untenable
        
       | vintagedave wrote:
       | Given the mention of security issues in their custom PostScript
       | extensions, and that PDF files are often malformed, I wonder why
       | they chose C as the language for the new interpreter. I don't
       | want to write a typical HN comment ( _cough_ use Rust for
       | everything :)) but surely there is _some_ better language for
       | entirely new development of a secure and fast parser in 2022.
       | 
       | The post has no explananation of this choice. Does anyone know?
        
         | salmo wrote:
         | My guess is that since the rest of the project (not in PS
         | itself) is in C, it's in C. And it may be borrowing from the PS
         | interpreter codebase. I dunno.
         | 
         | Requiring another skillset, toolchain, etc. is onerous and has
         | to be weighed in those decisions. Rust is cool for sure, but
         | difficult to adopt in brownfield projects because of humans
         | more than tech.
         | 
         | Also, it wasn't written on in 2022, just made the default now.
         | GS is a venerable codebase, and jumping on a "new" language
         | bandwagon may have seemed dangerous at the time it was started.
         | 
         | All conjecture. I'm not an expert or involved.
        
         | mkl95 wrote:
         | One reason may be that they want to build a high level wrapper
         | of that C API, something that is well documented in some
         | languages (i.e. Python)
        
         | lvh wrote:
         | We (Latacora) previously advised clients to encapsulate
         | GhostScript processing in something with a hard security
         | boundary (like a Lambda) and I am not expecting the new
         | implementation to change that.
        
         | h2odragon wrote:
         | I suspect they need portability more than most projects.
        
           | winter_blue wrote:
           | Are you kidding? Many other languages are as portable, if not
           | more portable.[a] Your point would be valid in 1972, not in
           | 2022. I can't believe you're regurgitating the same
           | "portability" from 50 years ago, today (unless you meant it
           | as a joke and forgot to include a /s).
           | 
           | [a] Languages targeting LLVM or supported by GCC are portable
           | to every target machine code / ISA / architecture supported
           | by those toolchains. JVM, JS, etc are portable to all the
           | platforms they support. You don't need to do any extra work
           | (of recompiling) if you use a bytecode VM / platform (for
           | example, like JVM).
        
             | mistrial9 wrote:
             | does an LLVM requirement fit the social and license goals
             | of this eco-system fundamental project?
        
             | zbentley wrote:
             | Well, there's portability and then there's portability.
             | Getting LLVM to emit artifacts on a given target is easy.
             | Getting assurance that big, complex interfaces that
             | integrate with the underlying OS in extremely specific ways
             | (i.e. your programming language's IO or concurrency system)
             | behave correctly on that target, and have appropriate
             | testing, community support, and documentation is another
             | thing entirely.
             | 
             | Like, I get it. The claim that "rust isn't portable" is
             | often used as a thought terminating cliche, and is often
             | wrong or irrelevant in context. But the claim "X uses LLVM,
             | LLVM can target environment Y, therefore X is fully
             | compatible with Y" is just as reductive and misleading.
        
         | jeffbee wrote:
         | WUFFS seems like a great option for this.
        
         | midislack wrote:
         | No, not more Rust activism. Please, anything but more of this.
         | Have some shame.
        
         | amluto wrote:
         | Beyond a lack of memory safety, C has another issue that makes
         | me dislike it for this kind of application: C has a very
         | minimal set of built in data structures. Combined with a lack
         | of generics, this means that using, say, a dictionary means
         | that quite a bit of the implementation gets hard coded into
         | every site that uses the dictionary. This is almost invariably
         | done with lots of pointers (since C has no better-constrained
         | reference type), and the result can be bug-prone and difficult
         | to refactor.
         | 
         | For all of C++'s faults, at least it's possible to use a map
         | (or unordered_set or whatever) and mostly avoid encoding the
         | fact that it's anything other than an associative container of
         | some sort at the call sites. This is especially true in C++11
         | or newer with auto.
        
           | SAI_Peregrinus wrote:
           | [WUFFS](https://github.com/google/wuffs) is made for stuff
           | like this, and it has a library available as transpiled C
           | code.
        
           | tgflynn wrote:
           | > this means that using, say, a dictionary means that quite a
           | bit of the implementation gets hard coded into every site
           | that uses the dictionary
           | 
           | I don't understand this part of your comment. There's nothing
           | preventing you from designing a nice well-encapsulated
           | map/dictionary data structure in C and I'm sure there are
           | many many libraries that do just that.
           | 
           | I do agree though that having such basic data structures in
           | the standard library, as modern C++ does, is usually
           | preferable.
        
             | simias wrote:
             | Lack of generics will do that, unless you consider that
             | blindly casting `void _` all over the place counts as
             | "well-encapsulated". Even with macro-soup designing a good
             | agnostic dictionary implementation for C is rather
             | challenging. Linked lists are _okay* if you use something
             | like the kernel's list.h, but even then it's macro-heavy
             | and has its pitfalls.
             | 
             | In my work as an embedded developer I still use C a lot and
             | it's probably the programming language I know best and have
             | the most experience with but it would never cross my mind
             | to write a PDF interpreter in it unless I had a tremendous
             | reason to do so. There are so many better choices these
             | days.
        
               | tgflynn wrote:
               | Type safety and encapsulation are distinct issues. The
               | Linux kernel uses many well-encapsulated interfaces but
               | it's written in C and the typing reflects that
               | limitation.
               | 
               | Personally I haven't used straight C in years and would
               | never choose it over C++ unless platform constraints
               | required it, but a vast amount of very complex software
               | has been and continues to be written in C, including all
               | the widely used OS kernels, so I don't find it very
               | surprising that a new feature in a very old piece of
               | software would be written in it.
        
             | chrisseaton wrote:
             | > There's nothing preventing you from designing a nice
             | well-encapsulated map/dictionary data structure in C
             | 
             | When you write a set function for your map data structure,
             | what type do you make the key parameter?
        
               | rixed wrote:
               | size_t key_size, void *key
        
               | nextaccountic wrote:
               | And then eschew type safety
        
               | chrisseaton wrote:
               | > nice well-encapsulated
               | 
               | ...
               | 
               | > void *
        
               | tgflynn wrote:
               | Type safety and encapsulation aren't the same thing.
               | Encapsulation is about hiding implementation details from
               | the user of an API, which is what the comment I
               | originally replied to was claiming you couldn't do in C.
        
               | chrisseaton wrote:
               | The void * is (should have been!) an implementation
               | detail, and you're leaking it in the interface - that's
               | not encapsulation.
               | 
               | For example if I want to store a __int128 on a 64-bit
               | machine I'll have to deal with stuff like memory
               | allocation and lifetime myself, when the data structure
               | should do that.
        
               | mistrial9 wrote:
               | this is a pointer-based language so there are lots of
               | ways to solve that, but you know that already.. this is a
               | setup question.. of course its not useful to re-invent
               | critical, secure functions over and over yet, what if I
               | am not writing critical, secure functions anyway?
               | 
               | I would choose a key type that is natural to the
               | environment and problem.. unsigned integers are useful.
               | Which unsigned integer size? there are only a couple of
               | practical answers to that.. unless there is some massive
               | dataset, use a 32bit unsigned integer, like so much of
               | the software does right now.
        
               | thesz wrote:
               | Code from yalsat (stochastic SAT solver) [1] made me
               | learn something two years ago. I can declare an array of
               | some elements and make access to elements statically
               | typed. Same with maps, sets and others.
               | 
               | [1] https://github.com/msoos/yalsat/blob/main/yals.c#L49
        
           | Piezoid wrote:
           | Code reuse is achievable by (mis)using the preprocessor
           | system. It is possible to build a somewhat usable API, even
           | for intrusive data structures. (eg. the linux kernel and
           | klib[1])
           | 
           | I do agree that generics are required for modern programming,
           | but for some, the cost of complexity of modern languages
           | (compared to C) and the importance of compatibility seem to
           | outweigh the benefits.
           | 
           | [1]: http://attractivechaos.github.io/klib
        
         | MobiusHorizons wrote:
         | It looks like it needs to interoperable with the rest of their
         | codebase which was already written in C
         | 
         | > The new PDF interpreter is written entirely in C, but
         | interfaces to the same underlying graphics library as the
         | existing PostScript interpreter. So operations in PDF should
         | render exactly the same as they always have (this is affected
         | slightly by differing numerical accuracy), all the same devices
         | that are currently supported by the Ghostscript family, and any
         | new ones in the future should work seamlessly.
        
           | [deleted]
        
           | Sytten wrote:
           | That is not an argument at least for rust since its super
           | easy to consume and offer a C interface. I think it's more of
           | a shift in mentality that needs to occur.
        
             | MobiusHorizons wrote:
             | while it doesn't prevent rust from being used, it is still
             | a hurdle which must be overcome. Building and maintaining a
             | multi-language build system has significant costs,
             | especially with a project with as much history and wide use
             | as ghostscript.
        
             | dfox wrote:
             | It is so easy and well documented that first page of google
             | results for "rust autotools" does not contain anything
             | about how to integrate rust code into existing autotools
             | project.
             | 
             | Another issue is general subtle brokenness of rust tooling
             | on anything that is not linux on amd64.
        
         | asdff wrote:
         | I don't even actively code with rust but just from the fact
         | that its been packaged as a dependency has been enough of a
         | headache for me. The latest issue is with some homebrew package
         | that has rust as a dependency. It turns out on macos mojave
         | rust needs to be built from source since there is no bottle. I
         | let it build for a full day and it still didn't finish
         | building, so I gave up. Then I installed rust independently
         | with rustup and successfully linked that install to brew, which
         | nearly worked, but failed with the cryptic "rustup could not
         | choose a version of cargo to run..." error that I can't make
         | any sense of, because the solution it gave for that error to
         | download the latest stable release and set it as your toolchain
         | with 'rustup default stable' didn't do anything because that
         | was already done. The real salt on the wound is that modern
         | google search bringing up nothing relevant.
        
       | [deleted]
        
       | neilv wrote:
       | Years back, I raised how evolved Ghostscript had been over a very
       | long time, together with the huge complexity of the PDF specs, as
       | a potential source of vulnerabilities.
       | 
       | (But maybe wasn't as much on people's radars, with all lower-
       | hanging fruit of other technology choices and practices going on,
       | outside of PDF.)
       | 
       | New code for a large spec is also interesting for potential
       | vulns, but maybe easier to get confidence about.
       | 
       | One neat direction they could go is to be considered more
       | trustworthy than the Adobe products. For example, if one is
       | thinking of a PDF engine as (among other purposes) supporting the
       | use case of a PDF viewer that's an agent of the interests of that
       | individual human user, then I suspect you're going to end up with
       | different attention and decisions affecting security (compared to
       | implementations from businesses focused on other goals).
       | 
       | (I say agent of the individual user, but that can also be aligned
       | with enterprise security, as an alternative to risk management
       | approaches that, e.g., ultimately will decide they're relying on
       | gorillas not to make it through the winter.)
        
         | asdff wrote:
         | Is there any work in this space on some oddball "contamination
         | protocol" type of security? Like you would assume everything is
         | contaminated and you do things that eliminate the potential for
         | cross contamination entirely, like they do in lab settings with
         | aseptic technique. In this case, it could mean printing out the
         | contaminated pdf on a system you don't care about being
         | contaminated, then scanning it with an airgapped scanner to
         | recover a 'sterile' pdf. It seems convoluted but I'm sure for
         | some applications that could be a good solution that requires
         | no improvement to pdf protocol.
        
           | neilv wrote:
           | I've heard of measures like that, including for the _other_
           | direction (i.e., redacting documents without leaking
           | information in the effectively opaque PDF format).
           | 
           | IMHO, having well-engineered tools handle data, and being
           | conservative about the trust/privileges given externally-
           | sourced data is at least complementary to the current "zero
           | trust" thinking among networks and nodes.
           | 
           | (Example: Does your spreadsheet really arbitrary code
           | execution, in an imperfect sandbox, for all your nontechnical
           | users? Should what people might think is a self-contained
           | standalone text document file really phone home, to disclose
           | your activity and location, or have the potential to be
           | remotely memory-holed/disabled, along with attendant added
           | security risks from that added complexity and the additional
           | requirements it puts on host systems/tools to try to enforce
           | that questionable design?)
        
           | woodruffw wrote:
           | DARPA is funding fundamental research in this space,
           | specifically through programs like SafeDocs[1].
           | 
           | [1]: https://www.darpa.mil/program/safe-documents
        
       | aidos wrote:
       | Does anyone know much about the Artifex team? How big it is etc?
       | 
       | They seem to be the kings of working with PDFs. I've not really
       | looked at the Ghostscript code (and I'm surprised to hear their
       | interpreter was still in postscript), but I've looked through the
       | mupdf code and what I saw was really nice.
       | 
       | In any case, I appreciate the work they've done in providing
       | fantastic tools to the world for decades now.
        
         | petilon wrote:
         | I don't know the current team, but I have met its founder: L.
         | Peter Deutsch [1].
         | 
         | James Gosling, inventor of Java, once described him as the
         | "greatest programmer in the world". They both used to work at
         | Sun Microsystems.
         | 
         | [1] https://en.wikipedia.org/wiki/L._Peter_Deutsch
        
           | skemper911 wrote:
           | Three of the greatest programmers I've experienced worked
           | there, Peter, Tor, Raph. Hats off.
        
         | madmoose wrote:
         | Strangely this appears to be a new implementation not based on
         | MuPDF, so Artifex now has two implementations of a PDF
         | interpreter.
         | 
         | I wonder what made them decide to reimplement it instead of
         | reusing their existing code.
        
       | toddm wrote:
       | Ghostscript (well, gv) got me through the 1990s and beyond as
       | part of my TeX -> dvips -> gv workflow.
       | 
       | Kudos and thank you to those who maintain it and the associated
       | packages!
        
       | lordfosco wrote:
       | Most important part of the announcement - you can still revert
       | back to the former interpreter by setting the `-dNEWPDF=false`
       | flag.
       | 
       | While progress is always nice to see - I am also pleased that we
       | don't necessarily need to update all the scripts that depend on
       | ghostscript at once but can keep them running in their current
       | state.
        
         | ris wrote:
         | It's particularly fun for them to introduce this in a point
         | release. If this didn't warrant a major version bump I'm
         | frankly not sure what would.
        
       | [deleted]
        
       | mkl wrote:
       | > As time has gone on, and we have encountered more and more PDF
       | files with ever more unexpected deviations from the specification
       | 
       | Does anyone know of a collection of malformed PDF files? It would
       | be useful for testing PDF processing programs.
        
         | mdaniel wrote:
         | I wasn't able to readily find any collections, and searching
         | for anything plus the keyword "pdf" returns links to articles
         | _written in_ pdf
         | 
         | That said, this GitHub topic may have some pointers:
         | https://github.com/topics/malware-samples
        
         | svat wrote:
         | There are some here, as test files in the qpdf library:
         | https://github.com/qpdf/qpdf/tree/main/qpdf/qtest/qpdf
         | 
         | (But still, note: A couple of months ago I wrote a low-level
         | PDF parser--just parse the PDF file's bytes into PDF objects,
         | nothing more--and fed it all the PDF files that happened to be
         | present on my laptop, and ran into some files that (some) PDF
         | viewers open, but even qpdf doesn't. I say "even" because qpdf
         | is really good IMO.)
        
       | vfclists wrote:
       | Using C sounds like it will bring a whole new list of exploits
       | with it.
       | 
       | Not good!!
        
         | vodou wrote:
         | C is not inherently unsafe. Sure, it hasn't "memory safety" as
         | a feature. But there are loads of applications considered safe
         | written in C. An experienced C programmer (with the help of
         | tooling) can write safe C code. It is not impossible.
        
           | c7DJTLrn wrote:
           | That would explain all the vulnerabilities in systemd and
           | Linux. They just aren't experienced enough. Linus needs to
           | get in touch with an expert.
        
             | tinus_hn wrote:
             | I'm looking forward to your efforts in rewriting it in Rust
        
               | tptacek wrote:
               | So is everyone else! Can't happen soon enough.
        
           | vfclists wrote:
           | I guess "experienced C programmers" must be short supply
           | although they have been writing C for years.
        
           | jcranmer wrote:
           | SQLite is the most stringently developed C code I'm aware of
           | --the test suite maintains 100% branch coverage, routinely
           | run through all of the sanitizers, and it is regularly
           | fuzzed.
           | 
           | It _still_ accumulates CVEs:
           | https://www.sqlite.org/cves.html.
        
             | vodou wrote:
             | Are you aware of a way to develop fault free code? Please
             | share this knowledge then, please.
        
               | jcranmer wrote:
               | It's easy to develop fault-free code: just redefine all
               | those faults as (undocumented) features!
               | 
               | That's not a helpful answer, but it's basically the same
               | thing you're doing--redefining memory safety
               | vulnerabilities that would be precluded entirely by
               | writing in memory-safe languages as programmer faults.
        
               | tptacek wrote:
               | He's aware of a way to develop memory-corruption-fault
               | free code, obviously.
        
         | WesolyKubeczek wrote:
         | Of course, let's better use a PostScript interpreter also
         | written in C, so your exploits leveraging both at least look
         | like art.
        
         | midislack wrote:
         | Stop this.
        
       | kisamoto wrote:
       | Not sure why this is being posted now as this is from March...
       | 
       | But anyway - I understand why they have changed their interpreter
       | however the lack of major version bump threw me off. I use ps2pdf
       | to optimize pdfs (long story short - makes their size smaller)
       | and was alarmed when my pdfs suddenly ended up without the jpeg
       | backgrounds. Instead, purely black (although this did result in a
       | very small file size so who knows... :) )
       | 
       | Thankfully you can add `-d NEWPDF=false` to your command to use
       | the old parser. I'm yet to submit a bug report but it would be
       | nice if it was backwards compatible...
        
       ___________________________________________________________________
       (page generated 2022-07-31 23:00 UTC)