[HN Gopher] New Ghostscript PDF interpreter ___________________________________________________________________ New Ghostscript PDF interpreter Author : diskmuncher Score : 148 points Date : 2022-07-31 15:40 UTC (7 hours ago) (HTM) web link (www.ghostscript.com) (TXT) w3m dump (www.ghostscript.com) | mepian wrote: | "But Ghostscript's PDF interpreter was, as noted, written in | PostScript, and PostScript is not a great language for handling | error conditions and recovering." | | Isn't C, their chosen replacement of PostScript, also | particularly bad at this? | daptaq wrote: | I'd say a language is bad at error handling if it doesn't let | you check if a procedure failed or not. What C does it that it | compiles even if you ignore this, which is a different issue. | Java, Rust, etc. wouldn't compile if you totally ignored it, | but you that doesn't mean you have to do proper error handling, | beyond satisfying the compiler/type system. | ptx wrote: | Are there any languages that are bad at error handling then, | according to that definition? That don't let you return | values, set global flags, mutate arguments or in any other | way communicate back from a procedure? | colonwqbang wrote: | I also had a slight chuckle at this. However, I'm sure C is | still a great step up from Postscript. | | It is however quite entertaining to read the predictable | comments from Rust/Java/C++ fans who are upset that they didn't | choose their favourite language. | forgotpwd16 wrote: | Surprised the decision wasn't made sooner. | vivegi wrote: | In the past when we had to use Ghostscript for PDF processing, we | always separated it out into its own process and added a whole | lot of error management externally. | | Even if the application was fine, you would always encounter | PS/PDF files in the wild that kept stress-testing the | application's memory safety. | diskmuncher wrote: | How interpreting PDF in Postscript became untenable | vintagedave wrote: | Given the mention of security issues in their custom PostScript | extensions, and that PDF files are often malformed, I wonder why | they chose C as the language for the new interpreter. I don't | want to write a typical HN comment ( _cough_ use Rust for | everything :)) but surely there is _some_ better language for | entirely new development of a secure and fast parser in 2022. | | The post has no explananation of this choice. Does anyone know? | salmo wrote: | My guess is that since the rest of the project (not in PS | itself) is in C, it's in C. And it may be borrowing from the PS | interpreter codebase. I dunno. | | Requiring another skillset, toolchain, etc. is onerous and has | to be weighed in those decisions. Rust is cool for sure, but | difficult to adopt in brownfield projects because of humans | more than tech. | | Also, it wasn't written on in 2022, just made the default now. | GS is a venerable codebase, and jumping on a "new" language | bandwagon may have seemed dangerous at the time it was started. | | All conjecture. I'm not an expert or involved. | mkl95 wrote: | One reason may be that they want to build a high level wrapper | of that C API, something that is well documented in some | languages (i.e. Python) | lvh wrote: | We (Latacora) previously advised clients to encapsulate | GhostScript processing in something with a hard security | boundary (like a Lambda) and I am not expecting the new | implementation to change that. | h2odragon wrote: | I suspect they need portability more than most projects. | winter_blue wrote: | Are you kidding? Many other languages are as portable, if not | more portable.[a] Your point would be valid in 1972, not in | 2022. I can't believe you're regurgitating the same | "portability" from 50 years ago, today (unless you meant it | as a joke and forgot to include a /s). | | [a] Languages targeting LLVM or supported by GCC are portable | to every target machine code / ISA / architecture supported | by those toolchains. JVM, JS, etc are portable to all the | platforms they support. You don't need to do any extra work | (of recompiling) if you use a bytecode VM / platform (for | example, like JVM). | mistrial9 wrote: | does an LLVM requirement fit the social and license goals | of this eco-system fundamental project? | zbentley wrote: | Well, there's portability and then there's portability. | Getting LLVM to emit artifacts on a given target is easy. | Getting assurance that big, complex interfaces that | integrate with the underlying OS in extremely specific ways | (i.e. your programming language's IO or concurrency system) | behave correctly on that target, and have appropriate | testing, community support, and documentation is another | thing entirely. | | Like, I get it. The claim that "rust isn't portable" is | often used as a thought terminating cliche, and is often | wrong or irrelevant in context. But the claim "X uses LLVM, | LLVM can target environment Y, therefore X is fully | compatible with Y" is just as reductive and misleading. | jeffbee wrote: | WUFFS seems like a great option for this. | midislack wrote: | No, not more Rust activism. Please, anything but more of this. | Have some shame. | amluto wrote: | Beyond a lack of memory safety, C has another issue that makes | me dislike it for this kind of application: C has a very | minimal set of built in data structures. Combined with a lack | of generics, this means that using, say, a dictionary means | that quite a bit of the implementation gets hard coded into | every site that uses the dictionary. This is almost invariably | done with lots of pointers (since C has no better-constrained | reference type), and the result can be bug-prone and difficult | to refactor. | | For all of C++'s faults, at least it's possible to use a map | (or unordered_set or whatever) and mostly avoid encoding the | fact that it's anything other than an associative container of | some sort at the call sites. This is especially true in C++11 | or newer with auto. | SAI_Peregrinus wrote: | [WUFFS](https://github.com/google/wuffs) is made for stuff | like this, and it has a library available as transpiled C | code. | tgflynn wrote: | > this means that using, say, a dictionary means that quite a | bit of the implementation gets hard coded into every site | that uses the dictionary | | I don't understand this part of your comment. There's nothing | preventing you from designing a nice well-encapsulated | map/dictionary data structure in C and I'm sure there are | many many libraries that do just that. | | I do agree though that having such basic data structures in | the standard library, as modern C++ does, is usually | preferable. | simias wrote: | Lack of generics will do that, unless you consider that | blindly casting `void _` all over the place counts as | "well-encapsulated". Even with macro-soup designing a good | agnostic dictionary implementation for C is rather | challenging. Linked lists are _okay* if you use something | like the kernel's list.h, but even then it's macro-heavy | and has its pitfalls. | | In my work as an embedded developer I still use C a lot and | it's probably the programming language I know best and have | the most experience with but it would never cross my mind | to write a PDF interpreter in it unless I had a tremendous | reason to do so. There are so many better choices these | days. | tgflynn wrote: | Type safety and encapsulation are distinct issues. The | Linux kernel uses many well-encapsulated interfaces but | it's written in C and the typing reflects that | limitation. | | Personally I haven't used straight C in years and would | never choose it over C++ unless platform constraints | required it, but a vast amount of very complex software | has been and continues to be written in C, including all | the widely used OS kernels, so I don't find it very | surprising that a new feature in a very old piece of | software would be written in it. | chrisseaton wrote: | > There's nothing preventing you from designing a nice | well-encapsulated map/dictionary data structure in C | | When you write a set function for your map data structure, | what type do you make the key parameter? | rixed wrote: | size_t key_size, void *key | nextaccountic wrote: | And then eschew type safety | chrisseaton wrote: | > nice well-encapsulated | | ... | | > void * | tgflynn wrote: | Type safety and encapsulation aren't the same thing. | Encapsulation is about hiding implementation details from | the user of an API, which is what the comment I | originally replied to was claiming you couldn't do in C. | chrisseaton wrote: | The void * is (should have been!) an implementation | detail, and you're leaking it in the interface - that's | not encapsulation. | | For example if I want to store a __int128 on a 64-bit | machine I'll have to deal with stuff like memory | allocation and lifetime myself, when the data structure | should do that. | mistrial9 wrote: | this is a pointer-based language so there are lots of | ways to solve that, but you know that already.. this is a | setup question.. of course its not useful to re-invent | critical, secure functions over and over yet, what if I | am not writing critical, secure functions anyway? | | I would choose a key type that is natural to the | environment and problem.. unsigned integers are useful. | Which unsigned integer size? there are only a couple of | practical answers to that.. unless there is some massive | dataset, use a 32bit unsigned integer, like so much of | the software does right now. | thesz wrote: | Code from yalsat (stochastic SAT solver) [1] made me | learn something two years ago. I can declare an array of | some elements and make access to elements statically | typed. Same with maps, sets and others. | | [1] https://github.com/msoos/yalsat/blob/main/yals.c#L49 | Piezoid wrote: | Code reuse is achievable by (mis)using the preprocessor | system. It is possible to build a somewhat usable API, even | for intrusive data structures. (eg. the linux kernel and | klib[1]) | | I do agree that generics are required for modern programming, | but for some, the cost of complexity of modern languages | (compared to C) and the importance of compatibility seem to | outweigh the benefits. | | [1]: http://attractivechaos.github.io/klib | MobiusHorizons wrote: | It looks like it needs to interoperable with the rest of their | codebase which was already written in C | | > The new PDF interpreter is written entirely in C, but | interfaces to the same underlying graphics library as the | existing PostScript interpreter. So operations in PDF should | render exactly the same as they always have (this is affected | slightly by differing numerical accuracy), all the same devices | that are currently supported by the Ghostscript family, and any | new ones in the future should work seamlessly. | [deleted] | Sytten wrote: | That is not an argument at least for rust since its super | easy to consume and offer a C interface. I think it's more of | a shift in mentality that needs to occur. | MobiusHorizons wrote: | while it doesn't prevent rust from being used, it is still | a hurdle which must be overcome. Building and maintaining a | multi-language build system has significant costs, | especially with a project with as much history and wide use | as ghostscript. | dfox wrote: | It is so easy and well documented that first page of google | results for "rust autotools" does not contain anything | about how to integrate rust code into existing autotools | project. | | Another issue is general subtle brokenness of rust tooling | on anything that is not linux on amd64. | asdff wrote: | I don't even actively code with rust but just from the fact | that its been packaged as a dependency has been enough of a | headache for me. The latest issue is with some homebrew package | that has rust as a dependency. It turns out on macos mojave | rust needs to be built from source since there is no bottle. I | let it build for a full day and it still didn't finish | building, so I gave up. Then I installed rust independently | with rustup and successfully linked that install to brew, which | nearly worked, but failed with the cryptic "rustup could not | choose a version of cargo to run..." error that I can't make | any sense of, because the solution it gave for that error to | download the latest stable release and set it as your toolchain | with 'rustup default stable' didn't do anything because that | was already done. The real salt on the wound is that modern | google search bringing up nothing relevant. | [deleted] | neilv wrote: | Years back, I raised how evolved Ghostscript had been over a very | long time, together with the huge complexity of the PDF specs, as | a potential source of vulnerabilities. | | (But maybe wasn't as much on people's radars, with all lower- | hanging fruit of other technology choices and practices going on, | outside of PDF.) | | New code for a large spec is also interesting for potential | vulns, but maybe easier to get confidence about. | | One neat direction they could go is to be considered more | trustworthy than the Adobe products. For example, if one is | thinking of a PDF engine as (among other purposes) supporting the | use case of a PDF viewer that's an agent of the interests of that | individual human user, then I suspect you're going to end up with | different attention and decisions affecting security (compared to | implementations from businesses focused on other goals). | | (I say agent of the individual user, but that can also be aligned | with enterprise security, as an alternative to risk management | approaches that, e.g., ultimately will decide they're relying on | gorillas not to make it through the winter.) | asdff wrote: | Is there any work in this space on some oddball "contamination | protocol" type of security? Like you would assume everything is | contaminated and you do things that eliminate the potential for | cross contamination entirely, like they do in lab settings with | aseptic technique. In this case, it could mean printing out the | contaminated pdf on a system you don't care about being | contaminated, then scanning it with an airgapped scanner to | recover a 'sterile' pdf. It seems convoluted but I'm sure for | some applications that could be a good solution that requires | no improvement to pdf protocol. | neilv wrote: | I've heard of measures like that, including for the _other_ | direction (i.e., redacting documents without leaking | information in the effectively opaque PDF format). | | IMHO, having well-engineered tools handle data, and being | conservative about the trust/privileges given externally- | sourced data is at least complementary to the current "zero | trust" thinking among networks and nodes. | | (Example: Does your spreadsheet really arbitrary code | execution, in an imperfect sandbox, for all your nontechnical | users? Should what people might think is a self-contained | standalone text document file really phone home, to disclose | your activity and location, or have the potential to be | remotely memory-holed/disabled, along with attendant added | security risks from that added complexity and the additional | requirements it puts on host systems/tools to try to enforce | that questionable design?) | woodruffw wrote: | DARPA is funding fundamental research in this space, | specifically through programs like SafeDocs[1]. | | [1]: https://www.darpa.mil/program/safe-documents | aidos wrote: | Does anyone know much about the Artifex team? How big it is etc? | | They seem to be the kings of working with PDFs. I've not really | looked at the Ghostscript code (and I'm surprised to hear their | interpreter was still in postscript), but I've looked through the | mupdf code and what I saw was really nice. | | In any case, I appreciate the work they've done in providing | fantastic tools to the world for decades now. | petilon wrote: | I don't know the current team, but I have met its founder: L. | Peter Deutsch [1]. | | James Gosling, inventor of Java, once described him as the | "greatest programmer in the world". They both used to work at | Sun Microsystems. | | [1] https://en.wikipedia.org/wiki/L._Peter_Deutsch | skemper911 wrote: | Three of the greatest programmers I've experienced worked | there, Peter, Tor, Raph. Hats off. | madmoose wrote: | Strangely this appears to be a new implementation not based on | MuPDF, so Artifex now has two implementations of a PDF | interpreter. | | I wonder what made them decide to reimplement it instead of | reusing their existing code. | toddm wrote: | Ghostscript (well, gv) got me through the 1990s and beyond as | part of my TeX -> dvips -> gv workflow. | | Kudos and thank you to those who maintain it and the associated | packages! | lordfosco wrote: | Most important part of the announcement - you can still revert | back to the former interpreter by setting the `-dNEWPDF=false` | flag. | | While progress is always nice to see - I am also pleased that we | don't necessarily need to update all the scripts that depend on | ghostscript at once but can keep them running in their current | state. | ris wrote: | It's particularly fun for them to introduce this in a point | release. If this didn't warrant a major version bump I'm | frankly not sure what would. | [deleted] | mkl wrote: | > As time has gone on, and we have encountered more and more PDF | files with ever more unexpected deviations from the specification | | Does anyone know of a collection of malformed PDF files? It would | be useful for testing PDF processing programs. | mdaniel wrote: | I wasn't able to readily find any collections, and searching | for anything plus the keyword "pdf" returns links to articles | _written in_ pdf | | That said, this GitHub topic may have some pointers: | https://github.com/topics/malware-samples | svat wrote: | There are some here, as test files in the qpdf library: | https://github.com/qpdf/qpdf/tree/main/qpdf/qtest/qpdf | | (But still, note: A couple of months ago I wrote a low-level | PDF parser--just parse the PDF file's bytes into PDF objects, | nothing more--and fed it all the PDF files that happened to be | present on my laptop, and ran into some files that (some) PDF | viewers open, but even qpdf doesn't. I say "even" because qpdf | is really good IMO.) | vfclists wrote: | Using C sounds like it will bring a whole new list of exploits | with it. | | Not good!! | vodou wrote: | C is not inherently unsafe. Sure, it hasn't "memory safety" as | a feature. But there are loads of applications considered safe | written in C. An experienced C programmer (with the help of | tooling) can write safe C code. It is not impossible. | c7DJTLrn wrote: | That would explain all the vulnerabilities in systemd and | Linux. They just aren't experienced enough. Linus needs to | get in touch with an expert. | tinus_hn wrote: | I'm looking forward to your efforts in rewriting it in Rust | tptacek wrote: | So is everyone else! Can't happen soon enough. | vfclists wrote: | I guess "experienced C programmers" must be short supply | although they have been writing C for years. | jcranmer wrote: | SQLite is the most stringently developed C code I'm aware of | --the test suite maintains 100% branch coverage, routinely | run through all of the sanitizers, and it is regularly | fuzzed. | | It _still_ accumulates CVEs: | https://www.sqlite.org/cves.html. | vodou wrote: | Are you aware of a way to develop fault free code? Please | share this knowledge then, please. | jcranmer wrote: | It's easy to develop fault-free code: just redefine all | those faults as (undocumented) features! | | That's not a helpful answer, but it's basically the same | thing you're doing--redefining memory safety | vulnerabilities that would be precluded entirely by | writing in memory-safe languages as programmer faults. | tptacek wrote: | He's aware of a way to develop memory-corruption-fault | free code, obviously. | WesolyKubeczek wrote: | Of course, let's better use a PostScript interpreter also | written in C, so your exploits leveraging both at least look | like art. | midislack wrote: | Stop this. | kisamoto wrote: | Not sure why this is being posted now as this is from March... | | But anyway - I understand why they have changed their interpreter | however the lack of major version bump threw me off. I use ps2pdf | to optimize pdfs (long story short - makes their size smaller) | and was alarmed when my pdfs suddenly ended up without the jpeg | backgrounds. Instead, purely black (although this did result in a | very small file size so who knows... :) ) | | Thankfully you can add `-d NEWPDF=false` to your command to use | the old parser. I'm yet to submit a bug report but it would be | nice if it was backwards compatible... ___________________________________________________________________ (page generated 2022-07-31 23:00 UTC)