[HN Gopher] What is the smallest possible valid PDF?
       ___________________________________________________________________
        
       What is the smallest possible valid PDF?
        
       Author : oftenwrong
       Score  : 114 points
       Date   : 2020-03-20 14:16 UTC (8 hours ago)
        
 (HTM) web link (stackoverflow.com)
 (TXT) w3m dump (stackoverflow.com)
        
       | curveto wrote:
       | Technically, the %PDF doesn't have to occur at byte 0. But, like
       | many others inferred, that pushes onward toward real world
       | structure (vs. an academically correct but useless PDF).
       | 
       | If you want a fully covered example it'll need a trailer, xref
       | and at least one obj reference. ...and there are TWO flavors of
       | those (linearized and not). ...and flate coded and not.
       | 
       | So, for a test harness you'd actually want a collection of small
       | files.
        
         | bronson wrote:
         | Allowing %PDF anywhere also leads to false positives.
         | 
         | https://github.com/minad/mimemagic/issues/4
         | 
         | https://github.com/minad/mimemagic/issues?q=is%3Aissue+is%3A...
        
       | [deleted]
        
       | sushisource wrote:
       | Can anyone explain to me why PDF persists as the most common
       | document format for "official" correspondence? It's absolute dog-
       | vomit of a format, just unbelievably overwrought and unfriendly.
       | I wince every time I have to sign one, or, god forbid, actually
       | fill in some form.
       | 
       | Is the explanation really as lame as "They were there first and
       | it stuck"?
        
         | ghaff wrote:
         | Because it became, as a descendent of Postscript, a pretty
         | dominant standard for situations where you wanted to specify
         | the layout.
         | 
         | There's certainly a lot of legacy embedded in PDF which doesn't
         | help.
         | 
         | >Is the explanation really as lame as "They were there first
         | and it stuck"?
         | 
         | I'm not sure they were first but they became dominant
         | because... Adobe. And Adobe made it an open standard which
         | pretty sealed the deal.
        
         | est31 wrote:
         | Can you name a competing format that ended up being "better"
         | than PDF? Not intending to say there is no such format. I'm
         | genuinely curious.
        
           | tonyedgecombe wrote:
           | XPS is better in many ways, apart from its obvious failure in
           | the market.
        
         | crazygringo wrote:
         | Why do you wince when you have to sign one?
         | 
         | As an engineer I can see why you would find the spec inelegant.
         | 
         | But as an end-user, using it to fill in forms couldn't be
         | easier.
        
           | pbhjpbhj wrote:
           | What do you use for form filling. During job applications
           | last year nearly all the pdf forms were just images, or
           | vector boxes that you couldn't fill ...
        
       | simonw wrote:
       | There's actually a good practical use-case for this: you're
       | building software that needs to detect PDF files (as a smaller
       | detail of what it dies, not its primary purpose) and you want to
       | include a tiny one in your unit tests.
       | 
       | I did that here with tiny images in JPEG, PNG and GIF
       | https://github.com/simonw/datasette-render-images/blob/maste...
        
         | eesmith wrote:
         | Doesn't a PDF detector only need to check if the first few
         | bytes are '%PDF-1.' ?
         | 
         | That is, do you need a "valid PDF" detector or "more likely a
         | PDF than anything else" detector?
        
           | [deleted]
        
           | rovr138 wrote:
           | Depends. If you need to parse it, that might still cause
           | errors.
        
             | eesmith wrote:
             | Agreed.
             | 
             | Though when I think of "detector" I think more of
             | https://en.wikipedia.org/wiki/File_(command) and not
             | something which verifies the file is in the correct format.
        
               | BiteCode_dev wrote:
               | Well...
               | 
               | file 'setupTests.ts' setupTests.ts: Java source, ASCII
               | text
               | 
               | I wouldn't put too much trust in file.
        
               | jtvjan wrote:
               | The heuristic used for Java source looks like this:
               | 
               | 0 regex \^import.*;$ Java source
        
               | klodolph wrote:
               | I don't know what you expected. File is just there to
               | give a good guess at a file's format. There are a ton of
               | reasons why this problem is hard, and there are reasons
               | to make "file" less accurate in order to make the
               | implementation simpler and more secure.
               | 
               | But it will work fine for PDFs, often enough.
        
         | meshy wrote:
         | That use is indeed exactly what inspired the question!
        
         | anonydsfsfs wrote:
         | Check out https://github.com/mathiasbynens/small
        
           | lioeters wrote:
           | That is indeed a nice one, "Smallest possible syntactically
           | valid files of different types".
           | 
           | Following the GitHub link (datasette-render-images) in the
           | comment you replied to, there's a code comment with a link to
           | the same library (small).
        
           | hombre_fatal wrote:
           | Kinda silly including 0-byte files which is most of them. Now
           | you're just building a list of things that take empty input.
           | 
           | They should move those to a simple list in the README and
           | reserve the repo for non-empty files.
        
       | LargoLasskhyfv wrote:
       | Not exactly on topic, but this reminds me of somewhere around the
       | year 2000, where i had to produce "documentation" in a hurry, in
       | under an hour before delivering and deploying some systems. I
       | _think_ i used Dia, but am not sure anymore. Anyways, all
       | essential information in DIN A4 landscape mode, nice diagrams
       | with network structure, IP numbers and so on, ready.
       | 
       | Now what? Remember it was around the year 2000, what to use?
       | Floppy disks of course! Saved it, looked at it and thought it had
       | gone wrong somehow because it listed as 4KB only.
       | 
       | Used another floppy, lowlevel formatted with fdformat to be sure,
       | taking minutes, hurry, hurry! Saved again.
       | 
       | 4KB!? WT..?
       | 
       | Booted up another system, loaded the PDF with different readers,
       | worked.
       | 
       | Shrugged and hoped it worked at the customers site also, which it
       | did, they even said it looked nice and clear.
       | 
       | If only they knew...
        
       | jansan wrote:
       | If you need something to watch I recommend "Funky File Formats"
       | where Ange Albertini shows for example how one file can be valid
       | in multiple file formats at the same time. Really amazing:
       | https://media.ccc.de/v/31c3_-_5930_-_en_-_saal_6_-_201412291...
        
       | matja wrote:
       | I wonder if afl-fuzz could do better. Context:
       | https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-th...
        
       | NanoWar wrote:
       | Here is a great collection of smallest X:
       | https://github.com/mathiasbynens/small
        
         | oftenwrong wrote:
         | I would love to see a write-up (like the SO answer) for each of
         | those.
        
       ___________________________________________________________________
       (page generated 2020-03-20 23:00 UTC)