[HN Gopher] What is the smallest possible valid PDF? ___________________________________________________________________ What is the smallest possible valid PDF? Author : oftenwrong Score : 114 points Date : 2020-03-20 14:16 UTC (8 hours ago) (HTM) web link (stackoverflow.com) (TXT) w3m dump (stackoverflow.com) | curveto wrote: | Technically, the %PDF doesn't have to occur at byte 0. But, like | many others inferred, that pushes onward toward real world | structure (vs. an academically correct but useless PDF). | | If you want a fully covered example it'll need a trailer, xref | and at least one obj reference. ...and there are TWO flavors of | those (linearized and not). ...and flate coded and not. | | So, for a test harness you'd actually want a collection of small | files. | bronson wrote: | Allowing %PDF anywhere also leads to false positives. | | https://github.com/minad/mimemagic/issues/4 | | https://github.com/minad/mimemagic/issues?q=is%3Aissue+is%3A... | [deleted] | sushisource wrote: | Can anyone explain to me why PDF persists as the most common | document format for "official" correspondence? It's absolute dog- | vomit of a format, just unbelievably overwrought and unfriendly. | I wince every time I have to sign one, or, god forbid, actually | fill in some form. | | Is the explanation really as lame as "They were there first and | it stuck"? | ghaff wrote: | Because it became, as a descendent of Postscript, a pretty | dominant standard for situations where you wanted to specify | the layout. | | There's certainly a lot of legacy embedded in PDF which doesn't | help. | | >Is the explanation really as lame as "They were there first | and it stuck"? | | I'm not sure they were first but they became dominant | because... Adobe. And Adobe made it an open standard which | pretty sealed the deal. | est31 wrote: | Can you name a competing format that ended up being "better" | than PDF? Not intending to say there is no such format. I'm | genuinely curious. | tonyedgecombe wrote: | XPS is better in many ways, apart from its obvious failure in | the market. | crazygringo wrote: | Why do you wince when you have to sign one? | | As an engineer I can see why you would find the spec inelegant. | | But as an end-user, using it to fill in forms couldn't be | easier. | pbhjpbhj wrote: | What do you use for form filling. During job applications | last year nearly all the pdf forms were just images, or | vector boxes that you couldn't fill ... | simonw wrote: | There's actually a good practical use-case for this: you're | building software that needs to detect PDF files (as a smaller | detail of what it dies, not its primary purpose) and you want to | include a tiny one in your unit tests. | | I did that here with tiny images in JPEG, PNG and GIF | https://github.com/simonw/datasette-render-images/blob/maste... | eesmith wrote: | Doesn't a PDF detector only need to check if the first few | bytes are '%PDF-1.' ? | | That is, do you need a "valid PDF" detector or "more likely a | PDF than anything else" detector? | [deleted] | rovr138 wrote: | Depends. If you need to parse it, that might still cause | errors. | eesmith wrote: | Agreed. | | Though when I think of "detector" I think more of | https://en.wikipedia.org/wiki/File_(command) and not | something which verifies the file is in the correct format. | BiteCode_dev wrote: | Well... | | file 'setupTests.ts' setupTests.ts: Java source, ASCII | text | | I wouldn't put too much trust in file. | jtvjan wrote: | The heuristic used for Java source looks like this: | | 0 regex \^import.*;$ Java source | klodolph wrote: | I don't know what you expected. File is just there to | give a good guess at a file's format. There are a ton of | reasons why this problem is hard, and there are reasons | to make "file" less accurate in order to make the | implementation simpler and more secure. | | But it will work fine for PDFs, often enough. | meshy wrote: | That use is indeed exactly what inspired the question! | anonydsfsfs wrote: | Check out https://github.com/mathiasbynens/small | lioeters wrote: | That is indeed a nice one, "Smallest possible syntactically | valid files of different types". | | Following the GitHub link (datasette-render-images) in the | comment you replied to, there's a code comment with a link to | the same library (small). | hombre_fatal wrote: | Kinda silly including 0-byte files which is most of them. Now | you're just building a list of things that take empty input. | | They should move those to a simple list in the README and | reserve the repo for non-empty files. | LargoLasskhyfv wrote: | Not exactly on topic, but this reminds me of somewhere around the | year 2000, where i had to produce "documentation" in a hurry, in | under an hour before delivering and deploying some systems. I | _think_ i used Dia, but am not sure anymore. Anyways, all | essential information in DIN A4 landscape mode, nice diagrams | with network structure, IP numbers and so on, ready. | | Now what? Remember it was around the year 2000, what to use? | Floppy disks of course! Saved it, looked at it and thought it had | gone wrong somehow because it listed as 4KB only. | | Used another floppy, lowlevel formatted with fdformat to be sure, | taking minutes, hurry, hurry! Saved again. | | 4KB!? WT..? | | Booted up another system, loaded the PDF with different readers, | worked. | | Shrugged and hoped it worked at the customers site also, which it | did, they even said it looked nice and clear. | | If only they knew... | jansan wrote: | If you need something to watch I recommend "Funky File Formats" | where Ange Albertini shows for example how one file can be valid | in multiple file formats at the same time. Really amazing: | https://media.ccc.de/v/31c3_-_5930_-_en_-_saal_6_-_201412291... | matja wrote: | I wonder if afl-fuzz could do better. Context: | https://lcamtuf.blogspot.com/2014/11/pulling-jpegs-out-of-th... | NanoWar wrote: | Here is a great collection of smallest X: | https://github.com/mathiasbynens/small | oftenwrong wrote: | I would love to see a write-up (like the SO answer) for each of | those. ___________________________________________________________________ (page generated 2020-03-20 23:00 UTC)