[HN Gopher] Pdfgrep - a commandline utility to search text in PD... ___________________________________________________________________ Pdfgrep - a commandline utility to search text in PDF files Author : kretaceous Score : 187 points Date : 2022-09-25 14:38 UTC (8 hours ago) (HTM) web link (pdfgrep.org) (TXT) w3m dump (pdfgrep.org) | neilv wrote: | I've long used `pdfgrep` in a very kludgey way, when stockpiling | rare variants of particular used ThinkPad models (for looking up | _actual_ specs based on an IBM "type" number shown in a photo on | an eBay listing, since the seller's listing of specs is often | incorrect). | | Example shell function: t500grep() { | pdfgrep "$1" /home/user/doc/lenovo-psref-withdrawn- | thinkpad-2005-to-2013-2013-12-447.pdf } | | Example run: $ t500grep 2082-3GU | 2082-3GU T9400 2.53 2GB 15.4" WSXGA+ Cam GMA, HD | 3650 160G 7200 DVD+-RW Intel 5100 2G | Turbo 9 Bus 32 Aug 08 | | The Lenovo services to look up this info come and go, and are | also slow, but a saved copy of the data lives forever. | | (A non/less-kludge way would be to get the information from all | my IBM/Lenovo PSREFs into a lovingly-engineering | database/knowledge schema, simple CSV file, or `grep`-able ad hoc | text file.) | mistrial9 wrote: | +1 -- after trying several tool sets over years, _pdfgrep_ is | currently used daily around here | donio wrote: | For Emacs users there is also https://github.com/jeremy- | compostella/pdfgrep which lets you browse the results and open | the original docs highlighting the selected match. | | It's on MELPA. | nanna wrote: | Tried this the other day but couldn't figure out how to use it. | Is it invoked from eshell, dired, a pdftool buffer, or what? | radicalbyte wrote: | Cool, about 15 years ago I built something similar for PDF, | Office (OpenXML) as well as plain text as part of a search | engine. Commercial/closed source of course but it was super | handy. | majkinetor wrote: | For Windows, there is dngrep: http://dngrep.github.io | gabythenerd wrote: | Love dngrep, one of my friends used to combine all his pdfs | into one and then use adobe reader to search before I showed | this to him. It's very powerful and also simple to use, even | for non-technical users. | moonshotideas wrote: | Out of curiosity, how did you solve the issue of extracting text | from the pdf, error free? Or did you use another package? | kfarnung wrote: | Looking at the list of dependencies, it seems like they use | poppler-cpp to render the PDFs. | | https://gitlab.com/pdfgrep/pdfgrep#dependencies | dmoo wrote: | Popper tools pdftotext -layout is great | Frost1x wrote: | Curious as well. About a year ago I was implementing what I | thought naively might not be a very difficult verification that | a specific string existed (case sensitive or insensitive) | within a PDF's text and had many cases where text viewed was | clearly rendered in the document but many libraries couldn't | identify the text. It's my understanding there's a lot of | variance in how a rendered PDF may be presenting something one | may assume is a simple string that really isn't after going | down the rabbit hole (wasn't too surprising because I dont like | to make simplicity assumptions). I couldn't find anything at | the time that seemed to be error free. | | Aside from applying document rendering with OCR and text | recognition approaches, I ended up living with some error rate | there. I think PDFgrep was one of the libraries I tested. Some | other people just used libraries/tools as is with no sort of | QAing but from my sample applying to several hundred verified | documents, pdfgrep (and others) missed some. | nip wrote: | Tangential: | | Some time ago I built an automation [1] that identifies whether | the given PDFs contain the specified keywords, outputting the | result as a CSV file. | | Similar to PDFGrep, probably much slower, but potentially more | convenient for people preferring GUIs | | [1] https://github.com/bendersej/pdf-keywords-extractor | mistermann wrote: | A bit of a tangent, but does anyone know of a good utility that | can index a large number of PDF files so one can do fast keyword | searches across all of them simultaneously (free or paid)? It | seems like this sort of utility used to be very common 15 years | ago, but local search has kind of died on the vine. | [deleted] | kranner wrote: | DEVONsphere Express, Recoll, also the latest major version of | Calibre. | crtxcr wrote: | I am working on looqs, it can do that (and also will render the | page immediatly): https://github.com/quitesimpleorg/looqs | donio wrote: | Recoll is a nice one, uses Xapian for the index. | | https://www.lesbonscomptes.com/recoll/ | summm wrote: | Recoll? | tombrossman wrote: | Yes, +1 for Recoll. It can also OCR those PDFs that are just | an image of a page of text, and not 'live' text. Read the | install notes and install the helper applications. | | When searching I'll first try the application or system's | native search utility, but most of the time I end up opening | Recoll to actually find the thing or snippet of text I want, | and it has never failed me. | | https://www.lesbonscomptes.com/recoll/pages/features.html#do. | .. | llanowarelves wrote: | dtSearch | sumnole wrote: | While we're asking for tool tips: does anyone know of a tool | that will cache/index web pages as the user browses, so that it | can be searched/viewed offline later? | pletnes wrote: | Macos' spotlight can do this AFAIK. | pulvinar wrote: | Yes, and Spotlight's also useable from the command line as | mdfind, which has an -onlyin switch to restrict the search to | a directory. | marttt wrote: | I've been using Ali G. Rudi's pdftxt with my own shell wrappers. | From the homepage: "uses mupdf to extract text from pdf files; | prefixes each line with its page number for searching." | | Usually I 1) pdftxt a file and 2) based on the results, jump to a | desired page in Rudi's framebuffer PDF reader, fbpdf. For this, | the page number prefix in pdftxt is a particularly nice default. | No temptations with too many command line options either. | | https://litcave.rudi.ir/pdftxt-0.7.tar.gz | thriftwy wrote: | Catdoc utility does the same for .doc MS word files. Maybe for | PDFs also. | sauercrowd wrote: | pdfgrep is great. Worked like a charme to diff updates to a | contract | Findecanor wrote: | Just what I needed to search my collection of comp-sci articles. | Regular grep fails on most PDFs. | | I installed the Ubuntu package. Thanks! | ankrgyl wrote: | DocQuery (https://github.com/impira/docquery), a project I work | on, allows you to do something similar, but search over semantic | information in the PDF files (using a large language model that | is pre-trained to query business documents). | | For example: $ docquery scan "What is the due | date?" /my/invoices/ /my/invoices/Order1.pdf What is | the due date?: 4/27/2022 /my/invoices/Order2.pdf What | is the due date?: 9/26/2022 ... | | It's obviously a lot slower than "grepping", but very powerful. | pugio wrote: | Wow this is exactly what I've been looking for, thank you! I | just wish with these transformer models it was possible to | extract a structured set of what the model "knows" (for e.g. | easy search indexing ). These natural language question systems | are a little too fuzzy sometimes. | a1369209993 wrote: | > to extract a structured set of what the model "knows" | | To be fair, that's impossible in the general case, since the | model can know things (ie be able to answer queries) without | knowing that it knows them (ie being able to produce a list | of anserable queries by any means significantly more | efficient than trying every query and seeing which ones | work). | | As a reducto ad absurdum example, consider a 'model' | consisting of a deniably encrypted key-value store, where | it's outright cryptographically guaranteed that you can't | effiently enumerate the queries. Neural networks aren't | _quite_ that bad, but (in the general-over-NNs case) they at | least superficially _appear_ to be pretty close. (They 're | definitely not _reliably_ secure though; don 't depend on | that.) | ankrgyl wrote: | Can you tell me a bit more about your use case? A few things | that come to mind: | | - There are some ML/transformer-based methods for extracting | a known schema (e.g. NER) or an unknown schema (e.g. relation | extraction). - We're going to add a feature to DocQuery | called "templates" soon for some popular document types (e.g. | invoices) + a document classifier which will automatically | apply the template based on the doc type. - Our commercial | product (http://impira.com/) supports all of this + is a | hosted solution (many of our customers use us to automate | accounts payable, process insurance documents, etc.) | mbb70 wrote: | Since you mention insurance documents, could you speak to | how well this would extract data from a policy document | like https://ahca.myflorida.com/medicaid/Prescribed_Drug/dr | ug_cri... ? | | The unstoppable administrative engine that is the American | Healthcare system produces hundreds of thousands of | continuously updated documents like this with no | standardized format/structure. | | Manually extracting/normalizing this data into a querable | format is an industry all its own. | ankrgyl wrote: | It's very easy to try! Just plug that URL here: | https://huggingface.co/spaces/impira/docquery. | | I tried a few questions: What is the | development date? -> June 20, 2017 What is the | medicine? -> SPINRAZA(r) (nusinersen) How many | doses -> 5 doses Did the patient meet the review | criteria? -> Patient met initial review criteria. | Is the patient treated with Evrysdi? -> not | pugio wrote: | Your commercial product looks very cool, but my use case is | in creating an offline-first local document storage system | (data never reaches a cloud). I'd like to be enable users | to search through all documents for relevant pieces of | information. | | The templates sound very cool - are they essentially just | using a preset list of (natural language) queries tied to a | particular document class? It seems like you're using a | version of donut for your document classification? | ankrgyl wrote: | > but my use case is in creating an offline-first local | document storage system (data never reaches a cloud). | | Makes sense -- this is why we OSS'd DocQuery :) | | > The templates sound very cool - are they essentially | just using a preset list of (natural language) queries | tied to a particular document class? It seems like you're | using a version of donut for your document | classification? | | Yes that's the plan. We've done extensive testing with | other approaches (e.g. NER) and realized that the | benefits of using use-case specific queries | (customizability, accuracy, flexibility for many use | cases) outweigh the tradeoffs (NER only needs one | execution for all fields). | | Currently, we support pre-trained Donut models for both | querying and classification. You can play with it by | adding the --classify flag to `docquery scan`. We're | releasing some new stuff soon that should be faster and | more accurate. | pugio wrote: | Sweet! I'll keep an eye on the repo. Thank you for open | sourcing DocQuery. I agree with your reasoning: my | current attempts to find an NER model that covers all my | use cases have come up short | ultrasounder wrote: | This is so epic. I was just ruminating about this particular | use-case. who are your typical customers. Supply chain or | purchasing? Also I notice that you do text extraction from | Invoices? Are you using something similar to CharGRID or its | derivate BERTGRID? Wish you and your team more success! | ankrgyl wrote: | Thank you ultrasounder! Supply chain, construction, | purchasing, insurance, financial services, and healthcare are | our biggest verticals. Although we have customers doing just | about anything you can imagine with documents! | | For invoices, we have a pre-trained model (demo here: | https://huggingface.co/spaces/impira/invoices) that is pretty | good at most fields, but within our product, it will | automatically learn about your formats as you upload | documents and confirm/correct predictions. The pre-trained | model is based on LayoutLM and the additional learning we do | uses a generative model (GMMs) that can learn from as little | as one example. | | LMK if you have any other questions. | greggsy wrote: | I've been looking for an alternative to Acrobat's 'Advanced | Search' capability, which allows you to define a search directory | (like NP++). This feature is possible with pdfgrep and other | tools, but the killer for me is that it displays results in | context, and allows you to quickly sift through dozens or | hundreds of results with ease. | | It's literally the only reason why I have acrobat installed. | asicsp wrote: | See also https://github.com/phiresky/ripgrep-all (`ripgrep`, but | also search in PDFs, E-Books, Office documents, zip, tar.gz, | etc.) | xenodium wrote: | Was already a big fan of ripgrep and pdfgrep. rga (ripgrep-all) | was such a hidden gem for me. | MisterSandman wrote: | Used this for online quizzes for one of my courses. It was pretty | good, but using the terminal for stuff like this still sucks | since you can't just click into the PDF. | | I wish PDFs had a Ctrl-Shift-F search like VS Code that could | search for text in multiple pdfs in a directory. | darkteflon wrote: | I use Houdahspot on MacOS for this. You can either invoke it | globally with a custom hotkey (I use hyper 's', for "search"), | or from within a specific directory via an icon it adds to the | top right corner of Finder. | greggsy wrote: | Advanced Search in Acrobat allows you to define a directory. | It's the only reason I ever have it installed. ___________________________________________________________________ (page generated 2022-09-25 23:00 UTC)